idct_ap922x87.cpp

来自「这是一组DCT和iDCT的代码」· C++ 代码 · 共 718 行 · 第 1/2 页
CPP
718 行
#include "StdAfx.h"
#pragma warning(once:4305 4244)
/*
 --------------------------------------------------------
 idct_ap922x87.c - AP922float iDCT, C-code implementation
 --------------------------------------------------------

 AP-922 floating-point iDCT (x87 FPU)
 ------------------------------------
 
    Intel Application Note AP-922 discusses an MMX (scaled integer) 
  implementation of the MPEG fDCT/iDCT algorithm.  
  "AP922float" is a floating-point adaptation of the AP922 algorithm.
 
    The X87 implementation (this file) should compile and run on any
  i486DX or higher CPU.  *This* C-implementation attempts to emulate the 
  SSE-version of AP922float.  The emulation is only approximately, however,
  due to 2 numerical issues :

 Numerical Issue #1 : float->int conversion
 ------------------------------------------

    All modern CPUs can accurately convert float->int, but since this
  algorithm was coded on an x86 platform (with its aging heritage),
  float conversion is not as straightforward as one would expect.

    In a C-compiler, the fastest way to convert float->int is to
  simply type-cast.  Unfortunately, typecasting results in 
  truncation.  Truncation is *NOT* equivalent to rounding down!
  Truncated positive numbers are rounded down (toward 0), while 
  truncated negative numbers are rounded up (toward 0.)  On 
  average, the truncated number's magnitude is reduced by 0.5.

    The standard C-library function "floor()" explicitly rounds
  down the input, but calling a function to perform a simple
  rounding operation wastes CPU time, especially when each iDCT
  call requires 64 float->int conversions.
 
    The x87 FPU supports all rounding modes, but a FPU state change 
  must be executed prior to the rounding instructions, and this
  operation is not directly accessible from a high-level language.
  Therefore, if we are not willing to hand-code assembly,
  we have two choices: 

      a) maintain speed, use truncation (C type-casting)
    or
      b) maintain accuracy, use C-function floor()
 


   a) "FAST_FLOAT2INT" uses a truncation policy to convert float->int.
      "Truncation" is equivalent to rounding-down (toward 0) for positive #s,
      and rounding-up (toward 0) for negative #s.  Statistically, the 
      truncation introduces an average offset of -0.5, meaning the least
      significant bit of the converted-int is wrong.
  
      To minimize the error impact in the final iDCT output, the
      C-code iDCT prescales input data by a 16-bit left-shift. 
      The output is descaled by a complementary right-shift.
      The prescale/descale operation reduces truncation error by a factor
      of 65536.

   b) Precise float->int conversion can be achieved with the standard 
      C-library function 'floor( value + 0.5)'  This statement implements
      a standard rounding policy : "round-up if >0.5, otherwise down."
    
      The floor() function incurs a severe performance penalty: with 
      accurate_rounding, the iDCT's overall execution time doubles!
    
      Alternatively, the accuate rounding could be accelerated by hand-
      coding in x86 assembly.


  The actual-SSE assembly code implements accurate-rounding for
  float->int conversion, but without additional speed loss.  The


 SSE Numerical precision vs X87 numerical precision
 ------------------------------------------------------------
    On Intel x86 CPUs, FPU add/multiply operations are always calculated 
  with 80-bit internal precision.  Therefore, all other things being 
  equal, the X87 iDCT will be more accurate than Intel-SSE, even
  when the X87 iDCT stores intermediate results in 32-bit float.

    As a side note, the X87 iDCT can achieve the precision-level of the 
  IEEE-1180/1990 reference iDCT.  The "Reference-precision" mode is
  accessed by setting 'typdedef double ssefloat', and disabling
  "USE_FAST_FLOAT2INT" (delete the #define statement.)  

 Output clipping 
 ------------------------------------

    The iDCT's output range is limited to {-256,+255}
  The clipper is easy to implement in standard C, but slow to execute.
  In this C-implementation, usage of the clipper is left as a decision 
  to the user.  For strict IEEE-compliance, the output clipper must be used.

  (MMX offers several instructions to efficiently implement clipping.
   Thus, output clipping is done in both AP922float_SSE and 
   AP922float_3DN.)
 
 ------------------------------------------
 LEGAL MUMBO-JUMBO : Disclaimer of Warranty
 ------------------------------------------
  This software is available to the user without any license fee or
  royalty on an "as is" basis.  The author disclaims any and all warranties,
  whether express, implied, or statuary, including any implied warranties or 
  merchantability or of fitness for a particular purpose.  In no event shall 
  the copyright-holder be liable for any incidental, punitive, or 
  consequential damages of any kind whatsoever arising from the use of these
  programs.

  This disclaimer of warranty extends to the user of these programs and user's
  customers, employees, agents, transferees, successors, and assigns.

 Revision history
 ----------------
 09/03/2000 initial release of 'AP922float_x87' iDCT
            Based on AP922, this iDCT uses floating-point instructions for 
            enhanced accuracy.  To the best of my knowledge, AP922float is
            fully IEEE 1180/1990 compliant iDCT, including output-range 
            clipping to {-256,+255}
            When set to use double-precision and accurate-rounding (see
            above), AP922float is numerically equivalent to the 
            IEEE-1180/1990 reference_iDCT.
            
 09/03/2000 liaor@iname.com   http://members.tripod.com/~liaor
 liaor@iname.com   http://members.tripod.com/~liaor
*/
#include<math.h>  // for "floor()" function

typedef float ssefloat;  // ssefloat = IEEE-754 32-bit float

  #define PSCALE_SHIFT 16  // 16 bit left-shift
  #define DESCALE_SHIFT 16  // 16 bit right-shift
  #define DESCALE_ROUND_COMPENSATION (1<<(PSCALE_SHIFT-1))
  
  #define ROW_PRESCALE( x ) (((int)(x))<<PSCALE_SHIFT)

#define USE_FAST_FLOAT2INT 1
// comment out the previous line to use accurate (slower) rounding

#ifdef USE_FAST_FLOAT2INT

  // approximate rounding
  #define SHIFT_ROUND_COLF( x )  ( ( ( (int)(x) ) + DESCALE_ROUND_COMPENSATION) >> DESCALE_SHIFT)

#else

  // accurate rounding
  //#define SHIFT_ROUND_COLF( x )  (  (((int)floor( (x) + DESCALE_ROUND_COMPENSATION )))>>DESCALE_SHIFT  ) 
  #define SHIFT_ROUND_COLF( x )  (  ( ((int)floor( (x) )) + DESCALE_ROUND_COMPENSATION )>>DESCALE_SHIFT  ) 

#endif // USE_FAST_FLOAT2INT



#define NCOS1_16 (0.980785280403230449126182236134239) // cosine(  Pi/16 )
#define NCOS2_16 (0.923879532511286756128183189396788) // cosine( 2Pi/16 )
#define NCOS3_16 (0.831469612302545237078788377617906) // cosine( 3Pi/16 )
#define NCOS4_16 (0.707106781186547524400844362104849) // cosine( 4Pi/16 )
#define NCOS5_16 (0.555570233019602224742830813948533) // cosine( 5Pi/16 )
#define NCOS6_16 (0.382683432365089771728459984030399) // cosine( 6Pi/16 )
#define NCOS7_16 (0.195090322016128267848284868477022) // cosine( 7Pi/16 )

#define TANG1_16 ( NCOS7_16 / NCOS1_16) // tangent( Pi/16)
#define TANG2_16 ( NCOS6_16 / NCOS2_16) // tangent(2Pi/16)
#define TANG3_16 ( NCOS5_16 / NCOS3_16) // tangent(3Pi/16)

// Scaled floating-point coefficient table
// w00, w04, w08, w12
// w01, w05, w09, w13
// w02, w06, w10, w14
// w03, w07, w11, w15    ( all elements in float table are multiplied by 0.5)
// w16, w20, w24, w28
// w17, w21, w25, w29
// w18, w22, w26, w30
// w19, w23, w27, w31
//
// Compared to the integer-based table, the float table is downscaled by a
// multiplication factor of 0.5
// Scaled floating-point coefficient table
//      Original iDCT               SSE-iDCT 
//     table-ordering      ->    table-ordering
//    ------------------       ------------------
//    w00, w02, w04, w06,  ->  w00, w04, w08, w12,
//    w01, w03, w05, w07,  ->  w01, w05, w09, w13,
//    w08, w10, w12, w14,  ->  w02, w06, w10, w14,
//    w09, w11, w13, w15,  ->  w03, w07, w11, w15,
//    w16, w18, w20, w22,  ->  w16, w20, w24, w28,
//    w17, w19, w21, w23,  ->  w17, w21, w25, w29,
//    w24, w26, w28, w30,  ->  w18, w22, w26, w30,
//    w25, w27, w29, w31   ->  w19, w23, w27, w31
//
// Compared to the integer-based table, the float table is downscaled by a
// multiplication factor of 0.25*"DESCALE_SHIFT0"

#define IDCT_CONSTANT (0.25)
 // all table entries multiplied by (0.5*0.5)=0.25


#define RS0 (NCOS4_16*IDCT_CONSTANT)  // iDCT row#0 scalar
#define RS1 (NCOS1_16*IDCT_CONSTANT)  // iDCT row#1 scalar
#define RS2 (NCOS2_16*IDCT_CONSTANT)  // iDCT row#2 scalar
#define RS3 (NCOS3_16*IDCT_CONSTANT)  // iDCT row#3 scalar
#define RS4 (NCOS4_16*IDCT_CONSTANT)  // iDCT row#4 scalar
#define RS5 (NCOS3_16*IDCT_CONSTANT)  // iDCT row#5 scalar
#define RS6 (NCOS2_16*IDCT_CONSTANT)  // iDCT row#6 scalar
#define RS7 (NCOS1_16*IDCT_CONSTANT)  // iDCT row#7 scalar

const static ssefloat tab_i_01234567x87[] =
{ 
 // Row #0
  NCOS4_16*RS0,  NCOS4_16*RS0,  NCOS4_16*RS0,  NCOS4_16*RS0, 
  NCOS2_16*RS0,  NCOS6_16*RS0, -NCOS6_16*RS0, -NCOS2_16*RS0, 
  NCOS4_16*RS0, -NCOS4_16*RS0, -NCOS4_16*RS0,  NCOS4_16*RS0, 
  NCOS6_16*RS0, -NCOS2_16*RS0,  NCOS2_16*RS0, -NCOS6_16*RS0, 
  NCOS1_16*RS0,  NCOS3_16*RS0,  NCOS5_16*RS0,  NCOS7_16*RS0, 
  NCOS3_16*RS0, -NCOS7_16*RS0, -NCOS1_16*RS0, -NCOS5_16*RS0, 
  NCOS5_16*RS0, -NCOS1_16*RS0,  NCOS7_16*RS0,  NCOS3_16*RS0, 
  NCOS7_16*RS0, -NCOS5_16*RS0,  NCOS3_16*RS0, -NCOS1_16*RS0, 

 // Row #1
  NCOS4_16*RS1,  NCOS4_16*RS1,  NCOS4_16*RS1,  NCOS4_16*RS1, 
  NCOS2_16*RS1,  NCOS6_16*RS1, -NCOS6_16*RS1, -NCOS2_16*RS1, 
  NCOS4_16*RS1, -NCOS4_16*RS1, -NCOS4_16*RS1,  NCOS4_16*RS1, 
  NCOS6_16*RS1, -NCOS2_16*RS1,  NCOS2_16*RS1, -NCOS6_16*RS1, 
  NCOS1_16*RS1,  NCOS3_16*RS1,  NCOS5_16*RS1,  NCOS7_16*RS1, 
  NCOS3_16*RS1, -NCOS7_16*RS1, -NCOS1_16*RS1, -NCOS5_16*RS1, 
  NCOS5_16*RS1, -NCOS1_16*RS1,  NCOS7_16*RS1,  NCOS3_16*RS1, 
  NCOS7_16*RS1, -NCOS5_16*RS1,  NCOS3_16*RS1, -NCOS1_16*RS1, 

 // Row #2
  NCOS4_16*RS2,  NCOS4_16*RS2,  NCOS4_16*RS2,  NCOS4_16*RS2, 
  NCOS2_16*RS2,  NCOS6_16*RS2, -NCOS6_16*RS2, -NCOS2_16*RS2, 
  NCOS4_16*RS2, -NCOS4_16*RS2, -NCOS4_16*RS2,  NCOS4_16*RS2, 
  NCOS6_16*RS2, -NCOS2_16*RS2,  NCOS2_16*RS2, -NCOS6_16*RS2, 
  NCOS1_16*RS2,  NCOS3_16*RS2,  NCOS5_16*RS2,  NCOS7_16*RS2, 
  NCOS3_16*RS2, -NCOS7_16*RS2, -NCOS1_16*RS2, -NCOS5_16*RS2, 
  NCOS5_16*RS2, -NCOS1_16*RS2,  NCOS7_16*RS2,  NCOS3_16*RS2, 
  NCOS7_16*RS2, -NCOS5_16*RS2,  NCOS3_16*RS2, -NCOS1_16*RS2, 

 // Row #3
  NCOS4_16*RS3,  NCOS4_16*RS3,  NCOS4_16*RS3,  NCOS4_16*RS3, 
  NCOS2_16*RS3,  NCOS6_16*RS3, -NCOS6_16*RS3, -NCOS2_16*RS3, 
  NCOS4_16*RS3, -NCOS4_16*RS3, -NCOS4_16*RS3,  NCOS4_16*RS3, 
  NCOS6_16*RS3, -NCOS2_16*RS3,  NCOS2_16*RS3, -NCOS6_16*RS3, 
  NCOS1_16*RS3,  NCOS3_16*RS3,  NCOS5_16*RS3,  NCOS7_16*RS3, 
  NCOS3_16*RS3, -NCOS7_16*RS3, -NCOS1_16*RS3, -NCOS5_16*RS3, 
  NCOS5_16*RS3, -NCOS1_16*RS3,  NCOS7_16*RS3,  NCOS3_16*RS3, 
  NCOS7_16*RS3, -NCOS5_16*RS3,  NCOS3_16*RS3, -NCOS1_16*RS3, 

 // Row #4
  NCOS4_16*RS4,  NCOS4_16*RS4,  NCOS4_16*RS4,  NCOS4_16*RS4, 
  NCOS2_16*RS4,  NCOS6_16*RS4, -NCOS6_16*RS4, -NCOS2_16*RS4, 
  NCOS4_16*RS4, -NCOS4_16*RS4, -NCOS4_16*RS4,  NCOS4_16*RS4, 
  NCOS6_16*RS4, -NCOS2_16*RS4,  NCOS2_16*RS4, -NCOS6_16*RS4, 
  NCOS1_16*RS4,  NCOS3_16*RS4,  NCOS5_16*RS4,  NCOS7_16*RS4, 
  NCOS3_16*RS4, -NCOS7_16*RS4, -NCOS1_16*RS4, -NCOS5_16*RS4, 
  NCOS5_16*RS4, -NCOS1_16*RS4,  NCOS7_16*RS4,  NCOS3_16*RS4, 
  NCOS7_16*RS4, -NCOS5_16*RS4,  NCOS3_16*RS4, -NCOS1_16*RS4, 

 // Row #5
  NCOS4_16*RS5,  NCOS4_16*RS5,  NCOS4_16*RS5,  NCOS4_16*RS5, 
  NCOS2_16*RS5,  NCOS6_16*RS5, -NCOS6_16*RS5, -NCOS2_16*RS5, 
  NCOS4_16*RS5, -NCOS4_16*RS5, -NCOS4_16*RS5,  NCOS4_16*RS5, 
  NCOS6_16*RS5, -NCOS2_16*RS5,  NCOS2_16*RS5, -NCOS6_16*RS5, 
  NCOS1_16*RS5,  NCOS3_16*RS5,  NCOS5_16*RS5,  NCOS7_16*RS5, 
  NCOS3_16*RS5, -NCOS7_16*RS5, -NCOS1_16*RS5, -NCOS5_16*RS5, 
  NCOS5_16*RS5, -NCOS1_16*RS5,  NCOS7_16*RS5,  NCOS3_16*RS5, 
  NCOS7_16*RS5, -NCOS5_16*RS5,  NCOS3_16*RS5, -NCOS1_16*RS5, 

 // Row #6
  NCOS4_16*RS6,  NCOS4_16*RS6,  NCOS4_16*RS6,  NCOS4_16*RS6, 
  NCOS2_16*RS6,  NCOS6_16*RS6, -NCOS6_16*RS6, -NCOS2_16*RS6, 
  NCOS4_16*RS6, -NCOS4_16*RS6, -NCOS4_16*RS6,  NCOS4_16*RS6, 
  NCOS6_16*RS6, -NCOS2_16*RS6,  NCOS2_16*RS6, -NCOS6_16*RS6, 
  NCOS1_16*RS6,  NCOS3_16*RS6,  NCOS5_16*RS6,  NCOS7_16*RS6, 
  NCOS3_16*RS6, -NCOS7_16*RS6, -NCOS1_16*RS6, -NCOS5_16*RS6, 
  NCOS5_16*RS6, -NCOS1_16*RS6,  NCOS7_16*RS6,  NCOS3_16*RS6, 
  NCOS7_16*RS6, -NCOS5_16*RS6,  NCOS3_16*RS6, -NCOS1_16*RS6, 

 // Row #7
  NCOS4_16*RS7,  NCOS4_16*RS7,  NCOS4_16*RS7,  NCOS4_16*RS7, 
  NCOS2_16*RS7,  NCOS6_16*RS7, -NCOS6_16*RS7, -NCOS2_16*RS7, 
  NCOS4_16*RS7, -NCOS4_16*RS7, -NCOS4_16*RS7,  NCOS4_16*RS7, 
  NCOS6_16*RS7, -NCOS2_16*RS7,  NCOS2_16*RS7, -NCOS6_16*RS7, 
  NCOS1_16*RS7,  NCOS3_16*RS7,  NCOS5_16*RS7,  NCOS7_16*RS7, 
  NCOS3_16*RS7, -NCOS7_16*RS7, -NCOS1_16*RS7, -NCOS5_16*RS7, 
  NCOS5_16*RS7, -NCOS1_16*RS7,  NCOS7_16*RS7,  NCOS3_16*RS7, 
  NCOS7_16*RS7, -NCOS5_16*RS7,  NCOS3_16*RS7, -NCOS1_16*RS7, 

 // iDCT column coefficients
//  NCOS4_16, NCOS4_16, NCOS4_16, NCOS4_16,
//  TANG1_16, TANG1_16, TANG1_16, TANG1_16,
//  TANG2_16, TANG2_16, TANG2_16, TANG2_16,
//  TANG3_16, TANG3_16, TANG3_16, TANG3_16
};

// externally allocated array of 64 floats, for temporary storage
//extern void *fTempArray;
static float fTempArray[64];

const static ssefloat cos_4_16f = NCOS4_16;
const static ssefloat tg_1_16f= NCOS7_16/NCOS1_16; // tangent (Pi/16)
const static ssefloat tg_2_16f= NCOS6_16/NCOS2_16; // tangent (2Pi/16)
const static ssefloat tg_3_16f= NCOS5_16/NCOS3_16; // tangent (3Pi/16)


/*
; The original, unscaled idct coefficient table
; static const short w[32] = {
; FIX(cos_4_16), FIX(cos_2_16), FIX(cos_4_16), FIX(cos_6_16),
; FIX(cos_4_16), FIX(cos_6_16), -FIX(cos_4_16), -FIX(cos_2_16),
; FIX(cos_4_16), -FIX(cos_6_16), -FIX(cos_4_16), FIX(cos_2_16),
; FIX(cos_4_16), -FIX(cos_2_16), FIX(cos_4_16), -FIX(cos_6_16),
; FIX(cos_1_16), FIX(cos_3_16), FIX(cos_5_16), FIX(cos_7_16),
; FIX(cos_3_16), -FIX(cos_7_16), -FIX(cos_1_16), -FIX(cos_5_16),
; FIX(cos_5_16), -FIX(cos_1_16), FIX(cos_7_16), FIX(cos_3_16),
; FIX(cos_7_16), -FIX(cos_5_16), FIX(cos_3_16), -FIX(cos_1_16) };
;
; #define DCT_8_INV_ROW(x, y)


#define BITS_INV_ACC	4	//; 4 or 5 for IEEE
	// 5 yields higher accuracy, but lessens dynamic range on the input matrix
#define SHIFT_INV_ROW	(16 - BITS_INV_ACC)
#define SHIFT_INV_COL	(1 + BITS_INV_ACC )

#define RND_INV_ROW		(1 << (SHIFT_INV_ROW-1))
#define RND_INV_COL		(1 << (SHIFT_INV_COL-1)) 
#define RND_INV_CORR	(RND_INV_COL - 1)		//; correction -1.0 and round
//#define RND_INV_ROW		(1024 * (6 - BITS_INV_ACC)) //; 1 << (SHIFT_INV_ROW-1)
//#define RND_INV_COL		(16 * (BITS_INV_ACC - 3)) //; 1 << (SHIFT_INV_COL-1)


const static long r_inv_row[2] = { RND_INV_ROW, RND_INV_ROW};

//#define SHIFT_ROUND_COL( x ) ( ( (x) + RND_INV_COL ) >> (SHIFT_INV_COL) )
//#define SHIFT_ROUND_ROW( x ) ( ( (x) + RND_INV_ROW ) >> (SHIFT_INV_ROW) )


const static short tg_1_16[4] = {13036, 13036, 13036, 13036 }; //tg * (2<<16) + 0.5
const static short tg_2_16[4] = {27146, 27146, 27146, 27146 }; //tg * (2<<16) + 0.5
const static short tg_3_16[4] = {-21746, -21746, -21746, -21746 }; //tg * (2<<16) + 0.5
const static short cos_4_16[4] = {-19195, -19195, -19195, -19195 }; //cos * (2<<16) + 0.5
const static short ocos_4_16[4] = {23170, 23170, 23170, 23170 }; //cos * (2<<15) + 0.5
*/

// AP922float_x87(), X87 implementation of floating-point AP922
void idct_ap922float_x87(short *data)
{
//  static ssefloat fTempArray[64]; // intermediate (row-iDCT output) matrix
  static ssefloat a[8], b[8], e[8], dp[8]; // intermediate results
  static ssefloat tmp[2];  // temp calculation, scratch pad

  static int xi[8]; // temp 32-bit ints for initial scale-up shift

//  static short const _one_corr=0x0001;
idct_ap922x87.cpp - 源码说明

本页面展示了「这是一组DCT和iDCT的代码」中的 idct_ap922x87.cpp 源码文件，采用 C++ 编程语言编写，共 718 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与iDCT相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?