📄 idct_ap922x87.cpp
字号:
#include "StdAfx.h"
#pragma warning(once:4305 4244)
/*
--------------------------------------------------------
idct_ap922x87.c - AP922float iDCT, C-code implementation
--------------------------------------------------------
AP-922 floating-point iDCT (x87 FPU)
------------------------------------
Intel Application Note AP-922 discusses an MMX (scaled integer)
implementation of the MPEG fDCT/iDCT algorithm.
"AP922float" is a floating-point adaptation of the AP922 algorithm.
The X87 implementation (this file) should compile and run on any
i486DX or higher CPU. *This* C-implementation attempts to emulate the
SSE-version of AP922float. The emulation is only approximately, however,
due to 2 numerical issues :
Numerical Issue #1 : float->int conversion
------------------------------------------
All modern CPUs can accurately convert float->int, but since this
algorithm was coded on an x86 platform (with its aging heritage),
float conversion is not as straightforward as one would expect.
In a C-compiler, the fastest way to convert float->int is to
simply type-cast. Unfortunately, typecasting results in
truncation. Truncation is *NOT* equivalent to rounding down!
Truncated positive numbers are rounded down (toward 0), while
truncated negative numbers are rounded up (toward 0.) On
average, the truncated number's magnitude is reduced by 0.5.
The standard C-library function "floor()" explicitly rounds
down the input, but calling a function to perform a simple
rounding operation wastes CPU time, especially when each iDCT
call requires 64 float->int conversions.
The x87 FPU supports all rounding modes, but a FPU state change
must be executed prior to the rounding instructions, and this
operation is not directly accessible from a high-level language.
Therefore, if we are not willing to hand-code assembly,
we have two choices:
a) maintain speed, use truncation (C type-casting)
or
b) maintain accuracy, use C-function floor()
a) "FAST_FLOAT2INT" uses a truncation policy to convert float->int.
"Truncation" is equivalent to rounding-down (toward 0) for positive #s,
and rounding-up (toward 0) for negative #s. Statistically, the
truncation introduces an average offset of -0.5, meaning the least
significant bit of the converted-int is wrong.
To minimize the error impact in the final iDCT output, the
C-code iDCT prescales input data by a 16-bit left-shift.
The output is descaled by a complementary right-shift.
The prescale/descale operation reduces truncation error by a factor
of 65536.
b) Precise float->int conversion can be achieved with the standard
C-library function 'floor( value + 0.5)' This statement implements
a standard rounding policy : "round-up if >0.5, otherwise down."
The floor() function incurs a severe performance penalty: with
accurate_rounding, the iDCT's overall execution time doubles!
Alternatively, the accuate rounding could be accelerated by hand-
coding in x86 assembly.
The actual-SSE assembly code implements accurate-rounding for
float->int conversion, but without additional speed loss. The
SSE Numerical precision vs X87 numerical precision
------------------------------------------------------------
On Intel x86 CPUs, FPU add/multiply operations are always calculated
with 80-bit internal precision. Therefore, all other things being
equal, the X87 iDCT will be more accurate than Intel-SSE, even
when the X87 iDCT stores intermediate results in 32-bit float.
As a side note, the X87 iDCT can achieve the precision-level of the
IEEE-1180/1990 reference iDCT. The "Reference-precision" mode is
accessed by setting 'typdedef double ssefloat', and disabling
"USE_FAST_FLOAT2INT" (delete the #define statement.)
Output clipping
------------------------------------
The iDCT's output range is limited to {-256,+255}
The clipper is easy to implement in standard C, but slow to execute.
In this C-implementation, usage of the clipper is left as a decision
to the user. For strict IEEE-compliance, the output clipper must be used.
(MMX offers several instructions to efficiently implement clipping.
Thus, output clipping is done in both AP922float_SSE and
AP922float_3DN.)
------------------------------------------
LEGAL MUMBO-JUMBO : Disclaimer of Warranty
------------------------------------------
This software is available to the user without any license fee or
royalty on an "as is" basis. The author disclaims any and all warranties,
whether express, implied, or statuary, including any implied warranties or
merchantability or of fitness for a particular purpose. In no event shall
the copyright-holder be liable for any incidental, punitive, or
consequential damages of any kind whatsoever arising from the use of these
programs.
This disclaimer of warranty extends to the user of these programs and user's
customers, employees, agents, transferees, successors, and assigns.
Revision history
----------------
09/03/2000 initial release of 'AP922float_x87' iDCT
Based on AP922, this iDCT uses floating-point instructions for
enhanced accuracy. To the best of my knowledge, AP922float is
fully IEEE 1180/1990 compliant iDCT, including output-range
clipping to {-256,+255}
When set to use double-precision and accurate-rounding (see
above), AP922float is numerically equivalent to the
IEEE-1180/1990 reference_iDCT.
09/03/2000 liaor@iname.com http://members.tripod.com/~liaor
liaor@iname.com http://members.tripod.com/~liaor
*/
#include<math.h> // for "floor()" function
typedef float ssefloat; // ssefloat = IEEE-754 32-bit float
#define PSCALE_SHIFT 16 // 16 bit left-shift
#define DESCALE_SHIFT 16 // 16 bit right-shift
#define DESCALE_ROUND_COMPENSATION (1<<(PSCALE_SHIFT-1))
#define ROW_PRESCALE( x ) (((int)(x))<<PSCALE_SHIFT)
#define USE_FAST_FLOAT2INT 1
// comment out the previous line to use accurate (slower) rounding
#ifdef USE_FAST_FLOAT2INT
// approximate rounding
#define SHIFT_ROUND_COLF( x ) ( ( ( (int)(x) ) + DESCALE_ROUND_COMPENSATION) >> DESCALE_SHIFT)
#else
// accurate rounding
//#define SHIFT_ROUND_COLF( x ) ( (((int)floor( (x) + DESCALE_ROUND_COMPENSATION )))>>DESCALE_SHIFT )
#define SHIFT_ROUND_COLF( x ) ( ( ((int)floor( (x) )) + DESCALE_ROUND_COMPENSATION )>>DESCALE_SHIFT )
#endif // USE_FAST_FLOAT2INT
#define NCOS1_16 (0.980785280403230449126182236134239) // cosine( Pi/16 )
#define NCOS2_16 (0.923879532511286756128183189396788) // cosine( 2Pi/16 )
#define NCOS3_16 (0.831469612302545237078788377617906) // cosine( 3Pi/16 )
#define NCOS4_16 (0.707106781186547524400844362104849) // cosine( 4Pi/16 )
#define NCOS5_16 (0.555570233019602224742830813948533) // cosine( 5Pi/16 )
#define NCOS6_16 (0.382683432365089771728459984030399) // cosine( 6Pi/16 )
#define NCOS7_16 (0.195090322016128267848284868477022) // cosine( 7Pi/16 )
#define TANG1_16 ( NCOS7_16 / NCOS1_16) // tangent( Pi/16)
#define TANG2_16 ( NCOS6_16 / NCOS2_16) // tangent(2Pi/16)
#define TANG3_16 ( NCOS5_16 / NCOS3_16) // tangent(3Pi/16)
// Scaled floating-point coefficient table
// w00, w04, w08, w12
// w01, w05, w09, w13
// w02, w06, w10, w14
// w03, w07, w11, w15 ( all elements in float table are multiplied by 0.5)
// w16, w20, w24, w28
// w17, w21, w25, w29
// w18, w22, w26, w30
// w19, w23, w27, w31
//
// Compared to the integer-based table, the float table is downscaled by a
// multiplication factor of 0.5
// Scaled floating-point coefficient table
// Original iDCT SSE-iDCT
// table-ordering -> table-ordering
// ------------------ ------------------
// w00, w02, w04, w06, -> w00, w04, w08, w12,
// w01, w03, w05, w07, -> w01, w05, w09, w13,
// w08, w10, w12, w14, -> w02, w06, w10, w14,
// w09, w11, w13, w15, -> w03, w07, w11, w15,
// w16, w18, w20, w22, -> w16, w20, w24, w28,
// w17, w19, w21, w23, -> w17, w21, w25, w29,
// w24, w26, w28, w30, -> w18, w22, w26, w30,
// w25, w27, w29, w31 -> w19, w23, w27, w31
//
// Compared to the integer-based table, the float table is downscaled by a
// multiplication factor of 0.25*"DESCALE_SHIFT0"
#define IDCT_CONSTANT (0.25)
// all table entries multiplied by (0.5*0.5)=0.25
#define RS0 (NCOS4_16*IDCT_CONSTANT) // iDCT row#0 scalar
#define RS1 (NCOS1_16*IDCT_CONSTANT) // iDCT row#1 scalar
#define RS2 (NCOS2_16*IDCT_CONSTANT) // iDCT row#2 scalar
#define RS3 (NCOS3_16*IDCT_CONSTANT) // iDCT row#3 scalar
#define RS4 (NCOS4_16*IDCT_CONSTANT) // iDCT row#4 scalar
#define RS5 (NCOS3_16*IDCT_CONSTANT) // iDCT row#5 scalar
#define RS6 (NCOS2_16*IDCT_CONSTANT) // iDCT row#6 scalar
#define RS7 (NCOS1_16*IDCT_CONSTANT) // iDCT row#7 scalar
const static ssefloat tab_i_01234567x87[] =
{
// Row #0
NCOS4_16*RS0, NCOS4_16*RS0, NCOS4_16*RS0, NCOS4_16*RS0,
NCOS2_16*RS0, NCOS6_16*RS0, -NCOS6_16*RS0, -NCOS2_16*RS0,
NCOS4_16*RS0, -NCOS4_16*RS0, -NCOS4_16*RS0, NCOS4_16*RS0,
NCOS6_16*RS0, -NCOS2_16*RS0, NCOS2_16*RS0, -NCOS6_16*RS0,
NCOS1_16*RS0, NCOS3_16*RS0, NCOS5_16*RS0, NCOS7_16*RS0,
NCOS3_16*RS0, -NCOS7_16*RS0, -NCOS1_16*RS0, -NCOS5_16*RS0,
NCOS5_16*RS0, -NCOS1_16*RS0, NCOS7_16*RS0, NCOS3_16*RS0,
NCOS7_16*RS0, -NCOS5_16*RS0, NCOS3_16*RS0, -NCOS1_16*RS0,
// Row #1
NCOS4_16*RS1, NCOS4_16*RS1, NCOS4_16*RS1, NCOS4_16*RS1,
NCOS2_16*RS1, NCOS6_16*RS1, -NCOS6_16*RS1, -NCOS2_16*RS1,
NCOS4_16*RS1, -NCOS4_16*RS1, -NCOS4_16*RS1, NCOS4_16*RS1,
NCOS6_16*RS1, -NCOS2_16*RS1, NCOS2_16*RS1, -NCOS6_16*RS1,
NCOS1_16*RS1, NCOS3_16*RS1, NCOS5_16*RS1, NCOS7_16*RS1,
NCOS3_16*RS1, -NCOS7_16*RS1, -NCOS1_16*RS1, -NCOS5_16*RS1,
NCOS5_16*RS1, -NCOS1_16*RS1, NCOS7_16*RS1, NCOS3_16*RS1,
NCOS7_16*RS1, -NCOS5_16*RS1, NCOS3_16*RS1, -NCOS1_16*RS1,
// Row #2
NCOS4_16*RS2, NCOS4_16*RS2, NCOS4_16*RS2, NCOS4_16*RS2,
NCOS2_16*RS2, NCOS6_16*RS2, -NCOS6_16*RS2, -NCOS2_16*RS2,
NCOS4_16*RS2, -NCOS4_16*RS2, -NCOS4_16*RS2, NCOS4_16*RS2,
NCOS6_16*RS2, -NCOS2_16*RS2, NCOS2_16*RS2, -NCOS6_16*RS2,
NCOS1_16*RS2, NCOS3_16*RS2, NCOS5_16*RS2, NCOS7_16*RS2,
NCOS3_16*RS2, -NCOS7_16*RS2, -NCOS1_16*RS2, -NCOS5_16*RS2,
NCOS5_16*RS2, -NCOS1_16*RS2, NCOS7_16*RS2, NCOS3_16*RS2,
NCOS7_16*RS2, -NCOS5_16*RS2, NCOS3_16*RS2, -NCOS1_16*RS2,
// Row #3
NCOS4_16*RS3, NCOS4_16*RS3, NCOS4_16*RS3, NCOS4_16*RS3,
NCOS2_16*RS3, NCOS6_16*RS3, -NCOS6_16*RS3, -NCOS2_16*RS3,
NCOS4_16*RS3, -NCOS4_16*RS3, -NCOS4_16*RS3, NCOS4_16*RS3,
NCOS6_16*RS3, -NCOS2_16*RS3, NCOS2_16*RS3, -NCOS6_16*RS3,
NCOS1_16*RS3, NCOS3_16*RS3, NCOS5_16*RS3, NCOS7_16*RS3,
NCOS3_16*RS3, -NCOS7_16*RS3, -NCOS1_16*RS3, -NCOS5_16*RS3,
NCOS5_16*RS3, -NCOS1_16*RS3, NCOS7_16*RS3, NCOS3_16*RS3,
NCOS7_16*RS3, -NCOS5_16*RS3, NCOS3_16*RS3, -NCOS1_16*RS3,
// Row #4
NCOS4_16*RS4, NCOS4_16*RS4, NCOS4_16*RS4, NCOS4_16*RS4,
NCOS2_16*RS4, NCOS6_16*RS4, -NCOS6_16*RS4, -NCOS2_16*RS4,
NCOS4_16*RS4, -NCOS4_16*RS4, -NCOS4_16*RS4, NCOS4_16*RS4,
NCOS6_16*RS4, -NCOS2_16*RS4, NCOS2_16*RS4, -NCOS6_16*RS4,
NCOS1_16*RS4, NCOS3_16*RS4, NCOS5_16*RS4, NCOS7_16*RS4,
NCOS3_16*RS4, -NCOS7_16*RS4, -NCOS1_16*RS4, -NCOS5_16*RS4,
NCOS5_16*RS4, -NCOS1_16*RS4, NCOS7_16*RS4, NCOS3_16*RS4,
NCOS7_16*RS4, -NCOS5_16*RS4, NCOS3_16*RS4, -NCOS1_16*RS4,
// Row #5
NCOS4_16*RS5, NCOS4_16*RS5, NCOS4_16*RS5, NCOS4_16*RS5,
NCOS2_16*RS5, NCOS6_16*RS5, -NCOS6_16*RS5, -NCOS2_16*RS5,
NCOS4_16*RS5, -NCOS4_16*RS5, -NCOS4_16*RS5, NCOS4_16*RS5,
NCOS6_16*RS5, -NCOS2_16*RS5, NCOS2_16*RS5, -NCOS6_16*RS5,
NCOS1_16*RS5, NCOS3_16*RS5, NCOS5_16*RS5, NCOS7_16*RS5,
NCOS3_16*RS5, -NCOS7_16*RS5, -NCOS1_16*RS5, -NCOS5_16*RS5,
NCOS5_16*RS5, -NCOS1_16*RS5, NCOS7_16*RS5, NCOS3_16*RS5,
NCOS7_16*RS5, -NCOS5_16*RS5, NCOS3_16*RS5, -NCOS1_16*RS5,
// Row #6
NCOS4_16*RS6, NCOS4_16*RS6, NCOS4_16*RS6, NCOS4_16*RS6,
NCOS2_16*RS6, NCOS6_16*RS6, -NCOS6_16*RS6, -NCOS2_16*RS6,
NCOS4_16*RS6, -NCOS4_16*RS6, -NCOS4_16*RS6, NCOS4_16*RS6,
NCOS6_16*RS6, -NCOS2_16*RS6, NCOS2_16*RS6, -NCOS6_16*RS6,
NCOS1_16*RS6, NCOS3_16*RS6, NCOS5_16*RS6, NCOS7_16*RS6,
NCOS3_16*RS6, -NCOS7_16*RS6, -NCOS1_16*RS6, -NCOS5_16*RS6,
NCOS5_16*RS6, -NCOS1_16*RS6, NCOS7_16*RS6, NCOS3_16*RS6,
NCOS7_16*RS6, -NCOS5_16*RS6, NCOS3_16*RS6, -NCOS1_16*RS6,
// Row #7
NCOS4_16*RS7, NCOS4_16*RS7, NCOS4_16*RS7, NCOS4_16*RS7,
NCOS2_16*RS7, NCOS6_16*RS7, -NCOS6_16*RS7, -NCOS2_16*RS7,
NCOS4_16*RS7, -NCOS4_16*RS7, -NCOS4_16*RS7, NCOS4_16*RS7,
NCOS6_16*RS7, -NCOS2_16*RS7, NCOS2_16*RS7, -NCOS6_16*RS7,
NCOS1_16*RS7, NCOS3_16*RS7, NCOS5_16*RS7, NCOS7_16*RS7,
NCOS3_16*RS7, -NCOS7_16*RS7, -NCOS1_16*RS7, -NCOS5_16*RS7,
NCOS5_16*RS7, -NCOS1_16*RS7, NCOS7_16*RS7, NCOS3_16*RS7,
NCOS7_16*RS7, -NCOS5_16*RS7, NCOS3_16*RS7, -NCOS1_16*RS7,
// iDCT column coefficients
// NCOS4_16, NCOS4_16, NCOS4_16, NCOS4_16,
// TANG1_16, TANG1_16, TANG1_16, TANG1_16,
// TANG2_16, TANG2_16, TANG2_16, TANG2_16,
// TANG3_16, TANG3_16, TANG3_16, TANG3_16
};
// externally allocated array of 64 floats, for temporary storage
//extern void *fTempArray;
static float fTempArray[64];
const static ssefloat cos_4_16f = NCOS4_16;
const static ssefloat tg_1_16f= NCOS7_16/NCOS1_16; // tangent (Pi/16)
const static ssefloat tg_2_16f= NCOS6_16/NCOS2_16; // tangent (2Pi/16)
const static ssefloat tg_3_16f= NCOS5_16/NCOS3_16; // tangent (3Pi/16)
/*
; The original, unscaled idct coefficient table
; static const short w[32] = {
; FIX(cos_4_16), FIX(cos_2_16), FIX(cos_4_16), FIX(cos_6_16),
; FIX(cos_4_16), FIX(cos_6_16), -FIX(cos_4_16), -FIX(cos_2_16),
; FIX(cos_4_16), -FIX(cos_6_16), -FIX(cos_4_16), FIX(cos_2_16),
; FIX(cos_4_16), -FIX(cos_2_16), FIX(cos_4_16), -FIX(cos_6_16),
; FIX(cos_1_16), FIX(cos_3_16), FIX(cos_5_16), FIX(cos_7_16),
; FIX(cos_3_16), -FIX(cos_7_16), -FIX(cos_1_16), -FIX(cos_5_16),
; FIX(cos_5_16), -FIX(cos_1_16), FIX(cos_7_16), FIX(cos_3_16),
; FIX(cos_7_16), -FIX(cos_5_16), FIX(cos_3_16), -FIX(cos_1_16) };
;
; #define DCT_8_INV_ROW(x, y)
#define BITS_INV_ACC 4 //; 4 or 5 for IEEE
// 5 yields higher accuracy, but lessens dynamic range on the input matrix
#define SHIFT_INV_ROW (16 - BITS_INV_ACC)
#define SHIFT_INV_COL (1 + BITS_INV_ACC )
#define RND_INV_ROW (1 << (SHIFT_INV_ROW-1))
#define RND_INV_COL (1 << (SHIFT_INV_COL-1))
#define RND_INV_CORR (RND_INV_COL - 1) //; correction -1.0 and round
//#define RND_INV_ROW (1024 * (6 - BITS_INV_ACC)) //; 1 << (SHIFT_INV_ROW-1)
//#define RND_INV_COL (16 * (BITS_INV_ACC - 3)) //; 1 << (SHIFT_INV_COL-1)
const static long r_inv_row[2] = { RND_INV_ROW, RND_INV_ROW};
//#define SHIFT_ROUND_COL( x ) ( ( (x) + RND_INV_COL ) >> (SHIFT_INV_COL) )
//#define SHIFT_ROUND_ROW( x ) ( ( (x) + RND_INV_ROW ) >> (SHIFT_INV_ROW) )
const static short tg_1_16[4] = {13036, 13036, 13036, 13036 }; //tg * (2<<16) + 0.5
const static short tg_2_16[4] = {27146, 27146, 27146, 27146 }; //tg * (2<<16) + 0.5
const static short tg_3_16[4] = {-21746, -21746, -21746, -21746 }; //tg * (2<<16) + 0.5
const static short cos_4_16[4] = {-19195, -19195, -19195, -19195 }; //cos * (2<<16) + 0.5
const static short ocos_4_16[4] = {23170, 23170, 23170, 23170 }; //cos * (2<<15) + 0.5
*/
// AP922float_x87(), X87 implementation of floating-point AP922
void idct_ap922float_x87(short *data)
{
// static ssefloat fTempArray[64]; // intermediate (row-iDCT output) matrix
static ssefloat a[8], b[8], e[8], dp[8]; // intermediate results
static ssefloat tmp[2]; // temp calculation, scratch pad
static int xi[8]; // temp 32-bit ints for initial scale-up shift
// static short const _one_corr=0x0001;
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -