📄 idct_ap922sse_rawcode.cpp
字号:
#include "StdAfx.h"
/*
------------------------------------------------------------------------
idct_ap922sse_rawcode.c - AP922float iDCT, Intel SSE implementation
Requirements :
Visual C++ 6.0 Pro + Visual Studio Service Pack 4 + Processor Pack Beta
SSE capable CPU (Pentium-III, Celeron-II)
------------------------------------------------------------------------
AP-922 floating-point iDCT (SSE)
--------------------------------
Intel Application Note AP-922 discusses an MMX
(scaled integer) implementation of the MPEG fDCT/iDCT operations.
"AP922float" is a floating-point adaptation of the AP922 algorithm.
The SSE implementation of AP922float follows the AP-922 document
very closely. However, since all math operations use 32-bit floats
instead of 16/32-bit integers, most of the actual assembly code
was rewritten from scratch.
Several coding-issues arose during the development of this file.
They are summarized here :
#1) 16-bit integer -> 32-bit float conversion
--------------------------------------------
The usual MMX/SSE sequence for converting int->float is as follows:
a) mov mm0, [int16_data];
// extract element 0 only
b) pshufw mm1, mm0, 00000000b; // mm1 = [d0 d0 d0 d0] <16-bit int>
c) psrad mm1, 16; // mm1 = [ d0 d0] <32-bit int>
d) cvtpi2ps xmm0, mm1; // xmm0= [ d0 d0] <32b float>
//done
This sequence has several problems. pshufw, psrad, cvtpi2ps all issue
from "port1", meaning the instructions will not pair on Pentium-III.
Without further optimization, the int->float unpack step wastes many
instruction slots.
Although we cannot eliminate decoder contention, we can reduce it by
using an instruction other than psrad. For example, we substitute
"pand" for psrad.
a) mov mm0, [int16_data];
// extract element 0 only
b) pshufw mm1, mm0, 00000000b;// mm1 = [d0 d0 d0 d0] <16-bit int>
c) pand mm1, mmAndmask; // mm1 = [d0 d0 ] <32-bit int>*PSCALE
// PSCALE = 65536, a row_prescale factor
d) cvtpi2ps xmm0, mm1; // xmm0= [ xd0 xd0] <32b float>
// xd0 = d0 * PSCALE
pand can issue from either port0 or port1. As several int->float unpack
operations are overlapped, pand can pair with any other MMX/SSE
instruction. Some instruction slots are still empty, but overall
less decoder bandwidth is wasted.
#2) AP922 iDCT_row macro, PMADDWD/PADDD vs MULPS/ADDPS
-----------------------------------------------------
AP922 w[] coefficient table is ordered for PMADDWD+PADDD.
<4 16bit integers> <4 16bit integers>
[X4 X0 X4 X0] mm0 [X6 X2 X6 X2] mm2
[W6 W4 W2 W0] mm1 [W7 W5 W3 W1] mm3
| |
PMADDWD mm0,mm1 PMADDWD mm2,mm3
[X0*W4 | X0*W0 ] [X2*W5 | X2*W1 ]
[ + | + ] [ + | + ]
[X4*W6 | X4*W2 ] [X6*W7 | X6*W3 ]
2 32-bit integers 2 32-bit integers
PADDD mm0, mm2
mm0 <= [A1 A0]
; a0 =x[0]*w[0]+x[2]*w[1]+x[4]*w[2]+x[6]*w[3];
; a1 =x[0]*w[4]+x[2]*w[5]+x[4]*w[6]+x[6]*w[7];
The key observation is PMADDWD's *horizontal* add operation.
Intel-SSE has no equivalent for MMX's PMADDWD, so the w[] table must
be reordered for optimal execution with MULPS + ADDPS.
// XMM0 = [x0 x0 x0 x0]
// XMM1 = [x2 x2 x2 x2]
// XMM2 = [x4 x4 x4 x4]
// XMM3 = [x6 x6 x6 x6]
MULPS XMM0, [TABLE+00]; // XMM0= [x0*w12 x0*w8 x0*w4 x0*w0]
MULPS XMM1, [TABLE+16]; // XMM0= [x2*w13 x2*w9 x2*w5 x2*w1]
MULPS XMM2, [TABLE+32]; // XMM0= [x4*w14 x4*w10 x4*w6 x4*w2]
MULPS XMM3, [TABLE+48]; // XMM0= [x6*w15 x6*w11 x6*w7 x6*w3]
ADDPS XMM1, XMM0;
ADDPS XMM2, XMM1;
ADDPS XMM3, XMM2; // XMM3 =[ A3 A2 A1 A0 ]
#3) AP922 iDCT_col macro, descale/shift/round (float->int16 converion)
----------------------------------------------------------------------
The final operation in the AP922float iDCT requires float->int16
conversion. This potentially introduces numerical error which
we've worked so hard to minimize.
cvttps2pi mm0, xmm0; // xmm0[y1 y0] -> truncate -> mm0[y1 y0]
The cvttps2pi stores 32-bit integer from truncatation. Truncation
is equivalent to "negative_round_down / positive_round_up." This
means the 32-bit integer's magnitude is reduced by 0.5, on average.
The usual method of adding a rounding_offset does not work,
because the offset's sign (+/-) depends on the input value's sign.
Fortunately, Intel-SSE offers cvtps2pi. This instruction uses the
current rounding mode (stored in register mxcsr.) MXCSR is usually set
to "round-nearest." Precise rounding is achieved by pre-adding a
"correction_offset", then issuing cvtps2pi.
ADDPS XMM0, [round_float_offset]; // [y3+. y2+. y1+. y0+.]
CVTPS2PI MM0, XMM0; // mm0 = [y1 y0] <int32>
MOVHLPS XMM0, XMM0; // XMM0 =[y3 y2 y3 y2] <float>
CVTPS2PI MM1, XMM0; // mm1 = [y3 y2] <int32>
Execution wise, PADDD(MMX) would be faster than ADDPS, but as
its name suggests, "round_float_offset" must be added as a
floating-point instruction, because numbers <1.0 cannot be
represented in integer!
Recall from optimization (#1), the iDCT column output will produce
outputs that are scaled up by 65536 (row_prescale.)
The prescaling can be canceled by compensating the w[] table, but
we defer discussion until item #4.
**The above code assumes the rounding-mode bits ("mxcsr register")
are set for "round-nearest" mode! If a different round-mode is
in use (always round down, or always round up), the constant
round_float_offset must be changed accordingly.
#4) iDCT Output range clipping {-256,+255}
----------------------------------------------------------------------
The iDCT's output range must be clipped to {-256,+255} This range
equals exactly a 9-bit signed integer, allowing us to leverage the
packssdw instruction.
cvtps2pi mm0, xmm0; // mm0 = [y1 y0]*PRESCALE <int32>
movhlps xmm0, xmm0;
cvtps2pi mm1 xmm0; // mm1 = [y3 y2]*PRESCALE <int32>
#define DESCALE_SHIFT0 9 // 16-9 = 7 bit shift
#define DESCALE_SHIFT 7 // 16-7 = output y[] range is 9-bit int
// (DESCALE+SHIFT0 + DESCALE_SHIFT) must equal the PRESCALE_SHIFT, 16.
//psrad mm0, DESCALE_SHIFT0; // don't need, see below
//psrad mm1, DESCALE_SHIFT0; // don't need, see below
packssdw mm0, mm1; // mm0 = [y3 y2 y1 y0]*PRESCALE <int16>
psraw mm0, DESCALE_SHIFT; // mm0 = [y3 y2 y1 y0]
movq [data], mm0; // data is clipped to {-XXX,+YYY}
"packssdw" applies floor/ceiling (arithmetic saturation) to -32768,+32767
This range equals a 16-bit signed integer. Our goal is to clip the
final output y[] to a 9-bit signed integer. Therefore, we choose
DESCALE_SHIFT to be 16-9=7 bits.
If we choose DESCALE_SHIFT = 7, the descale-factor is 2^7=128.
But from optimization (#1), the PRESCALE factor = 65536 (16-bits.)
Somewhere between the iDCT-input and iDCT-output, the difference of
these two (16-9=7bits) must be descaled.
One possible solution is to insert psrad instructions between the
packssdw and cvtps2pi. The psrad descales by "DESCALE_SHIFT0."
But this solution is wasteful, because the descale_shift0 can be
implicitly performed by adjusting the w[] coefficient table.
Therefore, we choose to scae the coefficient table. Since the table
is stored in floating-point format, the stored constants do not lose
any precision. (The scale/descale factors are integer powers of 2.)
#5) Compiler issues, data alignment to 16-byte address offset
-------------------------------------------------------------
I haven't figured out how to force alignment to 16-bytes, using
Visual C++.
Numerical precision
-------------------
On Intel x86 CPUs, X87-FPU add/multiply operations are always calculated
with 80-bit internal precision, regardless of the input type. In contrast,
Pentium-III SSE calculations are performed with lower precision. Therefore,
all other things being equal, the X87 DCT will be more accurate than
Intel-SSE DCT.
------------------------------------------
LEGAL MUMBO-JUMBO : Disclaimer of Warranty
------------------------------------------
This software is available to the user without any license fee or
royalty on an "as is" basis. The author disclaims any and all warranties,
whether express, implied, or statuary, including any implied warranties or
merchantability or of fitness for a particular purpose. In no event shall
the copyright-holder be liable for any incidental, punitive, or
consequential damages of any kind whatsoever arising from the use of these
programs.
This disclaimer of warranty extends to the user of these programs and user's
customers, employees, agents, transferees, successors, and assigns.
Revision history
----------------
09/03/2000 initial release of 'AP922float_SSE' iDCT
Based on AP922, this iDCT uses SSE-float instructions for enhanced
accuracy. To the best of my knowledge, AP922float is fully
IEEE 1180/1990 compliant iDCT, including output-range clipping
to {-256,+255}
Requires Visual C++ 6.0 Pro, Visual Studio Service Pack 4,
and Visual C++ 6 Processor Pack Beta. SP4 and PPBeta are both
available at http://www.microsoft.com
IMPORTANT, Visual C++ does not seem to align data to 16-byte
address offset. You will need to manually
09/03/2000 liaor@iname.com http://members.tripod.com/~liaor
*/
#define NCOS1_16 (0.980785280403230449126182236134239) // cosine( Pi/16 )
#define NCOS2_16 (0.923879532511286756128183189396788) // cosine( 2Pi/16 )
#define NCOS3_16 (0.831469612302545237078788377617906) // cosine( 3Pi/16 )
#define NCOS4_16 (0.707106781186547524400844362104849) // cosine( 4Pi/16 )
#define NCOS5_16 (0.555570233019602224742830813948533) // cosine( 5Pi/16 )
#define NCOS6_16 (0.382683432365089771728459984030399) // cosine( 6Pi/16 )
#define NCOS7_16 (0.195090322016128267848284868477022) // cosine( 7Pi/16 )
#define TANG1_16 ( NCOS7_16 / NCOS1_16) // tangent( Pi/16)
#define TANG2_16 ( NCOS6_16 / NCOS2_16) // tangent(2Pi/16)
#define TANG3_16 ( NCOS5_16 / NCOS3_16) // tangent(3Pi/16)
// Scaled floating-point coefficient table
// Original (MMX) iDCT SSE-iDCT
// table-ordering -> table-ordering
// ------------------ ------------------
// w00, w02, w04, w06, -> w00, w04, w08, w12,
// w01, w03, w05, w07, -> w01, w05, w09, w13,
// w08, w10, w12, w14, -> w02, w06, w10, w14,
// w09, w11, w13, w15, -> w03, w07, w11, w15,
// w16, w18, w20, w22, -> w16, w20, w24, w28,
// w17, w19, w21, w23, -> w17, w21, w25, w29,
// w24, w26, w28, w30, -> w18, w22, w26, w30,
// w25, w27, w29, w31 -> w19, w23, w27, w31
//
// Compared to the integer-based table, the float table is downscaled by a
// multiplication factor of 0.5
#define FIXED_DOWNSCALE (0.25/(1<<9))
// 'fixed_downscale' includes a descale shift factor, discussed in the
// text at the beginning of this file.
#define RS0 (NCOS4_16*FIXED_DOWNSCALE) // iDCT row#0 scalar
#define RS1 (NCOS1_16*FIXED_DOWNSCALE) // iDCT row#1 scalar
#define RS2 (NCOS2_16*FIXED_DOWNSCALE) // iDCT row#2 scalar
#define RS3 (NCOS3_16*FIXED_DOWNSCALE) // iDCT row#3 scalar
#define RS4 (NCOS4_16*FIXED_DOWNSCALE) // iDCT row#4 scalar
#define RS5 (NCOS3_16*FIXED_DOWNSCALE) // iDCT row#5 scalar
#define RS6 (NCOS2_16*FIXED_DOWNSCALE) // iDCT row#6 scalar
#define RS7 (NCOS1_16*FIXED_DOWNSCALE) // iDCT row#7 scalar
#define ROW_PRESCALE_SHIFT (16) // do not change!
#define DESCALE_SHIFT2 (7)
// compute the round_float_offset, for numerically accurate rounding
#define RND_INV_ROWF (1 << (DESCALE_SHIFT2-1) )
#define RND_FLOAT_OFFSET ((RND_INV_ROWF) - 0.5)
//////////////////////////////////////////////////////////////////////
//
// If the function crashes after compiling, try inserting dummy
// variables to shift around the sse-table's address.
// Otherwise, you should recode the function to dynamically allocate
// RAM for the table. Then generate the table on the fly.
//
const static __int64 useless_padding[1];
const float tab_i_01234567ssef[] =
{
// Row #0
NCOS4_16*RS0, NCOS4_16*RS0, NCOS4_16*RS0, NCOS4_16*RS0,
NCOS2_16*RS0, NCOS6_16*RS0, -NCOS6_16*RS0, -NCOS2_16*RS0,
NCOS4_16*RS0, -NCOS4_16*RS0, -NCOS4_16*RS0, NCOS4_16*RS0,
NCOS6_16*RS0, -NCOS2_16*RS0, NCOS2_16*RS0, -NCOS6_16*RS0,
NCOS1_16*RS0, NCOS3_16*RS0, NCOS5_16*RS0, NCOS7_16*RS0,
NCOS3_16*RS0, -NCOS7_16*RS0, -NCOS1_16*RS0, -NCOS5_16*RS0,
NCOS5_16*RS0, -NCOS1_16*RS0, NCOS7_16*RS0, NCOS3_16*RS0,
NCOS7_16*RS0, -NCOS5_16*RS0, NCOS3_16*RS0, -NCOS1_16*RS0,
// Row #1
NCOS4_16*RS1, NCOS4_16*RS1, NCOS4_16*RS1, NCOS4_16*RS1,
NCOS2_16*RS1, NCOS6_16*RS1, -NCOS6_16*RS1, -NCOS2_16*RS1,
NCOS4_16*RS1, -NCOS4_16*RS1, -NCOS4_16*RS1, NCOS4_16*RS1,
NCOS6_16*RS1, -NCOS2_16*RS1, NCOS2_16*RS1, -NCOS6_16*RS1,
NCOS1_16*RS1, NCOS3_16*RS1, NCOS5_16*RS1, NCOS7_16*RS1,
NCOS3_16*RS1, -NCOS7_16*RS1, -NCOS1_16*RS1, -NCOS5_16*RS1,
NCOS5_16*RS1, -NCOS1_16*RS1, NCOS7_16*RS1, NCOS3_16*RS1,
NCOS7_16*RS1, -NCOS5_16*RS1, NCOS3_16*RS1, -NCOS1_16*RS1,
// Row #2
NCOS4_16*RS2, NCOS4_16*RS2, NCOS4_16*RS2, NCOS4_16*RS2,
NCOS2_16*RS2, NCOS6_16*RS2, -NCOS6_16*RS2, -NCOS2_16*RS2,
NCOS4_16*RS2, -NCOS4_16*RS2, -NCOS4_16*RS2, NCOS4_16*RS2,
NCOS6_16*RS2, -NCOS2_16*RS2, NCOS2_16*RS2, -NCOS6_16*RS2,
NCOS1_16*RS2, NCOS3_16*RS2, NCOS5_16*RS2, NCOS7_16*RS2,
NCOS3_16*RS2, -NCOS7_16*RS2, -NCOS1_16*RS2, -NCOS5_16*RS2,
NCOS5_16*RS2, -NCOS1_16*RS2, NCOS7_16*RS2, NCOS3_16*RS2,
NCOS7_16*RS2, -NCOS5_16*RS2, NCOS3_16*RS2, -NCOS1_16*RS2,
// Row #3
NCOS4_16*RS3, NCOS4_16*RS3, NCOS4_16*RS3, NCOS4_16*RS3,
NCOS2_16*RS3, NCOS6_16*RS3, -NCOS6_16*RS3, -NCOS2_16*RS3,
NCOS4_16*RS3, -NCOS4_16*RS3, -NCOS4_16*RS3, NCOS4_16*RS3,
NCOS6_16*RS3, -NCOS2_16*RS3, NCOS2_16*RS3, -NCOS6_16*RS3,
NCOS1_16*RS3, NCOS3_16*RS3, NCOS5_16*RS3, NCOS7_16*RS3,
NCOS3_16*RS3, -NCOS7_16*RS3, -NCOS1_16*RS3, -NCOS5_16*RS3,
NCOS5_16*RS3, -NCOS1_16*RS3, NCOS7_16*RS3, NCOS3_16*RS3,
NCOS7_16*RS3, -NCOS5_16*RS3, NCOS3_16*RS3, -NCOS1_16*RS3,
// Row #4
NCOS4_16*RS4, NCOS4_16*RS4, NCOS4_16*RS4, NCOS4_16*RS4,
NCOS2_16*RS4, NCOS6_16*RS4, -NCOS6_16*RS4, -NCOS2_16*RS4,
NCOS4_16*RS4, -NCOS4_16*RS4, -NCOS4_16*RS4, NCOS4_16*RS4,
NCOS6_16*RS4, -NCOS2_16*RS4, NCOS2_16*RS4, -NCOS6_16*RS4,
NCOS1_16*RS4, NCOS3_16*RS4, NCOS5_16*RS4, NCOS7_16*RS4,
NCOS3_16*RS4, -NCOS7_16*RS4, -NCOS1_16*RS4, -NCOS5_16*RS4,
NCOS5_16*RS4, -NCOS1_16*RS4, NCOS7_16*RS4, NCOS3_16*RS4,
NCOS7_16*RS4, -NCOS5_16*RS4, NCOS3_16*RS4, -NCOS1_16*RS4,
// Row #5
NCOS4_16*RS5, NCOS4_16*RS5, NCOS4_16*RS5, NCOS4_16*RS5,
NCOS2_16*RS5, NCOS6_16*RS5, -NCOS6_16*RS5, -NCOS2_16*RS5,
NCOS4_16*RS5, -NCOS4_16*RS5, -NCOS4_16*RS5, NCOS4_16*RS5,
NCOS6_16*RS5, -NCOS2_16*RS5, NCOS2_16*RS5, -NCOS6_16*RS5,
NCOS1_16*RS5, NCOS3_16*RS5, NCOS5_16*RS5, NCOS7_16*RS5,
NCOS3_16*RS5, -NCOS7_16*RS5, -NCOS1_16*RS5, -NCOS5_16*RS5,
NCOS5_16*RS5, -NCOS1_16*RS5, NCOS7_16*RS5, NCOS3_16*RS5,
NCOS7_16*RS5, -NCOS5_16*RS5, NCOS3_16*RS5, -NCOS1_16*RS5,
// Row #6
NCOS4_16*RS6, NCOS4_16*RS6, NCOS4_16*RS6, NCOS4_16*RS6,
NCOS2_16*RS6, NCOS6_16*RS6, -NCOS6_16*RS6, -NCOS2_16*RS6,
NCOS4_16*RS6, -NCOS4_16*RS6, -NCOS4_16*RS6, NCOS4_16*RS6,
NCOS6_16*RS6, -NCOS2_16*RS6, NCOS2_16*RS6, -NCOS6_16*RS6,
NCOS1_16*RS6, NCOS3_16*RS6, NCOS5_16*RS6, NCOS7_16*RS6,
NCOS3_16*RS6, -NCOS7_16*RS6, -NCOS1_16*RS6, -NCOS5_16*RS6,
NCOS5_16*RS6, -NCOS1_16*RS6, NCOS7_16*RS6, NCOS3_16*RS6,
NCOS7_16*RS6, -NCOS5_16*RS6, NCOS3_16*RS6, -NCOS1_16*RS6,
// Row #7
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -