idct_ap922sse_rawcode.cpp

来自「这是一组DCT和iDCT的代码」· C++ 代码 · 共 1,111 行 · 第 1/3 页
CPP
1,111 行
#include "StdAfx.h"
/*
 ------------------------------------------------------------------------
 idct_ap922sse_rawcode.c - AP922float iDCT, Intel SSE implementation

 Requirements : 
  Visual C++ 6.0 Pro + Visual Studio Service Pack 4 + Processor Pack Beta
  SSE capable CPU (Pentium-III, Celeron-II)
 ------------------------------------------------------------------------

 AP-922 floating-point iDCT (SSE)
 --------------------------------
 
 Intel Application Note AP-922 discusses an MMX
 (scaled integer) implementation of the MPEG fDCT/iDCT operations.
 "AP922float" is a floating-point adaptation of the AP922 algorithm.
 
 The SSE implementation of AP922float follows the AP-922 document
 very closely.  However, since all math operations use 32-bit floats 
 instead of 16/32-bit integers, most of the actual assembly code 
 was rewritten from scratch.
 
 Several coding-issues arose during the development of this file.
 They are summarized here :

  #1) 16-bit integer -> 32-bit float conversion
  --------------------------------------------
     The usual MMX/SSE sequence for converting int->float is as follows:

     a) mov mm0, [int16_data];
        // extract element 0 only

     b) pshufw mm1, mm0, 00000000b; // mm1 = [d0 d0 d0 d0] <16-bit int>

     c) psrad mm1, 16;              // mm1 = [   d0    d0] <32-bit int>

     d) cvtpi2ps xmm0, mm1;         // xmm0= [   d0    d0] <32b float>

        //done

     This sequence has several problems.  pshufw, psrad, cvtpi2ps all issue
     from "port1", meaning the instructions will not pair on Pentium-III.
     Without further optimization, the int->float unpack step wastes many
     instruction slots.

     Although we cannot eliminate decoder contention, we can reduce it by
     using an instruction other than psrad.  For example, we substitute
     "pand" for psrad.

     a) mov mm0, [int16_data];
        // extract element 0 only

     b) pshufw mm1, mm0, 00000000b;// mm1 = [d0 d0 d0 d0] <16-bit int>

     c) pand mm1, mmAndmask;       // mm1 = [d0    d0   ] <32-bit int>*PSCALE
        // PSCALE = 65536, a row_prescale factor

     d) cvtpi2ps xmm0, mm1;        // xmm0= [   xd0   xd0] <32b float>
        // xd0 = d0 * PSCALE

     pand can issue from either port0 or port1.  As several int->float unpack
     operations are overlapped, pand can pair with any other MMX/SSE 
     instruction.  Some instruction slots are still empty, but overall
     less decoder bandwidth is wasted.

  #2) AP922 iDCT_row macro, PMADDWD/PADDD vs MULPS/ADDPS
  -----------------------------------------------------
     AP922 w[] coefficient table is ordered for PMADDWD+PADDD.

     <4 16bit integers>            <4 16bit integers>
     [X4  X0  X4  X0] mm0           [X6  X2  X6  X2] mm2
     [W6  W4  W2  W0] mm1           [W7  W5  W3  W1] mm3
            
            |                              |

       PMADDWD mm0,mm1                PMADDWD mm2,mm3

     [X0*W4 | X0*W0 ]               [X2*W5 | X2*W1 ]
     [  +   |   +   ]               [  +   |   +   ]
     [X4*W6 | X4*W2 ]               [X6*W7 | X6*W3 ]
      2 32-bit integers              2 32-bit integers

                     PADDD mm0, mm2

                      mm0 <= [A1 A0]

         ; a0 =x[0]*w[0]+x[2]*w[1]+x[4]*w[2]+x[6]*w[3];
         ; a1 =x[0]*w[4]+x[2]*w[5]+x[4]*w[6]+x[6]*w[7];

     The key observation is PMADDWD's *horizontal* add operation.

     Intel-SSE has no equivalent for MMX's PMADDWD, so the w[] table must
     be reordered for optimal execution with MULPS + ADDPS.

     // XMM0 = [x0 x0 x0 x0]     
     // XMM1 = [x2 x2 x2 x2]     
     // XMM2 = [x4 x4 x4 x4]     
     // XMM3 = [x6 x6 x6 x6]     

      MULPS XMM0, [TABLE+00];    // XMM0= [x0*w12 x0*w8  x0*w4 x0*w0]
      MULPS XMM1, [TABLE+16];    // XMM0= [x2*w13 x2*w9  x2*w5 x2*w1]
      MULPS XMM2, [TABLE+32];    // XMM0= [x4*w14 x4*w10 x4*w6 x4*w2]
      MULPS XMM3, [TABLE+48];    // XMM0= [x6*w15 x6*w11 x6*w7 x6*w3]
      ADDPS XMM1, XMM0;
      ADDPS XMM2, XMM1;
      ADDPS XMM3, XMM2;          // XMM3 =[  A3     A2     A1    A0 ]


  #3) AP922 iDCT_col macro, descale/shift/round (float->int16 converion)
  ----------------------------------------------------------------------
     The final operation in the AP922float iDCT requires float->int16
     conversion.  This potentially introduces numerical error which
     we've worked so hard to minimize.

     cvttps2pi mm0, xmm0;  // xmm0[y1 y0] -> truncate -> mm0[y1 y0]

     The cvttps2pi stores 32-bit integer from truncatation.  Truncation
     is equivalent to "negative_round_down / positive_round_up."  This
     means the 32-bit integer's magnitude is reduced by 0.5, on average.
     The usual method of adding a rounding_offset does not work, 
     because the offset's sign (+/-) depends on the input value's sign.

     Fortunately, Intel-SSE offers cvtps2pi.  This instruction uses the
     current rounding mode (stored in register mxcsr.)  MXCSR is usually set
     to "round-nearest."  Precise rounding is achieved by pre-adding a
     "correction_offset", then issuing cvtps2pi.

        ADDPS XMM0, [round_float_offset]; // [y3+. y2+. y1+. y0+.]
        CVTPS2PI MM0, XMM0;   // mm0 = [y1 y0]       <int32>
        MOVHLPS XMM0, XMM0;   // XMM0 =[y3 y2 y3 y2] <float>
        CVTPS2PI MM1, XMM0;   // mm1 = [y3 y2]       <int32>

        Execution wise, PADDD(MMX) would be faster than ADDPS, but as
        its name suggests, "round_float_offset" must be added as a 
        floating-point instruction, because numbers <1.0 cannot be 
        represented in integer!


     Recall from optimization (#1), the iDCT column output will produce
     outputs that are scaled up by 65536 (row_prescale.)
     The prescaling can be canceled by compensating the w[] table, but
     we defer discussion until item #4.

     **The above code assumes the rounding-mode bits ("mxcsr register") 
       are set for "round-nearest" mode!  If a different round-mode is
       in use (always round down, or always round up), the constant
       round_float_offset must be changed accordingly.

  #4) iDCT Output range clipping {-256,+255}
  ----------------------------------------------------------------------
     The iDCT's output range must be clipped to {-256,+255}  This range
     equals exactly a 9-bit signed integer, allowing us to leverage the
     packssdw instruction.

     cvtps2pi mm0, xmm0;  // mm0 = [y1 y0]*PRESCALE <int32>
     movhlps xmm0, xmm0;
       cvtps2pi mm1 xmm0; // mm1 = [y3 y2]*PRESCALE <int32>

     #define DESCALE_SHIFT0  9     // 16-9 = 7 bit shift
     #define DESCALE_SHIFT   7     // 16-7 = output y[] range is 9-bit int
     // (DESCALE+SHIFT0 + DESCALE_SHIFT) must equal the PRESCALE_SHIFT, 16.
     

     //psrad  mm0, DESCALE_SHIFT0; // don't need, see below

     //psrad  mm1, DESCALE_SHIFT0; // don't need, see below


     packssdw mm0, mm1;   // mm0 = [y3 y2 y1 y0]*PRESCALE <int16>

     psraw mm0, DESCALE_SHIFT; // mm0 = [y3 y2 y1 y0]
     
     movq [data], mm0;    // data is clipped to {-XXX,+YYY}

     "packssdw" applies floor/ceiling (arithmetic saturation) to -32768,+32767
     This range equals a 16-bit signed integer.  Our goal is to clip the
     final output y[] to a 9-bit signed integer.  Therefore, we choose
     DESCALE_SHIFT to be 16-9=7 bits.

     If we choose DESCALE_SHIFT = 7, the descale-factor is 2^7=128.  
     But from optimization (#1), the PRESCALE factor = 65536 (16-bits.)
     Somewhere between the iDCT-input and iDCT-output, the difference of
     these two (16-9=7bits) must be descaled. 
     
     One possible solution is to insert psrad instructions between the
     packssdw and cvtps2pi.  The psrad descales by "DESCALE_SHIFT0."
     But this solution is wasteful, because the descale_shift0 can be
     implicitly performed by adjusting the w[] coefficient table.

     Therefore, we choose to scae the coefficient table.  Since the table
     is stored in floating-point format, the stored constants do not lose
     any precision.  (The scale/descale factors are integer powers of 2.)
     
  #5) Compiler issues, data alignment to 16-byte address offset
  -------------------------------------------------------------

     I haven't figured out how to force alignment to 16-bytes, using 
     Visual C++.

 Numerical precision
 -------------------
  On Intel x86 CPUs, X87-FPU add/multiply operations are always calculated 
  with 80-bit internal precision, regardless of the input type.  In contrast,
  Pentium-III SSE calculations are performed with lower precision.  Therefore,
  all other things being equal, the X87 DCT will be more accurate than 
  Intel-SSE DCT.


 ------------------------------------------
 LEGAL MUMBO-JUMBO : Disclaimer of Warranty
 ------------------------------------------
  This software is available to the user without any license fee or
  royalty on an "as is" basis.  The author disclaims any and all warranties,
  whether express, implied, or statuary, including any implied warranties or 
  merchantability or of fitness for a particular purpose.  In no event shall 
  the copyright-holder be liable for any incidental, punitive, or 
  consequential damages of any kind whatsoever arising from the use of these
  programs.

  This disclaimer of warranty extends to the user of these programs and user's
  customers, employees, agents, transferees, successors, and assigns.

 Revision history
 ----------------
 09/03/2000 initial release of 'AP922float_SSE' iDCT
            Based on AP922, this iDCT uses SSE-float instructions for enhanced
            accuracy.  To the best of my knowledge, AP922float is fully
            IEEE 1180/1990 compliant iDCT, including output-range clipping
            to {-256,+255}

            Requires Visual C++ 6.0 Pro, Visual Studio Service Pack 4,
             and Visual C++ 6 Processor Pack Beta.  SP4 and PPBeta are both
             available at http://www.microsoft.com

            IMPORTANT, Visual C++ does not seem to align data to 16-byte
            address offset.  You will need to manually 
            
 09/03/2000 liaor@iname.com   http://members.tripod.com/~liaor
*/

#define NCOS1_16 (0.980785280403230449126182236134239) // cosine(  Pi/16 )
#define NCOS2_16 (0.923879532511286756128183189396788) // cosine( 2Pi/16 )
#define NCOS3_16 (0.831469612302545237078788377617906) // cosine( 3Pi/16 )
#define NCOS4_16 (0.707106781186547524400844362104849) // cosine( 4Pi/16 )
#define NCOS5_16 (0.555570233019602224742830813948533) // cosine( 5Pi/16 )
#define NCOS6_16 (0.382683432365089771728459984030399) // cosine( 6Pi/16 )
#define NCOS7_16 (0.195090322016128267848284868477022) // cosine( 7Pi/16 )

#define TANG1_16 ( NCOS7_16 / NCOS1_16) // tangent( Pi/16)
#define TANG2_16 ( NCOS6_16 / NCOS2_16) // tangent(2Pi/16)
#define TANG3_16 ( NCOS5_16 / NCOS3_16) // tangent(3Pi/16)

// Scaled floating-point coefficient table
//    Original (MMX) iDCT         SSE-iDCT 
//     table-ordering      ->    table-ordering
//    ------------------       ------------------
//    w00, w02, w04, w06,  ->  w00, w04, w08, w12,
//    w01, w03, w05, w07,  ->  w01, w05, w09, w13,
//    w08, w10, w12, w14,  ->  w02, w06, w10, w14,
//    w09, w11, w13, w15,  ->  w03, w07, w11, w15,
//    w16, w18, w20, w22,  ->  w16, w20, w24, w28,
//    w17, w19, w21, w23,  ->  w17, w21, w25, w29,
//    w24, w26, w28, w30,  ->  w18, w22, w26, w30,
//    w25, w27, w29, w31   ->  w19, w23, w27, w31
//
// Compared to the integer-based table, the float table is downscaled by a
// multiplication factor of 0.5

#define FIXED_DOWNSCALE (0.25/(1<<9))
 // 'fixed_downscale' includes a descale shift factor, discussed in the
 // text at the beginning of this file.

#define RS0 (NCOS4_16*FIXED_DOWNSCALE)  // iDCT row#0 scalar
#define RS1 (NCOS1_16*FIXED_DOWNSCALE)  // iDCT row#1 scalar
#define RS2 (NCOS2_16*FIXED_DOWNSCALE)  // iDCT row#2 scalar
#define RS3 (NCOS3_16*FIXED_DOWNSCALE)  // iDCT row#3 scalar
#define RS4 (NCOS4_16*FIXED_DOWNSCALE)  // iDCT row#4 scalar
#define RS5 (NCOS3_16*FIXED_DOWNSCALE)  // iDCT row#5 scalar
#define RS6 (NCOS2_16*FIXED_DOWNSCALE)  // iDCT row#6 scalar
#define RS7 (NCOS1_16*FIXED_DOWNSCALE)  // iDCT row#7 scalar

#define ROW_PRESCALE_SHIFT (16) // do not change!
#define DESCALE_SHIFT2 (7)

// compute the round_float_offset, for numerically accurate rounding
#define RND_INV_ROWF (1 << (DESCALE_SHIFT2-1) )
#define RND_FLOAT_OFFSET ((RND_INV_ROWF) - 0.5)

//////////////////////////////////////////////////////////////////////
//
//  If the function crashes after compiling, try inserting dummy
//  variables to shift around the sse-table's address.
//  Otherwise, you should recode the function to dynamically allocate
//  RAM for the table.  Then generate the table on the fly.
//

 const static __int64 useless_padding[1];

const float tab_i_01234567ssef[] =
{ 
 // Row #0
  NCOS4_16*RS0,  NCOS4_16*RS0,  NCOS4_16*RS0,  NCOS4_16*RS0, 
  NCOS2_16*RS0,  NCOS6_16*RS0, -NCOS6_16*RS0, -NCOS2_16*RS0, 
  NCOS4_16*RS0, -NCOS4_16*RS0, -NCOS4_16*RS0,  NCOS4_16*RS0, 
  NCOS6_16*RS0, -NCOS2_16*RS0,  NCOS2_16*RS0, -NCOS6_16*RS0, 
  NCOS1_16*RS0,  NCOS3_16*RS0,  NCOS5_16*RS0,  NCOS7_16*RS0, 
  NCOS3_16*RS0, -NCOS7_16*RS0, -NCOS1_16*RS0, -NCOS5_16*RS0, 
  NCOS5_16*RS0, -NCOS1_16*RS0,  NCOS7_16*RS0,  NCOS3_16*RS0, 
  NCOS7_16*RS0, -NCOS5_16*RS0,  NCOS3_16*RS0, -NCOS1_16*RS0, 

 // Row #1
  NCOS4_16*RS1,  NCOS4_16*RS1,  NCOS4_16*RS1,  NCOS4_16*RS1, 
  NCOS2_16*RS1,  NCOS6_16*RS1, -NCOS6_16*RS1, -NCOS2_16*RS1, 
  NCOS4_16*RS1, -NCOS4_16*RS1, -NCOS4_16*RS1,  NCOS4_16*RS1, 
  NCOS6_16*RS1, -NCOS2_16*RS1,  NCOS2_16*RS1, -NCOS6_16*RS1, 
  NCOS1_16*RS1,  NCOS3_16*RS1,  NCOS5_16*RS1,  NCOS7_16*RS1, 
  NCOS3_16*RS1, -NCOS7_16*RS1, -NCOS1_16*RS1, -NCOS5_16*RS1, 
  NCOS5_16*RS1, -NCOS1_16*RS1,  NCOS7_16*RS1,  NCOS3_16*RS1, 
  NCOS7_16*RS1, -NCOS5_16*RS1,  NCOS3_16*RS1, -NCOS1_16*RS1, 

 // Row #2
  NCOS4_16*RS2,  NCOS4_16*RS2,  NCOS4_16*RS2,  NCOS4_16*RS2, 
  NCOS2_16*RS2,  NCOS6_16*RS2, -NCOS6_16*RS2, -NCOS2_16*RS2, 
  NCOS4_16*RS2, -NCOS4_16*RS2, -NCOS4_16*RS2,  NCOS4_16*RS2, 
  NCOS6_16*RS2, -NCOS2_16*RS2,  NCOS2_16*RS2, -NCOS6_16*RS2, 
  NCOS1_16*RS2,  NCOS3_16*RS2,  NCOS5_16*RS2,  NCOS7_16*RS2, 
  NCOS3_16*RS2, -NCOS7_16*RS2, -NCOS1_16*RS2, -NCOS5_16*RS2, 
  NCOS5_16*RS2, -NCOS1_16*RS2,  NCOS7_16*RS2,  NCOS3_16*RS2, 
  NCOS7_16*RS2, -NCOS5_16*RS2,  NCOS3_16*RS2, -NCOS1_16*RS2, 

 // Row #3
  NCOS4_16*RS3,  NCOS4_16*RS3,  NCOS4_16*RS3,  NCOS4_16*RS3, 
  NCOS2_16*RS3,  NCOS6_16*RS3, -NCOS6_16*RS3, -NCOS2_16*RS3, 
  NCOS4_16*RS3, -NCOS4_16*RS3, -NCOS4_16*RS3,  NCOS4_16*RS3, 
  NCOS6_16*RS3, -NCOS2_16*RS3,  NCOS2_16*RS3, -NCOS6_16*RS3, 
  NCOS1_16*RS3,  NCOS3_16*RS3,  NCOS5_16*RS3,  NCOS7_16*RS3, 
  NCOS3_16*RS3, -NCOS7_16*RS3, -NCOS1_16*RS3, -NCOS5_16*RS3, 
  NCOS5_16*RS3, -NCOS1_16*RS3,  NCOS7_16*RS3,  NCOS3_16*RS3, 
  NCOS7_16*RS3, -NCOS5_16*RS3,  NCOS3_16*RS3, -NCOS1_16*RS3, 

 // Row #4
  NCOS4_16*RS4,  NCOS4_16*RS4,  NCOS4_16*RS4,  NCOS4_16*RS4, 
  NCOS2_16*RS4,  NCOS6_16*RS4, -NCOS6_16*RS4, -NCOS2_16*RS4, 
  NCOS4_16*RS4, -NCOS4_16*RS4, -NCOS4_16*RS4,  NCOS4_16*RS4, 
  NCOS6_16*RS4, -NCOS2_16*RS4,  NCOS2_16*RS4, -NCOS6_16*RS4, 
  NCOS1_16*RS4,  NCOS3_16*RS4,  NCOS5_16*RS4,  NCOS7_16*RS4, 
  NCOS3_16*RS4, -NCOS7_16*RS4, -NCOS1_16*RS4, -NCOS5_16*RS4, 
  NCOS5_16*RS4, -NCOS1_16*RS4,  NCOS7_16*RS4,  NCOS3_16*RS4, 
  NCOS7_16*RS4, -NCOS5_16*RS4,  NCOS3_16*RS4, -NCOS1_16*RS4, 

 // Row #5
  NCOS4_16*RS5,  NCOS4_16*RS5,  NCOS4_16*RS5,  NCOS4_16*RS5, 
  NCOS2_16*RS5,  NCOS6_16*RS5, -NCOS6_16*RS5, -NCOS2_16*RS5, 
  NCOS4_16*RS5, -NCOS4_16*RS5, -NCOS4_16*RS5,  NCOS4_16*RS5, 
  NCOS6_16*RS5, -NCOS2_16*RS5,  NCOS2_16*RS5, -NCOS6_16*RS5, 
  NCOS1_16*RS5,  NCOS3_16*RS5,  NCOS5_16*RS5,  NCOS7_16*RS5, 
  NCOS3_16*RS5, -NCOS7_16*RS5, -NCOS1_16*RS5, -NCOS5_16*RS5, 
  NCOS5_16*RS5, -NCOS1_16*RS5,  NCOS7_16*RS5,  NCOS3_16*RS5, 
  NCOS7_16*RS5, -NCOS5_16*RS5,  NCOS3_16*RS5, -NCOS1_16*RS5, 

 // Row #6
  NCOS4_16*RS6,  NCOS4_16*RS6,  NCOS4_16*RS6,  NCOS4_16*RS6, 
  NCOS2_16*RS6,  NCOS6_16*RS6, -NCOS6_16*RS6, -NCOS2_16*RS6, 
  NCOS4_16*RS6, -NCOS4_16*RS6, -NCOS4_16*RS6,  NCOS4_16*RS6, 
  NCOS6_16*RS6, -NCOS2_16*RS6,  NCOS2_16*RS6, -NCOS6_16*RS6, 
  NCOS1_16*RS6,  NCOS3_16*RS6,  NCOS5_16*RS6,  NCOS7_16*RS6, 
  NCOS3_16*RS6, -NCOS7_16*RS6, -NCOS1_16*RS6, -NCOS5_16*RS6, 
  NCOS5_16*RS6, -NCOS1_16*RS6,  NCOS7_16*RS6,  NCOS3_16*RS6, 
  NCOS7_16*RS6, -NCOS5_16*RS6,  NCOS3_16*RS6, -NCOS1_16*RS6, 

 // Row #7
idct_ap922sse_rawcode.cpp - 源码说明

本页面展示了「这是一组DCT和iDCT的代码」中的 idct_ap922sse_rawcode.cpp 源码文件，采用 C++ 编程语言编写，共 1,111 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与iDCT相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?