⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 idct_ap922x87.cpp

📁 这是一组DCT和iDCT的代码
💻 CPP
📖 第 1 页 / 共 2 页
字号:
 
  short *x; // pointer to input
  short *y; // pointer to final output
  ssefloat *yr, *xc; // pointer to output row
  ssefloat *ti; // pointer to IDCT row coefficient table

  int i;

  ////////////////////////////////////////////////////////////////////////
  //
  // AP922float iDCT row transform (x87 version)
  //
  // Row IDCT operator :A_T*M_T*P_T
  // Let Y=[output column data, 8 elements] 32-bit IEEE-754 float
  //     X=[input column data, 8 elements] 16-bit short integer
  //
  //     Y= [ A_T*M_T*P_T ] * X
  //
  //   (Y and X are both column vectors)

  for ( i = 0; i < 8; ++i ) // row iDCT
  {
    ti = (float*)&tab_i_01234567x87[ i * 32 ];
    x = &data[ 8*i ];
    yr = ((ssefloat *) fTempArray) + 8*i; // intermediate output, select row#i

    // *Prescale input values by left-shifting them
    //
    // The prescaling operation duplicates the Intel-SSE code.  
    //   The prescaling arises from an optimzation within the actual SSE 
    // code.  Our SSE code uses "pshufw+pand" to convert 16-bit integers
    // (short) to 32-bit integers(long.)  Using 'pand' instead of 'psrad'
    // reduces port-contention for the MMXSHIFT/PFADDER/PSHUFW resources.
    // However, the optimization's side effect is a scaleup of 65536 
    // (left shift by 16.)
    //
    // Here, the prescaling is included in this C-Code to mimic the SSE code.

    // 0) apply P_T operator, permute input values

    xi[0] = ROW_PRESCALE( x[0] );
    xi[1] = ROW_PRESCALE( x[1] );
    xi[2] = ROW_PRESCALE( x[2] );
    xi[3] = ROW_PRESCALE( x[3] );
    xi[4] = ROW_PRESCALE( x[4] );
    xi[5] = ROW_PRESCALE( x[5] );
    xi[6] = ROW_PRESCALE( x[6] );
    xi[7] = ROW_PRESCALE( x[7] );

//////////////////////////////////////////////////////////////////////
//
// original table order,
//    The scaled-int (MMX) iDCT implementation uses "PMADDWD", which
//    simultaneously multiplies and adds adjacent data elements.  The
//    original AP922 w[] table is ordered for optimum execution
//    with pmaddwd and paddd.
//
//    a0 = x[0] * ti[0]  + x[4] * ti[1]  + x[2] * ti[4]  + x[6] * ti[5];
//    a1 = x[0] * ti[2]  + x[4] * ti[3]  + x[2] * ti[6]  + x[6] * ti[7];
//    a2 = x[0] * ti[8]  + x[4] * ti[9]  + x[2] * ti[12] + x[6] * ti[13];
//    a3 = x[0] * ti[10] + x[4] * ti[11] + x[2] * ti[14] + x[6] * ti[15];
//
//    b0 = x[1] * ti[16] + x[5] * ti[17] + x[3] * ti[20] + x[7] * ti[21];
//    b1 = x[1] * ti[18] + x[5] * ti[19] + x[3] * ti[22] + x[7] * ti[23];
//    b2 = x[1] * ti[24] + x[5] * ti[25] + x[3] * ti[28] + x[7] * ti[29];
//    b3 = x[1] * ti[26] + x[5] * ti[27] + x[3] * ti[30] + x[7] * ti[31];    
//
//////////////////////////////////////////////////////////////////////


    // 1) Apply M_T operator 
    //    
    //    Since Intel SSE has no floating-point equivalent of pmaddwd,
    //    the w[] coefficient table must be changed.  The new table-order
    //    works with the available MULPS and ADDPS instructions.

    //    SSE packing :
    //
    //    let a_vX[] = ( a_vX[0] )   ,  where [0] is stored first (LSB)
    //                 ( a_vX[1] )   
    //                 ( a_vX[2] )
    //                 ( a_vX[3] )
    //
    //     ;// xmm0 <= [ xi[0] xi[0] xi[0] xi[0] ]
    //     ;// xmm1 <= [ xi[2] xi[2] xi[2] xi[2] ]
    //     ;// xmm2 <= [ xi[4] xi[4] xi[4] xi[4] ]
    //     ;// xmm3 <= [ xi[6] xi[6] xi[6] xi[6] ]
    //
    //     ;// TABLE[0 ] <= [ w12 w8  w4 w0] ;// 4 floats (16 bytes)
    //     ;// TABLE[4 ] <= [ w13 w9  w5 w1] ;// 4 floats (16 bytes)
    //     ;// TABLE[8 ] <= [ w14 w10 w6 w2] ;// 4 floats (16 bytes)
    //     ;// TABLE[12] <= [ w15 w11 w7 w3] ;// 4 floats (16 bytes)

    //   1a) Compute "vertical" vectors (a_v0[], a_v1[], ... a_v3[])
    //
    //    MULPS xmm0, [TABLE + 0*16]; // xmm0 <= a_v0[]
    //    MULPS xmm1, [TABLE + 1*16]; // xmm1 <= a_v1[]
    //    MULPS xmm2, [TABLE + 2*16]; // xmm2 <= a_v2[]
    //    MULPS xmm3, [TABLE + 3*16]; // xmm3 <= a_v3[]
    //
    //   1b) Sum the vertical vectors to produce a0, a1, a2, a3
    //
    //    ADDPS xmm1, xmm0;  // xmm1 <= a_v1[] + ( a_v0[]                   )
    //    ADDPS xmm2, xmm1; //  xmm2 <= a_v2[] + ( a_v0[] + a_v1[]          )
    //    ADDPS xmm3, xmm2; //  xmm3 <= a_v3[] + ( a_v0[] + a_v1[] + a_v2[] )
    //
    //     ;// xmm3 <= [ a3 a2 a1 a0 ]
    //
    //   Repeat (1a) and (1b) for the vectors b_v[], 
    //     ...
    //     ...
    //     ;// xmm7 <= [ b3 b2 b1 b0 ]
    //


 // a_v[]=  ( a_v0[] ) +  ( a_v1[] )  +  ( a_v2[] )  +  ( a_v3[] );
    a[0] = xi[0]*ti[0] + xi[2]*ti[4]  + xi[4]*ti[8]  + xi[6]*ti[12];
    a[1] = xi[0]*ti[1] + xi[2]*ti[5]  + xi[4]*ti[9]  + xi[6]*ti[13];
    a[2] = xi[0]*ti[2] + xi[2]*ti[6]  + xi[4]*ti[10] + xi[6]*ti[14];
    a[3] = xi[0]*ti[3] + xi[2]*ti[7]  + xi[4]*ti[11] + xi[6]*ti[15];

 // b_v[]=  ( b_v0[] ) +  ( b_v1[] )  +  ( b_v2[] )  +  ( b_v3[] );
    b[0] = xi[1]*ti[16]+ xi[3]*ti[20] + xi[5]*ti[24] + xi[7]*ti[28];
    b[1] = xi[1]*ti[17]+ xi[3]*ti[21] + xi[5]*ti[25] + xi[7]*ti[29];
    b[2] = xi[1]*ti[18]+ xi[3]*ti[22] + xi[5]*ti[26] + xi[7]*ti[30];
    b[3] = xi[1]*ti[19]+ xi[3]*ti[23] + xi[5]*ti[27] + xi[7]*ti[31];

    // 2) Apply A_T operator, store outputs
    //
    //    1  0  0  0   1  0  0  0
    //    0  1  0  0   0  1  0  0
    //    0  0  1  0   0  0  1  0
    //    0  0  0  1   0  0  0  1
    //    0  0  0  1   0  0  0 -1
    //    0  0  1  0   0  0 -1  0
    //    0  1  0  0   0 -1  0  0
    //    1  0  0  0  -1  0  0  0

    //    Since the internal calculations produce floating-point results,
    //    the row-store operation does not round/scale the results.
    //
    //       ;// xmm3 <= [ a3 a2 a1 a0 ]
    //       ;// xmm7 <= [ b3 b2 b1 b0 ]
    //
    //      MOVPS xmm4, xmm3; // xmm3 <= [a3 a2 a1 a0]
    //
    //      SUBPS xmm3, xmm7; // xmm3 <= [a3-b3 a2-b2 a1-b1 a0-b0]
    //      ADDPS xmm4, xmm7; // xmm4 <= [a3+b3 a2+b2 a1+b1 a0+b0]
    //
    //       ;// xmm4 <= [ y3 y2 y1 y0]
    //       ;// xmm3 <= [ y4 y5 y6 y7]  *** (reversed order)
    //
    //      SHUFPS xmm3, xmm3, 00011011b;// xmm4 <=[y7 y6 y5 y4]; // shuffle
    //
    //       ;// xmm3 <= [ y7 y6 y5 y4]  (correct order)
    //
    //      MOVPS [ OUTR + 0*16 ], xmm4; // store [y3 y2 y1 y0]
    //      MOVPS [ OUTR + 1*16 ], xmm3; // store [y7 y6 y5 y4]

    yr[0] = ( a[0] + b[0] );
    yr[1] = ( a[1] + b[1] );
    yr[2] = ( a[2] + b[2] );
    yr[3] = ( a[3] + b[3] );

    yr[4] = ( a[3] - b[3] );
    yr[5] = ( a[2] - b[2] );
    yr[6] = ( a[1] - b[1] );
    yr[7] = ( a[0] - b[0] );
  } // end for( i = 0; i < 8; ++i ) // end of row iDCT


  ////////////////////////////////////////////////////////////////////////
  //
  // AP922float iDCT column transform
  // Base code originally from VirtualDub (Avery Lee),
  //   http://www.concentric.net/~psilon

  // Column IDCT operator :A_T*(F_T*E_T*B_T*D_T)*P_T
  // Let Y=[output column data, 8 elements], 16-bit short integer
  //     X=[input column data, 8 elements], 32-bit IEEE-754 float
  //
  //     Y= [ A_T*(F_T*E_T*B_T*D_T)*P_T ] * X
  //
  //   (Y and X are both column vectors)

  for ( i = 0; i < 8; ++i ) // column iDCT
  {
    xc = ((ssefloat *) fTempArray) + i; // select column #i
    y = &data[ i ];

    // 1) Apply (D_T * P_T) - the cos() coefficients of D_T are implicit
    //    in the idct_row operation.  But we still need to apply the
    //    shuffling operation of D_T.
    //
    //    1  0  0  0   0  0  0  0
    //    0  0  0  0   1  0  0  0
    //    0  0  1  0   0  0  0  0
    //    0  0  0  0   0  0  1  0
    //    0  1  0  0   0  0  0  0
    //    0  0  0  0   0  0  0  1
    //    0  0  0  1   0  0  0  0
    //    0  0  0  0   0  1  0  0

    dp[0] = xc[ 0 *8];
    dp[1] = xc[ 4 *8];
    dp[2] = xc[ 2 *8];
    dp[3] = xc[ 6 *8];

    dp[4] = xc[ 1 *8];
    dp[5] = xc[ 7 *8];
    dp[6] = xc[ 3 *8];
    dp[7] = xc[ 5 *8];
 
    // 2) Apply B_T
    //
    //    1  1  0  0
    //    1 -1  0  0
    //    0  0  1 t2
    //    0  0 t2 -1
    //                1 t1  0  0
    //               t1 -1  0  0
    //                0  0  1 t3
    //                0  0 t3 -1
 
 
    b[0] =   dp[1]            + dp[0];
    b[1] =   dp[0]            - dp[1];

    b[2] = ( dp[3]*tg_2_16f ) + dp[2];
    b[3] = ( dp[2]*tg_2_16f ) - dp[3];

    b[4] = ( dp[5]*tg_1_16f ) + dp[4];
    b[5] = ( dp[4]*tg_1_16f ) - dp[5];

    b[6] = ( dp[7]*tg_3_16f ) + dp[6];
    b[7] = ( dp[6]*tg_3_16f ) - dp[7];
  
    // 3) Apply E_T
    //
    //    1  0  1  0
    //    0  1  0  1
    //    0  1  0 -1
    //    1  0 -1  0
    //                1  0  1  0
    //                1  0 -1  0
    //                0  1  0  1
    //                0  1  0 -1
 
    e[0] = b[0] + b[2];
    e[1] = b[1] + b[3];
    e[2] = b[1] - b[3];
    e[3] = b[0] - b[2];
    e[4] = b[4] + b[6];
    e[5] = b[4] - b[6];
    e[6] = b[5] + b[7];
    e[7] = b[5] - b[7];

    // 4) Apply F_T
    //
    //    1  0  0  0
    //    0  1  0  0
    //    0  0  1  0
    //    0  0  0  1
    //                1  0  0  0
    //                0  1  0  0
    //                0  0  1  0
    //                0  0  0  1

#define _F0 e[0] 
#define _F1 e[1] 
#define _F2 e[2] 
#define _F3 e[3] 
#define _F4 e[4] 
#define _F5 e[5] 
#define _F6 e[6] 
#define _F7 e[7] 

    tmp[0] = (e[5] + e[6]) * cos_4_16f;
    tmp[1] = (e[5] - e[6]) * cos_4_16f;
    _F5 = tmp[0];
    _F6 = tmp[1];

    // 5) Apply A_T
    //
    //    1  0  0  0   1  0  0  0
    //    0  1  0  0   0  1  0  0
    //    0  0  1  0   0  0  1  0
    //    0  0  0  1   0  0  0  1
    //    0  0  0  1   0  0  0 -1
    //    0  0  1  0   0  0 -1  0
    //    0  1  0  0   0 -1  0  0
    //    1  0  0  0  -1  0  0  0
    //
    //    yfloat[0]= F0 + F4
    //    yfloat[1]= F1 + F5
    //    yfloat[2]= F2 + F6
    //    yfloat[3]= F3 + F7
    //           
    //    yfloat[4]= F3 - F7
    //    yfloat[5]= F2 - F6
    //    yfloat[6]= F1 - F5
    //    yfloat[7]= F0 - F4
    //
    //
    // 6) float -> int, shift/round to final output y[]
    //    The final shift&round operation reverses the row-input prescaling.
    //    It also applies the chosen rounding-mode (accurate or fast.)
    //
    //    Note, the C-code below differs *substantially* from the Intel-SSE 
    //    implementation.  Intel_SSE uses this basic sequence:
    //
    //       ADDPS xmm0, [round_compensation for y3 y2 y1 y0]  <float32>
    //
    //       cvtps2pi mm0,xmm0;  // mm0 <= [y1 y0] <int32>
    //     
    //       movhlps xmm0, xmm0; // xmm0 <=[y3 y2 y3 y2]
    //       cvtps2pi mm1,xmm0;  // mm1 <= [y3 y2] <int32>
    //
    //       ;// psrad is incorporated in the iDCT row coeff table w[]
    // 
    //       ;//psrad mm0, DESCALE_SHIFT1;  // NOT USED in ap922float_sse
    //       ;//psrad mm1, DESCALE_SHIFT1;  // NOT USED in ap922float_sse
    //
    //
    //       packssdw mm0, mm1;  // mm0 <= [y3 y2 y1 y0] <int16>
    //
    //       psraw mm0, 7;       // mm0 clipped to {-256,+255}
    //
    //       movq [ OUTC + 0], mm0;  // store
 

    y[0*8] = SHIFT_ROUND_COLF( _F0 + _F4 );
    y[1*8] = SHIFT_ROUND_COLF( _F1 + _F5 );
    y[2*8] = SHIFT_ROUND_COLF( _F2 + _F6 );
    y[3*8] = SHIFT_ROUND_COLF( _F3 + _F7 );

    y[4*8] = SHIFT_ROUND_COLF( _F3 - _F7 );
    y[5*8] = SHIFT_ROUND_COLF( _F2 - _F6 );
    y[6*8] = SHIFT_ROUND_COLF( _F1 - _F5 );
    y[7*8] = SHIFT_ROUND_COLF( _F0 - _F4 );

  } // end for ( i = 0; i < 8; ++i ) // end of SSEfloat column iDCT


  //    In standard-C, the output clip code adds significantly to 
  //    execution time.  

#define IDCT_CLIP_LOW -256  // IDCT output range is 9-bits
#define IDCT_CLIP_HIGH 255  // IDCT output range is 9-bits

  for ( i = 0; i < 64; ++i )
  { // clip output to {-256,+255}
    if ( data[ i ] < IDCT_CLIP_LOW )
      data[ i ] = IDCT_CLIP_LOW;

    if ( data[ i ] > IDCT_CLIP_HIGH )
      data[ i ] = IDCT_CLIP_HIGH;
  }
}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -