📄 r8x8invdct_ieee.asm

📁 ADI BF DSP的FFT汇编优化后的代码
💻 ASM
📖 第 1 页 / 共 2 页
字号:
12 下一页
/*******************************************************************************
Copyright(c) 2000 - 2002 Analog Devices. All Rights Reserved.
Developed by Joint Development Software Application Team, IPDC, Bangalore, India
for Blackfin DSPs  ( Micro Signal Architecture 1.0 specification).

By using this module you agree to the terms of the Analog Devices License
Agreement for DSP Software. 
********************************************************************************
Module Name     : r8x8invdct_ieee.asm
Label name      :  __r8x8invdct_ieee
Version         :   1.3
Change History  :

                Version     Date          Author        Comments
                1.3         11/18/2002    Swarnalatha   Tested with VDSP++ 3.0
                                                        compiler 6.2.2 on 
                                                        ADSP-21535 Rev.0.2
                1.2         11/13/2002    Swarnalatha   Tested with VDSP++3.0
                                                        on ADSP-21535 Rev.0.2
                1.1         02/19/2002    Vijay         Modified to match
                                                        silicon cycle count
                1.0         03/13/2001    shailendra    Original 

Description     : This is the implementation of Chen's algorithm of IDCT.
                  It is based on the separable nature of IDCT for multi-
                  dimension. The input matrix is 8x8 real data. First, one dime-
                  nsional 8-point IDCT is calculated for each of the 8 rows. The
                  output is stored in a separate matrix after transpose. Then 
                  again 8-point IDCT is calculated on each row of matrix. The 
                  output is again stored in a transpose matrix. This is final 
                  output.
                 
                  Chen's algorithm has 4 stages (parts) of implementation.

                  This implementation works only for 8x8 input. The input data 
                  should be real. The range of input should be -256 to 255. 
                 
                  The algorithm is in-placed. 

                  Note :  The algorithm reads the input data from the "in" 
                  matrix, from bit reversed locations.
                  First 8-point IDCT will be calculated for all the 8 rows.
                  This output is stored in "temp" buffer in the transposed form.
                  Again the 8-point IDCT is applied on all the 8 rows of "temp" 
                  buffer. Final output computed is stored in "in" buffer in 
                  transposed form. The operation of transposing the matrix and 
                  calculation of bit reversed are carried out while writing the 
                  data without any explicit code.
                  Output of function is provided "in" buffer in normal order.
                  This function passes all the IEEE-1180 test cases.

Prototype       : void _r8x8dct(fract16 *in, fract16 *coeff, fract16 *temp);

                         *in -> Pointer to Input vector.
                         *coeff -> Pointer to coefficients.
                         *temp -> Pointer to temporary data. 

Registers Used  : A0, A1, R0-R7, I0-I3, B0, B2, B3, M0-M2, L0-L3, P0-P5, LC0.

Performance     : 
                    Code Size   : 498 Bytes.
                    Cycle Count : 417 Cycles
******************************************************************************/
.section    L1_code;
.global     __r8x8invdct_ieee;
.align      8;
    
__r8x8invdct_ieee:

/********************** Function Prologue *********************************/
    [--SP] = (R7:4, P5:3);  // Pushing the registers on stack.
    B0 = R0;                // Pointer to Input matrix.
    B3 = R1;                // Pointer to Coefficients
    B2 = R2;                // Pointer to Temporary matrix.
    L0 = 0;                 // L registers are initialized to 0
    L1 = 0;                 // --------- do --------
    L2 = 0;                 // --------- do --------
    L3 = 20;                // L3 is used for making coefficients array
                            // circular.
    M1 = 16 (X);            // All these registers are initialized for
    M3 = 8(X);              // modifying address offsets.
    
    I0 = B0;                // I0 points to Input Element (0, 0)
    I2 = B0;                // I2 points to Input Element (0, 0)
    I2 += M3 || R0.H = W[I0];
                            // Element 0 is read in R0.H 
    I1 = I2;                // I1 points to input Element (0, 6)
    I1 += 4  || R0.L = W[I2++];
                            // I2 points to input Element (0, 4) 
                            // Element 4 is read in R0.L
    P2 = 8 (X);
    P3 = 32 (X);
    P4 = -32 (X);
    P5 = 98 (X);
    
    I3 = B3;                // I3 points to Coefficients
    P0 = B2;                // P0 points to array Element (0, 0) of temp
    P1 = B2;
    R7 = [I3++];            // Coefficient C4 is read in R7.H and R7.L
    MNOP;
    NOP;
    
/*
    According to Chen's algorithm, first 8-point IDCT will be calculated for all
    the 8 rows. The output of this calculation is stored in another transpose 
    matrix. Now again the 8-point IDCT is applied on all the 8 rows. The output
    is stored in matrix transpose form. This is the final output. It is done 
    with the help of twp loops ROW1_START and ROW2_START.

   In the first loop of ROW1_START the input is read from "in" buffer and output
    is written to "temp" buffer. In the second loop of ROW2_START the input is 
    read from "temp" buffer and output is written to "in" buffer. "in" buffer 
    holds the final output. 
*/
    
/*
*   The following operation is done in 2 instructions.
*   A1 = Element 0 * cos(pi/4) 
*   A0 =  Element 0 * cos(pi/4)
*   A1 = A1 + Element 4 * cos(pi/4)
*   A0 = A0 - Element 4 * cos(pi/4)
*   At the same time the value of Element 2 and 6 are read in R1.H and R1.L 
    respectively. The coefficient C2 and C6 are read in R7.H and R7.L
*   In the end R3 holds 0th and R2 holds 4th Element.
*/
    
    A1 = R7.H * R0.H, A0 = R7.H * R0.H (IS) || I0 += 4  || R1.L = W[I1++];
    R3 = (A1 += R7.H * R0.L), R2 = ( A0 -= R7.H * R0.L) (IS) || R1.H = W[I0--]  
    || R7 = [I3++];
    
    LSETUP (ROW1_START, ROW1_END) LC0 = P2;
                            //Loop for 8 rows. 
    P2 = 112 (X); 
    P1 = P1 + P2;           // P1 points to element (7, 0) of temp buffer.
    P2 = -94(X);
    
ROW1_START:
    
/*
*   The following two instructions do -
*   A1 = Element 2 * cos(3pi/8) 
*   A0 =  Element 2 * cos(pi/8)
*   A1 = A1 - Element 6 * cos(pi/8)
*   A0 = A0 + Element 6 * cos(3pi/8)
*   Element 1 and 7 are read in R5.H and R5.L.
*   Coefficients C1 and C7 are in register R7.H and R7.L respectively.
*   In the end R1 holds 2nd and R0 hols 6th element.
*/
    
        A1= R7.L * R1.H, A0 = R7.H * R1.H (IS)  || I0 += 4  || R5.H = W[I0];
        R1 = (A1 -= R7.H * R1.L) , R0 = (A0 += R7.L * R1.L) (IS)
        || R5.L = W[I1--] || R7 = [I3++];
    
/*
*   The following two instructions do -
*   Element 0 = Element 0 + Element 6.
*   Element 4 = Element 4 + Element 2.
*   Element 2 = Element 4 - Element 2.
*   Element 6 = Element 0 - Element 6.
*   The register R3 is saved to make it free. Element 3 is read in R6.L
*   At this stage Element 0 is in R3, 4 is in R2, 2 is in R1 and 6 is in R0
*/
    
        R3 = R3 + R0, R0 = R3 - R0;     
        R2 = R2 + R1, R1 = R2 - R1 || [SP + 32] = R3 || R6.L = W[I0--];
    
/*
*  In the following 8 instructions the Stage 4, 3 and 2 computation of butterfly
*  for elements 1, 5, 3 and 7 has been combined.
*  R5.H and R5.L has data 1 and 7 respectively.
*  R6.H and R6.L has data 5 and 3 respectively.
*  For the first two instructions R7.H has C1 and R7.L has C7.
*  For the next four instructions R7.H has C3 and R7.L has C5.
*  For the last two instructions R7.H has C1 and R7.L has C7 again.
*  After combining the stage 4, 3, and 2 the final four equations are
*  obtained. These give the output of Stage 2 straight way.
*
*  Element 1 = C7 * Element1 - C1 * Element 7 + C3 * Element 5 - C5 * Element 3.
*  Element 7 = C1 * Element1 + C7 * Element 7 + C5 * Element 5 + C3 * Element 3.
*  Element 5 = C5 * Element1 + C3 * Element 7 + C7 * Element 5 - C1 * Element 3.
*  Element 3 = C3 * Element1 - C5 * Element 7 - C1 * Element 5 - C7 * Element 3.
*
*  The first 4 instructions implement the first two equations. The next four
*  instructions implement last two equations.
*  In the last the Element 1 is in R3, 7 in R2, 5 in R7 and 3 in R6.
*  Mean while the address offsets are modified.
*/
    
        A1  = R7.L * R5.H, A0  = R7.H * R5.H (IS) || [SP + 36] = R2 
        || R6.H = W[I2--];
        A1 -= R7.H * R5.L, A0 += R7.L * R5.L (IS) || I0 -= 4 || R7 = [I3++];
        A1 += R7.H * R6.H, A0 += R7.L * R6.H (IS) || I0 += M1;
        R3 = (A1 -= R7.L * R6.L), R2 = (A0 += R7.H * R6.L) (IS);                
        A1  = R7.L * R5.H, A0  = R7.H * R5.H (IS)  || R4 = [SP + 32];
        A1 += R7.H * R5.L, A0 -= R7.L * R5.L (IS)   || I1 += M1 || R7 = [I3++];
        A1 += R7.L * R6.H, A0 -= R7.H * R6.H (IS);
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -