📄 r8x8invdct_ieee.asm
字号:
/*******************************************************************************
Copyright(c) 2000 - 2002 Analog Devices. All Rights Reserved.
Developed by Joint Development Software Application Team, IPDC, Bangalore, India
for Blackfin DSPs ( Micro Signal Architecture 1.0 specification).
By using this module you agree to the terms of the Analog Devices License
Agreement for DSP Software.
********************************************************************************
Module Name : r8x8invdct_ieee.asm
Label name : __r8x8invdct_ieee
Version : 1.3
Change History :
Version Date Author Comments
1.3 11/18/2002 Swarnalatha Tested with VDSP++ 3.0
compiler 6.2.2 on
ADSP-21535 Rev.0.2
1.2 11/13/2002 Swarnalatha Tested with VDSP++3.0
on ADSP-21535 Rev.0.2
1.1 02/19/2002 Vijay Modified to match
silicon cycle count
1.0 03/13/2001 shailendra Original
Description : This is the implementation of Chen's algorithm of IDCT.
It is based on the separable nature of IDCT for multi-
dimension. The input matrix is 8x8 real data. First, one dime-
nsional 8-point IDCT is calculated for each of the 8 rows. The
output is stored in a separate matrix after transpose. Then
again 8-point IDCT is calculated on each row of matrix. The
output is again stored in a transpose matrix. This is final
output.
Chen's algorithm has 4 stages (parts) of implementation.
This implementation works only for 8x8 input. The input data
should be real. The range of input should be -256 to 255.
The algorithm is in-placed.
Note : The algorithm reads the input data from the "in"
matrix, from bit reversed locations.
First 8-point IDCT will be calculated for all the 8 rows.
This output is stored in "temp" buffer in the transposed form.
Again the 8-point IDCT is applied on all the 8 rows of "temp"
buffer. Final output computed is stored in "in" buffer in
transposed form. The operation of transposing the matrix and
calculation of bit reversed are carried out while writing the
data without any explicit code.
Output of function is provided "in" buffer in normal order.
This function passes all the IEEE-1180 test cases.
Prototype : void _r8x8dct(fract16 *in, fract16 *coeff, fract16 *temp);
*in -> Pointer to Input vector.
*coeff -> Pointer to coefficients.
*temp -> Pointer to temporary data.
Registers Used : A0, A1, R0-R7, I0-I3, B0, B2, B3, M0-M2, L0-L3, P0-P5, LC0.
Performance :
Code Size : 498 Bytes.
Cycle Count : 417 Cycles
******************************************************************************/
.section L1_code;
.global __r8x8invdct_ieee;
.align 8;
__r8x8invdct_ieee:
/********************** Function Prologue *********************************/
[--SP] = (R7:4, P5:3); // Pushing the registers on stack.
B0 = R0; // Pointer to Input matrix.
B3 = R1; // Pointer to Coefficients
B2 = R2; // Pointer to Temporary matrix.
L0 = 0; // L registers are initialized to 0
L1 = 0; // --------- do --------
L2 = 0; // --------- do --------
L3 = 20; // L3 is used for making coefficients array
// circular.
M1 = 16 (X); // All these registers are initialized for
M3 = 8(X); // modifying address offsets.
I0 = B0; // I0 points to Input Element (0, 0)
I2 = B0; // I2 points to Input Element (0, 0)
I2 += M3 || R0.H = W[I0];
// Element 0 is read in R0.H
I1 = I2; // I1 points to input Element (0, 6)
I1 += 4 || R0.L = W[I2++];
// I2 points to input Element (0, 4)
// Element 4 is read in R0.L
P2 = 8 (X);
P3 = 32 (X);
P4 = -32 (X);
P5 = 98 (X);
I3 = B3; // I3 points to Coefficients
P0 = B2; // P0 points to array Element (0, 0) of temp
P1 = B2;
R7 = [I3++]; // Coefficient C4 is read in R7.H and R7.L
MNOP;
NOP;
/*
According to Chen's algorithm, first 8-point IDCT will be calculated for all
the 8 rows. The output of this calculation is stored in another transpose
matrix. Now again the 8-point IDCT is applied on all the 8 rows. The output
is stored in matrix transpose form. This is the final output. It is done
with the help of twp loops ROW1_START and ROW2_START.
In the first loop of ROW1_START the input is read from "in" buffer and output
is written to "temp" buffer. In the second loop of ROW2_START the input is
read from "temp" buffer and output is written to "in" buffer. "in" buffer
holds the final output.
*/
/*
* The following operation is done in 2 instructions.
* A1 = Element 0 * cos(pi/4)
* A0 = Element 0 * cos(pi/4)
* A1 = A1 + Element 4 * cos(pi/4)
* A0 = A0 - Element 4 * cos(pi/4)
* At the same time the value of Element 2 and 6 are read in R1.H and R1.L
respectively. The coefficient C2 and C6 are read in R7.H and R7.L
* In the end R3 holds 0th and R2 holds 4th Element.
*/
A1 = R7.H * R0.H, A0 = R7.H * R0.H (IS) || I0 += 4 || R1.L = W[I1++];
R3 = (A1 += R7.H * R0.L), R2 = ( A0 -= R7.H * R0.L) (IS) || R1.H = W[I0--]
|| R7 = [I3++];
LSETUP (ROW1_START, ROW1_END) LC0 = P2;
//Loop for 8 rows.
P2 = 112 (X);
P1 = P1 + P2; // P1 points to element (7, 0) of temp buffer.
P2 = -94(X);
ROW1_START:
/*
* The following two instructions do -
* A1 = Element 2 * cos(3pi/8)
* A0 = Element 2 * cos(pi/8)
* A1 = A1 - Element 6 * cos(pi/8)
* A0 = A0 + Element 6 * cos(3pi/8)
* Element 1 and 7 are read in R5.H and R5.L.
* Coefficients C1 and C7 are in register R7.H and R7.L respectively.
* In the end R1 holds 2nd and R0 hols 6th element.
*/
A1= R7.L * R1.H, A0 = R7.H * R1.H (IS) || I0 += 4 || R5.H = W[I0];
R1 = (A1 -= R7.H * R1.L) , R0 = (A0 += R7.L * R1.L) (IS)
|| R5.L = W[I1--] || R7 = [I3++];
/*
* The following two instructions do -
* Element 0 = Element 0 + Element 6.
* Element 4 = Element 4 + Element 2.
* Element 2 = Element 4 - Element 2.
* Element 6 = Element 0 - Element 6.
* The register R3 is saved to make it free. Element 3 is read in R6.L
* At this stage Element 0 is in R3, 4 is in R2, 2 is in R1 and 6 is in R0
*/
R3 = R3 + R0, R0 = R3 - R0;
R2 = R2 + R1, R1 = R2 - R1 || [SP + 32] = R3 || R6.L = W[I0--];
/*
* In the following 8 instructions the Stage 4, 3 and 2 computation of butterfly
* for elements 1, 5, 3 and 7 has been combined.
* R5.H and R5.L has data 1 and 7 respectively.
* R6.H and R6.L has data 5 and 3 respectively.
* For the first two instructions R7.H has C1 and R7.L has C7.
* For the next four instructions R7.H has C3 and R7.L has C5.
* For the last two instructions R7.H has C1 and R7.L has C7 again.
* After combining the stage 4, 3, and 2 the final four equations are
* obtained. These give the output of Stage 2 straight way.
*
* Element 1 = C7 * Element1 - C1 * Element 7 + C3 * Element 5 - C5 * Element 3.
* Element 7 = C1 * Element1 + C7 * Element 7 + C5 * Element 5 + C3 * Element 3.
* Element 5 = C5 * Element1 + C3 * Element 7 + C7 * Element 5 - C1 * Element 3.
* Element 3 = C3 * Element1 - C5 * Element 7 - C1 * Element 5 - C7 * Element 3.
*
* The first 4 instructions implement the first two equations. The next four
* instructions implement last two equations.
* In the last the Element 1 is in R3, 7 in R2, 5 in R7 and 3 in R6.
* Mean while the address offsets are modified.
*/
A1 = R7.L * R5.H, A0 = R7.H * R5.H (IS) || [SP + 36] = R2
|| R6.H = W[I2--];
A1 -= R7.H * R5.L, A0 += R7.L * R5.L (IS) || I0 -= 4 || R7 = [I3++];
A1 += R7.H * R6.H, A0 += R7.L * R6.H (IS) || I0 += M1;
R3 = (A1 -= R7.L * R6.L), R2 = (A0 += R7.H * R6.L) (IS);
A1 = R7.L * R5.H, A0 = R7.H * R5.H (IS) || R4 = [SP + 32];
A1 += R7.H * R5.L, A0 -= R7.L * R5.L (IS) || I1 += M1 || R7 = [I3++];
A1 += R7.L * R6.H, A0 -= R7.H * R6.H (IS);
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -