📄 r8x8dct.asm
字号:
/*******************************************************************************
Copyright(c) 2000 - 2002 Analog Devices. All Rights Reserved.
Developed by Joint Development Software Application Team, IPDC, Bangalore, India
for Blackfin DSPs ( Micro Signal Architecture 1.0 specification).
By using this module you agree to the terms of the Analog Devices License
Agreement for DSP Software.
********************************************************************************
Module Name : r8x8dct.asm
Label Name : __r8x8dct
Version : 1.3
Change History :
Version Date Author Comments
1.3 11/18/2002 Swarnalatha Tested with VDSP++ 3.0
compiler 6.2.2 on
ADSP-21535 Rev.0.2
1.2 11/13/2002 Swarnalatha Tested with VDSP++3.0
on ADSP-21535 Rev.0.2
1.1 02/19/2002 Vijay Modified to match
silicon cycle count
1.0 03/13/2001 Vijay Original
Description : This is the implementation of Chen's algorithm of DCT.
It is based on the separable nature of DCT for multi-
dimension. The input matrix is 8x8 real data. First, one dime-
nsional 8-point DCT is calculated for each of the 8 rows. The
output is stored in a separate matrix after transpose. Then
again 8-point DCT is calculated on each row of matrix. The
output is again stored in a transpose matrix. This is final
output.
Chen's algorithm has 4 stages of implementation. In the first
stage there are additions and subtractions only. In the second
stage addition and subtraction is done with one multiplication
In the third and last (fourth) stages more MAC operations are
involved.
This implementation works only for 8x8 input. The input data
should be real. The range of input should be -256 to 255.
The algorithm is in-placed.
Note : The algorithm reads the input data from the "in"
matrix.
First 8-point DCT will be calculated for all the 8 rows.
This output is stored in "temp" buffer in the
transposed form at bit reversed locations.
Again the 8-point DCT is applied on all the 8 rows of
"temp" buffer. Final output computed is stored in "in"
buffer in transposed form at bit reversed locations.
The operation of transposing the matrix and
calculation of bit reversed are carried out while
writing the data without any explicit code.
Output of function is provided "in" buffer in normal
order.
Prototype : void _r8x8dct(fract16 *in, fract16 *coeff, fract16 *temp);
*in -> Pointer to Input vector.
*coeff -> Pointer to coefficients.
*temp -> Pointer to temporary data.
Registers Used : A0, A1, R0-R7, I0-I3, B0-B3, M0, M1, L0-L3, P0-P5, LC0, LC1.
Performance :
Code Size : 236 Bytes.
Cycle Count : 293 Cycles.
*******************************************************************************/
.section L1_code;
.global __r8x8dct;
.align 8;
__r8x8dct:
/******************************* Function Prologue ***************************/
[--SP] = (R7:4, P5:3); //Pushing the Registers on stack.
B0 = R0; //Pointer to Input matrix.
B3 = R1; //Pointer to Coefficients.
B2 = R2; //Pointer to Temporary matrix.
L0 = 0; //L registers are initialized to 0
L1 = 0; //-------- do --------
L2 = 0; //-------- do --------
L3 = 16; //L3 is set to 16 to make the coefficients
//array Circular.
/*
I0, I1, and I2 registers are used to read the input data. I3 register is used
to read the coefficients. P0 and P1 registers are used for writing the output
data.
*/
M0 = 12 (X); // All these initialization are used in the
// modification of address offsets.
M1 = 16 (X);
P2 = 16;
P3 = 32 (X);
P4 = -110 (X);
P5 = -62 (X);
P0 = 2;
/*
According to Chen's algorithm, first 8-point DCT will be calculated for all
the 8 rows. The output of this calculation is stored in another transpose
matrix. Now again the 8-point DCT is applied on all the 8 rows. The output
is stored in matrix in transposed form. This is the final output.
Therefore,
a loop of 2 iteration (DCT_START, DCT_end) is set.
B0 points to the "in" buffer and B2 points to "temp" buffer in the first
iteration. The input is read from "in" buffer and output is written to
"temp" buffer. In the second iteration of DCT_START B0 points to "temp" and
B2 points to "in" buffer. The input is read from "temp" buffer and output
is written to "in" buffer. "in" buffer holds the final output.
*/
LSETUP (DCT_START, DCT_END) LC0 = P0;
DCT_START:
I0 = B0; //I0 points to Input Element (0, 0)
I1 = B0; //Element 1 and 0 is read in R0.
I1 += M0 || R0 = [I0++];
//I1 points to Input Element (0, 6)
I2 = I1; //Element 6 is read in R3.H
I2 -= 4 || R3.H = W[I1++];
//I2 points to Input Element (0, 4)
I3 = B3; //I3 points to Coefficients
P0 = B2; //P0 points to temporary array Element (0, 0)
P1 = B2; //P1 points to temporary array
R7 = [P1++P2] || R2 = [I2++];
//P1 points to temporary array Element (1, 0)
//R7 is a dummy read. Element 4 and 5 are read in R2
R3.L = W[I1--]; //Element 7 is read in R3.L
R1.H = W[I0++]; //Element 2 is read in R1.H
//******************************* Implementation of Part 1 ********************
/*
All the additions from Stage 1 and Stage 2 are implemented in Part1 1. for
the optimization
It is taken out of the loop for optimization point of view. (Part 1)
The following instruction does the following job.
Element 0 = (Element 0 + Element 7) / 2.
Element 1 = (Element 1 + Element 6) / 2.
Element 6 = (Element 1 - Element 6) / 2.
Element 7 = (Element 0 - Element 7) / 2.
It reads the data 3 in R1.L.
*/
R0 = R0 +|+ R3, R3 = R0 -|- R3 (ASR) || R1.L = W[I0++] || NOP;
/*
This single instruction does the following job.
Element 2 = (Element 2 + Element 5) / 2.
Element 3 = (Element 3 + Element 4) / 2.
Element 4 = (Element 3 - Element 4) / 2.
Element 5 = (Element 2 - Element 5) / 2.
It reads the Coefficients C4 = cos(4*pi/16) in register R7.
*/
R1 = R1 +|+ R2, R2 = R1 -|- R2 (ASR, CO) || NOP || R7 = [I3++];
/*
At the end of stage 1 R0 has (1,0), R1 has (2,3), R2 has (4, 5) and
R3 has (6,7). Where notation (x, y) means the element from column x is in
upper half of register and element from column y is in lower half of the
register.
*/
//******************************* Implementation of Part 2 *********************
/*
The following addition/subtraction instruction does -
Element 0 = Element 0 + Element 3.
Element 1 = Element 1 + Element 2.
Element 2 = Element 1 - Element 2.
Element 3 = Element 0 - Element 3.
*/
R0 = R0 +|+ R1, R1 = R0 -|- R1;
LSETUP (ROW_START, ROW_END) LC1 = P2 >> 1;
//The loop is set for 8 rows.
ROW_START:
/*
This is part 2 computation continued.....
The following two instructions do -
A1 = Element 6 * cos(pi/4)
A0 = Element 6 * cos(pi/4)
A1 = A1 - Element 5 * cos(pi/4)
A0 = A0 + Element 5 * cos(pi/4).
The instruction W[I0] = R3.L is used for packing it to R2.L.
*/
A1 = R3.H * R7.L, A0 = R3.H * R7.L || I1 += M1 || W[I0] = R3.L;
R4.H = ( A1 -= R2.L * R7.L), R4.L = (A0 += R2.L * R7.L) || I2 += M0;
/*
At the end of stage 2 R0 has (1,0), R1 has (2,3), R4 has (5, 6).
*/
//************************ Implementation of Part 3 ***************************
/*
The following two instruction does the job of stage 3 -
A1 = Element 0 * cos(pi/4)
A0 = Element 0 * cos(pi/4)
A1 = A1 - Element 1 * cos(pi/4)
A0 = A0 + Element 1 * cos(pi/4)
The value of coefficients C2 and C6 are read in register R7.
*/
A1 = R0.L * R7.L, A0 = R0.L * R7.L || NOP || R3.H = W[I1++];
R5.H = (A1 -= R0.H * R7.L), R5.L = (A0 += R0.H * R7.L)
|| R7 = [I3++];
/*
The following three instructions do -
A1 = Element 2 * cos(3pi/8)
A0 = Element 3 * cos(3pi/8)
A1 = A1 + Element 3 * cos(pi/8)
A0 = A0 - Element 2 * cos(pi/8)
R3 reads the value of cos pi/4.
The value of coefficients C7 and C1 is read in register R7.
Element 4 = Element 4 + Element 5.
Element 5 = Element 4 - Element 5.
Element 6 = Element 7 - Element 6.
Element 7 = Element 7 + Element 6.
*/
A1 = R1.H * R7.L, A0 = R1.L * R7.L || W[P0++P3] = R5.L
|| R2.L = W[I0];
R2 = R2 +|+ R4, R4 = R2 -|- R4 || I0 += 4 || R3.L = W[I1--];
R6.H = (A1 += R1.L * R7.H), R6.L = (A0 -= R1.H * R7.H) || I0 += 4
|| R7 = [I3++];
/*
At the end of part 3 R2 has (4, 7), R4 has (5,6), R5 has (1, 0) and
R6 has (2,3).
*/
//****************************** Implementation of Part 4 **********************
/*
The following two instructions do -
A1 = Element 4 * cos(7pi/16)
A0 = Element 7 * cos(7pi/16)
A1 = A1 + Element 7 * cos(pi/16)
A0 = A0 - Element 4 * cos(pi/16)
The value of next coefficients are read, and the registers are written to
their locations.
*/
A1 = R2.H * R7.L, A0 = R2.L * R7.L || W[P0++P3] = R6.H
|| R0 = [I0++];
R2.H = ( A1 += R2.L * R7.H), R2.L = ( A0 -= R2.H * R7.H)
|| W[P0++P3] = R5.H || R7 = [I3++];
/*
The following two instructions do -
A1 = Element 5 * cos(3pi/16)
A0 = Element 6 * cos(3pi/16)
A1 = A1 + Element 6 * cos(5pi/16)
A0 = A0 - Element 5 * cos(5pi/16)
The output values are written.
*/
A1 = R4.H * R7.H, A0 = R4.L *R7.H || W[P0++P2] = R6.L
|| R1.H = W[I0++];
R4.H = (A1 += R4.L * R7.L), R4.L = ( A0 -= R4.H * R7.L)
|| W[P0++P4] = R2.L || R1.L = W[I0++];
//******************* Implementation of Part 1 *****************************
/*
This is the same part as part 1 specified earlier. First time the part 1
calculation is done outside the loop, after wards it is done here. It serves
two purpose. Firts it computes part 1 for the next data, and it writes the
data 5, and 6 to its bit reversed order in transpose way.
*/
R0 = R0 +|+ R3, R3 = R0 -|- R3 (ASR) || W[P1++P3] = R2.H
|| R2 = [I2++];
R1 = R1 +|+ R2, R2 = R1 -|- R2 (ASR, CO) || W[P1++P3] = R4.L
|| R7 = [I3++];
ROW_END: R0 = R0 +|+ R1, R1 = R0 -|- R1 || W[P1++P5] = R4.H || NOP;
B1 = B0; //Swapping of Input and output address pointers
B0 = B2; //B0 points to input buffer.
DCT_END:B2 = B1; //B2 points to output buffer.
TERMINATE:
(R7:4,P5:3)=[SP++]; //Pop the registers before returning.
RTS; //Return.
NOP; //to avoid one stall if LINK or UNLINK happens to be
//the next instruction after RTS in the memory.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -