📄 cr2fftnasm_outplace.asm
字号:
/*******************************************************************************************
Copyright(c) 2000 Analog Devices/Intel
Developed by JD(FRIO) Software Application Team, IPDC, Bangalore, India
********************************************************************************************
File Name : cr2fftn_outplace.asm
Module Name : one dimensional radix 2 forward FFT for complex data (Out of place).
Label name : __cfftN
Description : This file contains the code for the implementation of FFT. The Algorithm
used is Decimation in Time. For the optimization point of view, the whole
computaion of butterfly signal flow has beeen divided in three parts. The
first part, the middle part and the last part. In the first part, the
Stage 1 and Stage 2 of the butterfly structure are implemented. In the 2nd
part the general butterfly computation is done, which corresponds to the
middle stages of butterfly. In the last part the last stage of the buttrfly
structure is implemented, where mainly the loop overheads are saved.
Input and output both the data are complex and 16 bit and represented in 1.15
format. The loop unroling is also done for optimization point of view.
The C callable prototype of the function is as follows:
void cfftN(complex_fract16 *in, complex_fract16 *out, complex_fract16 *w,
int wst, int n);
*in -> Pointer to Input array.
*out -> Pointer to Output array.
*w -> Pointer to Twiddle factor.
wst -> Twiddle factor Stride. It is equal to 512/(size of input).
n -> length of Input data.
The C equivalent code for the main loop of each butterfly is
as follows.
for(i=0; i<le; i++)
{
xire = out[add].re >> 1;
xiim = out[add].im >> 1;
xipre = out[add+indx].re >> 1;
xipim = out[add+indx].im >> 1;
mult.re = ((xipre * wptr[indx_w].re - xipim * wptr[indx_w].im) >>15);
mult.im = ((xipre * wptr[indx_w].im + xipim * wptr[indx_w].re) >>15);
out[add].re = xire + mult.re;
out[add].im = xiim + mult.im;
out[add+indx].re = xire - mult.re;
out[add+indx].im = xiim - mult.im;
add = add + 2*offset;
}
The ouput is provided in normal order.
Restrictions : The length of input array should be more than 4.i.e, 8, 16, 32
and it should be power of 2.
Registers Used :
R0 -> It is mainly used for counter for middle stages. Its value is equal to m-3.
if m are the total number of stages for a perticular n.
R1 -> It is used for the storing the value of wst. wst = 512/n.
R2 -> Used for storing Input/Output data.
R3 -> used for storing the twiddle factor value.
R4 -> Used for storing Input/Output data.
R5 -> Used for storing Input/Output data.
R6 -> Used for storing Input/Output data.
R7 -> Used for calculating and storing the address offset for twiddle factor
array.
P0 -> It is used for storing the address offset of output buffer in middle part
of implementation.
P1 -> It is used for storing the address offset of output buffer in middle part
of implementation.
P2 -> It is used for storing the address offset of output buffer in middle part
of implementation.
P3 -> It stores the number of lines for the butterflies at a perticular stage.
P4 -> It stores the value of input array length.
P5 -> It stores the number of butterflies at a perticular stage.
A0 -> Used for storing the value of MAC temporarily.
A1 -> Used for storing the value of MAC temporarily.
B0 -> Start address of Input array.
B2 -> start address of output array.
B3 -> Start address of twiddle factor buffer.
I0 -> Address for input array.
I1 -> Address for output and temporary array while computing.
I2 -> Address for output array while computing.
I3 -> Address for twiddle factor array.
M0 -> Address offset for input/output array.
M1 -> Address offset for output array.
M2 -> Address offset for output array.
M3 -> Address offset for twiddle factor.
LC0 -> Counter.
LC1 -> Counter.
PERFORMANCE: (With new timerfrio1.dll)
N = 256.
The cycle count for 256 point fft calculation is 3176 cycles.
The Cycle count for Stage1 + Stage2 + Bitreverse computation = 389
The Cycle count for middle stage computation = 2291
The Cycle count for last stage computation = 397
The Cycle count for prologue, epilogue... etc = 99
----------------------------------------------------------------------
TOTAL = 3176
Code Size : 484 Bytes.
****************************************************************************************/
.section data1;
.align 4096;
.global _in;
.var _in[512];
.align 4096;
.global _out;
.var _out[512];
.align 4096;
.global _w;
.var _w[256];
/***************************************************************************************/
.section program;
.global __cfftN;
.align 8;
__cfftN:
/**************** Function Prologue ****************************************************/
/*
* B0 stores the starting address of Input array.
* B2 stores the starting address of Output array.
* B3 stores the starting address of Twiddle factor.
*/
B0 = R0; //Pointer of Input buffer
B2 = R1; //Pointer of output buffer
B3 = R2; //Address of Twiddle factor
M3 = B2;
P1 = [SP+12]; //The value of wst. wst = 512/n.
R2 = [SP+16]; //length of input array
[--SP] = (R7:4, P5:3);
CC = R2 <= 4 (IU); //Come out of the program if the size of input data
If CC Jump Terminate; //is less than equal to 4.
/***************************************************************************************/
/************************** Implementation of First part *******************************/
/*
* The stage 1 and 2 of the butterfly structure are implemented. After the comput-
* ation the result is stored to output array. The main reason for separating it out from
* the general computation, is that the multiplications of both the stages can be avoided.
*
* In the first stage of signal flow, there are n/2 number of butterflies. Each butterfly
* works on two inputs. These input are multiplied by W0 and added and subtracted.
* Multiplying a data with W0 which is eqaul to 1 +0j will result the same data.
*
* In the second stage the number of butterflies are n/4. The data are added and subtra-
* cted after multiplication with W0 and Wn/4. The multiplication with W0 doesn't have
* any impact. The multiplication of data x+jy with Wn/4 will give the value y-jx.
*
* Therefore, the multiplications involved in both the stages 1 and 2, can be reduced to
* additions only. The output is stored after dividing by 2 for scaling purpose. In one loop
* data correspodding to 2 butterflies are processed.
*
* The input is read from Input array and output of computation at this stage is stored in
* output buffer, which is of the size of input array and passed at the time of calling
* the fft function.
*/
P4 = R2;
I0 = B0; //Address of Input array
I2 = B2; //Address of Output array
R3 = R2 << 1 || nop || nop;
M0 = R3; //M0 stores the offset required for bit reversing
P5 = P4 >> 2; //P5 is equal to number of butterflies at Stage 2
P0 = P5 >> 1; //P0 is half of th enumber of butterflies at stage 2.
P0 += -1; //P0 is one less than 1/2 of butterfly at stage 2.
/*
* Register I0 is used for reading Input from Input array from Bit reversed locations.
* I2 is used for writing the output of this stage in output buffer in normal
* order.
* The register set R2, R3, R4 and R5 process the four data corresponding to one
* butterfly at Stage 2.
* Register set R0, R1, R6 and R7 process the four data corresponding to next
* butterfly at Stage 2.
* The loop is set for one less than the half of the number of butterflies at stage2.
* The last butterfly is processed separately.
* All the data are divided by 2, before processing to avoid saturation and for
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -