📄 cr2fftn_outplace.asm
字号:
/*******************************************************************************
Copyright(c) 2000 - 2002 Analog Devices. All Rights Reserved.
Developed by Joint Development Software Application Team, IPDC, Bangalore, India
for Blackfin DSPs ( Micro Signal Architecture 1.0 specification).
By using this module you agree to the terms of the Analog Devices License
Agreement for DSP Software.
********************************************************************************
Module Name : cr2fftn_outplace.asm
Label name : __cr2fftn_outplace
Version : 1.4
Change History :
Version Date Author Comments
1.4 11/18/2002 Swarnalatha Tested with VDSP++ 3.0
compiler 6.2.2 on
ADSP-21535 Rev.0.2
1.3 11/13/2002 Swarnalatha Tested with VDSP++3.0
on ADSP-21535 Rev.0.2
1.2 02/18/2002 Nishanth Modified to match
silicon cycle count
1.1 11/12/2001 Nishanth Modified to support 1024
points
1.0 01/20/2001 Shailendra Original
Description : This file contains the code for the implementation of FFT. The
Algorithm used is Decimation in Time. For the optimization
point of view, the whole computation of butterfly signal flow
has been divided in three parts. The first part, the middle
part and the last part. In the first part, the Stage 1 and
Stage 2 of the butterfly structure are implemented. In the 2nd
part the general butterfly computation is done, which
corresponds to the middle stages of butterfly. In the last
part the last stage of the butterfly structure is implemented,
where mainly the loop overheads are saved.
Input and output both the data are complex and 16 bit and
represented in 1.15 format. The loop unrolling is also done
for optimization point of view.
Twiddle factor array has real and imag. values interleaved
Real -> Cosine and imag -> -sine
w = e^(-2*j*pi*[0:N/2 - 1]/N)
Algorithm : The C equivalent code for the main loop of each butterfly is
as follows:
for(i=0; i<le; i++)
{
xire = out[add].re >> 1;
xiim = out[add].im >> 1;
xipre = out[add+indx].re >> 1;
xipim = out[add+indx].im >> 1;
mult.re = ((xipre * wptr[indx_w].re - xipim *
wptr[indx_w].im) >>15);
mult.im = ((xipre * wptr[indx_w].im + xipim *
wptr[indx_w].re) >>15);
out[add].re = xire + mult.re;
out[add].im = xiim + mult.im;
out[add+indx].re = xire - mult.re;
out[add+indx].im = xiim - mult.im;
add = add + 2*offset;
}
The ouput is provided in normal order.
Assumptions : 1. The length of input array should be more than 4 and it
should be power of 2.i.e, 8, 16, 32
2. All the inputs should be less than 0x3fff to avoid overflow
in the first stage.
3. The input array base address in[] should have 'x' zeros in
LSB for bit reversing properly.
where x = log (2*N) to the base 2.
Prototype : void _cr2fftn_outplace(
complex_fract16 in[],
// (i) : Pointer to the input array.
complex_fract16 out[],
// (o) : Pointer to the output array.
complex_fract16 w[],
// (i) : Twiddle Factor array
int wst,
// (i) : Twiddle Factor Stride = 1024 / n
int n)
// (i) : FFT length.
Registers used : A0,A1, R0-R7, I0-I3, B0,B2,B3, M0-M3, L0-L3, P0-P5, LC0,LC1.
Performance :
Code Size : 500 Bytes.
Cycle Count : 195 cycles for FFT size of 16.
711 cycles for FFT size of 64.
3171 cycles for FFT size of 256.
15135 cycles for FFT size of 1024.
*******************************************************************************/
.section L1_code;
.global __cr2fftn_outplace;
.align 8;
__cr2fftn_outplace:
/**************** Function Prologue *******************************************/
/*
* B0 stores the starting address of Input array.
* B2 stores the starting address of Output array.
* B3 stores the starting address of Twiddle factor.
*/
[--SP] = (R7:4, P5:3);
L0 = 0; //Disable circular buffering
L1 = 0;
L2 = 0;
L3 = 0;
B0 = R0; //Pointer of Input buffer
B2 = R1; //Pointer of output buffer
B3 = R2; //Address of Twiddle factor
P1 = [SP+40]; //The value of wst. wst = 512/n.
R2 = [SP+44]; //length of input array
M3 = B2;
CC = R2 <= 4 (IU); //Come out of the program if the size of input data
If CC Jump Terminate; //is less than equal to 4.
/***************** Implementation of First part *******************************/
/*
* The stage 1 and 2 of the butterfly structure are implemented. After the
* computation the result is stored to output array. The main reason for
separating it out from the general computation, is that the multiplications of
both the stages can be avoided.
*
* In the first stage of signal flow, there are n/2 number of butterflies. Each
butterfly works on two inputs. These input are multiplied by W0 and added and
subtracted. Multiplying a data with W0 which is equal to 1 +0j will result the
same data.
*
* In the second stage the number of butterflies are n/4. The data are added and
subtracted after multiplication with W0 and Wn/4. The multiplication with W0
doesn't have any impact. The multiplication of data x+jy with Wn/4 will give
the value y-jx.
*
* Therefore, the multiplications involved in both the stages 1 and 2, can be
reduced to additions only. The output is stored after dividing by 2 for
scaling purpose. In one loop data corresponding to 2 butterflies are
processed.
*
* The input is read from Input array and output of computation at this stage is
stored in output buffer, which is of the size of input array and passed at the
time of calling the fft function.
*/
P4 = R2;
I0 = B0; //Address of Input array
I2 = B2; //Address of Output array
R3 = R2 << 1;
M0 = R3; //M0 stores the offset required for bit reversing
P5 = P4 >> 2; //P5 is equal to number of butterflies at Stage 2
P0 = P5 >> 1; //P0 is half of the number of butterflies at stage 2
P0 += -1; //P0 is one less than 1/2 of butterfly at stage 2.
MNOP;
/*
* Register I0 is used for reading Input from Input array from Bit reversed
locations.I2 is used for writing the output of this stage in output buffer in
normal order.
* The register set R2, R3, R4 and R5 process the four data corresponding to one
* butterfly at Stage 2.
* Register set R0, R1, R6 and R7 process the four data corresponding to next
* butterfly at Stage 2.
* The loop is set for one less than the half of the number of butterflies at
stage2.
* The last butterfly is processed separately.
* All the data are divided by 2, before processing to avoid saturation and for
* scaling purpose.
* All the data are stored to output buffer after the division by 2, for scaling
* purpose.
*/
/*
* These four instructions reads the input data in R2, R3, R4 and R5 registers
from bit reversed locations. The first stage computation is done. It is taken
out of the loop for optimization point of view.
*/
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -