📄 cr2fftn_outplace.asm

📁 ADI BF DSP的FFT汇编优化后的代码
💻 ASM
📖 第 1 页 / 共 2 页
字号:
12 下一页
/*******************************************************************************
Copyright(c) 2000 - 2002 Analog Devices. All Rights Reserved.
Developed by Joint Development Software Application Team, IPDC, Bangalore, India
for Blackfin DSPs  ( Micro Signal Architecture 1.0 specification).

By using this module you agree to the terms of the Analog Devices License
Agreement for DSP Software. 
********************************************************************************
Module Name     : cr2fftn_outplace.asm
Label name      : __cr2fftn_outplace
Version         : 1.4
Change History  :

                Version   Date            Author        Comments
                1.4       11/18/2002      Swarnalatha   Tested with VDSP++ 3.0
                                                        compiler 6.2.2 on 
                                                        ADSP-21535 Rev.0.2
                1.3       11/13/2002      Swarnalatha   Tested with VDSP++3.0
                                                        on ADSP-21535 Rev.0.2
                1.2       02/18/2002      Nishanth      Modified to match 
                                                        silicon cycle count
                1.1       11/12/2001      Nishanth      Modified to support 1024
                                                        points
                1.0       01/20/2001      Shailendra    Original

Description     : This file contains the code for the implementation of FFT. The
                  Algorithm used is Decimation in Time. For the optimization 
                  point of view, the whole computation of butterfly signal flow 
                  has been divided in three parts. The first part, the middle 
                  part and the last part. In the first part, the Stage 1 and 
                  Stage 2 of the butterfly structure are implemented. In the 2nd
                  part the general butterfly computation is done, which 
                  corresponds to the middle stages of butterfly. In the last 
                  part the last stage of the butterfly structure is implemented,
                  where mainly the loop overheads are saved.

                  Input and output both the data are complex and 16 bit and 
                  represented in 1.15 format. The loop unrolling is also done 
                  for optimization point of view. 

                  Twiddle factor array has real and imag. values interleaved 
                        Real -> Cosine and imag -> -sine
                                w = e^(-2*j*pi*[0:N/2 - 1]/N)

Algorithm       : The C equivalent code for the main loop of each butterfly is 
                  as follows:

                    for(i=0; i<le; i++)
                        {
                        xire = out[add].re >> 1;
                        xiim = out[add].im >> 1;
                        xipre = out[add+indx].re >> 1;
                        xipim = out[add+indx].im >> 1;

                        mult.re = ((xipre * wptr[indx_w].re - xipim *
                                    wptr[indx_w].im) >>15);
                        mult.im = ((xipre * wptr[indx_w].im + xipim * 
                                    wptr[indx_w].re) >>15);
                
                        out[add].re = xire + mult.re; 
                        out[add].im = xiim + mult.im;
                        out[add+indx].re = xire - mult.re;
                        out[add+indx].im = xiim - mult.im;
                        add = add + 2*offset;
                        }
                    The ouput is  provided in normal order.

Assumptions     : 1. The length of input array should be more than 4 and it 
                     should be power of 2.i.e, 8, 16, 32

                  2. All the inputs should be less than 0x3fff to avoid overflow
                     in the first stage.

                  3. The input array base address in[] should have 'x' zeros in 
                     LSB for bit reversing properly.
                      where x = log (2*N) to the base 2.

Prototype       :        void _cr2fftn_outplace(
                            complex_fract16 in[],
                                // (i) :  Pointer to the input array. 
                            complex_fract16 out[],
                                // (o) :  Pointer to the output array. 
                            complex_fract16 w[],
                                // (i) :  Twiddle Factor array 
                            int             wst,
                                // (i) :  Twiddle Factor Stride = 1024 / n 
                            int             n)
                                // (i) :  FFT length. 

Registers used  : A0,A1, R0-R7, I0-I3, B0,B2,B3, M0-M3, L0-L3, P0-P5, LC0,LC1.

Performance     :
                Code Size   : 500 Bytes.
                Cycle Count : 195 cycles for FFT size of   16.
                              711 cycles for FFT size of   64.
                             3171 cycles for FFT size of  256.
                            15135 cycles for FFT size of 1024.
*******************************************************************************/
.section L1_code;
.global __cr2fftn_outplace;
.align 8;
    
__cr2fftn_outplace:
    
/**************** Function Prologue *******************************************/
    
/*
*  B0 stores the starting address of Input array.
*  B2 stores the starting address of Output array.
*  B3 stores the starting address of Twiddle factor.
*/
    [--SP] = (R7:4, P5:3);
    L0 = 0;                 //Disable circular buffering
    L1 = 0;
    L2 = 0;
    L3 = 0;
    
    B0 = R0;                //Pointer of Input buffer
    B2 = R1;                //Pointer of output buffer
    B3 = R2;                //Address of Twiddle factor
    P1 = [SP+40];           //The value of wst.  wst = 512/n.
    R2 = [SP+44];           //length of input array
    M3 = B2;
    
    CC = R2 <= 4 (IU);      //Come out of the program if the size of input data
    If CC Jump Terminate;   //is less than equal to 4.
    
/***************** Implementation of First part *******************************/
    
/* 
* The stage 1 and 2 of the butterfly structure are implemented. After the 
* computation the result is stored to output array. The main reason for 
  separating it out from the general computation, is that the multiplications of
  both the stages can be avoided. 
* 
* In the first stage of signal flow, there are n/2 number of butterflies. Each 
 butterfly works on two inputs. These input are multiplied by W0 and added and 
 subtracted. Multiplying a data with W0 which is equal to 1 +0j will result the 
 same data.
* 
* In the  second stage the number of butterflies are n/4. The data are added and
 subtracted after multiplication with W0 and Wn/4. The multiplication with W0 
 doesn't have any impact. The multiplication of data x+jy with Wn/4 will give 
 the value y-jx. 
*
* Therefore, the multiplications involved in both the stages 1 and 2, can be 
  reduced to additions only. The output is stored after dividing by 2 for 
  scaling purpose. In one loop data corresponding to 2 butterflies are 
  processed.
*
* The input is read from Input array and output of computation at this stage is 
  stored in output buffer, which is of the size of input array and passed at the
  time of calling the fft function.
*/
    P4 = R2;    
    I0 = B0;                //Address of Input array
    I2 = B2;                //Address of Output array
    R3 = R2 << 1;      
    M0 = R3;                //M0 stores the offset required for bit reversing
    P5 = P4 >> 2;           //P5 is equal to number of butterflies at Stage 2
    P0 = P5 >> 1;           //P0 is half of the number of butterflies at stage 2
    P0 += -1;               //P0 is one less than 1/2 of butterfly at stage 2.
    MNOP;
    
/*
*  Register I0 is used for reading Input from Input array from Bit reversed 
   locations.I2 is used for writing the output of this stage in output buffer in
   normal order.
*  The register set R2, R3, R4 and R5 process the four data corresponding to one
*  butterfly at Stage 2.
*  Register set R0, R1, R6 and R7 process the four data corresponding to next
*  butterfly at Stage 2.
*  The loop is set for one less than the half of the number of butterflies at 
   stage2.
*  The last butterfly is processed separately.
*  All the data are divided by 2, before processing to avoid saturation and for
*  scaling purpose.
*  All the data are stored to output buffer after the division by 2, for scaling
*  purpose.
*/
    
/*
*  These four instructions reads the input data in R2, R3, R4 and R5 registers 
   from bit reversed locations. The first stage computation is done. It is taken
   out of the loop for optimization point of view.
*/
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -