📄 fir_decima_spl.asm

📁 该程序是在ADIdsp上,实现fir运算。开发环境用的是VDSP++3.5
💻 ASM
字号:
/*******************************************************************************
Copyright(c) 2000 - 2002 Analog Devices. All Rights Reserved.
Developed by Joint Development Software Application Team, IPDC, Bangalore, India
for Blackfin DSPs  ( Micro Signal Architecture 1.0 specification).

By using this module you agree to the terms of the Analog Devices License
Agreement for DSP Software. 
********************************************************************************
Module Name     : fir_decima_spl.asm
Label name      : __fir_decima_spl
Version         : 1.3
Change History  :
                Version   Date            Author        Comments
                1.3       11/18/2002      Swarnalatha   Tested with VDSP++ 3.0
                                                        compiler 6.2.2 on 
                                                        ADSP-21535 Rev.0.2
                1.2       11/13/2002      Swarnalatha   Tested with VDSP++ 3.0
                                                        on ADSP-21535 Rev. 0.2
                1.1       03/27/2002      Nishanth      Modified to match 
                                                        silicon cycle count
                1.0       06/02/2001      Nishanth      Original

Description     : This function performs FIR based Decimation Filter. The 
                  function produces the filtered decimated output for a given 
                  input data. The characteristics of the filter are dependant 
                  on the coefficient values,the number of taps(L) and decimation
                  index(M) supplied by the calling program. 
                  The coefficients stored in vector `h` are applied to the 
                  elements of vector `x[]`. For filtering, 40 bit accumulator is
                  used. The most significant 16 bits of the result is stored in 
                  the output vetor `y[ ]`computed according to a decimation 
                  index `L`.

                  The implementation of a zero phase decimator is demonstrated 
                  in the program.
                  The implementation provided below does not use a delay line 
                  once it does not require samples older than x(0).
                  This has been done to avoid overhead due to unnecessary 
                  duplication of input data.

                  The equation for decimation by M can be expressed as:
                    y(n) = h(0) * x(n*M) + h(1) * x(n*M-1) + ...... + h(L-1) * 
                           x(n*M+1-L)

                  This implementation is divided into three stages.
                  a) In the first stage, it computes only y(0) since all 
                     samples are from delay line (x(0) is copied to delay line.)
                     y(0) = h(0) * x(0) + h(1) * x(-1) + ... + h(L-1) * x(-L+1)

                  b) In the second stage, it finds the output samples which 
                     require delay line except y(0), i.e. for the first 
                     Ceil(L/M)-1 output samples
                     y(1) = h(0) * x(M) + h(1) * x(M-1) + ...+ h(L-1) * x(M-L+1)
                     ...
                     y(f) = h(0) * x(f*M) + h(1) * x(f*M-1) + ... h(L-1) *
                            x(f*M-L+1)
                        ,where f = Ceil(L/M) - 1.   
                  This stage has been separated out due to the use of delay 
                  line. There are two inner loops. One finds sum of terms 
                  containing inputs present in delay line and the other, ones in
                  input buffer.

                  c) In the third stage, all the remaining output samples are 
                     calculated.
                     i.e. y(Ceil(L/M)) to y(Nout - 1) are computed in stage 3.

                  d) After filtering the input, the delay line is updated by the
                     last L-1 input samples.

Assumptions     : 1. This routine assumes that the number of filter 
                     coefficients(L) is even since filtering of each sample is 
                     done by two MACs simultaneously.
                  2. Decimation factor(M) is assumed to be even so that each 
                     sample to be filtered is having an offset of 2 bytes from 
                     4 byte boundary.
                  3. It also assumes that  L > M. If L <= M, Ceil(L/M)-1 = 0, 
                     i.e., Stage1 need not be done. But loop using lc0 does the 
                     loop atleast once.
                  4. It also assumes that number of input samples is an integral
                     multiple of decimation factor. This is for correct updation
                     of delay line.
                  5. It assumes that there is one extra location in the delay 
                     line at the end to which x[0] is copied
                  6. It assumes an input buffer aligned to 32 bit boundary and 
                     first 2 Bytes occupied by any value other than the actual 
                     input.
                  7. Length of the filter L divided by 2 must be atleast 3 i.e.
                     L/2 >= 3

Prototype       : void fir_decima(const fract16 x[], fract16 y[], int Ni, 
                              fract16 h[], int L, int M, int LBYM, fract16 d[]);
                            x[]  -  input array 
                            y[]  -  output array
                            Nout -  Number of output samples
                            h[]  -  Filter coefficient array
                            L    -  No. of coefficients 
                            M    -  Decimation Factor
                            LBYM -  Ceil(L/M)
                            d[]  -  Delay line buffer


Registers used  : A0, A1, R0-R3, R5-R7, I0-I3, B2, M0, M1, L0-L3, P0-P2, P4, P5,
                  LC0, LC1.

Performance     :
                Code size   : 260  bytes
                Cycle Count : 1530 Cycles (For Ni=256, L=16 and M=2)

*******************************************************************************/
.section  L1_code;
.global __fir_decima_spl;
.align 8;
    
__fir_decima_spl:                     
    [--SP]=(R7:5,P5:4);     // Push R7 and P5 to P4
    
    P4 = [SP+32];           // Address of filter coefficients
    P1 = [SP+36];           // Number of Coefficients (L)
    P0 = [SP+44];           // Ceil(L/M) (Stage 1 + 2 counter)
    R7 = [SP+48];           // Address of delay line buffer
    
    P2 = R2;                // Number of output samples
    
    I0 = R0;                // Address of input buffer
    L0 = 0;                 // Input buffer is a linear Buffer
    
    R2 = P1;                // R2 = L
    R2 <<= 1;               // R2 = 2 * L (length(L2) for circular buffer)
    
    R7 = R7 + R2(S) || R3=[SP+40];  
                            // R7 = Delay line add. + 2*L , R3=Decimation 
                            // Factor (M)
    I1 = R7;                // Location after end of delay line
    L1 = 0;                 // Delay line buffer is a linear buffer
    
    P2 -= P0;               // Stage3 counter
    P0 += -1;               // Stage2 counter
    P5 = P1;                // Stage2b counter = L
    
    B2 = P4;                // Address of coefficients
    I2 = P4;
    L2 = R2;                // Circular buffering of coefficients(length = 2*L)
    
    P4 = R3;                // P4 = M
    
    I3 = R1;                // Address of output buffer
    L3 = 0;                 // Linear Buffer
    
    
    M0 = R2;                // M0 = 2*L
    
    R3 = R3 + R3(S) || R7.L = W[I0++] || R7.H = W[I1--];
                            // R3=2*M as input data is of type fract16
                            // Fetch x(0) from delay line buffer, Modify delay 
                            // line pointer.
    W[I1--] = R7.l;         // Store x(0) to location after delay line
    R6 = R3;                // R6 = 2*M
    R6 += 8;                // R6 = 2*M + 8
    
    M1 = R6;                // M1 = 2*M + 8
    
    P5 -= P4;               // Stage2b counter = L-M
    
// Start of stage 1
    
    A1=A0=0 || R0 = [I1--] || R1 = [I2++];
                            // Fetch x(0) and x(-1) to R0 , Fetch h(o) and h(1) 
                            // to R1
    LSETUP (FIR_DEC_STG1,FIR_DEC_STG1) LC0 = P1 >> 1;
                            // Loop for finding y(0), count value = L/2
FIR_DEC_STG1:
        A1+=R0.H*R1.L, A0+=R0.L*R1.H || R0 = [I1--]|| R1 = [I2++];
                            // A1+=x(0)*h(0), A0+=x(-1)*h(1)
                            // Fetch x(-2) and x(-3) into R0.H and R0.L
                            // Fetch h2 and h3 into R1.L and R1.H (first time)
        R7.L = (A0+=A1) || I1 += M0;
                            // Add the two MACs results, Modify Delay line 
                            // pointer
        I1 += 4;            // Modify delay line pointer
    
// End of stage 1

// Start of stage 2
    
        LSETUP (FIR_DEC_STG2_ST,FIR_DEC_STG2_END) LC0 = P0;
                            // Stage 2 loop (L/M - 1)
        P0 = P4;
FIR_DEC_STG2_ST:
            A1=A0=0 || R0 = [I0--];
                            // Fetch samples from input buffer to R0
                            
        R6 = R6 + R3(S) || W[I3++] = R7.L || R5 = [I1--];
                            // Adjust modifier for input buffer
                            // Store previous result, Fetch samples from delay 
                            // line to R0                 
                            
        LSETUP(FIR_DEC_STG2A,FIR_DEC_STG2A) LC1 = P4 >> 1;
                            // Loop for terms containing samples from input 
                            // buffer
         
      
        P4 = P4 + P0;       // Increment counter for input buffer loop
FIR_DEC_STG2A:
            A1+=R0.H*R1.L, A0+=R0.L*R1.H  || R0 = [I0--] || R1 = [I2++];
                            // Find sum of terms containing samples from input 
                            // buffer
    
        R2 = R2 - R3(S) || I0 += M1;
        M1 = R6;
    
        LSETUP(FIR_DEC_STG2B,FIR_DEC_STG2B) LC1 = P5>>1;
                            // Loop for terms containing samples from delay 
                            // line  buffer
FIR_DEC_STG2B:
            A1+=R5.H*R1.L, A0+=R5.L*R1.H  || R5 = [I1--] || R1 = [I2++];
                            // Find sum of terms containing samples from delay 
                            // line buffer
        P5 -= P0;           // Decrement counter for delay line loop
                            // Adjust modifier, Modify delay line pointer
        R7.L=(A0+=A1)||R0 = [I1++M0] ;
                            // Add the two MACs results, Modify Input buffer 
                            // pointer Increment modifier by 2*M

FIR_DEC_STG2_END:
        M0 = R2;            // Decrement modifier by 2*M
    
// End of stage 2

// Start of stage 3
        P1+=-4;
        MNOP;
        NOP;
    LSETUP (FIR_DEC_STG3_ST,FIR_DEC_STG3_END) LC0 = P2;     
                            // Loop for Nout - Ceil(L/M)
    
FIR_DEC_STG3_ST:    

		A1=A0=0 || R0 = [I0--] || W[I3++] = R7.L; 
                            // Fetch input into R0 and store output present in 
                            // R7.L
        A1+=R0.H*R1.L, A0+=R0.L*R1.H || R0 = [I0--] || R1 = [I2++];
        A1+=R0.H*R1.L, A0+=R0.L*R1.H || R0 = [I0--] || R1 = [I2++];
        LSETUP (FIR_DEC_STG3A,FIR_DEC_STG3A) LC1 = P1 >> 1; 
                            // LC1 is the no. of coefficients(L)/2 - 2
        
FIR_DEC_STG3A:
            A1+=R0.H*R1.L, A0+=R0.L*R1.H || R0 = [I0--] || R1 = [I2++];
                            // A1+=x(kM)*h(0), A0+=x(kM-1)*h(1)
                            // Fetch x(M-2) and x(M-3) into R0.H and R0.L
                            // Fetch h2 and h3 into R1.L and R1.H (first time)
FIR_DEC_STG3_END:
        R7.L=(A0+=A1) || I0 += M1;
                            // Get y to R7.L and modify I0

// End of stage 3
    
    P1 += 3;                // Loop counter = L-1
    W[I3++] = R7.L || R0.L = W[I0--];
                            // Fetch last input sample and Store final output 
                            // sample
    
    LSETUP( FIR_DEC_DELUPDATE,FIR_DEC_DELUPDATE) LC0 = P1;
FIR_DEC_DELUPDATE:
        R0.L = W[I0--] || W[I1--] = R0.L;
                            // Update delay line buffer with last input samples
    
    (R7:5,P5:4)=[SP++];     // Pop R7 and P5-P4
    RTS;
    NOP;                    //to avoid one stall if LINK or UNLINK happens to be
                            //the next instruction after RTS in the memory.
                            
__fir_decima_spl.end:
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -