⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 iir.asm

📁 TS201 HOSTBOOT源代码 学习TS201 HOSTHOOT
💻 ASM
字号:
/*       Floating point IIR using software pipelining

	iir.asm

  Revision history:

    10/06/99  BL            			Initial (and so far the only) version


TigerSHARC real IIR filter, SECTIONS = Nbiq must be a multiple of 4
Performance approaches 2.25 cycles/biquad as Nbiq -> infinity.

  Cycle counts  |                             |  
                |  Formula             Cycles |  
  --------------+-----------------------------+
  Power-up      |     6                   6   | 
  Initialization|     3                   3   |   
  Kernel code   |  N*(11+9*Nbiq/4)+6  23506   | 
  Termination   |     1                   1   | 


  Memory usage      |    Formula                Bytes
  ------------------+---------------------------------
  Power-up          |                             60
  Initialization    |                             32
  Kernel code       |                            220
  Termination       |                              4
  ------------------+--------------------------------
  Constant data     |    16*Nbiq                 256
  Non-constant data |     8*Nbiq                 128

  The example waveform has a sum of 5 sinewaves. Filter coeffs filter 4 out.
  (Four bi-quads per sine to filter). Filter's scalecoef actually should be 
  slightly larger than 1.0 to preserve unity gain. The cycle count is not affected.

  To make use of TigerSharc's parallel processing and to take care of
  computational stalls, software pipelining was used. The number of
  bi-quads is presumed to be a multiple of 4, they are divided into 4 sets
  S1, S2, S3 and S4. Input is denoted X(n), output Y(n) and intermidiate
  results are:

    I1 = S1(X),
    I2 = S2(I1),
    I3 = S3(I2)

(Thus, also, Y = S4(I3)).

The parallel execution structure is:

                     ---------
    X(n)     ->     |    S1    | -> I1(n)      done in X block registers xR3:0, xR8, xR11
                     ---------
                     ---------                                 ||
    I1(n-1)  ->     |    S2    | -> I2(n-1)    done in Y block registers yR3:0, yR8, yR11
                     ---------
                     ---------
    I2(n-2)  ->     |    S3    | -> I3(n-2)    done in X block registers xR7:4, xR9, xR12
                     ---------
                     ---------                                 ||
    I3(n-3)  ->     |    S4    | -> Y(n-3)     done in Y block registers yR7:4, yR9, yR12
                      ---------

Thus, there is a three-sample delay from input to output.
To make this work, the filter coefficients must be interleaved in sets of 4.

Individual bi-quads are of the form

                    1 + b1*z^(-1) + b2*z^(-2)
    H(z) =  scale * -------------------------
                    1 - a1*z^(-1) - a2*z^(-2)

All the scale coefficients are combined into one scalecoef with a single
multiply at the end.

The recursion is implemented using the canonical form:

    w(n) = x(n) + a1*w(n-1) + a2*w(n-2)
    y(n) = w(n) + b1*w(n-1) + b2*w(n-2)

Note that input x(n) could come as output of previous bi-quad section

Note:
    In this revision the performance approaches 2.25 cycles/biquad as Nbiq -> infinity

    This can be improved to 2.0 cycles/biquad. By making all coeffs loads
    quad, extra instruction slot may be created for loop termination jump.
    This optimization would also improve bus utilization to free up additional
    time slots for DMA transfers.

************************************************************************/

#include "defTS201.h"
#include "IIRDef.h"

.extern coeffs;
.extern delayline;
.extern input;
.extern output;
.extern scalecoef;

/************************************************************************/
.section reset;
	jump _main (NP);;

/************************************************************************/
.section program;
.global _main;

/************************************** Start of code *****************************************/
_main:
    j4 = j31+input;;                                        	 // j4 -> input                                
    j5 = j31+output;;                                       	 // j5 -> output            
    LC0=N; r9:8=r9:8-r9:8; yr10=[j31+scalecoef];;                // N samples, init accumulators to 0, fetch scale

    j0 = j31+delayline;;                                         // j0 -> delayline
    j1 = j31+delayline;;                                         // j1 -> delayline
    k0 = k31+coeffs;;                                            // k0 -> coeffs

/********************************* Benchmark kernel code ****************************************/
.align_code 4;
main_loop:
        r3=r3-r3; r0=l[j0+=2]; r2=l[k0+=2];;                     // r3=0, r0=w(n-2), r2 = a2 for S1,S2
        r7=r7-r7; xr8=[j4+=1]; r6=l[k0+=2];;                     // r7=0, xr8=input, r6 = a2 for S3,S4
        LC1 = SECTIONS/4; r4=l[j0+=2];;                          // init counter, r4=w(n-2) for S3,S4

.align_code 4;
biq:                                                             // Inner loop computations are SIMD, using both X and Y comp blocks
            fr3=r0*r2; fr8=r8+r3; r1=l[j0+=2]; r2=l[k0+=2];;     // r3=a2*w(n-2), r8=x(n), r1=w(n-1), r2=a1 (S1,S2)
            fr7=r4*r6; fr9=r9+r7; r5=l[j0+=2]; r6=l[k0+=2];;     // r7=a2*w(n-2), r9=x(n), r5=w(n-1), r6=a1 (S3,S4)
            fr3=r1*r2; fr8=r8+r3; l[j1+=2]=r1; r2=l[k0+=2];;     // r3=a1*w(n-1), r8=x(n)+a2*w(n-2), store new w(n-2), r2=b2 (S1,S2)
            fr7=r5*r6; fr9=r9+r7; l[j1+=2]=r5; r6=l[k0+=2];;     // r7=a1*w(n-1), r9=x(n)+a2*w(n-2), store new w(n-2), r6=b2 (S3,S4)
            fr3=r0*r2; fr11=r8+r3; r0=l[j0+=2]; r2=l[k0+=2];;    // r3=b2*w(n-2), r11=new w(n), r0=next w(n-2), r2=b1 (S1,S2)
            fr7=r4*r6; fr12=r9+r7; r4=l[j0+=2]; r6=l[k0+=2];;    // r7=b2*w(n-2), r12=new w(n), r4=next w(n-2), r6=b1 (S3,S4)
            fr3=r1*r2; fr8=r11+r3; l[j1+=2]=r11; r2=l[k0+=2];;   // r3=b1*w(n-1), r8=w(n)+b2*w(n-2), store new w(n-1), r2=next a2 (S1,S2)
            fr7=r5*r6; fr9=r12+r7; l[j1+=2]=r12; r6=l[k0+=2];;   // r7=b1*w(n-1), r9=w(n)+b2*w(n-2), store new w(n-1), r6=next a2 (S3,S4)
            if NLC1E, jump biq;;

        fr9=r9+r7; j0=j31+delayline;;                            // xr9=I3(n-2), yr9=Y(n-3), setup j0 for next sample
        fr8=r8+r3; j1=j31+delayline;;                            // xr8=I1(n), yr8=I2(n-1), setup j1 for next sample
        yfr11=r9*r10; yr9=xr9;;                                  // scale output, yr9=I3(n-2)
        xr9=yr8; k0=k31+coeffs;;                                 // xr9=I2(n-1), setup k0 for next sample
        if NLC0E, jump main_loop; yr8=xr8; [j5+=1]=yr11;;   	 //write ouput data, yr8=I1(n) 

/******************************************* Done ***********************************************/
done:															 // done.
	nop; nop; nop;;
    jump done (NP);; 

_main.end:

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -