📄 dsp_fft32x32.h64

📁 dm642函数库
💻 H64
📖 第 1 页 / 共 2 页
字号:
上一页 12
*                                                                           *
*                        xh0  = x[2 * i0    ] + x[2 * i2    ];              *
*                        xh1  = x[2 * i0 + 1] + x[2 * i2 + 1];              *
*                        xl0  = x[2 * i0    ] - x[2 * i2    ];              *
*                        xl1  = x[2 * i0 + 1] - x[2 * i2 + 1];              *
*                                                                           *
*                        xh20 = x[2 * i1    ] + x[2 * i3    ];              *
*                        xh21 = x[2 * i1 + 1] + x[2 * i3 + 1];              *
*                        xl20 = x[2 * i1    ] - x[2 * i3    ];              *
*                        xl21 = x[2 * i1 + 1] - x[2 * i3 + 1];              *
*                                                                           *
*                        x[2 * i0    ] = xh0 + xh20;                        *
*                        x[2 * i0 + 1] = xh1 + xh21;                        *
*                                                                           *
*                        xt0  = xh0 - xh20;                                 *
*                        yt0  = xh1 - xh21;                                 *
*                        xt1  = xl0 + xl21;                                 *
*                        yt2  = xl1 + xl20;                                 *
*                        xt2  = xl0 - xl21;                                 *
*                        yt1  = xl1 - xl20;                                 *
*                                                                           *
*                        x[2 * i1    ] = (xt1 * co1 + yt1 * si1) >> 15;     *
*                        x[2 * i1 + 1] = (yt1 * co1 - xt1 * si1) >> 15;     *
*                        x[2 * i2    ] = (xt0 * co2 + yt0 * si2) >> 15;     *
*                        x[2 * i2 + 1] = (yt0 * co2 - xt0 * si2) >> 15;     *
*                        x[2 * i3    ] = (xt2 * co3 + yt2 * si3) >> 15;     *
*                        x[2 * i3 + 1] = (yt2 * co3 - xt2 * si3) >> 15;     *
*                    }                                                      *
*              }                                                            *
*                                                                           *
*              ie <<= 2;                                                    *
*          }                                                                *
*      }                                                                    *
*                                                                           *
*       The conventional Cooley Tukey FFT, is written using three loops.    *
*       The outermost loop "k" cycles through the stages. There are log     *
*       N to the base 4 stages in all. The loop "j" cycles through the      *
*       groups of butterflies with different twiddle factors, loop "i"      *
*       reuses the twiddle factors for the different butterflies within     *
*       a stage. It is interesting to note the following:                   *
*                                                                           *
*-------------------------------------------------------------------------- *
*       Stage#     #Groups     # Butterflies with common     #Groups*Bflys  *
*                                twiddle factors                            *
*-------------------------------------------------------------------------- *
*        1         N/4          1                            N/4            *
*        2         N/16         4                            N/4            *
*        ..                                                                 *
*        logN      1            N/4                          N/4            *
*-------------------------------------------------------------------------- *
*                                                                           *
*       The following statements can be made based on above observations:   *
*                                                                           *
*       a) Inner loop "i0" iterates a veriable number of times. In          *
*       particular the number of iterations quadruples every time from      *
*       1..N/4. Hence software pipelining a loop that iterates a vraiable   *
*       number of times is not profitable.                                  *
*                                                                           *
*       b) Outer loop "j" iterates a variable number of times as well.      *
*       However the number of iterations is quartered every time from       *
*       N/4 .  . Hence the behaviour in (a) and (b) are exactly opposite    *
*       to each other.                                                      *
*                                                                           *
*       c) If the two loops "i" and "j" are colaesced together then they    *
*       will iterate for a fixed number of times namely N/4. This allows    *
*       us to combine the "i" and "j" loops into 1 loop. Optimized impl-    *
*       ementations will make use of this fact.                             *
*                                                                           *
*       In addition the Cooley Tukey FFT accesses three twiddle factors     *
*       per iteration of the inner loop, as the butterflies that re-use     *
*       twiddle factors are lumped together. This leads to accessing the    *
*       twiddle factor array at three points each sepearted by "ie". Note   *
*       that "ie" is initially 1, and is quadrupled with every iteration.   *
*       Therfore these three twiddle factors are not even contiguous in     *
*       the array.                                                          *
*                                                                           *
*       In order to vectorize the FFT, it is desirable to access twiddle    *
*       factor array using double word wide loads and fetch the twiddle     *
*       factors needed. In order to do this a modified twiddle factor       *
*       array is created, in which the factors WN/4, WN/2, W3N/4 are        *
*       arranged to be contiguous. This eliminates the seperation between   *
*       twiddle factors within a butterfly. However this implies that as    *
*       the loop is traversed from one stage to another, that we maintain   *
*       a redundant version of the twiddle factor array. Hence the size     *
*       of the twiddle factor array increases as compared to the normal     *
*       Cooley Tukey FFT.  The modified twiddle factor array is of size     *
*       "2 * N" where the conventional Cooley Tukey FFT is of size"3N/4"    *
*       where N is the number of complex points to be transformed. The      *
*       routine that generates the modified twiddle factor array was        *
*       presented earlier. With the above transformation of the FFT,        *
*       both the input data and the twiddle factor array can be accessed    *
*       using double-word wide loads to enable packed data processing.      *
*                                                                           *
*       The final stage is optimised to remove the multiplication as        *
*       w0 = 1.  This stage also performs digit reversal on the data,       *
*       so the final output is in natural order.                            *
*                                                                           *
*       The fft() code shown here performs the bulk of the computation      *
*       in place. However, because digit-reversal cannot be performed       *
*       in-place, the final result is written to a separate array, y[].     *
*                                                                           *
*       There is one slight break in the flow of packed processing that     *
*       needs to be comprehended. The real part of the complex number is    *
*       in the lower half, and the imaginary part is in the upper half.     *
*       The flow breaks in case of "xl0" and "xl1" because in this case     *
*       the real part needs to be combined with the imaginary part because  *
*       of the multiplication by "j". This requires a packed quantity like  *
*       "xl21xl20" to be rotated as "xl20xl21" so that it can be combined   *
*        using add2's and sub2's. Hence the natural version of C code       *
*       shown below is transformed using packed data processing as shown:   *
*                                                                           *
*                        xl0  = x[2 * i0    ] - x[2 * i2    ];              *
*                        xl1  = x[2 * i0 + 1] - x[2 * i2 + 1];              *
*                        xl20 = x[2 * i1    ] - x[2 * i3    ];              *
*                        xl21 = x[2 * i1 + 1] - x[2 * i3 + 1];              *
*                                                                           *
*                        xt1  = xl0 + xl21;                                 *
*                        yt2  = xl1 + xl20;                                 *
*                        xt2  = xl0 - xl21;                                 *
*                        yt1  = xl1 - xl20;                                 *
*                                                                           *
*                        xl1_xl0   = _sub2(x21_x20, x21_x20)                *
*                        xl21_xl20 = _sub2(x32_x22, x23_x22)                *
*                        xl20_xl21 = _rotl(xl21_xl20, 16)                   *
*                                                                           *
*                        yt2_xt1   = _add2(xl1_xl0, xl20_xl21)              *
*                        yt1_xt2   = _sub2(xl1_xl0, xl20_xl21)              *
*                                                                           *
*       Also notice that xt1, yt1 endup on seperate words, these need to    *
*       be packed together to take advantage of the packed twiddle fact     *
*       ors that have been loaded. In order for this to be achieved they    *
*       are re-aligned as follows:                                          *
*                                                                           *
*       yt1_xt1 = _packhl2(yt1_xt2, yt2_xt1)                                *
*       yt2_xt2 = _packhl2(yt2_xt1, yt1_xt2)                                *
*                                                                           *
*       In the folllowing code since all data elements are 32 bits, add2    *
*       sub2 are replaced with normal 32 bit add's and subtracts.           *
*       The packed words "yt1_xt1" allows the loaded"sc" twiddle factor     *
*       to be used for the complex multiplies. The real part of the         *
*       multiply and the imaginary part of the multiply are performed       *
*       as 16x32 multiplies using MPYLIR and MPYHIR                         *
*                                                                           *
*       (X + jY) ( C + j S) = (XC + YS) + j (YC - XS).                      *
*                                                                           *
*       The actual twiddle factors for the FFT are cosine, - sine. The      *
*       twiddle factors stored in the table are csine and sine, hence       *
*       the sign of the "sine" term is comprehended during multipli-        *
*       cation as shown above.                                              *
*                                                                           *
*   MEMORY NOTE                                                             *
*       The optimized implementations are written for LITTLE ENDIAN.        *
*                                                                           *
*   CYCLES                                                                  *
*       [(N/4 + 1) * 10 + 10] * ceil(log4(N) - 1) + 6 * (N/4 + 2) + 27      *
*                                                                           *
*       N = 512, [1290 + 10] * 4 + 6 * 130 + 27 = 6007 cycles               *
*                                                                           *
*   CODESIZE                                                                *
*       972 bytes                                                           *
* ------------------------------------------------------------------------- *
*             Copyright (c) 2003 Texas Instruments, Incorporated.           *
*                            All Rights Reserved.                           *
* ========================================================================= *
*============================================================================*

        .global _DSP_fft32x32

* ========================================================================= *
*   End of file:  dsp_fft32x32.h64                                          *
* ------------------------------------------------------------------------- *
*             Copyright (c) 2003 Texas Instruments, Incorporated.           *
*                            All Rights Reserved.                           *
* ========================================================================= *
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -