📄 fftalgorithms.tex
字号:
\documentclass[fleqn,12pt]{article}%\setlength{\oddsidemargin}{-0.25in}\setlength{\textwidth}{7.0in}\setlength{\topmargin}{-0.25in}\setlength{\textheight}{9.5in}%\usepackage{algorithmic}\newenvironment{algorithm}{\begin{quote} %\vspace{1em}\begin{algorithmic}\samepage}{\end{algorithmic} %\vspace{1em} \end{quote}}\newcommand{\Real}{\mathop{\mathrm{Re}}}\newcommand{\Imag}{\mathop{\mathrm{Im}}}\begin{document}\title{FFT Algorithms}\author{Brian Gough, {\tt bjg@network-theory.co.uk}}\date{May 1997}\maketitle\section{Introduction}Fast Fourier Transforms (FFTs) are efficient algorithms forcalculating the discrete fourier transform (DFT),%\begin{eqnarray}h_a &=& \mathrm{DFT}(g_b) \\ &=& \sum_{b=0}^{N-1} g_b \exp(-2\pi i a b /N) \qquad 0 \leq a \leq N-1\\ &=& \sum_{b=0}^{N-1} g_b W_N^{ab} \qquad W_N= \exp(-2\pi i/N)\end{eqnarray}%The DFT usually arises as an approximation to the continuous fouriertransform when functions are sampled at discrete intervals in space ortime. The naive evaluation of the discrete fourier transform is amatrix-vector multiplication ${\mathbf W}\vec{g}$, and would take$O(N^2)$ operations for $N$ data-points. The general principle of theFast Fourier Transform algorithms is to use a divide-and-conquerstrategy to factorize the matrix $W$ into smaller sub-matrices,typically reducing the operation count to $O(N \sum f_i)$ if $N$ canbe factorized into smaller integers, $N=f_1 f_2 \dots f_n$.This chapter explains the algorithms used in the GSL FFT routinesand provides some information on how to extend them. To learn more aboutthe FFT you should read the review article {\em Fast FourierTransforms: A Tutorial Review and A State of the Art} by Duhamel andVetterli~\cite{duhamel90}. There are several introductory books on theFFT with example programs, such as {\em The Fast Fourier Transform} byBrigham~\cite{brigham74} and {\em DFT/FFT and Convolution Algorithms}by Burrus and Parks~\cite{burrus84}. In 1979 the IEEE published acompendium of carefully-reviewed Fortran FFT programs in {\em Programsfor Digital Signal Processing}~\cite{committee79} which is a usefulreference for implementations of many different FFT algorithms. If youare interested in using DSPs then the {\em Handbook of Real-Time FastFourier Transforms}~\cite{smith95} provides detailed information onthe algorithms and hardware needed to design, build and test DSPapplications. Many FFT algorithms rely on results from number theory.These results are covered in the books {\em Fast transforms:algorithms, analyses, applications}, by Elliott andRao~\cite{elliott82}, {\em Fast Algorithms for Digital SignalProcessing} by Blahut~\cite{blahut} and {\em Number Theory in DigitalSignal Processing} by McClellan and Rader~\cite{mcclellan79}. There isalso an annotated bibliography of papers on the FFT and related topicsby Burrus~\cite{burrus-note}.\section{Families of FFT algorithms}%There are two main families of FFT algorithms: the Cooley-Tukeyalgorithm and the Prime Factor algorithm. These differ in the way theymap the full FFT into smaller sub-transforms. Of the Cooley-Tukeyalgorithms there are two types of routine in common use: mixed-radix(general-$N$) algorithms and radix-2 (power of 2) algorithms. Eachtype of algorithm can be further classified by additional characteristics,such as whether it operates in-place or uses additional scratch space,whether its output is in a sorted or scrambled order, and whether ituses decimation-in-time or -frequency iterations.Mixed-radix algorithms work by factorizing the data vector intoshorter lengths. These can then be transformed by small-$N$ FFTs.Typical programs include FFTs for small prime factors, such as 2, 3,5, \dots which are highly optimized. The small-$N$ FFT modules act asbuilding blocks and can be multiplied together to make longertransforms. By combining a reasonable set of modules it is possible tocompute FFTs of many different lengths. If the small-$N$ modules aresupplemented by an $O(N^2)$ general-$N$ module then an FFT of anylength can be computed, in principle. Of course, any lengths whichcontain large prime factors would perform only as $O(N^2)$.Radix-2 algorithms, or ``power of two'' algorithms, are simplifiedversions of the mixed-radix algorithm. They are restricted to lengthswhich are a power of two. The basic radix-2 FFT module only involvesaddition and subtraction, so the algorithms are very simple. Radix-2algorithms have been the subject of much research into optimizing theFFT. Many of the most efficient radix-2 routines are based on the``split-radix'' algorithm. This is actually a hybrid which combinesthe best parts of both radix-2 and radix-4 (``power of 4'')algorithms~\cite{sorenson86,duhamel86}.The prime factor algorithm (PFA) is an alternative form of general-$N$algorithm based on a different way of recombining small-$N$ FFTmodules~\cite{temperton83pfa,temperton85}. It has a very simple indexingscheme which makes it attractive. However it only works in the casewhere all factors are mutually prime. This requirement makes it moresuitable as a specialized algorithm for given lengths.\begin{subsection}{FFTs of prime lengths}% Large prime lengths cannot be handled efficiently by any of thesealgorithms. However it may still possible to compute a DFT, by usingresults from number theory. Rader showed that it is possible toconvert a length-$p$ FFT (where $p$ is prime) into a convolution oflength-($p-1$). There is a simple identity between the convolution oflength $N$ and the FFT of the same length, so if $p-1$ is easilyfactorizable this allows the convolution to be computed efficientlyvia the FFT. The idea is presented in the original paper byRader~\cite{raderprimes} (also reprinted in~\cite{mcclellan79}), butfor more details see the theoretical books mentioned earlier.\end{subsection}%\subsection{In-place algorithms vs algorithms using scratch space}%%%For simple algorithms, such as the Radix-2 algorithms, the FFT can be%performed in-place, without additional memory. For this to be possible%the index calculations must be simple enough that the calculation of%the result to be stored in index $k$ can be carried out at the same%time as the data in $k$ is used, so that no temporary storage is%needed.%The mixed-radix algorithm is too complicated for this. It requires an%area of scratch space sufficient to hold intermediate copies of the%input data.%\subsection{Self-sorting algorithms vs scrambling algorithms}%%%A self-sorting algorithm At each iteration of the FFT internal permutations are included%which leave the final iteration in its natural order.%The mixed-radix algorithm can be made self-sorting. The additional%scratch space helps here, although it is in fact possible to design%self-sorting algorithms which do not require scratch%space. %The in-place radix-2 algorithm is not self-sorting. The data is%permuted, and transformed into bit-reversed order. Note that%bit-reversal only refers to the order of the array, not the values for%each index which are of course not changed. More precisely, the data%stored in location $[b_n \dots b_2 b_1 b_0]$ is moved to location%$[b_0 b_1 b_2 \dots b_n]$, where $b_0 \dots b_n$ are the raw bits of a%given index. Placing the data in bit reversed order is easily done,%using right shifts to extract the least significant bits of the index%and left-shifts to feed them into a register for the bit-reversed%location. Simple radix-2 FFT usually recompute the bit reverse%ordering in the naive way for every call. For repeated calls it might%be worthwhile to precompute the permutation and cache it. There are%also more sophisticated algorithms which reduce the number of%operations needed to perform bit-reversal~\cite{bit-reversal}.%\begin{subsection}{Decimation-in-time (DIT) vs Decimation-in-Frequency (DIF)}%\end{subsection}\subsection{Optimization}%There is no such thing as the single fastest FFT {\em algorithm}. FFTalgorithms involve a mixture of floating point calculations, integerarithmetic and memory access. Each of these operations will havedifferent relative speeds on different platforms. The performance ofan algorithm is a function of the hardware it is implemented on. Thegoal of optimization is thus to choose the algorithm best suited tothe characteristics of a given hardware platform.For example, the Winograd Fourier Transform (WFTA) is an algorithmwhich is designed to reduce the number of floating pointmultiplications in the FFT. However, it does this at the expense ofusing many more additions and data transfers than other algorithms.As a consequence the WFTA might be a good candidate algorithm formachines where data transfers occupy a negligible time relative tofloating point arithmetic. However on most modern machines, where thespeed of data transfers is comparable to or slower than floating pointoperations, it would be outperformed by an algorithm which used abetter mix of operations (i.e. more floating point operations butfewer data transfers).For a study of this sort of effect in detail, comparing the differentalgorithms on different platforms consult the paper {\it Effects ofArchitecture Implementation on DFT Algorithm Performance} by Mehalic,Rustan and Route~\cite{mehalic85}. The paper was written in the early1980's and has data for super- and mini-computers which you areunlikely to see today, except in a museum. However, the methodology isstill valid and it would be interesting to see similar results forpresent day computers.\section{FFT Concepts}%Factorization is the key principle of the mixed-radix FFT divide-and-conquerstrategy. If $N$ can be factorized into a product of $n_f$ integers,%\begin{equation}N = f_1 f_2 ... f_{n_f} ,\end{equation}%then the FFT itself can be divided into smaller FFTs for each factor.More precisely, an FFT of length $N$ can be broken up into,%\begin{quote}\begin{tabular}{l}$(N/f_1)$ FFTs of length $f_1$, \\$(N/f_2)$ FFTs of length $f_2$, \\\dots \\$(N/f_{n_f})$ FFTs of length $f_{n_f}$. \end{tabular}\end{quote}%The total number of operations for these sub-operations will be $O(N(f_1 + f_2 + ... + f_{n_f}))$. When the factors of $N$ are all smallintegers this will be substantially less than $O(N^2)$. For example,when $N$ is a power of 2 an FFT of length $N=2^m$ can be reduced to $mN/2$ FFTs of length 2, or $O(N\log_2 N)$ operations. Here is ademonstration which shows this:We start with the full DFT,%\begin{eqnarray}h_a &=& \sum_{b=0}^{N-1} g_b W_N^{ab} \qquad W_N=\exp(-2\pi i/N)\end{eqnarray}%and split the sum into even and odd terms,%\begin{eqnarray}\phantom{h_a}% &=& \left(\sum_{b=0,2,4,\dots} + \sum_{b=1,3,5,\dots}\right) g_b W_N^{ab}\\ &=& \sum_{b=0}^{N/2-1} g_{2b} W_N^{a(2b)} + \sum_{b=0}^{N/2-1} g_{2b+1} W_N^{a(2b+1)}.\end{eqnarray}%This converts the original DFT of length $N$ into two DFTs of length$N/2$,%\begin{equation}h_a = \sum_{b=0}^{N/2-1} g_{2b} W_{(N/2)}^{ab} + W_N^a \sum_{b=0}^{N/2-1} g_{2b+1} W_{(N/2)}^{ab} \end{equation}%The first term is a DFT of the even elements of $g$. The second termis a DFT of the odd elements of $g$, premultiplied by an exponentialfactor $W_N^k$ (known as a {\em twiddle factor}).%\begin{equation}\mathrm{DFT}(h) = \mathrm{DFT}(g_{even}) + W_N^k \mathrm{DFT}(g_{odd})\end{equation}%By splitting the DFT into its even and odd parts we have reduced theoperation count from $N^2$ (for a DFT of length $N$) to $2 (N/2)^2$(for two DFTs of length $N/2$). The cost of the splitting is that weneed an additional $O(N)$ operations to multiply by the twiddle factor$W_N^k$ and recombine the two sums.We can repeat the splitting procedure recursively $\log_2 N$ timesuntil the full DFT is reduced to DFTs of single terms. The DFT of asingle value is just the identity operation, which costs nothing.However since $O(N)$ operations were needed at each stage to recombinethe even and odd parts the total number of operations to obtain thefull DFT is $O(N \log_2 N)$. If we had used a length which was aproduct of factors $f_1$, $f_2$, $\dots$ we could have split the sumin a similar way. First we would split terms corresponding to thefactor $f_1$, instead of the even and odd terms corresponding to afactor of two. Then we would repeat this procedure for the subsequentfactors. This would lead to a final operation count of $O(N \sumf_i)$.This procedure gives some motivation for why the number of operationsin a DFT can in principle be reduced from $O(N^2)$ to $O(N \sum f_i)$.It does not give a good explanation of how to implement the algorithmin practice which is what we shall do in the next section.\section{Radix-2 Algorithms}%For radix-2 FFTs it is natural to write array indices in binary formbecause the length of the data is a power of two. This is nicelyexplained in the article {\em The FFT: Fourier Transforming One Bit ata Time} by P.B. Visscher~\cite{visscher96}. A binary representationfor indices is the key to deriving the simplest efficient radix-2algorithms.We can write an index $b$ ($0 \leq b < 2^{n-1}$) in binaryrepresentation like this,%\begin{equation}b = [b_{n-1} \dots b_1 b_0] = 2^{n-1}b_{n-1} + \dots + 2 b_1 + b_0 .\end{equation}%Each of the $b_0, b_1, \dots, b_{n-1}$ are the bits (either 0 or 1) of$b$.Using this notation the original definition of the DFTcan be rewritten as a sum over the bits of $b$,%\begin{equation} h(a) = \sum_{b=0}^{N-1} g_b \exp(-2\pi i a b /N)\end{equation}%to give an equivalent summation like this,%\begin{equation} h([a_{n-1} \dots a_1 a_0]) = \sum_{b_0=0}^{1} \sum_{b_1=0}^{1} \dots\sum_{b_{n-1}=0}^{1} g([b_{n-1} \dots b_1 b_0]) W_N^{ab}\end{equation}%where the bits of $a$ are $a=[a_{n-1} \dots a_1 a_0]$. To reduce the number of operations in the sum we will use theperiodicity of the exponential term,%\begin{equation}W_N^{x+N} = W_N^{x}.\end{equation}%Most of the products $ab$ in $W_N^{ab}$ are greater than $N$. Bymaking use of this periodicity they can all be collapsed down into therange $0\dots N-1$. This allows us to reduce the number of operationsby combining common terms, modulo $N$. Using this idea we can derivedecimation-in-time or decimation-in-frequency algorithms, depending onhow we break the DFT summation down into common terms. We'll firstconsider the decimation-in-time algorithm.\subsection{Radix-2 Decimation-in-Time (DIT)}%To derive the the decimation-in-time algorithm we start by separatingout the most significant bit of the index $b$,%\begin{equation}[b_{n-1} \dots b_1 b_0] = 2^{n-1}b_{n-1} + [b_{n-2} \dots b_1 b_0]\end{equation}%Now we can evaluate the innermost sum of the DFT without anydependence on the remaining bits of $b$ in the exponential,%\begin{eqnarray} h([a_{n-1} \dots a_1 a_0]) &=& \sum_{b_0=0}^{1} \sum_{b_1=0}^{1} \dots\sum_{b_{n-1}=0}^{1} g(b) W_N^{a(2^{n-1}b_{n-1}+[b_{n-2} \dots b_1 b_0])} \\ &=& \sum_{b_0=0}^{1} \sum_{b_1=0}^{1} \dots\sum_{b_{n-2}=0}^{1} W_N^{a[b_{n-2} \dots b_1 b_0]}\sum_{b_{n-1}=0}^{1} g(b) W_N^{a(2^{n-1}b_{n-1})}\end{eqnarray}%Looking at the term $W_N^{a(2^{n-1}b_{n-1})}$ we see that we can alsoremove most of the dependence on $a$ as well, by using the periodicity of theexponential,%\begin{eqnarray}W_N^{a(2^{n-1}b_{n-1})} &=&\exp(-2\pi i [a_{n-1} \dots a_1 a_0] 2^{n-1} b_{n-1}/ 2^n )\\&=& \exp(-2\pi i [a_{n-1} \dots a_1 a_0] b_{n-1}/ 2 )\\&=& \exp(-2\pi i ( 2^{n-2}a_{n-1} + \dots + a_1 + (a_0/2)) b_{n-1} )\\&=& \exp(-2\pi i a_0 b_{n-1}/2 ) \\&=& W_2^{a_0 b_{n-1}}\end{eqnarray}%Thus the innermost exponential term simplifies so that it onlyinvolves the highest order bit of $b$ and the lowest order bit of $a$,and the sum can be reduced to,%\begin{equation}h([a_{n-1} \dots a_1 a_0])= \sum_{b_0=0}^{1} \sum_{b_1=0}^{1} \dots\sum_{b_{n-2}=0}^{1} W_N^{a[b_{n-2} \dots b_1 b_0]}\sum_{b_{n-1}=0}^{1} g(b) W_2^{a_0 b_{n-1}}.\end{equation}%We can repeat this this procedure for the next most significant bit of$b$, $b_{n-2}$, using a similar identity,%\begin{eqnarray}W_N^{a(2^{n-2}b_{n-2})} &=& \exp(-2\pi i [a_{n-1} \dots a_1 a_0] 2^{n-2} b_{n-2}/ 2^n )\\&=& W_4^{ [a_1 a_0] b_{n-2}}.\end{eqnarray}%to give a formula with even less dependence on the bits of $a$, %\begin{equation}h([a_{n-1} \dots a_1 a_0])=
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -