📄 ofdm_ldpc_sse2.lyx
字号:
#LyX 1.3 created this file. For more info see http://www.lyx.org/\lyxformat 221\textclass article\language american\inputencoding auto\fontscheme default\graphics default\paperfontsize 12\spacing onehalf \papersize Default\paperpackage a4\use_geometry 1\use_amsmath 0\use_natbib 0\use_numerical_citations 0\paperorientation portrait\topmargin 2cm\secnumdepth 3\tocdepth 3\paragraph_separation indent\defskip medskip\quotes_language english\quotes_times 2\papercolumns 1\papersides 1\paperpagestyle headings\layout Standard\align center \begin_inset Graphics filename epfl-logo.eps display color rotateOrigin center\end_inset \layout Standard\align center \size large Swiss Federal Institute of Technology, Lausanne\layout Standard\align center \series medium \size large Mobile Communication Laboratory\series default \series medium (LCM)\layout Standard\added_space_top vfill \align center \series bold \size larger \emph on \noun on Report of Activity\layout Standard\added_space_bottom 6cm \align center \series bold \size larger \emph on \noun on Summer Internship Project\layout Standard\align center Gaurav Gupta IIT Guwahati \newline Y.Srinivasulu IIT Delhi\layout Standard\pagebreak_bottom \align center Guides: Linus Gasser and Nicolae Chiurtu\newline Professor: Bixio Rimoldi\newline \layout Standard\pagebreak_bottom \begin_inset LatexCommand \tableofcontents{}\end_inset \layout SectionIntroduction\layout StandardThis work is a part of our Summer Internship Project at the Mobile Communication Laboratory (LCM), EPFL. The primary objectives of this project are:\layout SubsectionFamiliarisation with the Software Radio Platform\layout StandardThe Software Radio is a radio test bed which has been developed for providing real-time wideband radio communication capabilities in a form attractive for university research and teaching. It provides a platform for development of modules for wireless communication system oriented, but not limited, to 3G standards. Within the physical constraints and hardware limitations the platform is designed to be as adaptable and flexible as possible to allow the evaluation of new communication schemes in a \begin_inset Quotes eld\end_inset real\begin_inset Quotes erd\end_inset environment. \layout SubsectionFast Fourier Transform (FFT) Module\layout StandardA module had to be developed to perform FFT and inverse FFT on the Software Radio platform. An FFT module is needed to perform OFDM based communication, where first the inverse FFT of the frequency domain samples is taken to put them onto the channel and the FFT is taken at the other end of the channel. This module was to be developed exploiting the advantages provided by the parallel processing capabilities of the Intel Streaming SIMD Extensions (SSE and SSE2). \layout StandardThere was a FFT module already present which computed the FFT of an input size of 256 samples in 100 ms. This module used the Intel Multimedia Extensions(MMX) which did the operations using 64 bit operands. It did all its calculations in short integers. We have been able to develop an FFT module which could compute the same FFT in 50 ms by using SSE instructions which operated on 128 bit operands. Our module does the calculations in single precision floating point numbers which can give us better precision than the previous module. However, this advantage is not being exploited currently in the Software Radio because the inputs to the module are short integers and so we convert the inputs from short integers to single precision floating point numbers.\layout SubsectionLow Density Parity Check (LDPC) Codes\layout StandardThe already existing LDPC module has transmit frequency of 2 GHz with 5 MHz bandwidth. Each frame consists of 15 slots of 660 microseconds. Each slot contains 2560 chips, where each chip represents 2 bits. The midamble and guard intervals are included. The decoding was done using Iterative Belief Propagation Algorithm using message passing decoders with inputs to the variable nodes as the Log Likelyhood Ratio(LLR)s of the message from the demapper. The implementation was done using exponential and logarithm of the LLRs which are the mathematical equivalent of hyperbolic tangent and inverse hyperbolic tangent method. The encoding and decoding of LDPC codes was taking 1.55 ms and 186 ms respectively for code rate 0.25 on a 3 Ghz Pentium 4 machine. The number of iterations is set to 30. \layout Standard\pagebreak_bottom Therefore only one slot out of 300 could be decoded in real time. In order to make this real time,the decoding algorithm was to be optimized using the Intel SSE and SSE2 instructions. The goal was to compute the messages in as little time as possible.\layout SectionFFT Module\layout SubsectionAbout FFT\layout StandardTo compute the FFT and the IFFT of the input samples, the Butterfly approach is being used. It involves shuffling the inputs; multiplication with exponential factors; and additions and subtractions with the size of the butterfly increasing at each step.\layout StandardThe DFT of a signal s[n] is defined as S[k]=\begin_inset Formula $\sum_{n=0}^{n=N-1}$\end_inset s[n]\begin_inset Formula $W_{N}^{nk}$\end_inset , where \begin_inset Formula $W=e^{-i\frac{2\pi}{N}}$\end_inset \layout StandardFFT algorithms rely on the fact that an N-point DFT involves redundant calculation. The same values of \begin_inset Formula $W_{N}^{nk}$\end_inset are calculated many times as the computation proceeds because \begin_inset Formula $W_{N}^{nk}$\end_inset is periodic . \layout StandardS[k]=G[k]+\begin_inset Formula $W_{N}^{nk}$\end_inset H[k]\layout Standardwhere G[k] and H[k] are two N/2 point DFTs( see figure 1). \layout Standard\align center \begin_inset Graphics filename newdecomposition.fig\end_inset \layout Standard\align center Figure 1: Decomposition of an N-point FFT into two \begin_inset Formula $\frac{N}{2}$\end_inset -point FFTs\layout StandardThese DFTs can be further split into yet smaller DFTs and the process continues till we reach 2 point DFTs in \begin_inset Formula $log{}_{\textrm{2}}N$\end_inset stages( see figure 2). The 2 point DFT can be calculated using a butterfly of size 2. The process can be propagated to further stages with progressively larger butterflies of sizes 4,8,16,...,N.\layout Standard\align center \begin_inset Graphics filename 2butterfly.fig\end_inset \layout Standard\align center Figure 2: Calculation of FFT using Butterflies of various sizes.\layout SubsectionThe Implementation\layout StandardThe input to the module is complex samples, with 16-bit short integers being used for the real and imaginary parts. To fully utilize the SSE features, the short integers are converted to 32-bit single precision floating point numbers. Each 128-bit register can hold 2 complex samples, with their real and imaginary parts(Im2,Re2,Im1,Re1). Thus two complex operations can be done in one instruction. The functions - handle2(), handle4() and handle8() use SSE registers. For handle8(), the eight input samples from the memory can be loaded into 4 xmm registers, thereby minimizing the data fetching between memory and registers and saving time. Also, the memory has been allocated using swr_malloc(), so that specialized data transfer instructions can be used which operate on 16-byte aligned memory. \layout StandardA handle16() function is not possible because all the 8 xmm registers will be required to load the 16 input samples and no register will be left for temporary data copy. If short integers are used for intermediate calculations, a handle16() function can be made.\layout StandardAt each stage, the input array is reordered with the even indexed samples first and then the odd indexed ones. The first half and the second half part of the array is passed to the \begin_inset Quotes eld\end_inset handle()\begin_inset Quotes erd\end_inset function seperately to calculate their N/2-point DFTs. The second half part is then multiplied by the \begin_inset Formula $W_{N}^{nk}$\end_inset factors from n=0,1,..,N/2, . The computation of the \begin_inset Formula $W_{N}^{nk}$\end_inset factors is done during initialization to save time. These are stored in a temporary array temp[N/2]. The first half of the output to next stage is obtained by adding the first half of the output and the temporary array elements and the second half is obtained by subtracting the temp array elements from the first half of the output array.\layout Subsection*SSE optimization\layout StandardThe FFT for input sizes less than or equal to eight has been optimized using SSE. Consider the 4-point FFT as shown in figure 3.\layout Standard\align center \begin_inset Graphics filename new4butterfly.fig\end_inset \layout Standard\align center Figure 3: The 4-point FFT.\layout StandardThis FFT has been implemented in SSE as shown in the figure 4.\layout Standard\align center \begin_inset Graphics filename xmm.fig\end_inset \layout Standard\align center Figure 4: The implementation of 4-point FFT in SSE.\layout Standardhandle4() takes an complex input array of size 4 and places the input[0] and input[1] in xmm0, input[2] and input[3] in xmm1 with input[0] and input[2] in the lower 64 bits. A copy of xmm0 is made in xmm1 ( ie \begin_inset Formula $\mathnormal{xmm1=xmm0}$\end_inset ).\layout StandardNow the following operations take place \layout Standard\begin_inset Formula $\mathnormal{xmm0=xmm0+xmm2}$\end_inset \layout Standard\begin_inset Formula $\mathnormal{xmm1=xmm1-xmm2}$\end_inset \layout StandardAt this point, xmm0 has \emph on input[0]\emph default , \emph on input[1]\emph default and xmm1 has \emph on input[2]\emph default , \emph on input[3]\emph default . Now, \emph on input[2]\emph default is to be multiplied with \begin_inset Formula $W_{\textrm{4}}^{0}$\end_inset =1 and \emph on input[3]\emph default by \begin_inset Formula $W_{\textrm{4}}^{1}$\end_inset =-j. Therefore \emph on Re3\emph default =\emph on input[3].imag\emph default and \emph on Im3\emph default =\emph on -input[3].real\emph default . This is accomplished by moving the multiplication factor in xmm3.\layout Standard\begin_inset Formula $\mathnormal{xmm1=xmm1*xmm3}$\end_inset .\layout Standard\begin_inset Formula $\mathnormal{xmm2=xmm0}$\end_inset .\layout StandardThe xmm0 and xmm1 are shuffled appropriately to get { Re2, Im2 } in xmm0 and { Re1, Im1 } in xmm2 for the next stage. xmm0 is copied to xmm1. \layout Standard\begin_inset Formula $\mathnormal{xmm1=xmm0}$\end_inset \layout Standard\begin_inset Formula $\mathnormal{xmm0=xmm0+xmm2}$\end_inset \layout Standard\begin_inset Formula $\mathnormal{xmm1=xmm1-xmm2}$\end_inset \layout Standardxmm0 contains output[0],output[1]\layout Standardxmm1 contains output[2],output[3]\layout StandardThus the computation of four point FFT is accomplished. \layout StandardA similar procedure has been adopted for the computation of FFT's of size 2 and 8 so that these computations are highly optimised.\layout StandardFor larger input sizes, the recursive algorithm comes into play and we see that the FFT loses its time advantage as we go for larger input sizes because only a small part is being computed using SSE.\layout Standard\pagebreak_bottom The inverse FFT is done is a similar manner except that for the computation of \begin_inset Formula $W_{N}^{nk}$\end_inset we have W=\begin_inset Formula $e^{i\frac{2\pi}{N}}$\end_inset , and the final output is obtained by dividing the result by N in the final stage.\layout SectionLow Density Parity Check Codes\layout StandardThe code we were using had a rate of 0.25 with 4000 vnodes and 3000 cnodes. We were doing 30 message passing iterations in the decoding and the SNR was around 0 dB.\layout SubsectionStarting Version\layout Standard\added_space_bottom 1cm The already existing program for the decoding of LDPC had a non-optimized graph structure. It read the entire graph from a file called \begin_inset Quotes eld\end_inset graphs.c\begin_inset Quotes erd\end_inset and initialized the graph by calling a function \begin_inset Quotes eld\end_inset getGraph()\begin_inset Quotes erd\end_inset . The graph structure had arrays of variable nodes (vnodes) and check nodes (cnodes). The vnodes and cnodes were each themselves structures. The vnode structure had the received value at the vnode, the degree of the vnode and the edgelist array which gave information about the interconnection of the vnode to other cnodes. The cnode structure had data members degree, which gave the degree of the cnode and an edgelist array which again gave the interconnection of the check node to other vnodes. These interconnections were initialized while reading the graph from the file. After this, if the channel was a Binary Symmetric Channel, the demodulation was also carried out. These soft values were then converted to Log Likelihood Ratios (LLR's) and the vnodes were initialized with these received values by calling \begin_inset Quotes eld\end_inset initializegraph()\begin_inset Quotes erd\end_inset .
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -