📄 fftw_4.html

📁 FFTW, a collection of fast C routines to compute the Discrete Fourier Transform in one or more dime
💻 HTML
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
nz</CODE>.<LI>The data layout only depends upon the dimensions of the array, not onthe plan, so you are guaranteed that different plans for the same size(or inverse plans) will use the same (consistent) data layouts.</UL><H3><A NAME="SEC59">Usage of MPI FFTW for Real Multi-dimensional Transforms</A></H3><P>MPI transforms specialized for real data are also available, similiar tothe uniprocessor <CODE>rfftwnd</CODE> transforms.  Just as in the uniprocessorcase, the real-data MPI functions gain roughly a factor of two in speed(and save a factor of two in space) at the expense of more complicateddata formats in the calling program.  Before reading this section, youshould definitely understand how to call the uniprocessor <CODE>rfftwnd</CODE>functions and also the complex MPI FFTW functions.<P>The following is an example of a program using <CODE>rfftwnd_mpi</CODE>.  Itcomputes the size <CODE>nx x ny x nz</CODE> transform of a real function<CODE>f(x,y,z)</CODE>, multiplies the imaginary part by <CODE>2</CODE> for fun, thencomputes the inverse transform.  (We'll also use<CODE>FFTW_TRANSPOSED_ORDER</CODE> output for the transform, and additionallysupply the optional workspace parameter to <CODE>rfftwnd_mpi</CODE>, just toadd a little spice.)<PRE>#include &#60;rfftw_mpi.h&#62;int main(int argc, char **argv){     const int nx = ..., ny = ..., nz = ...;     int local_nx, local_x_start, local_ny_after_transpose,         local_y_start_after_transpose, total_local_size;     int x, y, z;     rfftwnd_mpi_plan plan, iplan;     fftw_real *data, *work;     fftw_complex *cdata;     MPI_Init(&#38;argc,&#38;argv);     /* create the forward and backward plans: */     plan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD,                                    nx, ny, nz,                                    FFTW_REAL_TO_COMPLEX,                                    FFTW_ESTIMATE);<A NAME="IDX260"></A>     iplan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD,      /* dim.'s of REAL data --&#62; */  nx, ny, nz,                                     FFTW_COMPLEX_TO_REAL,                                     FFTW_ESTIMATE);     rfftwnd_mpi_local_sizes(plan, &#38;local_nx, &#38;local_x_start,                            &#38;local_ny_after_transpose,                            &#38;local_y_start_after_transpose,                            &#38;total_local_size);<A NAME="IDX261"></A>     data = (fftw_real*) malloc(sizeof(fftw_real) * total_local_size);     /* workspace is the same size as the data: */     work = (fftw_real*) malloc(sizeof(fftw_real) * total_local_size);     /* initialize data to f(x,y,z): */     for (x = 0; x &#60; local_nx; ++x)             for (y = 0; y &#60; ny; ++y)                     for (z = 0; z &#60; nz; ++z)                             data[(x*ny + y) * (2*(nz/2+1)) + z]                                     = f(x + local_x_start, y, z);     /* Now, compute the forward transform: */     rfftwnd_mpi(plan, 1, data, work, FFTW_TRANSPOSED_ORDER);<A NAME="IDX262"></A>     /* the data is now complex, so typecast a pointer: */     cdata = (fftw_complex*) data;          /* multiply imaginary part by 2, for fun:        (note that the data is transposed) */     for (y = 0; y &#60; local_ny_after_transpose; ++y)             for (x = 0; x &#60; nx; ++x)                     for (z = 0; z &#60; (nz/2+1); ++z)                             cdata[(y*nx + x) * (nz/2+1) + z].im                                     *= 2.0;     /* Finally, compute the inverse transform; the result        is transposed back to the original data layout: */     rfftwnd_mpi(iplan, 1, data, work, FFTW_TRANSPOSED_ORDER);     free(data);     free(work);     rfftwnd_mpi_destroy_plan(plan);<A NAME="IDX263"></A>     rfftwnd_mpi_destroy_plan(iplan);     MPI_Finalize();}</PRE><P>There's a lot of stuff in this example, but it's all just what you wouldhave guessed, right?  We replaced all the <CODE>fftwnd_mpi*</CODE> functionsby <CODE>rfftwnd_mpi*</CODE>, but otherwise the parameters were pretty muchthe same.  The data layout distributed among the processes just like forthe complex transforms (see Section <A HREF="fftw_4.html#SEC58">MPI Data Layout</A>), but in addition thefinal dimension is padded just like it is for the uniprocessor in-placereal transforms (see Section <A HREF="fftw_3.html#SEC37">Array Dimensions for Real Multi-dimensional Transforms</A>).<A NAME="IDX264"></A>In particular, the <CODE>z</CODE> dimension of the real input data is paddedto a size <CODE>2*(nz/2+1)</CODE>, and after the transform it contains<CODE>nz/2+1</CODE> complex values.<A NAME="IDX265"></A><A NAME="IDX266"></A><P>Some other important things to know about the real MPI transforms:<UL><LI>As for the uniprocessor <CODE>rfftwnd_create_plan</CODE>, the dimensionspassed for the <CODE>FFTW_COMPLEX_TO_REAL</CODE> plan are those of the<EM>real</EM> data.  In particular, even when <CODE>FFTW_TRANSPOSED_ORDER</CODE><A NAME="IDX267"></A><A NAME="IDX268"></A>is used as in this case, the dimensions are those of the (untransposed)real output, not the (transposed) complex input.  (For the complex MPItransforms, on the other hand, the dimensions are always those of theinput array.)<LI>The output ordering of the transform (<CODE>FFTW_TRANSPOSED_ORDER</CODE> or<CODE>FFTW_TRANSPOSED_ORDER</CODE>) <EM>must</EM> be the same for both forwardand backward transforms.  (This is not required in the complex case.)<LI><CODE>total_local_size</CODE> is the required size in <CODE>fftw_real</CODE> values,not <CODE>fftw_complex</CODE> values as it is for the complex transforms.<LI><CODE>local_ny_after_transpose</CODE> and <CODE>local_y_start_after_transpose</CODE>describe the portion of the array after the transform; that is, they areindices in the complex array for an <CODE>FFTW_REAL_TO_COMPLEX</CODE> transformand in the real array for an <CODE>FFTW_COMPLEX_TO_REAL</CODE> transform.<LI><CODE>rfftwnd_mpi</CODE> always expects <CODE>fftw_real*</CODE> array arguments, butof course these pointers can refer to either real or complex arrays,depending upon which side of the transform you are on.  Just as forin-place uniprocessor real transforms (and also in the example above),this is most easily handled by typecasting to a complex pointer whenhandling the complex data.<LI>As with the complex transforms, there are also<CODE>rfftwnd_create_plan</CODE> and <CODE>rfftw2d_create_plan</CODE> functions, andany rank greater than one is supported.<A NAME="IDX269"></A><A NAME="IDX270"></A></UL><P>Programs using the MPI FFTW real transforms should link with<CODE>-lrfftw_mpi -lfftw_mpi -lrfftw -lfftw -lm</CODE> on Unix.<A NAME="IDX271"></A><H3><A NAME="SEC60">Usage of MPI FFTW for Complex One-dimensional Transforms</A></H3><P>The MPI FFTW also includes routines for parallel one-dimensionaltransforms of complex data (only).  Although the speedup is generallyworse than it is for the multi-dimensional routines,<A NAME="DOCF6" HREF="fftw_foot.html#FOOT6">(6)</A> these distributed-memory one-dimensional transforms areespecially useful for performing one-dimensional transforms that don'tfit into the memory of a single machine.<P>The usage of these routines is straightforward, and is similar to thatof the multi-dimensional MPI transform functions.  You first include theheader <CODE>&#60;fftw_mpi.h&#62;</CODE> and then create a plan by calling:<PRE>fftw_mpi_plan fftw_mpi_create_plan(MPI_Comm comm, int n,                                    fftw_direction dir, int flags);</PRE><P><A NAME="IDX272"></A><A NAME="IDX273"></A><P>The last three arguments are the same as for <CODE>fftw_create_plan</CODE>(except that all MPI transforms are automatically <CODE>FFTW_IN_PLACE</CODE>).The first argument specifies the group of processes you are using, andis usually <CODE>MPI_COMM_WORLD</CODE> (all processes).<A NAME="IDX274"></A>A plan can be used for many transforms of the same size, and isdestroyed when you are done with it by calling<CODE>fftw_mpi_destroy_plan(plan)</CODE>.<A NAME="IDX275"></A><P>If you don't care about the ordering of the input or output data of thetransform, you can include <CODE>FFTW_SCRAMBLED_INPUT</CODE> and/or<CODE>FFTW_SCRAMBLED_OUTPUT</CODE> in the <CODE>flags</CODE>.<A NAME="IDX276"></A><A NAME="IDX277"></A><A NAME="IDX278"></A>These save some communications at the expense of having the input and/oroutput reordered in an undocumented way.  For example, if you areperforming an FFT-based convolution, you might use<CODE>FFTW_SCRAMBLED_OUTPUT</CODE> for the forward transform and<CODE>FFTW_SCRAMBLED_INPUT</CODE> for the inverse transform.<P>The transform itself is computed by:<PRE>void fftw_mpi(fftw_mpi_plan p, int n_fields,              fftw_complex *local_data, fftw_complex *work);</PRE><P><A NAME="IDX279"></A><P><A NAME="IDX280"></A><CODE>n_fields</CODE>, as in <CODE>fftwnd_mpi</CODE>, is equivalent to<CODE>howmany=n_fields</CODE>, <CODE>stride=n_fields</CODE>, and <CODE>dist=1</CODE>, andshould be <CODE>1</CODE> when you are computing the transform of a singlearray.  <CODE>local_data</CODE> contains the portion of the array local to thecurrent process, described below.  <CODE>work</CODE> is either <CODE>NULL</CODE> oran array exactly the same size as <CODE>local_data</CODE>; in the latter case,FFTW can use the <CODE>MPI_Alltoall</CODE> communications primitive which is(usually) faster at the expense of extra storage.  Upon return,<CODE>local_data</CODE> contains the portion of the output local to thecurrent process (see below).<A NAME="IDX281"></A><P><A NAME="IDX282"></A>To find out what portion of the array is stored local to the currentprocess, you call the following routine:<PRE>void fftw_mpi_local_sizes(fftw_mpi_plan p,                          int *local_n, int *local_start,                          int *local_n_after_transform,                          int *local_start_after_transform,                          int *total_local_size);</PRE><P><A NAME="IDX283"></A><P><CODE>total_local_size</CODE> is the number of <CODE>fftw_complex</CODE> elementsyou should actually allocate for <CODE>local_data</CODE> (and <CODE>work</CODE>).<CODE>local_n</CODE> and <CODE>local_start</CODE> indicate that the current processstores <CODE>local_n</CODE> elements corresponding to the indices<CODE>local_start</CODE> to <CODE>local_start+local_n-1</CODE> in the "real"array.  <EM>After the transform, the process may store a differentportion of the array.</EM>  The portion of the data stored on the processafter the transform is given by <CODE>local_n_after_transform</CODE> and<CODE>local_start_after_transform</CODE>.  This data is exactly the same as acontiguous segment of the corresponding uniprocessor transform output(i.e. an in-order sequence of sequential frequency bins).<P>Note that, if you compute both a forward and a backward transform of thesame size, the local sizes are guaranteed to be consistent.  That is,the local size after the forward transform will be the same as the localsize before the backward transform, and vice versa.<P>Programs using the FFTW MPI routines should be linked with<CODE>-lfftw_mpi -lfftw -lm</CODE> on Unix, in addition to whatever librariesare required for MPI.<A NAME="IDX284"></A><H3><A NAME="SEC61">MPI Tips</A></H3><P>There are several things you should consider in order to get the bestperformance out of the MPI FFTW routines.<P><A NAME="IDX285"></A>First, if possible, the first and second dimensions of your data shouldbe divisible by the number of processes you are using.  (If only one canbe divisible, then you should choose the first dimension.)  This allowsthe computational load to be spread evenly among the processes, and alsoreduces the communications complexity and overhead.  In theone-dimensional transform case, the size of the transform should ideallybe divisible by the <EM>square</EM> of the number of processors.<P><A NAME="IDX286"></A>Second, you should consider using the <CODE>FFTW_TRANSPOSED_ORDER</CODE>output format if it is not too burdensome.  The speed gains fromcommunications savings are usually substantial.<P>Third, you should consider allocating a workspace for<CODE>(r)fftw(nd)_mpi</CODE>, as this can often(but not always) improve performance (at the cost of extra storage).<P>Fourth, you should experiment with the best number of processors to usefor your problem.  (There comes a point of diminishing returns, when thecommunications costs outweigh the computational benefits.<A NAME="DOCF7" HREF="fftw_foot.html#FOOT7">(7)</A>)  The <CODE>fftw_mpi_test</CODE> program can output helpful performancebenchmarks.<A NAME="IDX287"></A><A NAME="IDX288"></A>It accepts the same parameters as the uniprocessor test programs(c.f. <CODE>tests/README</CODE>) and is run like an ordinary MPI program.  Forexample, <CODE>mpirun -np 4 fftw_mpi_test -s 128x128x128</CODE> will benchmarka <CODE>128x128x128</CODE> transform on four processors, reporting timings andparallel speedups for all variants of <CODE>fftwnd_mpi</CODE> (transposed,with workspace, etcetera).  (Note also that there is the<CODE>rfftw_mpi_test</CODE> program for the real transforms.)<A NAME="IDX289"></A><P><HR><P>Go to the <A HREF="fftw_1.html">first</A>, <A HREF="fftw_3.html">previous</A>, <A HREF="fftw_5.html">next</A>, <A HREF="fftw_10.html">last</A> section, <A HREF="fftw_toc.html">table of contents</A>.</BODY></HTML>
上一页 1 23
💿 文件大小 1511 K
👤 上传用户 koalalee
📂 所属分类数学计算
🏷️ 相关标签

#collection #Transform #Discrete #routines
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -