📄 fftw_4.html
字号:
<A NAME="IDX230"></A><P>The only requirement of the FFTW MPI code is that you have the standardMPI 1.1 (or later) libraries and header files installed on your system.A free implementation of MPI is available from<A HREF="http://www-unix.mcs.anl.gov/mpi/mpich/">the MPICH home page</A>.<P>Previous versions of the FFTW MPI routines have had an unfortunatetendency to expose bugs in MPI implementations. The current version hasbeen largely rewritten, and hopefully avoids some of the problems. Ifyou run into difficulties, try passing the optional workspace to<CODE>(r)fftwnd_mpi</CODE> (see below), as this allows us to use the standard(and hopefully well-tested) <CODE>MPI_Alltoall</CODE> primitive for<A NAME="IDX231"></A>communications. Please let us know (<A HREF="mailto:fftw@fftw.org">fftw@fftw.org</A>)how things work out.<P><A NAME="IDX232"></A><A NAME="IDX233"></A>Several test programs are included in the <CODE>mpi</CODE> directory. Theones most useful to you are probably the <CODE>fftw_mpi_test</CODE> and<CODE>rfftw_mpi_test</CODE> programs, which are run just like an ordinary MPIprogram and accept the same parameters as the other FFTW test programs(c.f. <CODE>tests/README</CODE>). For example, <CODE>mpirun <I>...params...</I>fftw_mpi_test -r 0</CODE> will run non-terminating complex-transformcorrectness tests of random dimensions. They can also do performancebenchmarks.<H3><A NAME="SEC57">Usage of MPI FFTW for Complex Multi-dimensional Transforms</A></H3><P>Usage of the MPI FFTW routines is similar to that of the uniprocessorFFTW. We assume that the reader already understands the usage of theuniprocessor FFTW routines, described elsewhere in this manual. Somefamiliarity with MPI is also helpful.<P>A typical program performing a complex two-dimensional MPI transformmight look something like:<PRE>#include <fftw_mpi.h>int main(int argc, char **argv){ const int NX = ..., NY = ...; fftwnd_mpi_plan plan; fftw_complex *data; MPI_Init(&argc,&argv); plan = fftw2d_mpi_create_plan(MPI_COMM_WORLD, NX, NY, FFTW_FORWARD, FFTW_ESTIMATE); ...allocate and initialize data... fftwnd_mpi(p, 1, data, NULL, FFTW_NORMAL_ORDER); ... fftwnd_mpi_destroy_plan(plan); MPI_Finalize();}</PRE><P>The calls to <CODE>MPI_Init</CODE> and <CODE>MPI_Finalize</CODE> are required in all<A NAME="IDX234"></A><A NAME="IDX235"></A>MPI programs; see the <A HREF="http://www.mcs.anl.gov/mpi/">MPI home page</A>for more information. Note that all of your processes run the programin parallel, as a group; there is no explicit launching ofthreads/processes in an MPI program.<P><A NAME="IDX236"></A>As in the ordinary FFTW, the first thing we do is to create a plan (oftype <CODE>fftwnd_mpi_plan</CODE>), using:<PRE>fftwnd_mpi_plan fftw2d_mpi_create_plan(MPI_Comm comm, int nx, int ny, fftw_direction dir, int flags);</PRE><P><A NAME="IDX237"></A><A NAME="IDX238"></A><P>Except for the first argument, the parameters are identical to those of<CODE>fftw2d_create_plan</CODE>. (There are also analogous<CODE>fftwnd_mpi_create_plan</CODE> and <CODE>fftw3d_mpi_create_plan</CODE>functions. Transforms of any rank greater than one are supported.)<A NAME="IDX239"></A><A NAME="IDX240"></A>The first argument is an MPI <EM>communicator</EM>, which specifies thegroup of processes that are to be involved in the transform; thestandard constant <CODE>MPI_COMM_WORLD</CODE> indicates all availableprocesses.<A NAME="IDX241"></A><P>Next, one has to allocate and initialize the data. This is somewhattricky, because the transform data is distributed across the processesinvolved in the transform. It is discussed in detail by the nextsection (see Section <A HREF="fftw_4.html#SEC58">MPI Data Layout</A>).<P>The actual computation of the transform is performed by the function<CODE>fftwnd_mpi</CODE>, which differs somewhat from its uniprocessorequivalent and is described by:<PRE>void fftwnd_mpi(fftwnd_mpi_plan p, int n_fields, fftw_complex *local_data, fftw_complex *work, fftwnd_mpi_output_order output_order);</PRE><P><A NAME="IDX242"></A><P>There are several things to notice here:<UL><LI><A NAME="IDX243"></A>First of all, all <CODE>fftw_mpi</CODE> transforms are in-place: the output isin the <CODE>local_data</CODE> parameter, and there is no need to specify<CODE>FFTW_IN_PLACE</CODE> in the plan flags.<LI><A NAME="IDX244"></A><A NAME="IDX245"></A>The MPI transforms also only support a limited subset of the<CODE>howmany</CODE>/<CODE>stride</CODE>/<CODE>dist</CODE> functionality of theuniprocessor routines: the <CODE>n_fields</CODE> parameter is equivalent to<CODE>howmany=n_fields</CODE>, <CODE>stride=n_fields</CODE>, and <CODE>dist=1</CODE>.(Conceptually, the <CODE>n_fields</CODE> parameter allows you to transform anarray of contiguous vectors, each with length <CODE>n_fields</CODE>.)<CODE>n_fields</CODE> is <CODE>1</CODE> if you are only transforming a single,ordinary array.<LI>The <CODE>work</CODE> parameter is an optional workspace. If it is not<CODE>NULL</CODE>, it should be exactly the same size as the <CODE>local_data</CODE>array. If it is provided, FFTW is able to use the built-in<CODE>MPI_Alltoall</CODE> primitive for (often) greater efficiency at the<A NAME="IDX246"></A>expense of extra storage space.<LI>Finally, the last parameter specifies whether the output data has thesame ordering as the input data (<CODE>FFTW_NORMAL_ORDER</CODE>), or if it istransposed (<CODE>FFTW_TRANSPOSED_ORDER</CODE>). Leaving the data transposed<A NAME="IDX247"></A><A NAME="IDX248"></A>results in significant performance improvements due to a savedcommunication step (needed to un-transpose the data). Specifically, thefirst two dimensions of the array are transposed, as is described inmore detail by the next section.</UL><P><A NAME="IDX249"></A>The output of <CODE>fftwnd_mpi</CODE> is identical to that of thecorresponding uniprocessor transform. In particular, you should recallour conventions for normalization and the sign of the transformexponent.<P>The same plan can be used to compute many transforms of the same size.After you are done with it, you should deallocate it by calling<CODE>fftwnd_mpi_destroy_plan</CODE>.<A NAME="IDX250"></A><P><A NAME="IDX251"></A><A NAME="IDX252"></A><B>Important:</B> The FFTW MPI routines must be called in the same order byall processes involved in the transform. You should assume that theyall are blocking, as if each contained a call to <CODE>MPI_Barrier</CODE>.<P>Programs using the FFTW MPI routines should be linked with<CODE>-lfftw_mpi -lfftw -lm</CODE> on Unix, in addition to whatever librariesare required for MPI.<A NAME="IDX253"></A><H3><A NAME="SEC58">MPI Data Layout</A></H3><P><A NAME="IDX254"></A><A NAME="IDX255"></A>The transform data used by the MPI FFTW routines is <EM>distributed</EM>: adistinct portion of it resides with each process involved in thetransform. This allows the transform to be parallelized, for example,over a cluster of workstations, each with its own separate memory, sothat you can take advantage of the total memory of all the processorsyou are parallelizing over.<P>In particular, the array is divided according to the rows (firstdimension) of the data: each process gets a subset of the rows of the<A NAME="IDX256"></A>data. (This is sometimes called a "slab decomposition.") Oneconsequence of this is that you can't take advantage of more processorsthan you have rows (e.g. <CODE>64x64x64</CODE> matrix can at most use 64processors). This isn't usually much of a limitation, however, as eachprocessor needs a fair amount of data in order for theparallel-computation benefits to outweight the communications costs.<P>Below, the first dimension of the data will be referred to as<SAMP>`<CODE>x</CODE>'</SAMP> and the second dimension as <SAMP>`<CODE>y</CODE>'</SAMP>.<P>FFTW supplies a routine to tell you exactly how much data resides on thecurrent process:<PRE>void fftwnd_mpi_local_sizes(fftwnd_mpi_plan p, int *local_nx, int *local_x_start, int *local_ny_after_transpose, int *local_y_start_after_transpose, int *total_local_size);</PRE><P><A NAME="IDX257"></A><P>Given a plan <CODE>p</CODE>, the other parameters of this routine are set tovalues describing the required data layout, described below.<P><CODE>total_local_size</CODE> is the number of <CODE>fftw_complex</CODE> elementsthat you must allocate for your local data (and workspace, if youchoose). (This value should, of course, be multiplied by<CODE>n_fields</CODE> if that parameter to <CODE>fftwnd_mpi</CODE> is not <CODE>1</CODE>.)<P>The data on the current process has <CODE>local_nx</CODE> rows, starting atrow <CODE>local_x_start</CODE>. If <CODE>fftwnd_mpi</CODE> is called with<CODE>FFTW_TRANSPOSED_ORDER</CODE> output, then <CODE>y</CODE> will be the firstdimension of the output, and the local <CODE>y</CODE> extent will be given by<CODE>local_ny_after_transpose</CODE> and<CODE>local_y_start_after_transpose</CODE>. Otherwise, the output has thesame dimensions and layout as the input.<P>For instance, suppose you want to transform three-dimensional data ofsize <CODE>nx x ny x nz</CODE>. Then, the current process will store a subsetof this data, of size <CODE>local_nx x ny x nz</CODE>, where the <CODE>x</CODE>indices correspond to the range <CODE>local_x_start</CODE> to<CODE>local_x_start+local_nx-1</CODE> in the "real" (i.e. logical) array.If <CODE>fftwnd_mpi</CODE> is called with <CODE>FFTW_TRANSPOSED_ORDER</CODE> output,<A NAME="IDX258"></A>then the result will be a <CODE>ny x nx x nz</CODE> array, of which a<CODE>local_ny_after_transpose x nx x nz</CODE> subset is stored on thecurrent process (corresponding to <CODE>y</CODE> values starting at<CODE>local_y_start_after_transpose</CODE>).<P>The following is an example of allocating such a three-dimensional arrayarray (<CODE>local_data</CODE>) before the transform and initializing it tosome function <CODE>f(x,y,z)</CODE>:<PRE> fftwnd_mpi_local_sizes(plan, &local_nx, &local_x_start, &local_ny_after_transpose, &local_y_start_after_transpose, &total_local_size); local_data = (fftw_complex*) malloc(sizeof(fftw_complex) * total_local_size); for (x = 0; x < local_nx; ++x) for (y = 0; y < ny; ++y) for (z = 0; z < nz; ++z) local_data[(x*ny + y)*nz + z] = f(x + local_x_start, y, z);</PRE><P>Some important things to remember:<UL><LI>Although the local data is of dimensions <CODE>local_nx x ny x nz</CODE> inthe above example, do <EM>not</EM> allocate the array to be of size<CODE>local_nx*ny*nz</CODE>. Use <CODE>total_local_size</CODE> instead.<LI>The amount of data on each process will not necessarily be the same; infact, <CODE>local_nx</CODE> may even be zero for some processes. (Forexample, suppose you are doing a <CODE>6x6</CODE> transform on fourprocessors. There is no way to effectively use the fourth processor ina slab decomposition, so we leave it empty. Proof left as an exercisefor the reader.)<LI><A NAME="IDX259"></A>All arrays are, of course, in row-major order (see Section <A HREF="fftw_2.html#SEC7">Multi-dimensional Array Format</A>).<LI>If you want to compute the inverse transform of the output of<CODE>fftwnd_mpi</CODE>, the dimensions of the inverse transform are given bythe dimensions of the output of the forward transform. For example, ifyou are using <CODE>FFTW_TRANSPOSED_ORDER</CODE> output in the above example,then the inverse plan should be created with dimensions <CODE>ny x nx x
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -