📄 readme
字号:
How to run the EuroBen Efficiency Benchmark. ===========================================Below we describe how to go about installing and running the EuroBenEfficiency Benchmark. In the case that you run into troubleplease mail to: Aad van der Steen; steen@phys.uu.nl -----------------------------------The EuroBen Efficiency Benchmark has the following structure: |-- Makefile |------------ commun/ |-- dddot/ |------------ fft1d/ |-- gmxm/ |------------ gmxv/ effbench/ --|-- linful/ |------------ linspar/ |-- ping/ |------------ pingpong/ |-- qsort/ |------------ smxv/ |-- transp/ |------------ wvlt2d/================================================================================ INSTALLATION AND EXECUTION================================================================================The master Makefile in effbench/ can be used for installation of the13 programs:commun -- A test for the speed of various communication patterns (MPI).dddot -- A test for the speed a distributed dotproduct (MPI).fft1d -- A test for the speed of a 1-D FFT.gmxv -- A test for the speed of the matrix-vector multiply Ab = c.gmxm -- A test for the speed of the matrix-matrix multiply AB = C.linful -- A test for the solution of a full linear system Ax = b.linspar -- A test for the solution of a sparse linear system Ax = b.ping -- A very detailed test to assess bandwidth and latency in point-to-point communication.(1-sided communication, MPI).pingpong-- A very detailed pingpong test to assess bandwidth and latency in point-to-point communication.(2-sided communication, MPI). qsort -- A test for the speed of Quicksort on Integers and 8-byte Reals. smxv -- A test for the speed of the sparse matrix-vector multiply Ab = c in CRS format.transp -- A test for the speed of a global distributed matrix transpose (MPI). wvlt2d -- A test for the speed of a 2-D Haar Wavelet Transform.We assume that, at least, for the first time, you will want to runthe programs with the same compiler options. You should perform the followingsteps:1) cd basics/ 1a - Modify the subroutine 'state.f' such that it reflects the state of the system: type of machine, compiler version, compiler options, OS version, etc. 1b - OPTIONAL (Non-MPI programs): The program directories for the sequential programs contain the timing functions 'wclock.f' and 'cclock.c'. 'wclock.f' is a Fortran interface routine that calls 'cclock.c', which in turn relies on the 'gettimeofday' Unix system routine. This timer works almost everywhere (except under UNICOS) and delivers the wallclock time with a resolution of a few microseconds. It is generally better than the Fortran 90 routine 'System_Clock'. If, for any reason you want to use another/better wallclock timer, modify the Real*8 function wclock.f in basics.2) Go back to effbench/ 2a - Do a 'make state': The 'state.f' routine that you have modified is copied to all the program directories. 2b - OPTIONAL (Non-MPI programs): If you have modified 'wclock.f' for the sequential programs in basics/, do a 'make clock-seq': The 'wclock.f' is copied to the relevant program directories.3) cd install/ 3a - In install/ you will find header files with definitions for the 'make' utility. 3a1: Sequential programs: Modify the 'Make.Incl-seq' such that is contains the correct name for the Fortran 90 compiler, Loader (usually the same as the compiler), and the options for the Fortran 90 and C compiler. For completeness' sake there are empty definitions for libraries (LIBS) and include files (INCS) you might want to use but in normal situations they are not needed for the sequential programs. 3a2: Parallel programs: Modify the 'Make.Incl-mpi' such that is contains the correct name for the Fortran 90 compiler, Loader (usually the same as the compiler), and the options for the Fortran 90 compiler. The names for the compiler systems for MPI programs may be different from those for sequential programs. For completeness' sake there is an empty definition for the include file (INCS) you might want to use but in normal situations this is not needed. For libraries (LIBS) fill in the name of the MPI library (if necessary). 3a3: Modify the 'Speed.Incl' file: It contains only one line starting with '++++' Replace it by the Theoretical Peak Performance of your system expressed in Mflop/s per CPU. So for a system with a Theoretical Peak Performance of 3.6 Gflop/s per processor: ++++ --> 3600. NOTE: This should really be per processor (per socket if you will) and NOT per core.4) Go back to effbench/ Do a 'make lib'. This will cause an object library 'intlib.a' to be made that is used by the sequential numerical programs to compute the integral of the performance over the appropriate problem size ranges and to calculate latencies for some MPI programs.5) 5a: Do a 'make make-seq': This will cause the Makefiles in the directories of the sequential programs to be completed according to the specifications you made in 'install/Make.Incl-seq'. 5b: Do a 'make make-mpi': This will cause the Makefiles in the directories of the MPI programs to be completed according to the specifications you made in 'install/Make.Incl-mpi'.6) Do a 'make makeall': This will cause in the directories <prog>, where <prog> is 'commun/', 'dddot/','fft1d/', etc. the executables to be made each with the name x.<prog>. This will take a minute. 6a - For the non-MPI programs these can be run by: 'x.<prog>'. 6b - For the MPI programs run them by: 'mpirun -np <p> x.<prog>' or 'mpiexec -n <p> x.<prog> where <p> is the desired number of processes and x.<prog> the MPI executable. (or by any equivalent of mpirun if this is not available, also see 8b below).7) Do a 'make speed': This will cause the Theoretical Peak Performance to be set to the correct value in the relevant directories.8) 8a: Do a 'make runall': This will run all sequential programs in turn. The results are placed in a directory called 'Log.`hostname`', where 'hostname' is the local name of your system. This will take a few minutes. The results have names '<prog>.log' where <prog> is any of the programs listed above. 8b: For the MPI programs 'make runall' will cause the MPI programs to be run and the results to be transferred to 'Log.`hostname`'. The programs are run with the following number of processors by: mpirun -np 6 x.commun > commun.log mpirun -np 16 x.dddot > dddot.log mpirun -np 2 x.ping > ping.log mpirun -np 2 x.pingpong > pingpong.log mpirun -np 8 x.transp > transp.log NOTE: Although improbable, with newer MPI-2 implementations 'mpirun -np <procs> <x.prog>' may have to be replaced by: 'mpiexec -n <procs> <x.prog>'. This is provided for in the scripts 'x.all' in the 5 relevant directories: comment out the 'mpirun' line and decomment the 'mpiexec' line. ================================================================================ CUSTOMISING THE RUNS: (OPTIONAL)================================================================================You might want to run some of the programs in an alternative settingThis might include:- Other compiler options. In that case do the following for any of the programs <prog>: 9a) cd <prog>/ 9a1 - Modify the definition of 'FFLAGS' in the Makefile. 9a2 - Modify the compiler options line in subroutine 'state.f'. 9a3 - Do a 'make veryclean': this will remove all old objects and the excutable. 9a4 - Do a 'make'. 9a5 - Do an 'x.all': this runs the program and writes the result to '<prog>.log'. 9a6 - mv <prog>.log ../Log.hostname/ or, 9a7 - ATERNATIVELY, when you have run several programs with altered repeat counts: a. cd .. b. Do a 'make collect': this causes any result file '<prog>.log' to be moved from the '<prog>/' directories to 'Log.hostname/'.- Substitute library calls or other equivalent code instead of that of the model implementation. 9b) cd <prog>/ 9b1 - Modify the definition of 'FFLAGS' in the Makefile (if required). 9b2 - Modify the compiler options line in subroutine 'state.f' (if required). 9b3 - Do a 'make veryclean': this will remove all old objects and the excutable. 9b4 - Invalidate the routines to be replaced by removing or renaming them and, if necessary, modify the Makefile accordingly. Specifically: A. For program 'gmxm' and 'gmxv' it is assumed that you would like to replace the given Fortran routines by the routines 'dgemm' and 'dgemv', respectively. If so, modify the zero in the first line in 'gmxm.in' and 'gmxv.in' to an integer value /= 0 (and invalidate the supplied BLAS routines in the respective directories). If you use routines that are different from the BLAS routines, still modify 'gmxm.in' and 'gmxv.in' files by changing the zero to a non-zero value, but, in addition, change the calls to 'dgemm' and 'dgemv' to that of your own favorite library routines. B. In program 'linful' the factorisation and solution are based on the usual LAPACK routines. So, you only have to invalidate the source routines present in the directory. C. As there is no universally accepted standard for FFTs there is no alternative to modifying the code in 'fft1d.f'. Replace lines 81 and 82 by the call(s) to your favorite library routine. 9b5 - Do a 'make'. 9b6 - Run the program as before. 9b7 - BE SURE TO REPORT THE SUBSTITUTION(S) IN THE RESULTS! ================================================================================ ABOUT THE EFFICIENCY MEASURE================================================================================1) Programs 'fft1d', 'linful', 'linspar', and 'wvlt2d' measure an overall efficiency (ratio of actual performance and theoretical peak performance) by integrating the actual performance over a range of problem sizes. For instance, 'linspar' is evaluated in the range N = 1000,...,20000 with 10 additional problem sizes in between. The problem sizes are given in the appropriate '<prog>.in' file, with <prog> any of the four programs mentioned. If for any reason (for instance because you suspect that the curve used for the integration does not catch the performance behaviour of your processor adequately), you may wish to add measuring points WITHIN the range given for each of the programs. You are welcome to do so by adding the appropriate line(s) to the '<prog>.in' file(s). Note, however, that it is NOT allowed to modify the lower and upper bounds themselves.2) The four programs show a fraction of the peak performance that is required to be attained and also the effiency measure that actually is attained by integrating over the observation range. Obviously, the actual fraction of the theoretical peak performance must be greater or equal to the required fraction. Also obviously, the better the fraction is, the higher the effiency of the processor is. It does not matter whether your final result is obtained by using the original code, by optimising it, or by using a library as long as the library is a standard tool and generally accessible for the average users of such processors.================================================================================ FURTHER REMARKS================================================================================1) The program 'ping' is a program that measures bandwidth and latency by means of one-sided MPI communication (MPI_Get and MPI_Put). The present situation is that many MPI implementations still do not support one-sided communication as required in MPI-2. Consequently, program 'ping' may not compile on your system and hence you will have no result for it. Because of the slow adoption of full MPI-2 we presently do not consider this result as mandatory but it certainly adds value to the total result of the benchmark.2) Please run the benchmark FIRST AS-IS, i.e., with the minimal changes to get it running (probably none are necessary). Then, if you are inclined to do so, do the optimisations you have in mind and run again.================================================================================ Lastly, ==================== | Best of success! | ====================
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -