📄 how_to_compile.txt
字号:
-------FFdecsa-------Compiling is as easy as running a make command, if you have gcc and areusing a little endian machine. 64 bit machines have not been tested butmay work with little or no changes; big endian machines will certainlygive incorrect results (read the technical_background.txt to know wherethe problem is).Before compiling you could edit the Makefile to tweak compiler flags foroptimal performance. If you want to play with different bit-groupingstrategies you have to edit FFdecsa_DBG.c and change the "our choice"definition. This is highly critical for performance.After compilation run the FFdecsa_test application. It will test correctdecryption and print the meausered speed (use "nice --19 ./FFdecsa_test"on an idle machine for better results). Or just use "make test".gcc >=3.3.3 is highly recommended. Older versions could give performanceproblems.icc is currently unusable. In the initial phases of development ofFFdecsa icc was able to compile the code and gave interesting speedresults when using the 8charA grouping mode (array of 8 characters areautomatically manipulated through MMX instructions). At some point thecode began to work incorrectly because of a compiler bug (but I found aworkaround). Then, the performance dropped with no reason; I found aworkaround by adding an unused variable (alignment problem, grep for iccin the code to see where it happens). Then, with the introduction ofgroup modes based on intrinsics, gcc was finally able to go beyond thespeed record originally set by icc. Additional code tweaks added morespeed to gcc, while icc started to segfault on compilation (both version7 and 8). In conclusion, icc is bugged and this code is too hard for it.gcc on the other hand is great. I tried to inspect generated assemblerto find weak spots, and the generated code is very good indeed.Note: the code can be compiled with gcc or g++. g++ is 3% faster forsome reason.You should not get any errors or warnings. I only get two "inliningfailed" warnings on two functions I asked to be inlined but gcc doesn'twant to inline.The build process creates additional temp files by running grepcommands. This is how debugging output is handled. All the linescontaining DBG are removed and the temp file is compiled (so the linenumbers change between temp and original files). Don't edit the tempfiles, they will be overwritten. If you don't remove the DBG lines (forexample, by changing "grep -v DBG" into "grep -v aaDBG" in Makefile) alot of output will be generated. This is useful to understand what'swrong when the FFdecsa_test is failing. I included a reference "knowngood" output in the debug_output directory. Extra debug output iscommented out in the code.The debug output functionality could be... bugged. This is because Itested everything using hard coded int grouping mode and thengeneralized the debug output to abstract grouping modes. A bug where 4bytes are printed instead of 8 could be present somewhere. I think itisn't, but you've been warned.This code was only tried on Linux.It should work on Windows or other platforms, but you may encounterproblems related to the compiler quality. If you want to try, begin withthe int grouping mode. It is only 30% slower then the best (MMX) and itshould be easily portable because no intrinsics are used. I'mparticularly interested in hearing what kind of performance can beobtained on x86_64 processors in int, long long int, mmx, 2mmx, ssemodes.As a reference, here are the results I get on an Athlon XP 2400+ (thisprocessor runs at 2000MHz); other processors belonging to the Athlon XParchitecture, including Durons, should have the same speed per MHz.Cache size and bus speed don't matter.CPU: AMD Athlon XP 2400+Compiler: g++ (gcc version 3.3.3 20040412 (Red Hat Linux 3.3.3-7))Flags: -O3 -march=athlon-xp -fexpensive-optimizations -funroll-loops --param max-unrolled-insns=500grouping mode speed (Mbit/s) notes---------------------------------------------------------------------PARALLEL_32_4CHAR 14PARALLEL_32_4CHARA 12PARALLEL_32_INT 125 very good and very portablePARALLEL_64_8CHAR 17PARALLEL_64_8CHARA 15 needs a vectorizing compilerPARALLEL_64_2INT 75 x86 has too few registersPARALLEL_64_LONG 97 try this on x86_64PARALLEL_64_MMX 165 the bestPARALLEL_128_16CHAR 6PARALLEL_128_16CHARA 7PARALLEL_128_4INT 69PARALLEL_128_2LONG 52PARALLEL_128_2MMX 36 slower than expectedPARALLEL_128_SSE 156 just slower than 64_MMXBest speeds are obtained with native data types: int, mmx, sse (thiscould be a compiler artifact).64 bit processors should try 64_LONG.Vectorizing compilers should like *CHARA.64_MMX is faster than 128_SSE on the Athlon; perhaps SSE instruction areinternally split into 64 bit chunks. Could be different on x86_64 orIntel processors.128_SSE has a 64 bit (MMX) batch type because SSE has no shiftinginstructions, they are only available on SSE2. As the Athlon XP doesn'tsupport SSE2, I couldn't experiment with that.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -