📄 readme
字号:
First up, let me say I don't like writing in assembler. It is not portable,dependant on the particular CPU architecture release and is generally a pigto debug and get right. Having said that, the x86 architecture is probablythe most important for speed due to number of boxes and sinceit appears to be the worst architecture to to getgood C compilers for. So due to this, I have lowered myself to doassembler for the inner DES routines in libdes :-).The file to implement in assembler is des_enc.c. Replace the following4 functionsdes_encrypt1(DES_LONG data[2],des_key_schedule ks, int encrypt);des_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt);des_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);des_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);They encrypt/decrypt the 64 bits held in 'data' usingthe 'ks' key schedules. The only difference between the 4 functions is thatdes_encrypt2() does not perform IP() or FP() on the data (this is anoptimization for when doing triple DES and des_encrypt3() and des_decrypt3()perform triple des. The triple DES routines are in here because it doesmake a big difference to have them located near the des_encrypt2 functionat link time..Now as we all know, there are lots of different operating systems running onx86 boxes, and unfortunately they normally try to make sure their assemblerformating is not the same as the other peoples.The 4 main formats I know of areMicrosoft Windows 95/Windows NTElf Includes Linux and FreeBSD(?).a.out The older Linux.Solaris Same as Elf but different comments :-(.Now I was not overly keen to write 4 different copies of the same code,so I wrote a few perl routines to output the correct assembler, givena target assembler type. This code is ugly and is just a hack.The libraries are x86unix.pl and x86ms.pl.des586.pl, des686.pl and des-som[23].pl are the programs to actuallygenerate the assembler.So to generate elf assemblerperl des-som3.pl elf >dx86-elf.sFor Windows 95/NTperl des-som2.pl win32 >win32.asm[ update 4 Jan 1996 ]I have added another way to do things.perl des-som3.pl cpp >dx86-cpp.sgenerates a file that will be included by dx86unix.cpp when it is compiled.To build for elf, a.out, solaris, bsdi etc,cc -E -DELF asm/dx86unix.cpp | as -o asm/dx86-elf.occ -E -DSOL asm/dx86unix.cpp | as -o asm/dx86-sol.occ -E -DOUT asm/dx86unix.cpp | as -o asm/dx86-out.occ -E -DBSDI asm/dx86unix.cpp | as -o asm/dx86bsdi.oThis was done to cut down the number of files in the distribution.Now the ugly part. I acquired my copy of Intels"Optimization's For Intel's 32-Bit Processors" and found a few interestingthings. First, the aim of the exersize is to 'extract' one byte at a timefrom a word and do an array lookup. This involves getting the byte fromthe 4 locations in the word and moving it to a new word and doing the lookup.The most obvious way to do this isxor eax, eax # clear wordmovb al, cl # get low bytexor edi DWORD PTR 0x100+des_SP[eax] # xor in wordmovb al, ch # get next bytexor edi DWORD PTR 0x300+des_SP[eax] # xor in wordshr ecx 16which seems ok. For the pentium, this system appears to be the best.One has to do instruction interleaving to keep both functional unitsoperating, but it is basically very efficient.Now the crunch. When a full register is used after a partial write, eg.mov al, clxor edi, DWORD PTR 0x100+des_SP[eax]386 - 1 cycle stall486 - 1 cycle stall586 - 0 cycle stall686 - at least 7 cycle stall (page 22 of the above mentioned document).So the technique that produces the best results on a pentium, according tothe documentation, will produce hideous results on a pentium pro.To get around this, des686.pl will generate code that is not as fast ona pentium, should be very good on a pentium pro.mov eax, ecx # copy word shr ecx, 8 # line up next byteand eax, 0fch # mask bytexor edi DWORD PTR 0x100+des_SP[eax] # xor in array lookupmov eax, ecx # get wordshr ecx 8 # line up next byteand eax, 0fch # mask bytexor edi DWORD PTR 0x300+des_SP[eax] # xor in array lookupDue to the execution units in the pentium, this actually works quite well.For a pentium pro it should be very good. This is the type of outputVisual C++ generates.There is a third option. instead of usingmov al, chwhich is bad on the pentium pro, one may be able to usemovzx eax, chwhich may not incur the partial write penalty. On the pentium,this instruction takes 4 cycles so is not worth using but on thepentium pro it appears it may be worth while. I need access to one toexperiment :-).eric (20 Oct 1996)22 Nov 1996 - I have asked people to run the 2 different version on pentiumpros and it appears that the intel documentation is wrong. Themov al,bh is still faster on a pentium pro, so just use the des586.plinstall des686.pl3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved thesefunctions into des_enc.c because it does make a massive performancedifference on some boxes to have the functions code located close tothe des_encrypt2() function.9 Jan 1997 - des-som2.pl is now the correct perl script to use forpentiums. It contains an inner loop fromSvend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at273,000 per second. He had a previous version at 250,000 and the bestI was able to get was 203,000. The content has not changed, this is alldue to instruction sequencing (and actual instructions choice) which is ableto keep both functional units of the pentium going.We may have lost the ugly register usage restrictions when x86 went 32 bitbut for the pentium it has been replaced by evil instruction ordering tricks.13 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf.raw DES at 281,000 per second on a pentium 100.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -