📄 rc4-x86_64.pl

📁 openssl using to improve the ability to program using it. it is a good tools
💻 PL
字号:
#!/usr/bin/env perl## ====================================================================# Written by Andy Polyakov <appro@fy.chalmers.se> for the OpenSSL# project. Rights for redistribution and usage in source and binary# forms are granted according to the OpenSSL license.# ====================================================================## 2.22x RC4 tune-up:-) It should be noted though that my hand [as in# "hand-coded assembler"] doesn't stand for the whole improvement# coefficient. It turned out that eliminating RC4_CHAR from config# line results in ~40% improvement (yes, even for C implementation).# Presumably it has everything to do with AMD cache architecture and# RAW or whatever penalties. Once again! The module *requires* config# line *without* RC4_CHAR! As for coding "secret," I bet on partial# register arithmetics. For example instead of 'inc %r8; and $255,%r8'# I simply 'inc %r8b'. Even though optimization manual discourages# to operate on partial registers, it turned out to be the best bet.# At least for AMD... How IA32E would perform remains to be seen...# As was shown by Marc Bevand reordering of couple of load operations# results in even higher performance gain of 3.3x:-) At least on# Opteron... For reference, 1x in this case is RC4_CHAR C-code# compiled with gcc 3.3.2, which performs at ~54MBps per 1GHz clock.# Latter means that if you want to *estimate* what to expect from# *your* Opteron, then multiply 54 by 3.3 and clock frequency in GHz.# Intel P4 EM64T core was found to run the AMD64 code really slow...# The only way to achieve comparable performance on P4 was to keep# RC4_CHAR. Kind of ironic, huh? As it's apparently impossible to# compose blended code, which would perform even within 30% marginal# on either AMD and Intel platforms, I implement both cases. See# rc4_skey.c for further details...# P4 EM64T core appears to be "allergic" to 64-bit inc/dec. Replacing # those with add/sub results in 50% performance improvement of folded# loop...# As was shown by Zou Nanhai loop unrolling can improve Intel EM64T# performance by >30% [unlike P4 32-bit case that is]. But this is# provided that loads are reordered even more aggressively! Both code# pathes, AMD64 and EM64T, reorder loads in essentially same manner# as my IA-64 implementation. On Opteron this resulted in modest 5%# improvement [I had to test it], while final Intel P4 performance# achieves respectful 432MBps on 2.8GHz processor now. For reference.# If executed on Xeon, current RC4_CHAR code-path is 2.7x faster than# RC4_INT code-path. While if executed on Opteron, it's only 25%# slower than the RC4_INT one [meaning that if CPU
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -