📄 readme
字号:
Copyright 2000, 2001 Free Software Foundation, Inc.This file is part of the GNU MP Library.The GNU MP Library is free software; you can redistribute it and/or modifyit under the terms of the GNU Lesser General Public License as published bythe Free Software Foundation; either version 3 of the License, or (at youroption) any later version.The GNU MP Library is distributed in the hope that it will be useful, butWITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITYor FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General PublicLicense for more details.You should have received a copy of the GNU Lesser General Public Licensealong with the GNU MP Library. If not, see http://www.gnu.org/licenses/. AMD K7 MPN SUBROUTINESThis directory contains code optimized for the AMD Athlon CPU.The mmx subdirectory has routines using MMX instructions. All Athlons haveMMX, the separate directory is just so that configure can omit it if theassembler doesn't support MMX.STATUSTimes for the loops, with all code and data in L1 cache. cycles/limb mpn_add/sub_n 1.6 mpn_copyi 0.75 or 1.0 \ varying with data alignment mpn_copyd 0.75 or 1.0 / mpn_divrem_1 17.0 integer part, 15.0 fractional part mpn_mod_1 17.0 mpn_divexact_by3 8.0 mpn_l/rshift 1.2 mpn_mul_1 3.4 mpn_addmul/submul_1 3.9 mpn_mul_basecase 4.42 cycles/crossproduct (approx) mpn_sqr_basecase 2.3 cycles/crossproduct (approx) or 4.55 cycles/triangleproduct (approx)Prefetching of sources hasn't yet been tried.NOTEScmov, MMX, 3DNow and some extensions to MMX and 3DNow are available.Write-allocate L1 data cache means prefetching of destinations is unnecessary.Floating point multiplications can be done in parallel with integermultiplications, but there doesn't seem to be any way to make use of this.Unsigned "mul"s can be issued every 3 cycles. This suggests 3 is a limit onthe speed of the multiplication routines. The documentation shows mulexecuting in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that,to get near 3 cycles code has to be arranged so that nothing else is issuedto IEU0. A busy IEU0 could explain why some code takes 4 cycles and otherapparently equivalent code takes 5.OPTIMIZATIONSUnrolled loops are used to reduce looping overhead. The unrolling isconfigurable up to 32 limbs/loop for most routines and up to 64 for some.The K7 has 64k L1 code cache so quite big unrolling is allowable.Computed jumps into the unrolling are used to handle sizes not a multiple ofthe unrolling. An attractive feature of this is that times increasesmoothly with operand size, but it may be that some routines should justhave simple loops to finish up, especially when PIC adds between 2 and 16cycles to get %eip.Position independent code is implemented using a call to get %eip for thecomputed jumps and a ret is always done, rather than an addl $4,%esp or apopl, so the CPU return address branch prediction stack stays synchronisedwith the actual stack in memory.Branch prediction, in absence of any history, will guess forward jumps arenot taken and backward jumps are taken. Where possible it's arranged thatthe less likely or less important case is under a taken forward jump.CODINGInstructions in general code have been shown grouped if they can executetogether, which means up to three direct-path instructions which have nosuccessive dependencies. K7 always decodes three and has out-of-orderexecution, but the groupings show what slots might be available and whatdependency chains exist.When there's vector-path instructions an effort is made to get triplets ofdirect-path instructions in between them, even if there's dependencies,since this maximizes decoding throughput and might save a cycle or two ifdecoding is the limiting factor.INSTRUCTIONSadcl directdivl 39 cycles back-to-backlodsl,etc vectorloop 1 cycle vector (decl/jnz opens up one decode slot)movd reg vectormovd mem directmull issue every 3 cycles, latency 4 cycles low word, 6 cycles high wordpopl vector (use movl for more than one pop)pushl direct, will pair with a loadshrdl %cl vector, 3 cycles, seems to be 3 decode tooxorl r,r false read dependency recognisedREFERENCES"AMD Athlon Processor X86 Code Optimization Guide", AMD publication number22007, revision K, February 2002. Available on-line,http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf"3DNow Technology Manual", AMD publication number 21928G/0-March 2000.This describes the femms and prefetch instructions. Available on-line,http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf"AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMDpublication number 22466, revision D, March 2000. This describesinstructions added in the Athlon processor, such as pswapd and the extraprefetch forms. Available on-line,http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22466.pdf"3DNow Instruction Porting Guide", AMD publication number 22621, revision B,August 1999. This has some notes on general Athlon optimizations as well as3DNow. Available on-line,http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf----------------Local variables:mode: textfill-column: 76End:
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -