⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 readme

📁 The GNU MP Bignum Library
💻
字号:
Copyright 1996, 1999, 2000, 2001, 2003 Free Software Foundation, Inc.This file is part of the GNU MP Library.The GNU MP Library is free software; you can redistribute it and/or modifyit under the terms of the GNU Lesser General Public License as published bythe Free Software Foundation; either version 3 of the License, or (at youroption) any later version.The GNU MP Library is distributed in the hope that it will be useful, butWITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITYor FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General PublicLicense for more details.You should have received a copy of the GNU Lesser General Public Licensealong with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.                   INTEL PENTIUM P5 MPN SUBROUTINESThis directory contains mpn functions optimized for Intel Pentium (P5,P54)processors.  The mmx subdirectory has additional code for Pentium with MMX(P55).STATUS                                cycles/limb	mpn_add_n/sub_n            2.375	mpn_mul_1                 12.0	mpn_add/submul_1          14.0	mpn_mul_basecase          14.2 cycles/crossproduct (approx)	mpn_sqr_basecase           8 cycles/crossproduct (approx)                                   or 15.5 cycles/triangleproduct (approx)	mpn_l/rshift               5.375 normal (6.0 on P54)				   1.875 special shift by 1 bit	mpn_divrem_1              44.0	mpn_mod_1                 28.0	mpn_divexact_by3          15.0	mpn_copyi/copyd            1.0Pentium MMX gets the following improvements	mpn_l/rshift               1.75	mpn_mul_1                 12.0 normal, 7.0 for 16-bit multipliermpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb.  Due to loopoverhead and other delays (cache refill?), they run at or near 2.5cycles/limb.mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than theyshould.  Intel documentation says a mul instruction is 10 cycles, but itmeasures 9 and the routines using it run as 9.P55 MMX AND X87The cost of switching between MMX and x87 floating point on P55 is about 100cycles (fld1/por/emms for instance).  In order to avoid that the two aren'tmixed and currently that means using MMX and not x87.MMX offers a big speedup for lshift and rshift, and a nice speedup for16-bit multipliers in mpn_mul_1.  If fast code using x87 is found thenperhaps the preference for MMX will be reversed.P54 SHLDLmpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but thedocumentation indicates that they should take only 43/8 = 5.375 cycles/limb,or 5 cycles/limb asymptotically.  The P55 runs them at the expected speed.It seems that on P54 a shldl or shrdl allows pairing in one following cycle,but not two.  For example, back to back repetitions of the following	shldl(	%cl, %eax, %ebx)	xorl	%edx, %edx	xorl	%esi, %esirun at 5 cycles, as expected, but repetitions of the following run at 7cycles, whereas 6 would be expected (and is achieved on P55),	shldl(	%cl, %eax, %ebx)	xorl	%edx, %edx	xorl	%esi, %esi	xorl	%edi, %edi	xorl	%ebp, %ebpThree xorls run at 7 cycles too, so it doesn't seem to be just that pairinginhibited is only in the second following cycle (or something like that).Avoiding this problem would bring P54 shifts down from 6.0 c/l to 5.5 with apattern of shift, 2 loads, shift, 2 stores, shift, etc.  A start has beenmade on something like that, but it's not yet complete.OTHER NOTESPrefetching Destinations    Pentium doesn't allocate cache lines on writes, unlike most other modern    processors.  Since the functions in the mpn class do array writes, we    have to handle allocating the destination cache lines by reading a word    from it in the loops, to achieve the best performance.Prefetching Sources    Prefetching of sources is pointless since there's no out-of-order loads.    Any load instruction blocks until the line is brought to L1, so it may    as well be the load that wants the data which blocks.Data Cache Bank Clashes    Pairing of memory operations requires that the two issued operations    refer to different cache banks (ie. different addresses modulo 32    bytes).  The simplest way to ensure this is to read/write two words from    the same object.  If we make operations on different objects, they might    or might not be to the same cache bank.PIC %eip Fetching    A simple call $+5 and popl can be used to get %eip, there's no need to    balance calls and returns since P5 doesn't have any return stack branch    prediction.Float Multiplies    fmul is pairable and can be issued every 2 cycles (with a 4 cycle    latency for data ready to use).  This is a lot better than integer mull    or imull at 9 cycles non-pairing.  Unfortunately the advantage is    quickly eaten away by needing to throw data through memory back to the    integer registers to adjust for fild and fist being signed, and to do    things like propagating carry bits.REFERENCES"Intel Architecture Optimization Manual", 1997, order number 242816.  Thisis mostly about P5, the parts about P6 aren't relevant.  Available on-line:        http://download.intel.com/design/PentiumII/manuals/242816.htm----------------Local variables:mode: textfill-column: 76End:

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -