📄 readme
字号:
POSITION INDEPENDENT CODEDefining the symbol PIC in m4 processing selects SVR4 / ELF style positionindependent code. This is necessary for shared libraries because they canbe mapped into different processes at different virtual addresses. Actuallyrelocations are allowed, but presumably pages with relocations aren'tshared, defeating the purpose of a shared library.The use of the PLT adds a fixed cost to every function call, and the GOTadds a cost to any function accessing global variables. These are small butmight be noticeable when working with small operands.Calls from one library function to another don't need to go through the PLT,since of course the call instruction uses a displacement, not an absoluteaddress, and the relative locations of object files are known when libgmp.sois created. "ld -Bsymbolic" (or "gcc -Wl,-Bsymbolic") will resolve callsthis way, so that there's no jump through the PLT, but of course leavingsetups of the GOT address in %ebx that may be unnecessary.The %ebx setup could be avoided in assembly if a separate option controlledPIC for calls as opposed to computed jumps etc. But there's only everlikely to be a handful of calls out of assembler, and getting the sameoptimization for C intra-library calls would be more important. There seemsno easy way to tell gcc that certain functions can be called non-PIC, andunfortunately many GMP functions use the global memory allocation variables,so they need the GOT anyway. Object files with no global data referencesand only intra-library calls could go into the library as non-PIC under-Bsymbolic. Integrating this into libtool and automake is left as anexercise for the reader.GLOBAL OFFSET TABLE CODINGIt's believed the magic _GLOBAL_OFFSET_TABLE_ used by code establishing theaddress of the GOT should be written without a GSYM_PREFIX, ie. that it'sthe same "_GLOBAL_OFFSET_TABLE_" on an underscore or non-underscore system.Certainly this is true for instance of NetBSD 1.4 which is an underscoresystem but requires "_GLOBAL_OFFSET_TABLE_".Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault whenasked to assemble the following, L1: addl $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebxIt seems that using the label in the same instruction it refers to is theproblem, since a nop in between works. But the simplest workaround is tofollow gcc and omit the +[.-L1] since it does nothing, addl $_GLOBAL_OFFSET_TABLE_, %ebxCurrent gas 2.10 generates incorrect object code when %eax is used in such aconstruction (with or without +[.-L1]), addl $_GLOBAL_OFFSET_TABLE_, %eaxThe R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate forthe 1 byte opcode of "addl $n,%eax". The best workaround is just to use anyother register, since then it's a two byte opcode+mod/rm. GCC for examplealways uses %ebx (which is needed for calls through the PLT).A similar problem occurs in an leal (again with or without a +[.-L1]), leal _GLOBAL_OFFSET_TABLE_(%edi), %ebxThis time the R_386_GOTPC gets a displacement of 0 rather than the 2appropriate for the opcode and mod/rm, making this form unusable.SIMPLE LOOPSThe overheads in setting up for an unrolled loop can mean that at smallsizes a simple loop is faster. Making small sizes go fast is important,even if it adds a cycle or two to bigger sizes. To this end variousroutines choose between a simple loop and an unrolled loop according tooperand size. The path to the simple loop, or to special case code forsmall sizes, is always as fast as possible.Adding a simple loop requires a conditional jump to choose between thesimple and unrolled code. The size of a branch misprediction penaltyaffects whether a simple loop is worthwhile.The convention is for an m4 definition UNROLL_THRESHOLD to set the crossoverpoint, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=UNROLL_THRESHOLD using the unrolled loop. If position independent code addsa couple of cycles to an unrolled loop setup, the threshold will vary withPIC or non-PIC. Something like the following is typical. deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8))There's no automated way to determine the threshold. Setting it to a smallvalue and then to a big value makes it possible to measure the simple andunrolled loops each over a range of sizes, from which the crossover pointcan be determined. Alternately, just adjust the threshold up or down untilthere's no more speedups.UNROLLED LOOP CODINGThe x86 addressing modes allow a byte displacement of -128 to +127, makingit possible to access 256 bytes, which is 64 limbs, without adjustingpointer registers within the loop. Dword sized displacements can be usedtoo, but they increase code size, and unrolling to 64 ought to be enough.When unrolling to the full 64 limbs/loop, the limb at the top of the loopwill have a displacement of -128, so pointers have to have a corresponding+128 added before entering the loop. When unrolling to 32 limbs/loopdisplacements 0 to 127 can be used with 0 at the top of the loop and noadjustment needed to the pointers.Where 64 limbs/loop is supported, the +128 adjustment is done only when 64limbs/loop is selected. Usually the gain in speed using 64 instead of 32 or16 is small, so support for 64 limbs/loop is generally only for comparison.COMPUTED JUMPSWhen working from least significant limb to most significant limb (mostroutines) the computed jump and pointer calculations in preparation for anunrolled loop are as follows. S = operand size in limbs N = number of limbs per loop (UNROLL_COUNT) L = log2 of unrolling (UNROLL_LOG2) M = mask for unrolling (UNROLL_MASK) C = code bytes per limb in the loop B = bytes per limb (4 for x86) computed jump (-S & M) * C + entrypoint subtract from pointers (-S & M) * B initial loop counter (S-1) >> L displacements 0 to B*(N-1)The loop counter is decremented at the end of each loop, and the loopingstops when the decrement takes the counter to -1. The displacements are forthe addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".Usually the multiply by "C" can be handled without an imul, using instead anleal, or a shift and subtract.When working from most significant to least significant limb (eg. mpn_lshiftand mpn_copyd), the calculations change as follows. add to pointers (-S & M) * B displacements 0 to -B*(N-1)OLD GAS 1.92.3This version comes with FreeBSD 2.2.8 and has a couple of gremlins thataffect GMP code.Firstly, an expression involving two forward references to labels comes outas zero. For example, addl $bar-foo, %eax foo: nop bar:This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".When only one forward reference is involved, it works correctly, as forexample, foo: addl $bar-foo, %eax nop bar:Secondly, an expression involving two labels can't be used as thedisplacement for an leal. For example, foo: nop bar: leal bar-foo(%eax,%ebx,8), %ecxA slightly cryptic error is given, "Unimplemented segment type 0 inparse_operand". When only one label is used it's ok, and the label can be aforward reference too, as for example, leal foo(%eax,%ebx,8), %ecx nop foo:These problems only affect PIC computed jump calculations. The workaroundsare just to do an leal without a displacement and then an addl, and to makesure the code is placed so that there's at most one forward reference in theaddl.REFERENCES"Intel Architecture Software Developer's Manual", volumes 1 to 3, 2001,order numbers 245470, 245471 and 245472. Available on-line, http://developer.intel.com/design/pentium4/manuals/245470.htm http://developer.intel.com/design/pentium4/manuals/245471.htm http://developer.intel.com/design/pentium4/manuals/245472.htm"System V Application Binary Interface", Unix System Laboratories Inc, 1992,published by Prentice Hall, ISBN 0-13-880410-9. And the "Intel386 ProcessorSupplement", AT&T, 1991, ISBN 0-13-877689-X. These have details of callingconventions and ELF shared library PIC coding. Versions of both availableon-line, http://www.sco.com/developer/devspecs"Intel386 Family Binary Compatibility Specification 2", Intel Corporation,published by McGraw-Hill, 1991, ISBN 0-07-031219-2. (Same as the above 386ABI supplement.)----------------Local variables:mode: textfill-column: 76End:
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -