📄 s_tanhl.s
字号:
.file "tanhl.s"// Copyright (c) 2001 - 2003, Intel Corporation// All rights reserved.//// Contributed 2001 by the Intel Numerics Group, Intel Corporation//// Redistribution and use in source and binary forms, with or without// modification, are permitted provided that the following conditions are// met://// * Redistributions of source code must retain the above copyright// notice, this list of conditions and the following disclaimer.//// * Redistributions in binary form must reproduce the above copyright// notice, this list of conditions and the following disclaimer in the// documentation and/or other materials provided with the distribution.//// * The name of Intel Corporation may not be used to endorse or promote// products derived from this software without specific prior written// permission.// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL OR ITS // CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, // PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR // PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY // OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY OR TORT (INCLUDING// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS // SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. // // Intel Corporation is the author of this code, and requests that all// problem reports or change requests be submitted to it directly at // http://www.intel.com/software/products/opensource/libraries/num.htm.//// History//==============================================================// 11/29/01 Initial version// 05/20/02 Cleaned up namespace and sf0 syntax// 08/14/02 Changed mli templates to mlx// 02/10/03 Reordered header: .section, .global, .proc, .align//// API//==============================================================// long double tanhl(long double)//// Overview of operation//==============================================================//// Algorithm description// ---------------------//// There are 4 paths://// 1. Special path: x = 0, Inf, NaNs, denormal// Return tanhl(x) = +/-0.0 for zeros// Return tanhl(x) = QNaN for NaNs// Return tanhl(x) = sign(x)*1.0 for Inf// Return tanhl(x) = x + x^2 for - denormals// Return tanhl(x) = x - x^2 for + denormals//// 2. [0;1/8] path: 0.0 < |x| < 1/8// Return tanhl(x) = x + x^3*A3 + ... + x^15*A15//// 3. Main path: 1/8 <= |x| < 22.8// For several ranges of 1/8 <= |x| < 22.8// Return tanhl(x) = sign(x)*((A0H+A0L) + y*(A1H+A1L) + y^2*(A2H+A2L) + // + y^3*A3 + y^4*A4 + ... + y^25*A25 )// where y = (|x|/a) - b//// For each range there is particular set of coefficients.// Below is the list of ranges:// 1/8 <= |x| < 1/4 a = 0.125, b = 1.5// 1/4 <= |x| < 1/2 a = 0.25, b = 1.5// 1/2 <= |x| < 1.0 a = 0.5, b = 1.5// 1.0 <= |x| < 2.0 a = 1.0, b = 1.5// 2.0 <= |x| < 3.25 a = 2.0, b = 1.5// 3.25 <= |x| < 4.0 a = 2.0, b = 2.0// 4.0 <= |x| < 6.5 a = 4.0, b = 1.5// 6.5 <= |x| < 8.0 a = 4.0, b = 2.0// 8.0 <= |x| < 13.0 a = 8.0, b = 1.5// 13.0 <= |x| < 16.0 a = 8.0, b = 2.0// 16.0 <= |x| < 22.8 a = 16.0, b = 1.5// ( [3.25;4.0], [6.5;8.0], [13.9;16.0] subranges separated // for monotonicity issues resolve )//// 4. Saturation path: 22.8 <= |x| < +INF // Return tanhl(x) = sign(x)*(1.0 - tiny_value)// (tiny_value ~ 1e-1233)//// Implementation notes// --------------------//// 1. Special path: x = 0, INF, NaNa, denormals//// This branch is cut off by one fclass operation.// Then zeros+nans, infinities and denormals processed separately.// For denormals we use simple fma operaton x+x*x (- for +denorms)//// 2. [0;1/8] path: 0.0 < |x| < 1/8//// Here we use simple polynimial computations, where last step// is performed as x + x^3*A3+...// The rest of polynomial is factorized using binary tree technique.//// 3. Main path: 1/8 <= |x| < 22.8//// Multiprecision have to be performed only for first few// polynomial iterations (up to 3-rd x degree)// Here we use the same parallelisation way as above:// Split whole polynomial to first, "multiprecision" part, and second, // so called "tail", native precision part.//// 1) Multiprecision part: // [v1=(A0H+A0L)+y*(A1H+A1L)] + [v2=y^2*((A2H+A2L)+y*A3)]// v1 and v2 terms calculated in parallel//// 2) Tail part:// v3 = x^4 * ( A4 + x*A5 + ... + x^21*A25 )// v3 is splitted to 2 even parts (10 coefficient in each one).// These 2 parts are also factorized using binary tree technique.// // So Multiprecision and Tail parts cost is almost the same// and we have both results ready before final summation.//// Some tricks were applied to maintain symmetry at direct// rounding modes (to +/-inf). We had to set result sign// not at the last operation but much more earlier and at// several places.//// 4. Saturation path: 22.8 <= |x| < +INF //// We use formula sign(x)*(1.0 - tiny_value) instead of simple sign(x)*1.0// just to meet IEEE requirements for different rounding modes in this case.//// Registers used//==============================================================// Floating Point registers used: // f8 - input & output// f32 -> f92// General registers used: // r2, r3, r32 -> r52 // Predicate registers used:// p0, p6 -> p11, p14, p15// p6 - arg is zero, denormal or special IEEE// p7 - arg is in [16;32] binary interval// p8 - arg is in one of subranges // [3.25;4.0], [6.5;8.0], [13.9;16.0]// p9 - arg < 1/8// p10 - arg is NOT in one of subranges // [3.25;4.0], [6.5;8.0], [13.9;16.0]// p11 - arg in saturation domain// p14 - arg is positive// p15 - arg is negative// Assembly macros//==============================================================rDataPtr = r2rTailDataPtr = r3rBias = r33rSignBit = r34rInterval = r35rArgExp = r36rArgSig = r37r3p25Offset = r38r2to4 = r39r1p25 = r40rOffset = r41r1p5 = r42rSaturation = r43r1625Sign = r44rTiny = r45rAddr1 = r46rAddr2 = r47rTailAddr1 = r48rTailAddr2 = r49rTailOffset = r50rTailAddOffset = r51rShiftedDataPtr = r52//==============================================================fA0H = f32fA0L = f33fA1H = f34fA1L = f35fA2H = f36fA2L = f37fA3 = f38fA4 = f39fA5 = f40fA6 = f41fA7 = f42fA8 = f43fA9 = f44fA10 = f45fA11 = f46fA12 = f47fA13 = f48fA14 = f49fA15 = f50fA16 = f51fA17 = f52fA18 = f53fA19 = f54fA20 = f55 fA21 = f56 fA22 = f57 fA23 = f58fA24 = f59fA25 = f60fArgSqr = f61fArgCube = f62fArgFour = f63fArgEight = f64fArgAbsNorm = f65fArgAbsNorm2 = f66fArgAbsNorm2L = f67fArgAbsNorm3 = f68fArgAbsNorm4 = f69fArgAbsNorm11 = f70fRes = f71fResH = f72fResL = f73fRes1H = f74fRes1L = f75fRes1Hd = f76fRes2H = f77fRes2L = f78fRes3H = f79fRes3L = f80fRes4 = f81fTT = f82 fTH = f83fTL = f84fTT2 = f85 fTH2 = f86fTL2 = f87f1p5 = f88f2p0 = f89fTiny = f90fSignumX = f91fArgAbsNorm4X = f92// Data tables//==============================================================RODATA.align 16LOCAL_OBJECT_START(tanhl_data)////////// Main tables ///////////_0p125_to_0p25_data: // exp = 2^-3// Polynomial coefficients for the tanh(x), 1/8 <= |x| < 1/4 data8 0x93D27D6AE7E835F8, 0x0000BFF4 //A3 = -5.6389704216278164626050408239e-04data8 0xBF66E8668A78A8BC //A2H = -2.7963640930198357253955165902e-03data8 0xBBD5384EFD0E7A54 //A2L = -1.7974001252014762983581666453e-20data8 0x3FBEE69E31DB6156 //A1H = 1.2070645062647619716322822114e-01data8 0x3C43A0B4E24A3DCA //A1L = 2.1280460108882061756490131241e-18data8 0x3FC7B8FF903BF776 //A0H = 1.8533319990813951205765874874e-01data8 0x3C593F1A61986FD4 //A0L = 5.4744612262799573374268254539e-18data8 0xDB9E6735560AAE5A, 0x0000BFA3 //A25 = -3.4649731131719154051239475238e-28data8 0xF0DDE953E4327704, 0x00003FA4 //A24 = 7.6004173864565644629900702857e-28data8 0x8532AED11DEC5612, 0x00003FAB //A23 = 5.3798235684551098715428515761e-26data8 0xAEF72A34D88B0038, 0x0000BFAD //A22 = -2.8267199091484508912273222600e-25data8 0x9645EF1DCB759DDD, 0x0000BFB2 //A21 = -7.7689413112830095709522203109e-24data8 0xA5D12364E121F70F, 0x00003FB5 //A20 = 6.8580281614531622113161030550e-23data8 0x9CF166EA815AC705, 0x00003FB9 //A19 = 1.0385615003184753213024737634e-21data8 0x852B1D0252498752, 0x0000BFBD //A18 = -1.4099753997949827217635356478e-20data8 0x9270F5716D25EC9F, 0x0000BFC0 //A17 = -1.2404055949090177751123473821e-19data8 0xC216A9C4EEBDDDCA, 0x00003FC4 //A16 = 2.6303900460415782677749729120e-18data8 0xDCE944D89FF592F2, 0x00003FC6 //A15 = 1.1975620514752377092265425941e-17data8 0x83C8DDF213711381, 0x0000BFCC //A14 = -4.5721980583985311263109531319e-16LOCAL_OBJECT_END(tanhl_data)LOCAL_OBJECT_START(_0p25_to_0p5_data)// Polynomial coefficients for the tanh(x), 1/4 <= |x| < 1/2 data8 0xB6E27B747C47C8AD, 0x0000BFF6 //A3 = -2.7905990032063258105302045572e-03data8 0xBF93FD54E226F8F7 //A2H = -1.9521070769536099515084615064e-02data8 0xBC491BC884F6F18A //A2L = -2.7222721075104525371410300625e-18data8 0x3FCBE3FBB015A591 //A1H = 2.1789499376181400980279079249e-01data8 0x3C76AFC2D1AE35F7 //A1L = 1.9677459707672596091076696742e-17data8 0x3FD6EF53DE8C8FAF //A0H = 3.5835739835078589399230963863e-01data8 0x3C8E2A1C14355F9D //A0L = 5.2327050592919416045278607775e-17data8 0xF56D363AAE3BAD53, 0x00003FBB //A25 = 6.4963882412697389947564301120e-21data8 0xAD6348526CEEB897, 0x0000BFBD //A24 = -1.8358149767147407353343152624e-20data8 0x85D96A988565FD65, 0x0000BFC1 //A23 = -2.2674950494950919052759556703e-19data8 0xD52CAF6B1E4D9717, 0x00003FC3 //A22 = 1.4445269502644677106995571101e-18data8 0xBD7E1BE5CBEF7A01, 0x00003FC5 //A21 = 5.1362075721080004718090799595e-18data8 0xAE84A9B12ADD6948, 0x0000BFC9 //A20 = -7.5685210830925426342786733068e-17data8 0xEAC2D5FCF80E250C, 0x00003FC6 //A19 = 1.2726423522879522181100392135e-17data8 0xE0D2A8AC8C2EDB95, 0x00003FCE //A18 = 3.1200443098733419749016380203e-15data8 0xB22F0AB7B417F78E, 0x0000BFD0 //A17 = -9.8911854977385933809488291835e-15data8 0xE25A627BAEFFA7A4, 0x0000BFD3 //A16 = -1.0052095388666003876301743498e-13data8 0xC90F32EC4A17F908, 0x00003FD6 //A15 = 7.1430637679768183097897337145e-13data8 0x905F6F124AF956B1, 0x00003FD8 //A14 = 2.0516607231389483452611375485e-12LOCAL_OBJECT_END(_0p25_to_0p5_data)LOCAL_OBJECT_START(_0p5_to_1_data)// Polynomial coefficients for the tanh(x), 1/2 <= |x| < 1 data8 0xAB402BE491EE72A7, 0x00003FF7 //A3 = 5.2261556931080934657023772945e-03data8 0xBFB8403D3DDA87BE //A2H = -9.4730212784752659826992271519e-02data8 0xBC6FF7BC2AB71A8B //A2L = -1.3863786398568460929625760740e-17data8 0x3FD3173B1EFA6EF4 //A1H = 2.9829290414066567116435635398e-01data8 0x3C881E4DCABDE840 //A1L = 4.1838710466827119847963316219e-17data8 0x3FE45323E552F228 //A0H = 6.3514895238728730220145735075e-01data8 0x3C739D5832BF7BCF //A0L = 1.7012977006567066423682445459e-17data8 0xF153980BECD8AE12, 0x00003FD0 //A25 = 1.3396313991261493342597057700e-14data8 0xEC9ACCD245368129, 0x0000BFD3 //A24 = -1.0507358886349528807350792383e-13data8 0x8AE6498CA36D2D1A, 0x00003FD4 //A23 = 1.2336759149738309660361813001e-13data8 0x8DF02FBF5AC70E64, 0x00003FD7 //A22 = 1.0085317723615282268326194551e-12data8 0x9E15C7125DA204EE, 0x0000BFD9 //A21 = -4.4930478919612724261941857560e-12data8 0xA62C6F39BDDCEC1C, 0x00003FD7 //A20 = 1.1807342457875095150035780314e-12data8 0xDFD8D65D30F80F52, 0x00003FDC //A19 = 5.0896919887121116317817665996e-11data8 0xB795AFFD458F743E, 0x0000BFDE //A18 = -1.6696932710534097241291327756e-10data8 0xFEF30234CB01EC89, 0x0000BFDD //A17 = -1.1593749714588103589483091370e-10data8 0xA2F638356E13761E, 0x00003FE2 //A16 = 2.3714062288761887457674853605e-09data8 0xC429CC0D031E4FD5, 0x0000BFE3 //A15 = -5.7091025466377379046489586383e-09data8 0xC78363FF929EFF62, 0x0000BFE4 //A14 = -1.1613199289622686725595739572e-08LOCAL_OBJECT_END(_0p5_to_1_data)LOCAL_OBJECT_START(_1_to_2_data)// Polynomial coefficients for the tanh(x), 1 <= |x| < 2.0 data8 0xB3D8FB48A548D99A, 0x00003FFB //A3 = 8.7816203264683800892441646129e-02data8 0xBFC4EFBD8FB38E3B //A2H = -1.6356629864377389416141284073e-01
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -