📄 libm_tan.s
字号:
.file "libm_tan.s"// Copyright (C) 2000, 2001, Intel Corporation// All rights reserved.// // Contributed 2/2/2000 by John Harrison, Ted Kubaska, Bob Norin, Shane Story,// and Ping Tak Peter Tang of the Computational Software Lab, Intel Corporation.//// Redistribution and use in source and binary forms, with or without// modification, are permitted provided that the following conditions are// met://// * Redistributions of source code must retain the above copyright// notice, this list of conditions and the following disclaimer.//// * Redistributions in binary form must reproduce the above copyright// notice, this list of conditions and the following disclaimer in the// documentation and/or other materials provided with the distribution.//// * The name of Intel Corporation may not be used to endorse or promote// products derived from this software without specific prior written// permission.//// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL OR ITS // CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, // PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR // PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY // OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY OR TORT (INCLUDING// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS // SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. // // Intel Corporation is the author of this code, and requests that all// problem reports or change requests be submitted to it directly at // http://developer.intel.com/opensource.//// *********************************************************************//// History: // 02/02/00 Initial Version // 4/04/00 Unwind support added// 12/28/00 Fixed false invalid flags//// *********************************************************************//// Function: tan(x) = tangent(x), for double precision x values//// *********************************************************************//// Accuracy: Very accurate for double-precision values //// *********************************************************************//// Resources Used://// Floating-Point Registers: f8 (Input and Return Value)// f9-f15// f32-f112//// General Purpose Registers:// r32-r48// r49-r50 (Used to pass arguments to pi_by_2 reduce routine)//// Predicate Registers: p6-p15//// *********************************************************************//// IEEE Special Conditions://// Denormal fault raised on denormal inputs// Overflow exceptions do not occur// Underflow exceptions raised when appropriate for tan // (No specialized error handling for this routine)// Inexact raised when appropriate by algorithm//// tan(SNaN) = QNaN// tan(QNaN) = QNaN// tan(inf) = QNaN// tan(+/-0) = +/-0//// *********************************************************************//// Mathematical Description//// We consider the computation of FPTAN of Arg. Now, given//// Arg = N pi/2 + alpha, |alpha| <= pi/4,//// basic mathematical relationship shows that//// tan( Arg ) = tan( alpha ) if N is even;// = -cot( alpha ) otherwise.//// The value of alpha is obtained by argument reduction and// represented by two working precision numbers r and c where//// alpha = r + c accurately.//// The reduction method is described in a previous write up.// The argument reduction scheme identifies 4 cases. For Cases 2// and 4, because |alpha| is small, tan(r+c) and -cot(r+c) can be// computed very easily by 2 or 3 terms of the Taylor series// expansion as follows://// Case 2:// -------//// tan(r + c) = r + c + r^3/3 ...accurately// -cot(r + c) = -1/(r+c) + r/3 ...accurately//// Case 4:// -------//// tan(r + c) = r + c + r^3/3 + 2r^5/15 ...accurately// -cot(r + c) = -1/(r+c) + r/3 + r^3/45 ...accurately////// The only cases left are Cases 1 and 3 of the argument reduction// procedure. These two cases will be merged since after the// argument is reduced in either cases, we have the reduced argument// represented as r + c and that the magnitude |r + c| is not small// enough to allow the usage of a very short approximation.//// The greatest challenge of this task is that the second terms of// the Taylor series for tan(r) and -cot(r)//// r + r^3/3 + 2 r^5/15 + ...//// and//// -1/r + r/3 + r^3/45 + ...//// are not very small when |r| is close to pi/4 and the rounding// errors will be a concern if simple polynomial accumulation is// used. When |r| < 2^(-2), however, the second terms will be small// enough (5 bits or so of right shift) that a normal Horner// recurrence suffices. Hence there are two cases that we consider// in the accurate computation of tan(r) and cot(r), |r| <= pi/4.//// Case small_r: |r| < 2^(-2)// --------------------------//// Since Arg = N pi/4 + r + c accurately, we have//// tan(Arg) = tan(r+c) for N even,// = -cot(r+c) otherwise.//// Here for this case, both tan(r) and -cot(r) can be approximated// by simple polynomials://// tan(r) = r + P1_1 r^3 + P1_2 r^5 + ... + P1_9 r^19// -cot(r) = -1/r + Q1_1 r + Q1_2 r^3 + ... + Q1_7 r^13//// accurately. Since |r| is relatively small, tan(r+c) and// -cot(r+c) can be accurately approximated by replacing r with// r+c only in the first two terms of the corresponding polynomials.//// Note that P1_1 (and Q1_1 for that matter) approximates 1/3 to// almost 64 sig. bits, thus//// P1_1 (r+c)^3 = P1_1 r^3 + c * r^2 accurately.//// Hence,//// tan(r+c) = r + P1_1 r^3 + P1_2 r^5 + ... + P1_9 r^19// + c*(1 + r^2)//// -cot(r+c) = -1/(r+c) + Q1_1 r + Q1_2 r^3 + ... + Q1_7 r^13// + Q1_1*c////// Case normal_r: 2^(-2) <= |r| <= pi/4// ------------------------------------//// This case is more likely than the previous one if one considers// r to be uniformly distributed in [-pi/4 pi/4].//// The required calculation is either//// tan(r + c) = tan(r) + correction, or// -cot(r + c) = -cot(r) + correction.//// Specifically,//// tan(r + c) = tan(r) + c tan'(r) + O(c^2)// = tan(r) + c sec^2(r) + O(c^2)// = tan(r) + c SEC_sq ...accurately// as long as SEC_sq approximates sec^2(r)// to, say, 5 bits or so.//// Similarly,//// -cot(r + c) = -cot(r) - c cot'(r) + O(c^2)// = -cot(r) + c csc^2(r) + O(c^2)// = -cot(r) + c CSC_sq ...accurately// as long as CSC_sq approximates csc^2(r)// to, say, 5 bits or so.//// We therefore concentrate on accurately calculating tan(r) and// cot(r) for a working-precision number r, |r| <= pi/4 to within// 0.1% or so.//// We will employ a table-driven approach. Let//// r = sgn_r * 2^k * 1.b_1 b_2 ... b_5 ... b_63// = sgn_r * ( B + x )//// where//// B = 2^k * 1.b_1 b_2 ... b_5 1// x = |r| - B//// Now,// tan(B) + tan(x)// tan( B + x ) = ------------------------// 1 - tan(B)*tan(x)//// / \ // | tan(B) + tan(x) |// = tan(B) + | ------------------------ - tan(B) |// | 1 - tan(B)*tan(x) |// \ ///// sec^2(B) * tan(x)// = tan(B) + ------------------------// 1 - tan(B)*tan(x)//// (1/[sin(B)*cos(B)]) * tan(x)// = tan(B) + --------------------------------// cot(B) - tan(x)////// Clearly, the values of tan(B), cot(B) and 1/(sin(B)*cos(B)) are// calculated beforehand and stored in a table. Since//// |x| <= 2^k * 2^(-6) <= 2^(-7) (because k = -1, -2)//// a very short polynomial will be sufficient to approximate tan(x)// accurately. The details involved in computing the last expression// will be given in the next section on algorithm description.////// Now, we turn to the case where cot( B + x ) is needed.////// 1 - tan(B)*tan(x)// cot( B + x ) = ------------------------// tan(B) + tan(x)//// / \ // | 1 - tan(B)*tan(x) |// = cot(B) + | ----------------------- - cot(B) |// | tan(B) + tan(x) |// \ ///// [tan(B) + cot(B)] * tan(x)// = cot(B) - ----------------------------// tan(B) + tan(x)//// (1/[sin(B)*cos(B)]) * tan(x)// = cot(B) - --------------------------------// tan(B) + tan(x)////// Note that the values of tan(B), cot(B) and 1/(sin(B)*cos(B)) that// are needed are the same set of values needed in the previous// case.//// Finally, we can put all the ingredients together as follows://// Arg = N * pi/2 + r + c ...accurately//// tan(Arg) = tan(r) + correction if N is even;// = -cot(r) + correction otherwise.//// For Cases 2 and 4,//// Case 2:// tan(Arg) = tan(r + c) = r + c + r^3/3 N even// = -cot(r + c) = -1/(r+c) + r/3 N odd// Case 4:// tan(Arg) = tan(r + c) = r + c + r^3/3 + 2r^5/15 N even// = -cot(r + c) = -1/(r+c) + r/3 + r^3/45 N odd////// For Cases 1 and 3,//// Case small_r: |r| < 2^(-2)//// tan(Arg) = r + P1_1 r^3 + P1_2 r^5 + ... + P1_9 r^19// + c*(1 + r^2) N even//// = -1/(r+c) + Q1_1 r + Q1_2 r^3 + ... + Q1_7 r^13// + Q1_1*c N odd//// Case normal_r: 2^(-2) <= |r| <= pi/4//// tan(Arg) = tan(r) + c * sec^2(r) N even// = -cot(r) + c * csc^2(r) otherwise//// For N even,//// tan(Arg) = tan(r) + c*sec^2(r)// = tan( sgn_r * (B+x) ) + c * sec^2(|r|)// = sgn_r * ( tan(B+x) + sgn_r*c*sec^2(|r|) )// = sgn_r * ( tan(B+x) + sgn_r*c*sec^2(B) )//// since B approximates |r| to 2^(-6) in relative accuracy.//// / (1/[sin(B)*cos(B)]) * tan(x)// tan(Arg) = sgn_r * | tan(B) + --------------------------------// \ cot(B) - tan(x)// \ // + CORR |// /// where//// CORR = sgn_r*c*tan(B)*SC_inv(B); SC_inv(B) = 1/(sin(B)*cos(B)).//// For N odd,//// tan(Arg) = -cot(r) + c*csc^2(r)// = -cot( sgn_r * (B+x) ) + c * csc^2(|r|)// = sgn_r * ( -cot(B+x) + sgn_r*c*csc^2(|r|) )// = sgn_r * ( -cot(B+x) + sgn_r*c*csc^2(B) )//// since B approximates |r| to 2^(-6) in relative accuracy.//// / (1/[sin(B)*cos(B)]) * tan(x)// tan(Arg) = sgn_r * | -cot(B) + --------------------------------// \ tan(B) + tan(x)// \ // + CORR |// /// where//// CORR = sgn_r*c*cot(B)*SC_inv(B); SC_inv(B) = 1/(sin(B)*cos(B)).////// The actual algorithm prescribes how all the mathematical formulas// are calculated.////// 2. Algorithmic Description// ==========================//// 2.1 Computation for Cases 2 and 4.// ----------------------------------//// For Case 2, we use two-term polynomials.//// For N even,//// rsq := r * r// Result := c + r * rsq * P1_1// Result := r + Result ...in user-defined rounding//// For N odd,// S_hi := -frcpa(r) ...8 bits// S_hi := S_hi + S_hi*(1 + S_hi*r) ...16 bits// S_hi := S_hi + S_hi*(1 + S_hi*r) ...32 bits// S_hi := S_hi + S_hi*(1 + S_hi*r) ...64 bits// S_lo := S_hi*( (1 + S_hi*r) + S_hi*c )// ...S_hi + S_lo is -1/(r+c) to extra precision// S_lo := S_lo + Q1_1*r//// Result := S_hi + S_lo ...in user-defined rounding//// For Case 4, we use three-term polynomials//// For N even,//// rsq := r * r// Result := c + r * rsq * (P1_1 + rsq * P1_2)// Result := r + Result ...in user-defined rounding//// For N odd,// S_hi := -frcpa(r) ...8 bits// S_hi := S_hi + S_hi*(1 + S_hi*r) ...16 bits// S_hi := S_hi + S_hi*(1 + S_hi*r) ...32 bits// S_hi := S_hi + S_hi*(1 + S_hi*r) ...64 bits// S_lo := S_hi*( (1 + S_hi*r) + S_hi*c )// ...S_hi + S_lo is -1/(r+c) to extra precision// rsq := r * r// P := Q1_1 + rsq*Q1_2// S_lo := S_lo + r*P//// Result := S_hi + S_lo ...in user-defined rounding////// Note that the coefficients P1_1, P1_2, Q1_1, and Q1_2 are// the same as those used in the small_r case of Cases 1 and 3// below.////// 2.2 Computation for Cases 1 and 3.// ----------------------------------// This is further divided into the case of small_r,// where |r| < 2^(-2), and the case of normal_r, where |r| lies between// 2^(-2) and pi/4.//// Algorithm for the case of small_r// ---------------------------------//// For N even,// rsq := r * r// Poly1 := rsq*(P1_1 + rsq*(P1_2 + rsq*P1_3))// r_to_the_8 := rsq * rsq// r_to_the_8 := r_to_the_8 * r_to_the_8// Poly2 := P1_4 + rsq*(P1_5 + rsq*(P1_6 + ... rsq*P1_9))// CORR := c * ( 1 + rsq )// Poly := Poly1 + r_to_the_8*Poly2// Result := r*Poly + CORR// Result := r + Result ...in user-defined rounding// ...note that Poly1 and r_to_the_8 can be computed in parallel// ...with Poly2 (Poly1 is intentionally set to be much// ...shorter than Poly2 so that r_to_the_8 and CORR can be hidden)//// For N odd,// S_hi := -frcpa(r) ...8 bits// S_hi := S_hi + S_hi*(1 + S_hi*r) ...16 bits// S_hi := S_hi + S_hi*(1 + S_hi*r) ...32 bits// S_hi := S_hi + S_hi*(1 + S_hi*r) ...64 bits// S_lo := S_hi*( (1 + S_hi*r) + S_hi*c )// ...S_hi + S_lo is -1/(r+c) to extra precision// S_lo := S_lo + Q1_1*c//// ...S_hi and S_lo are computed in parallel with// ...the following// rsq := r*r// P := Q1_1 + rsq*(Q1_2 + rsq*(Q1_3 + ... + rsq*Q1_7))//// Result := r*P + S_lo// Result := S_hi + Result ...in user-defined rounding////// Algorithm for the case of normal_r// ----------------------------------//// Here, we first consider the computation of tan( r + c ). As// presented in the previous section,//// tan( r + c ) = tan(r) + c * sec^2(r)// = sgn_r * [ tan(B+x) + CORR ]// CORR = sgn_r * c * tan(B) * 1/[sin(B)*cos(B)]//// because sec^2(r) = sec^(|r|), and B approximate |r| to 6.5 bits.//// tan( r + c ) =// / (1/[sin(B)*cos(B)]) * tan(x)// sgn_r * | tan(B) + -------------------------------- +// \ cot(B) - tan(x)// \ // CORR |// ///// The values of tan(B), cot(B) and 1/(sin(B)*cos(B)) are// calculated beforehand and stored in a table. Specifically,// the table values are//// tan(B) as T_hi + T_lo;// cot(B) as C_hi + C_lo;// 1/[sin(B)*cos(B)] as SC_inv//// T_hi, C_hi are in double-precision memory format;// T_lo, C_lo are in single-precision memory format;// SC_inv is in extended-precision memory format.//// The value of tan(x) will be approximated by a short polynomial of// the form//// tan(x) as x + x * P, where// P = x^2 * (P2_1 + x^2 * (P2_2 + x^2 * P2_3))//// Because |x| <= 2^(-7), cot(B) - x approximates cot(B) - tan(x)// to a relative accuracy better than 2^(-20). Thus, a good// initial guess of 1/( cot(B) - tan(x) ) to initiate the iterative// division is://// 1/(cot(B) - tan(x)) is approximately// 1/(cot(B) - x) is// tan(B)/(1 - x*tan(B)) is approximately// T_hi / ( 1 - T_hi * x ) is approximately//// T_hi * [ 1 + (Thi * x) + (T_hi * x)^2 ]//// The calculation of tan(r+c) therefore proceed as follows://// Tx := T_hi * x// xsq := x * x//// V_hi := T_hi*(1 + Tx*(1 + Tx))// P := xsq * (P1_1 + xsq*(P1_2 + xsq*P1_3))// ...V_hi serves as an initial guess of 1/(cot(B) - tan(x))// ...good to about 20 bits of accuracy//// tanx := x + x*P// D := C_hi - tanx// ...D is a double precision denominator: cot(B) - tan(x)//// V_hi := V_hi + V_hi*(1 - V_hi*D)// ....V_hi approximates 1/(cot(B)-tan(x)) to 40 bits//// V_lo := V_hi * ( [ (1 - V_hi*C_hi) + V_hi*tanx ]// - V_hi*C_lo ) ...observe all order// ...V_hi + V_lo approximates 1/(cot(B) - tan(x))// ...to extra accuracy//// ... SC_inv(B) * (x + x*P)// ... tan(B) + ------------------------- + CORR// ... cot(B) - (x + x*P)// ...// ... = tan(B) + SC_inv(B)*(x + x*P)*(V_hi + V_lo) + CORR// ...//// Sx := SC_inv * x// CORR := sgn_r * c * SC_inv * T_hi//// ...put the ingredients together to compute// ... SC_inv(B) * (x + x*P)// ... tan(B) + ------------------------- + CORR// ... cot(B) - (x + x*P)// ...// ... = tan(B) + SC_inv(B)*(x + x*P)*(V_hi + V_lo) + CORR// ...// ... = T_hi + T_lo + CORR +// ... Sx * V_hi + Sx * V_lo + Sx * P *(V_hi + V_lo)//// CORR := CORR + T_lo// tail := V_lo + P*(V_hi + V_lo)// tail := Sx * tail + CORR// tail := Sx * V_hi + tail// T_hi := sgn_r * T_hi//// ...T_hi + sgn_r*tail now approximate// ...sgn_r*(tan(B+x) + CORR) accurately//// Result := T_hi + sgn_r*tail ...in user-defined// ...rounding control// ...It is crucial that independent paths be fully// ...exploited for performance's sake.////// Next, we consider the computation of -cot( r + c ). As// presented in the previous section,//// -cot( r + c ) = -cot(r) + c * csc^2(r)// = sgn_r * [ -cot(B+x) + CORR ]// CORR = sgn_r * c * cot(B) * 1/[sin(B)*cos(B)]//// because csc^2(r) = csc^(|r|), and B approximate |r| to 6.5 bits.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -