📄 s_cosl.s
字号:
.file "sincosl.s"// Copyright (C) 2000, 2001, Intel Corporation// All rights reserved.// // Contributed 2/2/2000 by John Harrison, Ted Kubaska, Bob Norin, Shane Story,// and Ping Tak Peter Tang of the Computational Software Lab, Intel Corporation.//// Redistribution and use in source and binary forms, with or without// modification, are permitted provided that the following conditions are// met://// * Redistributions of source code must retain the above copyright// notice, this list of conditions and the following disclaimer.//// * Redistributions in binary form must reproduce the above copyright// notice, this list of conditions and the following disclaimer in the// documentation and/or other materials provided with the distribution.//// * The name of Intel Corporation may not be used to endorse or promote// products derived from this software without specific prior written// permission.//// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL OR ITS // CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, // PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR // PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY // OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY OR TORT (INCLUDING// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS // SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. // // Intel Corporation is the author of this code, and requests that all// problem reports or change requests be submitted to it directly at // http://developer.intel.com/opensource.//// *********************************************************************//// History: // 2/02/2000 (hand-optimized)// 4/04/00 Unwind support added//// *********************************************************************//// Function: Combined sinl(x) and cosl(x), where//// sinl(x) = sine(x), for double-extended precision x values// cosl(x) = cosine(x), for double-extended precision x values//// *********************************************************************//// Resources Used://// Floating-Point Registers: f8 (Input and Return Value) // f32-f99//// General Purpose Registers:// r32-r43 // r44-r45 (Used to pass arguments to pi_by_2 reduce routine)//// Predicate Registers: p6-p13//// *********************************************************************//// IEEE Special Conditions://// Denormal fault raised on denormal inputs// Overflow exceptions do not occur// Underflow exceptions raised when appropriate for sin // (No specialized error handling for this routine)// Inexact raised when appropriate by algorithm//// sinl(SNaN) = QNaN// sinl(QNaN) = QNaN// sinl(inf) = QNaN // sinl(+/-0) = +/-0// cosl(inf) = QNaN // cosl(SNaN) = QNaN// cosl(QNaN) = QNaN// cosl(0) = 1// // *********************************************************************//// Mathematical Description// ========================//// The computation of FSIN and FCOS is best handled in one piece of // code. The main reason is that given any argument Arg, computation // of trigonometric functions first calculate N and an approximation // to alpha where//// Arg = N pi/2 + alpha, |alpha| <= pi/4.//// Since//// cosl( Arg ) = sinl( (N+1) pi/2 + alpha ),//// therefore, the code for computing sine will produce cosine as long // as 1 is added to N immediately after the argument reduction // process.//// Let M = N if sine// N+1 if cosine. //// Now, given//// Arg = M pi/2 + alpha, |alpha| <= pi/4,//// let I = M mod 4, or I be the two lsb of M when M is represented // as 2's complement. I = [i_0 i_1]. Then//// sinl( Arg ) = (-1)^i_0 sinl( alpha ) if i_1 = 0,// = (-1)^i_0 cosl( alpha ) if i_1 = 1.//// For example:// if M = -1, I = 11 // sin ((-pi/2 + alpha) = (-1) cos (alpha)// if M = 0, I = 00 // sin (alpha) = sin (alpha)// if M = 1, I = 01 // sin (pi/2 + alpha) = cos (alpha)// if M = 2, I = 10 // sin (pi + alpha) = (-1) sin (alpha)// if M = 3, I = 11 // sin ((3/2)pi + alpha) = (-1) cos (alpha)//// The value of alpha is obtained by argument reduction and // represented by two working precision numbers r and c where//// alpha = r + c accurately.//// The reduction method is described in a previous write up.// The argument reduction scheme identifies 4 cases. For Cases 2 // and 4, because |alpha| is small, sinl(r+c) and cosl(r+c) can be // computed very easily by 2 or 3 terms of the Taylor series // expansion as follows://// Case 2:// -------//// sinl(r + c) = r + c - r^3/6 accurately// cosl(r + c) = 1 - 2^(-67) accurately//// Case 4:// -------//// sinl(r + c) = r + c - r^3/6 + r^5/120 accurately// cosl(r + c) = 1 - r^2/2 + r^4/24 accurately//// The only cases left are Cases 1 and 3 of the argument reduction // procedure. These two cases will be merged since after the // argument is reduced in either cases, we have the reduced argument // represented as r + c and that the magnitude |r + c| is not small // enough to allow the usage of a very short approximation.//// The required calculation is either//// sinl(r + c) = sinl(r) + correction, or// cosl(r + c) = cosl(r) + correction.//// Specifically,//// sinl(r + c) = sinl(r) + c sin'(r) + O(c^2)// = sinl(r) + c cos (r) + O(c^2)// = sinl(r) + c(1 - r^2/2) accurately.// Similarly,//// cosl(r + c) = cosl(r) - c sinl(r) + O(c^2)// = cosl(r) - c(r - r^3/6) accurately.//// We therefore concentrate on accurately calculating sinl(r) and // cosl(r) for a working-precision number r, |r| <= pi/4 to within// 0.1% or so.//// The greatest challenge of this task is that the second terms of // the Taylor series// // r - r^3/3! + r^r/5! - ...//// and//// 1 - r^2/2! + r^4/4! - ...//// are not very small when |r| is close to pi/4 and the rounding // errors will be a concern if simple polynomial accumulation is // used. When |r| < 2^-3, however, the second terms will be small // enough (6 bits or so of right shift) that a normal Horner // recurrence suffices. Hence there are two cases that we consider // in the accurate computation of sinl(r) and cosl(r), |r| <= pi/4.//// Case small_r: |r| < 2^(-3)// --------------------------//// Since Arg = M pi/4 + r + c accurately, and M mod 4 is [i_0 i_1],// we have//// sinl(Arg) = (-1)^i_0 * sinl(r + c) if i_1 = 0// = (-1)^i_0 * cosl(r + c) if i_1 = 1//// can be accurately approximated by//// sinl(Arg) = (-1)^i_0 * [sinl(r) + c] if i_1 = 0// = (-1)^i_0 * [cosl(r) - c*r] if i_1 = 1//// because |r| is small and thus the second terms in the correction // are unneccessary.//// Finally, sinl(r) and cosl(r) are approximated by polynomials of // moderate lengths.//// sinl(r) = r + S_1 r^3 + S_2 r^5 + ... + S_5 r^11// cosl(r) = 1 + C_1 r^2 + C_2 r^4 + ... + C_5 r^10//// We can make use of predicates to selectively calculate // sinl(r) or cosl(r) based on i_1. //// Case normal_r: 2^(-3) <= |r| <= pi/4// ------------------------------------//// This case is more likely than the previous one if one considers// r to be uniformly distributed in [-pi/4 pi/4]. Again,// // sinl(Arg) = (-1)^i_0 * sinl(r + c) if i_1 = 0// = (-1)^i_0 * cosl(r + c) if i_1 = 1.//// Because |r| is now larger, we need one extra term in the // correction. sinl(Arg) can be accurately approximated by//// sinl(Arg) = (-1)^i_0 * [sinl(r) + c(1-r^2/2)] if i_1 = 0// = (-1)^i_0 * [cosl(r) - c*r*(1 - r^2/6)] i_1 = 1.//// Finally, sinl(r) and cosl(r) are approximated by polynomials of // moderate lengths.//// sinl(r) = r + PP_1_hi r^3 + PP_1_lo r^3 + // PP_2 r^5 + ... + PP_8 r^17//// cosl(r) = 1 + QQ_1 r^2 + QQ_2 r^4 + ... + QQ_8 r^16//// where PP_1_hi is only about 16 bits long and QQ_1 is -1/2. // The crux in accurate computation is to calculate //// r + PP_1_hi r^3 or 1 + QQ_1 r^2//// accurately as two pieces: U_hi and U_lo. The way to achieve this // is to obtain r_hi as a 10 sig. bit number that approximates r to // roughly 8 bits or so of accuracy. (One convenient way is//// r_hi := frcpa( frcpa( r ) ).)//// This way,//// r + PP_1_hi r^3 = r + PP_1_hi r_hi^3 +// PP_1_hi (r^3 - r_hi^3)// = [r + PP_1_hi r_hi^3] + // [PP_1_hi (r - r_hi) // (r^2 + r_hi r + r_hi^2) ]// = U_hi + U_lo//// Since r_hi is only 10 bit long and PP_1_hi is only 16 bit long,// PP_1_hi * r_hi^3 is only at most 46 bit long and thus computed // exactly. Furthermore, r and PP_1_hi r_hi^3 are of opposite sign // and that there is no more than 8 bit shift off between r and // PP_1_hi * r_hi^3. Hence the sum, U_hi, is representable and thus // calculated without any error. Finally, the fact that //// |U_lo| <= 2^(-8) |U_hi|//// says that U_hi + U_lo is approximating r + PP_1_hi r^3 to roughly // 8 extra bits of accuracy.//// Similarly,//// 1 + QQ_1 r^2 = [1 + QQ_1 r_hi^2] +// [QQ_1 (r - r_hi)(r + r_hi)]// = U_hi + U_lo.// // Summarizing, we calculate r_hi = frcpa( frcpa( r ) ). //// If i_1 = 0, then//// U_hi := r + PP_1_hi * r_hi^3// U_lo := PP_1_hi * (r - r_hi) * (r^2 + r*r_hi + r_hi^2)// poly := PP_1_lo r^3 + PP_2 r^5 + ... + PP_8 r^17// correction := c * ( 1 + C_1 r^2 )//// Else ...i_1 = 1//// U_hi := 1 + QQ_1 * r_hi * r_hi// U_lo := QQ_1 * (r - r_hi) * (r + r_hi)// poly := QQ_2 * r^4 + QQ_3 * r^6 + ... + QQ_8 r^16// correction := -c * r * (1 + S_1 * r^2)//// End//// Finally,// // V := poly + ( U_lo + correction )//// / U_hi + V if i_0 = 0// result := |// \ (-U_hi) - V if i_0 = 1//// It is important that in the last step, negation of U_hi is // performed prior to the subtraction which is to be performed in // the user-set rounding mode. ////// Algorithmic Description// =======================//// The argument reduction algorithm is tightly integrated into FSIN // and FCOS which share the same code. The following is complete and // self-contained. The argument reduction description given // previously is repeated below.////// Step 0. Initialization. //// If FSIN is invoked, set N_inc := 0; else if FCOS is invoked,// set N_inc := 1.//// Step 1. Check for exceptional and special cases.//// * If Arg is +-0, +-inf, NaN, NaT, go to Step 10 for special // handling.// * If |Arg| < 2^24, go to Step 2 for reduction of moderate// arguments. This is the most likely case.// * If |Arg| < 2^63, go to Step 8 for pre-reduction of large// arguments.// * If |Arg| >= 2^63, go to Step 10 for special handling.//// Step 2. Reduction of moderate arguments.//// If |Arg| < pi/4 ...quick branch// N_fix := N_inc (integer)// r := Arg// c := 0.0// Branch to Step 4, Case_1_complete// Else ...cf. argument reduction// N := Arg * two_by_PI (fp)// N_fix := fcvt.fx( N ) (int)// N := fcvt.xf( N_fix )// N_fix := N_fix + N_inc// s := Arg - N * P_1 (first piece of pi/2)// w := -N * P_2 (second piece of pi/2)//// If |s| >= 2^(-33)// go to Step 3, Case_1_reduce// Else// go to Step 7, Case_2_reduce// Endif// Endif//// Step 3. Case_1_reduce.//// r := s + w// c := (s - r) + w ...observe order// // Step 4. Case_1_complete//// ...At this point, the reduced argument alpha is// ...accurately represented as r + c.// If |r| < 2^(-3), go to Step 6, small_r.//// Step 5. Normal_r.//// Let [i_0 i_1] by the 2 lsb of N_fix.// FR_rsq := r * r// r_hi := frcpa( frcpa( r ) )// r_lo := r - r_hi//// If i_1 = 0, then// poly := r*FR_rsq*(PP_1_lo + FR_rsq*(PP_2 + ... FR_rsq*PP_8))// U_hi := r + PP_1_hi*r_hi*r_hi*r_hi ...any order// U_lo := PP_1_hi*r_lo*(r*r + r*r_hi + r_hi*r_hi)// correction := c + c*C_1*FR_rsq ...any order// Else// poly := FR_rsq*FR_rsq*(QQ_2 + FR_rsq*(QQ_3 + ... + FR_rsq*QQ_8))// U_hi := 1 + QQ_1 * r_hi * r_hi ...any order// U_lo := QQ_1 * r_lo * (r + r_hi)// correction := -c*(r + S_1*FR_rsq*r) ...any order// Endif//// V := poly + (U_lo + correction) ...observe order//// result := (i_0 == 0? 1.0 : -1.0)//// Last instruction in user-set rounding mode//// result := (i_0 == 0? result*U_hi + V :// result*U_hi - V)//// Return//// Step 6. Small_r.// // ...Use flush to zero mode without causing exception// Let [i_0 i_1] be the two lsb of N_fix.//// FR_rsq := r * r//// If i_1 = 0 then// z := FR_rsq*FR_rsq; z := FR_rsq*z *r// poly_lo := S_3 + FR_rsq*(S_4 + FR_rsq*S_5)// poly_hi := r*FR_rsq*(S_1 + FR_rsq*S_2)// correction := c// result := r// Else// z := FR_rsq*FR_rsq; z := FR_rsq*z// poly_lo := C_3 + FR_rsq*(C_4 + FR_rsq*C_5)// poly_hi := FR_rsq*(C_1 + FR_rsq*C_2) // correction := -c*r// result := 1// Endif//// poly := poly_hi + (z * poly_lo + correction)//// If i_0 = 1, result := -result//// Last operation. Perform in user-set rounding mode//// result := (i_0 == 0? result + poly :// result - poly )// Return//// Step 7. Case_2_reduce.//// ...Refer to the write up for argument reduction for // ...rationale. The reduction algorithm below is taken from// ...argument reduction description and integrated this.//// w := N*P_3// U_1 := N*P_2 + w ...FMA// U_2 := (N*P_2 - U_1) + w ...2 FMA// ...U_1 + U_2 is N*(P_2+P_3) accurately// // r := s - U_1// c := ( (s - r) - U_1 ) - U_2//// ...The mathematical sum r + c approximates the reduced// ...argument accurately. Note that although compared to// ...Case 1, this case requires much more work to reduce// ...the argument, the subsequent calculation needed for// ...any of the trigonometric function is very little because// ...|alpha| < 1.01*2^(-33) and thus two terms of the // ...Taylor series expansion suffices.//// If i_1 = 0 then// poly := c + S_1 * r * r * r ...any order// result := r// Else// poly := -2^(-67)// result := 1.0// Endif// // If i_0 = 1, result := -result//// Last operation. Perform in user-set rounding mode//// result := (i_0 == 0? result + poly :// result - poly )// // Return//// // Step 8. Pre-reduction of large arguments.// // ...Again, the following reduction procedure was described// ...in the separate write up for argument reduction, which// ...is tightly integrated here.// N_0 := Arg * Inv_P_0// N_0_fix := fcvt.fx( N_0 )// N_0 := fcvt.xf( N_0_fix) // Arg' := Arg - N_0 * P_0// w := N_0 * d_1// N := Arg' * two_by_PI// N_fix := fcvt.fx( N )// N := fcvt.xf( N_fix )// N_fix := N_fix + N_inc //// s := Arg' - N * P_1// w := w - N * P_2//// If |s| >= 2^(-14)// go to Step 3// Else// go to Step 9// Endif//// Step 9. Case_4_reduce.// // ...first obtain N_0*d_1 and -N*P_2 accurately// U_hi := N_0 * d_1 V_hi := -N*P_2// U_lo := N_0 * d_1 - U_hi V_lo := -N*P_2 - U_hi ...FMAs//// ...compute the contribution from N_0*d_1 and -N*P_3// w := -N*P_3// w := w + N_0*d_2// t := U_lo + V_lo + w ...any order//// ...at this point, the mathematical value// ...s + U_hi + V_hi + t approximates the true reduced argument// ...accurately. Just need to compute this accurately.//// ...Calculate U_hi + V_hi accurately:// A := U_hi + V_hi// if |U_hi| >= |V_hi| then// a := (U_hi - A) + V_hi// else// a := (V_hi - A) + U_hi// endif// ...order in computing "a" must be observed. This branch is// ...best implemented by predicates.// ...A + a is U_hi + V_hi accurately. Moreover, "a" is // ...much smaller than A: |a| <= (1/2)ulp(A).//// ...Just need to calculate s + A + a + t// C_hi := s + A t := t + a// C_lo := (s - C_hi) + A // C_lo := C_lo + t//// ...Final steps for reduction// r := C_hi + C_lo// c := (C_hi - r) + C_lo//// ...At this point, we have r and c// ...And all we need is a couple of terms of the corresponding// ...Taylor series.//// If i_1 = 0// poly := c + r*FR_rsq*(S_1 + FR_rsq*S_2)// result := r// Else// poly := FR_rsq*(C_1 + FR_rsq*C_2)// result := 1// Endif//// If i_0 = 1, result := -result//// Last operation. Perform in user-set rounding mode//// result := (i_0 == 0? result + poly :// result - poly )// Return// // Large Arguments: For arguments above 2**63, a Payne-Hanek// style argument reduction is used and pi_by_2 reduce is called.//#include "libm_support.h" #ifdef _LIBC.rodata#else.data#endif.align 64 FSINCOSL_CONSTANTS:ASM_TYPE_DIRECTIVE(FSINCOSL_CONSTANTS,@object)data4 0x4B800000, 0xCB800000, 0x00000000,0x00000000 // two**24, -two**24data4 0x4E44152A, 0xA2F9836E, 0x00003FFE,0x00000000 // Inv_pi_by_2data4 0xCE81B9F1, 0xC84D32B0, 0x00004016,0x00000000 // P_0 data4 0x2168C235, 0xC90FDAA2, 0x00003FFF,0x00000000 // P_1 data4 0xFC8F8CBB, 0xECE675D1, 0x0000BFBD,0x00000000 // P_2 data4 0xACC19C60, 0xB7ED8FBB, 0x0000BF7C,0x00000000 // P_3 data4 0x5F000000, 0xDF000000, 0x00000000,0x00000000 // two_to_63, -two_to_63data4 0x6EC6B45A, 0xA397E504, 0x00003FE7,0x00000000 // Inv_P_0 data4 0xDBD171A1, 0x8D848E89, 0x0000BFBF,0x00000000 // d_1 data4 0x18A66F8E, 0xD5394C36, 0x0000BF7C,0x00000000 // d_2 data4 0x2168C234, 0xC90FDAA2, 0x00003FFE,0x00000000 // pi_by_4 data4 0x2168C234, 0xC90FDAA2, 0x0000BFFE,0x00000000 // neg_pi_by_4 data4 0x3E000000, 0xBE000000, 0x00000000,0x00000000 // two**-3, -two**-3data4 0x2F000000, 0xAF000000, 0x9E000000,0x00000000 // two**-33, -two**-33, -two**-67data4 0xA21C0BC9, 0xCC8ABEBC, 0x00003FCE,0x00000000 // PP_8 data4 0x720221DA, 0xD7468A05, 0x0000BFD6,0x00000000 // PP_7 data4 0x640AD517, 0xB092382F, 0x00003FDE,0x00000000 // PP_6 data4 0xD1EB75A4, 0xD7322B47, 0x0000BFE5,0x00000000 // PP_5 data4 0xFFFFFFFE, 0xFFFFFFFF, 0x0000BFFD,0x00000000 // C_1 data4 0x00000000, 0xAAAA0000, 0x0000BFFC,0x00000000 // PP_1_hi data4 0xBAF69EEA, 0xB8EF1D2A, 0x00003FEC,0x00000000 // PP_4 data4 0x0D03BB69, 0xD00D00D0, 0x0000BFF2,0x00000000 // PP_3 data4 0x88888962, 0x88888888, 0x00003FF8,0x00000000 // PP_2data4 0xAAAB0000, 0xAAAAAAAA, 0x0000BFEC,0x00000000 // PP_1_lo data4 0xC2B0FE52, 0xD56232EF, 0x00003FD2,0x00000000 // QQ_8data4 0x2B48DCA6, 0xC9C99ABA, 0x0000BFDA,0x00000000 // QQ_7data4 0x9C716658, 0x8F76C650, 0x00003FE2,0x00000000 // QQ_6data4 0xFDA8D0FC, 0x93F27DBA, 0x0000BFE9,0x00000000 // QQ_5data4 0xAAAAAAAA, 0xAAAAAAAA, 0x0000BFFC,0x00000000 // S_1 data4 0x00000000, 0x80000000, 0x0000BFFE,0x00000000 // QQ_1 data4 0x0C6E5041, 0xD00D00D0, 0x00003FEF,0x00000000 // QQ_4 data4 0x0B607F60, 0xB60B60B6, 0x0000BFF5,0x00000000 // QQ_3 data4 0xAAAAAA9B, 0xAAAAAAAA, 0x00003FFA,0x00000000 // QQ_2 data4 0xFFFFFFFE, 0xFFFFFFFF, 0x0000BFFD,0x00000000 // C_1 data4 0xAAAA719F, 0xAAAAAAAA, 0x00003FFA,0x00000000 // C_2 data4 0x0356F994, 0xB60B60B6, 0x0000BFF5,0x00000000 // C_3data4 0xB2385EA9, 0xD00CFFD5, 0x00003FEF,0x00000000 // C_4 data4 0x292A14CD, 0x93E4BD18, 0x0000BFE9,0x00000000 // C_5data4 0xAAAAAAAA, 0xAAAAAAAA, 0x0000BFFC,0x00000000 // S_1 data4 0x888868DB, 0x88888888, 0x00003FF8,0x00000000 // S_2 data4 0x055EFD4B, 0xD00D00D0, 0x0000BFF2,0x00000000 // S_3 data4 0x839730B9, 0xB8EF1C5D, 0x00003FEC,0x00000000 // S_4data4 0xE5B3F492, 0xD71EA3A4, 0x0000BFE5,0x00000000 // S_5data4 0x38800000, 0xB8800000, 0x00000000 // two**-14, -two**-14ASM_SIZE_DIRECTIVE(FSINCOSL_CONSTANTS)FR_Input_X = f8 FR_Neg_Two_to_M3 = f32 FR_Two_to_63 = f32 FR_Two_to_24 = f33 FR_Pi_by_4 = f33 FR_Two_to_M14 = f34 FR_Two_to_M33 = f35 FR_Neg_Two_to_24 = f36 FR_Neg_Pi_by_4 = f36 FR_Neg_Two_to_M14 = f37 FR_Neg_Two_to_M33 = f38 FR_Neg_Two_to_M67 = f39 FR_Inv_pi_by_2 = f40 FR_N_float = f41 FR_N_fix = f42 FR_P_1 = f43 FR_P_2 = f44 FR_P_3 = f45
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -