📄 ieee754.texi
字号:
@cindex IEEE floating point This chapter describes functions for examining the representation offloating point numbers and controlling the floating point environment ofyour program. The functions described in this chapter are declared inthe header file @file{gsl_ieee_utils.h}.@menu* Representation of floating point numbers:: * Setting up your IEEE environment:: * IEEE References and Further Reading:: @end menu@node Representation of floating point numbers@section Representation of floating point numbers@cindex IEEE format for floating point numbers@cindex bias, IEEE format@cindex exponent, IEEE format@cindex sign bit, IEEE format@cindex mantissa, IEEE formatThe IEEE Standard for Binary Floating-Point Arithmetic defines binaryformats for single and double precision numbers. Each number is composedof three parts: a @dfn{sign bit} (@math{s}), an @dfn{exponent}(@math{E}) and a @dfn{fraction} (@math{f}). The numerical value of thecombination @math{(s,E,f)} is given by the following formula,@tex\beforedisplay$$(-1)^s (1 \cdot fffff\dots) 2^E$$\afterdisplay@end tex@ifinfo@example(-1)^s (1.fffff...) 2^E@end example@end ifinfo@noindent@cindex normalized form, IEEE format@cindex denormalized form, IEEE formatThe sign bit is either zero or one. The exponent ranges from a minimum value@c{$E_{min}$}@math{E_min} to a maximum value@c{$E_{max}$}@math{E_max} depending on the precision. The exponent is converted to an unsigned number@math{e}, known as the @dfn{biased exponent}, for storage by adding a@dfn{bias} parameter,@c{$e = E + \hbox{\it bias}$}@math{e = E + bias}.The sequence @math{fffff...} represents the digits of the binaryfraction @math{f}. The binary digits are stored in @dfn{normalizedform}, by adjusting the exponent to give a leading digit of @math{1}. Since the leading digit is always 1 for normalized numbers it isassumed implicitly and does not have to be stored.Numbers smaller than @c{$2^{E_{min}}$}@math{2^(E_min)}are be stored in @dfn{denormalized form} with a leading zero,@tex\beforedisplay$$(-1)^s (0 \cdot fffff\dots) 2^{E_{min}}$$\afterdisplay@end tex@ifinfo@example(-1)^s (0.fffff...) 2^(E_min)@end example@end ifinfo@noindent@cindex zero, IEEE format@cindex infinity, IEEE formatThis allows gradual underflow down to @c{$2^{E_{min} - p}$}@math{2^(E_min - p)} for @math{p} bits of precision. A zero is encoded with the special exponent of @c{$2^{E_{min}-1}$}@math{2^(E_min - 1)} and infinities with the exponent of @c{$2^{E_{max}+1}$}@math{2^(E_max + 1)}.@noindent@cindex single precision, IEEE formatThe format for single precision numbers uses 32 bits divided in thefollowing way,@smallexampleseeeeeeeefffffffffffffffffffffff s = sign bit, 1 bite = exponent, 8 bits (E_min=-126, E_max=127, bias=127)f = fraction, 23 bits@end smallexample@noindent@cindex double precision, IEEE formatThe format for double precision numbers uses 64 bits divided in thefollowing way,@smallexampleseeeeeeeeeeeffffffffffffffffffffffffffffffffffffffffffffffffffffs = sign bit, 1 bite = exponent, 11 bits (E_min=-1022, E_max=1023, bias=1023)f = fraction, 52 bits@end smallexample@noindentIt is often useful to be able to investigate the behavior of acalculation at the bit-level and the library provides functions forprinting the IEEE representations in a human-readable form.@comment float vs double vs long double @comment (how many digits are available for each)@deftypefun void gsl_ieee_fprintf_float (FILE * @var{stream}, const float * @var{x})@deftypefunx void gsl_ieee_fprintf_double (FILE * @var{stream}, const double * @var{x})These functions output a formatted version of the IEEE floating-pointnumber pointed to by @var{x} to the stream @var{stream}. A pointer isused to pass the number indirectly, to avoid any undesired promotionfrom @code{float} to @code{double}. The output takes one of thefollowing forms,@table @code@item NaNthe Not-a-Number symbol@item Inf, -Infpositive or negative infinity@item 1.fffff...*2^E, -1.fffff...*2^E a normalized floating point number@item 0.fffff...*2^E, -0.fffff...*2^E a denormalized floating point number@item 0, -0positive or negative zero@comment @item [non-standard IEEE float], [non-standard IEEE double]@comment an unrecognized encoding@end tableThe output can be used directly in GNU Emacs Calc mode by preceding itwith @code{2#} to indicate binary.@end deftypefun@deftypefun void gsl_ieee_printf_float (const float * @var{x})@deftypefunx void gsl_ieee_printf_double (const double * @var{x})These functions output a formatted version of the IEEE floating-pointnumber pointed to by @var{x} to the stream @code{stdout}.@end deftypefun@noindentThe following program demonstrates the use of the functions by printingthe single and double precision representations of the fraction@math{1/3}. For comparison the representation of the value promoted fromsingle to double precision is also printed.@example#include <stdio.h>#include <gsl/gsl_ieee_utils.h>intmain (void) @{ float f = 1.0/3.0; double d = 1.0/3.0; double fd = f; /* promote from float to double */ printf(" f="); gsl_ieee_printf_float(&f); printf("\n"); printf("fd="); gsl_ieee_printf_double(&fd); printf("\n"); printf(" d="); gsl_ieee_printf_double(&d); printf("\n"); return 0;@}@end example@noindentThe binary representation of @math{1/3} is @math{0.01010101... }. Theoutput below shows that the IEEE format normalizes this fraction to givea leading digit of 1,@smallexample f= 1.01010101010101010101011*2^-2fd= 1.0101010101010101010101100000000000000000000000000000*2^-2 d= 1.0101010101010101010101010101010101010101010101010101*2^-2@end smallexample@noindentThe output also shows that a single-precision number is promoted todouble-precision by adding zeros in the binary representation.@comment importance of using 1.234L in long double calculations@comment @example@comment int main (void)@comment @{@comment long double x = 1.0, y = 1.0; @comment x = x + 0.2;@comment y = y + 0.2L;@comment printf(" d %.20Lf\n",x);@comment printf("ld %.20Lf\n",y);@comment return 1;@comment @}@comment d 1.20000000000000001110@comment ld 1.20000000000000000004@comment @end example@node Setting up your IEEE environment@section Setting up your IEEE environment@cindex IEEE exceptions@cindex precision, IEEE arithmetic@cindex rounding mode@cindex arithmetic exceptions@cindex exceptions, IEEE arithmetic@cindex division by zero, IEEE exceptions@cindex underflow, IEEE exceptions@cindex overflow, IEEE exceptionsThe IEEE standard defines several @dfn{modes} for controlling thebehavior of floating point operations. These modes specify the importantproperties of computer arithmetic: the direction used for rounding (e.g.whether numbers should be rounded up, down or to the nearest number),the rounding precision and how the program should handle arithmeticexceptions, such as division by zero.Many of these features can now be controlled via standard functions suchas @code{fpsetround}, which should be used whenever they are available.Unfortunately in the past there has been no universal API forcontrolling their behavior -- each system has had its own way ofaccessing them. For example, the Linux kernel provides the function@code{__setfpucw} (@dfn{set-fpu-control-word}) to set IEEE modes, whileHP-UX and Solaris use the functions @code{fpsetround} and@code{fpsetmask}. To help you write portable programs GSL allows you tospecify modes in a platform-independent way using the environmentvariable @code{GSL_IEEE_MODE}. The library then takes care of all thenecessary machine-specific initializations for you when you call thefunction @code{gsl_ieee_env_setup}.@deftypefun void gsl_ieee_env_setup ()This function reads the environment variable @code{GSL_IEEE_MODE} andattempts to set up the corresponding specified IEEE modes. Theenvironment variable should be a list of keywords, separated bycommas, like this,@display@code{GSL_IEEE_MODE} = "@var{keyword},@var{keyword},..."@end display@noindentwhere @var{keyword} is one of the following mode-names,@itemize @asis@item @code{single-precision}@item @code{double-precision}@item @code{extended-precision}@item @code{round-to-nearest}@item @code{round-down}@item @code{round-up}@item @code{round-to-zero}@item @code{mask-all}@item @code{mask-invalid}@item @code{mask-denormalized}@item @code{mask-division-by-zero}@item @code{mask-overflow}@item @code{mask-underflow}@item @code{trap-inexact}@item @code{trap-common}@end itemizeIf @code{GSL_IEEE_MODE} is empty or undefined then the function returnsimmediately and no attempt is made to change the system's IEEEmode. When the modes from @code{GSL_IEEE_MODE} are turned on thefunction prints a short message showing the new settings to remind youthat the results of the program will be affected.If the requested modes are not supported by the platform being used thenthe function calls the error handler and returns an error code of@code{GSL_EUNSUP}.The following combination of modes is convenient for many purposes,@exampleGSL_IEEE_MODE="double-precision,"\ "mask-underflow,"\ "mask-denormalized"@end example@noindentThis choice ignores any errors relating to small numbers (eitherdenormalized, or underflowing to zero) but traps overflows, division byzero and invalid operations.@end deftypefun@noindentTo demonstrate the effects of different rounding modes consider thefollowing program which computes @math{e}, the base of naturallogarithms, by summing a rapidly-decreasing series,@tex\beforedisplay$$e = 1 + {1 \over 2!} + {1 \over 3!} + {1 \over 4!} + \dots = 2.71828182846...$$\afterdisplay@end tex@ifinfo@examplee = 1 + 1/2! + 1/3! + 1/4! + ... = 2.71828182846...@end example@end ifinfo @example#include <math.h>#include <stdio.h>#include <gsl/gsl_ieee_utils.h>intmain (void)@{ double x = 1, oldsum = 0, sum = 0; int i = 0; gsl_ieee_env_setup (); /* read GSL_IEEE_MODE */ do @{ i++; oldsum = sum; sum += x; x = x / i; printf("i=%2d sum=%.18f error=%g\n", i, sum, sum - M_E); if (i > 30) break; @} while (sum != oldsum); return 0;@}@end example@noindentHere are the results of running the program in @code{round-to-nearest}mode. This is the IEEE default so it isn't really necessary to specifyit here,@exampleGSL_IEEE_MODE="round-to-nearest" ./a.out i= 1 sum=1.000000000000000000 error=-1.71828i= 2 sum=2.000000000000000000 error=-0.718282....i=18 sum=2.718281828459045535 error=4.44089e-16i=19 sum=2.718281828459045535 error=4.44089e-16@end example@noindentAfter nineteen terms the sum converges to within @c{$4 \times 10^{-16}$}@math{4 \times 10^-16} of the correct value. If we now change the rounding mode to@code{round-down} the final result is less accurate,@exampleGSL_IEEE_MODE="round-down" ./a.out i= 1 sum=1.000000000000000000 error=-1.71828....i=19 sum=2.718281828459041094 error=-3.9968e-15@end example@noindentThe result is about @c{$4 \times 10^{-15}$}@math{4 \times 10^-15} below the correct value, an order of magnitude worse than the resultobtained in the @code{round-to-nearest} mode.If we change to rounding mode to @code{round-up} then the series nolonger converges (the reason is that when we add each term to the sumthe final result is always rounded up. This is guaranteed to increase the sumby at least one tick on each iteration). To avoid this problem we wouldneed to use a safer converge criterion, such as @code{while (fabs(sum -oldsum) > epsilon)}, with a suitably chosen value of epsilon.Finally we can see the effect of computing the sum usingsingle-precision rounding, in the default @code{round-to-nearest}mode. In this case the program thinks it is still using double precisionnumbers but the CPU rounds the result of each floating point operationto single-precision accuracy. This simulates the effect of writing theprogram using single-precision @code{float} variables instead of@code{double} variables. The iteration stops after about half the numberof iterations and the final result is much less accurate,@exampleGSL_IEEE_MODE="single-precision" ./a.out ....i=12 sum=2.718281984329223633 error=1.5587e-07@end example@noindentwith an error of @c{$O(10^{-7})$}@math{O(10^-7)}, which corresponds to singleprecision accuracy (about 1 part in @math{10^7}). Continuing theiterations further does not decrease the error because all thesubsequent results are rounded to the same value.@node IEEE References and Further Reading@section References and Further Reading@noindentThe reference for the IEEE standard is,@itemize @asis@itemANSI/IEEE Std 754-1985, IEEE Standard for Binary Floating-Point Arithmetic@end itemize@noindentA more pedagogical introduction to the standard can be found in thepaper ``What Every Computer Scientist Should Know About Floating-PointArithmetic''.@itemize @asis@itemDavid Goldberg: What Every Computer Scientist Should Know AboutFloating-Point Arithmetic. @cite{ACM Computing Surveys}, Vol. 23, No. 1(March 1991), pages 5-48Corrigendum: @cite{ACM Computing Surveys}, Vol. 23, No. 3 (September1991), page 413.@itemSee also the sections by B. A. Wichmann and Charles B. Dunham inSurveyor's Forum: ``What Every Computer Scientist Should Know AboutFloating-Point Arithmetic''. @cite{ACM Computing Surveys}, Vol. 24,No. 3 (September 1992), page 319@end itemize@noindent@comment to turn on math exception handling use __setfpucw, see@comment /usr/include/i386/fpu_control.h@comment e.g.@comment #include <math.h>@comment #include <stdio.h>@comment #include <fpu_control.h>@comment double f (double x);@comment int main ()@comment {@comment double a = 0;@comment double y, z;@comment __setfpucw(0x1372); @comment mention extended vs double precision on Pentium, and how to get around@comment it (-ffloat-store, or selective use of volatile)@comment On the alpha the option -mieee is needed with gcc@comment In Digital's compiler the equivalent is -ieee, -ieee-with-no-inexact @comment and -ieee-with-inexact, or -fpe1 or -fpe2
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -