WO2002052401A2

WO2002052401A2 - Method of obtaining correctly rounded results in ieee double precision format of transcendental functions

Info

Publication number: WO2002052401A2
Application number: PCT/US2001/049052
Authority: WO
Inventors: Kwok Choi
Original assignee: Sun Microsystems, Inc.
Priority date: 2000-12-27
Filing date: 2001-12-17
Publication date: 2002-07-04
Also published as: WO2002052401A3; AU2002227437A1

Abstract

A method for correctly rounding a result of transcendental function such as sine, cosine, tangent, arc tangent, exponent, logarithm, or power in double precision arithmetic is disclosed. All arguments of the functions that are undefined, overflow or underflow are fist purged off. Then the given argument is reduced within a primary range to compute the function as a pair of double precision numbers. Multi-precision arithmetic with a specified word length is used if the sum of the pair of double precision number is close to the boundary case. The multi-precision arithmetic is repeated with greater precision each time until the result of the multi-precision arithmetic is not close to the boundary case. The result of the multi-precision arithmetic or the sum of said pair of double precision numbers is finally rounded according to the given rounding mode.

Description

S P E C I F I C A T I O N

TITLE OF INVENTION

METHOD OF OBTAINING CORRECTLY ROUNDED RESULTS IN IEEE DOUBLE PRECISION FORMAT OF TRANSCENDENTAL FUNCTIONS

BACKGROUND OF THE INVENTION Field of the Invention

The present invention relates to data processing systems having a floating point arithmetic unit. More particularly, the present invention relates to a data processing system performing correctly rounded floating point operations.

The Background Art

Floating point units conventionally use hardware or software to perform mathematical operations. Regardless of the approach used, the floating point unit can produce an inexact result, that is a result having an error in the last few digits. Rounding the result discards one or more of the trailing significant digits and adjusts the retained part in accordance with some specified rule.

The purpose for rounding is to reduce the number of digits in the result so that the rounded result can fit within the word size of a floating point number in the microprocessor. Floating point notation is a method of writing numeric quantities by using an exponent and a significand (also called the mantissa). The exponent represents the location of the decimal point in base 10 of the number, and the significand represents the significant digits of the number. For example, the number 12.846 can also be written as 1.2846 x 10¹ which comprises an exponent of 1 and a significand of 1.2846.

In many situations, the exponent and the significand of the result of a floating point operation exceed a microprocessor's word size. For example, the addition of two numbers having n significand bits might result in a sum of having n+1 bits.

I Consequently, an overflow bit results if the length of the significand is only n-bits long. The overflow bit must then be rounded off.

In general, the precision of a floating point number is determined by the length of the exponent and the significand. The longer the exponent and the significand length, the more precise it becomes. However, the tradeoff is that calculations take longer because more bits are being manipulated. Commonly, precisions are adopted according to the specifications set forth in Institute Electrical and Electronic Engineers (IEEE) standard 754 (1981). The standard specifies four precisions for a binary machine: single, single extended, double, and double extended. Implementations are not required to have all four precisions.

By way of illustration as defined in the IEEE standard, single precision includes 1 sign bit, 7 exponent bits and 24 significand bits. Double precision includes 1 sign bit, 10 exponent bits and 53 significand bits. The sign bit, exponent bits, and significand bits are packed in a 32 bit word for single precision, and a 64 bit word for double precision. Once it is determined that a number needs to be rounded to a given precision, a rounding method or mode must then be selected. Some of the more common rounding modes include chopping, rounding up, rounding down, and rounding to nearest/even. Thus, a result is said to be "correctly rounded" if it is rounded according to the rounding mode given.

In the prior art, correctly rounding computations are limited to addition, subtraction, multiplication, division, and square root computations. Rounding a transcendental number produced by transcendental functions such as a sine or cosine function is inherently difficult because a transcendental number can be arbitrarily close, but not equal, to a rational number. Transcendental numbers are numbers that cannot be represented with a finite number of digits. Consequently, one may need to examine as many extra trailing bits as necessary in order to determine the correct rounding. Multi- precision arithmetic is usually required in this case.

For IEEE single precision arithmetic, this problem can be solved by resorting to double precision arithmetic and exhaustive computation of all 2³² arguments. However, for double precision arithmetic, exhaustive computation is not currently possible because there are close to 2⁶³ double precision numbers, and the cost of multi-precision arithmetic at this level can be expensive.

Therefore, a method is needed to correctly round transcendental function results in double precision floating point without expending an inordinate number of computation clock cycles.

BRIEF DESCRIPTION OF THE INVENTION

A_; method for correctly rounding a result of transcendental function such as sine, cosine, tangent, arc tangent, exponent, logarithm, or power in double precision arithmetic is disclosed. All arguments of the functions that are undefined, overflow or underflow are first purged off. Then the given argument is reduced within a primary range to compute the function as a pair of double precision numbers. Multi-precision arithmetic with a specified word length is used if the sum of the pair of double precision number is close to the boundary case. The multi-precision arithmetic is repeated with greater precision each time until the result of the multi-precision arithmetic is not close to the boundary case. The result of the multi-precision arithmetic or the sum of said pair of double precision numbers is finally rounded according to the given rounding mode.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate one or more embodiments of the invention and, together with the present description, serve to explain the principles of the invention.

In the drawings:

FIG. 1 is a block diagram illustrating a method for correctly rounding one of the transcendental functions sine, cosine, tangent, arc tangent, exponential, logarithm, and power in IEEE double precision arithmetic in accordance with a specific embodiment of the present invention. FIG. 2 is a system block diagram of a system for implementing the method of the present invention.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT One embodiment of the present invention is described herein in the context of algorithm performing correctly rounded floating point operations. Those of ordinary skill in the art will realize that the following description of the present invention is illustrative only and not in any way limiting. Other embodiments of the invention will readily suggest themselves to such skilled persons having the benefit of this disclosure. Reference will now be made in detail to an implementation of the present invention as illustrated in the accompanying drawings. The same reference numbers will be used throughout the drawings and the following description to refer to the same or like parts.

In accordance with a presently preferred embodiment of the present invention, the components, algorithmic steps, and/or data structures are implemented using a source code programmed in the C source computer programming language. This implementation is not intended to be limiting in any way. Different implementations may be used and may include other types of language, computing platforms, program storage devices and/or computer programs. In addition, those of ordinary skill in the art will readily recognize that devices of a less general purpose nature, such as hardwired devices, devices relying on FPGA (field programmable gate array) or ASIC (application specific integrated circuit) technology, or the like, may also be used without departing from the scope and spirit of the inventive concepts disclosed herewith.

In general, except for certain special cases which can be identified easily, for any double precision argument (which happens to be a rational number), the value of a transcendental function is a transcendental number, i.e., a number that cannot be represented with a finite number of digits. Normally one simply cuts the processing off after a certain specified number of bits, hence resulting in an approximation error, then rounds the number using a conventional rounding algorithm, and declares the result. The problem with this approach is that two different machines could, conceivably, produce different approximation errors and return slightly different rounded results. The precept of the JAVA programming language is that all machines should return the same result.

H The simplest way to achieve that is to require that the results of all library functions must be "correctly rounded" as if the results are precise and computed infinitely. Consequently, the rounded number will always be the same no matter what processor (as long as it conforms to the IEEE 754 standard) was used to compute it.

The method of the present invention utilizes an iterative approach to request additional bits of the solution until the number can be correctly rounded even in the presence of an approximation error. Since the number will never be equal to a boundary case (which is the case for a rational number), the iterative process will terminate. It is strongly believed that one will never need to do more than a few iterations to achieve correct roundedness. In particular, the first computation will result in 73 bits accuracy, which will yield correctly rounded result 99.9999% of time. A second iteration will give 120 bits accuracy improving the chances of success to 1-2^"67. The method of the present invention attempts to achieve "correct rounding" in compliance with IEEE standards without forcing all calculations to be done at a number of bits which would guarantee a solution in all cases. In this way, the correct rounding is performed significantly more efficiently than, for example, a 120 bit calculation were require for all results.

FIG. 1 is a block diagram illustrating a general method for correctly rounding one of the transcendental functions sine, cosine, tangent, arc tangent, exponential, logarithm, and power in IEEE double precision arithmetic in accordance with a specific embodiment of the present invention.

"x" is an argument that a transcendental function f(x) operates on. The transcendental function f may be sine, cosine, tangent, exponential, logarithm, power, or arc tangent (or a combination thereof).

In a first block 100, all arguments x are first purged off when/(x) is undefined, overflow or underflow. An example of a code in C language for the exponential function is illustrated below:

/* filter out non-finite argument */ if(hx >= 0x40862E42) { /* if |x|>=709.78... */ if(hx>=0x7ff0₀00₀) { if(((hx&0xfffff)|LO _WORD(x)) 1=0) return x+x; /* NaN */ else return (xsb==0)? x.0.0; /* exp(+-inf)={inf,0 } */

} if(x > o_threshold) return huge*huge; /* overflow */ if(x < u_threshold) return twoml00₀*twoml000; /* underflow */

First, the bit pattern of x is examined. If it is a not a number (NaN), then the code "NaN" is returned. If it is a positive infinite number, then a "+Infinity"code is returned. Next, the range of x is checked. If x is more than the threshold for an overflow condition of the exponential function, then an overflow result is returned. If x is less than the threshold for an underflow condition of the exponential function, then an underflow result is returned.

In block 102, argument x is preferably reduced to argument y such that y falls within a primary range. For trigonometric functions such as sine, cosine, and tangent, argument x is reduced to argument y through the following mathematical relation: y=x-k*pi 2 where k is an integer chosen so that |y|</7z/4. Let n= k mod 4. Then the computation of the trigonometric functions are reduced to the computation of sin(v) or cos(y) as indicated in the table below:

For an arctangent function, a special table value t; is chosen so that

will be less than 1/8 in absolute value. Then atan(x) can be computed through the following relationship atan( ) = atan(t +atan(y), where atanfø) is obtained from a lookup table.

For an exponential function, an integer n is chosen such that

Also, y must be less than ln2/32 in absolute value. Let k = [π/16] andj = n-lβk. Then the value of the exponential of x is equal to

For a logarithm function, choose an integer n such that x = r*2 where l/sqrt(2) < r < sqrt(2) . A value is chosen from a table so that | t,-r | < 0.0315. Let y = r/t_t. Then log( ) is computed as follows:

\ogix) = n*\og(2) + logfø) + log(y) where log(t,) is obtained from a look-up table.

For a power function, the following expression is used:

^ _ _£y logM thereby reducing the computation to the exponential function and the logarithm function as described above.

It is commonly known in the art that higher precision arithmetic may be achieved using a pair of floating point numbers. The second number is usually regarded as a correction term for the first number. In order to preserve accuracy, the reduced argument y is represented as a pair of double precision numbers (yl, y2) so that the sum of yl and y2 will approximate y to more than 73 significant bits. An example of a code written in C computer programming language performing the argument reduction for the exponential function is illustrated below:

/* compute (rh+rt) = x - n*(ln2)/16 */ n = x*invln2_nb+half[xsb]; / ^H determine n by rounding x*16/ln2 */ k = π»NB; j=n&((l«NB)-l); /* k = [π/16], j = n-16k */ c = (double)n;

/* simulate higher precision arithmetic */ tl = x - c*ln2_nbhl; t2 = c*ln2_nbh2; ph = tl-t2; a = c*ln2_nbh3; pt = t2-(tl-ph); t4 = c*ln2_nbh4; if(pt==zero) { qh = ph-t3; qt=t3-(ph-qh); ph = qh; if(qt=zero) pt= -t4; else pt= -(qt+t4); } else pt = t3-(pt-t4);

/* represent result in two floating point numbers */ r = ph+pt; n rh = (double)(float)r; rt = pt-(rh-ph);

The integer n is determined by rounding 16 /ln2 to the nearest integer. Then x - rc ι2)/16 is computed by simulating higher precision arithmetic as follows. First, (ln2)/16 is pre-computed to 144 significant bits and then split into four 36-bit chunks, ln2_nbhl, In2_nbh2, In2_nbh3, In2_nbh4. Each chunk is represented by a floating point number. The purpose of the split is to guarantee tl = -n*ln2_hl as well as t2=n*ln2_h2, t3=n*ln2_h3, and t4=n*ln2_h4 can all be computed exactly in the IEEE double precision arithmetic. By carefully manipulating the sum of tl, t2, t3, and t4 as illustrated in the C code, one is able to distill the sum into two floating point number rh and rt, whose sum will approximate x - nf ln2)/16 to at least 77 bits.

The computation of f(y), where y is represented as two floating point number yl and y2, will also make use of simulated higher precision arithmetic. In general, the computation ofβy) is usually accomplished by using a polynomial such as: f(y)= a₀ + aiy + a₂y² + ... + a_nyⁿ Using a commonly known in the art "Remez" algorithm, coefficients a; are generated to achieve the desired accuracy of f(y). The above polynomial can be computed using a simulated higher precision arithmetic. However, higher precision arithmetic could be slow.

To optimize the performance of computation of the polynomial, an integer k is predetermined such that the "tail" portion of polynomial sny + ... + a_ny" can be computed using normal arithmetic without affecting the desired final accuracy of f(y). Then simulated higher precision arithmetic is used to compute f(y) to achieve more than 73 significant bit accuracy. Then method of the simulation is similar to the code in the computation of the argument reduction.

Finally, fix), represented as a pair of double precision numbers (zl,z2), is retrieved by reversing the argument reduction step from f(y) to achieve at least 73 bits accuracy.

In a block 104, the sum of double precision number zl and z2 are checked for boundary cases by examining the bit pattern from the 54^th significant bit to the 73^rd significant bit. Suppose the rounding mode is nearest/even: if the 54^th significant bit is 0 and the rest of the bits from the 55^th significant bit to the 73^rd significant bit are l's, or if the 54^th significant bit is 1 and the rest of the bits from the 55^th significant bit to the 73^rd significant bit are O's, then this number is identified as too close to the boundary (halfway) number. In this situation, a multi-precision routine as defined below in block 106 must be used for the computation of f(x) again. However, such multi-precision routine actually slows the computation. A multi-precision word length, nw, is initially set to 8 word length yielding 24*(8-3)=120 bits accuracy.

If the sum of double precision number zl and z2 is not close to the boundary cases, it is concluded that rounding the sum of zl and z2 will give a correctly rounded result. In block 108, the rounding is accomplished per IEEE standard 754.

In block 106, a multi-precision arithmetic with nw, initially set to 8 word length, is used to request additional bits of the solution. The number of words for a multi- precision number is called nw. The reliable number of significant bits is 24*(nw-3). The multi-precision number, mx, is an integer array defined by:

mx[0], mx[l], mx[2], mx[3], ..., mx[nw-l]

where mx[0] is the sign word, mx[l] is the exponent word, mx[2] is the leading word, and mx[3] to mx[nw-l] are the 24-bit chung significant words. The addition, subtraction, multiplication, and division of multi-precision numbers are accomplished by calling special subroutines. An example of a multiplication subroutine code written in C language is illustrated below:

/* compute the prouct of a*b */

/* initialize t[i] = b[2]*a[i]+b[3]*a[i-l]+b[4]*a[i-2] */ x = (double)b[2]; y = (double)b[3]; z = (double)b[4]; ta[2] = (double)a[2]; ta[3] = (double)a[3]; t[2] = x*ta[2]; t[3] = x*ta[3]+y*ta[2];

for(i=4;i<nw;i++) { ta[i] = (double)a[i]; t[i] = x*ta[i]+y*ta[i-l]+z*ta[i-2]; } t[nw] = y*ta[nw-l]+z*ta[nw-2]; t[nw+l] = z*ta[nw-l]; t[nw+2] = zero;

n = 3; for (j=5;j<nw;j++) { if(b[j]!=0) { n += 1; y = (double) b[j]; m = 2-j; for (i=j;i<=nw+2;i+=l) t[i] += y*ta[m+i];

} if(n>30) { /* more than 3Q term have been added, need to normalize t[i] to avoid inexact arithmetic */ i for(jj=nw+2;jj>j;jj-) { x = (double)((intχtwom48*t[jj])); t(jj] -= x*two48; t[jj-2] += x;

} n = 0; /* reset n to zero */

The code first converts the "digit" of the multi-precision number a[] to a floating point number. Then it performs multiplication using an algorithm learnt in school. During the computation, care must be taken to avoid inexact arithmetic.

The algorithm of fix) is also re-designed to use the multi-precision arithmetic. For example, to compute exp( ), we first perform the argument reduction y = x - n*lτι2 Then we further reduce y to z = y 2", where n is chosen so that |z|<2^"10 . Now we use Taylor series to compute exp(z) = 1 + 1/(1 !)*z + l/(2!)*z² + 1/(3 !)*z³ + ... and exp(y) is recovered by squaring exp(z) n times: exp(y) = (((exp(z)²)²)-)² Finally the exp(x) is recovered by exp(x) = 2" expO).

A decision block 110 checks whether the multi-precision result of function fix) which has 24*(nw-3) reliable significant bits is too close to boundary cases. The procedure for checking boundary cases is repeated from block 104. If the multi- precision result with nw number of words is a boundary case, nw is replaced with nw+2 in block 112, and block 106 is reiterated. Otherwise, the result from block 110 is the correctly rounded result of function fix). Such result is then rounded in block 108 per IEEE standard 754.

JO FIG. 2 is a system block diagram of a system for implementing the method of the present invention. The number on which a transcendental function will operate on is entered into the system through input 202. Input 202 may be in the form of any input source, such as a keyboard, or data stored in other format readable by input 202. A storage medium 204 for storing the above code can take the form of a hard drive, floppy disk, or laser disc. Storage medium 204 provides instructions to computing system 206 that comprises a central processing unit (CPU) 208 and a floating point unit (FPU) 210. Both CPU 208 and FPU 210 perform operations on the number provided by input 202. An output 212 provides a user with the results of the computation. Such output may be in the form of a display, printed output, or other output form.

While embodiments and applications of this invention have been shown and described, it would be apparent to those skilled in the art having the benefit of this disclosure that many more modifications than mentioned above are possible without departing from the inventive concepts herein. The invention, therefore, is not to be restricted except in the spirit of the appended claims.

li

Claims

CLAIMS What is claimed is:

1. A method for correctly rounding a transcendental function in IEEE double precision arithmetic, the method comprising: using multi-precision arithmetic with a specified word length if the sum of said pair of double precision number is close to a boundary case; repeating said multi-precision arithmetic if the result of said multi-precision arithmetic is close to said boundary case; and rounding the result of said multi-precision arithmetic or the sum of said pair of double precision number.

2. The method according to claim 1 further comprising purging off a plurality of arguments of the transcendental function when an operation of the transcendental function on said plurality of arguments yields an undefined, overflow or underflow result.

3. The method according to claim 2 further comprising reducing said plurality of arguments that are not purged off to within a primary range for computing the function as a pair of double precision number.

4. The method according to claim 3 wherein the transcendental function is at least selected from the group consisting of: sine, cosine, tangent, exponential, logarithm, power, and arc tangent.

5. The method according to claim 4 wherein the sum of said pair of double precision number matches the transcendental function to 73 significant bits.

6. The method according to claim 3 wherein said reducing further comprises: processing the transcendental function with a polynomial; generating a plurality of coefficients of said polynomial; computing said polynomial using said plurality of coefficients;

\

7. The method according to claim 6 wherein said plurality of coefficients of said polynomial is generated using a "Remez" algorithm.

8. The method according to claim 7 wherein said polynomial is computed by determining the tail portion of said polynomial using a predetermined degree of said polynomial.

9. The method according to claim 3 wherein said boundary case is determined by examining the bit pattern of said pair of double precision number from the 54^th significand bit to the 73^rd significand bit.

10. The method according to claim 3 wherein said repeating further comprises increasing said specified word length by two.

11. A program storage device readable by a machine, tangibly embodying a program of instructions readable by the machine to perform a method for correctly rounding a transcendental function in IEEE double precision arithmetic, the method comprising: using multi-precision arithmetic with a specified word length if the sum of said pair of double precision number is close to a boundary case; repeating said multi-precision arithmetic if the result of said multi-precision arithmetic is close to said boundary case; and rounding the result of said multi-precision arithmetic or the sum of said pair of double precision number.

12. The program storage device according to claim 11 wherein the method further comprises purging off a plurality of arguments of the transcendental function when an operation of the transcendental function on said plurality of arguments yields an undefined, overflow or underflow result.

13. The program storage device according to claim 12 wherein the method further comprises reducing said plurality of arguments that are not purged off to within a primary range for computing the function as a pair of double precision number.

'3

14. The program storage device according to claim 11 wherein the transcendental function is at least selected from the group consisting of: sine, cosine, tangent, exponential, logarithm, power, and arc tangent.

15. The program storage device according to claim 11 wherein the sum of said pair of double precision number matches the transcendental function to 73 significant bits.

16. The program storage device according to claim 11 wherein said reducing further comprises: processing the transcendental function with a polynomial; generating a plurality of coefficients of said polynomial; computing said polynomial using said plurality of coefficients;

17. The program storage device according to claim 16 wherein said plurality of coefficients of said polynomial is generated using a "Remez" algorithm.

18. The program storage device according to claim 17 wherein said polynomial is computed by determining the tail part of said polynomial using a predetermined degree of said polynomial.

19. The program storage device according to claim 11 wherein said boundary case is determined by examining the bit pattern of said pair of double precision number from the 54^th significand bit to the 73^rd significand bit.

20. The program storage device according to claim 11 wherein said repeating further comprises increasing said specified word length by two.

H