WO2002052401A2 - Method of obtaining correctly rounded results in ieee double precision format of transcendental functions - Google Patents

Method of obtaining correctly rounded results in ieee double precision format of transcendental functions Download PDF

Info

Publication number
WO2002052401A2
WO2002052401A2 PCT/US2001/049052 US0149052W WO02052401A2 WO 2002052401 A2 WO2002052401 A2 WO 2002052401A2 US 0149052 W US0149052 W US 0149052W WO 02052401 A2 WO02052401 A2 WO 02052401A2
Authority
WO
WIPO (PCT)
Prior art keywords
polynomial
double precision
pair
precision
storage device
Prior art date
Application number
PCT/US2001/049052
Other languages
French (fr)
Other versions
WO2002052401A3 (en
Inventor
Kwok Choi
Original Assignee
Sun Microsystems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Microsystems, Inc. filed Critical Sun Microsystems, Inc.
Priority to AU2002227437A priority Critical patent/AU2002227437A1/en
Publication of WO2002052401A2 publication Critical patent/WO2002052401A2/en
Publication of WO2002052401A3 publication Critical patent/WO2002052401A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation

Definitions

  • the present invention relates to data processing systems having a floating point arithmetic unit. More particularly, the present invention relates to a data processing system performing correctly rounded floating point operations.
  • Floating point units conventionally use hardware or software to perform mathematical operations. Regardless of the approach used, the floating point unit can produce an inexact result, that is a result having an error in the last few digits. Rounding the result discards one or more of the trailing significant digits and adjusts the retained part in accordance with some specified rule.
  • Floating point notation is a method of writing numeric quantities by using an exponent and a significand (also called the mantissa).
  • the exponent represents the location of the decimal point in base 10 of the number
  • the significand represents the significant digits of the number.
  • the number 12.846 can also be written as 1.2846 x 10 1 which comprises an exponent of 1 and a significand of 1.2846.
  • the exponent and the significand of the result of a floating point operation exceed a microprocessor's word size.
  • the addition of two numbers having n significand bits might result in a sum of having n+1 bits.
  • the precision of a floating point number is determined by the length of the exponent and the significand. The longer the exponent and the significand length, the more precise it becomes. However, the tradeoff is that calculations take longer because more bits are being manipulated. Commonly, precisions are adopted according to the specifications set forth in Institute Electrical and Electronic Engineers (IEEE) standard 754 (1981). The standard specifies four precisions for a binary machine: single, single extended, double, and double extended. Implementations are not required to have all four precisions.
  • single precision includes 1 sign bit, 7 exponent bits and 24 significand bits.
  • Double precision includes 1 sign bit, 10 exponent bits and 53 significand bits.
  • the sign bit, exponent bits, and significand bits are packed in a 32 bit word for single precision, and a 64 bit word for double precision.
  • a ; method for correctly rounding a result of transcendental function such as sine, cosine, tangent, arc tangent, exponent, logarithm, or power in double precision arithmetic is disclosed. All arguments of the functions that are undefined, overflow or underflow are first purged off. Then the given argument is reduced within a primary range to compute the function as a pair of double precision numbers. Multi-precision arithmetic with a specified word length is used if the sum of the pair of double precision number is close to the boundary case. The multi-precision arithmetic is repeated with greater precision each time until the result of the multi-precision arithmetic is not close to the boundary case. The result of the multi-precision arithmetic or the sum of said pair of double precision numbers is finally rounded according to the given rounding mode.
  • All arguments of the functions that are undefined, overflow or underflow are first purged off. Then the given argument is reduced within a primary range to compute the function as a
  • FIG. 1 is a block diagram illustrating a method for correctly rounding one of the transcendental functions sine, cosine, tangent, arc tangent, exponential, logarithm, and power in IEEE double precision arithmetic in accordance with a specific embodiment of the present invention.
  • FIG. 2 is a system block diagram of a system for implementing the method of the present invention.
  • the components, algorithmic steps, and/or data structures are implemented using a source code programmed in the C source computer programming language.
  • This implementation is not intended to be limiting in any way. Different implementations may be used and may include other types of language, computing platforms, program storage devices and/or computer programs.
  • devices of a less general purpose nature such as hardwired devices, devices relying on FPGA (field programmable gate array) or ASIC (application specific integrated circuit) technology, or the like, may also be used without departing from the scope and spirit of the inventive concepts disclosed herewith.
  • a transcendental function is a transcendental number, i.e., a number that cannot be represented with a finite number of digits. Normally one simply cuts the processing off after a certain specified number of bits, hence resulting in an approximation error, then rounds the number using a conventional rounding algorithm, and declares the result.
  • the problem with this approach is that two different machines could, conceivably, produce different approximation errors and return slightly different rounded results.
  • the precept of the JAVA programming language is that all machines should return the same result.
  • the method of the present invention utilizes an iterative approach to request additional bits of the solution until the number can be correctly rounded even in the presence of an approximation error. Since the number will never be equal to a boundary case (which is the case for a rational number), the iterative process will terminate. It is strongly believed that one will never need to do more than a few iterations to achieve correct roundedness. In particular, the first computation will result in 73 bits accuracy, which will yield correctly rounded result 99.9999% of time. A second iteration will give 120 bits accuracy improving the chances of success to 1-2 "67 .
  • the method of the present invention attempts to achieve "correct rounding" in compliance with IEEE standards without forcing all calculations to be done at a number of bits which would guarantee a solution in all cases. In this way, the correct rounding is performed significantly more efficiently than, for example, a 120 bit calculation were require for all results.
  • FIG. 1 is a block diagram illustrating a general method for correctly rounding one of the transcendental functions sine, cosine, tangent, arc tangent, exponential, logarithm, and power in IEEE double precision arithmetic in accordance with a specific embodiment of the present invention.
  • x is an argument that a transcendental function f(x) operates on.
  • the transcendental function f may be sine, cosine, tangent, exponential, logarithm, power, or arc tangent (or a combination thereof).
  • bit pattern of x is examined. If it is a not a number (NaN), then the code "NaN” is returned. If it is a positive infinite number, then a "+Infinity"code is returned. Next, the range of x is checked. If x is more than the threshold for an overflow condition of the exponential function, then an overflow result is returned. If x is less than the threshold for an underflow condition of the exponential function, then an underflow result is returned.
  • argument x is preferably reduced to argument y such that y falls within a primary range.
  • ⁇ /7z/4. Let n k mod 4. Then the computation of the trigonometric functions are reduced to the computation of sin(v) or cos(y) as indicated in the table below:
  • n is chosen such that Also, y must be less than ln2/32 in absolute value.
  • the value of the exponential of x is equal to
  • n r*2 where l/sqrt(2) ⁇ r ⁇ sqrt(2) .
  • a value is chosen from a table so that
  • y r/t t .
  • log( ) is computed as follows:
  • ⁇ ogix n* ⁇ og(2) + logf ⁇ ) + log(y) where log(t,) is obtained from a look-up table.
  • n is determined by rounding 16 /ln2 to the nearest integer. Then x - rc ⁇ 2)/16 is computed by simulating higher precision arithmetic as follows. First, (ln2)/16 is pre-computed to 144 significant bits and then split into four 36-bit chunks, ln2_nbhl, In2_nbh2, In2_nbh3, In2_nbh4. Each chunk is represented by a floating point number.
  • f(y) where y is represented as two floating point number yl and y2, will also make use of simulated higher precision arithmetic.
  • coefficients a are generated to achieve the desired accuracy of f(y).
  • the above polynomial can be computed using a simulated higher precision arithmetic. However, higher precision arithmetic could be slow.
  • an integer k is predetermined such that the "tail" portion of polynomial sny + ... + a n y" can be computed using normal arithmetic without affecting the desired final accuracy of f(y). Then simulated higher precision arithmetic is used to compute f(y) to achieve more than 73 significant bit accuracy. Then method of the simulation is similar to the code in the computation of the argument reduction.
  • fix represented as a pair of double precision numbers (zl,z2), is retrieved by reversing the argument reduction step from f(y) to achieve at least 73 bits accuracy.
  • a block 104 the sum of double precision number zl and z2 are checked for boundary cases by examining the bit pattern from the 54 th significant bit to the 73 rd significant bit.
  • the rounding mode is nearest/even: if the 54 th significant bit is 0 and the rest of the bits from the 55 th significant bit to the 73 rd significant bit are l's, or if the 54 th significant bit is 1 and the rest of the bits from the 55 th significant bit to the 73 rd significant bit are O's, then this number is identified as too close to the boundary (halfway) number.
  • a multi-precision routine as defined below in block 106 must be used for the computation of f(x) again. However, such multi-precision routine actually slows the computation.
  • nw The number of words for a multi- precision number is called nw.
  • the reliable number of significant bits is 24*(nw-3).
  • the multi-precision number, mx is an integer array defined by:
  • mx[0] is the sign word
  • mx[l] is the exponent word
  • mx[2] is the leading word
  • mx[3] to mx[nw-l] are the 24-bit chung significant words.
  • the addition, subtraction, multiplication, and division of multi-precision numbers are accomplished by calling special subroutines.
  • An example of a multiplication subroutine code written in C language is illustrated below:
  • the code first converts the "digit" of the multi-precision number a[] to a floating point number. Then it performs multiplication using an algorithm learnt in school. During the computation, care must be taken to avoid inexact arithmetic.
  • exp( ) (((exp(z) 2 ) 2 )-) 2
  • exp(x) 2" expO).
  • a decision block 110 checks whether the multi-precision result of function fix) which has 24*(nw-3) reliable significant bits is too close to boundary cases. The procedure for checking boundary cases is repeated from block 104. If the multi- precision result with nw number of words is a boundary case, nw is replaced with nw+2 in block 112, and block 106 is reiterated. Otherwise, the result from block 110 is the correctly rounded result of function fix). Such result is then rounded in block 108 per IEEE standard 754.
  • FIG. 2 is a system block diagram of a system for implementing the method of the present invention.
  • the number on which a transcendental function will operate on is entered into the system through input 202.
  • Input 202 may be in the form of any input source, such as a keyboard, or data stored in other format readable by input 202.
  • a storage medium 204 for storing the above code can take the form of a hard drive, floppy disk, or laser disc.
  • Storage medium 204 provides instructions to computing system 206 that comprises a central processing unit (CPU) 208 and a floating point unit (FPU) 210. Both CPU 208 and FPU 210 perform operations on the number provided by input 202.
  • An output 212 provides a user with the results of the computation. Such output may be in the form of a display, printed output, or other output form.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

A method for correctly rounding a result of transcendental function such as sine, cosine, tangent, arc tangent, exponent, logarithm, or power in double precision arithmetic is disclosed. All arguments of the functions that are undefined, overflow or underflow are fist purged off. Then the given argument is reduced within a primary range to compute the function as a pair of double precision numbers. Multi-precision arithmetic with a specified word length is used if the sum of the pair of double precision number is close to the boundary case. The multi-precision arithmetic is repeated with greater precision each time until the result of the multi-precision arithmetic is not close to the boundary case. The result of the multi-precision arithmetic or the sum of said pair of double precision numbers is finally rounded according to the given rounding mode.

Description

S P E C I F I C A T I O N
TITLE OF INVENTION
METHOD OF OBTAINING CORRECTLY ROUNDED RESULTS IN IEEE DOUBLE PRECISION FORMAT OF TRANSCENDENTAL FUNCTIONS
BACKGROUND OF THE INVENTION Field of the Invention
The present invention relates to data processing systems having a floating point arithmetic unit. More particularly, the present invention relates to a data processing system performing correctly rounded floating point operations.
The Background Art
Floating point units conventionally use hardware or software to perform mathematical operations. Regardless of the approach used, the floating point unit can produce an inexact result, that is a result having an error in the last few digits. Rounding the result discards one or more of the trailing significant digits and adjusts the retained part in accordance with some specified rule.
The purpose for rounding is to reduce the number of digits in the result so that the rounded result can fit within the word size of a floating point number in the microprocessor. Floating point notation is a method of writing numeric quantities by using an exponent and a significand (also called the mantissa). The exponent represents the location of the decimal point in base 10 of the number, and the significand represents the significant digits of the number. For example, the number 12.846 can also be written as 1.2846 x 101 which comprises an exponent of 1 and a significand of 1.2846.
In many situations, the exponent and the significand of the result of a floating point operation exceed a microprocessor's word size. For example, the addition of two numbers having n significand bits might result in a sum of having n+1 bits.
I Consequently, an overflow bit results if the length of the significand is only n-bits long. The overflow bit must then be rounded off.
In general, the precision of a floating point number is determined by the length of the exponent and the significand. The longer the exponent and the significand length, the more precise it becomes. However, the tradeoff is that calculations take longer because more bits are being manipulated. Commonly, precisions are adopted according to the specifications set forth in Institute Electrical and Electronic Engineers (IEEE) standard 754 (1981). The standard specifies four precisions for a binary machine: single, single extended, double, and double extended. Implementations are not required to have all four precisions.
By way of illustration as defined in the IEEE standard, single precision includes 1 sign bit, 7 exponent bits and 24 significand bits. Double precision includes 1 sign bit, 10 exponent bits and 53 significand bits. The sign bit, exponent bits, and significand bits are packed in a 32 bit word for single precision, and a 64 bit word for double precision. Once it is determined that a number needs to be rounded to a given precision, a rounding method or mode must then be selected. Some of the more common rounding modes include chopping, rounding up, rounding down, and rounding to nearest/even. Thus, a result is said to be "correctly rounded" if it is rounded according to the rounding mode given.
In the prior art, correctly rounding computations are limited to addition, subtraction, multiplication, division, and square root computations. Rounding a transcendental number produced by transcendental functions such as a sine or cosine function is inherently difficult because a transcendental number can be arbitrarily close, but not equal, to a rational number. Transcendental numbers are numbers that cannot be represented with a finite number of digits. Consequently, one may need to examine as many extra trailing bits as necessary in order to determine the correct rounding. Multi- precision arithmetic is usually required in this case.
For IEEE single precision arithmetic, this problem can be solved by resorting to double precision arithmetic and exhaustive computation of all 232 arguments. However, for double precision arithmetic, exhaustive computation is not currently possible because there are close to 263 double precision numbers, and the cost of multi-precision arithmetic at this level can be expensive.
Therefore, a method is needed to correctly round transcendental function results in double precision floating point without expending an inordinate number of computation clock cycles.
BRIEF DESCRIPTION OF THE INVENTION
A; method for correctly rounding a result of transcendental function such as sine, cosine, tangent, arc tangent, exponent, logarithm, or power in double precision arithmetic is disclosed. All arguments of the functions that are undefined, overflow or underflow are first purged off. Then the given argument is reduced within a primary range to compute the function as a pair of double precision numbers. Multi-precision arithmetic with a specified word length is used if the sum of the pair of double precision number is close to the boundary case. The multi-precision arithmetic is repeated with greater precision each time until the result of the multi-precision arithmetic is not close to the boundary case. The result of the multi-precision arithmetic or the sum of said pair of double precision numbers is finally rounded according to the given rounding mode.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate one or more embodiments of the invention and, together with the present description, serve to explain the principles of the invention.
In the drawings:
FIG. 1 is a block diagram illustrating a method for correctly rounding one of the transcendental functions sine, cosine, tangent, arc tangent, exponential, logarithm, and power in IEEE double precision arithmetic in accordance with a specific embodiment of the present invention. FIG. 2 is a system block diagram of a system for implementing the method of the present invention.
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT One embodiment of the present invention is described herein in the context of algorithm performing correctly rounded floating point operations. Those of ordinary skill in the art will realize that the following description of the present invention is illustrative only and not in any way limiting. Other embodiments of the invention will readily suggest themselves to such skilled persons having the benefit of this disclosure. Reference will now be made in detail to an implementation of the present invention as illustrated in the accompanying drawings. The same reference numbers will be used throughout the drawings and the following description to refer to the same or like parts.
In accordance with a presently preferred embodiment of the present invention, the components, algorithmic steps, and/or data structures are implemented using a source code programmed in the C source computer programming language. This implementation is not intended to be limiting in any way. Different implementations may be used and may include other types of language, computing platforms, program storage devices and/or computer programs. In addition, those of ordinary skill in the art will readily recognize that devices of a less general purpose nature, such as hardwired devices, devices relying on FPGA (field programmable gate array) or ASIC (application specific integrated circuit) technology, or the like, may also be used without departing from the scope and spirit of the inventive concepts disclosed herewith.
In general, except for certain special cases which can be identified easily, for any double precision argument (which happens to be a rational number), the value of a transcendental function is a transcendental number, i.e., a number that cannot be represented with a finite number of digits. Normally one simply cuts the processing off after a certain specified number of bits, hence resulting in an approximation error, then rounds the number using a conventional rounding algorithm, and declares the result. The problem with this approach is that two different machines could, conceivably, produce different approximation errors and return slightly different rounded results. The precept of the JAVA programming language is that all machines should return the same result.
H The simplest way to achieve that is to require that the results of all library functions must be "correctly rounded" as if the results are precise and computed infinitely. Consequently, the rounded number will always be the same no matter what processor (as long as it conforms to the IEEE 754 standard) was used to compute it.
The method of the present invention utilizes an iterative approach to request additional bits of the solution until the number can be correctly rounded even in the presence of an approximation error. Since the number will never be equal to a boundary case (which is the case for a rational number), the iterative process will terminate. It is strongly believed that one will never need to do more than a few iterations to achieve correct roundedness. In particular, the first computation will result in 73 bits accuracy, which will yield correctly rounded result 99.9999% of time. A second iteration will give 120 bits accuracy improving the chances of success to 1-2"67. The method of the present invention attempts to achieve "correct rounding" in compliance with IEEE standards without forcing all calculations to be done at a number of bits which would guarantee a solution in all cases. In this way, the correct rounding is performed significantly more efficiently than, for example, a 120 bit calculation were require for all results.
FIG. 1 is a block diagram illustrating a general method for correctly rounding one of the transcendental functions sine, cosine, tangent, arc tangent, exponential, logarithm, and power in IEEE double precision arithmetic in accordance with a specific embodiment of the present invention.
"x" is an argument that a transcendental function f(x) operates on. The transcendental function f may be sine, cosine, tangent, exponential, logarithm, power, or arc tangent (or a combination thereof).
In a first block 100, all arguments x are first purged off when/(x) is undefined, overflow or underflow. An example of a code in C language for the exponential function is illustrated below:
/* filter out non-finite argument */ if(hx >= 0x40862E42) { /* if |x|>=709.78... */ if(hx>=0x7ff00000) { if(((hx&0xfffff)|LO _WORD(x)) 1=0) return x+x; /* NaN */ else return (xsb==0)? x.0.0; /* exp(+-inf)={inf,0 } */
} if(x > o_threshold) return huge*huge; /* overflow */ if(x < u_threshold) return twoml000*twoml000; /* underflow */
First, the bit pattern of x is examined. If it is a not a number (NaN), then the code "NaN" is returned. If it is a positive infinite number, then a "+Infinity"code is returned. Next, the range of x is checked. If x is more than the threshold for an overflow condition of the exponential function, then an overflow result is returned. If x is less than the threshold for an underflow condition of the exponential function, then an underflow result is returned.
In block 102, argument x is preferably reduced to argument y such that y falls within a primary range. For trigonometric functions such as sine, cosine, and tangent, argument x is reduced to argument y through the following mathematical relation: y=x-k*pi 2 where k is an integer chosen so that |y|</7z/4. Let n= k mod 4. Then the computation of the trigonometric functions are reduced to the computation of sin(v) or cos(y) as indicated in the table below:
Figure imgf000008_0004
For an arctangent function, a special table value t; is chosen so that
Figure imgf000008_0001
will be less than 1/8 in absolute value. Then atan(x) can be computed through the following relationship atan( ) = atan(t +atan(y), where atanfø) is obtained from a lookup table.
For an exponential function, an integer n is chosen such that
Figure imgf000008_0002
Also, y must be less than ln2/32 in absolute value. Let k = [π/16] andj = n-lβk. Then the value of the exponential of x is equal to
Figure imgf000008_0003
For a logarithm function, choose an integer n such that x = r*2 where l/sqrt(2) < r < sqrt(2) . A value is chosen from a table so that | t,-r | < 0.0315. Let y = r/tt. Then log( ) is computed as follows:
\ogix) = n*\og(2) + logfø) + log(y) where log(t,) is obtained from a look-up table.
For a power function, the following expression is used:
^ _ £y logM thereby reducing the computation to the exponential function and the logarithm function as described above.
It is commonly known in the art that higher precision arithmetic may be achieved using a pair of floating point numbers. The second number is usually regarded as a correction term for the first number. In order to preserve accuracy, the reduced argument y is represented as a pair of double precision numbers (yl, y2) so that the sum of yl and y2 will approximate y to more than 73 significant bits. An example of a code written in C computer programming language performing the argument reduction for the exponential function is illustrated below:
/* compute (rh+rt) = x - n*(ln2)/16 */ n = x*invln2_nb+half[xsb]; / H determine n by rounding x*16/ln2 */ k = π»NB; j=n&((l«NB)-l); /* k = [π/16], j = n-16k */ c = (double)n;
/* simulate higher precision arithmetic */ tl = x - c*ln2_nbhl; t2 = c*ln2_nbh2; ph = tl-t2; a = c*ln2_nbh3; pt = t2-(tl-ph); t4 = c*ln2_nbh4; if(pt==zero) { qh = ph-t3; qt=t3-(ph-qh); ph = qh; if(qt=zero) pt= -t4; else pt= -(qt+t4); } else pt = t3-(pt-t4);
/* represent result in two floating point numbers */ r = ph+pt; n rh = (double)(float)r; rt = pt-(rh-ph);
The integer n is determined by rounding 16 /ln2 to the nearest integer. Then x - rc ι2)/16 is computed by simulating higher precision arithmetic as follows. First, (ln2)/16 is pre-computed to 144 significant bits and then split into four 36-bit chunks, ln2_nbhl, In2_nbh2, In2_nbh3, In2_nbh4. Each chunk is represented by a floating point number. The purpose of the split is to guarantee tl = -n*ln2_hl as well as t2=n*ln2_h2, t3=n*ln2_h3, and t4=n*ln2_h4 can all be computed exactly in the IEEE double precision arithmetic. By carefully manipulating the sum of tl, t2, t3, and t4 as illustrated in the C code, one is able to distill the sum into two floating point number rh and rt, whose sum will approximate x - nf ln2)/16 to at least 77 bits.
The computation of f(y), where y is represented as two floating point number yl and y2, will also make use of simulated higher precision arithmetic. In general, the computation ofβy) is usually accomplished by using a polynomial such as: f(y)= a0 + aiy + a2y2 + ... + anyn Using a commonly known in the art "Remez" algorithm, coefficients a; are generated to achieve the desired accuracy of f(y). The above polynomial can be computed using a simulated higher precision arithmetic. However, higher precision arithmetic could be slow.
To optimize the performance of computation of the polynomial, an integer k is predetermined such that the "tail" portion of polynomial sny + ... + any" can be computed using normal arithmetic without affecting the desired final accuracy of f(y). Then simulated higher precision arithmetic is used to compute f(y) to achieve more than 73 significant bit accuracy. Then method of the simulation is similar to the code in the computation of the argument reduction.
Finally, fix), represented as a pair of double precision numbers (zl,z2), is retrieved by reversing the argument reduction step from f(y) to achieve at least 73 bits accuracy.
In a block 104, the sum of double precision number zl and z2 are checked for boundary cases by examining the bit pattern from the 54th significant bit to the 73rd significant bit. Suppose the rounding mode is nearest/even: if the 54th significant bit is 0 and the rest of the bits from the 55th significant bit to the 73rd significant bit are l's, or if the 54th significant bit is 1 and the rest of the bits from the 55th significant bit to the 73rd significant bit are O's, then this number is identified as too close to the boundary (halfway) number. In this situation, a multi-precision routine as defined below in block 106 must be used for the computation of f(x) again. However, such multi-precision routine actually slows the computation. A multi-precision word length, nw, is initially set to 8 word length yielding 24*(8-3)=120 bits accuracy.
If the sum of double precision number zl and z2 is not close to the boundary cases, it is concluded that rounding the sum of zl and z2 will give a correctly rounded result. In block 108, the rounding is accomplished per IEEE standard 754.
In block 106, a multi-precision arithmetic with nw, initially set to 8 word length, is used to request additional bits of the solution. The number of words for a multi- precision number is called nw. The reliable number of significant bits is 24*(nw-3). The multi-precision number, mx, is an integer array defined by:
mx[0], mx[l], mx[2], mx[3], ..., mx[nw-l]
where mx[0] is the sign word, mx[l] is the exponent word, mx[2] is the leading word, and mx[3] to mx[nw-l] are the 24-bit chung significant words. The addition, subtraction, multiplication, and division of multi-precision numbers are accomplished by calling special subroutines. An example of a multiplication subroutine code written in C language is illustrated below:
/* compute the prouct of a*b */
/* initialize t[i] = b[2]*a[i]+b[3]*a[i-l]+b[4]*a[i-2] */ x = (double)b[2]; y = (double)b[3]; z = (double)b[4]; ta[2] = (double)a[2]; ta[3] = (double)a[3]; t[2] = x*ta[2]; t[3] = x*ta[3]+y*ta[2];
for(i=4;i<nw;i++) { ta[i] = (double)a[i]; t[i] = x*ta[i]+y*ta[i-l]+z*ta[i-2]; } t[nw] = y*ta[nw-l]+z*ta[nw-2]; t[nw+l] = z*ta[nw-l]; t[nw+2] = zero;
n = 3; for (j=5;j<nw;j++) { if(b[j]!=0) { n += 1; y = (double) b[j]; m = 2-j; for (i=j;i<=nw+2;i+=l) t[i] += y*ta[m+i];
} if(n>30) { /* more than 3Q term have been added, need to normalize t[i] to avoid inexact arithmetic */ i for(jj=nw+2;jj>j;jj-) { x = (double)((intχtwom48*t[jj])); t(jj] -= x*two48; t[jj-2] += x;
} n = 0; /* reset n to zero */
The code first converts the "digit" of the multi-precision number a[] to a floating point number. Then it performs multiplication using an algorithm learnt in school. During the computation, care must be taken to avoid inexact arithmetic.
The algorithm of fix) is also re-designed to use the multi-precision arithmetic. For example, to compute exp( ), we first perform the argument reduction y = x - n*lτι2 Then we further reduce y to z = y 2", where n is chosen so that |z|<2"10 . Now we use Taylor series to compute exp(z) = 1 + 1/(1 !)*z + l/(2!)*z2 + 1/(3 !)*z3 + ... and exp(y) is recovered by squaring exp(z) n times: exp(y) = (((exp(z)2)2)-)2 Finally the exp(x) is recovered by exp(x) = 2" expO).
A decision block 110 checks whether the multi-precision result of function fix) which has 24*(nw-3) reliable significant bits is too close to boundary cases. The procedure for checking boundary cases is repeated from block 104. If the multi- precision result with nw number of words is a boundary case, nw is replaced with nw+2 in block 112, and block 106 is reiterated. Otherwise, the result from block 110 is the correctly rounded result of function fix). Such result is then rounded in block 108 per IEEE standard 754.
JO FIG. 2 is a system block diagram of a system for implementing the method of the present invention. The number on which a transcendental function will operate on is entered into the system through input 202. Input 202 may be in the form of any input source, such as a keyboard, or data stored in other format readable by input 202. A storage medium 204 for storing the above code can take the form of a hard drive, floppy disk, or laser disc. Storage medium 204 provides instructions to computing system 206 that comprises a central processing unit (CPU) 208 and a floating point unit (FPU) 210. Both CPU 208 and FPU 210 perform operations on the number provided by input 202. An output 212 provides a user with the results of the computation. Such output may be in the form of a display, printed output, or other output form.
While embodiments and applications of this invention have been shown and described, it would be apparent to those skilled in the art having the benefit of this disclosure that many more modifications than mentioned above are possible without departing from the inventive concepts herein. The invention, therefore, is not to be restricted except in the spirit of the appended claims.
li

Claims

CLAIMS What is claimed is:
1. A method for correctly rounding a transcendental function in IEEE double precision arithmetic, the method comprising: using multi-precision arithmetic with a specified word length if the sum of said pair of double precision number is close to a boundary case; repeating said multi-precision arithmetic if the result of said multi-precision arithmetic is close to said boundary case; and rounding the result of said multi-precision arithmetic or the sum of said pair of double precision number.
2. The method according to claim 1 further comprising purging off a plurality of arguments of the transcendental function when an operation of the transcendental function on said plurality of arguments yields an undefined, overflow or underflow result.
3. The method according to claim 2 further comprising reducing said plurality of arguments that are not purged off to within a primary range for computing the function as a pair of double precision number.
4. The method according to claim 3 wherein the transcendental function is at least selected from the group consisting of: sine, cosine, tangent, exponential, logarithm, power, and arc tangent.
5. The method according to claim 4 wherein the sum of said pair of double precision number matches the transcendental function to 73 significant bits.
6. The method according to claim 3 wherein said reducing further comprises: processing the transcendental function with a polynomial; generating a plurality of coefficients of said polynomial; computing said polynomial using said plurality of coefficients;
\
7. The method according to claim 6 wherein said plurality of coefficients of said polynomial is generated using a "Remez" algorithm.
8. The method according to claim 7 wherein said polynomial is computed by determining the tail portion of said polynomial using a predetermined degree of said polynomial.
9. The method according to claim 3 wherein said boundary case is determined by examining the bit pattern of said pair of double precision number from the 54th significand bit to the 73rd significand bit.
10. The method according to claim 3 wherein said repeating further comprises increasing said specified word length by two.
11. A program storage device readable by a machine, tangibly embodying a program of instructions readable by the machine to perform a method for correctly rounding a transcendental function in IEEE double precision arithmetic, the method comprising: using multi-precision arithmetic with a specified word length if the sum of said pair of double precision number is close to a boundary case; repeating said multi-precision arithmetic if the result of said multi-precision arithmetic is close to said boundary case; and rounding the result of said multi-precision arithmetic or the sum of said pair of double precision number.
12. The program storage device according to claim 11 wherein the method further comprises purging off a plurality of arguments of the transcendental function when an operation of the transcendental function on said plurality of arguments yields an undefined, overflow or underflow result.
13. The program storage device according to claim 12 wherein the method further comprises reducing said plurality of arguments that are not purged off to within a primary range for computing the function as a pair of double precision number.
'3
14. The program storage device according to claim 11 wherein the transcendental function is at least selected from the group consisting of: sine, cosine, tangent, exponential, logarithm, power, and arc tangent.
15. The program storage device according to claim 11 wherein the sum of said pair of double precision number matches the transcendental function to 73 significant bits.
16. The program storage device according to claim 11 wherein said reducing further comprises: processing the transcendental function with a polynomial; generating a plurality of coefficients of said polynomial; computing said polynomial using said plurality of coefficients;
17. The program storage device according to claim 16 wherein said plurality of coefficients of said polynomial is generated using a "Remez" algorithm.
18. The program storage device according to claim 17 wherein said polynomial is computed by determining the tail part of said polynomial using a predetermined degree of said polynomial.
19. The program storage device according to claim 11 wherein said boundary case is determined by examining the bit pattern of said pair of double precision number from the 54th significand bit to the 73rd significand bit.
20. The program storage device according to claim 11 wherein said repeating further comprises increasing said specified word length by two.
H
PCT/US2001/049052 2000-12-27 2001-12-17 Method of obtaining correctly rounded results in ieee double precision format of transcendental functions WO2002052401A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2002227437A AU2002227437A1 (en) 2000-12-27 2001-12-17 Method of obtaining correctly rounded results in ieee double precision format of transcendental functions

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US75262800A 2000-12-27 2000-12-27
US09/752,628 2000-12-27

Publications (2)

Publication Number Publication Date
WO2002052401A2 true WO2002052401A2 (en) 2002-07-04
WO2002052401A3 WO2002052401A3 (en) 2003-01-23

Family

ID=25027107

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/049052 WO2002052401A2 (en) 2000-12-27 2001-12-17 Method of obtaining correctly rounded results in ieee double precision format of transcendental functions

Country Status (2)

Country Link
AU (1) AU2002227437A1 (en)
WO (1) WO2002052401A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271134A (en) * 2018-12-13 2019-01-25 上海燧原科技有限公司 Surmount function operation method and device, storage medium and electronic equipment

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
"CORRECTLY ROUNDED ELEMENTARY FUNCTIONS" IBM TECHNICAL DISCLOSURE BULLETIN, IBM CORP. NEW YORK, US, vol. 33, no. 1A, 1 June 1990 (1990-06-01), pages 353-355, XP000117760 ISSN: 0018-8689 *
LEFEVRE V ET AL: "Towards correctly rounded transcendentals" PROCEEDINGS 13TH IEEE SYMPOSIUM ON COMPUTER ARITHMETIC. ASILOMAR, CA, JULY 6 - 9, 1997, IEEE SYMPOSIUM ON COMPUTER ARITHMETIC, LOS ALAMITOS, CA: IEEE COMP. SOC. PRESS, US, 6 July 1997 (1997-07-06), pages 132-137, XP010241202 ISBN: 0-8186-7846-1 *
LYNCH T ET AL: "The K5 transcendental functions" COMPUTER ARITHMETIC, 1995., PROCEEDINGS OF THE 12TH SYMPOSIUM ON BATH, UK 19-21 JULY 1995, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 19 July 1995 (1995-07-19), pages 163-170, XP010146644 ISBN: 0-8186-7089-4 *
SCHULTE M ET AL: "Exact rounding of certain elementary functions" COMPUTER ARITHMETIC, 1993. PROCEEDINGS., 11TH SYMPOSIUM ON WINDSOR, ONT., CANADA 29 JUNE-2 JULY 1993, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, 29 June 1993 (1993-06-29), pages 138-145, XP010128520 ISBN: 0-8186-3862-1 *
ZIV A: "FAST EVALUATION OF ELEMENTARY MATHEMATICAL FUNCTIONS WITH CORRECTLY ROUNDED LAST BIT" ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, ASSOCIATION FOR COMPUTING MACHINERY, NEW YORK, US, vol. 17, no. 3, 1 September 1991 (1991-09-01), pages 410-423, XP000446457 ISSN: 0098-3500 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271134A (en) * 2018-12-13 2019-01-25 上海燧原科技有限公司 Surmount function operation method and device, storage medium and electronic equipment
CN109271134B (en) * 2018-12-13 2020-08-25 上海燧原科技有限公司 Transcendental function operation method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
WO2002052401A3 (en) 2003-01-23
AU2002227437A1 (en) 2002-07-08

Similar Documents

Publication Publication Date Title
Lefèvre et al. Toward correctly rounded transcendentals
US5570310A (en) Method and data processor for finding a logarithm of a number
US5404324A (en) Methods and apparatus for performing division and square root computations in a computer
US5892697A (en) Method and apparatus for handling overflow and underflow in processing floating-point numbers
CN1271508C (en) Producing logarithmic signal approximation with vairable precision
EP0421092B1 (en) Method and apparatus for performing mathematical functions using polynomial approximation and a rectangular aspect ratio multiplier
US5954789A (en) Quotient digit selection logic for floating point division/square root
RU2412462C2 (en) Low wattage floating-point processor for selected sub-accuracy
US5671170A (en) Method and apparatus for correctly rounding results of division and square root computations
US5696712A (en) Three overlapped stages of radix-2 square root/division with speculative execution
Cornea-Hasegan et al. Correctness proofs outline for Newton-Raphson based floating-point divide and square root algorithms
US5463574A (en) Apparatus for argument reduction in exponential computations of IEEE standard floating-point numbers
US7865882B2 (en) Fast correctly rounding floating point conversion and identifying exceptional conversion
EP0416308A2 (en) Rectangular array signed digit multiplier
JPH05216620A (en) Method and circuit for normalizing floating point
US20070162535A1 (en) Rounding floating point division results
US6598065B1 (en) Method for achieving correctly rounded quotients in algorithms based on fused multiply-accumulate without requiring the intermediate calculation of a correctly rounded reciprocal
WO2002052401A2 (en) Method of obtaining correctly rounded results in ieee double precision format of transcendental functions
JP6919539B2 (en) Arithmetic processing unit and control method of arithmetic processing unit
EP0377992A2 (en) Floating point division method and apparatus
US20090094307A1 (en) Accuracy improvement in cordic through precomputation of the error bias
CN115357216A (en) Data processing method, medium, electronic device, and program product
CN106528050B (en) Trailing or leading digit predictor
US20220137925A1 (en) Device and method for hardware-efficient adaptive calculation of floating-point trigonometric functions using coordinate rotate digital computer (cordic)
US5710730A (en) Divide to integer

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP