US7720900B2  Fused multiply add split for multiple precision arithmetic  Google Patents
Fused multiply add split for multiple precision arithmetic Download PDFInfo
 Publication number
 US7720900B2 US7720900B2 US11/223,641 US22364105A US7720900B2 US 7720900 B2 US7720900 B2 US 7720900B2 US 22364105 A US22364105 A US 22364105A US 7720900 B2 US7720900 B2 US 7720900B2
 Authority
 US
 United States
 Prior art keywords
 carry
 instruction
 addend
 result
 operands
 Prior art date
 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 Active, expires
Links
Images
Classifications

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
 G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
 G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using noncontactmaking devices, e.g. tube, solid state device; using unspecified devices
 G06F7/483—Computations with numbers represented by a nonlinear combination of denominational numbers, e.g. rational numbers, logarithmic number system, floatingpoint numbers

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F2207/00—Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
 G06F2207/38—Indexing scheme relating to groups G06F7/38  G06F7/575
 G06F2207/3804—Details
 G06F2207/386—Special constructional features
 G06F2207/3884—Pipelining

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
 G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
 G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using noncontactmaking devices, e.g. tube, solid state device; using unspecified devices
 G06F7/499—Denomination or exception handling, e.g. rounding, overflow
 G06F7/49936—Normalisation mentioned as feature only

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
 G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
 G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using noncontactmaking devices, e.g. tube, solid state device; using unspecified devices
 G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using noncontactmaking devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
 G06F7/5443—Sum of products
Abstract
Description
1. Field of the Invention
The present invention relates generally to performing floatingpoint operations in a Central Processing Unit (CPU) of a computing device, and more particularly, to an improved floating point unit for more efficiently performing multiple multiply Add operations at the rate of one per cycle.
2. Description of the Prior Art
Many computeintensive applications today use extendedprecision fixedpoint arithmetic. This includes applications such as conversion between binary and decimal and publickey algorithms such as Diffie Hellman, DSA, ElGamel, and (most importantly) RSA. PublickeyAlgorithm (PKA) cryptography particularly, has become an essential part of the Internet. The most computeintensive part of PKA is a modular exponentiation using very large integers; typically 1024 bits, 2048 bits, or even larger. This computation is executed in software using multipleprecision arithmetic. For example, a typical 1024bit RSA exponentiation requires about 200,000 64bit multiplies and twice that many 64bit adds. The computing time for this on a work station or a personal computer is not normally significant, as this occurs only once per securesocketlayer (SSL) transaction. However, at the server, where many sessions can be in progress at the same time, this computation tends to be the limiting factor for the number of SSL transactions that can be performed.
The software on the IBM eServer zSeries® (z/OS) available from assignee International Business Machines, Inc., implements 64bit fixedpoint instructions to perform this operation. Fixedpoint multiply on the zSeries is relatively slow, a 64bit multiply typically taking more than 20 cycles, and is not pipelined. Additionally, there are not enough fixedpoint registers to keep intermediate results in the registers.
One solution is to implement special cryptographic accelerators. With current technology, it takes several accelerators (usually more than 10) to provide the performance required by one mainframe server. Current technology trends indicate that server performance is increasing faster than accelerator performance, so this imbalance will continue to worsen in the future. Additionally, these accelerators run asynchronously to the central processing unit (CPU), so there is also a significant performance overhead in the CPU to interface with the accelerator.
Moreover, most current floatingpoint improvements are primarily concerned with performance, (not function) and especially as this applies to denormalized operands. In the application for which MAA is intended, denormalized operands do not occur. (Denormalized operands are very tiny values, unnormalized operands can have values in the normal range, but with leftmost zeros in the fraction.) For example, U.S. Pat. Nos. 5,943,249 and 6,732,134 describe processors for performing floating point operations, however, are concerned with denormalized operands and not normal values. U.S. Pat. Nos. 6,256,655 and 6,904,446 describe floating point processing that meet criteria for preserving the integrity of the result (e.g., fractions is affected by the alignment of the input fractions.)
It would be highly desirable to provide an improved floatingpoint unit for providing efficient processing of multipleprecision fixedpoint operands.
It is therefore one object of the present invention to provide an improved method of processing floatingpoint operations in a computer system.
It is another object of the present invention to provide such a method which more efficiently handles the processing of unnormalized floatingpoint numbers.
Particularly, the present invention is directed to an improved floating point unit for a computing device, e.g., a server, that provides efficient processing of multipleprecision fixedpoint operands, and additionally provides a set of floating point instructions.
According to one aspect of the invention, the efficient processing of multipleprecision fixedpoint operands is based on a basic building block called a multiplyaddadd (MAA) function. The MAA building block has four input operands and two output operands, all with the same number of bits, and can be represented mathematically as: H, L=A*B+C+D. The result (H, L) is returned in two parts: (H) a highorder part and (L) a loworder part.
The MAA building block has the property that the result always fits with no loss of information (no carry out). The MAA building block also permits the software to use a carrysave technique and permits a parallel algorithm that can be pipelined.
According to this aspect of the invention, the input operands and result parts are all floatingpoint operands in the same format. Each result part is an entire floatingpoint number, complete with sign, exponent, and fraction, and thus each can be used directly with no conversion as the input to the next stage, which are other MAAs.
Thus a key aspect of the invention is the capability of generating a twopart result, both parts in a format compatible with the inputs. A further key aspect is the preservation of the integrity of the result. To preserve the integrity of the result:
The alignment of the resulting fractions must not be affected by the alignment of the input fractions. (The alignment is affected by the input exponents.); and
The resulting exponents must be a function only of the input exponents and not affected by the alignment of the input fractions.
According to one aspect of the present invention, there is provided a computing system having an arithmetic logic unit adapted to produce both a highorder part (H) and a loworder part (L) of a fused multiply add operation result according to H, L=A*B+C, where A, B are input operands and C an addend, and where each part is formatted the same as the format of the input operands, wherein alignment of the result is not affected by alignment of the input operands.
According to a further aspect of the present invention, there is provided an arithmetic logic unit apparatus for processing an instruction for calculating A×B+C, the instruction indicating a plurality of operands (A, B) including an addend (C). The apparatus comprises:

 a. a multiplier means for performing a multiplication of A and B operands to obtain an intermediate partial sum result and partial carry results;
 b. a carrysave adder block for receiving the partial sum and carry expression and generating the explicit value of the result in a double wide format;
 c. an aligner means for aligning in parallel operation, the Coperand to the product fraction, and generating the aligned addend which is in the range of the product; and,
 d. a carrypropagate adder means for generating an intermediate extended result in a double wide format; and,
 e. means for suppressing leftalignment of the intermediate extended result, whereby input operands for a subsequent A×B+C operation remain rightaligned.
According to a further aspect of the present invention, there is provided a method of processing an instruction in an arithmetic logic unit, the instruction indicating a plurality of operands (A, B) including an addend (C). The method comprises:

 a. receiving, by an arithmetic logic unit, input operands A, B and C, said arithmetic logic unit including a hardware structure for executing an instruction for calculating A×B+C;
 b. performing a multiplication of A and B operands in a Multiplier block to obtain an intermediate partial sum result and partial carry results;
 c. inputting said partial sum and carry expression to a carrysave adder block that generates the explicit value of the result in a double wide format;
 d. aligning in parallel operation, the C addend to the product and generating the aligned addend; and,
 e. generating an intermediate extended result in a carrypropagate adder that produces a result in a double wide format; and,
 f. suppressing leftalignment of said intermediate extended result, whereby input operands for a subsequent A×B+C operation remain rightaligned.
The objects, features and advantages of the present invention will become apparent to one skilled in the art, in view of the following detailed description taken in combination with the attached drawings, in which:
The floatingpoint unit of a zSeries machine is capable of performing 64bit hexadecimal floatingpoint (HFP) MULTIPLY ADDs at the rate of one per cycle. According to the invention, a new set of instructions are defined which can utilize this same data flow with very minor changes. One significant change is the elimination of the final normalization step, and the provision of new instructions named: MULTIPLY AND ADD UNNORMALIZED and MULTIPLY UNNORMALIZED.)
Basically, the mathematics of modular exponentiation is reformulated into a “carrysave” approach which permits the computations to be performed in parallel, thus utilizing the full pipelining of the hardware an inner loop, and the basic unit of the inner loop is one multiply and two adds (MAA). These new instructions can perform a 52bit MAA in 5 cycles. (The fixedpoint instructions can perform a 64bit MAA in 28 cycles.)
In operation, the three different operands are available in Areg, Breg and Creg. Mantissas of the Aoperand 113 and Boperand 116 are multiplied in the Multiplier block 123 to obtain a partial “sum” result 133 and a partial “carry” expression 136 of the product. These product expressions are in carrysave form with a width of 112 bit each. The intermediate partial sum result and partial carry result 133,136 and the aligned addend fraction 139 are input to a carrysave adder block 143 that generates the result value in a double wide format, comprising 112 bits. This result is again in carrysave form consisting out of two partial results 144,145, which are both in that double wide format (i.e., HFP extended). That is, the carrysave adder condenses the three 112 bit wide operands into two. Here a fourth operand could be added, when the Carrysave adder is expanded from 3:2 to 4:2. That is, as shown in
In parallel operation, the Coperand (addend) 119 is aligned to the product fraction, by comparing the product exponent and the addend exponent. The difference delivers the alignshiftamount (positive or negative) by which the added fraction has to be shifted. The result of this is a 112 bit wide aligned addend fraction 139. It is the 112 bit wide window of the extended CFraction, which is in the range of the product.
The MainAdder block 153 generates the intermediate extended result 160. It is a carrypropagate adder, which delivers the explicit value of the fraction result in a double wide format (e.g., 112 bits in an example implementation). The 112 bit intermediate result 160 may thus be split as two 56bit fractions referred to as a highorder part (H) and a loworder part (L) that may be placed in a floatingpoint register pair (not shown).
For normal floatingpoint operation, the Normalizer block 153 would perform a left shift, to remove leading zero bits of the result. However, according to the invention, this normalization is suppressed. The basic operation performed by the Normalizer 153 is a shift left operation. The Normalizer obtains the NormShiftAmount, as indicated by receipt of a signal 150 from the control block 101, which indicates how many leading zero's are in the adder output. The suppressing of normalization is performed by forcing that NormShiftAmount to zero, independent of the actual adder result.
The Rounder/Reformatter block generates again the 64 bit HFP Format with exponent and 56 bit fraction. Rounding is done by truncation, if needed. At this point, the highorder part (H) or loworder part (L) results are selected. When both results are required, an additional cycle is needed.
As further shown in

 a. Exp(Prod):=Exp(A)+Exp(B)−x′40′
 b. where x′40′ is a bias, in an example embodiment, and perform an AlignShiftAmount according to:
 c. SA(Align):=Exp(Prod)−Exp(C)=Exp(A)+Exp(B)−x′40′−Exp(C);
3) perform the Multiplication, Alignment, perform the carry save addition for Multiplier and Aligner outputs and, perform the main addition and retrieve get extended result; 4) suppressing normalization by forcing NormShiftAmount=0; and, 5) depending on the instruction, the High, Low or both result parts are taken and written back as Result in HFP form. The exponent is added to the fraction.

 a. MAYH: High Result
 b. MAYL: Low Result (Exponent:=HighExponent−14 or x′E′)
 c. MAY: High Result and Low Result−one extra cycle necessary
Thus, according to the invention, the control block implements logic for forcing a shiftamount of zero, or in other words, suppressing the normalization. Advantageously, the control logic, implemented by control block 101, is less expensive to design, as it can be synthesized and does not need costly manual custom design as the fraction dataflow. It is less timing critical than the dataflow, which limits the cycle time and performance.
Moreover, implementation of the Fraction Dataflow is designed and implemented to allow a fast execution of a floatingpoint MULTIPLY AND ADD instruction. It can be taken unchanged for the UNNORM MULTIPLY AND ADD. With that, a pipelined performance of one instruction per cycle is possible. For instructions which need to write the extended result, two cycles per instruction are necessary.
It should be understood that handling of Negative Signs in the context of the invention is implemented in a manner similar as in current floatingpoint MAA units. It is noted that as each of the Operands can have a negative Sign, there is differentiation between “Effective Addition” and ‘Effective Subtraction’. For ‘Effective Subtraction” the COperant is inverted and after the Main Adder the two's complement is used, which has the effect of an subtraction.
According to the invention, the new HFP instructions MULTIPLY UNNORMALIZED and MULTIPLY AND ADD UNNORMALIZED are extensions to the hardware required to implement the HFP instructions MULTIPLY AND ADD AND MULTIPLY AND SUBTRACT. Further details of the HFP instructions may be found in a reference the z/Architecture “Principles of Operation”, SA22783802, Chapter 18, dated June, 2003, the whole contents and disclosure of which are incorporated by reference as if fully set forth herein. According to the invention, twelve operation codes are defined, all of which are simple variations (or subsets) of the following function:
t1
where the source operands are HFP long (56bit fractions); the multiply and add operations are performed without normalization; the intermediate result is HFP extended (112bit fraction split as two 56bit fractions called the highorder part and loworder part); and the value returned is placed into a target location designated by a field.
The instructions are now described in greater detail as follows:
HFP Multiply Unnormalized
The instruction HFP Multiply Unnormalized instruction in a first variation has a structure as follows:
with a first variation MYR utilizing Long HFP multiplier multiplicand (operands) producing an extended HFPproduct; a second variation MYHR utilizing Long HFP multiplier multiplicand (operands) producing a highorder part of extended HFPproduct; and, a third variation MYLR utilizing Long HFP multiplier multiplicand operands producing a loworder part of the extended HFPproduct.
In a second variation, the Multiply Unnormalized instruction has a structure as follows:
with a first variation MY utilizing Long HFP multiplier and multiplicand (operands) producing an extended HFPproduct; a second variation MYH utilizing Long HFP multiplier and multiplicand producing a highorder part of extended HFPproduct; and, a third variation MYL utilizing Long HFP multiplier and multiplicands producing the loworder part of extended HFPproduct.
In both instructions, the second and third HFP operands are multiplied, forming an intermediate product, which, in turn, is used to form an intermediate extended result. All (or a part) of the intermediate extended result is placed in the floatingpointregister pair (or floatingpoint register) designated by the R1 field. The operands, intermediate values, and results are not normalized to eliminate leading hexadecimal zeros. Multiplication of two HFP numbers consists in exponent addition and fraction multiplication. The sum of the characteristics of the second and third operands, less 64, is used as the characteristic of the highorder part of the intermediate product; this value is independent of whether the result fraction is zero. The characteristic of the intermediate product is maintained correctly and does not wrap.
The highorder characteristic of the intermediate extended result is set to the characteristic of the intermediate product, modulo 128. The loworder characteristic of the intermediate extended result is set to 14 less than the highorder characteristic, modulo 128. Wraparound of the characteristic is independent of whether the result fraction is zero. In all cases, the second and thirdoperand fractions have 14 digits; the intermediateproduct fraction contains 28 digits and is an exact product of the operand fractions. The intermediateproduct fraction is not inspected for leading hexadecimal zero digits and is used without shifting as the fraction of the intermediate extended result. The sign of the result is the exclusive or of the operand signs, including the case when the result fraction is zero.
For MY and MYR, the entire intermediate extended result is placed in the floatingpoint register pair designated by the R1 field. For MYH and MYHR, the highorder part of the intermediate extended result is placed in the floatingpoint register designated by the R1 field and the loworder part is discarded. For MYL and MYLR, the low order part of the intermediate extended result is placed in the floatingpoint register designated by the R1 field and the highorder part is discarded. HFPexponentoverflow and HFPexponent underflow exceptions are not recognized. Characteristics of the intermediate extended result wraparound modulo 128 and no exception is reported.
The R1 field for MY and MYR must designate a valid floatingpointregister pair. Otherwise, a specification exception is recognized.
It is understood that the HFP MULTIPLY UNNORMALIZED differs from HFP MULTIPLY in the following ways: 1) Source operands are not normalized to eliminate leading hexadecimal zeros; 2) The intermediateproduct fraction is not inspected for leading hexadecimal zeros and no normalization occurs; 3) HFP exponent overflow and HFP exponent underflow are not recognized; 4) Zero fractions are not forced to true zero.
HFP Multiply and Add Unnormalized
The instruction HFP Multiply And Add Unnormalized instruction according to a first variation has a structure as follows:
with a first variation MAYR utilizing Long HFP sources producing an extended HFP result; a second variation MAYHR utilizing Long HFP sources to produce a highorder part of an extended HFP result; and, a third variation MAYLR utilizing Long HFP sources to produce the loworder part of an extended HFP result.
In a second variation, the Multiply And Add Unnormalized instruction has a structure as follows:
with a first variation MAY utilizing Long HFP sources producing extended HFP result; a second variation MAYH utilizing Long HFP sources to produce the highorder part of the extended HFP result; and, a third variation MAYL utilizing Long HFP sources to produce the loworder part of the extended HFP result.
The second and third HFP operands are multiplied, forming an intermediate product; the first operand (addend) is then added algebraically to the intermediate product to form an intermediate sum; the intermediatesum fraction is truncated on the left or on the right, if need be, to form an intermediate extended result. All (or a part) of the intermediate extended result is placed in the floatingpointregister pair (or floatingpoint register) designated by the R1 field. The operands, intermediate values, and results are not normalized to eliminate leading hexadecimal zeros.
Contrary to the registertoregister variation, whereby the second operand is in a floatingpoint register and is designated by the R1 field (in the RRFformat instruction); this instruction includes a storagetoregister variation, whereby the second operand is in storage and is designated by the X2, B2, and D2 fields (in an RXFformat instruction). In all variations, the third operand, the multiplicand, is in a floatingpoint register and is designated by the R3 field in the instruction. Moreover, in all variations, the target location is designated by the R1 field in the instruction. For MULTIPLY AND ADD UNNORMALIZED, the R1 field also designates the addend. When, for MULTIPLY AND ADD UNNORMALIZED, the target location is one floatingpoint register, the same floatingpoint register is used as both the addend and the target. When the target location is a floatingpoint registerpair, the R1 field may designate either the lowernumbered or highernumbered register of a floatingpointregister pair; thus, the first operand may be located in either of the two registers of the floatingpointregister pair into which the extended result is placed.
The MULTIPLY AND ADD UNNORMALIZED operations may be summarized as:
t1
Multiplication of two HFP numbers consists in exponent addition and fraction multiplication. The sum of the characteristics of the second and third operands, less 64, is used as the characteristic of the highorder part of the intermediate product; this value is independent of whether the result fraction is zero. The characteristic of the intermediate product is maintained correctly and does not wrap.
In all cases, the second and thirdoperand fractions have 14 digits; the intermediateproduct fraction contains 28 digits and is an exact product of the operand fractions. The intermediateproduct fraction is not inspected for leading hexadecimal zero digits and is used without shifting in the subsequent addition.
In all cases, the first operand is located in the floatingpoint register designated by the R1 field and the firstoperand fraction has 14 digits. Addition of two HFP numbers consists in characteristic comparison, fraction alignment, and signed fraction addition. The characteristics of the intermediate product and the addend are compared. If the characteristics are equal, no alignment is required. If the characteristic of the addend is smaller than the characteristic of the product, the fraction of the addend is aligned with the product fraction by a right shift, with its characteristic increased by one for each hexadecimal digit of shift. If the characteristic of the addend is larger than the characteristic of the product, the fraction of the addend is aligned with the product fraction by a left shift, with its characteristic decreased by one for each hexadecimal digit of shift. Shifting continues until the two characteristics agree. All hexadecimal digits shifted out are preserved and participate in the subsequent addition.
After alignment, the fractions with signs are then added algebraically to form a signed intermediate sum. The fraction of the intermediate sum is maintained exactly. The intermediatesum fraction is not inspected for leading hexadecimal zero digits and is not shifted. Only those 28 hexadecimal digits of the intermediatesum fraction which are aligned with the 28 hexadecimal digits of the intermediateproduct fraction are used as the fraction of the intermediate extendedresult.
The highorder characteristic of the intermediate extended result is set to the characteristic of the intermediate product, modulo 128. The loworder characteristic of the intermediate extended result is set to 14 less than the highorder characteristic, modulo 128. Wraparound of the characteristic is independent of whether the result fraction is zero.
The sign of the result is determined by the rules of algebra, unless the entire intermediatesum fraction is zero, in which case the sign of the result is made positive. For MAY and MAYR, the entire intermediate extended result is placed in the floatingpoint registerpair designated by the R1 field; the R1 field may designate either the lowernumbered or highernumbered register of a floatingpoint register pair. For MAYH and MAYHR, the highorder part of the intermediate extended result is placed in the floatingpoint register designated by the R1 field and the loworder part is discarded.
For the MAYL and MAYLR, the loworder part of the intermediate extended result is placed in the floatingpoint register designated by the R1 field and the highorder part is discarded. HFPexponentoverflow and HFPexponentunderflow exceptions are not recognized. Characteristics of the intermediate extended result wraparound modulo 128 and no exception is reported.
It should be understood that the MULTIPLY AND ADD UNNORMALIZED can be used to efficiently perform multiple precision arithmetic on numbers of any arbitrary size. This is accomplished by organizing the numbers into big digits of 52 bits each, with each big digit maintained as an integer in the HFP long format. Using a radix of 252 and big digits which can hold up to 56 bits provides a redundant representation. This redundant representation permits multiplication and addition using a “carry save” technique and permits maximum utilization of the floating point pipeline.
Further, by setting the multiplier to an integer value of 1 with the proper characteristic, the multiplicand can be scaled by any power of 16 and then added to the addend. This permits, for example, adding the “carry” from one stage of a multiplication to the “sum” of the next stage to the left. In the same manner, the “sum” of one stage can be scaled to be added to the “carry” of the stage to the right.
Moreover, it should be understood that in a first round of a multiply and accumulate, the step of clearing the accumulated value to zero, may be avoided by using the MULTIPLY UNNORMALIZED instead of MULTIPLY AND ADD UNNORMALIZED.
The HFP MULTIPLY AND ADD UNNORMALIZED differs from HFP MULTIPLY AND ADD in the following ways: 1) Source operands are not normalized to eliminate leading hexadecimal zeros; 2) When the characteristic of the intermediate product and the characteristic of the addend differ, the addend is always shifted; 3) There is no shifting after the addition, only the rightmost 28 digits of the intermediate sum are preserved in the intermediate extended result; 4) The loworder part of the intermediate extended result can be returned; 5) HFP exponent overflow and HFP exponent underflow are not recognized; 6) Zero fractions are not forced to true zero.
Advantageously, the HFP MULTIPLY UNNORMALIZED and HFP MULTIPLY AND ADD UNNORMALIZED instructions can be run on any CPU in the IBM System z9 or, an IBM eServer® zSeries® (e.g., a zSeries 990 (z990, z900) or zSeries 890 (z890)) systems, etc.). Thus, the solution scales with number of CPUs and CPU performance. The use and advantages of the proposed facility include:
Utilizes the floatingpoint hardware pipeline to multiply two 56bit fractions to produce a 112bit intermediateproduct fraction, then add a 56bit addend fraction to produce a 112bit result fraction. The expected latency is seven cycles, but throughput is expected to be one HFP long result every cycle. Either the low order part or the highorder part can be returned at the rate of one per cycle, or the entire 112bit fraction can be returned in two cycles. This is contrasted to the instructions MULTIPLY LOGICAL (MLG) which multiplies two 64bit unsigned integers to form a 128bit unsigned product and ADD LOGICAL WITH CARRY (ALCG) which adds two 64bit unsigned integers. MLG and ALCG take 20 cycles and 2 cycles, respectively, and are not pipelined.
Use of the 16 floatingpoint registers (FPRs) for intermediate results greatly reduces the number of load and store operations. As an example, the basic multiplyaccumulate step is reduced from 5 cycles per big digit if load and store are necessary, to 3 cycles per big digit, if the results can be maintained in the FPRs. This is contrasted with MLG and ALCG, which use general registers as accumulators with much less opportunity to keep intermediate results in registers.
Computations are performed on big digits in the HFP long format. The HFP long format has a 56bit fraction, but a radix of 2^{52 }is used. This redundant representation permits multiplication and addition of larger numbers without intermediate overflow. For example, the product of two 53bit values can be added to a 56bit intermediate sum to produce a 52bit “sum” and a 55bit “carry” without overflow. The 55bit “carry” can be added to the 52bit “sum” of the next digit to form the next 56bit intermediate sum. This technique is called “carry save” as carries do not need to ripple all the way from right to left during an addition. Use of the “carry save” technique maximizes utilization of the floatingpoint pipeline. Thus, the basic multiplyaccumulate step is 3 cycles per big digit, compared to a ripple carry, which would require 7 cycles per big digit.
Use of the HFP format, including the exponent, permits automatic scaling of numbers in the redundant format. The basic multiply accumulate step involves one multiply and two adds. The first add is included with the multiply. The second add, which combines the “carry” from one stage into the next stage, is performed using MULTIPLY ADD UNNORMALIZED (MAYLR) rather than ADD UNNORMALIZED (AWR). This permits scaling of the exponent to properly align the value for addition. As it is expected that the multiply add instructions can be pipelined at the rate of one instruction per cycle, there is very little additional overhead involved in using MAYLR rather than AWR.
While it is apparent that the invention herein disclosed is well calculated to fulfill the objects stated above, it will be appreciated that numerous modifications and embodiments may be devised by those skilled in the art and it is intended that the appended claims cover all such modifications and embodiments as fall within the true spirit and scope of the present invention.
Claims (14)
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

US11/223,641 US7720900B2 (en)  20050909  20050909  Fused multiply add split for multiple precision arithmetic 
Applications Claiming Priority (2)
Application Number  Priority Date  Filing Date  Title 

US11/223,641 US7720900B2 (en)  20050909  20050909  Fused multiply add split for multiple precision arithmetic 
CNA2006101281972A CN1928809A (en)  20050909  20060906  System, apparatus and method for performing floatingpoint operations 
Publications (2)
Publication Number  Publication Date 

US20070061392A1 US20070061392A1 (en)  20070315 
US7720900B2 true US7720900B2 (en)  20100518 
Family
ID=37856572
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

US11/223,641 Active 20290318 US7720900B2 (en)  20050909  20050909  Fused multiply add split for multiple precision arithmetic 
Country Status (2)
Country  Link 

US (1)  US7720900B2 (en) 
CN (1)  CN1928809A (en) 
Cited By (5)
Publication number  Priority date  Publication date  Assignee  Title 

US9430190B2 (en)  20130227  20160830  International Business Machines Corporation  Fused multiply add pipeline 
US10241756B2 (en)  20170711  20190326  International Business Machines Corporation  Tiny detection in a floatingpoint unit 
US10242423B2 (en)  20170428  20190326  Intel Corporation  Compute optimizations for low precision machine learning operations 
US10255656B2 (en)  20170424  20190409  Intel Corporation  Compute optimization mechanism 
US10303438B2 (en)  20170116  20190528  International Business Machines Corporation  Fusedmultiplyadd floatingpoint operations on 128 bit wide operands 
Families Citing this family (14)
Publication number  Priority date  Publication date  Assignee  Title 

US7254698B2 (en)  20030512  20070807  International Business Machines Corporation  Multifunction hexadecimal instructions 
US8073892B2 (en) *  20051230  20111206  Intel Corporation  Cryptographic system, method and multiplier 
US7912887B2 (en) *  20060510  20110322  Qualcomm Incorporated  Modebased multiplyadd recoding for denormal operands 
US8838663B2 (en)  20070330  20140916  Intel Corporation  Method and apparatus for performing multiplicative functions 
US8078660B2 (en) *  20070410  20111213  The Board Of Regents, University Of Texas System  Bridge fused multiplyadder circuit 
CN100570552C (en)  20071220  20091216  清华大学  Paralleling floating point multiplication addition unit 
US8046399B1 (en) *  20080125  20111025  Oracle America, Inc.  Fused multiplyadd rounding and unfused multiplyadd rounding in a single multiplyadd module 
US8046400B2 (en)  20080410  20111025  Via Technologies, Inc.  Apparatus and method for optimizing the performance of x87 floating point addition instructions in a microprocessor 
US8499017B2 (en) *  20090812  20130730  Arm Limited  Apparatus and method for performing fused multiply add floating point operation 
CN102339217B (en) *  20100727  20140910  中兴通讯股份有限公司  Fusion processing device and method for floatingpoint number multiplicationaddition device 
US8914430B2 (en)  20100924  20141216  Intel Corporation  Multiply add functional unit capable of executing scale, round, GETEXP, round, GETMANT, reduce, range and class instructions 
US9389871B2 (en)  20130315  20160712  Intel Corporation  Combined floating point multiplier adder with intermediate rounding logic 
DE112013007736T5 (en) *  20131228  20161222  Intel Corporation  RSA algorithm acceleration processors, procedures, systems and commands 
US9996320B2 (en) *  20151223  20180612  Intel Corporation  Fused multiplyadd (FMA) low functional unit 
Citations (5)
Publication number  Priority date  Publication date  Assignee  Title 

US4969118A (en) *  19890113  19901106  International Business Machines Corporation  Floating point unit for calculating A=XY+Z having simultaneous multiply and add 
US5928316A (en) *  19961118  19990727  Samsung Electronics Co., Ltd.  Fused floatingpoint multiplyandaccumulate unit with carry correction 
US6061707A (en) *  19980116  20000509  International Business Machines Corporation  Method and apparatus for generating an endaround carry in a floatingpoint pipeline within a computer system 
US6256655B1 (en) *  19980914  20010703  Silicon Graphics, Inc.  Method and system for performing floating point operations in unnormalized format using a floating point accumulator 
US6751644B1 (en) *  19990915  20040615  Sun Microsystems, Inc.  Method and apparatus for elimination of inherent carries 
Family Cites Families (4)
Publication number  Priority date  Publication date  Assignee  Title 

US6209106B1 (en) *  19980930  20010327  International Business Machines Corporation  Method and apparatus for synchronizing selected logical partitions of a partitioned information handling system to an external time reference 
SE513899C2 (en) *  19990112  20001120  Ericsson Telefon Ab L M  Method and arrangement for synchronizing 
US6763474B1 (en) *  20000803  20040713  International Business Machines Corporation  System for synchronizing nodes in a heterogeneous computer system by using multistage frequency synthesizer to dynamically adjust clock frequency of the nodes 
US6826123B1 (en) *  20031014  20041130  International Business Machines Corporation  Global recovery for time of day synchronization 

2005
 20050909 US US11/223,641 patent/US7720900B2/en active Active

2006
 20060906 CN CNA2006101281972A patent/CN1928809A/en not_active Application Discontinuation
Patent Citations (5)
Publication number  Priority date  Publication date  Assignee  Title 

US4969118A (en) *  19890113  19901106  International Business Machines Corporation  Floating point unit for calculating A=XY+Z having simultaneous multiply and add 
US5928316A (en) *  19961118  19990727  Samsung Electronics Co., Ltd.  Fused floatingpoint multiplyandaccumulate unit with carry correction 
US6061707A (en) *  19980116  20000509  International Business Machines Corporation  Method and apparatus for generating an endaround carry in a floatingpoint pipeline within a computer system 
US6256655B1 (en) *  19980914  20010703  Silicon Graphics, Inc.  Method and system for performing floating point operations in unnormalized format using a floating point accumulator 
US6751644B1 (en) *  19990915  20040615  Sun Microsystems, Inc.  Method and apparatus for elimination of inherent carries 
Cited By (5)
Publication number  Priority date  Publication date  Assignee  Title 

US9430190B2 (en)  20130227  20160830  International Business Machines Corporation  Fused multiply add pipeline 
US10303438B2 (en)  20170116  20190528  International Business Machines Corporation  Fusedmultiplyadd floatingpoint operations on 128 bit wide operands 
US10255656B2 (en)  20170424  20190409  Intel Corporation  Compute optimization mechanism 
US10242423B2 (en)  20170428  20190326  Intel Corporation  Compute optimizations for low precision machine learning operations 
US10241756B2 (en)  20170711  20190326  International Business Machines Corporation  Tiny detection in a floatingpoint unit 
Also Published As
Publication number  Publication date 

CN1928809A (en)  20070314 
US20070061392A1 (en)  20070315 
Similar Documents
Publication  Publication Date  Title 

EP1293891B1 (en)  Arithmetic processor accomodating different finite field size  
US5257215A (en)  Floating point and integer number conversions in a floating point adder  
JP2729027B2 (en)  The execution of the pipeline floatingpoint processor and the multiplyadd instruction sequence  
US5042001A (en)  Method and apparatus for performing mathematical functions using polynomial approximation and a rectangular aspect ratio multiplier  
Eisen et al.  Ibm power6 accelerators: Vmx and dfu  
US20060041610A1 (en)  Processor having parallel vector multiply and reduce operations with sequential semantics  
US6360189B1 (en)  Data processing apparatus and method for performing multiplyaccumulate operations  
US6360241B1 (en)  Computer method and apparatus for division and square root operations using signed digit  
US6847985B1 (en)  Floating point divide and square root processor  
US6480872B1 (en)  Floatingpoint and integer multiplyadd and multiplyaccumulate  
US7428566B2 (en)  Multipurpose functional unit with multiplyadd and format conversion pipeline  
US5631859A (en)  Floating point arithmetic unit having logic for quad precision arithmetic  
US7225323B2 (en)  Multipurpose floating point and integer multiplyadd functional unit with multiplicationcomparison test addition and exponent pipelines  
JP2598507B2 (en)  Divide or square root device  
JP5113089B2 (en)  Against selectable subprecision floatingpoint processor having reduced power requirements  
US6163791A (en)  High accuracy estimates of elementary functions  
US8447800B2 (en)  Modebased multiplyadd recoding for denormal operands  
JPH0749772A (en)  Floating point arithmetic unit to use corrected newtonraphson technique for division and square extraction calculation  
Pineiro et al.  Algorithm and architecture for logarithm, exponential, and powering computation  
GB2267589A (en)  Performing integer and floating point division using a single SRT divider  
US6751644B1 (en)  Method and apparatus for elimination of inherent carries  
US20060064454A1 (en)  Processing unit having decimal floatingpoint divider using NewtonRaphson iteration  
Bruguera et al.  Floatingpoint fused multiplyadd: reduced latency for floatingpoint addition  
US8078660B2 (en)  Bridge fused multiplyadder circuit  
JP2618604B2 (en)  How to control the data input from the data bus 
Legal Events
Date  Code  Title  Description 

AS  Assignment 
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GERWIG, GUENTER;SCHWARZ, ERIC M.;SMITH, SR., RONALD M.;REEL/FRAME:017169/0239 Effective date: 20050908 Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION,NEW YO Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GERWIG, GUENTER;SCHWARZ, ERIC M.;SMITH, SR., RONALD M.;REEL/FRAME:017169/0239 Effective date: 20050908 

STCF  Information on status: patent grant 
Free format text: PATENTED CASE 

REMI  Maintenance fee reminder mailed  
FPAY  Fee payment 
Year of fee payment: 4 

SULP  Surcharge for late payment  
MAFP  Maintenance fee payment 
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552) Year of fee payment: 8 