CN116521124A

CN116521124A - Vector floating point multiply-add device suitable for multiple precision floating point operations

Info

Publication number: CN116521124A
Application number: CN202310334526.2A
Authority: CN
Inventors: 赖书浩; 贺小勇
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-08-01

Abstract

The invention discloses a vector floating point multiply adder suitable for multiple precision floating point operations, which comprises a first operation module, a second operation module, a third operation module and a fourth operation module. The first operation module comprises a partial product generation module, a Wallace network, a first inversion module, an index opposite-order module, a mantissa composite right shifter, a stick logic module and an abnormality pre-judging module; the second operation module comprises a 3:2CSA adder, a CPA adder, a 1 adding circuit, a GRS logic module and a symbol pre-judging module; the third operation module comprises a second negation module, a leading 0 detection module, a trailing 0 detection module, a normalized compound left shift module, a normalized correction module, a rounding preprocessing module, a quick GRS solving module and a step code adjusting module; the fourth operation module comprises a mantissa plus 1 logic module, a step code plus 1 logic module, a symbol judgment module, an abnormality judgment module and a control logic output module. The invention takes the operation speed and the chip structure area into consideration, and can execute floating point multiplication and addition operation with various accuracies in parallel.

Description

Vector floating point multiply-add device suitable for multiple precision floating point operations

Technical Field

The invention relates to the technical field of floating point operation, in particular to a vector floating point multiply adder suitable for multiple precision floating point operation.

Background

The data in the computer has two expression modes of fixed point number and floating point number. The advantage of the fixed point number is that the arithmetic logic is simple, so that the functional component is simpler in design, but the numerical representation range of the fixed point number is smaller, and the representable precision of the data is lower. The floating point number can dynamically adjust the position of the decimal point, so that the decimal point can represent data more accurately; floating point numbers are therefore advantageous over fixed point numbers in that they can represent a much larger data range than fixed point numbers with greater precision; but has the disadvantage that the operation algorithm is much more complex than fixed-point logic, and has great difficulty in hardware design.

For floating point instructions, multiply and add instructions are most frequently used. In order to improve the operation capability of the floating point processor, the operation speed can be improved by integrating the multiplication instruction and the addition instruction into one instruction, namely, the floating point multiplication and addition instruction, and the precision loss is smaller because the multiplication and addition omits one rounding operation.

At present, a floating point multiplication and addition unit usually adopts a double-channel design method, and a common optimization idea is that a close path and a far path are divided according to an index difference on hardware to shorten a critical path; although this design can achieve high-speed operation, it generally causes a problem of excessive hardware occupation area, and reduces area efficiency. The existing floating point multiply-add device is generally only suitable for floating point multiply-add operation with one precision, and has certain limitation when the floating point multiply-add operation with multiple precision is needed.

Disclosure of Invention

In view of this, the embodiment of the invention provides a vector floating-point multiply-add device suitable for multiple precision floating-point operations.

The first aspect of the invention provides a vector floating point multiply adder suitable for multiple precision floating point operations, which comprises a first operation module, a second operation module, a third operation module and a fourth operation module;

the first operation module specifically comprises a partial product generation module, a Wallace network, a first inversion module, an index opposite-order module, a mantissa compound right shifter, a stinky logic module and an abnormality pre-judging module; wherein the Wallace network connects the output of the partial product generation module; the mantissa composite right shifter is connected with the output of the first inverting module; the mantissa composite right shifter is connected with the output of the exponent matching module; the still logic module is connected with the output of the mantissa composite right shifter;

the second operation module specifically comprises a 3:2CSA adder, a CPA adder, a 1 adding circuit, a GRS logic module and a symbol pre-judging module; wherein the 3:2CSA adder is connected to the output of the Wallace network; the CPA adder is connected with the output of the 3:2CSA adder; the 1 adding circuit is connected with the output of the CPA adder, the output of the mantissa compound right shifter and the output of the stilly logic module; the GRS logic module is connected with the output of the mantissa composite right shifter and the output of the stilly logic module; the symbol pre-judging module is connected with the output of the GRS logic module, the output of the CPA adder module and the output of the 1 adding circuit;

The third operation module specifically comprises a second negation module, a leading 0 detection module, a trailing 0 detection module, a normalized compound left shift module, a normalized correction module, a rounding preprocessing module, a quick GRS solving module and a code adjustment module; the second inverting module is connected with the output of the GRS logic module, the output of the CPA adder module and the output of the 1 adding circuit; the leading 0 detection module is connected with the output of the second inverting module; the back guide 0 detection module is connected with the output of the GRS logic module, the output of the CPA adder module and the output of the 1 adding circuit; the normalized composite left shift module is connected with the output of the second negation module, the output of the leading 0 detection module and the output of the index opposite-order module; the step code adjusting module is connected with the output of the index matching module; the normalization correction module is connected with the output of the normalization composite left shift module; the rapid GRS solving module is connected with the output of the leading 0 detecting module and the output of the trailing 0 detecting module; the rounding preprocessing module is connected with the output of the normalization correction module and the output of the quick GRS solving module;

the fourth operation module specifically comprises a mantissa and 1 logic module, a step code and 1 logic module, a symbol judgment module, an abnormality judgment module and a control logic output module; the mantissa adding 1 logic module is connected with the output of the normalization correction module and the output of the rounding preprocessing module; the step code 1 adding logic module is connected with the output of the step code adjusting module and the output of the rounding preprocessing module; the symbol judgment module is connected with the output of the normalization correction module, the output of the rounding preprocessing module and the output of the symbol pre-judgment module; the abnormality judgment module is connected with the output of the abnormality pre-judgment module, the output of the step code adjustment module, the output of the normalization correction module and the output of the rounding pre-processing module; the control logic output module is connected with the outputs of the mantissa and 1 logic module, the step code and 1 logic module, the symbol judgment module and the abnormality judgment module.

Further, in the first operation module:

the partial product generation module is used for obtaining the input floating point number and carrying out partial product multiplication to obtain 27 partial products;

the Wallace network is used for calculating a first summation Sum and a first Carry of the floating point number according to 27 partial products of the partial product generation module as a first output result;

the first inverse fetching module is used for performing bit-wise inverse fetching operation on mantissas of the symbol bit inverse floating point number when the floating point multiplication and subtraction operation is performed on the input floating point number;

the index opposite-order module is used for obtaining a shift value required by opposite-order shift in the input floating point number according to the order code without shifting the floating point number, and generating an opposite-order signal;

the mantissa composite right shifter is used for shifting the mantissa needing to be shifted by the floating point number in the input floating point number to the right according to the required shift value;

the stick logic module is used for calculating the sticky position of the input floating point number;

the abnormal pre-judging module is used for judging the NaN value and the infinity value in the input floating point number and generating an abnormal pre-judging signal.

Further, the stick logic module calculates the sticky bits of the input floating point number according to the following formula:

sticky＝rshiftnum>(tzd _fc +2×width+2)

wherein f _c Mantissas representing shifted floating point numbers; rfhiftnum represents a shift value, tzd _fc Represents the pair f _c Is a trailing 0 detection value of (2), width represents f _c Is a bit width of (c).

Further, in the second operation module:

the 3:2CSA adder is used for storing the carry value of the input floating point number;

the CPA adder is used for adding a first output result output by the first operation module in combination with a Carry value to obtain a second summation Sum and a second Carry as second output results;

the 1 adding circuit consists of cascaded half adders; the input of the half adder is selected as a preceding half adder or constant 1 according to the Carry value, and whether the bit values of the second summation Sum and the second Carry are required to be added with 1 is judged;

the GRS logic module is used for calculating a GRS value of the second output result;

the symbol pre-judging module is used for generating a symbol pre-judging signal according to the GRS value and the 1 adding result.

Further, the GRS logic module calculates a GRS value of the second output result by:

when the CPA adder performs addition, the GRS value of the second output result takes the shifted floating point mantissa f _c GRS value of (2);

when the CPA adder performs subtraction, the GRS value of the second output result takes the shifted floating point mantissa f _c Complementary to the GRS value of (c);

wherein, when the CPA adder performs the subtraction and shifts the floating point mantissa f _c When the GRS value of (2) is 0, the least significant bit of the Carry in the second output result is added with 1.

Further, in the third operation module:

the second inverting module is used for performing bit inverting operation on the second output result when the second output result output by the second operation module is a negative result;

the leading 0 detection module is used for leading 0 detection on the inverted second output result;

the trailing 0 detection module is used for carrying out trailing 0 detection on the second output result;

the normalized composite left shift module is used for normalized left shift of mantissas of the second output result to obtain a third output result;

the normalization correction module is used for carrying out negative result correction on the third output result;

the quick GRS solving module is used for calculating the G bit, R bit and S bit values of the floating point number;

the rounding preprocessing module is used for generating an enable adding signal 1 according to the negative result correction result and the G bit, R bit and S bit values;

the step code adjusting module generates a step code adjusting signal according to the opposite step signal.

Further, the normalization shifts left, and a shift value is determined by:

when the temporary step code is larger than the preset step code value, the shift value is the temporary step code minus 1; the temporary step code is obtained by an input floating point number;

when the temporary step code is smaller than the preset step code value, the shift value is the preset step code value.

Further, the negative result correction means that when the second output result is negative, 1 correction is added at the mantissa of the floating point number.

Further, the calculating the G bit, R bit and S bit values of the floating point number is specifically judged by the following formula:

sticky＝(lzd _invf +tzd _f )<(2×width+3)

in lzd _invf Tzd as a preamble 0 detection result _f A detection result of trailing 0; width represents the bit width of the mantissa;

when the second output result is negative and S is found to be 0 by the above inequality, if R is 1 at this time, it is corrected to 0, whereas if R is 0, it is corrected to 1, whereas if S is found to be 1 by the above formula or positive, R does not need to be corrected.

When the second output result is a negative result and R is corrected to be 0, if G is 1 at the moment, correcting the second output result to be 0, otherwise, if G is 0, correcting the second output result to be 1, and if R is corrected to be 1 or a positive result, G does not need to be corrected;

when the result is negative and GRS is 0, an enable signal added with 1 is output.

Further, in the fourth operation module:

the mantissa adding 1 logic module is used for adding 1 to the mantissa bit of the third output result according to the adding 1 enabling signal and the normalized correction result;

the step code 1 adding logic module is used for adding 1 to the level of the third output result according to the 1 adding enabling signal and the step code adjusting signal;

The symbol judgment module is used for adjusting the symbol of the third output result according to the 1-adding enabling signal, the normalized correction result and the symbol pre-judgment signal;

the abnormality judgment module is used for generating an abnormality indication signal according to the 1-adding enabling signal, the step code adjusting signal, the step comparison signal and the abnormality pre-judgment signal; the abnormality indication signal includes an invalid operation abnormality, an underflow abnormality, an overflow abnormality, a divide 0 abnormality, and an imprecise abnormality;

the control logic output module is used for outputting a multiplication and addition result according to the logic operation results of the mantissa plus 1 logic module, the step code plus 1 logic module, the symbol judgment module and the abnormality judgment module.

The embodiment of the invention has the following beneficial effects:

1. the floating-point multiply-add device can realize the floating-point multiply-add operation with half precision, single precision and double precision, and is applicable to all floating-point number types specified by the IEEE-754 standard, including normalized number and denormal number, infinity and NaN.

2. The floating point multiply-add device can realize vectorized floating point multiply-add operation, and can execute 4 half-precision floating point multiply-add operations, or 2 single-precision floating point multiply-add operations, or 1 double-precision floating point multiply-add operation in parallel.

3. The floating point multiplier-adder has great advantages in speed and area, high area efficiency is realized, design Compiler is used for synthesis under TSMC 7nm technology, maximum path delay is not more than 0.32ns, maximum working frequency reaches 3.125GHz, and area is not more than 3639.744nm2.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a vector floating point multiply-add architecture suitable for multiple precision floating point operations in accordance with the present invention;

FIG. 2 is a schematic diagram of a data structure of a partial product generation module in a vector floating point multiply-add device adapted for multiple precision floating point operations according to the present invention;

FIG. 3 is a schematic diagram of a shared partial product generated by a partial product generation module in a vector floating point multiply-add device suitable for multiple precision floating point operations according to the present invention;

FIG. 4 is a schematic diagram of a Wallace network architecture in a vector floating point multiply-add device adapted for multiple precision floating point operations in accordance with the present invention;

FIG. 5 is a schematic diagram of a mantissa complex right shifter in a vector floating point multiply-add device adapted for multiple precision floating point operations in accordance with the present invention;

FIG. 6 is a schematic diagram of a second operation module in a vector floating-point multiply-add device adapted for multiple precision floating-point operations according to the present invention;

FIG. 7 is a schematic diagram of a vectorization implementation of a 1-plus-add circuit in a vector floating-point multiply-add device adapted for multiple precision floating-point operations in accordance with the present invention;

FIG. 8 is a schematic diagram of a fast GRS solution module solving the stinky bits in a vector floating point multiply-add device suitable for multiple precision floating point operations according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In the informatization fields of large data quantity requirement, high data precision requirement, wide data range requirement, such as image processing, signal transmission, aerospace and the like, the data precision of fixed point numbers cannot meet the use requirement, so that floating point numbers are required to be used for representing related data, and the calculation of the data depends on a floating point arithmetic unit. Because floating point digital widths of different precision are different, calculation errors are also different, in the prior art, floating point operations on floating point numbers of different precision are generally realized by arranging a plurality of sets of independent parallel operation logics, but the occupied area of a floating point arithmetic unit is overlarge in hardware design.

Vector operation is a method for improving the data throughput rate and the lowest cost of parallelism, has more efficient operation capability, and can realize that one instruction completes a plurality of floating point operations. However, the area problem of the floating point arithmetic unit is more prominent due to the mixture of vectorization and multiple sets of independent parallel arithmetic units, and the design scheme for considering the area and the speed is not available in the prior art.

In order to solve the problem, the embodiment of the invention provides a vector floating point multiply-add device suitable for multiple precision floating point operations so as to realize the floating point multiply-add operation of A, B and C.

The floating point multiply-add device of the embodiment adopts a four-stage module pipeline to realize vector floating point multiply-add operation, can execute four half-precision floating point multiply-add operations at a time, or two single-precision floating point multiply-add operations or one double-precision floating point multiply-add operation, and outputs a correct result and generates an abnormality indication signal.

Referring to fig. 1, a four-stage module pipeline in the floating-point multiply-add device of the present embodiment is divided into a first operation module, a second operation module, a third operation module and a fourth operation module. The first operation module completes mantissa multiplication related operation, the second operation module completes addition related operation, the third operation module completes normalization and step code adjustment operation, and the fourth operation module completes 1 addition operation and abnormality judgment.

A first operation module: referring to fig. 1, the first operation module specifically includes a partial product generating module, a Wallace network, a first inverting module, an exponent pair order module, a mantissa compound right shifter, a stilly logic module and an anomaly pre-judging module; wherein the Wallace network connects the output of the partial product generation module; the mantissa composite right shifter is connected with the output of the first inverting module; the mantissa composite right shifter is connected with the output of the exponent matching module; the stinky logic module is connected with the output of the mantissa composite right shifter.

In the first operation module:

the partial product generation module is used for obtaining the input floating point number and carrying out partial product multiplication to obtain 27 partial products; in this embodiment, the partial product generating module uses a radix-4-booth multiplication, and the multiplicand and multiplier in the floating point number are input to the booth partial product generating module to generate 27 partial products. The data structure of the multiplicand is shown in fig. 2, and the partial product of the booth algorithm generated based on the data structure can be used for operation with different precision.

As shown in FIG. 2, f _d 、f _s And f _h Representing double-precision, single-precision and half-precision mantissas, respectively. Taking single and half precision as an example, this data structure allows for the calculation of f _s1 F is also obtained when the partial product of (2) _h2 And f _h3 Is a partial product of (a). And the high order of the partial products needs to fill the S related sequences, and the high order alignment is adopted so that the partial products with different precision can share the filling sequences, thereby simplifying the control logic to reduce the area, and therefore f _d0 、f _s0 And f _h0 Performing high-order alignment treatment, f _s1 And f _h2 Also aligned.

In particular, in fig. 2, a certain number of "0" s are spaced between different mantissas under the same precision, so that when partial product accumulation is performed, the partial products corresponding to the different mantissas under the vector operation do not interfere with each other. Therefore, based on the data structure of fig. 2, the booth multiplication under all the precision will share a set of partial products, which is convenient for the hardware to realize vector multiplication.

As shown in fig. 3, the embodiment of the present invention performs SIMD optimization (Single Instruction Multiple Data, that is, one instruction completes multiple floating point operations) on the partial product generation module, and designs a vector mantissa multiplication structure with mixed precision fusion based on the idea of hardware isolation, so that the half-precision, single-precision and double-precision vector operations share the 27 partial products. Taking the example of performing single-precision floating-point multiplication and addition, it needs to perform 24-bit mantissa multiplication, and 13 27-bit (including high-order extended sequence {1, -, S }, S is the sign bit of the partial product) partial products generated by it can be used for 11-bit-wide mantissa multiplication of two sets of half-precision floating-point multiplication and addition, 6 partial products are used for each set of 11-bit-wide multiplication, and the required partial product bit width is 14 bits.

The Wallace network is used for calculating a first Sum Sum and a first Carry of the floating point number according to 27 partial products of the partial product generation module as a first output result. As shown in FIG. 4, the Wallace network in this embodiment consists of a 3-stage 3:2CSA array (Carry-Save Adders) and a two-stage 4:2CSA array. The 3:2CSA is used for reducing three partial products x, y and z into two partial products Sum and Carry, and is specifically realized by the following formula:

sum＝x⊕y⊕z

carry＝{x&y|y&z|x&z,0}

sum and Carry obtained after each stage of 3:2CSA is reduced continue to be reduced as x, y and z of the next stage.

The 4:2CSA calculation is similar, except that the input is four partial products:

sum _i ＝x _i ⊕y _i ⊕z _i ⊕w _i ⊕c _i

c _i+1 ＝Mux(x _i ⊕y _i ,z _i ,x _i )

carry _i ＝Mux(x _i ⊕y _i ⊕z _i ⊕w _i ,c _i ,w _i )

the 4:2CSA calculation is represented in each bit because it uses the previous bit Carry _i When the first term of Mux is calculated as 1, selecting the second term, otherwise selecting the third term.

In this embodiment, the CSA calculates and saves Carry and Sum separately, and each bit of Carry and Sum is calculated independently and does not interfere with each other, so the speed is extremely fast. The number of partial products after the third stage is a multiple of 4, 4:2CSA is used for reducing the number of stages of CSA, the delay of 3:2CSA is two exclusive OR gates, the delay of 4:2CSA is three exclusive OR gates, the same number of parts are accumulated, and the two-stage 4:2CSA array optimizes the delay of 8 exclusive OR gates to 6 to shorten a critical path.

The first inverse module is used for performing floating point multiplication and subtraction operation on the input floating point numbers A×B and C, namely performing A×B+C, wherein sign bits of A×B and C are opposite, and the mantissa f of the floating point number C _c And performing bit reversal operation.

The index opposite-order module is used for obtaining a shift value required by opposite-order shift in the input floating point number according to the order code without shifting the floating point number, and generating an opposite-order signal. As shown in FIG. 1, in this embodiment, the mantissa of the floating point number C is shifted, the exponent-to-order module obtains the shift value required for the shift of the order, and it is assumed that the order codes of the floating point numbers A, B and C are e respectively _a 、e _b 、e _c The step code obtained by multiplying A and B is e _ab When e _c -e _ab When not less than 56, mantissa f of C _c The shift is not needed, and e should be taken for the order code after the order matching _c . When e _c -e _ab When < 56, f _c Shifting to the right, taking the offset bias of the code into consideration, obtaining the bit number r of the shift as e _a +e _b -e _c Bias+width+3, where the temporary step code is taken as e _c Plus the value of r, width is mantissa bit wide.

The mantissa composite right shifter is used for shifting the mantissa needing to be shifted by the floating point number in the input floating point number to the right according to the required shift value. As shown in fig. 5, the vector right shifter is based on the idea of a step shifter, the shifting operation of each stage is controlled by a one-bit signal of the number of shift bits i, and a specific control signal is selected according to the accuracy indicating the execution of the current operation. Assuming that s0 is shifted, if a 16-bit vector shift is performed at this time, the shift of s0[31:16] is controlled by r1[0], and if a 32-bit shift is performed, the lower bits are each shifted into "0" when shifting right. The shift logic of s0[15:0] is also similar, but whether single-precision or half-precision operations are performed, it shares the r0 control signal, but its lower bits are not necessarily shifted into "0" as the former, if 16-bit operations are performed at this time, then the shift operations of s0[31:16] and s0[15:0] are independent of each other, and at this time s0[31:16] is shifted into "0" when it is shifted right; otherwise, if a 32-bit operation is performed, then s0[31:0] is an integer, and s0[31:16] should be shifted into s0[15] when shifted right. The second to fourth stages are similar except that the right shift bits are 2, 4 and 8 bits, respectively. The 32-bit vector right shifter in this embodiment can implement a right shift of 32-bit numbers, and can also implement a right shift of two 16-bit numbers.

The stick logic module is used for calculating the sticky bits of the input floating point number. The stick calculation in this embodiment is based on stick fast arithmetic logic of post-amble 0 detection. Trailing 0 detection (Trailing Zeros Detector, TZD) is a bit-by-bit detection from the lowest bit to the highest bit of a binary number to find the first "1" position, the 0 following this "1" being trailing 0. The arithmetic logic is essentially a selection logic, assuming TZD is performed on binary number X, the result of TZD is 0 when X [0] =1, the result of TZD is 1 when X [1] =1, and so on.

In this embodiment, the stick logic module calculates the sticky bits of the input floating point number according to the following formula:

sticky＝rshiftnum>(tzd _fc +2×width+2)

wherein f _c Mantissas representing shifted floating point numbers; rfhiftnum represents a shift value, and is obtained by the above-described exponential scaling module, tzd _fc Represents the pair f _c Is a trailing 0 detection value of (2), width represents f _c Is a bit width of (c).

The abnormal pre-judging module is used for judging the NaN value and the infinity value in the input floating point number and generating an abnormal pre-judging signal. In this embodiment, the exception preprocessing module mainly determines NaN and infinity. Illustratively, for multiplicand a and multiplier B, when one is infinity and the other is neither 0 nor NaN, the result of a x B is infinitely large; as long as either a and B is NaN, the result of a x B is treated with aNaN, and as long as one of a and B is sNaN, the invalid operation abnormality indication is pulled high; if the result is qNaN, if the result is multiplication of infinity and 0, the result of A.times.B is qNaN, and the invalid operation abnormality indication is pulled high. Then taking A and B as a whole, setting the whole as AB, if AB or C is NaN, then the result is qNaN, and if C is sNaN, then the invalid operation abnormality indication is pulled high; when both AB and C are infinity and a valid subtraction is performed, the result is NaN and the invalid exception indication signal is pulled high.

And a second operation module: referring to fig. 1, the second operation module specifically includes a 3:2csa adder, a CPA adder, a 1 adding circuit, a GRS logic module, and a symbol pre-judging module; wherein the 3:2CSA adder is connected to the output of the Wallace network; the CPA adder is connected with the output of the 3:2CSA adder; the 1 adding circuit is connected with the output of the CPA adder, the output of the mantissa compound right shifter and the output of the stinky logic module; the GRS logic module is connected with the output of the mantissa composite right shifter and the output of the stick logic module; the symbol pre-judging module is connected with the output of the GRS logic module, the output of the CPA adder module and the output of the 1 adding circuit. After two addition items output by the CSA array are obtained by the second operation module, the two addition items and a result of the opposite-order shift are input into a carry propagation adder to complete addition operation after the addition items are reduced by the CSA, and in order to reduce the resource consumption of the large-bit-width adder, the improved large-bit-width adder is adopted.

In the second operation module:

the 3:2CSA adder is used for storing the carry value of the input floating point number; the CPA adder is used for adding the first output result output by the first operation module and combining with the Carry value to obtain a second summation Sum and a second Carry as second output results. As shown in fig. 6, in this embodiment, sum and Carry plus sign bits outputted by the Wallace network of the first operation module need to implement 162-bit addition. Since the upper 56 bit addition term does not coincide with Sum and Carry output by the multiplier, the result of the upper 56 bit addition is only related to the Carry from the lower 107 bit addition. Therefore, in order to save the Carry signal, the adder used by the second operation module is a 107-bit wide adder, including a one-stage Carry save adder and a one-stage two-input Carry Propagate Adder (CPA), and then determines whether the high 56 bits need to be added with 1 according to the calculated 107 th bit value. The add 1 circuit is realized in hardware by cascaded half adders to optimize area, and can be parallel to 107 bit addition, thereby removing 56 bit addition from the critical path and greatly shortening the timing path.

The 1 adding circuit consists of cascaded half adders; the input for selecting the half adder according to the Carry value is the preceding half adder or constant 1, and it is determined whether the bit values of the second Sum and the second Carry are required to be added by 1. As shown in fig. 7, the present invention performs SIMD optimization on a 56-bit plus 1 circuit based on the idea of hardware isolation, which is composed of cascaded half-adders whose inputs are selected from a preceding half-adder or constant 1 according to a format signal indicating the current operation accuracy.

The GRS logic module is used for calculating the GRS value of the second output result. As shown in FIG. 1, GRS computation is implemented by GRS logic, where the GRS of the result is f when the CPA adder performs the addition _c GRS value of (2); the GRS of the result when the CPA adder performs the subtraction is the complement of the GRS of fc. And when subtraction is performed, f _c When GRS of (C) is 0, carry upwards, and Carry the Carry to the lowest order of Carry outputted by Wallace network, and fill 1 into the lowest order of Carry.

And a third operation module: referring to fig. 1, the third operation module specifically includes a second negation module, a leading 0 detection module, a trailing 0 detection module, a normalized composite left shift module, a normalized correction module, a rounding preprocessing module, a fast GRS solving module, and a step code adjustment module; the second inverting module is connected with the output of the GRS logic module, the output of the CPA adder module and the output of the 1 adding circuit; the leading 0 detection module is connected with the output of the second inverting module; the trailing 0 detection module is connected with the output of the GRS logic module, the output of the CPA adder module and the output of the 1 adding circuit; the normalized composite left shift module is connected with the output of the second negation module, the output of the leading 0 detection module and the output of the index matching module; the step code adjusting module is connected with the output of the index matching module; the normalization correction module is connected with the output of the normalization composite left shift module; the rapid GRS solving module is connected with the output of the leading 0 detecting module and the output of the trailing 0 detecting module; the rounding preprocessing module is connected with the output of the normalization correction module and the output of the quick GRS solving module.

In the third operation module:

the second inverting module is used for performing bit inverting operation on the second output result when the second output result output by the second operation module is a negative result.

The leading 0 detection module is used for leading 0 detection on the inverted second output result; the leading 0 detection (Leading Zeros Detector, LZD) principle is similar to the trailing 0 detection, except for the detection direction. In the conventional design, the leading 0 detection module should be placed before the adder, but since the tail number of the first output result is as wide as 161 bits, in order to shorten the timing path, the leading 0 detection is performed after the adder in this embodiment.

And the trailing 0 detection module is used for carrying out trailing 0 detection on the second output result.

And the normalized composite left shift module is used for normalized left shift of mantissas of the second output result to obtain a third output result.

In this embodiment, the normalized composite left shift module performs normalized left shift operation on the mantissa by using the result z of the preamble 0 detection output, and adjusts the order, where each time the mantissa shifts one bit, the order is incremented by 1. When the temporary step code e _tmp When z is less than or equal to z, taking e from the left shift number _tmp -1, otherwise taking z, where the temporary step is the temporary step calculated by the exponent pair step module in the first operation module.

The normalization correction module is used for carrying out negative result correction on the third output result.

As shown in fig. 1, since the first and second arithmetic blocks are only the negative numbers, but not the negative numbers, if the obtained negative numbers are continuous "0" on continuous "1", the normalized mantissa needs to be shifted to the right by one bit to perform normalization correction. When the floating point number of the second output result is a negative number, only the bit-wise inverting operation is still executed, but the 1-adding operation cannot be combined into the CSA unit in the second operation module for processing, so that 1 needs to be added in a subsequent path for correction, namely, the negative result correction executed by the normalization correction module in the third operation module. The effective addition 1 caused by negative result correction has the same effect as the mantissa addition 1 during rounding and does not occur at the same time, and the effective addition 1 and the mantissa addition 1 are combined to reduce the use of an addition 1 circuit.

The fast GRS solving module is used for calculating G bit and R bit of floating point numberS bit value. In this embodiment, it is assumed that the result of the trailing 0 detection is tzd _f The preamble 0 detection result is lzd _invf The actual value of the stinky bit can be considered to be 1 when the following formula is satisfied:

sticky＝(lzd _invf +tzd _f )<(2×width+3)

in lzd _invf Tzd as a preamble 0 detection result _f A detection result of trailing 0; width represents the bit width of the mantissa.

When the second output result is negative and R is corrected to 0, if G is 1 at this time, correcting it to 0, otherwise, if G is 0, correcting it to 1, and if R is corrected to 1 or positive, G does not need to be corrected;

when the result is negative and GRS is 0, an enable signal added with 1 is output, and otherwise, the rounding processing is carried out conventionally.

FIG. 8 shows the operation logic of the fast-solving of the stinky bit for negative result correction, after solving the stinky bit, if stinky is 0, 1 should be added to the round bit, if the round bit is 1 at this time, the actual value of the round bit is 0, and the guard bit is the same.

In this embodiment, since TZD is parallel to LZD and normalized shift left, GRS solution does not introduce new delay in the data path, so that the actual value of S bit can be obtained more quickly.

In this embodiment, the GRS value is combined with the rounding preprocessing, and if the result of the effective subtraction is negative, the rounding preprocessing sends out an enable signal with 1 to the effective mantissa if the final GRS value is 0.

A fourth operation module: referring to fig. 1, the fourth operation module specifically includes a mantissa plus 1 logic module, a step code plus 1 logic module, a symbol judgment module, an anomaly judgment module, and a control logic output module; the mantissa adding 1 logic module is connected with the output of the normalization correction module and the output of the rounding preprocessing module; the step code 1 adding logic module is connected with the output of the step code adjusting module and the output of the rounding preprocessing module; the symbol judgment module is connected with the output of the normalization correction module, the output of the rounding preprocessing module and the output of the symbol pre-judgment module; the abnormality judging module is connected with the output of the abnormality pre-judging module, the output of the step code adjusting module, the output of the normalization correcting module and the output of the rounding pre-processing module; the control logic output module is connected with the outputs of the mantissa and 1 logic module, the step code and 1 logic module, the symbol judgment module and the abnormality judgment module.

In the fourth operation module:

the abnormality judgment module is used for generating an abnormality indication signal according to the 1-adding enabling signal, the step code adjusting signal, the step comparison signal and the abnormality pre-judgment signal; the abnormality indication signal includes an invalid operation abnormality, an underflow abnormality, an overflow abnormality, a divide by 0 abnormality, and an imprecise abnormality;

The fourth operation module in this embodiment completes the abnormality determination and generates the abnormality indication signal including an invalid operation abnormality, an underflow abnormality, an overflow abnormality, a divide-by-0 abnormality (multiplication and addition do not occur), an imprecise abnormality, and the like. The output control logic selectively outputs values, including normalized numbers, denormalized numbers, infinity, and NaN, conforming to the IEEE-754 standard based on the sign, the step code, the mantissa, and the exception signal, and outputs a 5-bit exception indication signal.

Compared with the prior art, the invention has the advantages that:

1. the floating-point multiply-add device can realize the floating-point multiply-add operation with half precision, single precision and double precision, and supports all floating-point number types specified by the IEEE-754 standard, including normalized number and denormal number, infinity and NaN.

3. The floating point multiplier-adder has great advantages in speed and area, high area efficiency is realized, the Design Compiler is used for synthesis under the TSMC 7nm technology, the maximum path delay is 0.32ns, the maximum working frequency is 3.125GHz, and the area is 3639.744nm2.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the functions and/or features may be integrated in a single physical device and/or software module or may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and these equivalent modifications or substitutions are included in the scope of the present invention as defined in the appended claims.

Claims

1. The vector floating point multiply adder suitable for the multiple-precision floating point operation is characterized by comprising a first operation module, a second operation module, a third operation module and a fourth operation module;

2. The vector floating point multiply-add device of claim 1 adapted for multiple precision floating point operations, wherein the first operation module:

3. The vector floating point multiply-add device of claim 2, wherein the stick logic module computes the sticky bits of the input floating point number by specifically:

sticky＝rshiftnum>(tzd _fc +2×width+2)

4. The vector floating point multiply-add device of claim 2 adapted for multiple precision floating point operations, wherein the second operation module:

5. The vector floating point multiply-add device of claim 4, wherein the GRS logic module calculates the GRS value of the second output result by:

when the CPA adder performs the subtraction,the GRS value of the second output result takes the mantissa f of the floating point number after shifting _c Complementary to the GRS value of (c);

6. The vector floating point multiply-add device of claim 4 adapted for multiple precision floating point operations, wherein:

7. The vector floating point multiply-add device of claim 6, wherein the normalized left shift determines the shift value by:

8. The vector floating point multiply-add device of claim 6 wherein the negative result correction is a 1-add correction at the mantissa of the floating point number when the second output result is negative.

9. The vector floating point multiply-add device according to claim 6, wherein the calculating the G-bit, R-bit, S-bit values of the floating point number is determined by the following formula:

sticky＝(lzd _invf +tzd _f )<(2×width+3)

10. The vector floating point multiply-add device of claim 6 adapted for multiple precision floating point operations, wherein the fourth operation module: