CN116643718A

CN116643718A - Floating point fusion multiply-add device and method of pipeline structure and processor

Info

Publication number: CN116643718A
Application number: CN202310721698.5A
Authority: CN
Inventors: 马思杰; 冯春阳; 李坤; 刘刚
Original assignee: Beijing Hexin Digital Technology Co ltd; Hexin Technology Co ltd
Current assignee: Beijing Hexin Digital Technology Co ltd; Hexin Technology Co ltd
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-08-25
Anticipated expiration: 2043-06-16
Also published as: CN116643718B

Abstract

The invention relates to the technical field of digital signal processing, in particular to a floating point fusion multiply-add device, a floating point fusion multiply-add method and a processor of a pipeline structure, which comprise a first-stage pipeline section for encoding mantissas of multiplication operands; a second pipeline stage for two-stage compression of the multiplier mantissa partial product; the system comprises a parallel double-way addition processing module, a leading zero prediction parallel error correction module and a third-stage pipeline section of a special leading zero prediction error correction module; a fourth stage pipeline segment for implementing normalized shift left, denormal shift left shifter and shift right shift; and the fifth stage pipeline segment is used for selecting different precision numbers of sign bits, exponents and rounding mantissas in parallel to obtain a floating point result. According to the invention, floating point fusion multiply-add is realized through the five-stage pipeline segments, so that the key path length of the whole structure can be reduced, the pipeline can be divided uniformly, the waste of time sequences is avoided, and high-performance calculation is realized.

Description

Floating point fusion multiply-add device and method of pipeline structure and processor

Technical Field

The present invention relates to the field of digital signal processing technologies, and in particular, to a floating point fusion multiply-add device and method with a pipeline structure, and a processor.

Background

The floating point Fused Multiply-add operation (FMA) is a floating point Multiply-add operation for executing A×C+B type, and can effectively solve the operations of addition and subtraction, multiplication and Multiply-add and subtraction in floating point calculation, C is set to 1 when the addition and subtraction are required to be calculated, B is set to 0 when the multiplication is required to be calculated, therefore, multiple operations can be realized only by one calculation unit, and the floating point Fused Multiply-add operation is used as one of core operation operations of a high-performance processor, and has great influence on the floating point performance of the whole processor.

However, although research on the multiplier and the adder is mature at present, the research on the multiplier and the adder is difficult to achieve larger performance improvement, the existing floating point fusion multiply-add operation algorithm is complex, the logic execution time is long, the scale is large, and the execution speed of the floating point fusion multiply-add operation is reduced, so that the research on the high-performance floating point fusion multiply-add unit has wide application value and important practical significance, and how to use the existing multiplier and adder to achieve the high-performance floating point multiply-add unit is a main difficulty at present.

Disclosure of Invention

The invention aims to provide a floating point fusion multiply-add device, a floating point fusion multiply-add method and a processor with a pipeline structure, which are used for realizing a high-performance floating point multiply-add device through an existing multiplier and an adder and improving the execution speed of floating point fusion multiply-add operation.

In order to solve the technical problems, the invention provides a floating point fusion multiply-add device and method of a pipeline structure and a processor.

In a first aspect, the present invention provides a floating point fusion multiply-add device of a pipeline structure, the device comprising: the first-stage water flowing section, the second-stage water flowing section, the third-stage water flowing section, the fourth-stage water flowing section and the fifth-stage water flowing section; the third-stage pipeline section comprises a parallel double-path addition processing module, a leading zero prediction parallel error correction module and a special leading zero prediction error correction module;

the first stage of running water section is used for responding to the input of a floating point operand, splitting sign bits, exponents and mantissas of the floating point operand, and encoding the mantissas of a multiplication operand to obtain a multiplier mantissa partial product; wherein the floating point operands include a multiply operand and an add operand;

a second stage pipeline stage, configured to perform two-stage compression on the fractional product of the mantissa of the multiplier, perform a contrast shift on the mantissa of the addition operand, and perform a leading zero operation on the mantissa of the floating point operand;

the third stage pipeline stage is used for processing the partial product of the multiplier mantissa after two stages of compression and the mantissa of the addition operand after the opposite shift based on the leading zero prediction parallel error correction tree to obtain a leading zero prediction parallel error correction result, and performing special leading zero prediction error correction operation on the floating point operand mantissa after leading zero operation to obtain a special leading zero prediction error correction result;

A fourth stage of pipeline stage, which is used for obtaining a leading zero prediction result according to the leading zero prediction parallel error correction result and the special leading zero prediction error correction result, and carrying out mantissa normalized shift and mantissa non-normalized left shift on the addition result in parallel according to the leading zero prediction result to obtain a shift mantissa; the addition result is generated by adding an addition operand and a multiplication operand;

and a fifth stage of pipeline stage, which is used for carrying out mantissa rounding on the shift mantissa to obtain mantissa rounding result, and simultaneously carrying out selection of mantissas with different precision on the mantissa rounding result so as to screen out a final floating point result.

In a further embodiment, the second stage flow stage comprises:

the partial product two-stage compression module is used for inputting the multiplier mantissa partial product into a first stage partial product compressor to obtain a partial product zero and a middle partial product, and inputting the middle partial product into a second stage partial product compressor to obtain a pseudo sum signal and a pseudo carry signal;

the addend mantissa processing module is used for performing opposite-order shift on mantissas of the addition operands according to the shift quantity to obtain addend shift values, determining an operation mode according to sign bits and operation types, and processing the addend shift values according to the operation mode to obtain an addend partial product; wherein the shift amount is an exponent difference value obtained by differencing the exponent of the multiplication operand and the exponent of the addition operand;

The leading zero module is used for determining the leading zero number of the mantissas, and carrying out leading zero operation on the mantissas of the floating-point operands based on the leading zero number of the mantissas to obtain a leading zero result of the floating-point mantissas;

and the exception processing module is used for carrying out exception processing on the non-number, infinity and part of zero obtained by splitting the floating point operand by the first stage pipeline segment to obtain an exception processing result.

In a further embodiment, the third stage pipeline section further comprises a compression encoding module;

the compression coding module is used for sequentially compressing and coding the addition partial product, the partial product zero, the pseudo sum signal and the pseudo carry signal to obtain a compression coding signal;

the double-path addition processing module is used for inputting the compression coding signals into the double-path adder to obtain two addition intermediate results;

the leading zero prediction module is used for carrying out leading zero prediction on the compression coding signal to obtain a leading zero value;

the leading zero prediction parallel error correction module is used for inputting the compressed coding signal into a leading zero prediction parallel error correction tree to obtain a leading zero prediction parallel error correction result, and the leading zero prediction parallel error correction result comprises a positive error correction signal and a negative error correction signal;

The special leading zero prediction error correction module is used for sequentially performing left shift and special leading zero prediction error correction on the floating point mantissa leading zero result by utilizing a preset special leading zero prediction error correction rule to obtain a special leading zero prediction error correction result.

In a further embodiment, the special leading zero prediction error correction rule is specifically:

and when the mantissa of the addition operand is subjected to the order shifting, if the addition operand is shifted beyond the calculation range of the preset bit, compensating the lowest bit of the addition result.

In a further embodiment, the fourth stage flow stage comprises:

the mantissa shifting module is used for obtaining a leading zero prediction result according to the leading zero prediction parallel error correction result, the leading zero value and the special leading zero prediction error correction result, processing the addition intermediate result according to the positive and negative of the addition intermediate result to obtain an addition result, normalizing the addition result to the left according to the leading zero prediction result to obtain a normalized left shift mantissa, and simultaneously respectively carrying out non-specification left shift and right shift on the addition result based on a minimum exponent difference determined by a third stage pipeline section according to an exponent difference to obtain a non-specification left shift mantissa and a right shift mantissa;

The mantissa selection module is used for selecting the normalized left shift mantissa, the non-standard left shift mantissa and the right shift mantissa based on the minimum exponent difference to obtain a shift mantissa;

and the sign bit selection module is used for calculating a multiply-add sign according to the addition intermediate result to obtain a rounding sign bit, and the multiply-add sign is obtained by carrying out exclusive or operation on the sign bit obtained by splitting the floating point operand by the first stage pipeline segment.

In a further embodiment, the mantissa selection module is specifically configured to:

generating a mantissa selection control signal according to the minimum exponent difference, the mantissa selection control signal comprising a normalized control signal and a non-normalized control signal;

if the mantissa selection control signal is a normalized control signal, selecting a normalized left shift mantissa as a shift mantissa;

if the mantissa selection control signal is a non-normalized control signal and the current operation data is single-precision data or double-precision data, selecting a non-specification left shift mantissa as a shift mantissa;

and if the mantissa selection control signal is a non-normalized control signal and the current operation data is single-precision data with double-precision specifications, selecting the right shift mantissa as the shift mantissa.

In a further embodiment, the fifth stage flow stage comprises:

the rounding module is used for rounding the shift mantissa to obtain a mantissa rounding result;

the precision number selecting module is used for simultaneously selecting the mantissa rounding result, the rounding sign bit and the exponent in parallel under different precision, and screening out a final floating point result according to the exception processing result, wherein the mantissa rounding result precision selection comprises part or all of three types of single precision mantissa selection, double precision specification single precision mantissa selection and double precision mantissa selection, the sign bit rounding result selection comprises part or all of three types of single precision sign bit selection, double precision specification single precision sign bit selection and double precision sign bit selection, and the exponent selection comprises part or all of three types of single precision exponent selection, double precision specification single precision exponent selection and double precision exponent selection.

In a second aspect, the present invention provides a floating point fusion multiply add method for a pipeline structure, the method comprising the steps of:

responding to the input of a floating point operand, splitting sign bits, exponents and mantissas of the floating point operand, and encoding mantissas of a multiplication operand to obtain a multiplier mantissa partial product; wherein the floating point operands include a multiply operand and an add operand;

Performing two-stage compression on the multiplier mantissa partial product, performing order shifting on the mantissa of the addition operand, and performing leading zero operation on the mantissa of the floating point operand;

processing based on parallel double-channel addition operation, leading zero prediction parallel error correction operation and special leading zero prediction error correction operation; the leading zero prediction parallel error correction operation comprises the steps of utilizing a leading zero prediction parallel error correction tree to process the partial product of the multiplier mantissa after two-stage compression and the mantissa of the addition operand after the opposite-order shift to obtain a leading zero prediction parallel error correction result; the special leading zero prediction error correction operation comprises the steps of carrying out special leading zero prediction error correction operation on floating-point operand mantissas after leading zero operation to obtain a special leading zero prediction error correction result;

obtaining a leading zero prediction result according to the leading zero prediction parallel error correction result and the special leading zero prediction error correction result, and carrying out mantissa normalized shift and mantissa non-normalized left shift on the addition result in parallel according to the leading zero prediction result to obtain a shift mantissa; the addition result is generated by adding an addition operand and a multiplication operand;

And performing mantissa rounding on the shift mantissa to obtain a mantissa rounding result, and simultaneously performing selection of mantissas with different accuracies on the mantissa rounding result to screen out a final floating point result.

In a third aspect, the present invention further provides a processor, including a processor body and a floating point fusion multiply-add device as described above disposed in the processor body.

In a fourth aspect, the present invention further provides a computer device, including a processor and a memory, where the processor is connected to the memory, and the processor includes a processor body and a floating point fusion multiply-add device provided in the processor body as described above.

The invention provides a floating point fusion multiply-add device, a method and a processor of a pipeline structure, wherein the device comprises a five-stage pipeline section, a first-stage pipeline section is used for splitting sign bits, exponents and mantissas of floating point operands and encoding the mantissas of multiplication operands to obtain multiplier mantissa partial products; the second stage of the pipeline section is used for realizing a stage shift operation, a two-stage partial compression operation and a leading zero operation; the third-stage pipeline section comprises a parallel double-way addition processing module, a leading zero prediction parallel error correction module and a special leading zero prediction error correction module, and is used for realizing leading zero prediction parallel error correction operation and special leading zero prediction error correction operation; the fourth stage pipeline segment is used for realizing normalized shift, denormal shift and right shift operation; the fifth stage pipeline stage is used to implement rounding operations and precision selection. Compared with the prior art, the device realizes leading zero prediction parallel error correction operation and special leading zero prediction error correction operation in the third stage pipeline section, and adds a denormalization left shift shifter and a denormalization right shift shifter in the fourth stage pipeline section, so that the device can process special data formats, can realize parallel execution logic, balances the execution time in each pipeline section, and further improves the execution speed of the floating point fusion multiply-add device.

Drawings

FIG. 1 is a block diagram of a floating point fused multiply-add device in a pipeline architecture provided by an embodiment of the present invention;

FIG. 2 is a schematic block diagram of a specific example of a floating point fused multiply-add device provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a relative order mode of a floating point fusion multiply-add device according to an embodiment of the present invention;

FIG. 4 is a diagram of a second stage pipeline according to an embodiment of the present invention;

FIG. 5 is a flow chart of a floating point fusion multiply-add method for a pipeline structure according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following examples are given for the purpose of illustration only and are not to be construed as limiting the invention, including the drawings for reference and description only, and are not to be construed as limiting the scope of the invention as many variations thereof are possible without departing from the spirit and scope of the invention.

Referring to fig. 1, an embodiment of the present invention provides a floating point fusion multiply-add device with a pipeline structure, as shown in fig. 1, where the device includes: a first stage water flowing section 1, a second stage water flowing section 2, a third stage water flowing section 3, a fourth stage water flowing section 4 and a fifth stage water flowing section 5.

In this embodiment, the first stage pipeline 1 includes a data splitting module 11 and an encoding module 12, where the data splitting module 11 is configured to split sign bits, exponents, and mantissas of a floating point operand in response to input of the floating point operand, to obtain sign bits, exponents, and mantissas of the floating point operand, where the floating point operand includes a multiplication operand and an addition operand; the encoding module 12 is configured to encode mantissas of the multiplication operands to obtain multiplier mantissa partial products.

The embodiment is compatible with the IEEE-754 standard, and is compatible with a special double-precision format single-precision floating point number and some special operations specified by an instruction set, such as: the operations of the underflow index reinforcement fixed value, the overflow index reduction fixed value and the like of the fifth-stage pipeline segment shown in fig. 2, the floating point number precision and the data type corresponding to the floating point number defined by the IEEE-754 standard are as follows:

the EEE-754 standard defines 32-bit single precision floating point numbers as shown in Table 1:

TABLE 1

1 bit sign bit

8-bit exponent

23-bit mantissa bits

64-bit double-precision floating point number, as shown in Table 2:

TABLE 2

1 bit sign bit

11-bit exponent

52-bit mantissa bits

This embodiment describes a special double-precision format single-precision floating point number, as shown in Table 3:

TABLE 3 Table 3

1 bit sign bit

11-bit exponent

23-bit mantissa bits

29 bits all 0

Although the present embodiment adopts 64-bit width, it actually only represents single-precision floating point numbers, so that the index represents only 897-1150, single precision is 0-256, and double precision is 0-2048;

the data types corresponding to floating point numbers are shown in Table 4:

TABLE 4 Table 4

Data type	Index number	Mantissa	Hidden position
				Non-numeric code	All 1	Not all 0	undefined
Infinity of infinity	All 1	All 0 s	undefined
				Zero (zero)	All 0 s	All 0 s	0
Denormalized number	All 0 s	Not all 0	0
				Normalized number	Not all 1, not all 0	Any number of	1

For easy understanding, this embodiment will be described with reference to performing an axc+b type floating point multiply-add operation in a floating point fusion multiply-add device, as shown in fig. 2, when three floating point operands A, B, C are input, if the current operation is addition or subtraction, the multiplication operand C is set to 1, if the current operation is multiplication, the addition operand B is set to 0, and then the data splitting module marks the data type according to the precision of the input floating point operand, and splits the floating point operand into three parts of sign bits, exponents and mantissas, where the sign bits directly output the highest bits; the exponent modifies the original precision exponent offset to 13-bit offset, i.e., single precision minus 0x7F plus 0xFFF, double precision minus 0x3FF plus 0xFFF, where 0x3FF represents 16-ary data; the mantissa expands the hidden bit according to the data type, and unifies the bit width into 53 bits, namely, double precision is the hidden bit plus 52 mantissa bits, and single precision is the hidden bit plus 23 mantissa bits plus 29 bits 0; it should be noted that, the multiplier in fig. 2 is a multiplication operand, and the addend is an addition operand.

After the floating point operand is split, enabling the non-number, the infinity number and part of zero to enter a second stage pipeline section for exception processing, and normally calculating other zero, the non-normalized number and the normalized number, wherein the split sign bit and the exponent of the addition operand are directly input into the second stage pipeline section; the calculation formula of the exponent sum of the multiplication operands is:

f＝exp(A)+exp(C)+56-λ

where f is the exponent sum of the multiply operand; exp (a), exp (C) are exponents of multiplier operand A, C, respectively; lambda is an offset, and its value is 0xFFF; it should be noted that, the above equation plus 56 is an algorithm implementation.

Meanwhile, the mantissas obtained by splitting are directly calculated and multiplied, the mantissa of the multiplication operand is encoded in the first stage pipeline section, and the generated multiplier mantissa partial product is input into the second stage pipeline section after the encoding is finished.

The second stage pipeline stage 2 is configured to perform two-stage compression on the fractional product of the multiplier mantissa, perform a relative shift on the mantissa of the add operand, and perform a leading zero operation on the mantissa of the floating point operand, where in this embodiment, the second stage pipeline stage 2 includes a fractional product two-stage compression module 21, an addend mantissa processing module 22, a leading zero module 23, and an exception processing module 24, and specific functions are described as follows:

the partial product two-stage compression module 21 is configured to input the multiplier mantissa partial product into a first stage partial product compressor to obtain a partial product zero and an intermediate partial product, and input the intermediate partial product into a second stage partial product compressor to obtain a pseudo sum signal and a pseudo carry signal;

the addend mantissa processing module 22 is configured to perform a level shift on the mantissa of the addition operand according to a shift amount to obtain an addend shift value, determine an operation mode according to the sign bit and the operation type, and process the addend shift value according to the operation mode to obtain an addend partial product; wherein the shift amount is an exponent difference value obtained by differencing the exponent of the multiplication operand and the exponent of the addition operand;

the leading zero module 23 is configured to determine a mantissa leading zero number, and perform leading zero operation on mantissas of the floating-point operand based on the mantissa leading zero number, to obtain a floating-point mantissa leading zero result;

The exception handling module 24 is configured to perform exception handling on non-numbers, infinity and part of zeros obtained by splitting the floating point operand by the first stage pipeline segment, so as to obtain an exception handling result.

As shown in fig. 3, in the second stage pipeline stage, the matching process of the fusion multiply-add device is specifically: the product is 106 bits, the decimal point is after the second bit of the data, because the embodiment shifts left 56 bits of the addition operand B in advance, which is equivalent to shifting right 56 bits of the product, therefore, the first stage pipeline stage needs to add the offset of 56 when calculating the exponent sum of the multiplication operand, the opposite-order mode in the embodiment adopts the principle of small order to large order, because the addition operand B is shifted left 56 bits in advance, therefore, after opposite-order, the addition operand B only needs to be shifted right, wherein, the shift quantity of right shift is the exponent difference value obtained by differencing the exponent sum of the multiplication operand and the exponent of the addition operand, the addend in fig. 2 and 3 is the addition operand, the embodiment sets a right shift shifter in the addend mantissa processing module of the second stage pipeline stage, and detects the operation mode which needs to be performed subsequently according to the sign bit and the operation type after shifting, namely, specifically, performs addition or subtraction operation, if the operation needs to be performed subsequently, the addition operand B is directly input into the fourth partial product d unit, so that the addition operand B is input into the third stage pipeline stage through the fourth partial pipeline stage unit d; if the subtraction operation is needed later, the addition operand B needs to be inverted and then input into the fourth partial product d unit, where the purpose of inversion is to take the complement operation, the inversion operation is firstly taken, and the addition operation is judged after the addition of the fourth stage pipeline stage.

As shown in fig. 4, for the two-stage compression module of the partial product, 27 partial products obtained by booth encoding in the first stage pipeline stage in this embodiment are first passed through the first stage partial product compressor to obtain a partial product zero and then fed into the third stage partial product c unit, and the rest of the generated middle partial products are further fed into the second stage partial product compressor to obtain a pseudo sum signal and a pseudo carry signal, and finally fed into the second stage partial product b unit and the first stage partial product a unit, where the first stage partial product compressor and the second stage partial product compressor in this embodiment preferably adopt 4-2 compressors CSA42, where CSA is a carry save adder, and CSA42 is a 4-2 compressor, and the 4 numbers are compressed to 2 numbers by using the carry save adder, and those skilled in the art can set them as other compressors according to specific implementation conditions, for example, but not limited to the embodiment of the present invention: it should be noted that, after the 4-2 compressor is replaced by the 3-2 compressor, the execution time of the second stage pipeline stage is basically not affected, and meanwhile, the exception handling module of the second stage pipeline stage performs exception handling on the non-number, infinity and part of zeros split in the first stage pipeline stage, the exception handling includes handling on the non-number exception, infinity multiplier and infinity adder, and the like, and the second stage pipeline stage also performs leading zero on 53-bit mantissas of three input floating point operands for subsequent leading zero error correction under special conditions.

Meanwhile, in the second stage pipeline segment, after the exponent difference value obtained by differencing the exponent of the multiply operand and the exponent of the add operand, the embodiment selects the larger exponent as the first exponent according to the positive and negative of the exponent difference value, and exclusive-or operates sign bits of three floating point operands obtained by splitting the first stage pipeline segment to obtain the multiply-add sign.

Because the address of the next instruction depends on the execution result of the previous instruction in the pipeline, if a plurality of tasks contend for the same pipeline stage in the same time period, the next instruction cannot be executed in the designed clock period, and the subsequent instruction needs to be suspended for a plurality of clock periods, so that the pipeline performance is reduced.

The third-stage pipeline section 3 is used for processing the partial product of the multiplier mantissa after two-stage compression and the mantissa of the addition operand after the opposite-order shift based on the leading zero prediction parallel error correction tree to obtain a leading zero prediction parallel error correction result, and performing special leading zero prediction error correction operation on the floating point operand mantissa after leading zero operation to obtain a special leading zero prediction error correction result; in this embodiment, the third stage pipeline 3 includes a compression encoding module 31 and a parallel two-way addition processing module 32, a leading zero prediction module 33, a leading zero prediction parallel error correction module 34 and a special leading zero prediction error correction module 35, and it should be noted that, although the two-way addition processing module 32, the leading zero prediction module 33, the leading zero prediction parallel error correction module 34 and the special leading zero prediction error correction module 35 are in a parallel relationship, the importance of the order of each module of the third stage pipeline is that the importance of the two-way addition processing module 32, the leading zero prediction module 33, the leading zero prediction parallel error correction module 34 and the special leading zero prediction error correction module 35 are in sequence, and the functions of each module of the third stage pipeline are specifically as follows:

the compression coding module 31 is configured to sequentially perform compression and coding operations on the addend partial product, the partial product zero, the pseudo sum signal and the pseudo carry signal, so as to obtain a compression coded signal;

The two-way addition processing module 32 is configured to input the compressed encoded signal into the two-way adder to obtain two addition intermediate results;

the leading zero prediction module 33 is configured to perform leading zero prediction on the compressed encoded signal to obtain a leading zero value;

the leading zero prediction parallel error correction module 34 is used for inputting the compressed encoded signal into a leading zero prediction parallel error correction tree to obtain a leading zero prediction parallel error correction result, wherein the leading zero prediction parallel error correction result comprises a positive error correction signal and a negative error correction signal;

the special leading zero prediction error correction module 35 is configured to sequentially perform left shift and special leading zero prediction error correction operations on the floating-point mantissa leading zero result by using a preset special leading zero prediction error correction rule, so as to obtain a special leading zero prediction error correction result.

In this embodiment, the special leading zero prediction error correction rule specifically includes: when the mantissa of the addition operand is subjected to order shifting, if the addition operand is shifted to exceed the calculation range of the preset bit, compensating the lowest bit of the addition result; specifically, the special leading zero prediction error correction rule is used for correcting the floating-point mantissa leading zero result under special conditions to generate a leading zero prediction error correction signal, and the special conditions specifically refer to when the addition is shifted by the opposite order, if the addition operand is shifted by more than 163 bits of calculation range, at this time, compensation needs to be performed at the least significant bit of the addition result to ensure calculation accuracy, namely: when the addition operation is performed, adding 1 to the lowest order bit; in the case of subtraction, 1 is subtracted from the lowest order, and it should be noted that, since when subtraction is performed, if the calculation result is 1.000 … … 000 and the ellipsis is all 0, at this time, the last order is subtracted by 1, the calculation result becomes 0.111 … … 111, at this time, the ellipsis all 1, the value of the leading zero is changed from 0 to 1, and since this operation is performed after addition, the leading zero prediction parallel error correction tree cannot detect this error, and therefore, this embodiment adds a parallel special case error correction logic, and realizes error correction for the special case by the special leading zero prediction error correction operation.

In the third-stage flow section, the index bit is used for the denormalization left shift and right shift process of the fourth-stage flow section for calculating the difference value between the first index and the minimum index range of the current precision, namely the minimum index difference; the mantissa part is obtained by the second stage pipeline section, four partial products (an addend partial product, a partial product zero, a pseudo sum signal and a pseudo carry signal) are input into a dual-channel adder, a leading zero prediction module and a leading zero prediction parallel error correction tree after being subjected to one-step compression and partial product coding respectively, wherein, as shown in fig. 4, the compressor of the third stage pipeline section preferably adopts a 4-2 compressor, and a person skilled in the art can set other compressors according to specific situations, and the method is not limited by the embodiment of the invention, the dual-channel adder outputs two addition intermediate results, wherein one addition intermediate result is a normal result, and the other addition intermediate result is a normal result plus 1; the leading zero prediction module outputs a leading zero value; the leading zero prediction parallel error correction tree outputs positive and negative error correction signals, the related calculation of mantissas in a third-stage pipeline section is 163 bits wide, and meanwhile, the leading zero prediction error correction module carries out 53-bit left shift and special leading zero prediction error correction operation on 3 53-bit mantissa leading zeros output by a second-stage pipeline section, so that a special leading zero prediction error correction result under special conditions is generated.

According to the embodiment, the compression coding operation, the leading zero prediction parallel error correction operation and the leading zero prediction error correction operation of special conditions are realized through the third-stage pipeline section, parallel special-condition error correction logic is realized through the special leading zero prediction error correction operation, the structure path is effectively simplified, and meanwhile, as few errors as possible are generated, so that the calculation precision is further ensured, meanwhile, the execution speed of the stage pipeline section is further accelerated through parallel error correction, and high performance and high precision are realized.

The fourth stage pipeline section 4 is configured to obtain a leading zero prediction result according to a leading zero prediction parallel error correction result and a special leading zero prediction error correction result, and perform mantissa normalization shift and mantissa non-normalization left shift on an addition result generated by mantissa of an addition operand after a relative shift in parallel according to the leading zero prediction result to obtain a shift mantissa, where the addition result is generated by adding an addition operand and a multiplication operand; in this embodiment, the fourth stage pipeline stage 4 includes a mantissa shift module 41, a mantissa selection module 42 and a sign bit selection module 43, where the functions of the modules are specifically:

a mantissa shift module 41, configured to obtain a leading zero prediction result according to the leading zero prediction parallel error correction result, the leading zero value and the special leading zero prediction error correction result, process the addition intermediate result according to the positive and negative of the addition intermediate result to obtain an addition result, normalize the addition result according to the leading zero prediction result, shift left the addition result to obtain a normalized left shift mantissa, and simultaneously shift left and right the addition result to obtain a non-specification left shift mantissa and a non-specification right shift mantissa respectively based on a minimum exponent difference determined by a third stage pipeline stage according to an exponent difference;

A mantissa selection module 42 configured to select a normalized left shift mantissa, a non-standard left shift mantissa, and a right shift mantissa based on the minimum exponent difference, to obtain a shift mantissa;

the sign bit selecting module 43 is configured to calculate a multiply-add sign according to the intermediate result of the addition, to obtain a rounding sign bit, where the multiply-add sign is obtained by performing an exclusive-or operation on the sign bit obtained by splitting the floating-point operand in the first stage pipeline segment.

In this embodiment, the mantissa selection module is specifically configured to: generating a mantissa selection control signal according to the minimum exponent difference, the mantissa selection control signal comprising a normalized control signal and a non-normalized control signal; if the mantissa selection control signal is a normalization control signal, selecting a normalization left shift mantissa generated by a normalization left shift shifter as a shift mantissa; if the mantissa selection control signal is a non-normalized control signal and the data currently operated is single-precision data or double-precision data, selecting a non-specification left shift mantissa generated by a non-normalized left shift shifter as a shift mantissa; and if the mantissa selection control signal is a non-normalized control signal and the current operation data is single-precision data with double-precision specifications, selecting the right shift mantissa generated by the right shift shifter as the shift mantissa.

Specifically, in the fourth stage pipeline stage, when subtraction operation is implemented, it is required to perform addition to the addition operand, because in the second stage pipeline stage, only the addition operand is subjected to inverse coding and is not added with 1, in the fourth stage pipeline stage, whether the operations of addition and negation are required is selected according to the positive and negative pairs of the obtained two addition intermediate results of the addition intermediate result, so as to obtain a final addition result, the leading zero prediction result selects a leading zero value according to the leading zero prediction parallel error correction tree and a control signal obtained by special leading zero prediction error correction in special cases, then the leading zero prediction result is used as a normalized left shift amount, the addition intermediate result is subjected to normalized left shift, the non-standard left shift amount and the non-standard left shift amount are determined by the minimum exponent difference obtained by the third stage pipeline stage, after shifting, the normalized left shift shifter, the non-standard left shift shifter and the non-standard left shift shifter are subjected to three-selection according to different cases so as to obtain the mantissas (shift mantissas), wherein the shifter and the calculation result in the fourth stage pipeline stage are all 163 bits, the leading zero prediction result is used for rounding the first leading zero prediction result and subtracting the intermediate sign, and the addition result is further subjected to addition sign-addition result after the first addition and the addition result is multiplied by the fourth stage.

In the embodiment, the normalized left shift shifter, the non-standard left shift shifter and the right shift shifter are designed in the fourth stage pipeline section, and the processing of single-precision data with double precision specifications under a non-normalized control signal can be realized through the parallel added right shift shifter, so that the key path delay is obviously shortened, the time consumed by execution is obviously reduced, and the processing performance of the floating point fusion multiply-add device is improved.

The fifth stage pipeline stage 5 is configured to perform mantissa rounding on the shifted mantissa to obtain a mantissa rounding result, and select mantissas with different precision from the mantissa rounding result at the same time to screen out a final floating point result, where in this embodiment, the fifth stage pipeline stage 5 includes a rounding module 51 and a precision selecting module 52, and the functions of each module are specifically:

a rounding module 51, configured to perform a rounding operation on the shifted mantissa, to obtain a mantissa rounding result;

a precision number selecting module 52, configured to simultaneously select, in parallel, a mantissa rounding result, a rounding sign bit, and an exponent under different precision, and screen a final floating point result according to the exception handling result, where the mantissa rounding result precision selection includes a part or all of three of single precision mantissa selection, double precision specification single precision mantissa selection, and double precision mantissa selection, the sign bit rounding result selection includes a part or all of three of single precision sign bit selection, double precision specification single precision sign bit selection, and double precision sign bit selection, and the exponent selection includes a part or all of three of single precision exponent selection, double precision specification single precision exponent selection, and double precision exponent selection; it should be noted that, in the overflow detection of the second index by the fifth water section and in the underflow, the underflow index is added to a fixed value, or in the overflow, the overflow index is subtracted from the fixed value to form the prior art in the field, which is not described herein.

Specifically, the fifth stage pipeline stage inputs 163 bit-shift mantissas and sign bits into a rounding module, outputs 53 bit mantissas and mantissa rounding and inaccurate sign bits, inputs the mantissa rounding and inaccurate sign bits into a floating point state register state bit generation module, and determines whether a rounding sign bit needs to perform a negation operation according to a result, so that a final sign bit is selected, and it is to be noted that a part of instructions need to perform a negation operation after calculation is completed, and because a rounding part condition is influenced by the sign bit, whether the rounding sign bit needs to perform the negation operation is determined after the completion of the rounding of the sign bit; the exponent bits calculated by the fourth stage can be used for the final calculation result, and the exponent bits added with a fixed value and subtracted with the fixed value are calculated at the same time, then sign bits, exponent bits and mantissa bits under different precision are selected at the same time in parallel, then the calculation result of conventional calculation is selected according to the precision, it is to be noted that the precision refers to 32-bit single-precision floating point number, 64-bit double-precision floating point number and special double-precision single-precision floating point number.

Compared with the prior art, the method for combining the sign bit, the exponent bit and the mantissa bit and selecting the mantissa bit according to the precision in the fifth stage pipeline section utilizes the rounding module to realize mantissa rounding, adopts a simultaneous parallel mode to select the sign bit, the exponent bit and the mantissa bit under different precision, supports conversion of various different data precision, not only can reduce the key path length of the whole structure, save resources and accelerate the calculation speed, but also can improve the working efficiency, thereby realizing the floating point fusion multiplication-addition device with the characteristics of high precision, high performance, shorter time sequence and the like.

The embodiment of the application provides a floating point fusion multiply-add device with a pipeline structure, which comprises five stages of pipeline segments, wherein the first stage of pipeline segment is used for splitting sign bits, exponents and mantissas of floating point operands and encoding the mantissas of multiplication operands to obtain multiplier mantissa partial products; the second stage of the pipeline section is used for realizing a stage shift operation, a two-stage partial compression operation and a leading zero operation; the third-stage pipeline section is used for realizing parallel double-channel addition operation, leading zero prediction parallel error correction operation and special leading zero prediction error correction operation; the fourth stage pipeline segment is used for realizing normalized shift, denormal shift and right shift operation; the fifth stage pipeline stage is used to implement rounding operations and precision selection. Compared with the prior art, the method has the advantages that the constraint influence of the area on the performance is reduced due to the progress of the process, the embodiment is based on the principle of parallel priority, the leading zero prediction parallel error correction operation and the special leading zero prediction error correction operation are added in the third-stage pipeline section under the condition of a small amount of sacrifice area, and the parallel normalized left shift, non-normalized left shift shifter and right shift shifter are arranged in the fourth-stage pipeline section, so that the special data format can be processed, the operation precision is improved, the logic realized in a serial mode in the prior art can be realized in a parallel mode, the critical path length of the whole structure is reduced, the pipeline sections are uniformly divided, the time sequence waste is avoided, the high-performance calculation is realized, and the method has the advantages of simple structure principle, simplicity and convenience in realization, higher efficiency and the like.

In one embodiment, as shown in fig. 5, an embodiment of the present invention provides a floating point fusion multiply add method of a pipeline structure, the method includes the following steps:

s1, responding to input of a floating point operand, splitting sign bits, exponents and mantissas of the floating point operand, and encoding mantissas of a multiplication operand to obtain a multiplier mantissa partial product; wherein the floating point operands include a multiply operand and an add operand;

s2, performing two-stage compression on the partial product of the mantissa of the multiplier, performing order shifting on the mantissa of the addition operand, and performing leading zero operation on the mantissa of the floating point operand;

s3, processing based on parallel double-channel addition operation, leading zero prediction parallel error correction operation and special leading zero prediction error correction operation; the leading zero prediction parallel error correction operation comprises the steps of utilizing a leading zero prediction parallel error correction tree to process the partial product of the multiplier mantissa after two-stage compression and the mantissa of the addition operand after the opposite-order shift to obtain a leading zero prediction parallel error correction result; the special leading zero prediction error correction operation comprises the steps of carrying out special leading zero prediction error correction operation on floating-point operand mantissas after leading zero operation to obtain a special leading zero prediction error correction result;

S4, obtaining a leading zero prediction result according to the leading zero prediction parallel error correction result and the special leading zero prediction error correction result, and carrying out mantissa normalized shift and mantissa non-normalized left shift on an addition result generated by mantissas of addition operands after the opposite-order shift in parallel according to the leading zero prediction result to obtain a shift mantissa; the addition result is generated by adding an addition operand and a multiplication operand;

s5, performing mantissa rounding on the shift mantissa to obtain a mantissa rounding result, and simultaneously performing selection of mantissas with different precision on the mantissa rounding result to screen out a final floating point result.

It should be noted that, the sequence number of each process does not mean that the execution sequence of each process is determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

For specific limitations of a floating point fused multiply-add method for a pipeline structure, reference may be made to the above limitation of a floating point fused multiply-add device for a pipeline structure, which is not described herein. Those of ordinary skill in the art will appreciate that the various modules and steps described in connection with the disclosed embodiments of the application may be implemented in hardware, software, or a combination of both. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the invention provides a floating point fusion multiplication and addition method of a pipeline structure, which homogenizes working procedures of each stage, increases leading zero prediction parallel error correction operation and special case leading zero prediction error correction operation, simultaneously sets parallel normalized left shift, non-normalized left shift shifter and right shift shifter, and simultaneously selects mantissa rounding results with different precision mantissas to screen out final floating point results, thereby accelerating the execution speed of the floating point fusion multiplication and addition method in a parallel mode and realizing the processing of special data formats through the increased leading zero prediction parallel error correction operation and special case leading zero prediction error correction operation.

The embodiment of the invention also provides a processor, which comprises a processor body and the floating point fusion multiply-add device arranged in the processor body.

FIG. 6 is a diagram of a computer device including a memory, a processor, and a transceiver connected by a bus, according to an embodiment of the present invention; the processor comprises a processor body and the floating point fusion multiply-add device arranged in the processor body.

Wherein the memory may comprise volatile memory or nonvolatile memory, or may comprise both volatile and nonvolatile memory; the processor may be a central processing unit, a microprocessor, an application specific integrated circuit, a programmable logic device, or a combination thereof. By way of example and not limitation, the programmable logic device described above may be a complex programmable logic device, a field programmable gate array, general purpose array logic, or any combination thereof.

In addition, the memory may be a physically separate unit or may be integrated with the processor.

It will be appreciated by those of ordinary skill in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be implemented, and that a particular computer device may include more or fewer components than those shown, or may combine some of the components, or have the same arrangement of components.

The floating point fusion multiply-add device of the pipeline structure, the method and the processor provided by the embodiment of the application realize the processing of special data formats through five stages of pipeline segments, and simultaneously reduce the critical path length of the whole structure, quicken the execution speed and save the execution time in a designed parallel mode.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., SSD), etc.

Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed, may comprise the steps of embodiments of the methods described above.

The foregoing examples represent only a few preferred embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the application. It should be noted that modifications and substitutions can be made by those skilled in the art without departing from the technical principles of the present application, and such modifications and substitutions should also be considered to be within the scope of the present application. Therefore, the protection scope of the patent of the application is subject to the protection scope of the claims.

Claims

1. A floating point fusion multiply-add device of a pipeline structure, comprising: the first-stage water flowing section, the second-stage water flowing section, the third-stage water flowing section, the fourth-stage water flowing section and the fifth-stage water flowing section; the third-stage pipeline section comprises a parallel double-path addition processing module, a leading zero prediction parallel error correction module and a special leading zero prediction error correction module;

2. The floating point fused multiply-add device of claim 1, wherein the second stage pipeline stage comprises:

3. The floating point fusion multiply-add device of a pipeline structure of claim 2, wherein the third stage pipeline stage further comprises a compression encoding module;

4. A floating point fused multiply-add device of a pipeline structure as claimed in claim 3, wherein said special leading zero prediction error correction rule is specifically:

5. A floating point fused multiply-add device in a pipeline architecture as claimed in claim 3, wherein said fourth stage pipeline stage comprises:

6. The floating point fused multiply-add device of claim 5, wherein the mantissa selection module is specifically configured to:

7. The floating point fused multiply-add device of claim 2, wherein the fifth stage pipeline stage comprises:

8. A floating point fusion multiply-add method for a pipeline structure, the method comprising the steps of:

9. A processor, characterized by: a floating point fusion multiply-add device according to any one of claims 1 to 7 comprising a processor body and a processor disposed in the processor body.

10. A computer device, characterized by: the floating point fusion multiply-add device comprises a processor and a memory, wherein the processor is connected with the memory, and comprises a processor body and the floating point fusion multiply-add device according to any one of claims 1-7, which is arranged in the processor body.