CN116820395A

CN116820395A - Floating point multiply-add unit supporting packet-level operation and application method thereof

Info

Publication number: CN116820395A
Application number: CN202310713118.8A
Authority: CN
Inventors: 黄立波; 谭弘兵; 王永文; 郭辉; 郑重; 雷国庆; 王俊辉; 郭维; 邓全; 隋兵才; 孙彩霞; 常俊胜; 沈俊忠
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-09-29

Abstract

The application discloses a floating point multiplication and addition unit supporting package-level operation and an application method thereof. The application aims to alleviate the performance loss of high-precision floating point numbers caused by iterative computation on the basis of not influencing the artificial intelligence computing power, and can realize the support of high-performance computation and artificial intelligence application with lower hardware cost without more adapting support of a compiler.

Description

Floating point multiply-add unit supporting packet-level operation and application method thereof

Technical Field

The application belongs to the technical field of microprocessor design, and particularly relates to a floating point multiply-add unit supporting packet-level operation and an application method thereof.

Background

With the continuous integration of high-performance computing and artificial intelligence technology, the computing power morphology of high-performance computing systems is correspondingly changed. Conventional high performance computing typically uses double precision floating point numbers for the computation, which can provide higher precision and a larger range of values to avoid accumulated errors during the computation. In contrast, artificial intelligence algorithms have good robustness, and often can employ low precision floating point calculations to reduce computational and storage overhead. To meet the requirements of high performance computing and multiple computing forces of artificial intelligence loads, currently mainstream high performance computing systems often support multiple data format computing, such as Double-Precision (DP, 64-bit), single-Precision (SP, 32-bit), half-Precision (HP, 16-bit), and brain-Precision (brain float, BF16, 16-bit) multiple data formats. Designing functional units that support multi-precision floating point operations is an effective means of achieving the diverse computing forces of high performance computing systems. Multi-precision floating point units typically multiplex mantissa multipliers to reduce hardware overhead, i.e., a plurality of smaller-sized multiplier units adapted for low-precision computation are organized into a multiplier array to achieve high-precision multiplication. The size of the mantissa multiplier is proportional to the square of the data bit width, such as a multiplier that supports 64-bit double-precision floating point calculations can support 4 single-precision multiplications of 32 bits or 16 half-precision multiplications of 16 bits simultaneously. However, in a multi-precision floating-point arithmetic unit, the performance of each precision is inversely proportional to its bit width, which means that even if the scale of a multiplier supporting double-precision floating-point is halved, the computational performance of single-precision and half-precision floating-point numbers is not affected. In practical high performance computing and artificial intelligence fusion applications, the need for computational effort by artificial intelligence algorithms is much greater than for high performance computing. The hardware cost can be remarkably reduced by reducing the mantissa multiplier scale of the multi-precision floating-point arithmetic unit, but the performance of the low-precision floating-point number suitable for artificial intelligence is not reduced, and the performance maximization can be realized under the limited area constraint. However, reducing the multiplier size reduces the performance of high-precision (double-precision) floating-point calculations.

Disclosure of Invention

The application aims to solve the technical problems: aiming at the problems in the prior art, the application provides a floating point multiply-add unit supporting packet-level operation and an application method thereof, and aims to alleviate the performance loss of high-precision floating point numbers caused by iterative computation on the basis of not influencing the calculation force of artificial intelligence, and can realize the support of high-performance calculation and artificial intelligence application with lower hardware cost without more adaptation support of a compiler.

In order to solve the technical problems, the application adopts the following technical scheme:

a floating-point multiply-add unit supporting packet-level operations, comprising:

the data preprocessing module is used for preprocessing an input floating-point operand, extracting sign bits, an exponent bit field and a mantissa bit field of the input floating-point operand according to a data format appointed by an operation type, supplementing hidden bits of mantissas, and simultaneously carrying out anomaly detection on the input data;

the system comprises a mantissa multiplication module, a logic circuit and a logic circuit, wherein the mantissa multiplication module is used for multiplying unsigned mantissas obtained after the addition of hidden bits to obtain a multiplication result, and the mantissa multiplication module is a multiplier array which consists of 2 multipliers supporting single-precision multiplication and is realized in an iterative mode;

the exponent difference module is used for calculating exponent differences of different operands in front and back beats of iterative computation of the mantissa multiplication module;

the addend pair-order shifting module is used for shifting the addend according to the index difference value obtained by the index difference value module;

the addition module is used for executing addition operation supporting package-level operation on the results output by the mantissa multiplication module and the addend pair-order shift module;

the leading zero calculation module is used for calculating the number of leading zeros in the mantissas combined and output by the addition module so as to control the subsequent mantissa normalization shift and exponent adjustment;

the normalization shift module is used for carrying out left logic shift on mantissas combined and output by the addition module, and the specific left shift data quantity is the leading zero number calculated in the previous step;

the mantissa rounding module is used for rounding the shifted mantissa according to the mantissa bit width of the output data format to obtain a final mantissa;

the exponent adjusting module is used for adjusting the exponent by combining the quantity of leading zeros and the carry rounded by mantissas to obtain a final exponent;

and the output processing module is used for combining the sign bit, the final mantissa and the final exponent according to the output data format to form a final output result, and simultaneously carrying out anomaly detection on the output data.

Optionally, the multiplier array, which is composed of 2 multipliers supporting single-precision multiplication and is implemented in an iterative manner, includes: the system comprises two data selection modules, a multiplier, a 4-2 compression tree, a carry register and a sum register, wherein the multiplier comprises a multiplier 1 and a multiplier 0 which are two multipliers with the size of 27 multiplied by 27, the two multipliers are in one-to-one correspondence with the two data selection modules, the data selection modules are used for selecting the multiplier 0 and the multiplier 1 to be sent into the corresponding one multiplier, and for single-precision calculation, the two multipliers finish 2 multiplication operations every clock cycle; for double-precision operation, the multiplication of 53×53 is realized in an iterative mode; the previous clock cycle completes 53X 27 operation through two multipliers to obtain addition Sum and Carry results, and the addition Sum and Carry results are latched into a Carry register and a Sum register; the latter clock cycle completes the 53 x 26 calculation and adds the latched result from the previous clock cycle to get the final result.

Optionally, the exponent difference module includes a data selector, an exponent difference calculation module and an addend opposite-order shift module, where the data selector is used to select an addend 1 and a product exponent sum as one input of the exponent difference calculation module, the other input of the exponent difference calculation module is an addend 2, and the exponent difference calculation module is connected with an input end of the addend opposite-order shift module; when double-precision floating point operation is carried out, the operation of the two periods on the index is different in an iteration implementation mode, the addition is completed in the first clock period, the addition 1 is selected through a data selector to calculate the index difference value of the addition 1 and the addition 2 through an index difference calculation module, the product index is selected through the data selector and the index difference value of the product and the addition is calculated through the index difference calculation module in the latter clock period, and the index difference calculation module is needed to be multiplexed through the data selector in the two previous and subsequent times; the single-precision floating point number calculation is completely consistent with the operation of the next clock cycle of the double-precision floating point number calculation, and the data selector selects the product index sum to calculate the index difference of the product sum and the addend through the index difference calculation module.

Optionally, the adding module includes:

a data selector for selecting among the mantissa multiplied Sum, the mantissa multiplied Carry and the antipodal addend 1;

a 4-2 compression tree for compression combining inputs including an output of the data selector, an addition number after the step-up 2, and a second selector;

the addition module is used for executing addition operation on the output of the 4-2 compression tree;

the enabling signal generating module DP & Pack is used for generating a double-precision computing enabling signal DP and a packet level operation enabling signal Pack;

the first selector is used for selecting one of the output of the addition module and the addition number 2 after the alignment under the control of the packet level operation enabling signal Pack to output;

the second selector is used for outputting one of the 0 and the opposite-order addition number latched in the previous clock cycle to the 4-2 compression tree under the control of the double-precision calculation enabling signal DP;

a register stage for registering and outputting an output of the first selector and outputting the antipodal addition number latched in the previous clock cycle to the second selector;

for single-precision calculation, the data flow of the addition stage of the addition module is identical to that of a traditional floating point multiplication and addition unit, specifically, the Sum and Carry of the Sum, zero input and mantissa multiplication after the opposite order are combined through a 4-2 compression tree, and then the Sum and Carry are added to obtain output; for the packet-level calculation of double-precision operation, the addition module is divided into two modes, and in the first mode, the addition and the multiplication and addition operation are independent, and the calculation flow is the same as the single-precision calculation flow; in the second mode, the addition and multiplication addition results are accumulated, the step-by-step addition numbers are directly latched in a bypass mode in the previous clock period, and the step-by-step addition results, the multiplication results and the other addition numbers are added through a 4-2 compression tree in the next clock period to obtain output.

In addition, the application also provides a processor chip, which comprises a chip body and an operation unit arranged in the chip body, wherein the operation unit comprises the floating point multiply-add unit supporting packet-level operation.

In addition, the application also provides an accelerator board card, which comprises a board card body and an accelerator chip arranged in the board card body, wherein the accelerator chip comprises a chip body and an operation unit arranged in the chip body, and the operation unit comprises the floating point multiplication and addition unit supporting package-level operation.

Optionally, the accelerator chip is a GPU chip.

In addition, the application also provides a computer device, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor comprises a microprocessor body and an operation unit arranged in the microprocessor body, and the operation unit comprises the floating point multiply-add unit supporting packet-level operation.

Optionally, the microprocessor is further connected with an accelerator board card through a bus, the accelerator board card comprises a board card body and an accelerator chip arranged in the board card body, the accelerator chip comprises a chip body and an operation unit arranged in the chip body, and the operation unit comprises a floating point multiplication and addition unit supporting package-level operation.

Optionally, the accelerator chip is a GPU chip.

In addition, the application also provides an application method of the floating point multiply-add unit supporting the packet-level operation, which comprises the following steps: the floating point multiply-add unit supporting the packet level operation respectively completes one double-precision floating point addition and one double-precision floating point multiply-add in two continuous clock cycles, and the addition result and the multiply-add result are independently output and accumulated and then output to support the packet level calculation.

Compared with the prior art, the application has the following advantages:

1. the low-overhead floating point multiplication and addition unit multiplexes the mantissa multipliers in an iterative mode to reduce hardware overhead, and the multiplication and addition unit supports packet level operation, namely operation of two clock cycles in the iterative calculation process forms a calculation packet, and addition is carried out once in the previous clock cycle of the iterative calculation of the multipliers to improve performance, so that the performance loss of high-precision floating points caused by the iterative calculation can be relieved on the basis of not influencing the artificial intelligence calculation force, and the support of high-performance calculation and artificial intelligence application can be realized with lower hardware overhead.

2. The floating point multiply-add unit supporting the packet-level operation only needs to slightly adjust the instruction execution sequence by the compiler, and does not need more adaptation support by the compiler.

3. The low-overhead floating point multiplication and addition unit supporting packet-level computation is a basic structural unit in the processor, and the interface of the low-overhead floating point multiplication and addition unit is consistent with the traditional operation unit, so that the low-overhead floating point multiplication and addition unit is directly replaced in the processor, and is easy to realize.

Drawings

FIG. 1 is a schematic diagram of a floating-point multiply-add unit supporting packet-level operations according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a multiplier array according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of a difference module in the embodiment of the application.

Fig. 4 is a schematic structural diagram of an addition module in an embodiment of the present application.

FIG. 5 is a diagram illustrating an application of a packet-level operation in an assembler instruction according to an embodiment of the present application.

Detailed Description

The floating point multiplication and addition unit supporting the packet level operation in the embodiment aims to design a low-overhead floating point multiplication and addition unit supporting the packet level operation, and can maximize the calculation performance of a processor under the constraint of a limited semiconductor area. For ease of understanding, a 3-stage running water (S1, S2, and S3) dual mode floating-point multiply-add unit supporting dual precision and single precision is described herein as an example.

In order to support packet-level operation and multiple precision data formats, the dual-mode floating-point multiply-add unit is modified in the structure of a conventional dual-precision floating-point multiply-add unit, and includes a plurality of sub-modules, as shown in fig. 1, the floating-point multiply-add unit supporting packet-level operation in this embodiment includes:

As shown in fig. 2, the multiplier array of this embodiment, which is composed of 2 multipliers supporting single-precision multiplication and is implemented in an iterative manner, includes: the two data selection modules are used for selecting the multiplier 0 and the multiplier 1 to be sent into a corresponding multiplier, and for single-precision calculation, the two multipliers finish 2 multiplication operations every clock cycle; for double-precision operation, the multiplication of 53×53 is realized in an iterative mode; the previous clock cycle completes 53X 27 operation through two multipliers to obtain addition Sum and Carry results, and the addition Sum and Carry results are latched into a Carry register and a Sum register; the latter clock cycle completes the 53 x 26 calculation and adds the latched result from the previous clock cycle to get the final result. It should be noted that the 4-2 compression tree is also called a 4-2 compressor, and is used to compress the input 4 operands into 2 operands, which is an existing logic structure, so its implementation will not be described in detail.

As shown in fig. 3, the exponent difference module of this embodiment includes a data selector, an exponent difference calculation module and an addend-to-order shift module, where the data selector is configured to select an addend 1 and a product exponent sum as one input of the exponent difference calculation module, the other input of the exponent difference calculation module is an addend 2, and the exponent difference calculation module is connected to an input end of the addend-to-order shift module; when double-precision floating point operation is carried out, the operation of the two periods on the index is different in an iteration implementation mode, the addition is completed in the first clock period, the addition 1 is selected through a data selector to calculate the index difference value of the addition 1 and the addition 2 through an index difference calculation module, the product index is selected through the data selector and the index difference value of the product and the addition is calculated through the index difference calculation module in the latter clock period, and the index difference calculation module is needed to be multiplexed through the data selector in the two previous and subsequent times; the single-precision floating point number calculation is completely consistent with the operation of the next clock cycle of the double-precision floating point number calculation, and the data selector selects the product index sum to calculate the index difference of the product sum and the addend through the index difference calculation module.

As shown in fig. 4, the addition module of the present embodiment includes:

It should be noted that, the data preprocessing module, the addend pair-order shifting module, the leading zero calculating module, the normalization shifting module, the mantissa rounding module, the exponent adjusting module, and the output processing module are all conventional modules of the floating point multiply-add unit, so that the implementation thereof is not described in detail.

The core idea of the application is to reduce hardware overhead by iteratively multiplexing the mantissa multipliers of the floating point multiply-add unit, but the performance loss of the double-precision floating point number is caused. Depending on the computational requirements in the actual application of high performance computing, double-precision Multiply-accumulate (FMA) and ADD (ADD) instructions tend to exist simultaneously. With the floating-point multiply-add unit implemented iteratively, multiply-add instructions require 4 clock cycles to complete and cannot be pipelined, resulting in pipeline Stall (Stall) resulting in performance penalty, as shown in (a) of fig. 5. The operation unit designed in this embodiment supports packet-level operation, and realizes that there are data-dependent multiply-ADD and ADD instructions, which can complete an ADD instruction and an FMA instruction in 4 clock cycles, and support calculation of two cases of data correlation and uncorrelation between the two instructions, which can effectively alleviate double-precision floating point performance loss caused by iteration, as shown in (c) in fig. 5. Compared with the independently implemented multiply-add instruction shown in (b) of fig. 5, the embodiment implements that there is a data-dependent multiply-add instruction, which can effectively alleviate the double-precision floating point performance loss caused by iteration. In addition, the multiply-add unit supporting the packet-level operation provided in this embodiment only needs to adjust the instruction sequence during compiling, and does not need additional support of a compiler, thereby being easy to implement.

The floating point multiplication and addition unit supporting the packet-level operation reduces the scale of the mantissa multiplier of the dual-mode floating point multiplication and addition unit by half, and multiplexes the multipliers in an iterative mode, so that the dual-mode floating point multiplication and addition unit can complete two parallel single-precision floating point operations per clock cycle or complete one dual-precision floating point operation within two clock cycles. Meanwhile, the data flow is adjusted, and the operation unit can finish double-precision addition operation once in the first clock period of iterative computation, so that the performance loss of double-precision floating point operation is reduced. The packet-level operation supported by the application is divided into two modes, wherein the first mode is that addition and multiplication and addition calculation work independently, and a calculation result is output once in each clock period. The second mode is that the addition and multiplication addition calculation results are accumulated, and the accumulated results are output in the second clock period. The floating-point multiply-add unit supporting packet-level operations in this embodiment includes first analyzing the computational requirements in the actual application code, i.e., the add instruction where there are often a large number of multiply-add instructions. The packet level computing mode is provided to compensate the double-precision floating point performance loss caused by the iteration mantissa multiplier, and the realization of the packet level computing mainly utilizes the hardware resource which is idle in the first clock period in the iteration computing, so that the hardware cost is hardly increased. And according to the functional requirements of package-level calculation and multi-precision calculation, carrying out customized design on the circuit. The data flow of the dual-mode floating point multiply-add unit is similar to the traditional floating point multiply-add unit, except that the data paths of all the component modules are divided and multiplexed to support parallel single-precision calculation, the application mainly makes great modification on a mantissa multiplier, an exponent difference value and an addition module, and the application is specifically as follows: iterative multiplier: the double-precision mantissa multiplier has a size of 53×53 (52-bit mantissa and 1-bit hidden bit), and the single-precision mantissa multiplier has a size of 24×24 (23-bit mantissa and 1-bit hidden bit). The application adopts two multipliers with the size of 27 multiplied by 27, and multiplexes the multipliers in an iterative mode to complete double-precision mantissa multiplication in two clock cycles, and each clock cycle can still complete double-parallel single-precision mantissa multiplication. Index difference: when the double-precision floating point multiply-add unit performs double-precision floating point operation, an iteration implementation mode is adopted, and the operation of the front period and the back period on the index is different. The first clock cycle completes the addition, thus calculating the exponent difference of the two addends, and the latter clock cycle is the same as the conventional multiply-add unit, calculating the exponent difference of the product and the addend. The two front and back calculations require multiplexing of the exponent difference calculation logic by the data selector. The single-precision floating point number calculation is completely consistent with the operation of the next clock cycle of the double-precision floating point number calculation. And an addition module: when the dual-mode floating-point multiply-add unit performs dual-precision floating-point operation, the data flow needs to be adjusted according to the mode of packet-level operation. In the first mode, independent addition and multiplication and addition operations are respectively completed in front and back clock cycles, and one result is output per beat. In the second mode, the addition result of the previous clock cycle is latched first and then accumulated with the multiplication result in the next clock cycle. Single precision floating point calculations are then consistent with multiply-add operations in the first case.

In summary, the low-overhead floating point multiply-add unit supporting packet-level computing according to the present embodiment is designed based on the actual computing requirements of high-performance computing and artificial intelligence application, so that the hardware overhead can be effectively reduced at the cost of losing a small amount of dual-precision floating point performance, and further, the maximization of computing performance can be realized under the constraint of limited semiconductor area. In particular, the mantissa multiplier array is multiplexed in an iterative manner to reduce hardware implementation overhead. In addition, the application provides a packet-level computing mode, specifically, one double-precision floating point addition and one double-precision floating point multiplication and addition are respectively completed in two continuous clock cycles, and the addition result and the multiplication and addition result can be independently output and also can be accumulated and then output, so that the performance loss caused by iterative computation can be effectively relieved. The packet-level operation supported by the floating-point multiply-add unit provided by the embodiment is carried out without adjusting the instruction sequence by a compiler, and is easy to use without additional support. High performance computing and artificial intelligence applications do not have the same computational power requirements, and computing units often need to support multiple data formats to meet their computational requirements. The embodiment designs a low-overhead multi-precision floating point multiply-add unit, which can support floating point number calculation in various formats and realize the maximization of calculation performance under the constraint of limited semiconductor area. In addition, considering that the hardware overhead of the floating point multiplication and addition unit can be effectively reduced by adopting an iterative mode to multiplex the mantissa multiplier, but the multiplication of iterative calculation needs two clock cycles, which can lead to the pause of a calculation pipeline and influence the calculation performance, the embodiment provides a packet-level calculation mode, specifically, the two-precision floating point addition and the two-precision floating point multiplication and addition are respectively completed in two continuous clock cycles, and the performance loss caused by the iterative calculation can be effectively relieved. The multi-precision arithmetic unit is a basic component unit in the current high-performance computing system, and often supports multiple data formats to meet the computing requirements of different application scenes. The multi-precision arithmetic unit has a larger hardware overhead than the conventional arithmetic unit because of supporting a plurality of data formats. The embodiment adopts an iterative implementation mode, and the mantissa multiplier in the multiplication and addition unit is multiplexed, so that the hardware implementation cost is obviously reduced. Meanwhile, in order to relieve the performance loss of double-precision floating point calculation caused by iterative calculation, the embodiment provides packet-level operation, one addition operation can be executed in the previous clock period of iterative calculation, and an addition result can be independently output and accumulated with the multiplication and addition calculation result in the latter clock period and is consistent with the calculation requirement of an actual program instruction. The low-overhead floating point multiply-add unit supporting packet-level computation provided by the embodiment only needs a compiler to adjust the execution sequence of instructions, does not need additional support, is easy to realize, and can maximize the computing performance of the processor under the constraint of a limited semiconductor area.

In addition, the embodiment also provides a processor chip, which comprises a chip body and an operation unit arranged in the chip body, wherein the operation unit comprises the floating point multiply-add unit supporting packet-level operation.

In addition, the embodiment also provides an accelerator board card, which comprises a board card body and an accelerator chip arranged in the board card body, wherein the accelerator chip comprises a chip body and an operation unit arranged in the chip body, and the operation unit comprises the floating point multiplication and addition unit supporting package-level operation. As an alternative embodiment, the accelerator chip is a GPU chip. In addition, the accelerator chip may be other types of computing accelerator chips, such as a DSP chip, etc.

In addition, the embodiment also provides a computer device, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor comprises a microprocessor body and an operation unit arranged in the microprocessor body, and the operation unit comprises the floating point multiply-add unit supporting packet-level operation. In this embodiment, the accelerator board card includes a board card body and an accelerator chip disposed in the board card body, where the accelerator chip includes a chip body and an operation unit disposed in the chip body, and the operation unit includes the floating point multiply-add unit supporting packet-level operation. As an alternative embodiment, the accelerator chip is a GPU chip. In addition, the accelerator chip may be other types of computing accelerator chips, such as a DSP chip, etc.

In addition, the embodiment also provides an application method of the floating point multiply-add unit supporting the packet-level operation, which comprises the following steps: the floating point multiply-add unit supporting the packet level operation respectively completes one double-precision floating point addition and one double-precision floating point multiply-add in two continuous clock cycles, and the addition result and the multiply-add result are independently output and accumulated and then output to support the packet level calculation.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present application, and the protection scope of the present application is not limited to the above examples, and all technical solutions belonging to the concept of the present application belong to the protection scope of the present application. It should be noted that modifications and adaptations to the present application may occur to one skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.

Claims

1. A floating point multiply-add unit supporting packet-level operations, comprising:

2. The floating point multiply-add unit supporting packet-level operations according to claim 1, wherein said multiplier array consisting of 2 multipliers supporting single-precision multiplication and implemented in an iterative manner comprises: the system comprises two data selection modules, a multiplier, a 4-2 compression tree, a carry register and a sum register, wherein the multiplier comprises a multiplier 1 and a multiplier 0 which are two multipliers with the size of 27 multiplied by 27, the two multipliers are in one-to-one correspondence with the two data selection modules, the data selection modules are used for selecting the multiplier 0 and the multiplier 1 to be sent into the corresponding one multiplier, and for single-precision calculation, the two multipliers finish 2 multiplication operations every clock cycle; for double-precision operation, the multiplication of 53×53 is realized in an iterative mode; the previous clock cycle completes 53X 27 operation through two multipliers to obtain addition Sum and Carry results, and the addition Sum and Carry results are latched into a Carry register and a Sum register; the latter clock cycle completes the 53 x 26 calculation and adds the latched result from the previous clock cycle to get the final result.

3. The floating point multiply-add unit supporting packet-level operations according to claim 2, wherein the exponent difference module comprises a data selector, an exponent difference calculation module and an addend-to-order shift module, the data selector is used for selecting an addend 1 and a product exponent sum as one input of the exponent difference calculation module, the other input of the exponent difference calculation module is an addend 2, and the exponent difference calculation module is connected with an input end of the addend-to-order shift module; when double-precision floating point operation is carried out, the operation of the two periods on the index is different in an iteration implementation mode, the addition is completed in the first clock period, the addition 1 is selected through a data selector to calculate the index difference value of the addition 1 and the addition 2 through an index difference calculation module, the product index is selected through the data selector and the index difference value of the product and the addition is calculated through the index difference calculation module in the latter clock period, and the index difference calculation module is needed to be multiplexed through the data selector in the two previous and subsequent times; the single-precision floating point number calculation is completely consistent with the operation of the next clock cycle of the double-precision floating point number calculation, and the data selector selects the product index sum to calculate the index difference of the product sum and the addend through the index difference calculation module.

4. The floating point multiply-add unit supporting packet-level operations of claim 3, wherein said add module comprises:

5. A processor chip comprising a chip body and an operation unit provided in the chip body, wherein the operation unit comprises the floating-point multiply-add unit supporting packet-level operation according to any one of claims 1 to 4.

6. An accelerator board card, the accelerator board card includes board card body and locates the accelerator chip in the board card body, the said accelerator chip includes the chip body and locates the arithmetic unit in the chip body, characterized by that, include the floating point multiply add unit supporting package level operation in the said arithmetic unit in any one of claims 1-4.

7. A computer device comprising a microprocessor and a memory connected to each other, the microprocessor comprising a microprocessor body and an arithmetic unit provided in the microprocessor body, wherein the arithmetic unit comprises the floating-point multiply-add unit supporting packet-level arithmetic according to any one of claims 1 to 4.

8. The computer device according to claim 7, wherein the microprocessor is further connected with an accelerator board through a bus, the accelerator board includes a board body and an accelerator chip provided in the board body, the accelerator chip includes a chip body and an arithmetic unit provided in the chip body, and the arithmetic unit includes the floating point multiply-add unit supporting packet-level arithmetic according to any one of claims 1 to 4.

9. The computer device of claim 8, wherein the accelerator chip is a GPU chip.

10. A method of using the floating-point multiply-add unit supporting packet-level operations as recited in any one of claims 1-4, comprising: the floating point multiply-add unit supporting the packet level operation respectively completes one double-precision floating point addition and one double-precision floating point multiply-add in two continuous clock cycles, and the addition result and the multiply-add result are independently output and accumulated and then output to support the packet level calculation.