WO2021073512A1 - 用于浮点运算的乘法器、方法、集成电路芯片和计算装置 - Google Patents
用于浮点运算的乘法器、方法、集成电路芯片和计算装置 Download PDFInfo
- Publication number
- WO2021073512A1 WO2021073512A1 PCT/CN2020/120717 CN2020120717W WO2021073512A1 WO 2021073512 A1 WO2021073512 A1 WO 2021073512A1 CN 2020120717 W CN2020120717 W CN 2020120717W WO 2021073512 A1 WO2021073512 A1 WO 2021073512A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- floating
- mantissa
- point number
- multiplier
- exponent
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/499—Denomination or exception handling, e.g. rounding or overflow
- G06F7/49905—Exception handling
- G06F7/4991—Overflow or underflow
- G06F7/49915—Mantissa overflow or underflow in handling floating-point numbers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/483—Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
- G06F7/487—Multiplying; Dividing
- G06F7/4876—Multiplying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/499—Denomination or exception handling, e.g. rounding or overflow
- G06F7/49942—Significance control
- G06F7/49947—Rounding
- G06F7/49957—Implementation of IEEE-754 Standard
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/52—Multiplying; Dividing
- G06F7/523—Multiplying only
- G06F7/53—Multiplying only in parallel-parallel fashion, i.e. both operands being entered in parallel
- G06F7/5318—Multiplying only in parallel-parallel fashion, i.e. both operands being entered in parallel with column wise addition of partial products, e.g. using Wallace tree, Dadda counters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/52—Multiplying; Dividing
- G06F7/523—Multiplying only
- G06F7/533—Reduction of the number of iteration steps or stages, e.g. using the Booth algorithm, log-sum, odd-even
- G06F7/5332—Reduction of the number of iteration steps or stages, e.g. using the Booth algorithm, log-sum, odd-even by skipping over strings of zeroes or ones, e.g. using the Booth Algorithm
Definitions
- This disclosure generally relates to the field of floating-point operations. More specifically, the present disclosure relates to methods, multipliers, integrated circuit chips, and computing devices for floating-point operations.
- the solution of the present disclosure provides a multiplier and method for floating-point operations, an integrated circuit chip including the multiplier, and a computing device.
- the present disclosure provides a multiplier for performing floating-point number multiplication according to an operation mode, wherein the floating-point number includes at least an exponent and a mantissa, and the multiplier includes: an exponent processing unit configured to perform multiplication according to the Arithmetic mode, the exponent of the first floating-point number and the exponent of the second floating-point number to obtain the exponent after the multiplication operation; and a mantissa processing unit for obtaining the mantissa of the first floating-point number according to the arithmetic mode, the mantissa of the first floating-point number and the The mantissa of the second floating-point number is used to obtain the mantissa after the multiplication operation, wherein the operation mode is used to indicate the data format of the first floating-point number and the data format of the second floating-point number.
- the present disclosure provides a method for performing a floating-point number multiplication operation using a multiplier, wherein the floating-point number includes at least an exponent and a mantissa, the multiplier performs the multiplication operation based on an operation mode, and the method includes: The exponent processing unit of the multiplier obtains the exponent after the multiplication operation according to the operation mode, the exponent of the first floating-point number, and the exponent of the second floating-point number; and the mantissa processing unit of the multiplier is used to obtain the exponent according to the operation Mode, the mantissa of the first floating-point number, and the mantissa of the second floating-point number to obtain the mantissa after the multiplication operation, wherein the operation mode is used to indicate the data format of the first floating-point number and the first floating-point number The data format of two floating-point numbers.
- the present disclosure provides an integrated circuit chip including the multiplier described above.
- the multiplier of the present disclosure may constitute an independent integrated circuit chip or be arranged on an integrated circuit chip or a computing device to implement operations on floating-point numbers in a variety of different data formats.
- the multiplier With the multiplier, corresponding operation method, integrated circuit chip and computing device of the present disclosure, it is possible to support operations on multiple floating-point data without providing multiple separate multipliers for different floating-point data. Therefore, the multiplier of the present disclosure is flexible and can be widely used in various floating-point data operations. In addition, when processing input data with a larger bit width, the multiplier of the present disclosure supports cyclic multiplexing operation, so there is no need to arrange more processing chips, thereby also reducing the layout area of the integrated circuit.
- Fig. 1 is a schematic diagram showing a floating-point data format according to an embodiment of the present disclosure
- Fig. 2 is a schematic structural block diagram showing a multiplier according to an embodiment of the present disclosure
- Figure 3 is a block diagram showing more details of the multiplier according to an embodiment of the present disclosure.
- Fig. 4 is a schematic block diagram showing a mantissa processing unit according to an embodiment of the present disclosure
- Fig. 5 is a schematic diagram showing a partial product operation according to an embodiment of the present disclosure.
- FIG. 6 is a schematic block diagram showing the operation flow of the Wallace tree compressor according to an embodiment of the present disclosure
- Fig. 7 is an overall schematic block diagram showing a multiplier according to an embodiment of the present disclosure.
- FIG. 8 is a flowchart illustrating a method for performing floating-point number multiplication using a multiplier according to an embodiment of the present disclosure
- FIG. 9 is a structural diagram showing a combined processing device according to an embodiment of the present disclosure.
- FIG. 10 is a schematic diagram showing the structure of a board card according to an embodiment of the present disclosure.
- the technical solution of the present disclosure provides a multiplier, method, integrated circuit chip, and computing device for floating-point number operations as a whole.
- the present disclosure provides a multiplier that supports multiple operation modes, thereby overcoming the defect that the existing multiplier can only support one type of floating-point arithmetic.
- the present disclosure uses multiple operation modes to indicate different floating-point data types, and in the multiplication calculation process of floating-point numbers, various operations on the data are performed based on one of the operation modes, including, for example, encoding, compression, and summation. , Normalization, and rounding operations to implement operations associated with one of multiple floating-point data types. Therefore, the multiplier of the present disclosure can support operations in multiple modes, further improving the flexibility of floating-point operations and reducing the cost of operations.
- FIG. 1 is a schematic diagram showing a floating point data format 100 according to an embodiment of the present disclosure.
- the floating-point number to which the technical solution of the present disclosure can be applied can include three parts, such as sign (or sign bit) 102, exponent (or exponent bit) 104, and mantissa (or mantissa bit) 106.
- sign or sign bit
- exponent or exponent bit
- mantissa or mantissa bit
- the floating-point numbers suitable for the multiplier of the present disclosure may include at least one of half-precision floating-point numbers, single-precision floating-point numbers, brain floating-point numbers, double-precision floating-point numbers, and custom floating-point numbers.
- the floating-point number format to which the technical solution of the present disclosure can be applied may be a floating-point format that conforms to the IEEE754 standard, such as double-precision floating-point number (float64, abbreviated as "FP64”), single-precision floating-point number ( float32, abbreviated “FP32”) or half-precision floating-point number (float16, abbreviated "FP16").
- FP64 double-precision floating-point number
- FP32 single-precision floating-point number
- FP16 half-precision floating-point number
- the floating-point number format can also be an existing 16-bit brain floating-point number (bfloat16, abbreviated as "BF16”), or a custom floating-point number format, such as 8-bit brain floating-point number (bfloat8, abbreviated as “BF8"), unsigned half-precision floating point numbers (unsigned float16, abbreviated as "UFP16”), unsigned 16-bit brain floating point numbers (unsigned bfloat16, abbreviated as "UBF16”).
- bfloat8 8-bit brain floating-point number
- UFP16 unsigned half-precision floating point numbers
- UPF16 unsigned 16-bit brain floating point numbers
- the multiplier of the present disclosure can at least support the multiplication operation between two floating-point numbers with any of the above-mentioned formats in operation, wherein the two floating-point numbers can have the same or different Floating point data format.
- the multiplication operation between two floating-point numbers can be FP16*FP16, BF16*BF16, FP32*FP32, FP32*BF16, FP16*BF16, FP32*FP16, BF8*BF16, UBF16*UFP16 or UBF16*FP16, etc. Multiplication operation between two floating-point numbers.
- FIG. 2 is a schematic structural block diagram of a multiplier 200 according to an embodiment of the present disclosure.
- the multiplier of the present disclosure supports multiplication operations of floating-point numbers in various data formats, and these data formats can be indicated by the operation mode of the present disclosure, so that the multiplier works in one of a variety of operation modes.
- the multiplier of the present disclosure may generally include an exponent processing unit 202 and a mantissa processing unit 204, wherein the exponent processing unit is used to process the exponent bit of a floating point number, and the mantissa processing unit is used to process the mantissa of a floating point number. Bit.
- the multiplier may further include a sign processing unit 206, which may be used to process a floating point number including a sign bit.
- the multiplier can perform floating-point operations on the received, input, or buffered first floating-point number and the second floating-point number according to one of the operating modes, the first floating-point number and the second floating-point number having the functions discussed above One of the floating-point data formats. For example, when the multiplier is in the first operation mode, it can support the multiplication of two floating-point numbers FP16*FP16, and when the multiplier is in the second operation mode, it can support the multiplication of two floating-point numbers BF16*BF16 .
- the multiplier when the multiplier is in the third operation mode, it can support the multiplication of two floating-point numbers FP32*FP32, and when the multiplier is in the fourth operation mode, it can support the multiplication of two floating-point numbers FP32*BF16 Operation.
- the corresponding relationship between the sample operation mode and the floating-point number is shown in Table 2 below.
- Operation mode number Arithmetic floating-point number type 1 FP16*FP16 2 BF16*BF16 3 FP32*FP32 4 FP32*BF16
- the above-mentioned table 2 may be stored in a memory of the multiplier, and the multiplier selects one of the operation modes in the table according to the instruction received from the external device, and the external device may be, for example, FIG. 10 External device 1012 shown in.
- the input of the operation mode can also be realized automatically via the mode selection unit 308 as shown in FIG. 3.
- the mode selection unit can select the multiplier to work in the first operation mode according to the data format of the two floating-point numbers.
- the mode selection unit may select the multiplier to work in the fourth operation mode according to the data format of the two floating point numbers.
- the different operation modes of the present disclosure are associated with corresponding floating-point data. That is to say, the operation mode of the present disclosure can be used to indicate the data format of the first floating-point number and the data format of the second floating-point number. In another embodiment, the operation mode of the present disclosure can not only indicate the data format of the first floating-point number and the data format of the second floating-point number, but can also be used to indicate the data format after the multiplication operation.
- the operation mode extended in conjunction with Table 2 is shown in Table 3 below.
- the operation modes in Table 3 are extended by one bit to indicate the data format after floating-point multiplication.
- the multiplier works in operation mode 21
- it performs floating-point operations on the input BF16*BF16 two floating-point numbers, and outputs the floating-point multiplication in the FP16 data format.
- the above operation mode in number form to indicate the floating point data format is only exemplary and not restrictive. According to the teaching of the present disclosure, it is also conceivable to establish an index according to the operation mode to determine the format of the multiplier and the multiplicand.
- the operation mode includes two indexes, the first index is used to indicate the type of the first floating-point number, and the second index is used to indicate the type of the second floating-point number.
- the first index "1" in the operation mode 13 indicates The first floating-point number (or multiplicand) is in the first floating-point format, that is, FP16, and the second index "3" indicates that the second floating-point number (or multiplier) is in the second floating-point format, that is, FP32.
- a third index may be added to the operation mode, which indicates the data format of the output result.
- the third index "1" in the operation mode 131 it may indicate that the data format of the output result is the first floating point.
- the format is FP16.
- the instructions may include three fields or fields, the first field is used to indicate the data format of the first floating-point number, the second field is used to indicate the data format of the second floating-point number, and The third field is used to indicate the data format of the output result.
- FIG. 3 is a block diagram showing a more detailed structure of the multiplier 300 according to an embodiment of the present disclosure. It can be seen from the content shown in FIG. 3 that it not only includes the exponent processing unit 202, the mantissa processing unit 204, and the optional symbol processing unit 206 shown in FIG. These units operate related units, and an exemplary operation of these units will be described in detail below with reference to FIG. 3.
- the exponent processing unit may be used to obtain the multiplied exponent according to the aforementioned operation mode, the exponent of the first floating-point number and the exponent of the second floating-point number.
- the exponent processing unit may be implemented by an addition and subtraction circuit.
- the exponent processing unit here can be used to add the exponent of the first floating-point number, the exponent of the second floating-point number, and the respective offset values of the corresponding input floating-point data format, and then subtract the output floating-point data format The offset value to obtain the exponent after the multiplication of the first floating-point number and the second floating-point number.
- the mantissa processing unit of the multiplier can be used to obtain the mantissa after the multiplication operation according to the foregoing operation mode, the first floating-point number, and the second floating-point number.
- the mantissa processing unit may include a partial product operation unit 312 and a partial product summation unit 314, wherein the partial product operation unit is configured to obtain an intermediate result according to the mantissa of the first floating point number and the mantissa of the second floating point number.
- the intermediate result may be multiple partial products obtained during the multiplication operation of the first floating-point number and the second floating-point number (as shown schematically in FIG. 5 and FIG. 6).
- the partial product summation unit is configured to perform an addition operation on the intermediate result to obtain an addition result, and use the addition result as the mantissa after the multiplication operation.
- the present disclosure uses a Booth ("Booth") encoding circuit to complement the high and low bits of the mantissa of the second floating-point number (such as serving as a multiplier in floating-point operations) with 0 (wherein the high-order bit) Adding 0 is to convert the mantissa as an unsigned number to a signed number) in order to obtain the intermediate result.
- the mantissa of the first floating-point number such as the multiplicand in floating-point operations
- can also be encoded such as high and low bits with 0, or both can be encoded.
- the partial product summation unit may include an adder, which is used to add the intermediate result to obtain the sum result.
- the partial product summation unit includes a Wallace tree and an adder, wherein the Wallace tree is used to add the intermediate results to obtain a second intermediate result, and the adder uses To add the second intermediate result to obtain the added result.
- the adder may include at least one of a full adder, a serial adder, and a forward bit adder.
- the mantissa processing unit may further include a control circuit 316 for indicating that the bit width of at least one of the first floating-point number or the second floating-point number is greater than that of the mantissa processing unit that can be processed at one time in the operation mode.
- the control circuit may be implemented as a control signal in an embodiment, for example, it may be a counter or a control flag.
- the partial product summation unit may also include a shifter.
- the shifter calls Is used to shift the existing sum result and add it with the sum result obtained in the current call to obtain a new sum result, and use the new sum result obtained in the last call as The mantissa after the multiplication operation.
- the multiplier of the present disclosure further includes a regularization unit 318 and a rounding unit 320.
- the regularization unit can be used to perform floating-point regularization processing on the mantissa and exponent after the multiplication operation to obtain a regularized exponent result and a regularized mantissa result, and use the regularized exponent result and the regularized mantissa result as The exponent after the multiplication operation and the mantissa after the multiplication operation.
- the regularization unit can adjust the bit width of the exponent and the mantissa to meet the requirements of the data format indicated above.
- the regularization unit can also make other adjustments to the exponent or mantissa.
- the regularization unit may also adjust the exponent after the multiplication operation according to the mantissa after the multiplication operation. For example, when the highest bit of the mantissa after the multiplication operation is 1, the exponent obtained after the multiplication operation can be increased by 1.
- the rounding unit may be used to perform a rounding operation on the regularized mantissa result according to a rounding mode, and use the mantissa after the rounding operation is performed as the mantissa after the multiplication operation.
- the rounding unit may perform rounding operations including rounding down, rounding up, and rounding to the nearest significant number, for example.
- the rounding unit can also round the 1 that is shifted out in the process of shifting the mantissa to the right.
- the multiplier of the present disclosure may also optionally include a sign processing unit.
- the sign processing unit can be used according to the first floating-point number.
- the sign of and the sign of the second floating-point number get the sign after the multiplication operation.
- the symbol processing unit may include an exclusive OR logic circuit 322 for performing an exclusive OR operation based on the sign of the first floating-point number and the sign of the second floating-point number. , To obtain the symbol after the multiplication operation.
- the symbol processing unit can also be implemented by a truth table or logical judgment.
- the multiplier of the present disclosure may further include a normalization processing unit 324 for converting the first floating-point number Or when the second floating-point number is a non-normalized non-zero floating-point number, the first floating-point number or the second floating-point number is normalized according to the operation mode to obtain the corresponding exponent and mantissa.
- the normalization processing unit can be used to normalize the FP16 type data to BF16 type data, so that the multiplier can operate in the second operation mode.
- the normalization processing unit may also be used to preprocess the mantissa of the normalized floating-point number with an implicit 1 and the mantissa of the non-normalized floating-point number without the implicit 1 (for example, the mantissa of Extend) to facilitate the subsequent operation of the mantissa processing unit.
- the normalization processing unit 324 and the aforementioned regularization unit 318 can also perform the same or similar operations in some embodiments.
- the difference is that the normalization processing unit 324 is specific to the input.
- the floating-point data of is subjected to normalization processing, and the regularization unit 318 performs regularization processing for the mantissa and exponent to be output.
- the multiplier of the present disclosure and its various embodiments have been described above with reference to FIG. 3. Based on the above description, those skilled in the art can understand that the solution of the present disclosure obtains the result of the multiplication operation (including the exponent, the mantissa and optional signs) through the execution of the multiplier. According to different application scenarios, for example, when the aforementioned regularization processing and rounding processing are not required, the result obtained by the mantissa processing unit and the exponential processing unit can be regarded as the final operation result. Furthermore, when the aforementioned regularization processing and rounding processing are required, the exponent and mantissa obtained after the regularization processing and rounding processing can be regarded as the final calculation result, or a part of the final calculation result (when considering The final symbol).
- the solution of the present disclosure uses multiple operation modes to enable the multiplier to support the operation of floating-point numbers of different types or data formats, so that the multiplexing of the multiplier can be realized, thereby saving the overhead of chip design and saving the calculation cost.
- the multiplier of the present disclosure also supports the calculation of high-bit-width floating-point numbers.
- the mantissa also called the mantissa bit or the mantissa part
- the mantissa operation of the present disclosure will be described below in conjunction with FIG. 4.
- FIG. 4 is a schematic block diagram showing an operation 400 of a mantissa processing unit according to an embodiment of the present disclosure.
- the mantissa processing operation of the present disclosure may mainly involve two units, namely, the partial product operation unit and the partial product summation unit discussed in combination with FIG. 3.
- the mantissa processing operation can be roughly divided into the first stage and the second stage. In the first stage, the mantissa processing operation will obtain intermediate results, and in the second stage, the mantissa processing operation will obtain the addition The mantissa result output by the converter 408.
- the first floating-point number and the second floating-point number received by the multiplier may be divided into multiple parts, namely the aforementioned sign (optional), exponent, and mantissa.
- the mantissa part of the two floating-point numbers will enter the mantissa processing unit as input (such as the mantissa processing unit in FIG. 2 or FIG. 3), and specifically enter the partial product operation unit.
- the present disclosure uses Booth coding circuit 402 to add 0 to the high and low bits of the mantissa of the second floating-point number (that is, the multiplier in floating-point operations), and performs Booth coding processing to generate partial products.
- the intermediate result is obtained in the circuit 404.
- the first floating-point number and the second floating-point number here are only for illustrative and not restrictive purposes. Therefore, in some application scenarios, the first floating-point number can be a multiplier and the second floating-point number can be a multiplicand. .
- encoding operations can also be performed on floating-point numbers that serve as multiplicands.
- Booth coding is briefly introduced below.
- a large number of intermediate results called partial products will be produced through the multiplication operation, and then these partial products will be accumulated to obtain the final result of the multiplication of the two binary numbers.
- the greater the number of partial products the greater the area and power consumption of the array multiplier, the slower the execution speed, and the more difficult it is to implement the circuit.
- the purpose of Booth coding is to effectively reduce the number of summations of partial products, thereby reducing the circuit area.
- the algorithm is to first encode the input multiplier according to the corresponding rules.
- the encoding rules may be, for example, the rules shown in Table 4 below:
- y2i+1, y2i, and y2i-1 in Table 4 can represent the values corresponding to each group of sub-data to be encoded (ie, multipliers), and X can represent the mantissa in the first floating-point number (ie, multiplicand).
- the coded signal obtained after Booth coding can include five types, which are -2X, 2X, -X, X, and 0, respectively.
- the received multiplicand is 8-bit data "X 7 X 6 X 5 X 4 X 3 X 2 X 1 X 0 ", the following partial products can be obtained:
- the multiplier digits include the continuous three-digit data "001" in the above table
- the partial product is X, which can be expressed as "X 7 X 6 X 5 X 4 X 3 X 2 X 1 X 0 ", the 9th
- the multiplier digits include the continuous three-digit data "011" in the above table
- the adder may be, for example, one or more full adders, half adders, or various combinations of the two.
- a Wallace tree compressor (or Wallace tree for short), it is mainly used to sum the above-mentioned intermediate results (ie, multiple partial products) to reduce the number of accumulation of partial products (ie, compression).
- a Wallace tree compressor can adopt a carry-save CAS (carry-save) architecture and a Wallace tree algorithm.
- the calculation speed of the Wallace tree array is much faster than the traditional carry-save addition.
- the Wallace tree compressor can calculate the sum of the partial products of each row in parallel. For example, it can reduce the number of accumulations of N partial products from N-1 times to Log2N times, thereby increasing the speed of the multiplier and improving resource efficiency. Utilization is of great significance.
- the Wallace tree compressor can be designed into many types, such as 7-2 Wallace tree, 4-2 Wallace tree and 3-2 Wallace tree.
- the present disclosure uses a 7-2 Wallace tree as an example of implementing various floating-point operations of the present disclosure, which will be described in detail later in conjunction with FIG. 5 and FIG. 6.
- the Wallace tree compression operation disclosed in the present disclosure may be arranged to have M inputs and N outputs, the number of which may not be less than K, where N is a preset positive integer less than M, and K is A positive integer not less than the maximum bit width of the intermediate result.
- N is a preset positive integer less than M
- K is A positive integer not less than the maximum bit width of the intermediate result.
- M can be 7, and N can be 2, which is a 7-2 Wallace tree which will be described in detail below.
- K can take a positive integer of 48, which means that the number of Wallace trees can be 48.
- one or more groups of the Wallace trees can be selected to add the intermediate results, wherein each group has X Wallace trees, and X is the sum of the intermediate results. Digits.
- the Wallace trees in each group may have a sequential carry relationship, but there is no carry relationship between each group.
- the Wallace tree compressor can be connected by carrying, for example, the carry output from the low-level Wallace tree compressor (Cin in Figure 6) to the high-level Wallace tree, and the high-level Wallace tree
- the carry output (Cout) of the Shishu compressor can become a higher-order Wallace tree compressor to receive the carry input from the lower-order Wallace tree compressor.
- any selection can be made. For example, they can be selected in the order of 0, 1, 2 and 3, or by Connect in the order of numbers 0, 2, 4, and 6, as long as the selected Wallace tree compressor is selected according to the above-mentioned carry relationship.
- the 0th to 23rd Wallace trees (that is, the 24 Wallace trees in the first group of Wallace trees) can complete the partial product addition and operation of the first group of multiplications , And each Wallace tree in the group can be connected by carry in turn.
- the 24th to 47th Wallace trees (that is, the 24 Wallace trees in the second group of Wallace trees) can complete the partial product addition operation of the second group of multiplications, where each Wallace in the group The scholar trees are connected by carry in turn.
- there is no carry relationship between the 23rd Wallace tree in the first group and the 24th Wallace tree in the second group that is, there is no carry relationship between Wallace trees in different groups.
- the compressed partial products are summed by the adder to obtain the result of the mantissa multiplication operation.
- the adder in one or more embodiments of the present disclosure, it may include one of a full adder, a serial adder, and a forward bit adder, which is used to add the Wallace tree compressor The obtained partial products of the last two rows are summed to obtain the result of the mantissa multiplication operation.
- the mantissa multiplication operation shown in FIG. 4 can effectively obtain the result of the mantissa multiplication operation.
- Booth coding can effectively reduce the number of partial product summations, thereby reducing the circuit area
- the Wallace compression tree can calculate the sum of partial products of each row in parallel, thereby increasing the speed of the multiplier.
- Figure 5 shows the partial product 500 obtained after passing through the partial product generating circuit in the mantissa processing unit described in conjunction with Figures 2 to 4, as shown in the figure between the two dashed lines in four rows of white dots, where each The white dots on the row indicate a partial product.
- the number of bits can be expanded in advance.
- the black dot in Figure 5 is the highest value of each 9-bit partial product copied. It can be seen that the partial product is expanded and aligned to 16(8+8)bit (that is, the bit width of the multiplicand mantissa is 8bit+multiplying The bit width of the mantissa is 8bit).
- the partial product is expanded to 38 (25+13) bits (that is, the bit width of the multiplicand mantissa is 25 bits + the bit width of the multiplier mantissa is 13 bits) .
- FIG. 6 is a schematic block diagram 600 showing the operation flow of the Wallace tree compressor according to an embodiment of the present disclosure.
- the 7 shown in Figure 6 can be obtained by performing Booth coding on the multiplier and by the multiplicand. Part product. Due to the use of Booth coding algorithm, the number of partial products generated is reduced.
- a dashed frame is used in the partial product part to identify a Wallace tree that includes 7 elements, and the process of compressing it from 7 elements to 2 elements is further shown with arrows.
- the compression process (or called the addition process) can be implemented by means of a full adder, that is, three elements are input and two elements are output (ie, a sum "sum” and a carry "carry” to the higher order) .
- the schematic block diagram of the 7-2 Wallace Tree Compressor is shown on the right side of Figure 6. It can be understood that the Wallace Tree Compressor includes 7 inputs from a column of partial products (as indicated in the dashed box on the left side of Figure 6). Seven elements). In operation, the carry input of the Wallace tree in the 0th column is 0, and the carry output Cout of each Wallace tree is used as the carry input Cin of the next Wallace tree.
- the Wallace tree including 7 elements can be compressed to include 2 elements.
- this disclosure uses a 7-2 Wallace tree compressor to finally compress the partial product of 7 rows into a partial product with two rows (ie, the second intermediate result of this disclosure), and uses an adder (for example, Advance bit adder) to get the mantissa result.
- the mantissa bits of the floating-point number are 10 bits.
- the mantissa bits can be extended by 1 bit, so that the mantissa bits are 11 bits.
- the mantissa bits are unsigned numbers, when Booth coding algorithm is used, 1 bit of 0 can be extended in the high bit, so the total mantissa bits are 12 bits.
- the partial product generation circuit can obtain 7 partial products in the high and low parts respectively, of which the seventh partial product is 0.
- the bit width of each partial product is 24bit.
- 48 7-2 Wallace trees can be used for compression processing, and the carry of the 23rd to 24th Wallace trees is 0.
- the mantissa of the floating-point number is 7 bits.
- the mantissa can be expanded to 9 bits.
- the partial product generating circuit can obtain 7 effective partial products in the high and low parts respectively, of which the 6th and 7th partial products are 0, each part of the product bit width is 18bit, by using the 0-17th and 24th to 41st two sets of 7-2 Wallace trees for compression processing, of which the 23rd to 24th Wallace trees The carry is 0.
- the mantissa of the floating-point number can be 23 bits. Considering the non-normalized non-zero numbers under the IEEE754 standard, the mantissa can be expanded to 24 bits. In order to save the area of the multiplication unit, the multiplier of the present disclosure can be called twice in this operation mode to complete an operation.
- the multiplication of the mantissa bits each time is 25bit*13bit, that is, the first floating point number ina is expanded by 1 bit 0 to become a 25bit signed number, and the 24bit mantissa bits of the second floating point number inb are divided into high and low parts, and 12 bits are respectively expanded by 1 Bit 0 gets two 13-bit multipliers, which are expressed as the high and low parts of inb_high13 and inb_low13.
- the multiplier of the present disclosure is called for the first time to calculate ina*inb_low13, and the multiplier is called for the second time to calculate ina*inb_high13.
- 7 effective partial products are generated by Booth coding, and the bit width of each partial product is 38 bits, compressed by the 0th to 37th 7-2 Wallace trees.
- the mantissa bit of the first floating-point number ina is 23 bits
- the mantissa bit of the second floating-point number inb is 7 bits.
- the mantissa can be expanded to 25bit and 9bit respectively, and the multiplication of 25bit ⁇ 9bit is performed to obtain 7 effective partial products, among which the 6th and 7th partial products are 0, and the bit width of each partial product is 34bit, passing the 0th to 33rd
- the Wallace tree is compressed.
- the aforementioned mantissa processing unit may further include a control circuit, which may be used when the mantissa bit width of the first floating-point number indicated by the operation mode and/or the mantissa bit width of the first floating-point number is greater than
- the mantissa processing unit is called multiple times according to the operation mode.
- the partial product summation circuit may further include a shifter, which is used for when the mantissa processing unit is called multiple times according to the operation mode, when the addition result is already available In the case of, shift the existing sum result and add it with the sum result obtained by the current call to obtain a new sum result, and use the new sum result as the sum result.
- a shifter which is used for when the mantissa processing unit is called multiple times according to the operation mode, when the addition result is already available In the case of, shift the existing sum result and add it with the sum result obtained by the current call to obtain a new sum result, and use the new sum result as the sum result.
- the mantissa processing unit can be called twice in the FP32*FP32 operation mode. Specifically, in the first call to the mantissa processing unit, the mantissa bits (ie ina*inb_low13) are added in the second stage through the advance bit adder to obtain the second low-order intermediate result. In the second call to the mantissa processing unit, The mantissa bits (ie, ina*inb_high13) are added by a forward bit adder in the second stage to obtain the second highest intermediate result. Thereafter, in one embodiment, the second low-order intermediate result and the second high-order intermediate result can be accumulated through the shift operation of the shifter to obtain the mantissa after the multiplication operation.
- the shift operation can be expressed by the following formula:
- the second high-order intermediate result sumh[37:0] is shifted to the left by 12 bits and accumulated with the second low-order intermediate result suml[37:0].
- FIG. 4 does not draw and describe other units, such as an exponent processing unit and a symbol processing unit.
- FIG. 7 The overall description of the multiplier of the present disclosure will be given below with reference to FIG. 7.
- the previous description of the mantissa processing unit is also applicable to the situation depicted in FIG. 7.
- FIG. 7 is an overall schematic block diagram showing a multiplier 700 according to an embodiment of the present disclosure. It should be understood that the positions, existence, and connection relationships of the various units depicted in the figure are only exemplary and not restrictive. For example, some of the units can be integrated, while other units can also be separated or depending on the application scenario. It is omitted or replaced if it is different.
- the multiplier of the present disclosure can be exemplarily divided into a first stage and a second stage in the operation of each operation mode according to the operation flow, as shown by the dotted line in the figure.
- the first stage output the calculation result of the sign bit, output the intermediate calculation result of the exponent bit, output the intermediate calculation result of the mantissa bit (for example, including the aforementioned encoding process of the input mantissa fixed-point multiplication Booth algorithm and Hua Laisha tree compression process).
- the second stage regularize and round the exponent and mantissa to output the calculation result of the exponent and the calculation result of the mantissa.
- the multiplier of the present disclosure may include a mode selection unit 702 and a normalization processing unit 704, wherein the mode selection unit may select an operation mode according to an input mode signal (in_mode).
- the input mode signal may correspond to the operation mode number in Table 2.
- the multiplier can be made to work in the operation mode of FP16*FP16, and when the input mode signal indicates the operation mode number "3" in Table 2 At this time, the multiplier can be operated in the FP32*FP32 operation mode.
- FIG. 7 only shows four exemplary operation modes: FP16*FP16, BF16*BF16, FP32*FP32, and FP32*BP16.
- the multiplier of the present disclosure also supports many other different operation modes.
- the normalization processing unit may be configured to perform normalization processing on the first floating-point number or the second floating-point number according to the operation mode when the first floating-point number or the second floating-point number is a non-normalized non-zero floating-point number. Obtain the corresponding exponent and mantissa, for example, according to the IEEE754 standard, regularize the floating-point number in the data format indicated by the operation mode.
- the multiplier includes a mantissa processing unit to perform a multiplication operation of the first floating-point number mantissa and the second floating-point number mantissa.
- the mantissa processing unit may include a bit expansion circuit 706, a Booth encoder 708, a partial product generation circuit 710, a Wallace tree compressor 712, and an adder 714.
- the number expansion circuit can be used to expand the mantissa in consideration of the denormalized non-zero numbers under the IEEE754 standard, so as to be suitable for the operation of the Booth encoder. Since the Booth encoder, the partial product generation circuit, the Wallace tree compressor and the adder have been described in detail in conjunction with FIGS. 4-6, the same description is equally applicable here and therefore will not be repeated.
- the multiplier of the present disclosure further includes a regularization unit 716 and a rounding unit 718, which have the same functions as the units shown in FIG. 3.
- the regularization unit can perform floating-point regularization on the addition result and the exponent data from the exponent processing unit according to the data format indicated by the output mode signal "out_mode" as shown in FIG. Process to obtain regularized index results and regularized mantissa results.
- the regularization unit can adjust the bit width of the exponent and the mantissa to make it meet the requirements of the aforementioned indicated data format.
- the regularization unit can repeatedly shift the mantissa by 1 bit to the left, and subtract 1 from the exponent until the highest bit value is 1.
- the rounding unit in one embodiment, it can be used to perform a rounding operation on the regularized mantissa result according to a rounding mode to obtain a rounded mantissa, and use the rounded mantissa as the multiplication operation After the mantissa.
- the aforementioned output mode signal may be a part of the operation mode, and is used to indicate the data format after the multiplication operation.
- the output mode signal may be combined with the aforementioned input mode signal to provide the mode selection unit. Based on the combined mode signal, the mode selection unit can clarify the data format of the input data and the output result in the initial stage of the operation of the multiplier without separately providing the output mode signal to the regularization, which can further simplify the operation.
- the following five rounding modes can be exemplarily included.
- mantissa rounding in "rounding" mode for example, the 24-bit mantissa of two normalized floating-point numbers is multiplied to obtain a 48-bit (47-0) mantissa, which is normalized (if the highest bit of the mantissa is 0, shift the mantissa by 1 bit to the left; if the highest bit of the mantissa is 1, the mantissa does not move, and the temporary order code requested above is added by 1), and only the 46th to the 24th digits are taken when outputting.
- the (23-0) digit is discarded; when the 23rd digit of the mantissa is 1, the 24th digit is 1 and the (23-0) digit is discarded.
- the multiplier of the present disclosure further includes an exponent processing unit 720 and a sign processing unit 722, wherein the exponent processing unit can be used to obtain the multiplication according to the operation mode, the exponent of the first floating-point number, and the exponent of the second floating-point number.
- the calculated exponent For example, the exponent processing circuit can add the exponent bit data of the first floating-point number, the exponent bit data of the second floating-point number, and the respective offset values of the corresponding input floating-point data type, and subtract the offset of the output floating-point data type. The value is shifted to obtain exponent bit data of the product of the first floating-point number and the second floating-point number.
- the exponent processing unit may be implemented as or include an addition and subtraction circuit, which is configured to perform according to the operation mode, the exponent of the first floating-point number, the exponent of the second floating-point number, and the The operation mode obtains the exponent after the multiplication operation.
- the symbol processing unit may be implemented as an exclusive OR circuit in one embodiment, which is used to perform an exclusive OR operation on the sign bit data of the first floating point number and the second floating point number to obtain the first floating point number and the second floating point number.
- the sign bit data of the product of floating-point numbers may be implemented as an exclusive OR circuit in one embodiment, which is used to perform an exclusive OR operation on the sign bit data of the first floating point number and the second floating point number to obtain the first floating point number and the second floating point number.
- the sign bit data of the product of floating-point numbers may be implemented as an exclusive OR circuit in one embodiment, which is used to perform an exclusive OR operation on the sign bit data of the first floating point number and the second floating point number to obtain the first floating point number and the second floating point number.
- the multiplier of the present disclosure supports operations in multiple operation modes, thereby overcoming the defect of the multiplier that only supports a single floating-point operation in the prior art. Furthermore, since the multiplier of the present disclosure can be multiplexed, it also supports high-bit wide floating-point data, which reduces the operation cost and overhead. In one or more embodiments, the multiplier of the present disclosure may also be arranged or included in an integrated circuit chip or a computing device to implement multiplication operations on floating-point numbers in multiple operation modes.
- the multiplier of the present disclosure may support parallel multiplication operations of multiple sets of floating-point numbers, the multiple sets of floating-point numbers each including a first floating-point number and a second floating-point number.
- the first floating-point numbers in each group of floating-point numbers can be spliced together and input into the multiplier, or they can be input in parallel without splicing
- the second floating-point numbers in each group of floating-point numbers can be spliced together and input into the multiplier, or without splicing.
- Parallel input multiplier Each set of input floating-point numbers can use the symbol processing unit, the mantissa processing unit and the exponent processing unit to complete the multiplication of the floating-point number.
- the multiplier can also include a plurality of symbol processing units, mantissa processing units and exponent processing units as described above. Each set of input floating-point numbers can be processed by different symbol processing units, mantissa processing units and exponent processing units. .
- the multiplier may include one or more symbol processing units, one or more exponent processing units, and one or more mantissa processing units, where the number of the three processing units can be combined arbitrarily, for example, the multiplier includes multiple symbols.
- the mantissa processing unit may include multiple Wallace trees, and the multiple Wallace trees may be divided into one or more groups of Wallace trees according to actual conditions (for example, operation modes), and each group of Wallace trees
- the mantissa tree is responsible for processing the mantissa of a set of floating-point numbers. For example, it is divided into two sets of Wallace trees.
- Each group supports two 16-bit mantissa operations. For example, the first floating-point number in each set of floating-point numbers in the two sets of floating-point numbers and The mantissa of the second floating-point number is 16 bits, so the Wallace tree in the multiplier supports parallel operations of these two sets of 16-bit mantissas.
- the mantissa processing unit may also include multiple groups of other component parts (such as Booth coding circuit, etc.), and each group of other component parts (such as Booth coding circuit, etc.) is responsible for processing the mantissa of a set of floating-point numbers.
- each group of other component parts such as Booth coding circuit, etc.
- each group of other component parts is responsible for processing the mantissa of a set of floating-point numbers.
- multiple calls can be made to the component parts in the mantissa processing unit, and there is no need to set more than one.
- the first floating-point number and the second floating-point number in each group of floating-point numbers can be spliced together and input into the multiplier.
- FIG. 8 is a flowchart illustrating a method 800 for performing a floating-point number multiplication operation using a multiplier according to an embodiment of the present disclosure. It is understandable that the multiplier described here is the multiplier described in detail above in conjunction with Figures 1 to 7, so the previous descriptions of the multiplier and its internal composition, functions and operations are also applicable to the description here. .
- the method 800 may include using the exponent processing unit of the multiplier at step S802 to obtain the post-multiplication operation according to the operation mode, the exponent of the first floating-point number, and the exponent of the second floating-point number.
- the index can be one of a variety of operation modes, and can be used to indicate the data format of a floating-point number. In one or more embodiments, the operation mode can also be used to determine the data format of the floating point number of the output result.
- the method 800 may use the mantissa processing unit of the multiplier to obtain the mantissa after the multiplication operation according to the operation mode, the first floating-point number, and the second floating-point number.
- the present disclosure uses the Booth coding algorithm and the Wallace tree compressor in some preferred embodiments, so as to improve the efficiency of the mantissa processing.
- the method 800 may also be used in step S806 to obtain the sign after the multiplication operation according to the sign of the first floating-point number and the sign of the second floating-point number.
- FIG. 9 is a structural diagram showing a combined processing device 900 according to an embodiment of the present disclosure.
- the combined processing device 900 includes a computing device 902, which may include the multiplier of the present disclosure as described above with reference to the accompanying drawings.
- the combined processing device also includes a universal interconnection interface 904 and other processing devices 906.
- the computing device according to the present disclosure interacts with other processing devices to jointly complete the operation specified by the user.
- the other processing device may include one or more types of general-purpose and/or special-purpose processors such as a central processing unit (“CPU"), a graphics processing unit (“GPU”), and a neural network processor.
- CPU central processing unit
- GPU graphics processing unit
- the number of processors is not limited but determined according to actual needs.
- the other processing device can be used as an interface between the computing device of the present disclosure (which can be embodied as a machine learning computing device) and external data and control.
- the execution includes but is not limited to data transfer, and completes the processing of the machine.
- the basic control of the start and stop of the learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.
- the universal interconnection interface can be used to transmit data and control commands between the computing device and other processing devices.
- the computing device can obtain required input data from other processing devices via the universal interconnection interface, and write the input data to the on-chip storage device of the computing device.
- the computing device can obtain control instructions from other processing devices via the universal interconnection interface, and write them into the on-chip control buffer of the computing device.
- the universal interconnection interface can also read the data in the storage module of the computing device and transmit it to other processing devices.
- the combined processing device may further include a storage device 908, which may be connected to the computing device and the other processing device respectively.
- the storage device may be used to store the data of the computing device and the other processing device, and it is especially suitable for the data required to be calculated in the internal storage of the computing device or other processing device. Saved data.
- the combined processing device of this disclosure can be used as an SOC system on chip for mobile phones, robots, drones, video capture, video surveillance equipment and other equipment, thereby effectively reducing the core area of the control part, increasing the processing speed and reducing The overall power consumption.
- the universal interconnection interface of the combined processing device is connected to some parts of the equipment.
- Some components here can be, for example, a camera, a monitor, a mouse, a keyboard, a network card or a wifi interface.
- the present disclosure also discloses a chip or integrated circuit chip, which includes the above-mentioned computing device, the combined processing device, and the multiplier of the present disclosure. In other embodiments, the present disclosure also discloses a chip packaging structure, which includes the above-mentioned chip.
- the present disclosure also discloses a board card, which includes the above-mentioned chip packaging structure.
- a board card which includes the above-mentioned chip packaging structure.
- the board may also include other supporting components.
- the supporting components may include, but are not limited to: a storage device 1004, an interface device 1006, and a control device. Device 1008.
- the storage device is connected to the chip in the chip packaging structure through a bus for storing data.
- the storage device may include multiple groups of storage units 1010. Each group of the storage unit and the chip are connected by a bus. It can be understood that each group of the storage units may be DDR SDRAM ("Double Data Rate SDRAM", double-rate synchronous dynamic random access memory).
- the storage device may include 4 groups of the storage unit. Each group of the storage unit may include a plurality of DDR4 particles (chips). In an embodiment, the chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification.
- each group of the storage unit may include a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
- DDR can transmit data twice in one clock cycle.
- a controller for controlling the DDR is provided in the chip, which is used to control the data transmission and data storage of each storage unit.
- the interface device is electrically connected with the chip in the chip packaging structure.
- the interface device is used to implement data transmission between the chip and an external device 1012 (for example, a server or a computer).
- the interface device may be a standard PCIE interface.
- the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer.
- the interface device may also be other interfaces, and the present disclosure does not limit the specific manifestations of the above other interfaces, as long as the interface unit can realize the switching function.
- the calculation result of the chip is still transmitted back to the external device (e.g., server) by the interface device.
- the control device is electrically connected with the chip to monitor the state of the chip.
- the chip and the control device may be electrically connected through an SPI interface.
- the control device may include a single-chip microcomputer ("MCU", Micro Controller Unit).
- the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and can drive multiple loads. Therefore, the chip can be in different working states such as multi-load and light-load.
- the control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and/or multiple processing circuits in the chip.
- the present disclosure also discloses an electronic device or device, which includes the above-mentioned board.
- electronic equipment or devices can include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, and cameras , Cameras, projectors, watches, earphones, mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.
- the transportation means include airplanes, ships, and/or vehicles;
- the household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
- the medical equipment includes nuclear magnetic resonance, B-ultrasound and/or electrocardiograph.
- the disclosed device can be implemented in other ways.
- the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, optical, acoustic, magnetic or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
- the functional units in the various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit can be realized in the form of hardware or software program module.
- the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
- the computer software product is stored in a memory and includes several instructions to enable a computer device (which can be a personal computer, a server, or a network device) Etc.) Perform all or part of the steps of the methods described in the various embodiments of the present disclosure.
- the aforementioned memory includes: U disk, read-only memory ("ROM”, Read-Only Memory), random access memory ("RAM”, Random Access Memory), mobile hard disk, magnetic disk or optical disk, etc., which can store programs The medium of the code.
- a multiplier for multiplying floating-point numbers according to an operation mode, wherein the floating-point number includes at least an exponent and a mantissa
- the multiplier includes: an exponent processing unit for performing multiplication according to the operation mode, first The exponent of the floating-point number and the exponent of the second floating-point number to obtain the exponent after the multiplication operation; and a mantissa processing unit for calculating the mantissa of the first floating-point number and the second floating-point number according to the operation mode, The mantissa is used to obtain the mantissa after the multiplication operation, wherein the operation mode is used to indicate the data format of the first floating-point number and the data format of the second floating-point number.
- the multiplier according to any one of clauses A1-A3, wherein the floating-point number further includes a sign, and the multiplier further includes: a sign processing unit configured to perform according to the sign of the first floating-point number and the second float The sign of the point gets the sign after the multiplication operation.
- Clause A5 the multiplier according to any one of clauses A1-A4, wherein the sign processing unit includes an exclusive OR logic circuit, and the exclusive OR logic circuit is used to determine the sign of the first floating-point number and the second Perform an exclusive OR operation on the sign of two floating-point numbers to obtain the sign after the multiplication operation.
- the sign processing unit includes an exclusive OR logic circuit
- the exclusive OR logic circuit is used to determine the sign of the first floating-point number and the second Perform an exclusive OR operation on the sign of two floating-point numbers to obtain the sign after the multiplication operation.
- the multiplier according to any one of clauses A1-A5, further comprising: a normalization processing unit, configured to: when the first floating-point number or the second floating-point number is a non-normalized non-zero floating-point number, According to the operation mode, the first floating-point number or the second floating-point number is normalized to obtain the corresponding exponent and mantissa.
- a normalization processing unit configured to: when the first floating-point number or the second floating-point number is a non-normalized non-zero floating-point number, According to the operation mode, the first floating-point number or the second floating-point number is normalized to obtain the corresponding exponent and mantissa.
- the multiplier according to any one of clauses A1-A7, wherein the mantissa processing unit includes a partial product operation unit and a partial product summation unit, wherein the partial product operation unit is configured to perform according to the first float The mantissa of the point and the mantissa of the second floating-point number obtain an intermediate result, and the partial product summation unit is configured to perform an addition operation on the intermediate result to obtain an addition result, and use the addition result as the multiplication operation After the mantissa.
- Clause A8 the multiplier according to any one of clauses A1-A7, wherein the partial product operation unit includes a Booth coding circuit, and the Booth coding circuit is used to perform a calculation of the mantissa of the first floating-point number or the second floating-point number. The high and low bits of is filled with 0, and Booth coding is performed to obtain the intermediate result.
- the multiplier according to any one of clauses A1-A8, wherein the partial product summation unit includes an adder, and the adder is configured to add the intermediate result to obtain the sum result .
- the multiplier according to any one of clauses A1-A9, wherein the partial product summation unit includes a Wallace tree and an adder, wherein the Wallace tree is used to add the intermediate result And to obtain a second intermediate result, and the adder is used to add the second intermediate result to obtain the sum result.
- the multiplier according to any one of clauses A1-A10, wherein the adder includes at least one of a full adder, a serial adder, and a forward bit adder.
- each of the Wallace trees has M inputs and N outputs, and the number of Wallace trees is not less than K, where N is A preset positive integer less than M, K is a positive integer not less than the maximum bit width of the intermediate result.
- Clause A14 the multiplier according to any one of clauses A1-A13, wherein the partial product summation unit is used to select one or more sets of the Wallace tree to add the intermediate result according to the operation mode And, where there are X Wallace trees in each group, and X is the number of digits of the intermediate result, wherein there is a sequential carry relationship between the Wallace trees in each group, and the Wallace trees in each group There is no carry relationship between scholar and trees.
- the multiplier according to any one of clauses A1-A14, wherein the mantissa processing unit further includes a control circuit for indicating at least one of the first floating-point number or the second floating-point number in the operation mode
- the mantissa processing unit is called multiple times according to the operation mode.
- the multiplier according to any one of clauses A1-A15, wherein the partial product summation unit further includes a shifter, when the control circuit calls the mantissa processing unit multiple times according to the operation mode , The shifter is used to shift the existing sum result in each call, and add it to the sum result obtained in the current call to obtain a new sum result, and will The new sum result obtained in the last call is used as the mantissa after the multiplication operation.
- the multiplier according to any one of clauses A1-A16, further comprising a regularization unit for performing floating-point regularization processing on the mantissa and exponent after the multiplication operation to obtain regularized exponent results and rules And use the regularized exponent result and the regularized mantissa result as the exponent after the multiplication operation and the mantissa after the multiplication operation.
- a regularization unit for performing floating-point regularization processing on the mantissa and exponent after the multiplication operation to obtain regularized exponent results and rules And use the regularized exponent result and the regularized mantissa result as the exponent after the multiplication operation and the mantissa after the multiplication operation.
- the multiplier according to any one of clauses A1-A17, further comprising a rounding unit configured to perform a rounding operation on the regularized mantissa result according to a rounding mode to obtain a rounded mantissa, and The rounded mantissa is used as the mantissa after the multiplication operation.
- the multiplier according to any one of clauses A1-A18, further comprising: a mode selection unit configured to select and indicate the first floating point number and the second floating point from a plurality of operation modes supported by the multiplier The calculation mode of the point data format.
- Item A20 a method for performing floating-point number multiplication using a multiplier, wherein the floating-point number includes at least an exponent and a mantissa, the multiplier performs multiplication based on an operation mode, and the method includes: exponential processing using the multiplier A unit to obtain the exponent after the multiplication operation according to the operation mode, the exponent of the first floating-point number, and the exponent of the second floating-point number;
- the mantissa processing unit of the multiplier is used to obtain the mantissa after the multiplication operation according to the operation mode, the mantissa of the first floating-point number, and the mantissa of the second floating-point number; wherein, the operation mode is used for Indicate the data format of the first floating-point number and the data format of the second floating-point number.
- Clause A22 a computing device, comprising the multiplier according to any one of clauses A1 to A19 or the integrated circuit chip according to clause A21.
- the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
- the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Nonlinear Science (AREA)
- Complex Calculations (AREA)
Abstract
一种用于浮点运算的乘法器、方法、集成电路芯片和计算装置,可以广泛应用于各类浮点数据运算中,其中计算装置(902)可以包括在组合处理装置(900)中,该组合处理装置(900)还可以包括通用互联接口(904)和其他处理装置(906)。所述计算装置(902)与其他处理装置(906)进行交互,共同完成用户指定的计算操作。组合处理装置(900)还可以包括存储装置(908),该存储装置(908)分别与计算装置(902)和其他处理装置(906)连接,用于计算装置(902)和其他处理装置(906)的数据。
Description
相关申请的交叉引用
本申请要求于2019年10月14日申请的,申请号为201910970802.8,名称为“用于浮点运算的乘法器、方法、集成电路芯片和计算装置”的中国专利申请的优先权,并且要求于2020年10月9日申请的,申请号为202011075144.5,名称为“用于浮点运算的乘法器、方法、集成电路芯片和计算装置”的中国专利申请的优先权,在此将其全文引入作为参考。
本披露一般地涉及浮点运算领域。更具体地,本披露涉及用于浮点运算的方法、乘法器、集成电路芯片和计算装置。
在当前的各种信号处理算法中,如向量之间的内积操作和矩阵的卷积运算中,使用到大量的乘加操作,而这些乘加操作的效率往往取决于乘法器的执行速度。尽管当前的乘法器在执行效率方面获得了显著提高,但在处理浮点类型数据方面,其还存在提升的空间。因此,如何获得一种高效率、低功耗和低成本的乘法器来执行浮点型数据的乘法操作成为现有技术中需要解决的问题。
发明内容
为了至少部分地解决背景技术中提到的技术问题,本披露的方案提供了一种用于浮点运算的乘法器、方法、包括该乘法器的集成电路芯片和计算装置。
在一个方面中,本披露提供一种乘法器,用于根据运算模式进行浮点数乘法运算,其中所述浮点数至少包括指数和尾数,所述乘法器包括:指数处理单元,用于根据所述运算模式、第一浮点数的指数和第二浮点数的指数来获得所述乘法运算后的指数;以及尾数处理单元,用于根据所述运算模式、所述第一浮点数的尾数和所述第二浮点数的尾数来获得所述乘法运算后的尾数,其中,所述运算模式用于指示所述第一浮点数的数据格式和所述第二浮点数的数据格式。
在另一方面中,本披露提供一种使用乘法器执行浮点数乘法运算的方法,其中所述浮点数至少包括指数和尾数,该乘法器基于运算模式执行乘法运算,所述方法包括:利用所述乘法器的指数处理单元来根据所述运算模式、第一浮点数的指数和第二浮点数的指数获得所述乘法运算后的指数;利用所述乘法器的尾数处理单元来根据所述运算模式、所述第一浮点数的尾数和所述第二浮点数的尾数获得所述乘法运算后的尾数,其中,所述运算模式用于指示所述第一浮点数的数据格式和所述第二浮点数的数据格式。
在又一方面中,本披露提供一种集成电路芯片,包括所述的乘法器。在一个或多个实施例中,本披露的乘法器可以构成一个独立的集成电路芯片或布置在一块集成电路芯片或计算装置上,实现对多种不同数据格式的浮点数的运算。
利用本披露的乘法器、相应的运算方法、集成电路芯片和计算装置,可以支持对多种浮点类型的数据进行运算而无需针对不同的浮点类型数据而提供多个单独的乘法器。由此,本披露的乘法器适用灵活,可以广泛应用于各类浮点数据运算。另外,在处理位宽较大的输入数据时,本披露的乘法器支持循环复用操作,从而无需布置更多的处理芯片,由此也减小了集成电路的布置面积。
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:
图1是示出根据本披露实施例的浮点数据格式的示意图;
图2是示出根据本披露实施例的乘法器的示意性结构框图;
图3是示出根据本披露实施例的乘法器的更多细节的结构框图;
图4是示出根据本披露实施例的尾数处理单元的示意性框图;
图5是示出根据本披露实施例的部分积操作的示意图;
图6是示出根据本披露实施例的华莱士树压缩器的操作流程和示意框图;
图7是示出根据本披露实施例的乘法器的整体示意框图;
图8是示出根据本披露实施例的使用乘法器执行浮点数乘法运算的方法的流程图;
图9是示出根据本披露实施例的一种组合处理装置的结构图;以及
图10是示出根据本披露实施例的一种板卡的结构示意图。
本披露的技术方案在整体上提供一种用于浮点数运算的乘法器、方法、集成电路芯片和计算装置。不同于现有技术的浮点运算乘法器,本披露提供了一种支持多种运算模式的乘法器,从而克服现有乘法器只能支持一种类型浮点运算的缺陷。特别地,本披露利用多种运算模式来指示不同的浮点数据类型,并且在浮点数的乘法计算过程中,基于运算模式之一来执行数据的各类操作,包括例如编码、压缩、求和、规格化和舍入操作,从而实现与多种浮点数据类型之一关联的操作。由此,本披露的乘法器可以支持多模式下的操作,进一步提高浮点运算的灵活性并降低运算的成本。
下面将结合附图对本披露的技术方案及其多个实施例进行详细的描述。应当理解的是,将关于浮点运算阐述许多具体细节以便提供对本披露所述多个实施例的透彻理解。然而,本领域普通技术人员在本披露公开内容的教导下可以在没有这些具体细节的情况下实践本披露描述的多个实施例。在其他情况下,本披露公开的内容并没有详细描述公知的方法、过程和组件,以避免不必要地模糊本披露描述的实施例。另外,该描述也不应被视为限制本披露的多个实施例的范围。
图1是示出根据本披露实施例的浮点数据格式100的示意图。如图1中所示,可以应用本披露技术方案的浮点数可以包括三个部分,例如符号(或符号位)102、指数(或指数位)104和尾数(或尾数位)106,其中对于无符号的浮点数则可以不存在符号或符号位。在一些实施例中,适用于本披露乘法器的浮点数可以包括半精度浮点数、单精度浮点数、脑浮点数、双精度浮点数、自定义浮点数中的至少一种。具体来说,在一些实施例中,可以应用本披露技术方案的浮点数格式可以是符合IEEE754标准的浮点格式,例如双精度浮点数(float64,简写为“FP64”)、单精度浮点数(float32,简写“FP32”)或半精度浮点数(float16,简写“FP16”)。在另外一些实施例中,浮点数格式也可以是现有的16位脑浮点数(bfloat16,简写“BF16”),也可以是自定义的浮点数格式,例如8位脑浮点数(bfloat8,简写“BF8”)、无符号半精度浮点数(unsigned float16,简写“UFP16”)、无符号16位脑浮点数(unsigned bfloat16,简写“UBF16”)。为了便于理解,下面的表1示出上述的部分数据格式,其中的符号位宽、指数位宽和尾数位宽仅用于示例性的说明目的。
表1
数据类型 | 符号位宽 | 指数位宽 | 尾数位宽 |
FP16 | 1 | 5 | 10 |
BF16 | 1 | 8 | 7 |
FP32 | 1 | 8 | 23 |
BF8 | 1 | 5 | 3 |
UFP16 | 0 | 5(或6) | 11(或10) |
UBF16 | 0 | 8 | 8 |
对于上面所提到的各种浮点数格式,本披露的乘法器在操作中至少可以支持具有任意上述格式的两个浮点数之间的相乘操作,其中两个浮点数可以具有相同或不同的浮点数据格式。例如,两个浮点数之间的相乘操作可以是FP16*FP16、BF16*BF16、FP32*FP32、FP32*BF16、FP16*BF16、FP32*FP16、BF8*BF16、UBF16*UFP16或UBF16*FP16等两个浮点数之间的相乘操作。
图2是示出根据本披露实施例的乘法器200的示意性结构框图。如前所述,本披露的乘法器支持各种数据格式的浮点数的相乘操作,而这些数据格式可以通过本披露的运算模式来指示,以使得乘法器工作在多种运算模式之一。
如图2中所示,本披露的乘法器总体上可以包括指数处理单元202和尾数处理单元204,其中指数处理单元用于处理浮点数的指数位,而尾数处理单元用于处理浮点数的尾数位。可选地或附加地,在一些实施例中,当乘法器处理的浮点数具有符号位时,乘法器还可以包括符号处理单元206,该符号处理单元可以用于处理包括符号位的浮点数。
在操作中,所述乘法器可以根据运算模式之一对接收、输入或缓存的第一浮点数和第二浮点数执行浮点运算,该第一浮点数和第二浮点数具有如前所讨论的浮点数据格式之一。例如,当乘法器处于第一运算模式中,其可以支持两个浮点数FP16*FP16的乘法运算,而当乘法器处于第二运算模式中,其可以支持两个浮点数BF16*BF16的乘法运算。类似地,当乘法器处于第三运算模式中,其可以支持两个浮点数FP32*FP32的乘法运算,而当乘法器处于第四运算模式中,其可以支持两个浮点数FP32*BF16的乘法运算。这里,示例的运算模式和浮点数对应关系如下表2所示。
表2
运算模式编号 | 运算浮点数类型 |
1 | FP16*FP16 |
2 | BF16*BF16 |
3 | FP32*FP32 |
4 | FP32*BF16 |
在一个实施例中,上述的表2可以存储于乘法器的一个存储器中,并且乘法器根据从外部设备接收到的指令来选择表中的运算模式之一,而该外部设备例如可以是图10中示出的外部设备1012。在另一个实施例中,该运算模式的输入也可以经由如图3中所示的模式选择单元308来自动地实现。例如,当两个FP16型的浮点数输入到本披露的乘法器时,模式选择单元可以根据该两个浮点数的数据格式而选择乘法器工作于第一运算模式中。又例如,当一个FP32型浮点数和一个BF16型浮点数输入到本披露的乘法器时,模式选择单元可以根据该两个浮点数的数据格式而选择乘法器工作于第四运算模式中。
可以看出,本披露的不同运算模式与对应的浮点型数据相关联。也就是说,本披露的运算模式可以用于指示第一浮点数的数据格式和第二浮点数的数据格式。在另一个实施例中,本披露的运算模式不仅可以指示第一浮点数的数据格式和第二浮点数的数据格式,还可以用于指示乘法运算后的数据格式。结合表2扩展的运算模式在下表3中示出。
表3
与表2中所示的运算模式编号不同,表3中的运算模式扩展一位以用于指示浮点乘法运算后的数据格式。例如,当乘法器工作于运算模式21中,其对输入的BF16*BF16两个浮点数执行浮点运算,并且将浮点乘法运算后以FP16的数据格式输出。
上面以编号形式的运算模式来指示浮点数据格式仅仅是示例性的而非限制性的,根据本披露的教导,也可以想到根据运算模式建立索引以确定乘数和被乘数的格式。例如,运算模式包括两个索引,第一个索引用于指示第一浮点数的类型,第二个索引用于指示第二浮点数的类型,例如运算模式13中的第一索引“1”指示第一浮点数(或称被乘数)为第一浮点格式,即FP16,而第二索引“3”指示第二浮点数(或称乘数)为第二浮点格式,即FP32。进一步,也可以对运算模式增加第三索引,该第三索引指示输出结果的数据格式,例如对于运算模式131中的第三索引“1”,其可以指示输出结果的数据格式是第一浮点格式,即FP16。当运算模式数目增加时,可以根据需要增加相应的索引或索引的层级,以便于对运算模式和数据格式之间关系的确立。
另外,尽管这里示例性地以数字编号来指代运算模式,在其他的例子中,也可以根据应用需要以其他的符号或编码来对运算模式进行指代,例如通过字母、符号或数字及其结合等等,并且通过这样的字母、数字、符号或其组合的表达来指代运算模式并标识出第一浮点数、第二浮点数和输出结果的数据格式。另外,当这些表达以指令形式形成时,该指令可以包括三个域或字段,第一域用于指示第一浮点数的数据格式,第二域用于指示第二浮点数的数据格式,而第三域用于指示输出结果的数据格式。当然,这些域也可以被合并于一个域,或增加新的域以用于指示更多的与浮点数据格式相关的内容。可以看出,本披露的运算模式不仅可以与输入的浮点数数据格式相关联,也可以用于规格化输出结果,以获得期望数据格式的乘积结果。
图3是示出根据本披露实施例的乘法器300的更多细节结构框图。从图3所示内容可以看出,其不仅包括图2中所示出的指数处理单元202、尾数处理单元204和可选的符号处理单元206,还示出这些单元可以包括的内部组件以及与这些单元操作相关的单元,下面结合图3来具体描述这些单元的示例性操作。
为了执行浮点数的乘法运算,指数处理单元可以用于根据前述的运算模式、第一浮点数的指数和第二浮点数的指数获得乘法运算后的指数。在一个实施例中,该指数处理单元可以通过加减法电路来实现。例如,此处的指数处理单元可以用于将第一浮点数的指数、第二浮点数的指数和各自对应的输入浮点数据格式的偏移值相加,并且接着减去输出浮点数据格式的偏移值,以获得第一浮点数和第二浮点数的乘法运算后的指数。
进一步,乘法器的尾数处理单元可以用于根据前述的运算模式、第一浮点数和所述第二浮点数来获得乘法运算后的尾数。在一个实施例中,尾数处理单元可以包括部分积运算单元312和部分积求和单元314,其中所述部分积运算单元用于根据第一浮点数的尾数和第二浮点数的尾数获得中间结果。在一些实施例中,该中间结果可以是第一浮点数和第二浮点数在相乘操作过程中所获得的多个部分积(如图5和图6中所示意性示出的)。所述部分积求和单元用于将所述中间结果进行加和运算以获得加和结果,并将所述加和结果作为所述乘法运算后的尾数。
为了获得中间结果,在一个实施例中,本披露利用布斯(“Booth”)编码电路对第二浮点数(如充当浮点运算中的乘数)的尾数的高低位补0(其中对高位补0是将尾数作为无符号数转为有符号数),以便获得所述中间结果。需要理解的是,根据编码方法的不同,也可以对第一浮点数(如充当浮点运算中的被乘数)的尾数进行编码(如高低位补0),或者对二者都进行编码,以获得多个部分积。关于部分积的更多描述,稍后将结合附图来说明。
在另一个实施例中,所述部分积求和单元可以包括加法器,其用于对所述中间结果进行加和,以获得所述加和结果。在又一个实施例中,部分积求和单元包括华莱士树和加法器,其中所述华莱士树用于对所述中间结果进行加和,以获得第二中间结果,所述加法器用于对所述第二中间结果进行加和,以获得所述加和结果。在这些实施例中,加法器可以包括全加器、串行加法器和超前进位加法器中的至少一种。
在一个实施例中,所述尾数处理单元还可以包括控制电路316,用于在运算模式指示所述第一浮点数或第二浮点数中的至少一个的尾数位宽大于尾数处理单元一次可处理的数据位宽时, 根据所述运算模式多次调用所述尾数处理单元。该控制电路在一个实施例中可以实现为控制信号,例如可以是一个计数器或者控制的标志位等。为了实现这里的多次调用,所述的部分积求和单元还可以包括移位器,当所述控制电路根据所述运算模式多次调用所述尾数处理单元时,移位器在每次调用中用于对已有加和结果进行移位,并与当次调用获得的求和结果进行相加,以获得新的加和结果,并且将在最后一次调用中获得的新的加和结果作为所述乘法运算后的尾数。
在一个实施例中,本披露的乘法器还包括规则化单元318和舍入单元320。该规则化单元可以用于对乘法运算后的尾数和指数进行浮点数规则化处理,以获得规则化指数结果和规则化尾数结果,并且将所述规则化指数结果和所述规则化尾数结果作为所述乘法运算后的指数和乘法运算后的尾数。例如,根据运算模式所指示的数据格式,规则化单元可以调整指数和尾数的位宽,以使其符合前述指示的数据格式的要求。另外,规则化单元还可以对指数或尾数做其他方面的调整。例如,在一些应用场景中,当尾数的值不为0时,尾数位的最高有效位应为1;否则,可以修改指数位并同时对尾数位进行移位,使其变为规格化数的形式。在另一个实施例中,该规则化单元还可以根据乘法运算后的尾数对所述乘法运算后的指数进行调整。例如,当乘法运算后的尾数的最高位为1时,可以将乘法运算后所获得的指数加1。与之相应,舍入单元可以用于根据舍入模式对所述规则化尾数结果执行舍入操作,并将执行了舍入操作后的尾数作为所述乘法运算后的尾数。根据不同的应用场景,该舍入单元可以执行例如包括向下舍入、向上舍入、向最近的有效数舍入等的舍入操作。在一些应用场景中,舍入单元也可以对尾数右移过程中移出的1进行舍入。
除了指数处理单元和尾数处理单元,本披露的乘法器还可选地包括符号处理单元,当输入的浮点数是带有符号位的浮点数时,该符号处理单元可以用于根据第一浮点数的符号和第二浮点数的符号获得乘法运算后的符号。例如,在一个实施例中,该符号处理单元可以包括异或逻辑电路322,所述异或逻辑电路用于根据所述第一浮点数的符号和所述第二浮点数的符号进行异或运算,获得所述乘法运算后的符号。在另一个实施例中,该符号处理单元也可以通过真值表或逻辑判断来实现。
另外,为了使输入或接收到的第一和第二浮点数符合规定的格式,在一个实施例中,本披露的乘法器还可以包括规格化处理单元324,用于当所述第一浮点数或第二浮点数为非规格化的非零浮点数时,根据所述运算模式,对所述第一浮点数或第二浮点数进行规格化处理,以获得对应的指数和尾数。例如,当选择的运算模式是表2中所示出的第2种运算模式,而输入的第一和第二浮点数是FP16型数据,则可以利用规格化处理单元将FP16型数据规格化为BF16型数据,以便乘法器以第2种运算模式进行操作。在一个或多个实施例中,规格化处理单元还可以用于对存在隐式的1的规格化浮点数和不存在隐式的1的非规格化浮点数的尾数进行预处理(例如尾数的扩充),以便于后续的尾数处理单元的操作。基于上文的描述,可以理解的是这里的规格化处理单元324和前述的规则化单元318在一些实施例中也可以执行相同或相类似的操作,不同的是规格化处理单元324针对于输入的浮点数据进行规格化处理而规则化单元318针对于将要输出的尾数和指数进行规则化处理。
以上结合图3对本披露的乘法器及其多个实施例进行了描述。基于上面的描述,本领域技术人员可以理解本披露的方案通过乘法器的执行来获得乘法运算后的结果(包括指数、尾数和可选的符号)。根据应用场景的不同,例如在不需要前述的规则化处理和舍入处理时,通过尾数处理单元和指数处理单元所获得的结果即可以视为最终的运算结果。进一步,对于需要前述的规则化处理和舍入处理时,则经过该规则化处理和舍入处理后所获得的指数和尾数可以视为最终的运算结果,或最终的运算结果的一部分(当考虑最终的符号时)。进一步,本披露的方案通过多种运算模式来使得乘法器支持不同类型或数据格式的浮点数的运算,从而可以实现乘法器的复用,由此节省了芯片设计的开销并节约了计算成本。另外,通过多次调用机制,本披露的乘法器也支持高位宽的浮点数的计算。鉴于在浮点数乘法操作中,尾数(或称尾数位或尾数部分)的相乘操作对于整个浮点运算的性能至关重要,下面将结合图4来描述本披露的尾数操作。
图4是示出根据本披露实施例的尾数处理单元操作400的示意性框图。如图4中所示,本披露的尾数处理操作可以主要涉及两个单元,即前述结合如图3所讨论的部分积运算单元和部分积求和单元。从操作时序上来看,该尾数处理操作大体可以分为第一阶段和第二阶段,在第一阶段中该尾数处理操作将获得中间结果,而在第二阶段中该尾数处理操作将获得从加法器408输出的尾数结果。
在示例性的具体操作中,由乘法器接收到的第一浮点数和第二浮点数可以被划分成多个部分,即前述的符号(可选的)、指数和尾数。可选地,在经过规格化处理后,两个浮点数的尾数部分将作为输入进入到尾数处理单元(如图2或图3中的尾数处理单元),并且具体地进入到部分积运算单元。如图4中所示,本披露利用布斯编码电路402对第二浮点数(即浮点运算中的乘数)的尾数的高低位补0,并进行布斯编码处理,从而在部分积产生电路404中获得所述中间结果。当然,这里的第一浮点数和第二浮点数仅仅用于说明性而非限制性的目的,因此在一些应用场景中,第一浮点数可以是乘数而第二浮点数可以是被乘数。相应地,在一些编码处理中,也可以对充当被乘数的浮点数执行编码操作。
为了更好的理解本披露的技术方案,下面对布斯编码进行简要地介绍。一般地,当两个二进制数进行相乘操作时,通过乘法操作会产生大量的称之为部分积的中间结果,然后在对这些部分积进行累加操作进而得到两个二进制数相乘的最终结果。其中部分积数量越多,阵列乘法器的面积和功耗就会越大,执行速度就会越慢,其实现电路也就越困难。而布斯编码的目的就是为了有效地减少部分积的求和项的数量,从而减小电路面积。其算法在于首先对输入的乘数进行相应规则的编码,在一个实施例中,编码规则例如可以是下表4所示的规则:
表4
其中表4中的y2i+1,y2i和y2i-1可以表示每一组待编码子数据(即乘数)对应的数值,X可以表示第一浮点数(即被乘数)中的尾数。对每一组对应的待编码数据进行布斯编码处理后,得到对应的编码信号PPi(i=0,1,2,...,n)。如表4中所示意性示出的,布斯编码后得到的编码信号可以包括五类,分别为-2X、2X、-X、X和0。示例性地,基于上述的编码规则,若接收到的被乘数为8位数据“X
7X
6X
5X
4X
3X
2X
1X
0”,则可以获得下述的部分积:
1)当乘数位中包括上表中的连续三位数据“001”时,部分积为X,可以表示为“X
7X
6X
5X
4X
3X
2X
1X
0”,第9位是符号位,即PPi={X[7],X};
2)当乘数位中包括上表中的连续三位数据“011”时,部分积为2X,可以表示为X左移一位,得到“X
7X
6X
5X
4X
3X
2X
1X
00”,即PPi={X,0};
5)当乘数位中包括上表中的连续三位数据“111”或“000”时,部分积为0,即PPi={9′b0}。
应当理解的是上面结合表4对获得部分积的过程的描述仅仅是示例性的而非限制性的,本领域技术人员在本披露的教导下,可以对表4中的规则进行改变,以获得不同于表4所示出的部分积。例如,在乘数位中存在连续多位(例如3位或3位以上)的特定数时,得到的部分积可以是被乘数的补码,或者例如在对部分积进行加和之后再执行上述3)和4)项中的“加1”操作。
根据上述介绍性描述可以理解,通过对第二浮点数的尾数利用布斯编码电路进行编码,并且利用第一浮点数的尾数,可以从部分积产生电路产生多个部分积作为中间结果,并且将中间结果输送入到部分积求和单元中的华莱士树(“Wallace Tree”)压缩器406。应当理解的是,此处利用布斯编码获得部分积仅是本披露得到部分积的一种优选方式,而本领域技术人员也可以通过其他的方式来获得该部分积。例如,还可以通过移位操作来获得,即根据乘数的位值为1还是0来选择移位加被乘数还是加0而获得相应的部分积。类似地,利用华莱士树压缩器以实现部分积的加法操作也仅仅是示例性的而非限制性的,本领域技术人员也可以想到利用其他类型的加法器来实现这样的部分积相加操作。该加法器例如可以是一个或多个全加器、半加器或二者的各种组合形式。
关于华莱士树压缩器(或简称为华莱士树),其主要用于对上述的中间结果(即多个部分积)进行求和,以减少部分积的累加次数(即,压缩)。通常,华莱士树压缩器可以采用进位保存CAS(carry-save)架构和Wallace树算法,其利用华莱士树阵列的计算速度比传统进位传递的加法快得多。
具体地,华莱士树压缩器能并行计算各行部分积之和,例如可以将N个部分积的累加次数从N-1次减少到Log2N次,从而提高了乘法器的速度,对资源的有效利用具有重要意义。根据不同的应用需要,可以将华莱士树压缩器设计成多种类型,例如7-2华莱士树、4-2华莱士树以及3-2华莱士树等。在一个或多个实施例中,本披露使用7-2华莱士树作为实现本披露的各种浮点运算的示例,稍后将结合图5和图6对其进行详细的描述。
在一些实施例中,本披露所公开的华莱士树压缩操作可以布置为具有M个输入,N个输出,其数目可以不小于K,其中N为预设的小于M的正整数,K为不小于中间结果的最大位宽的正整数。例如,M可以是7,N可以是2,即下文将详细描述的7-2华莱士树。当中间结果的最大位宽是48时,K可以取正整数48,也就是说华莱士树的数目可以是48个。
在一些实施例中,根据运算模式,可以选用一组或多组所述华莱士树对所述中间结果进行加和,其中每组有X个华莱士树,X为所述中间结果的位数。进一步,各组内的华莱士树之间可以存在依次进位的关系,而各组间并不存在进位的关系。在示例性的连接中,华莱士树压缩器可以通过进位进行连接,例如来自于低位华莱士树压缩器的进位输出(如图6中Cin)至高位华莱士树,而高位华莱士树压缩器的进位输出(Cout)又可以成为更高位华莱士树压缩器接收来自低位华莱士树压缩器的进位输入。另外,当从多个华莱士树压缩器中选择一个或多个华莱士树时,可以进行任意的选择,例如既可以按0、1、2和3编号的顺序来选择,也可以按0、2、4和6编号的顺序来连接,只要选择的华莱士树压缩器是按上述的进位关系来选择即可。
下面结合一个说明性的示例来介绍上文的华莱士树及其操作。假设第一浮点数和第二浮点数的是16位数据(例如FP16*FP16),乘法器支持的数据位宽是32位(由此支持两组16位数的并行相乘操作),华莱士树是7个(即上述M的一个示例值)输入和2个(即上述N的一个示例值)输出的7-2华莱士树压缩器。在该示例场景下,可以采用48个(即上述K的一个示例值)华莱士树来并行完成两组数据的乘法运算。
在上述的48个华莱士树中,第0~23个华莱士树(即第一组华莱士树中的24个华莱士树)可以完成第一组乘法的部分积加和运算,并且该组内的各华莱士树可以依次通过进位连接。进一步,第24~47个华莱士树(即第二组华莱士树中的24个华莱士树)可以完成第二组乘法的部分积加和运算,其中该组内的各华莱士树依次通过进位连接。另外,第一组中的第23个华莱士树和第二组中的第24个华莱士树之间不存在进位关系,即不同组的华莱士树之间不存在进位关系。
返回到图4,在通过华莱士树压缩器对部分积进行加和压缩后,将经过压缩后的部分积通过 加法器进行求和,以获得尾数乘法操作的结果。关于加法器,在本披露的一个或多个实施例中,其可以包括全加器、串行加法器和超前进位加法器中的一种,用于对华莱士树压缩器进行加和所得到的最后两行部分积进行求和操作,以获得尾数乘法操作的结果。
可以理解,通过图4所示出的尾数乘法操作,特别是示例性地使用布斯编码和华莱士树,可以有效地获得尾数乘法操作的结果。具体地,布斯编码处理能有效减少部分积求和项的数目,从而减小电路面积,而华莱士压缩树能并行计算各行部分积之和,从而提高了乘法器的速度。
下面将结合图5和图6对部分积和7-2华莱士树的示例操作过程作详细的描述。可以理解的是这里的描述仅仅是示例性的而非限制性的,目的仅在于对本披露方案的更好理解。
图5示出在经过前述结合图2-图4所描述的尾数处理单元中的部分积产生电路后所获得的部分积500,如图中的两个虚线之间四行白色圆点,其中每行白色圆点标识出一个部分积。为了便于后续的华莱士树压缩器的执行,可以预先对位数进行扩展。例如,图5中的黑点为复制的每个9位部分积的最高位数值,可以看出部分积被扩展对齐至16(8+8)bit(即,被乘数尾数的位宽8bit+乘数尾数的位宽8bit)。在另一个实施例中,例如对于25*13二进制乘法的部分积,其部分积被扩展至38(25+13)bit(即,被乘数尾数的位宽25bit+乘数尾数的位宽13bit)。
图6是示出根据本披露实施例的华莱士树压缩器的操作流程和示意框图600。
如图6中所示,在对两个浮点数的尾数执行相乘操作后,例如如前所述,通过将乘数进行布斯编码并且通过被乘数可以获得图6中所示出的7个部分积。由于布斯编码算法的使用,减小了产生的部分积的数目。为了便于理解,图中在部分积部分用虚线框标识出一个包括7个元素的华莱士树,并且进一步以箭头示出其从7个元素压缩至2个元素的过程。在一个实施例中,该压缩过程(或称加和过程)可以借助于全加器来实现,即输入三个元素输出两个元素(即一个和“sum”以及向高位的进位“carry”)。7-2华莱士树压缩器的示意框图在图6的右侧示出,可以理解该华莱士树压缩器包括7个来自一列部分积的输入(如图6左侧虚线框中标识的七个元素)。在操作中,第0列华莱士树的进位输入为0,每列华莱士树的进位输出Cout作为下一列华莱士树的进位输入Cin。
从图6左侧部分中可以看到,经过四次压缩后可以将包括7个元素的华莱士树压缩为包括2个元素。如前所提到,本披露利用7-2华莱士树压缩器将7行的部分积最终压缩成具有两行的部分积(即本披露的第二中间结果),并且利用加法器(例如超前进位加法器)来获得尾数结果。
为了进一步阐述本披露方案的原理,下面将示例性地描述本披露的乘法器如何完成FP16*FP16、FP16*FP16、FP32*FP32和FP32*BF16四种运算模式下在第一阶段的操作,即直到华莱士树压缩器完成中间结果的求和以获得第二中间结果:
(1)FP16*FP16
在乘法器的该运算模式下,浮点数的尾数位为10bit,考虑IEEE754标准下非规格化非零数,可以扩展1bit位,从而尾数位为11bit。另外,由于尾数位为无符号数,采用布斯编码算法时可以在高位扩展1bit的0,因此总的尾数位数为12bit。当对作为第二浮点数即乘数进行布斯编码,并且参照第一浮点数时,则通过部分积产生电路可以在高低部分分别获得7个部分积,其中第七个部分积为0,每个部分积的位宽为24bit,此时可以通过48个7-2华莱士树进行压缩处理,并且第23个到第24个华莱士树的进位为0。
(2)BF16*BF16
在乘法器的该运算模式下,浮点数的尾数位为7bit,考虑IEEE754标准下非规格化非零数及扩展为有符号数,则尾数可以扩展为9bit。当对作为第二浮点数即乘数进行布斯编码,并且参照第一浮点数时,则通过部分积产生电路可以在高低部分分别获得7个有效部分积,其中第6、7个部分积为0,每个部分积位宽为18bit,通过使用第0~17个和第24~41个两组的7-2华莱士树进行压缩处理,其中第23到第24个华莱士树的进位为0。
(3)FP32*FP32
在乘法器的该运算模式下,浮点数的尾数位可以为23bit,考虑IEEE754标准下非规格化非零数,则尾数可以扩展为24bit。为节省乘法单元的面积,本披露的乘法器在该运算模式下可以被调用两次以完成一次运算。为此,每次尾数位进行的乘法为25bit*13bit,即将第一浮点数ina 扩展1比特0成为25bit的有符号数,将第二浮点数inb的24bit尾数位分高低两部分12bit分别扩展1比特0得到两个13bit的乘数,表示为inb_high13和inb_low13高低两部分。具体操作中,第一次调用本披露的乘法器计算ina*inb_low13,第二次调用乘法器计算ina*inb_high13。在每一次的计算中,通过布斯编码生成7个有效部分积,每个部分积的位宽为38bit,通过第0~37个的7-2华莱士树进行压缩。
(4)FP32*BF16
该乘法器的该运算模式下,第一浮点数ina的尾数位为23bit,第二浮点数的inb的尾数位为7bit,考虑IEEE754标准下非规格化非零数和扩展为有符号数,则尾数可以分别扩展为25bit和9bit,进行25bit×9bit的乘法,获得7个有效部分积,其中第6、7个部分积为0,每个部分积的位宽为34bit,通过第0~33个华莱士树进行压缩。
以上通过具体示例描述了本披露的乘法器如何在四种运算模式下完成第一阶段的操作,其中优选的使用了布斯编码算法和7-2华莱士树。基于上述的描述,本领域技术人员可以理解本披露使用7个部分积,使得可以在不同的运算模式中复用7-2华莱士树。
在一些运算模式中,前述的尾数处理单元还可以包括控制电路,其可以用于当运算模式指示的所述第一浮点数的尾数位宽和/或所述第一浮点数的尾数位宽大于所述尾数处理单元一次可处理的数据位宽时,根据所述运算模式多次调用所述尾数处理单元。进一步,对于多次调用的情形,所述部分积求和电路还可以包括移位器,其用于当根据所述运算模式多次调用所述尾数处理单元时,在已有所述加和结果的情况下,对所述已有的加和结果进行移位,并与当次调用获得的所述求和结果进行相加,得到新的加和结果,将所述新的加和结果作为所述乘法运算后的尾数。
例如,如前所述,可以在FP32*FP32运算模式中两次调用尾数处理单元。具体地,在第一次调用尾数处理单元中,尾数位(即ina*inb_low13)在第二阶段通过超前进位加法器相加获得第二低位中间结果,在第二次调用尾数处理单元中,尾数位(即,ina*inb_high13)在第二阶段通过超前进位加法器相加获得第二高位中间结果。此后,在一个实施例中,可以通过移位器的移位操作来累加第二低位中间结果和第二高位中间结果,以获得该乘法运算后的尾数,该移位操作可以下式来表达:
r
fp32xfp32=sum
h[37:0]<<12+sum
l[37:0]
即将第二高位中间结果sumh[37:0]向左移12位并且与第二低位中间结果suml[37:0]累加。
上文结合图4-图6详细描述了本披露的乘法器在执行浮点运算时,对第一浮点数和第二浮点数的尾数相乘所执行的操作。当然,图4为了注重描述本披露乘法器的尾数处理单元的操作,并没有绘出其他的单元,例如指数处理单元和符号处理单元,并对其进行描述。下面将结合图7对本披露的乘法器进行整体上的描述,对于前文针对尾数处理单元所做的描述,同样也适用于图7所绘的情形。
图7是示出根据本披露实施例的乘法器700的整体示意框图。需要理解的是图中绘出的各类单元的位置、存在和连接关系仅仅是示例性的而非限制性的,例如其中的一些单元可以集成,而另一些单元也可以分离或依应用场景的不同而被省略或替换。
本披露的乘法器在每种运算模式的操作中按操作流程可以示例性地分为第一阶段和第二阶段,如图中的虚线所绘出的。概括来说,在第一阶段中:输出符号位的计算结果,输出指数位的中间计算结果,输出尾数位的中间计算结果(例如包括前述的输入尾数位定点乘法布斯算法的编码过程和华莱士树压缩过程)。在第二阶段中:对指数和尾数进行规则化和舍入操作,以输出指数的计算结果和输出尾数的计算结果。
如图7中所示,本披露的乘法器可以包括模式选择单元702和规格化处理单元704,其中模式选择单元可以根据输入模式信号(in_mode)来选择运算模式。在一个实施例中,该输入模式信号可以与表2中的运算模式编号相对应。例如,当输入模式信号指示表2中的运算模式编号“1”时,则可以令乘法器工作于FP16*FP16的运算模式中,而当输入模式信号指示表2中的运算模式编号“3”时,则可以令乘法器工作于FP32*FP32的运算模式中。为了图示的目的,图7仅示出 FP16*FP16、BF16*BF16、FP32*FP32和FP32*BP16四种示例性运算模式。然而,正如前所述,本披露的乘法器同样也支持其他多种不同的运算模式。
规格化处理单元可以配置成用于当第一浮点数或第二浮点数为非规格化的非零浮点数时,根据运算模式,对第一浮点数或第二浮点数进行规格化处理,以获得对应的指数和尾数,例如按照IEEE754标准、对运算模式所指示的数据格式的浮点数进行规则化处理。
进一步,乘法器包括尾数处理单元,以执行第一浮点数尾数和第二浮点数尾数的相乘操作。为此,在一个或多个实施例中,该尾数处理单元可以包括位数扩展电路706、布斯编码器708、部分积产生电路710、华莱士树压缩器712以及加法器714,其中位数扩展电路可以用于考虑IEEE754标准下非规格化非零数而对尾数进行扩展,以适合于布斯编码器的操作。由于关于布斯编码器、部分积产生电路、华莱士树压缩器和加法器,已经结合图4-图6进行了详细了描述,因此相同的描述在此同样适用并因此不再赘述。
在一些实施例中,本披露的乘法器还包括规则化单元716和舍入单元718,该规则化单元和舍入单元具有与图3中所示出的单元相同的功能。具体地,对于规则化单元,其可以根据如图7中所示的输出模式信号“out_mode”所指示的数据格式来对所述加和结果和来自于指数处理单元的指数数据进行浮点数规则化处理以获得规则化指数结果和规则化尾数结果。例如,根据输出模式信号所指示的数据格式,规则化单元可以调整指数和尾数的位宽,以使其符合前述指示的数据格式的要求。再例如,当尾数的最高位为0,且该尾数不为0,则规则化单元可以重复将尾数左移1位,并且指数减1,直到最高位数值为1。对于舍入单元,在一个实施例中,其可以用于根据舍入模式对所述规则化尾数结果执行舍入操作以获得舍入后的尾数,并将舍入后的尾数作为所述乘法运算后的尾数。
在一个或多个实施例中,前述的输出模式信号可以是运算模式的一部分,用于指示乘法运算后的数据格式。例如,如前表3中所描述的,当运算模式编号为“12”时,则其中的数字“1”可以相当于前述的“in_mode”信号,用于指示执行FP16*FP16的乘法操作,而其中的数字“2”可以相当于“out_mode”信号,用于指示输出结果的数据类型是BF16。因此可以理解的是,在一些应用场景中,输出模式信号可以与前述的输入模式信号合并,以提供给模式选择单元。基于此合并后的模式信号,模式选择单元可以在乘法器操作的初始阶段明确输入数据和输出结果的数据格式,而无需向规则化单独的提供输出模式信号,由此也可以进一步简化操作。
在一个或多个实施例中,对于前述的舍入操作,可以示例性包括如下5种舍入模式。
(1)舍入到最接近的值:在此模式下,当两个值同样接近的情况下,偶数优先。此时会将结果舍入为最接近且可以表示的值,但是当存在两个数同样接近的时候,则取其中的偶数作为舍入结果(在二进制中是以0结尾的数);
(2)四舍五入:示例性操作参见下面的例子;
(3)朝+∞方向舍入:在此规则下,会将结果朝正无限大的方向舍入;
(4)朝-∞方向舍入:在此规则下,会将结果朝负无限大的方向舍入;以及
(5)朝0方向舍入:在此规则下,会将结果朝0的方向舍入。
对于“四舍五入”模式下的尾数舍入的例子:例如两个规格化浮点数的24位的尾数相乘得到一个48位(47~0)的尾数,经过规格化处理(若尾数的最高位为0,将尾数左移1位;若尾数的最高位为1,则尾数不动,且将前面所求的临时的阶码加1),输出时只取第46至第24位。当尾数的第23位为0时,则舍去第(23-0)位;当尾数的第23位为1时,则向第24位进1并舍去第(23-0)位。
返回到图7,本披露的乘法器还包括指数处理单元720和符号处理单元722,其中指数处理单元可以用于根据运算模式、第一浮点数的指数和第二浮点数的指数获得所述乘法运算后的指数。例如,指数处理电路可以将第一浮点数的指数位数据、第二浮点数的指数位数据和各自对应的输入浮点数据类型的偏移值相加,并且减去输出浮点数据类型的偏移值,以获得所述第一浮点数和第二浮点数的乘积的指数位数据。在一个或多个实施例中,指数处理单元可以实现为或包括加减法电路,其用于根据所述运算模式、所述第一浮点数的指数、所述第二浮点数的指 数和所述运算模式获得所述乘法运算后的指数。
符号处理单元在一个实施例中可以实现为异或电路,其用于对所述第一浮点数和第二浮点数的符号位数据执行异或操作,以获得所述第一浮点数和第二浮点数的乘积的符号位数据。
上文结合图7对本披露的乘法器整体进行了详细的描述。通过该描述,本领域技术人员可以理解本披露的乘法器支持多种运算模式下的操作,从而克服了现有技术中仅支持单一浮点型运算的乘法器的缺陷。进一步,由于本披露的乘法器可以复用,因此也支持高位宽的浮点型数据,降低了运算成本和开销。在一个或多个实施例中,本披露的乘法器还可以布置成或包括于集成电路芯片或计算装置中,以实现在多种运算模式下对浮点数执行乘法运算。
另一方面,本公开的乘法器可以支持多组浮点数的并行相乘操作,所述多组浮点数各自包括第一浮点数和第二浮点数。各组浮点数中的第一浮点数可以拼接在一起输入乘法器,或不拼接而并行输入乘法器,并且各组浮点数中的第二浮点数可以拼接在一起输入乘法器,或不拼接而并行输入乘法器。输入的每组浮点数都可以分别使用符号处理单元、尾数处理单元和指数处理单元来完成浮点数的乘法运算。另外,该乘法器还可以包括多个如上所述的符号处理单元、尾数处理单元和指数处理单元,输入的每组浮点数可以各自通过不同的符号处理单元、尾数处理单元和指数处理单元来处理。当然,乘法器可以包括一个或多个符号处理单元、一个或多个指数处理单元和一个或多个尾数处理单元,其中,三个处理单元的个数可以任意组合,例如乘法器包括多个符号处理单元、多个指数处理单元和一个尾数处理单元,其中每组浮点数使用不同的符号处理单元和不同的指数处理单元来处理并且依次使用同一尾数处理单元来处理。进一步地,例如尾数处理单元可以包括多个华莱士树,所述多个华莱士树可以根据实际情况(例如运算模式)被划分为一组或多组华莱士树,每组华莱士树负责处理一组浮点数的尾数,例如划分为两组华莱士树,每组支持两个16位尾数的运算,例如两组浮点数中的每组浮点数中的第一浮点数和第二浮点数的尾数都是16位,由此该乘法器中的华莱士树支持这两组16位尾数的并行操作。而尾数处理单元还可以包括多组其它组成部件(例如布斯编码电路等),每组其它组成部件(例如布斯编码电路等)负责处理一组浮点数的尾数。当然,可以对尾数处理单元中的组成部件进行多次调用,而不必设置多个。另外,每组浮点数中的第一浮点数和第二浮点数可以拼接在一起输入乘法器。
图8是示出根据本披露实施例的使用乘法器执行浮点数乘法运算的方法800的流程图。可以理解的是此处所述的乘法器即前面结合图1-图7详细描述的乘法器,因此在前关于该乘法器及其内部组成、功能和操作的描述也同样适用于此处的描述。
如图8中所示,所述方法800可以包括在步骤S802处利用所述乘法器的指数处理单元来根据运算模式、第一浮点数的指数和第二浮点数的指数获得所述乘法运算后的指数。正如前所述,该运算模式可以是多种运算模式中的一种,并且可以用于指示浮点数的数据格式。在一个或多个实施例中,该运算模式还可以用于确定输出结果的浮点数的数据格式。
接着,在步骤S804处,该方法800可以利用乘法器的尾数处理单元来根据所述运算模式、第一浮点数和第二浮点数获得所述乘法运算后的尾数。关于尾数的示例性操作,本披露在一些优选的实施例中使用了布斯编码算法和华莱士树压缩器,从而提高尾数处理的效率。另外,当第一浮点数和第二浮点数是有符号数时,方法800还可以在步骤S806中用于根据第一浮点数的符号和第二浮点数的符号获得乘法运算后的符号。
尽管上述方法以步骤形式示出利用本披露的乘法器来执行浮点数乘法运算,但这些步骤顺序并不意味着本方法的步骤必须依所述顺序来执行,而是可以以其他顺序或并行的方式来处理。另外,此处为了描述的简明而没有阐述方法800的其他步骤,但本领域技术人员根据本披露的内容可以理解该方法也可以通过使用乘法器来执行前述结合图1-图7描述的各种操作。
在本披露的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。上述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
图9是示出根据本披露实施例的一种组合处理装置900的结构图。如图所示,该组合处理装置900包括计算装置902,该计算装置可以包括如前结合附图描述的本披露的乘法器。另外,该组合处理装置还包括通用互联接口904和其他处理装置906。根据本披露的计算装置与其他处理装置进行交互,共同完成用户指定的操作。
根据本披露的方案,该其他处理装置可以包括中央处理器(“CPU”)、图形处理器(“GPU”)、神经网络处理器等通用和/或专用处理器中的一种或多种类型的处理器,其数目不做限制而是依实际需要来确定。在一个或多个实施例中,该其他处理装置可以作为本披露的计算装置(其可以具体化为机器学习运算装置)与外部数据和控制的接口,执行包括但不限于数据搬运,完成对本机器学习运算装置的开启、停止等的基本控制;其他处理装置也可以和机器学习运算装置协作共同完成运算任务。
根据本披露的方案,该通用互联接口可以用于在计算装置与其他处理装置间传输数据和控制指令。例如,该计算装置可以经由所述通用互联接口从其他处理装置中获取所需的输入数据,写入该计算装置片上的存储装置。进一步,该计算装置可以经由所述通用互联接口从其他处理装置中获取控制指令,写入计算装置片上的控制缓存。替代地或可选地,通用互联接口也可以读取计算装置的存储模块中的数据并传输给其他处理装置。
可选地,该组合处理装置还可以包括存储装置908,其可以分别与所述计算装置和所述其他处理装置连接。在一个或多个实施例中,存储装置可以用于保存所述计算装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本计算装置或其他处理装置的内部存储中无法全部保存的数据。
根据应用场景的不同,本披露的组合处理装置可以作为手机、机器人、无人机、视频采集、视频监控设备等设备的SOC片上系统,从而有效地降低控制部分的核心面积,提高处理速度并降低整体的功耗。在此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。此处的某些部件可以例如是摄像头,显示器,鼠标,键盘,网卡或wifi接口。
在一些实施例里,本披露还公开了一种芯片或集成电路芯片,其包括了上述计算装置、组合处理装置以及本披露的乘法器。在另一些实施例里,本披露还公开了一种芯片封装结构,其包括了上述芯片。
在一些实施例里,本披露还公开了一种板卡,其包括了上述芯片封装结构。参阅图10,其提供了前述的示例性板卡,上述板卡除了包括上述芯片1002以外,还可以包括其他的配套部件,该配套部件可以包括但不限于:存储器件1004、接口装置1006和控制器件1008。
所述存储器件与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元1010。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(“Double Data Rate SDRAM”,双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储器件可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。
在一个实施例中,每一组所述存储单元可以包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备1012(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。例如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。在另一个实施例中,所述接口装置还可以是其他的接口,本披露并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的计算结果仍由 所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述芯片电连接,以便对所述芯片的状态进行监控。具体地,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(“MCU”,Micro Controller Unit)。所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,并且可以带动多个负载。由此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和/或多个处理电路的工作状态的调控。
在一些实施例里,本披露还公开了一种电子设备或装置,其包括了上述板卡。根据不同的应用场景,电子设备或装置可以包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本披露并不受所描述的动作顺序的限制,因为依据本披露,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本披露所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本披露所提供的几个实施例中,应该理解到,所披露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性、光学、声学、磁性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本披露各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,当本披露的技术方案可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本披露各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(“ROM”,Read-Only Memory)、随机存取存储器(“RAM”,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
依据以下条款可更好地理解前述内容:
条款A1,一种乘法器,用于根据运算模式进行浮点数乘法运算,其中所述浮点数至少包括指数和尾数,所述乘法器包括:指数处理单元,用于根据所述运算模式、第一浮点数的指数和第二浮点数的指数来获得所述乘法运算后的指数;以及尾数处理单元,用于根据所述运算模式、所述第一浮点数的尾数和所述第二浮点数的尾数来获得所述乘法运算后的尾数,其中,所述运算模式用于指示所述第一浮点数的数据格式和所述第二浮点数的数据格式。
条款A2,根据条款A1所述的乘法器,其中所述运算模式还用于指示所述乘法运算后的数据格式。
条款A3,根据条款A1或条款A2所述的乘法器,其中所述数据格式包括半精度浮点数、 单精度浮点数、脑浮点数、双精度浮点数、自定义浮点数中的至少一种。
条款A4,根据条款A1-A3任一项所述的乘法器,其中所述浮点数还包括符号,所述乘法器进一步包括:符号处理单元,用于根据第一浮点数的符号和第二浮点数的符号获得乘法运算后的符号。
条款A5,根据条款A1-A4任一项所述的乘法器,其中所述符号处理单元包括异或逻辑电路,所述异或逻辑电路用于根据所述第一浮点数的符号和所述第二浮点数的符号进行异或运算,获得所述乘法运算后的符号。
条款A6,根据条款A1-A5任一项所述的乘法器,进一步包括:规格化处理单元,用于当所述第一浮点数或第二浮点数为非规格化的非零浮点数时,根据所述运算模式,对所述第一浮点数或第二浮点数进行规格化处理,以获得对应的指数和尾数。
条款A7,根据条款A1-A7任一项所述的乘法器,其中所述尾数处理单元包括部分积运算单元和部分积求和单元,其中所述部分积运算单元用于根据所述第一浮点数的尾数和第二浮点数的尾数获得中间结果,所述部分积求和单元用于将所述中间结果进行加和运算以获得加和结果,并将所述加和结果作为所述乘法运算后的尾数。
条款A8,根据条款A1-A7任一项所述的乘法器,其中所述部分积运算单元包括布斯编码电路,所述布斯编码电路用于对第一浮点数或第二浮点数的尾数的高低位补0,并进行布斯编码处理,以获得所述中间结果。
条款A9,根据条款A1-A8任一项所述的乘法器,其中所述部分积求和单元包括加法器,所述加法器用于对所述中间结果进行加和,以获得所述加和结果。
条款A10,根据条款A1-A9任一项所述的乘法器,其中所述部分积求和单元包括华莱士树和加法器,其中所述华莱士树用于对所述中间结果进行加和,以获得第二中间结果,所述加法器用于对所述第二中间结果进行加和,以获得所述加和结果。
条款A11,根据条款A1-A10任一项所述的乘法器,其中所述加法器包括全加器、串行加法器和超前进位加法器中的至少一种。
条款A12,根据条款A1-A11任一项所述的乘法器,其中当所述中间结果的个数不足M个时,补充零值作为中间结果,使得所述中间结果的数量等于M,其中M为预设的正整数。
条款A13,根据条款A1-A12任一项所述的乘法器,其中每个所述华莱士树具有M个输入和N个输出,所述华莱士树的数目不小于K,其中N为预设的小于M的正整数,K为不小于所述中间结果的最大位宽的正整数。
条款A14,根据条款A1-A13任一项所述的乘法器,其中所述部分积求和单元用于根据运算模式来选用一组或多组所述华莱士树对所述中间结果进行加和,其中每组有X个华莱士树,X为所述中间结果的位数,其中各组内的所述华莱士树之间存在依次进位的关系,而各组之间的华莱士树不存在进位的关系。
条款A15,根据条款A1-A14任一项所述的乘法器,其中所述尾数处理单元还包括控制电路,用于在所述运算模式指示所述第一浮点数或第二浮点数中的至少一个的尾数位宽大于所述尾数处理单元一次可处理的数据位宽时,根据所述运算模式多次调用所述尾数处理单元。
条款A16,根据条款A1-A15任一项所述的乘法器,其中所述部分积求和单元还包括移位器,当所述控制电路根据所述运算模式多次调用所述尾数处理单元时,所述移位器在每次调用中用于对已有加和结果进行移位,并与当次调用获得的所述求和结果进行相加,以获得新的加和结果,并且将在最后一次调用中获得的新的加和结果作为所述乘法运算后的尾数。
条款A17,根据条款A1-A16任一项所述的乘法器,进一步包括规则化单元,用于对所述乘法运算后的尾数和指数进行浮点数规则化处理,以获得规则化指数结果和规则化尾数结果,并且将所述规则化指数结果和所述规则化尾数结果作为所述乘法运算后的指数和所述乘法运算后的尾数。
条款A18,根据条款A1-A17任一项所述的乘法器,进一步包括舍入单元,其用于根据舍入模式对所述规则化尾数结果执行舍入操作以获得舍入后的尾数,并将所述舍入后的尾数作为 所述乘法运算后的尾数。
条款A19,根据条款A1-A18任一项所述的乘法器,其进一步包括:模式选择单元,其用于从乘法器支持的多种运算模式中选择指示所述第一浮点数和第二浮点数的数据格式的运算模式。
条款A20,一种使用乘法器执行浮点数乘法运算的方法,其中所述浮点数至少包括指数和尾数,该乘法器基于运算模式执行乘法运算,所述方法包括:利用所述乘法器的指数处理单元来根据所述运算模式、第一浮点数的指数和第二浮点数的指数获得所述乘法运算后的指数;
利用所述乘法器的尾数处理单元来根据所述运算模式、所述第一浮点数的尾数和所述第二浮点数的尾数获得所述乘法运算后的尾数;其中,所述运算模式用于指示所述第一浮点数的数据格式和所述第二浮点数的数据格式。
条款A21,一种集成电路芯片,包括根据条款A1-A19的任意一项所述的乘法器。
条款A22,一种计算装置,包括根据条款A1-A19的任意一项所述的乘法器或根据条款A21所述的集成电路芯片。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心思想;同时,对于本领域的一般技术人员,依据本披露的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本披露的限制。
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本披露的方法及其核心思想。同时,本领域技术人员依据本披露的思想,基于本披露的具体实施方式及应用范围上做出的改变或变形之处,都属于本披露保护的范围。综上所述,本说明书内容不应理解为对本披露的限制。
Claims (22)
- 一种乘法器,用于根据运算模式进行浮点数乘法运算,其中所述浮点数至少包括指数和尾数,所述乘法器包括:指数处理单元,用于根据所述运算模式、第一浮点数的指数和第二浮点数的指数来获得所述乘法运算后的指数;以及尾数处理单元,用于根据所述运算模式、所述第一浮点数的尾数和所述第二浮点数的尾数来获得所述乘法运算后的尾数,其中,所述运算模式用于指示所述第一浮点数的数据格式和所述第二浮点数的数据格式。
- 根据权利要求1所述的乘法器,其中所述运算模式还用于指示所述乘法运算后的数据格式。
- 根据权利要求1或2所述的乘法器,其中所述数据格式包括半精度浮点数、单精度浮点数、脑浮点数、双精度浮点数、自定义浮点数中的至少一种。
- 根据权利要求1或2所述的乘法器,其中所述浮点数还包括符号,所述乘法器进一步包括:符号处理单元,用于根据第一浮点数的符号和第二浮点数的符号获得乘法运算后的符号。
- 根据权利要求4所述的乘法器,其中所述符号处理单元包括异或逻辑电路,所述异或逻辑电路用于根据所述第一浮点数的符号和所述第二浮点数的符号进行异或运算,获得所述乘法运算后的符号。
- 根据权利要求1或2所述的乘法器,进一步包括:规格化处理单元,用于当所述第一浮点数或第二浮点数为非规格化的非零浮点数时,根据所述运算模式,对所述第一浮点数或第二浮点数进行规格化处理,以获得对应的指数和尾数。
- 根据权利要求1或2所述的乘法器,其中所述尾数处理单元包括部分积运算单元和部分积求和单元,其中所述部分积运算单元用于根据所述第一浮点数的尾数和第二浮点数的尾数获得中间结果,所述部分积求和单元用于将所述中间结果进行加和运算以获得加和结果,并将所述加和结果作为所述乘法运算后的尾数。
- 根据权利要求7所述的乘法器,其中所述部分积运算单元包括布斯编码电路,所述布斯编码电路用于对第一浮点数或第二浮点数的尾数的高低位补0,并进行布斯编码处理,以获得所述中间结果。
- 根据权利要求8所述的乘法器,其中所述部分积求和单元包括加法器,所述加法器用于对所述中间结果进行加和,以获得所述加和结果。
- 根据权利要求8所述的乘法器,其中所述部分积求和单元包括华莱士树和加法器,其中所述华莱士树用于对所述中间结果进行加和,以获得第二中间结果,所述加法器用于对所述第二中间结果进行加和,以获得所述加和结果。
- 根据权利要求9或10所述的乘法器,其中所述加法器包括全加器、串行加法器和超前进位加法器中的至少一种。
- 根据权利要求10所述的乘法器,其中当所述中间结果的个数不足M个时,补充零值作为中间结果,使得所述中间结果的数量等于M,其中M为预设的正整数。
- 根据权利要求12所述的乘法器,其中每个所述华莱士树具有M个输入和N个输出,所述华莱士树的数目不小于K,其中N为预设的小于M的正整数,K为不小于所述中间结果的最大位宽的正整数。
- 根据权利要求13所述的乘法器,其中所述部分积求和单元用于根据运算模式来选用一组或多组所述华莱士树对所述中间结果进行加和,其中每组所述华莱士树有X个华莱士树,X为所述中间结果的位数,其中各组内的所述华莱士树之间存在依次进位的关系,而各组之间的华莱士树不存在进位的关系。
- 根据权利要求12-14的任意一项所述的乘法器,其中所述尾数处理单元还包括控制电路,用于在所述运算模式指示所述第一浮点数或第二浮点数中的至少一个的尾数位宽大于所述尾数处理单元一次可处理的数据位宽时,根据所述运算模式多次调用所述尾数处理单元。
- 根据权利要求15所述的乘法器,其中所述部分积求和单元还包括移位器,当所述控制电路根据所述运算模式多次调用所述尾数处理单元时,所述移位器在每次调用中用于对已有加和结果进行移位,并与当次调用获得的所述求和结果进行相加,以获得新的加和结果,并且将在最后一次调用中获得的新的加和结果作为所述乘法运算后的尾数。
- 根据权利要求16所述的乘法器,进一步包括规则化单元,用于:对所述乘法运算后的尾数和指数进行浮点数规则化处理,以获得规则化指数结果和规则化尾数结果,并且将所述规则化指数结果和所述规则化尾数结果作为所述乘法运算后的指数和所述乘法运算后的尾数。
- 根据权利要求17所述的乘法器,进一步包括:舍入单元,用于根据舍入模式对所述规则化尾数结果执行舍入操作以获得舍入后的尾数,并将所述舍入后的尾数作为所述乘法运算后的尾数。
- 根据权利要求1或2所述的乘法器,其进一步包括:模式选择单元,用于从乘法器支持的多种运算模式中选择指示所述第一浮点数和第二浮点数的数据格式的运算模式。
- 一种使用乘法器执行浮点数乘法运算的方法,其中所述浮点数至少包括指数和尾数,该乘法器基于运算模式执行乘法运算,所述方法包括:利用所述乘法器的指数处理单元来根据所述运算模式、第一浮点数的指数和第二浮点数的指数获得所述乘法运算后的指数;利用所述乘法器的尾数处理单元来根据所述运算模式、所述第一浮点数的尾数和所述第二浮点数的尾数获得所述乘法运算后的尾数;其中,所述运算模式用于指示所述第一浮点数的数据格式和所述第二浮点数的数据格式。
- 一种集成电路芯片,包括权利要求1-19的任意一项所述的乘法器。
- 一种计算装置,包括根据权利要求1-19的任意一项所述的乘法器或根据权利要求21所述的集成电路芯片。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/620,601 US20230076931A1 (en) | 2019-10-14 | 2020-10-13 | Multiplier for floating-point operation, method, integrated circuit chip, and calculation device |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910970802 | 2019-10-14 | ||
CN201910970802.8 | 2019-10-14 | ||
CN202011075144.5A CN112732221A (zh) | 2019-10-14 | 2020-10-09 | 用于浮点运算的乘法器、方法、集成电路芯片和计算装置 |
CN202011075144.5 | 2020-10-09 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021073512A1 true WO2021073512A1 (zh) | 2021-04-22 |
Family
ID=75538449
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/120717 WO2021073512A1 (zh) | 2019-10-14 | 2020-10-13 | 用于浮点运算的乘法器、方法、集成电路芯片和计算装置 |
Country Status (2)
Country | Link |
---|---|
US (1) | US20230076931A1 (zh) |
WO (1) | WO2021073512A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113608718A (zh) * | 2021-07-12 | 2021-11-05 | 中国科学院信息工程研究所 | 一种实现素数域大整数模乘计算加速的方法 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108459840A (zh) * | 2018-02-14 | 2018-08-28 | 中国科学院电子学研究所 | 一种simd结构浮点融合点乘运算单元 |
CN108958705A (zh) * | 2018-06-26 | 2018-12-07 | 天津飞腾信息技术有限公司 | 一种支持混合数据类型的浮点融合乘加器及其应用方法 |
US20190042193A1 (en) * | 2018-09-27 | 2019-02-07 | Intel Corporation | Floating-Point Dynamic Range Expansion |
CN109634558A (zh) * | 2018-12-12 | 2019-04-16 | 上海燧原科技有限公司 | 可编程的混合精度运算单元 |
CN109643227A (zh) * | 2016-08-22 | 2019-04-16 | 阿尔特拉公司 | 可变精度浮点乘法器 |
-
2020
- 2020-10-13 WO PCT/CN2020/120717 patent/WO2021073512A1/zh active Application Filing
- 2020-10-13 US US17/620,601 patent/US20230076931A1/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109643227A (zh) * | 2016-08-22 | 2019-04-16 | 阿尔特拉公司 | 可变精度浮点乘法器 |
CN108459840A (zh) * | 2018-02-14 | 2018-08-28 | 中国科学院电子学研究所 | 一种simd结构浮点融合点乘运算单元 |
CN108958705A (zh) * | 2018-06-26 | 2018-12-07 | 天津飞腾信息技术有限公司 | 一种支持混合数据类型的浮点融合乘加器及其应用方法 |
US20190042193A1 (en) * | 2018-09-27 | 2019-02-07 | Intel Corporation | Floating-Point Dynamic Range Expansion |
CN109634558A (zh) * | 2018-12-12 | 2019-04-16 | 上海燧原科技有限公司 | 可编程的混合精度运算单元 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113608718A (zh) * | 2021-07-12 | 2021-11-05 | 中国科学院信息工程研究所 | 一种实现素数域大整数模乘计算加速的方法 |
Also Published As
Publication number | Publication date |
---|---|
US20230076931A1 (en) | 2023-03-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021078212A1 (zh) | 用于向量内积的计算装置、方法和集成电路芯片 | |
TWI763079B (zh) | 用於浮點運算的乘法器、方法、積體電路晶片和計算裝置 | |
WO2021078210A1 (zh) | 用于神经网络运算的计算装置、方法、集成电路和设备 | |
CN110515589B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
CN111008003B (zh) | 数据处理器、方法、芯片及电子设备 | |
TWI774093B (zh) | 用於轉換資料類型的轉換器、晶片、電子設備及其方法 | |
CN110515587B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
WO2021078209A1 (zh) | 用于转换数据类型的转换器、芯片、电子设备及其方法 | |
CN110515590B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
WO2021073512A1 (zh) | 用于浮点运算的乘法器、方法、集成电路芯片和计算装置 | |
WO2021185262A1 (zh) | 计算装置、方法、板卡和计算机可读存储介质 | |
CN111258541B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
CN111258633B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
WO2021073511A1 (zh) | 用于浮点运算的乘法器、方法、集成电路芯片和计算装置 | |
CN209895329U (zh) | 乘法器 | |
CN113033799B (zh) | 数据处理器、方法、装置及芯片 | |
CN113031911B (zh) | 乘法器、数据处理方法、装置及芯片 | |
CN110647307B (zh) | 数据处理器、方法、芯片及电子设备 | |
CN210109863U (zh) | 乘法器、装置、神经网络芯片及电子设备 | |
CN110515586B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
WO2023231363A1 (zh) | 乘累加操作数的方法及其设备 | |
CN111258542A (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
JP7269382B2 (ja) | 計算装置、方法、プリント基板、およびコンピュータ読み取り可能な記録媒体 | |
CN111258546B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
CN110378478B (zh) | 乘法器、数据处理方法、芯片及电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20876694 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20876694 Country of ref document: EP Kind code of ref document: A1 |