WO2021073511A1 - 用于浮点运算的乘法器、方法、集成电路芯片和计算装置 - Google Patents
用于浮点运算的乘法器、方法、集成电路芯片和计算装置 Download PDFInfo
- Publication number
- WO2021073511A1 WO2021073511A1 PCT/CN2020/120716 CN2020120716W WO2021073511A1 WO 2021073511 A1 WO2021073511 A1 WO 2021073511A1 CN 2020120716 W CN2020120716 W CN 2020120716W WO 2021073511 A1 WO2021073511 A1 WO 2021073511A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- bit width
- mantissa
- floating
- processing unit
- exponent
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000012545 processing Methods 0.000 claims abstract description 354
- 230000015654 memory Effects 0.000 claims description 59
- 238000010606 normalization Methods 0.000 claims description 27
- 230000008569 process Effects 0.000 claims description 19
- 210000004556 brain Anatomy 0.000 claims description 6
- 239000002131 composite material Substances 0.000 abstract 3
- 238000004364 calculation method Methods 0.000 description 22
- 238000010586 diagram Methods 0.000 description 17
- 238000007906 compression Methods 0.000 description 9
- 238000007781 pre-processing Methods 0.000 description 8
- 230000006835 compression Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 230000000670 limiting effect Effects 0.000 description 5
- 238000004806 packaging method and process Methods 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 101100498818 Arabidopsis thaliana DDR4 gene Proteins 0.000 description 3
- 230000009471 action Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 240000005369 Alstonia scholaris Species 0.000 description 2
- 230000035508 accumulation Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000005481 NMR spectroscopy Methods 0.000 description 1
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 238000002604 ultrasonography Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/52—Multiplying; Dividing
- G06F7/523—Multiplying only
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/483—Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/52—Multiplying; Dividing
- G06F7/523—Multiplying only
- G06F7/533—Reduction of the number of iteration steps or stages, e.g. using the Booth algorithm, log-sum, odd-even
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
Definitions
- This disclosure generally relates to the field of floating-point operations. More specifically, the present disclosure relates to methods, multipliers, integrated circuit chips, and computing devices for floating-point operations.
- the solution of the present disclosure provides a multiplier and method for floating-point operations, an integrated circuit chip including the multiplier, and a computing device.
- the present disclosure provides a multiplier for performing multiplication operations of floating-point numbers, wherein the multiplier includes: a mantissa processing unit configured to obtain the multiplication operation based on the mantissa of the floating-point number
- the mantissa processing unit includes a control circuit configured to call the mantissa multiple times when the bit width of at least one of the two floating-point numbers is greater than the data bit width that can be processed by the mantissa processing unit at one time
- the mantissa processing unit includes: a mantissa processing unit configured to obtain the multiplication operation based on the mantissa of the floating-point number
- the mantissa processing unit includes a control circuit configured to call the mantissa multiple times when the bit width of at least one of the two floating-point numbers is greater than the data bit width that can be processed by the mantissa processing unit at one time
- the mantissa processing unit includes: a mantissa processing unit configured to obtain the multi
- the present disclosure provides a method for performing a floating-point number multiplication operation using a multiplier, wherein the mantissa processing unit of the multiplier is used to obtain the mantissa after the multiplication operation according to the mantissa of the floating-point number,
- the mantissa processing unit includes a control circuit for invoking the mantissa processing multiple times when the bit width of at least one of the two floating-point numbers is greater than the data bit width that can be processed by the mantissa processing unit at one time unit.
- the present disclosure provides an integrated circuit chip including the multiplier described above.
- the multiplier of the present disclosure may constitute an independent integrated circuit chip or be arranged on an integrated circuit chip or a computing device to implement operations on floating-point numbers in a variety of different data formats.
- the multiplier With the multiplier, corresponding operation method, integrated circuit chip and computing device of the present disclosure, it is possible to support operations on multiple floating-point data without providing multiple separate multipliers for different floating-point data. Therefore, the multiplier of the present disclosure is flexible and can be widely used in various floating-point data operations. In addition, when processing input data with a larger bit width, the multiplier of the present disclosure supports cyclic multiplexing operation, so there is no need to arrange more processing chips, thereby also reducing the layout area of the integrated circuit.
- Fig. 1 is a schematic diagram showing a floating-point data format according to an embodiment of the present disclosure
- Fig. 2 is a schematic structural block diagram showing a multiplier according to an embodiment of the present disclosure
- Figure 3 is a block diagram showing more details of the multiplier according to an embodiment of the present disclosure.
- Fig. 4 is a schematic block diagram showing a mantissa processing unit according to an embodiment of the present disclosure
- Fig. 5 is a schematic diagram showing a partial product operation according to an embodiment of the present disclosure.
- FIG. 6 is a schematic block diagram showing the operation flow of the Wallace tree compressor according to an embodiment of the present disclosure
- Fig. 7 is an overall schematic block diagram showing a multiplier according to an embodiment of the present disclosure.
- FIG. 8 is a flowchart illustrating a method for performing floating-point number multiplication using a multiplier according to an embodiment of the present disclosure
- FIG. 9 is a structural diagram showing a combined processing device according to an embodiment of the present disclosure.
- FIG. 10 is a schematic diagram showing the structure of a board card according to an embodiment of the present disclosure.
- the technical solution of the present disclosure provides a multiplier, method, integrated circuit chip, and computing device for floating-point number operations as a whole.
- the present disclosure provides a multiplier that supports multiple operation modes, thereby overcoming the defect that the existing multiplier can only support one type of floating-point arithmetic.
- the present disclosure uses multiple operation modes to indicate different floating-point data types, and in the multiplication calculation process of floating-point numbers, various operations on the data are performed based on one of the operation modes, including, for example, encoding, compression, and summation. , Normalization, and rounding operations to implement operations associated with one of multiple floating-point data types. Therefore, the multiplier of the present disclosure can support operations in multiple modes, further improving the flexibility of floating-point operations and reducing the cost of operations.
- FIG. 1 is a schematic diagram showing a floating point data format 100 according to an embodiment of the present disclosure.
- the floating-point number to which the technical solution of the present disclosure can be applied can include three parts, such as sign (or sign bit) 102, exponent (or exponent bit) 104, and mantissa (or mantissa bit) 106.
- sign or sign bit
- exponent or exponent bit
- mantissa or mantissa bit
- the floating-point numbers suitable for the multiplier of the present disclosure may include at least one of half-precision floating-point numbers, single-precision floating-point numbers, brain floating-point numbers, double-precision floating-point numbers, and custom floating-point numbers.
- the floating-point number format to which the technical solution of the present disclosure can be applied may be a floating-point format that conforms to the IEEE754 standard, such as double-precision floating-point number (float64, abbreviated as "FP64”), single-precision floating-point number ( float32, abbreviated “FP32”) or half-precision floating-point number (float16, abbreviated "FP16").
- FP64 double-precision floating-point number
- FP32 single-precision floating-point number
- FP16 half-precision floating-point number
- the floating-point number format can also be an existing 16-bit brain floating-point number (bfloat16, abbreviated as "BF16”), or a custom floating-point number format, such as 8-bit brain floating-point number (bfloat8, abbreviated as “BF8"), unsigned half-precision floating point numbers (unsigned float16, abbreviated as "UFP16”), unsigned 16-bit brain floating point numbers (unsigned bfloat16, abbreviated as "UBF16”).
- bfloat8 8-bit brain floating-point number
- UFP16 unsigned half-precision floating point numbers
- UPF16 unsigned 16-bit brain floating point numbers
- the multiplier of the present disclosure can at least support the multiplication operation between two floating-point numbers with any of the above-mentioned formats in operation, wherein the two floating-point numbers can have the same or different Floating point data format.
- the multiplication operation between two floating-point numbers can be FP16*FP16, BF16*BF16, FP32*FP32, FP32*BF16, FP16*BF16, FP32*FP16, BF8*BF16, UBF16*UFP16 or UBF16*FP16, etc. Multiplication operation between two floating-point numbers.
- FIG. 2 is a schematic structural block diagram of a multiplier 200 according to an embodiment of the present disclosure.
- the multiplier of the present disclosure supports multiplication operations of floating-point numbers in various data formats, and these data formats can be indicated by the operation mode of the present disclosure, so that the multiplier works in one of a variety of operation modes.
- the multiplier of the present disclosure may generally include an exponent processing unit 202 and a mantissa processing unit 204, wherein the exponent processing unit is used to process the exponent bit of a floating point number, and the mantissa processing unit is used to process the mantissa of a floating point number. Bit.
- the multiplier may further include a sign processing unit 206, which may be used to process a floating point number including a sign bit.
- the multiplier can perform floating-point operations on the received, input, or buffered first floating-point number and the second floating-point number according to one of the operating modes, the first floating-point number and the second floating-point number having the functions discussed above One of the floating-point data formats. For example, when the multiplier is in the first operation mode, it can support the multiplication of two floating-point numbers FP16*FP16, and when the multiplier is in the second operation mode, it can support the multiplication of two floating-point numbers BF16*BF16 .
- the multiplier when the multiplier is in the third operation mode, it can support the multiplication of two floating-point numbers FP32*FP32, and when the multiplier is in the fourth operation mode, it can support the multiplication of two floating-point numbers FP32*BF16 Operation.
- the corresponding relationship between the sample operation mode and the floating-point number is shown in Table 2 below.
- Operation mode number Arithmetic floating-point number type 1 FP16*FP16 2 BF16*BF16 3 FP32*FP32 4 FP32*BF16
- the above-mentioned table 2 may be stored in a memory of the multiplier, and the multiplier selects one of the operation modes in the table according to the instruction received from the external device, and the external device may be, for example, FIG. 10 External device 1012 shown in.
- the input of the operation mode can also be realized automatically via the mode selection unit 308 as shown in FIG. 3.
- the mode selection unit can select the multiplier to work in the first operation mode according to the data format of the two floating-point numbers.
- the mode selection unit may select the multiplier to work in the fourth operation mode according to the data format of the two floating point numbers.
- the different operation modes of the present disclosure are associated with corresponding floating-point data. That is to say, the operation mode of the present disclosure can be used to indicate the data format of the first floating-point number and the data format of the second floating-point number. In another embodiment, the operation mode of the present disclosure can not only indicate the data format of the first floating-point number and the data format of the second floating-point number, but can also be used to indicate the data format after the multiplication operation.
- the operation mode extended in conjunction with Table 2 is shown in Table 3 below.
- the operation modes in Table 3 are extended by one bit to indicate the data format after floating-point multiplication.
- the multiplier works in operation mode 21
- it performs floating-point operations on the input BF16*BF16 two floating-point numbers, and outputs the floating-point multiplication in the FP16 data format.
- the above operation mode in number form to indicate the floating point data format is only exemplary and not restrictive. According to the teaching of the present disclosure, it is also conceivable to establish an index according to the operation mode to determine the format of the multiplier and the multiplicand.
- the operation mode includes two indexes, the first index is used to indicate the type of the first floating-point number, and the second index is used to indicate the type of the second floating-point number.
- the first index "1" in the operation mode 13 indicates The first floating-point number (or multiplicand) is in the first floating-point format, that is, FP16, and the second index "3" indicates that the second floating-point number (or multiplier) is in the second floating-point format, that is, FP32.
- a third index may be added to the operation mode, which indicates the data format of the output result.
- the third index "1" in the operation mode 131 it may indicate that the data format of the output result is the first floating point.
- the format is FP16.
- the instructions may include three fields or fields, the first field is used to indicate the data format of the first floating-point number, the second field is used to indicate the data format of the second floating-point number, and The third field is used to indicate the data format of the output result.
- FIG. 3 is a block diagram showing a more detailed structure of the multiplier 300 according to an embodiment of the present disclosure. It can be seen from the content shown in FIG. 3 that it not only includes the exponent processing unit 202, the mantissa processing unit 204, and the optional symbol processing unit 206 shown in FIG. These units operate related units, and an exemplary operation of these units will be described in detail below with reference to FIG. 3.
- the exponent processing unit may be used to obtain the multiplied exponent according to the aforementioned operation mode, the exponent of the first floating-point number and the exponent of the second floating-point number.
- the exponent processing unit may be implemented by an addition and subtraction circuit.
- the exponent processing unit here can be used to add the exponent of the first floating-point number, the exponent of the second floating-point number, and the respective offset values of the corresponding input floating-point data format, and then subtract the output floating-point data format The offset value to obtain the exponent after the multiplication of the first floating-point number and the second floating-point number.
- the mantissa processing unit of the multiplier can be used to obtain the mantissa after the multiplication operation according to the foregoing operation mode, the first floating-point number, and the second floating-point number.
- the mantissa processing unit may include a partial product operation unit 312 and a partial product summation unit 314, wherein the partial product operation unit is configured to obtain an intermediate result according to the mantissa of the first floating point number and the mantissa of the second floating point number.
- the intermediate result may be multiple partial products obtained during the multiplication operation of the first floating-point number and the second floating-point number (as shown schematically in FIG. 5 and FIG. 6).
- the partial product summation unit is configured to perform an addition operation on the intermediate result to obtain an addition result, and use the addition result as the mantissa after the multiplication operation.
- the present disclosure uses a Booth ("Booth") encoding circuit to complement the high and low bits of the mantissa of the second floating-point number (such as serving as a multiplier in floating-point operations) with 0 (wherein the high-order bit) Adding 0 is to convert the mantissa as an unsigned number to a signed number) in order to obtain the intermediate result.
- the mantissa of the first floating-point number such as the multiplicand in floating-point operations
- can also be encoded such as high and low bits with 0, or both can be encoded.
- the partial product summation unit may include an adder, which is used to add the intermediate result to obtain the sum result.
- the partial product summation unit includes a Wallace tree and an adder, wherein the Wallace tree is used to add the intermediate results to obtain a second intermediate result, and the adder uses To add the second intermediate result to obtain the added result.
- the adder may include at least one of a full adder, a serial adder, and a forward bit adder.
- the multiplier of the present disclosure further includes a regularization unit 318 and a rounding unit 320.
- the regularization unit can be used to perform floating-point regularization processing on the mantissa and exponent after the multiplication operation to obtain a regularized exponent result and a regularized mantissa result, and use the regularized exponent result and the regularized mantissa result as The exponent after the multiplication operation and the mantissa after the multiplication operation.
- the regularization unit can adjust the bit width of the exponent and the mantissa to meet the requirements of the data format indicated above.
- the regularization unit can also make other adjustments to the exponent or mantissa.
- the regularization unit may also adjust the exponent after the multiplication operation according to the mantissa after the multiplication operation. For example, when the highest bit of the mantissa after the multiplication operation is 1, the exponent obtained after the multiplication operation can be increased by 1.
- the rounding unit may be used to perform a rounding operation on the regularized mantissa result according to a rounding mode, and use the mantissa after the rounding operation is performed as the mantissa after the multiplication operation.
- the rounding unit may perform rounding operations including rounding down, rounding up, and rounding to the nearest significant number, for example.
- the rounding unit can also round the 1 that is shifted out in the process of shifting the mantissa to the right.
- the multiplier of the present disclosure may also optionally include a sign processing unit.
- the sign processing unit can be used according to the first floating-point number.
- the sign of and the sign of the second floating-point number get the sign after the multiplication operation.
- the symbol processing unit may include an exclusive OR logic circuit 322 for performing an exclusive OR operation based on the sign of the first floating-point number and the sign of the second floating-point number. , To obtain the symbol after the multiplication operation.
- the symbol processing unit can also be implemented by a truth table or logical judgment.
- the multiplier of the present disclosure may further include a normalization processing unit 324 for converting the first floating-point number Or when the second floating-point number is a non-normalized non-zero floating-point number, the first floating-point number or the second floating-point number is normalized according to the operation mode to obtain the corresponding exponent and mantissa.
- the normalization processing unit can be used to normalize the FP16 type data to BF16 type data, so that the multiplier can operate in the second operation mode.
- the normalization processing unit may also be used to preprocess the mantissa of the normalized floating-point number with an implicit 1 and the mantissa of the non-normalized floating-point number without the implicit 1 (for example, the mantissa of Extend) to facilitate the subsequent operation of the mantissa processing unit.
- the normalization processing unit 324 and the aforementioned regularization unit 318 can also perform the same or similar operations in some embodiments.
- the difference is that the normalization processing unit 324 is specific to the input.
- the floating-point data of is subjected to normalization processing, and the regularization unit 318 performs regularization processing for the mantissa and exponent to be output.
- the multiplier of the present disclosure and its various embodiments have been described above with reference to FIG. 3. Based on the above description, those skilled in the art can understand that the solution of the present disclosure obtains the result of the multiplication operation (including the exponent, the mantissa and optional signs) through the execution of the multiplier. According to different application scenarios, for example, when the aforementioned regularization processing and rounding processing are not required, the result obtained by the mantissa processing unit and the exponential processing unit can be regarded as the final operation result. Furthermore, when the aforementioned regularization processing and rounding processing are required, the exponent and mantissa obtained after the regularization processing and rounding processing can be regarded as the final calculation result, or a part of the final calculation result (when considering The final symbol).
- the solution of the present disclosure uses multiple operation modes to enable the multiplier to support the operation of floating-point numbers of different types or data formats, so that the multiplexing of the multiplier can be realized, thereby saving the overhead of chip design and saving the calculation cost.
- the multiplier of the present disclosure also supports the calculation of high-bit-width floating-point numbers.
- the mantissa also called the mantissa bit or the mantissa part
- the mantissa operation of the present disclosure will be described below in conjunction with FIG. 4.
- FIG. 4 is a schematic block diagram showing an operation 400 of a mantissa processing unit according to an embodiment of the present disclosure.
- the mantissa processing operation of the present disclosure may mainly involve two units, namely, the partial product operation unit and the partial product summation unit discussed in combination with FIG. 3.
- the mantissa processing operation can be roughly divided into the first stage and the second stage. In the first stage, the mantissa processing operation will obtain intermediate results, and in the second stage, the mantissa processing operation will obtain the addition The mantissa result output by the converter 408.
- the first floating-point number and the second floating-point number received by the multiplier may be divided into multiple parts, namely the aforementioned sign (optional), exponent, and mantissa.
- the mantissa part of the two floating-point numbers will enter the mantissa processing unit as input (such as the mantissa processing unit in FIG. 2 or FIG. 3), and specifically enter the partial product operation unit.
- the present disclosure uses Booth coding circuit 402 to add 0 to the high and low bits of the mantissa of the second floating-point number (that is, the multiplier in floating-point operations), and performs Booth coding processing to generate partial products.
- the intermediate result is obtained in the circuit 404.
- the first floating-point number and the second floating-point number here are only for illustrative and not restrictive purposes. Therefore, in some application scenarios, the first floating-point number can be a multiplier and the second floating-point number can be a multiplicand. .
- encoding operations can also be performed on floating-point numbers that serve as multiplicands.
- Booth coding is briefly introduced below.
- a large number of intermediate results called partial products will be produced through the multiplication operation, and then these partial products will be accumulated to obtain the final result of the multiplication of the two binary numbers.
- the greater the number of partial products the greater the area and power consumption of the array multiplier, the slower the execution speed, and the more difficult it is to implement the circuit.
- the purpose of Booth coding is to effectively reduce the number of summations of partial products, thereby reducing the circuit area.
- the algorithm is to first encode the input multiplier according to the corresponding rules.
- the encoding rules may be, for example, the rules shown in Table 4 below:
- y2i+1, y2i, and y2i-1 in Table 4 can represent the values corresponding to each group of sub-data to be encoded (ie, multipliers), and X can represent the mantissa in the first floating-point number (ie, multiplicand).
- the coded signal obtained after Booth coding can include five types, which are -2X, 2X, -X, X, and 0, respectively.
- the received multiplicand is 8-bit data "X 7 X 6 X 5 X 4 X 3 X 2 X 1 X 0 ", the following partial products can be obtained:
- the multiplier digits include the continuous three-digit data "001" in the above table
- the partial product is X, which can be expressed as "X 7 X 6 X 5 X 4 X 3 X 2 X 1 X 0 ", the 9th
- the multiplier digits include the continuous three-digit data "011" in the above table
- the adder may be, for example, one or more full adders, half adders, or various combinations of the two.
- a Wallace tree compressor (or Wallace tree for short), it is mainly used to sum the above-mentioned intermediate results (ie, multiple partial products) to reduce the number of accumulation of partial products (ie, compression).
- a Wallace tree compressor can adopt a carry-save CAS (carry-save) architecture and a Wallace tree algorithm.
- the calculation speed of the Wallace tree array is much faster than the traditional carry-save addition.
- the Wallace tree compressor can calculate the sum of the partial products of each row in parallel. For example, it can reduce the number of accumulations of N partial products from N-1 times to Log2N times, thereby increasing the speed of the multiplier and improving resource efficiency. Utilization is of great significance.
- the Wallace tree compressor can be designed into many types, such as 7-2 Wallace tree, 4-2 Wallace tree and 3-2 Wallace tree.
- the present disclosure uses a 7-2 Wallace tree as an example of implementing various floating-point operations of the present disclosure, which will be described in detail later in conjunction with FIG. 5 and FIG. 6.
- the Wallace tree compression operation disclosed in the present disclosure may be arranged to have M inputs and N outputs, the number of which may not be less than K, where N is a preset positive integer less than M, and K is A positive integer not less than the maximum bit width of the intermediate result.
- N is a preset positive integer less than M
- K is A positive integer not less than the maximum bit width of the intermediate result.
- M can be 7, and N can be 2, which is a 7-2 Wallace tree which will be described in detail below.
- K can take a positive integer of 48, which means that the number of Wallace trees can be 48.
- one or more groups of the Wallace trees can be selected to add the intermediate results, wherein each group has X Wallace trees, and X is the sum of the intermediate results. Digits. Further, the Wallace trees in each group may have a sequential carry relationship, but there is no carry relationship between each group.
- the Wallace tree compressor can be connected by carrying, for example, the carry output from the low-level Wallace tree compressor (Cin in Figure 6) to the high-level Wallace tree, and the high-level Wallace tree The carry output (Cout) of the Shishu compressor can become a higher-order Wallace tree compressor to receive the carry input from the lower-order Wallace tree compressor.
- one or more Wallace tree compressors are selected from multiple Wallace tree compressors, arbitrary selection can be made. For example, it can be selected in the order of 0, 1, 2 and 3, or 0 , 2, 4, and 6 are connected in the order of numbers, as long as the selected Wallace tree compressor is selected according to the above-mentioned carry relationship.
- the 0th to 23rd Wallace trees (that is, the 24 Wallace trees in the first group of Wallace trees) can complete the partial product addition and operation of the first group of multiplications , And each Wallace tree in the group can be connected by carry in turn.
- the 24th to 47th Wallace trees (that is, the 24 Wallace trees in the second group of Wallace trees) can complete the partial product addition operation of the second group of multiplications, where each Wallace in the group The scholar trees are connected by carry in turn.
- there is no carry relationship between the 23rd Wallace tree in the first group and the 24th Wallace tree in the second group that is, there is no carry relationship between Wallace trees in different groups.
- the compressed partial products are summed by the adder to obtain the result of the mantissa multiplication operation.
- the adder in one or more embodiments of the present disclosure, it may include one of a full adder, a serial adder, and a forward bit adder, which is used to add the Wallace tree compressor The obtained partial products of the last two rows are summed to obtain the result of the mantissa multiplication operation.
- the mantissa multiplication operation shown in FIG. 4 can effectively obtain the result of the mantissa multiplication operation.
- Booth coding can effectively reduce the number of partial product summations, thereby reducing the circuit area
- the Wallace compression tree can calculate the sum of partial products of each row in parallel, thereby increasing the speed of the multiplier.
- Figure 5 shows the partial product 500 obtained after passing through the partial product generating circuit in the mantissa processing unit described in conjunction with Figures 2 to 4, as shown in the figure between the two dashed lines in four rows of white dots, where each The white dots on the row indicate a partial product.
- the number of bits can be expanded in advance.
- the black dot in Figure 5 is the highest value of each 9-bit partial product copied. It can be seen that the partial product is expanded and aligned to 16(8+8)bit (that is, the bit width of the multiplicand mantissa is 8bit+multiplying The bit width of the mantissa is 8bit).
- the partial product is expanded to 38 (25+13) bits (that is, the bit width of the multiplicand mantissa is 25 bits + the bit width of the multiplier mantissa is 13 bits) .
- FIG. 6 is a schematic block diagram 600 showing the operation flow of the Wallace tree compressor according to an embodiment of the present disclosure.
- the 7 shown in Figure 6 can be obtained by performing Booth coding on the multiplier and by the multiplicand. Part product. Due to the use of Booth coding algorithm, the number of partial products generated is reduced.
- a dashed frame is used in the partial product part to identify a Wallace tree that includes 7 elements, and the process of compressing it from 7 elements to 2 elements is further shown with arrows.
- the compression process (or called the addition process) can be implemented by means of a full adder, that is, three elements are input and two elements are output (ie, a sum "sum” and a carry "carry” to the higher order) .
- the schematic block diagram of the 7-2 Wallace Tree Compressor is shown on the right side of Figure 6. It can be understood that the Wallace Tree Compressor includes 7 inputs from a column of partial products (as indicated in the dashed box on the left side of Figure 6). Seven elements). In operation, the carry input of the Wallace tree in the 0th column is 0, and the carry output Cout of each Wallace tree is used as the carry input Cin of the next Wallace tree.
- the Wallace tree including 7 elements can be compressed to include 2 elements.
- this disclosure uses a 7-2 Wallace tree compressor to finally compress the partial product of 7 rows into a partial product with two rows (ie, the second intermediate result of this disclosure), and uses an adder (for example, Advance bit adder) to get the mantissa result.
- the mantissa bits of the floating-point number are 10 bits.
- the mantissa bits can be extended by 1 bit, so that the mantissa bits are 11 bits.
- the mantissa bits are unsigned numbers, when Booth coding algorithm is used, 1 bit of 0 can be extended in the high bit, so the total mantissa bits are 12 bits.
- the partial product generation circuit can obtain 7 partial products in the high and low parts respectively, of which the seventh partial product is 0.
- the bit width of each partial product is 24bit.
- 48 7-2 Wallace trees can be used for compression processing, and the carry of the 23rd to 24th Wallace trees is 0.
- the mantissa of the floating-point number is 7 bits.
- the mantissa can be expanded to 9 bits.
- the partial product generating circuit can obtain 7 effective partial products in the high and low parts respectively, of which the 6th and 7th partial products are 0, each part of the product bit width is 18bit, by using the 0-17th and 24th to 41st two sets of 7-2 Wallace trees for compression processing, of which the 23rd to 24th Wallace trees The carry is 0.
- the mantissa bits of the floating-point number can be 23 bits.
- the mantissa can be expanded to 25 bits.
- the bit width supported by the multiplier can be designed to be smaller, and the multiplier of the present disclosure can be called twice in this operation mode to complete an operation.
- the multiplication of the mantissa bits each time is 25bit*13bit, that is, the first floating-point number ina is expanded by 1 bit 0 to become a 25-bit signed number, and the 24bit mantissa bits of the second floating-point number inb are divided into high and low parts, 12 bits are extended by 1 respectively.
- Bit 0 gets two 13-bit multipliers, which are expressed as the high and low parts of inb_high13 and inb_low13.
- the multiplier of the present disclosure is called for the first time to calculate ina*inb_low13, and the multiplier is called for the second time to calculate ina*inb_high13.
- 7 effective partial products are generated by Booth coding, and the bit width of each partial product is 38 bits, compressed by the 0th to 37th 7-2 Wallace trees.
- the mantissa of the first floating-point number ina is 23 bits
- the mantissa of the second floating-point number inb is 7 bits.
- the mantissa can be expanded to 25bit and 9bit respectively, and the multiplication of 25bit ⁇ 9bit is performed to obtain 7 effective partial products, among which the 6th and 7th partial products are 0, and the bit width of each partial product is 34bit, passing the 0th to 33rd
- the Wallace tree is compressed.
- multiplier mantissa processing unit and exponent processing unit
- the mantissa processing unit may include a control circuit 316, and the control circuit 316 may be used to set the mantissa bit width of at least one of the two floating-point numbers to be greater than that of the mantissa processing unit.
- the mantissa processing unit is called multiple times.
- the data bit width that can be processed by the mantissa processing unit at one time refers to two bit widths (for example, the multiplier bit width and the multiplicand bit width) supported by the mantissa processing unit.
- control circuit is configured to perform according to the bit width of one of the two floating-point numbers and one of the two bit widths supported by the mantissa processing unit, or according to the two floating-point numbers
- the mantissa bit width of and the two bit widths supported by the mantissa processing unit determine that the mantissa processing unit is called multiple times to obtain the mantissa after the multiplication operation. Therefore, this repeated invocation of the mantissa processing unit in the multiplier avoids arranging large-area multiplier components to handle large-bit-width mantissa operations and avoids arranging small-area multiplier components that cannot handle large-bit-width mantissa operations. It has stronger applicability and is conducive to reducing the chip area.
- the two floating-point numbers include a first floating-point number and a second floating-point number
- the mantissa processing unit supports a first bit width and a second bit width
- the mantissa of the first floating-point number As the first input corresponding to the first bit width, the mantissa of the second floating-point number is used as the second input corresponding to the second bit width, and the bit width of the first input is less than or equal to the first input.
- One bit width, and the control circuit is used to call the mantissa processing unit multiple times to obtain the mantissa after the multiplication operation when the bit width of the second input is greater than the second bit width.
- bit width of one of the two inputs is fixedly less than or equal to the bit width supported by the corresponding mantissa processing unit. Therefore, it is only necessary to determine that the other input is supported by the corresponding mantissa processing unit.
- the size relationship of the bit width can determine whether to call the mantissa processing unit multiple times.
- the two floating-point numbers include a first floating-point number and a second floating-point number
- the mantissa processing unit supports a first bit width and a second bit width
- the mantissa of the first floating-point number As the first input corresponding to the first bit width, the mantissa of the second floating-point number is used as the second input corresponding to the second bit width, and the control circuit is used as the bit width of the first input.
- the relationship between the bit widths of the two inputs and the two bit widths supported by the mantissa processing unit is uncertain, and it is necessary to determine the relationship between the two inputs and the bit widths supported by the respective mantissa processing units to determine Whether to call the mantissa processing unit multiple times.
- the control circuit selects the mantissa of the first floating-point number as the The second input corresponding to the second bit width is selected and the mantissa of the second floating point number is selected as the first input corresponding to the first bit width.
- the mantissa of the two input floating-point numbers can be first inputted according to the strategy of large bit width to large bit width and small bit width to small bit width and the mantissa processing unit supports The two bit widths are matched in order to avoid multiple calls that can be used to complete the mantissa operation of two floating-point numbers at one time.
- the control circuit when the bit width of the first input is greater than the first bit width and the bit width of the second input is less than or equal to the second bit width, the control circuit according to the first input The bit width and the first bit width determine the number of times the mantissa processing unit is called and the data of the mantissa processing unit is input in each call.
- the control circuit is based on the sum of the bit widths of the second input The second bit width determines the number of times the mantissa processing unit is called and the data of the mantissa processing unit is input in each call.
- the control circuit is based on the bit width of the first input and the The first bit width and the bit width of the second input and the second bit width determine the number of times the mantissa processing unit is called and the data of the mantissa processing unit is input in each call.
- the description of the first floating-point number and the second floating-point number is only for distinguishing the two floating-point numbers, where "first" and “second” do not have a limiting effect.
- the description about the first bit width and the second bit width is only for distinguishing the two maximum processing bit widths supported by the mantissa processing unit, and the description about the first input and the second input is only for distinguishing the mantissa processing unit The two inputs corresponding to the two maximum processing bit widths, so neither "first” nor "second” has a limiting effect.
- the floating-point number of the input multiplier described in the above embodiment is a floating-point number that meets the format required by the operation and applies to the internal and external components of the multiplier, that is, the floating-point number that has undergone preprocessing such as normalization. It should be understood that the floating-point number input to the multiplier can be a normalized or non-normalized floating-point number. Combining the description of the normalization unit above, it can be known that if at least one of the two input floating-point numbers is a non-normalized non-normalized floating-point number.
- the at least one floating-point number may be normalized by the normalization unit first to obtain the normalized exponent and mantissa, and then the normalized mantissa is used as the input of the mantissa processing unit to perform the above-mentioned floating Point multiplication operation.
- the Booth coding circuit mentioned in the present disclosure performs signed fixed-point multiplication calculations, so it is necessary to extend the mantissa by 1 bit 0, that is, the mantissa becomes a signed positive number, and then use the extended signed mantissa as the mantissa.
- the input of the processing unit performs the floating-point number multiplication described above.
- the first and second embodiments of the present disclosure are also applicable to the arithmetic of floating-point numbers according to the arithmetic modes as described above.
- the above-mentioned first input may be a multiplier
- the second input may be a multiplicand
- the first bit width may be, for example, the maximum multiplier bit width supported by the mantissa processing unit.
- the second bit width may be, for example, the maximum multiplicand bit width supported by the mantissa processing unit.
- the two floating-point numbers input to the multiplier of the present disclosure are denormalized non-zero floating-point numbers as
- the Booth encoding circuit used in the present disclosure to perform signed fixed-point number multiplication first normalize two floating-point numbers, so the mantissa of the two floating-point numbers is extended by 1 bit, and in addition, in order to be applicable to the embodiments of the present disclosure
- the Booth coding circuit in, then expand the two mantissas by 1 bit to form a signed number.
- the mantissa of the two floating-point numbers and the input of the mantissa processing unit are matched. Therefore, when the bit width of the multiplier is greater than the maximum multiplier bit width and the bit width of the multiplicand is less than or equal to the maximum multiplicand, the control circuit only normalizes the original mantissa corresponding to the multiplier as the mantissa formed
- the mantissa is to be truncated, and in order to be applicable to the Booth coding circuit in the embodiment of the present disclosure, the sign bit is extended for each truncated part.
- the part of the mantissa to be truncated is truncated with a bit width of A-1 in each call, where A represents the maximum bit width of the multiplier supported by the mantissa processing unit.
- A represents the maximum bit width of the multiplier supported by the mantissa processing unit.
- Each intercepted part with a bit width of A-1 is supplemented with a high bit of 0 as a symbol to form a multiplier part with a bit width of A.
- the multiplier part is used as an input to the mantissa processing unit in each call.
- the multiplicand (in this embodiment, the multiplicand is a normalized and extended sign bit mantissa) is input to the mantissa processing unit as another input in each call. Therefore, the following formula can be used to determine the number of calls of the mantissa processing unit:
- n represents the number of times the mantissa processing unit is called
- B represents the bit width of the unnormalized mantissa without extending the sign bit
- B+1 represents the bit width after the mantissa is normalized
- B+1 can also be understood as B+2 -1, that is, the bit width of the multiplier minus the bit width of the sign bit
- A represents the bit width of the multiplier part (the maximum bit width of the multiplier supported by the mantissa processing unit)
- A-1 represents the bit width to be intercepted in each call The bit width of the truncated part in the mantissa.
- the maximum multiplier bit width supported by the mantissa processing unit is, for example, 8bit
- the maximum multiplicand bit width is, for example, 32bit.
- the two floating-point numbers input to the multiplier are FP32 and BF16 floating-point numbers, so choose Multiplication is performed in the FP32*BF16 operation mode, and the two floating-point numbers are non-normalized non-zero numbers. Therefore, the mantissa of the two floating-point numbers has a bit width of 23bit and 7bit respectively. Considering the IEEE754 standard, the bit width of the two mantissas Can be expanded to 24bit and 8bit.
- the two mantissas are extended by 1 bit 0 to become 25-bit and 9-bit signed numbers. Therefore, the control circuit takes the mantissa with a bit width of 9 bits as the multiplier corresponding to the maximum multiplier bit width and uses the mantissa with a bit width of 25 bits as the multiplicand corresponding to the maximum multiplicand bit width, because only the bit width of the multiplier is (9bit) is greater than the maximum multiplier bit width (8bit), and the bit width of the multiplicand (25bit) is less than the maximum multiplicand bit width (32bit), so the original mantissa corresponding to the multiplier is only normalized.
- the mantissa is used as the mantissa to be truncated inb, and the multiplicand is used as the multiplicand ina of the input mantissa processing unit.
- the two floating-point numbers input to the multiplier of the present disclosure are denormalized non-zero floating-point numbers as
- the Booth encoding circuit used in the present disclosure to perform signed fixed-point number multiplication first normalize two floating-point numbers, so the mantissa of the two floating-point numbers is extended by 1 bit, and in addition, in order to be applicable to the embodiments of the present disclosure
- the Booth coding circuit in, then expand the two mantissas by 1 bit to form a signed number.
- the mantissa of the two floating-point numbers and the input of the mantissa processing unit are matched. Therefore, when the bit width of the multiplicand is greater than the maximum multiplicand bit width and the bit width of the multiplier is less than or equal to the maximum multiplier bit width, the control circuit only normalizes the original mantissa corresponding to the multiplicand to form
- the mantissa of is used as the mantissa to be truncated, and in order to be applicable to the Booth coding circuit in the embodiment of the present disclosure, the sign bit is extended for each truncated part.
- the part with a bit width of C-1 is truncated from the mantissa in each call, where C represents the maximum bit width of the multiplicand supported by the mantissa processing unit.
- the part with the bit width C-1 of the second interception is supplemented with a bit of 0 at the high bit as a symbol to form the multiplicand part with the bit width C, and the multiplicand part is used as an input of the mantissa processing unit in each call.
- the multiplier in this embodiment, the multiplier is a normalized and extended sign bit mantissa
- the multiplier is input to the mantissa processing unit as another input in each call. Therefore, the following formula can be used to determine the number of calls of the mantissa processing unit:
- n represents the number of times the mantissa processing unit is called
- D represents the bit width of the unnormalized mantissa without extending the sign bit
- D+1 represents the bit width after the mantissa is normalized
- D+1 can also be understood as D+2 -1, that is, the bit width of the multiplicand minus the bit width of the sign bit
- C represents the bit width of the multiplicand part (the maximum bit width of the multiplicand supported by the mantissa processing unit)
- C-1 represents each call The bit width of the part to be truncated from the mantissa to be truncated.
- the maximum multiplier bit width supported by the mantissa processing unit is, for example, 12bit
- the maximum multiplicand bit width is, for example, 16bit.
- the two floating-point numbers input to the multiplier are FP32 and BF16 floating-point numbers, so choose Multiplication is performed in the FP32*BF16 operation mode, and the two floating-point numbers are non-normalized non-zero numbers. Therefore, the mantissa of the two floating-point numbers has a bit width of 23bit and 7bit respectively. Considering the IEEE754 standard, the bit width of the two mantissas Can be expanded to 24bit and 8bit.
- the two mantissas are extended by 1 bit 0 to become 25-bit and 9-bit signed numbers. Therefore, the control circuit takes the mantissa with a bit width of 9 bits as the multiplier corresponding to the maximum multiplier bit width and uses the mantissa with a bit width of 25 bits as the multiplicand corresponding to the maximum multiplicand bit width, because only the bit width of the multiplicand is
- the width (25bit) is greater than the maximum multiplicand bit width (16bit) supported by the mantissa processing unit, and the multiplier bit width (9bit) is smaller than the maximum multiplier bit width (12bit), so the original multiplicand corresponding to the original Only the mantissa formed after normalization is used as the mantissa to be truncated ina, and the multiplier is used as the multiplier inb of the input mantissa processing unit.
- the calculation performed at this time is ina_m*inb, that is, the multiplication operation of the multiplicand part with a bit width of 16 bits and a multiplier with a bit width of 9 bits, so that the mantissa result obtained by this call can be calculated.
- the two floating-point numbers input to the multiplier of the present disclosure are denormalized non-zero floating-point numbers as
- the Booth encoding circuit used in the present disclosure to perform signed fixed-point number multiplication first normalize two floating-point numbers, so the mantissa of the two floating-point numbers is extended by 1 bit, and in addition, in order to be applicable to the embodiments of the present disclosure
- the Booth coding circuit in, then expand the two mantissas by 1 bit to form a signed number.
- the mantissa of the two floating-point numbers and the input of the mantissa processing unit are matched. Therefore, when the bit width of the multiplier is greater than the maximum multiplier bit width and the bit width of the multiplicand (in this embodiment, the multiplicand is normalized and the mantissa of the sign bit is extended) is greater than the bit width of the multiplier.
- the control circuit uses the mantissa formed by normalizing the original mantissa corresponding to the multiplicand and the mantissa formed by only normalizing the original mantissa corresponding to the multiplicand as the mantissa to be truncated, and In order to be applicable to the Booth coding circuit in the embodiment of the present disclosure, the sign bit is extended for each intercepted part.
- each call in each call the part of the mantissa to be truncated corresponding to the multiplier is truncated with a bit width of A-1 and the mantissa to be truncated corresponding to the multiplicand is truncated.
- the part where the bit width of the interception is C-1, where A represents the maximum multiplier bit width supported by the mantissa processing unit, and C represents the maximum multiplicand bit width supported by the mantissa processing unit.
- the bit width for each interception is The part of A-1 is supplemented with a bit of 0 at the high bit as a symbol to form a multiplier part with a bit width of A.
- the multiplier part is used as an input to the mantissa processing unit in each call
- the bit width for each interception is
- the part of C-1 is supplemented with a 0 in the high bit as a symbol to form a multiplicand part with a bit width of C, which is used as another input of the mantissa processing unit in each call. Therefore, the following formula can be used to determine the number of calls of the mantissa processing unit:
- n ceil((B+1)/(A-1))*ceil((D+1)/(C-1))
- n represents the number of times the mantissa processing unit is called
- B represents the bit width of the unnormalized mantissa without extending the sign bit
- B+1 represents the bit width after the mantissa is normalized
- B+1 can also be understood as B+2 -1, that is, the bit width of the multiplier minus the bit width of the sign bit
- A represents the bit width of the multiplier part (the maximum bit width of the multiplier supported by the mantissa processing unit)
- A-1 represents the multiplication from and in each call
- D represents the bit width of the unnormalized and unexpanded sign bit mantissa
- D+1 represents the bit width of the mantissa normalized
- D+1 can also be understood as D+2-1, that is, the bit width of the multiplicand minus the bit width of the sign bit
- C represents the bit width of the multiplic
- the maximum multiplier bit width supported by the mantissa processing unit is, for example, 8bit, and the maximum multiplicand bit width is, for example, 16bit.
- the two floating-point numbers input to the multiplier are both FP32-type floating-point numbers, so choose FP32* Multiplication is performed in the FP32 operation mode, and the two floating-point numbers are non-normalized non-zero numbers. Therefore, the mantissa width of the two floating-point numbers is 23 bits. Considering the IEEE754 standard, the bit width of the two mantissas can be expanded to 24 bits.
- the two mantissas are extended by 1 bit 0 to become a signed number of 25 bits. Therefore, the control circuit selects the mantissa of the two floating-point numbers as the multiplier corresponding to the maximum multiplier bit width and the multiplicand corresponding to the maximum multiplicand bit width (because the mantissa of the two floating-point numbers has the same bit width after expansion , So you can choose one as the multiplier and the other as the multiplicand), because the bit width (25bit) of the multiplier is greater than the maximum multiplier bit width (8bit) and the bit width of the multiplicand (25bit ) Is greater than the maximum multiplicand bit width (16bit), so the mantissa formed by normalizing the original mantissa corresponding to the multiplier is used as the mantissa to be truncated inb and the mantissa formed by normalizing the original mantissa corresponding to the multiplicand As the
- ceil((23+1)/(8-1))*ceil((23+1)/(16-1)) 8, therefore, the mantissa processing unit needs to be called eight times.
- 7bit data is intercepted in inb each time.
- all the remaining data is intercepted and 0 is added to the front to make up 7bit, and the 7bit data intercepted each time is extended by 1 bit 0
- the (sign bit) becomes 8 bits as the multiplier part inb_m, and since inb is cut into four parts, there can be four multiplier parts inb_m1, inb_m2, inb_m3, and inb_m4.
- 15bit data is intercepted each time in ina.
- all the remaining data will be intercepted and 0 to make up 15bit, and the 15bit data intercepted each time will be extended by 1 bit.
- 0 (sign bit) becomes 16bit as the multiplicand part ina_m. Since ina is truncated into two parts, there can be two multiplicand parts ina_m1 and ina_m2.
- the following calculations can be performed in sequence: ina_m1*inb_m1, ina_m1*inb_m2, ina_m1*inb_m3, ina_m1*inb_m4, ina_m2*inb_m1, ina_m2*inb_m2, ina_m2*inb_m3, ina_m2*in
- the following calculations can also be performed sequentially: inb_m1*ina_m1, inb_m1*ina_m2, inb_m2*ina_m1, inb_m2*ina_m2, inb_m3*ina_m1, inb_m3*ina_m2, inb_m4*ina_m1, inb_m4*ina_m2.
- the calculation for each call is the multiplication of the multiplicand part with a bit width of 16 bits and the multiplier part with a bit width of 8 bits, so that the mantissa result obtained by the call can be calculated. It is worth noting that the truncation of the mantissa to be truncated can be performed in the order from high to low, or from low to high.
- the mantissa processing unit may further include a shift and add circuit for obtaining the multiplication operation according to the mantissa result obtained by calling the mantissa processing unit each time After the mantissa.
- the shift and add circuit includes a shifter, an intermediate memory, and an adder.
- the control circuit calls the mantissa processing unit multiple times according to the operation mode, after the first call, the shift The shifter shifts the mantissa result obtained in the first call to obtain the shifted mantissa result and stores the shifted mantissa result in the intermediate memory.
- the shifter will be The mantissa result obtained in this call is shifted to obtain the current mantissa result, and the adder adds the current mantissa result to the result stored in the intermediate memory and stores the added result in the
- the intermediate memory is updated in the intermediate memory, and the result stored in the intermediate memory after the last call is used as the mantissa after the multiplication operation.
- the truncation of the mantissa to be truncated is performed in the order from high order to low order.
- the shifter shifts the mantissa result obtained in the current call according to the following formula:
- Y represents the number of shifts required for the mantissa result obtained in the current call
- k represents the sum of the digits of all data after the truncated part used in the current call in the mantissa to be truncated corresponding to the multiplier
- J represents the sum of all data bits in the mantissa to be truncated corresponding to the multiplicand after the truncated part used in the current call.
- the mantissa processing unit when only the bit width of the multiplier is greater than the maximum multiplier bit width, the mantissa processing unit is called twice, and for example, the mantissa is to be truncated.
- the multiplier parts in the two calls are inb_m1 and inb_m2 respectively.
- the adder adds this R2 to R1 stored in the intermediate memory, and stores the result of the addition in the intermediate memory.
- the intermediate memory is updated in the intermediate memory.
- the result stored in the intermediate memory after the second call is the mantissa after the multiplication operation.
- the shift addition circuit can work in the same way.
- the mantissa processing unit is called eight times, and for example, the truncation of the mantissa to be truncated is performed in the order from high to low.
- the multiplier parts in the eight calls are inb_m1, inb_m2, inb_m3, and inb_m4, and the multiplicand parts are ina_m1, ina_m2, respectively.
- the mantissa processing unit when the mantissa processing unit is called eight times, the following calculations are sequentially performed: ina_m1*inb_m2, ina_m1*inb_m3, ina_m1*inb_m4, ina_m2*inb_m1, ina_m2*inb_m2, ina_m2*inb_m3, ina_m2*inb_m4.
- the shifter shifts the result of ina_m1*inb_m1 to the left.
- the mantissa Since the 7bit data is intercepted in the mantissa to be truncated corresponding to the multiplier in the first call, the mantissa is to be truncated
- 15bit data is intercepted in the mantissa to be truncated corresponding to the multiplicand, so the mantissa to be truncated is
- the shifter shifts the result of ina_m1*inb_m4 to the left, because in the fourth call
- the truncation mantissa the same 7-bit data as the last call is intercepted.
- the exponent processing unit includes a second control circuit (not shown in the figure), and the second control circuit is used to calculate the value of one of the two floating-point numbers.
- the exponent bit width and one of the two bit widths supported by the exponent processing unit or the exponent bit width of the two floating-point numbers and the two bit widths supported by the exponent processing unit are used to determine the value of multiple calls.
- the exponent processing unit obtains the exponent after the multiplication operation.
- the two floating-point numbers include a first floating-point number and a second floating-point number
- the exponent processing unit supports a third bit width and a fourth bit width, and the exponent of the first floating-point number
- the exponent of the second floating-point number is used as the fourth input corresponding to the fourth bit width
- the bit width of the third input is less than or equal to the first bit width.
- Three-bit width, and the second control circuit is used to call the exponent processing unit multiple times to obtain the multiplied exponent when the bit width of the fourth input is greater than the fourth bit width.
- bit width of one of the two inputs is fixed to be less than or equal to a bit width supported by the corresponding exponential processing unit. Therefore, it is only necessary to determine that the other input is supported by the corresponding exponential processing unit.
- the size relationship of the bit width can determine whether to call the exponential processing unit multiple times.
- the two floating-point numbers include a first floating-point number and a second floating-point number
- the exponent processing unit supports a third bit width and a fourth bit width, and the exponent of the first floating-point number As the third input corresponding to the third bit width, the exponent of the second floating-point number is used as the fourth input corresponding to the fourth bit width, and the second control circuit is used as the third input
- the bit width of is greater than the third bit width and the bit width of the fourth input is less than or equal to the fourth bit width
- the bit width of the fourth input is greater than the fourth bit width and the fourth bit width
- the exponent processing unit is called multiple times to obtain the exponent after the multiplication operation.
- the relationship between the bit widths of the two inputs and the two bit widths supported by the exponential processing unit is uncertain, and it is necessary to determine the relationship between the two inputs and the bit widths supported by the respective exponential processing units to determine Whether to call the index processing unit multiple times.
- the second control circuit selects the exponent of the first floating-point number as The fourth input corresponding to the fourth bit width and the exponent of the second floating point number is selected as the third input corresponding to the third bit width.
- the exponents of the two input floating-point numbers can be first inputted according to the strategy of large bit width to large bit width and small bit width to small bit width and the exponent processing unit supports The two bit widths are matched to avoid the exponential operation of two floating-point numbers that can be processed at one time, but multiple calls are made.
- bit width of the third input is greater than the third bit width and the bit width of the fourth input is less than or equal to the fourth bit width
- bit width of the fourth input is greater than the
- the second control circuit is used when the bit width of the third input is less than or equal to the bit width of the fourth input and the third bit width is less than or equal to the first bit width.
- the width is four bits
- the number of times the exponential processing unit is called and the data of the exponential processing unit are input in each invocation are determined according to the bit width of the fourth input and the third bit width. It is worth noting that, in the above three cases, the number of calls of the index processing unit and the data input to the index processing unit in each call are based on the larger of the bit widths of the third input and the fourth input. It is determined by the smaller of the third width and the fourth width. Of course, when the bit width of the third input and the fourth input are the same or the third bit width and the fourth bit width are the same, you can choose one of the two with the same bit width.
- the description of the first floating-point number and the second floating-point number is only for distinguishing between the two floating-point numbers, where "third" and “fourth” do not have a limiting effect.
- the description of the third input and the fourth input is only to distinguish the two inputs of the exponent processing unit, and the description of the third bit width and the fourth bit width is only to distinguish the exponential processing unit from the one supported by the exponent processing unit.
- the two inputs of the index processing unit correspond to the two maximum processing bit widths, so neither "third” nor "fourth” has a limiting effect.
- the floating-point number of the input multiplier described in the above embodiment is a floating-point number that meets the format required by the operation and applies to the internal and external components of the multiplier, that is, the floating-point number that has undergone preprocessing such as normalization. It should be understood that the floating-point number input to the multiplier can be a normalized or non-normalized floating-point number. Combining the description of the normalization unit above, it can be known that if at least one of the two input floating-point numbers is a non-normalized non-normalized floating-point number.
- the at least one floating-point number may be normalized by the normalization unit first to obtain the normalized exponent and mantissa, and then the normalized exponent is used as the input of the exponent processing unit to perform the above-mentioned floating Point multiplication operation.
- the third and fourth embodiments of the present disclosure are also applicable to the arithmetic of floating-point numbers according to the arithmetic modes as described above.
- the above-mentioned third input may be an addend
- the fourth input may be an addend
- the third bit width may be, for example, the maximum addend bit width supported by the exponent processing unit.
- the width may be, for example, the maximum addendum width supported by the exponent processing unit.
- the two floating-point numbers input to the multiplier of the present disclosure are normalized, so the mantissa of the two floating-point numbers is extended by 1 bit.
- the exponents of the two floating-point numbers are matched with the input of the exponent processing unit.
- control circuit may determine the number of calls of the exponent processing unit according to the following formula :
- m represents the number of times the exponent processing unit is called
- P represents the bit width of the addend
- Q represents the maximum addend bit width
- Q-1 represents the bit width of the part intercepted from the addend and the addend in each call.
- the addend and the addend are intercepted at the same time as the part of the bit width Q-1, so that the parts with the same bit width and the same digits intercepted from the addend and the addend are added. If in the call The intercepted part of the data is less than Q-1 bits or there is no data, and 0 or all of them are added to make up the Q-1 bits of data.
- the second control circuit can intercept the Q-1 bit part from the addend and the addend in the same order as the input of the exponent processing unit every time the exponent processing unit is called, and obtain this time by the exponent processing unit.
- the index result of the call, and the final index is obtained after calling the index processing unit m times. It is worth noting that the above-mentioned same order can be from high order to low order, or from low order to high order.
- the bit width of the addend is 6 bits
- the bit width of the addend is 9 bits
- the interception in the second call Add 0 before 2 bits of data to make up 7 bits, and extend a carry bit to form two 8-bit data with carry to add.
- the call to the exponential processing unit is also applicable to the third embodiment of the present disclosure.
- the exponent processing unit may further include a second shift and add circuit configured to obtain the post-multiplication operation according to the exponent result obtained by calling the exponent processing unit each time The index.
- the second shift and add circuit includes a second shifter, a second intermediate memory, and a second adder.
- the second shifter shifts the index result obtained by the first call and stores the shifted index result in the second intermediate memory, starting from the second call of the index processing unit, the The second shifter shifts the exponent result obtained in the current call, and the second adder adds the shifted exponent result to the value stored in the second intermediate memory and adds the result of the addition.
- the second intermediate memory is stored in the second intermediate memory to update the second intermediate memory, and the value stored in the second intermediate memory in the last call is used as the exponent after the multiplication operation.
- the second shifter shifts the exponent result obtained in the current call in the following manner: if the exponent processing unit is called, the index is intercepted and added in the order from high to low. When the number and the addend, the part intercepted from the addend and the addend in the current call is shifted to the left.
- the shift bit is the bit of the part after the part intercepted from the addend in the current call number.
- the bit width of the addend is 6bit
- the bit width of the addend is 9bit
- the maximum addend bit width and the maximum addend bit width supported by the exponent processing unit are both 8bit.
- both the addend and the addend are truncated with a bit width of 7 bits.
- the second shifter shifts the exponent result obtained in the first call by 2 bits to the left (because there are 2 bits after the part intercepted by the addend in this call. Data) and store the shifted index result in the second intermediate memory.
- the second shifter shifts the index result obtained in the current call to the left Since there is no more data after the intercepted part in this call, it is shifted by 0 bits to the left, that is, no shift.
- the second adder compares the result of the exponent shifted by 0 bits with the result stored in the second intermediate memory. Add the values and store the result of the addition in the second intermediate memory to update the second intermediate memory. Since this second call is the last call, it is stored in the second intermediate memory after the second call.
- the value in the second intermediate memory is the exponent after the multiplication operation.
- the control module may include multiple sub-modules, and the multiple sub-modules may be used to execute multiple calls.
- Various operations such as determining the number of calls to the mantissa processing unit, determining the number of calls, determining the data input to the mantissa processing unit in each call, determining whether the mantissa bit width matches the bit width supported by the mantissa processing unit, adjusting the mantissa input Wait.
- the second control module may also include multiple sub-modules, and similarly, these sub-modules may respectively perform various operations in multiple calls.
- FIG. 4 does not draw and describe other units, such as an exponent processing unit and a symbol processing unit.
- FIG. 7 The overall description of the multiplier of the present disclosure will be given below with reference to FIG. 7.
- the previous description of the mantissa processing unit is also applicable to the situation depicted in FIG. 7.
- FIG. 7 is an overall schematic block diagram showing a multiplier 700 according to an embodiment of the present disclosure. It should be understood that the positions, existence, and connection relationships of the various units depicted in the figure are only exemplary and not restrictive. For example, some of the units can be integrated, while other units can also be separated or depending on the application scenario. It is omitted or replaced if it is different.
- the multiplier of the present disclosure can be exemplarily divided into a first stage and a second stage in the operation of each operation mode according to the operation flow, as shown by the dotted line in the figure.
- the first stage output the calculation result of the sign bit, output the intermediate calculation result of the exponent bit, output the intermediate calculation result of the mantissa bit (for example, including the aforementioned encoding process of the input mantissa fixed-point multiplication Booth algorithm and Hua Laisha tree compression process).
- the second stage regularize and round the exponent and mantissa to output the calculation result of the exponent and the calculation result of the mantissa.
- the multiplier of the present disclosure may include a mode selection unit 702 and a normalization processing unit 704, wherein the mode selection unit may select an operation mode according to an input mode signal (in_mode).
- the input mode signal may correspond to the operation mode number in Table 2.
- the multiplier can be made to work in the operation mode of FP16*FP16, and when the input mode signal indicates the operation mode number "3" in Table 2 At this time, the multiplier can be operated in the FP32*FP32 operation mode.
- FIG. 7 only shows four exemplary operation modes of FP16*FP16, BF16*BF16, FP32*FP32, and FP32*BP16.
- the multiplier of the present disclosure also supports many other different operation modes.
- the normalization processing unit may be configured to perform normalization processing on the first floating-point number or the second floating-point number according to the operation mode when the first floating-point number or the second floating-point number is a non-normalized non-zero floating-point number. Obtain the corresponding exponent and mantissa, for example, according to the IEEE754 standard, regularize the floating-point number in the data format indicated by the operation mode.
- the multiplier includes a mantissa processing unit to perform a multiplication operation of the first floating-point number mantissa and the second floating-point number mantissa.
- the mantissa processing unit may include a bit expansion circuit 706, a Booth encoder 708, a partial product generation circuit 710, a Wallace tree compressor 712, and an adder 714.
- the number expansion circuit may be used to expand the mantissa of at least one of the first floating-point number and the second floating-point number by a number of bits, for example, adding zeros to the upper bits, so as to be suitable for the operation of the Booth encoder.
- the control circuit can perform the above operations of calling the mantissa processing unit multiple times according to the mantissa obtained after the sign bit extension of the mantissa by the bit extension circuit. Since the Booth encoder, the partial product generation circuit, the Wallace tree compressor and the adder have been described in detail in conjunction with FIGS. 4-6, the same description is equally applicable here and therefore will not be repeated.
- the multiplier of the present disclosure further includes a regularization unit 716 and a rounding unit 718, which have the same functions as the units shown in FIG. 3.
- the regularization unit can perform floating-point regularization on the addition result and the exponent data from the exponent processing unit according to the data format indicated by the output mode signal "out_mode" as shown in FIG. Process to obtain regularized index results and regularized mantissa results.
- the regularization unit can adjust the bit width of the exponent and the mantissa to make it meet the requirements of the aforementioned indicated data format.
- the regularization unit can repeatedly shift the mantissa by 1 bit to the left, and subtract 1 from the exponent until the highest bit value is 1.
- the rounding unit in one embodiment, it can be used to perform a rounding operation on the regularized mantissa result according to a rounding mode to obtain a rounded mantissa, and use the rounded mantissa as the multiplication operation After the mantissa.
- the aforementioned output mode signal may be a part of the operation mode, and is used to indicate the data format after the multiplication operation.
- the output mode signal may be combined with the aforementioned input mode signal to provide the mode selection unit. Based on the combined mode signal, the mode selection unit can clarify the data format of the input data and output result in the initial stage of the operation of the multiplier without separately providing the output mode signal to the regularization, which can further simplify the operation.
- the following five rounding modes can be exemplarily included.
- mantissa rounding in "rounding" mode for example, the 24-bit mantissa of two normalized floating-point numbers is multiplied to obtain a 48-bit (47-0) mantissa, which is normalized (if the highest bit of the mantissa is 0, shift the mantissa by 1 bit to the left; if the highest bit of the mantissa is 1, the mantissa does not move, and the temporary order code requested above is added by 1), and only the 46th to the 24th digits are taken when outputting.
- the (23-0) digit is discarded; when the 23rd digit of the mantissa is 1, the 24th digit is 1 and the (23-0) digit is discarded.
- the multiplier of the present disclosure further includes an exponent processing unit 720 and a sign processing unit 722, wherein the exponent processing unit can be used to obtain the multiplication according to the operation mode, the exponent of the first floating-point number, and the exponent of the second floating-point number.
- the calculated exponent For example, the exponent processing circuit can add the exponent bit data of the first floating-point number, the exponent bit data of the second floating-point number, and the respective offset values of the corresponding input floating-point data type, and subtract the offset of the output floating-point data type. The value is shifted to obtain exponent bit data of the product of the first floating-point number and the second floating-point number.
- the exponent processing unit may be implemented as or include an addition and subtraction circuit, which is configured to perform according to the operation mode, the exponent of the first floating-point number, the exponent of the second floating-point number, and the The operation mode obtains the exponent after the multiplication operation.
- the symbol processing unit may be implemented as an exclusive OR circuit in one embodiment, which is used to perform an exclusive OR operation on the sign bit data of the first floating point number and the second floating point number to obtain the first floating point number and the second floating point number.
- the sign bit data of the product of floating-point numbers may be implemented as an exclusive OR circuit in one embodiment, which is used to perform an exclusive OR operation on the sign bit data of the first floating point number and the second floating point number to obtain the first floating point number and the second floating point number.
- the sign bit data of the product of floating-point numbers may be implemented as an exclusive OR circuit in one embodiment, which is used to perform an exclusive OR operation on the sign bit data of the first floating point number and the second floating point number to obtain the first floating point number and the second floating point number.
- the multiplier of the present disclosure supports operations in multiple operation modes, thereby overcoming the defect of the multiplier that only supports a single floating-point operation in the prior art. Furthermore, since the multiplier of the present disclosure can be multiplexed, it also supports high-bit wide floating-point data, which reduces the operation cost and overhead. In one or more embodiments, the multiplier of the present disclosure may also be arranged or included in an integrated circuit chip or a computing device to implement multiplication operations on floating-point numbers in multiple operation modes.
- FIG. 8 is a flowchart illustrating a method 800 for performing a floating-point number multiplication operation using a multiplier according to an embodiment of the present disclosure. It is understandable that the multiplier described here is the multiplier described in detail above in conjunction with Figures 1 to 7, so the previous descriptions of the multiplier and its internal composition, functions and operations are also applicable to the description here. .
- the method 800 may include using the exponent processing unit of the multiplier at step S802 to obtain the post-multiplication operation according to the operation mode, the exponent of the first floating-point number, and the exponent of the second floating-point number.
- the index can be one of a variety of operation modes, and can be used to indicate the data format of a floating-point number. In one or more embodiments, the operation mode can also be used to determine the data format of the floating point number of the output result.
- the method 800 may use the mantissa processing unit of the multiplier to obtain the mantissa after the multiplication operation according to the operation mode, the first floating-point number, and the second floating-point number.
- the present disclosure uses the Booth coding algorithm and the Wallace tree compressor in some preferred embodiments, so as to improve the efficiency of the mantissa processing.
- the method 800 may also be used in step S806 to obtain the sign after the multiplication operation according to the sign of the first floating-point number and the sign of the second floating-point number.
- FIG. 9 is a structural diagram showing a combined processing device 900 according to an embodiment of the present disclosure.
- the combined processing device 900 includes a computing device 902, which may include the multiplier of the present disclosure as described above with reference to the accompanying drawings.
- the combined processing device also includes a universal interconnection interface 904 and other processing devices 906.
- the computing device according to the present disclosure interacts with other processing devices to jointly complete the operation specified by the user.
- the other processing device may include one or more types of general-purpose and/or special-purpose processors such as a central processing unit (“CPU"), a graphics processing unit (“GPU”), and a neural network processor.
- CPU central processing unit
- GPU graphics processing unit
- the number of processors is not limited but determined according to actual needs.
- the other processing device can be used as an interface between the computing device of the present disclosure (which can be embodied as a machine learning computing device) and external data and control.
- the execution includes but is not limited to data transfer, and completes the processing of the machine.
- the basic control of the start and stop of the learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.
- the universal interconnection interface can be used to transmit data and control commands between the computing device and other processing devices.
- the computing device can obtain required input data from other processing devices via the universal interconnection interface, and write the input data to the on-chip storage device of the computing device.
- the computing device can obtain control instructions from other processing devices via the universal interconnection interface, and write them into the on-chip control buffer of the computing device.
- the universal interconnection interface can also read the data in the storage module of the computing device and transmit it to other processing devices.
- the combined processing device may further include a storage device 908, which may be connected to the computing device and the other processing device respectively.
- the storage device may be used to store the data of the computing device and the other processing device, and it is especially suitable for the data required to be calculated in the internal storage of the computing device or other processing device. Saved data.
- the combined processing device of this disclosure can be used as an SOC system on chip for mobile phones, robots, drones, video capture, video surveillance equipment and other equipment, thereby effectively reducing the core area of the control part, increasing the processing speed and reducing The overall power consumption.
- the universal interconnection interface of the combined processing device is connected to some parts of the equipment.
- Some components here can be, for example, a camera, a monitor, a mouse, a keyboard, a network card or a wifi interface.
- the present disclosure also discloses a chip or integrated circuit chip, which includes the above-mentioned computing device, the combined processing device, and the multiplier of the present disclosure. In other embodiments, the present disclosure also discloses a chip packaging structure, which includes the above-mentioned chip.
- the present disclosure also discloses a board card, which includes the above-mentioned chip packaging structure.
- a board card which includes the above-mentioned chip packaging structure.
- the board may also include other supporting components.
- the supporting components may include, but are not limited to: a storage device 1004, an interface device 1006, and a control device. Device 1008.
- the storage device is connected to the chip in the chip packaging structure through a bus for storing data.
- the storage device may include multiple groups of storage units 1010. Each group of the storage unit and the chip are connected by a bus. It can be understood that each group of the storage units may be DDR SDRAM ("Double Data Rate SDRAM", double-rate synchronous dynamic random access memory).
- the storage device may include 4 groups of the storage unit. Each group of the storage unit may include a plurality of DDR4 particles (chips). In an embodiment, the chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification.
- each group of the storage unit may include a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
- DDR can transmit data twice in one clock cycle.
- a controller for controlling the DDR is provided in the chip, which is used to control the data transmission and data storage of each storage unit.
- the interface device is electrically connected with the chip in the chip packaging structure.
- the interface device is used to implement data transmission between the chip and an external device 1012 (for example, a server or a computer).
- the interface device may be a standard PCIE interface.
- the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer.
- the interface device may also be other interfaces, and the present disclosure does not limit the specific manifestations of the above other interfaces, as long as the interface unit can realize the switching function.
- the calculation result of the chip is still transmitted by the interface device back to an external device (such as a server).
- the control device is electrically connected with the chip to monitor the state of the chip.
- the chip and the control device may be electrically connected through an SPI interface.
- the control device may include a single-chip microcomputer ("MCU", Micro Controller Unit).
- the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and can drive multiple loads. Therefore, the chip can be in different working states such as multi-load and light-load.
- the control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and/or multiple processing circuits in the chip.
- the present disclosure also discloses an electronic device or device, which includes the above-mentioned board.
- electronic equipment or devices can include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, and cameras , Cameras, projectors, watches, earphones, mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.
- the transportation means include airplanes, ships, and/or vehicles;
- the household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
- the medical equipment includes nuclear magnetic resonance, B-ultrasound and/or electrocardiograph.
- the disclosed device can be implemented in other ways.
- the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, optical, acoustic, magnetic or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
- the functional units in the various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit can be realized in the form of hardware or software program module.
- the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
- the computer software product is stored in a memory and includes several instructions to enable a computer device (which can be a personal computer, a server, or a network device) Etc.) Perform all or part of the steps of the methods described in the various embodiments of the present disclosure.
- the aforementioned memory includes: U disk, read-only memory ("ROM”, Read-Only Memory), random access memory ("RAM”, Random Access Memory), mobile hard disk, magnetic disk or optical disk, etc., which can store programs The medium of the code.
- a multiplier for multiplication of floating-point numbers wherein the multiplier includes: a mantissa processing unit for obtaining the mantissa after the multiplication operation according to the mantissa of the floating-point number, and
- the mantissa processing unit includes a control circuit configured to call the mantissa processing unit multiple times when the bit width of at least one of the two floating-point numbers is greater than the data bit width that can be processed by the mantissa processing unit at one time.
- the multiplier according to clause A1 wherein the two floating-point numbers include a first floating-point number and a second floating-point number, the mantissa processing unit supports a first bit width and a second bit width, and the first The mantissa of a floating point number is used as the first input corresponding to the first bit width, the mantissa of the second floating point number is used as the second input corresponding to the second bit width, and the bit width of the first input is less than Or equal to the first bit width, and the control circuit is configured to call the mantissa processing unit multiple times to obtain the mantissa after the multiplication operation when the bit width of the second input is greater than the second bit width .
- the mantissa of the first floating point number is used as the first input corresponding to the first bit width
- the mantissa of the second floating point number is used as the second input corresponding to the second bit width
- the control circuit is used for When the bit width of the first input is greater than the first bit width and the bit width of the second input is less than or equal to the second bit width
- the bit width of the second input is greater than the second bit width Bit width and the bit width of the first input is less than or equal to the first bit width or when the bit width of the first input is greater than the first bit width and the bit width of the second input is greater than the
- the mantissa processing unit is called multiple times to obtain the mantissa after the multiplication operation.
- Clause A4 the multiplier according to any one of clauses A1-A3, wherein, when the mantissa bit width of the first floating-point number is less than the mantissa bit width of the second floating-point number and the first bit width is greater than the When the second bit width, or when the mantissa bit width of the first floating-point number is greater than the mantissa bit width of the second floating-point number and the first bit width is smaller than the second bit width, the control circuit The mantissa of the first floating point number is selected as the second input corresponding to the second bit width and the mantissa of the second floating point number is selected as the first input corresponding to the first bit width.
- Clause A5 the multiplier according to any one of clauses A1-A4, wherein when the bit width of the first input is greater than the first bit width and the bit width of the second input is less than or equal to the first bit width
- the control circuit determines the number of invocations of the mantissa processing unit and inputs the data of the mantissa processing unit in each call according to the bit width of the first input and the first bit width.
- Clause A6 the multiplier according to any one of clauses A1-A5, wherein when the bit width of the second input is greater than the second bit width and the bit width of the first input is less than or equal to the first input When one bit is wide, the control circuit determines the number of invocations of the mantissa processing unit and inputs the data of the mantissa processing unit in each call according to the bit width of the second input and the second bit width.
- the control circuit determines the number of times to call the mantissa processing unit according to the bit width of the first input and the first bit width, and the bit width of the second input and the second bit width, and Enter the data of the mantissa processing unit in each call.
- the multiplier according to any one of clauses A1-A7, wherein the mantissa processing unit further includes a shift and add circuit, and the shift and add circuit is configured to obtain the result of each invocation of the mantissa processing unit.
- the multiplier according to any one of clauses A1-A8, wherein the shift and add circuit includes a shifter, an intermediate memory, and an adder, and when the control circuit calls the mantissa processing unit multiple times After the first call, the shifter shifts the mantissa result obtained in the first call to obtain the shifted mantissa result and stores the shifted mantissa result in the intermediate memory, starting from the first At the beginning of the second call, the shifter shifts the mantissa result obtained in the current call to obtain the current mantissa result, and the adder compares the current mantissa result with the result stored in the intermediate memory. Add and store the result of the addition in the intermediate memory to update the intermediate memory, and the result stored in the intermediate memory after the last call is used as the mantissa after the multiplication operation.
- the multiplier according to any one of clauses A1-A9, wherein the multiplier further includes an exponent processing unit configured to obtain the multiplication according to the exponents of the two floating-point numbers After the operation of the exponent, the exponent processing unit includes a second control circuit, and the second control circuit is configured to determine the exponent bit width of one of the two floating-point numbers and the two bits supported by the exponent processing unit.
- One of the widths may be determined according to the exponent bit width of the two floating-point numbers and the two bit widths supported by the exponent processing unit to call the exponent processing unit multiple times to obtain the exponent after the multiplication operation.
- the multiplier according to any one of clauses A1-A10, wherein the two floating-point numbers include a first floating-point number and a second floating-point number, and the exponent processing unit supports a third bit width and a fourth bit width Width, the exponent of the first floating point number is used as the third input corresponding to the third bit width, the exponent of the second floating point number is used as the fourth input corresponding to the fourth bit width, and the third The bit width of the input is less than or equal to the third bit width, and the second control circuit is configured to call the exponent processing unit multiple times to obtain when the bit width of the fourth input is greater than the fourth bit width.
- the exponent after the multiplication operation wherein the two floating-point numbers include a first floating-point number and a second floating-point number, and the exponent processing unit supports a third bit width and a fourth bit width Width, the exponent of the first floating point number is used as the third input corresponding to the third bit width, the exponent of the second floating point number is used
- the multiplier according to any one of clauses A1-A11, wherein the two floating-point numbers include a first floating-point number and a second floating-point number, and the exponent processing unit supports a third bit width and a fourth bit width Width, the exponent of the first floating point number is used as the third input corresponding to the third bit width, the exponent of the second floating point number is used as the fourth input corresponding to the fourth bit width, and the second The control circuit is used for when the bit width of the third input is greater than the third bit width and the bit width of the fourth input is less than or equal to the fourth bit width, and when the bit width of the fourth input is greater than When the fourth bit width and the bit width of the third input are less than or equal to the third bit width or when the bit width of the third input is greater than the third bit width and the bit width of the fourth input When the width is greater than the fourth bit width, the exponent processing unit is called multiple times to obtain the exponent after the multiplication operation.
- the multiplier according to any one of clauses A1-A14, wherein the exponent processing unit further includes a second shift and add circuit, and the second shift and add circuit is configured to call the exponent according to each time The exponent result obtained by the processing unit is used to obtain the exponent after the multiplication operation.
- the multiplier according to any one of clauses A1-A15, wherein the mantissa processing unit includes a partial product operation unit and a partial product summation unit, wherein the partial product operation unit is configured to The mantissa of the floating-point number obtains an intermediate result, and the partial product summation unit is configured to perform an addition operation on the intermediate result to obtain an addition result, and use the addition result as the mantissa after the multiplication operation.
- Clause A17 The multiplier according to any one of clauses A1 to A16, wherein the partial product operation unit includes a Booth coding circuit, and the Booth coding circuit is used to perform the calculation of the first floating-point number or the first The mantissa of the two floating-point numbers is subjected to Booth coding processing to obtain the intermediate result.
- the multiplier according to any one of clauses A1 to A18, wherein the partial product summation unit includes a Wallace tree and an adder, and the Wallace tree is used to perform a calculation on the intermediate result And to obtain a second intermediate result, and the adder is used to add the second intermediate result to obtain the added result.
- Clause A20 The multiplier according to any one of clauses A1-A19, wherein the adder includes at least one of a full adder, a serial adder, and a forward bit adder.
- each of the Wallace trees has M inputs and N outputs, and the number of Wallace trees is not less than K, where N Is a preset positive integer less than M, and K is a positive integer not less than the maximum bit width of the intermediate result.
- Clause A23 the multiplier according to any one of clauses A1-A22, wherein the partial product summation unit is used to select one or more groups of the Wallace trees to add the intermediate results, wherein Each group of the Wallace trees has X Wallace trees, and X is the number of bits of the intermediate result, wherein the Wallace trees in each group have a sequential carry relationship, and between the groups The Wallace tree does not have a carry relationship.
- the multiplier according to any one of clauses A1-A23, wherein the multiplier further includes: a normalization processing unit, configured to: when at least one of the two floating-point numbers is non-normalized When the floating-point number is non-zero, the at least one floating-point number is normalized to obtain the corresponding exponent and mantissa.
- a normalization processing unit configured to: when at least one of the two floating-point numbers is non-normalized When the floating-point number is non-zero, the at least one floating-point number is normalized to obtain the corresponding exponent and mantissa.
- the multiplier according to any one of clauses A1-A24, wherein the multiplier is used to perform multiplication of the two floating-point numbers according to an operation mode, and the operation mode indicates the value of the two floating-point numbers Data format, the mantissa processing unit is used to obtain the mantissa after the multiplication operation according to the operation mode and the mantissa of the two floating-point numbers, and the exponent processing unit is used to obtain the mantissa after the multiplication operation according to the operation mode and the The exponent of two floating-point numbers is used to obtain the exponent after the multiplication operation.
- the normalization processing unit is further configured to perform normalization processing on at least one of the two floating-point numbers according to the operation mode, To obtain the corresponding exponent and mantissa.
- the multiplier according to any one of clauses A1-A26, wherein the data format includes at least one of half-precision floating-point numbers, single-precision floating-point numbers, brain floating-point numbers, double-precision floating-point numbers, and custom floating-point numbers.
- the data format includes at least one of half-precision floating-point numbers, single-precision floating-point numbers, brain floating-point numbers, double-precision floating-point numbers, and custom floating-point numbers.
- the data format includes at least one of half-precision floating-point numbers, single-precision floating-point numbers, brain floating-point numbers, double-precision floating-point numbers, and custom floating-point numbers.
- Clause A28 The multiplier according to any one of clauses A1-A27, wherein the mantissa processing unit includes a bit number expansion circuit, and the bit number expansion circuit is configured to compare the first floating-point number and the second The mantissa of at least one of the floating-point numbers is expanded by digits.
- the sign processing unit is used to obtain the sign after the multiplication operation according to the sign of the two floating-point numbers.
- the multiplier according to any one of clauses A1-A29, wherein the sign processing unit includes an exclusive OR logic circuit, and the exclusive OR logic circuit is configured to perform an exclusive OR according to the signs of the two floating-point numbers Operation to obtain the sign after the multiplication operation.
- the multiplier according to any one of clauses A1-A30 further includes a regularization unit for:
- the multiplier according to any one of clauses A1-A31, further comprising: a rounding unit configured to perform a rounding operation on the regularized mantissa result according to a rounding mode to obtain a rounded mantissa, and The rounded mantissa is used as the mantissa after the multiplication operation.
- a method for performing floating-point number multiplication using a multiplier wherein the mantissa processing unit of the multiplier is used to obtain the mantissa after the multiplication operation according to the mantissa of the floating-point number, and the mantissa processing unit includes
- the control circuit is configured to call the mantissa processing unit multiple times when the bit width of at least one of the two floating-point numbers is greater than the data bit width that can be processed by the mantissa processing unit at one time.
- Clause A34 an integrated circuit chip including the multiplier according to any one of clauses A1-A31.
- Clause A35 a computing device comprising the multiplier according to any one of clauses A1-A31 or the integrated circuit chip according to clause A34.
- the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
- the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Nonlinear Science (AREA)
- Complex Calculations (AREA)
Abstract
一种用于浮点运算的乘法器、方法、集成电路芯片和计算装置,其中计算装置(902)可以包括在组合处理装置中,该组合处理装置还可以包括通用互联接口(904)和其他处理装置(906)。所述计算装置与其他处理装置进行交互,共同完成用户指定的计算操作。组合处理装置还可以包括存储装置(908),该存储装置分别与计算装置和其他处理装置连接,用于计算装置和其他处理装置的数据。该方案可以广泛应用于各类浮点数据运算中。
Description
相关申请的交叉引用
本申请要求于2019年10月14日申请的,申请号为201910970802.8,名称为“用于浮点运算的乘法器、方法、集成电路芯片和计算装置”的中国专利申请的优先权,并且要求于2020年10月9日申请的,申请号为202011074061.4,名称为“用于浮点运算的乘法器、方法、集成电路芯片和计算装置”的中国专利申请的优先权,在此将其全文引入作为参考。
本披露一般地涉及浮点运算领域。更具体地,本披露涉及用于浮点运算的方法、乘法器、集成电路芯片和计算装置。
在当前的各种信号处理算法中,如向量之间的内积操作和矩阵的卷积运算中,使用到大量的乘加操作,而这些乘加操作的效率往往取决于乘法器的执行速度。尽管当前的乘法器在执行效率方面获得了显著提高,但在处理浮点类型数据方面,其还存在提升的空间。因此,如何获得一种高效率、低功耗和低成本的乘法器来执行浮点型数据的乘法操作成为现有技术中需要解决的问题。
发明内容
为了至少部分地解决背景技术中提到的技术问题,本披露的方案提供了一种用于浮点运算的乘法器、方法、包括该乘法器的集成电路芯片和计算装置。
在一个方面中,本披露提供一种乘法器,用于进行浮点数的乘法运算,其中,所述乘法器包括:尾数处理单元,用于根据所述浮点数的尾数来获得所述乘法运算后的尾数,所述尾数处理单元包括控制电路,所述控制电路用于在两个浮点数中的至少一个的尾数位宽大于所述尾数处理单元一次可处理的数据位宽时,多次调用所述尾数处理单元。
在另一方面中,本披露提供一种使用乘法器执行浮点数乘法运算的方法,其中,利用所述乘法器的尾数处理单元根据所述浮点数的尾数来获得所述乘法运算后的尾数,所述尾数处理单元包括控制电路,所述控制电路用于在两个浮点数中的至少一个的尾数位宽大于所述尾数处理单元一次可处理的数据位宽时,多次调用所述尾数处理单元。
在又一方面中,本披露提供一种集成电路芯片,包括所述的乘法器。在一个或多个实施例中,本披露的乘法器可以构成一个独立的集成电路芯片或布置在一块集成电路芯片或计算装置上,实现对多种不同数据格式的浮点数的运算。
利用本披露的乘法器、相应的运算方法、集成电路芯片和计算装置,可以支持对多种浮点类型的数据进行运算而无需针对不同的浮点类型数据而提供多个单独的乘法器。由此,本披露的乘法器适用灵活,可以广泛应用于各类浮点数据运算。另外,在处理位宽较大的输入数据时,本披露的乘法器支持循环复用操作,从而无需布置更多的处理芯片,由此也减小了集成电路的布置面积。
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:
图1是示出根据本披露实施例的浮点数据格式的示意图;
图2是示出根据本披露实施例的乘法器的示意性结构框图;
图3是示出根据本披露实施例的乘法器的更多细节的结构框图;
图4是示出根据本披露实施例的尾数处理单元的示意性框图;
图5是示出根据本披露实施例的部分积操作的示意图;
图6是示出根据本披露实施例的华莱士树压缩器的操作流程和示意框图;
图7是示出根据本披露实施例的乘法器的整体示意框图;
图8是示出根据本披露实施例的使用乘法器执行浮点数乘法运算的方法的流程图;
图9是示出根据本披露实施例的一种组合处理装置的结构图;以及
图10是示出根据本披露实施例的一种板卡的结构示意图。
本披露的技术方案在整体上提供一种用于浮点数运算的乘法器、方法、集成电路芯片和计算装置。不同于现有技术的浮点运算乘法器,本披露提供了一种支持多种运算模式的乘法器,从而克服现有乘法器只能支持一种类型浮点运算的缺陷。特别地,本披露利用多种运算模式来指示不同的浮点数据类型,并且在浮点数的乘法计算过程中,基于运算模式之一来执行数据的各类操作,包括例如编码、压缩、求和、规格化和舍入操作,从而实现与多种浮点数据类型之一关联的操作。由此,本披露的乘法器可以支持多模式下的操作,进一步提高浮点运算的灵活性并降低运算的成本。
下面将结合附图对本披露的技术方案及其多个实施例进行详细的描述。应当理解的是,将关于浮点运算阐述许多具体细节以便提供对本披露所述多个实施例的透彻理解。然而,本领域普通技术人员在本披露公开内容的教导下可以在没有这些具体细节的情况下实践本披露描述的多个实施例。在其他情况下,本披露公开的内容并没有详细描述公知的方法、过程和组件,以避免不必要地模糊本披露描述的实施例。另外,该描述也不应被视为限制本披露的多个实施例的范围。
图1是示出根据本披露实施例的浮点数据格式100的示意图。如图1中所示,可以应用本披露技术方案的浮点数可以包括三个部分,例如符号(或符号位)102、指数(或指数位)104和尾数(或尾数位)106,其中对于无符号的浮点数则可以不存在符号或符号位。在一些实施例中,适用于本披露乘法器的浮点数可以包括半精度浮点数、单精度浮点数、脑浮点数、双精度浮点数、自定义浮点数中的至少一种。具体来说,在一些实施例中,可以应用本披露技术方案的浮点数格式可以是符合IEEE754标准的浮点格式,例如双精度浮点数(float64,简写为“FP64”)、单精度浮点数(float32,简写“FP32”)或半精度浮点数(float16,简写“FP16”)。在另外一些实施例中,浮点数格式也可以是现有的16位脑浮点数(bfloat16,简写“BF16”),也可以是自定义的浮点数格式,例如8位脑浮点数(bfloat8,简写“BF8”)、无符号半精度浮点数(unsigned float16,简写“UFP16”)、无符号16位脑浮点数(unsigned bfloat16,简写“UBF16”)。为了便于理解,下面的表1示出上述的部分数据格式,其中的符号位宽、指数位宽和尾数位宽仅用于示例性的说明目的。
表1
数据类型 | 符号位宽 | 指数位宽 | 尾数位宽 |
FP16 | 1 | 5 | 10 |
BF16 | 1 | 8 | 7 |
FP32 | 1 | 8 | 23 |
BF8 | 1 | 5 | 3 |
UFP16 | 0 | 5(或6) | 11(或10) |
UBF16 | 0 | 8 | 8 |
对于上面所提到的各种浮点数格式,本披露的乘法器在操作中至少可以支持具有任意上述格式的两个浮点数之间的相乘操作,其中两个浮点数可以具有相同或不同的浮点数据格式。例如,两个浮点数之间的相乘操作可以是FP16*FP16、BF16*BF16、FP32*FP32、FP32*BF16、FP16*BF16、FP32*FP16、BF8*BF16、UBF16*UFP16或UBF16*FP16等两个浮点数之间的相乘 操作。
图2是示出根据本披露实施例的乘法器200的示意性结构框图。如前所述,本披露的乘法器支持各种数据格式的浮点数的相乘操作,而这些数据格式可以通过本披露的运算模式来指示,以使得乘法器工作在多种运算模式之一。
如图2中所示,本披露的乘法器总体上可以包括指数处理单元202和尾数处理单元204,其中指数处理单元用于处理浮点数的指数位,而尾数处理单元用于处理浮点数的尾数位。可选地或附加地,在一些实施例中,当乘法器处理的浮点数具有符号位时,乘法器还可以包括符号处理单元206,该符号处理单元可以用于处理包括符号位的浮点数。
在操作中,所述乘法器可以根据运算模式之一对接收、输入或缓存的第一浮点数和第二浮点数执行浮点运算,该第一浮点数和第二浮点数具有如前所讨论的浮点数据格式之一。例如,当乘法器处于第一运算模式中,其可以支持两个浮点数FP16*FP16的乘法运算,而当乘法器处于第二运算模式中,其可以支持两个浮点数BF16*BF16的乘法运算。类似地,当乘法器处于第三运算模式中,其可以支持两个浮点数FP32*FP32的乘法运算,而当乘法器处于第四运算模式中,其可以支持两个浮点数FP32*BF16的乘法运算。这里,示例的运算模式和浮点数对应关系如下表2所示。
表2
运算模式编号 | 运算浮点数类型 |
1 | FP16*FP16 |
2 | BF16*BF16 |
3 | FP32*FP32 |
4 | FP32*BF16 |
在一个实施例中,上述的表2可以存储于乘法器的一个存储器中,并且乘法器根据从外部设备接收到的指令来选择表中的运算模式之一,而该外部设备例如可以是图10中示出的外部设备1012。在另一个实施例中,该运算模式的输入也可以经由如图3中所示的模式选择单元308来自动地实现。例如,当两个FP16型的浮点数输入到本披露的乘法器时,模式选择单元可以根据该两个浮点数的数据格式而选择乘法器工作于第一运算模式中。又例如,当一个FP32型浮点数和一个BF16型浮点数输入到本披露的乘法器时,模式选择单元可以根据该两个浮点数的数据格式而选择乘法器工作于第四运算模式中。
可以看出,本披露的不同运算模式与对应的浮点型数据相关联。也就是说,本披露的运算模式可以用于指示第一浮点数的数据格式和第二浮点数的数据格式。在另一个实施例中,本披露的运算模式不仅可以指示第一浮点数的数据格式和第二浮点数的数据格式,还可以用于指示乘法运算后的数据格式。结合表2扩展的运算模式在下表3中示出。
表3
与表2中所示的运算模式编号不同,表3中的运算模式扩展一位以用于指示浮点乘法运算后的数据格式。例如,当乘法器工作于运算模式21中,其对输入的BF16*BF16两个浮点数执行浮点运算,并且将浮点乘法运算后以FP16的数据格式输出。
上面以编号形式的运算模式来指示浮点数据格式仅仅是示例性的而非限制性的,根据本披露的教导,也可以想到根据运算模式建立索引以确定乘数和被乘数的格式。例如,运算模式包括两个索引,第一个索引用于指示第一浮点数的类型,第二个索引用于指示第二浮点数的类型,例如运算模式13中的第一索引“1”指示第一浮点数(或称被乘数)为第一浮点格式,即FP16,而第二索引“3”指示第二浮点数(或称乘数)为第二浮点格式,即FP32。进一步,也可以对运算模式增加第三索引,该第三索引指示输出结果的数据格式,例如对于运算模式131中的第三索引“1”,其可以指示输出结果的数据格式是第一浮点格式,即FP16。当运算模式数目增加时,可以根据需要增加相应的索引或索引的层级,以便于对运算模式和数据格式之间关系的确立。
另外,尽管这里示例性地以数字编号来指代运算模式,在其他的例子中,也可以根据应用需要以其他的符号或编码来对运算模式进行指代,例如通过字母、符号或数字及其结合等等,并且通过这样的字母、数字、符号或其组合的表达来指代运算模式并标识出第一浮点数、第二浮点数和输出结果的数据格式。另外,当这些表达以指令形式形成时,该指令可以包括三个域或字段,第一域用于指示第一浮点数的数据格式,第二域用于指示第二浮点数的数据格式,而第三域用于指示输出结果的数据格式。当然,这些域也可以被合并于一个域,或增加新的域以用于指示更多的与浮点数据格式相关的内容。可以看出,本披露的运算模式不仅可以与输入的浮点数数据格式相关联,也可以用于规格化输出结果,以获得期望数据格式的乘积结果。
图3是示出根据本披露实施例的乘法器300的更多细节结构框图。从图3所示内容可以看出,其不仅包括图2中所示出的指数处理单元202、尾数处理单元204和可选的符号处理单元206,还示出这些单元可以包括的内部组件以及与这些单元操作相关的单元,下面结合图3来具体描述这些单元的示例性操作。
为了执行浮点数的乘法运算,指数处理单元可以用于根据前述的运算模式、第一浮点数的指数和第二浮点数的指数获得乘法运算后的指数。在一个实施例中,该指数处理单元可以通过加减法电路来实现。例如,此处的指数处理单元可以用于将第一浮点数的指数、第二浮点数的指数和各自对应的输入浮点数据格式的偏移值相加,并且接着减去输出浮点数据格式的偏移值,以获得第一浮点数和第二浮点数的乘法运算后的指数。
进一步,乘法器的尾数处理单元可以用于根据前述的运算模式、第一浮点数和所述第二浮点数来获得乘法运算后的尾数。在一个实施例中,尾数处理单元可以包括部分积运算单元312和部分积求和单元314,其中所述部分积运算单元用于根据第一浮点数的尾数和第二浮点数的尾数获得中间结果。在一些实施例中,该中间结果可以是第一浮点数和第二浮点数在相乘操作过程中所获得的多个部分积(如图5和图6中所示意性示出的)。所述部分积求和单元用于将所述中间结果进行加和运算以获得加和结果,并将所述加和结果作为所述乘法运算后的尾数。
为了获得中间结果,在一个实施例中,本披露利用布斯(“Booth”)编码电路对第二浮点数(如充当浮点运算中的乘数)的尾数的高低位补0(其中对高位补0是将尾数作为无符号数转为有符号数),以便获得所述中间结果。需要理解的是,根据编码方法的不同,也可以对第一浮点数(如充当浮点运算中的被乘数)的尾数进行编码(如高低位补0),或者对二者都进行编码,以获得多个部分积。关于部分积的更多描述,稍后将结合附图来说明。
在另一个实施例中,所述部分积求和单元可以包括加法器,其用于对所述中间结果进行加和,以获得所述加和结果。在又一个实施例中,部分积求和单元包括华莱士树和加法器,其中所述华莱士树用于对所述中间结果进行加和,以获得第二中间结果,所述加法器用于对所述第二中间结果进行加和,以获得所述加和结果。在这些实施例中,加法器可以包括全加器、串行加法器和超前进位加法器中的至少一种。
在一个实施例中,本披露的乘法器还包括规则化单元318和舍入单元320。该规则化单元可以用于对乘法运算后的尾数和指数进行浮点数规则化处理,以获得规则化指数结果和规则化尾 数结果,并且将所述规则化指数结果和所述规则化尾数结果作为所述乘法运算后的指数和乘法运算后的尾数。例如,根据运算模式所指示的数据格式,规则化单元可以调整指数和尾数的位宽,以使其符合前述指示的数据格式的要求。另外,规则化单元还可以对指数或尾数做其他方面的调整。例如,在一些应用场景中,当尾数的值不为0时,尾数位的最高有效位应为1;否则,可以修改指数位并同时对尾数位进行移位,使其变为规格化数的形式。在另一个实施例中,该规则化单元还可以根据乘法运算后的尾数对所述乘法运算后的指数进行调整。例如,当乘法运算后的尾数的最高位为1时,可以将乘法运算后所获得的指数加1。与之相应,舍入单元可以用于根据舍入模式对所述规则化尾数结果执行舍入操作,并将执行了舍入操作后的尾数作为所述乘法运算后的尾数。根据不同的应用场景,该舍入单元可以执行例如包括向下舍入、向上舍入、向最近的有效数舍入等的舍入操作。在一些应用场景中,舍入单元也可以对尾数右移过程中移出的1进行舍入。
除了指数处理单元和尾数处理单元,本披露的乘法器还可选地包括符号处理单元,当输入的浮点数是带有符号位的浮点数时,该符号处理单元可以用于根据第一浮点数的符号和第二浮点数的符号获得乘法运算后的符号。例如,在一个实施例中,该符号处理单元可以包括异或逻辑电路322,所述异或逻辑电路用于根据所述第一浮点数的符号和所述第二浮点数的符号进行异或运算,获得所述乘法运算后的符号。在另一个实施例中,该符号处理单元也可以通过真值表或逻辑判断来实现。
另外,为了使输入或接收到的第一和第二浮点数符合规定的格式,在一个实施例中,本披露的乘法器还可以包括规格化处理单元324,用于当所述第一浮点数或第二浮点数为非规格化的非零浮点数时,根据所述运算模式,对所述第一浮点数或第二浮点数进行规格化处理,以获得对应的指数和尾数。例如,当选择的运算模式是表2中所示出的第2种运算模式,而输入的第一和第二浮点数是FP16型数据,则可以利用规格化处理单元将FP16型数据规格化为BF16型数据,以便乘法器以第2种运算模式进行操作。在一个或多个实施例中,规格化处理单元还可以用于对存在隐式的1的规格化浮点数和不存在隐式的1的非规格化浮点数的尾数进行预处理(例如尾数的扩充),以便于后续的尾数处理单元的操作。基于上文的描述,可以理解的是这里的规格化处理单元324和前述的规则化单元318在一些实施例中也可以执行相同或相类似的操作,不同的是规格化处理单元324针对于输入的浮点数据进行规格化处理而规则化单元318针对于将要输出的尾数和指数进行规则化处理。
以上结合图3对本披露的乘法器及其多个实施例进行了描述。基于上面的描述,本领域技术人员可以理解本披露的方案通过乘法器的执行来获得乘法运算后的结果(包括指数、尾数和可选的符号)。根据应用场景的不同,例如在不需要前述的规则化处理和舍入处理时,通过尾数处理单元和指数处理单元所获得的结果即可以视为最终的运算结果。进一步,对于需要前述的规则化处理和舍入处理时,则经过该规则化处理和舍入处理后所获得的指数和尾数可以视为最终的运算结果,或最终的运算结果的一部分(当考虑最终的符号时)。进一步,本披露的方案通过多种运算模式来使得乘法器支持不同类型或数据格式的浮点数的运算,从而可以实现乘法器的复用,由此节省了芯片设计的开销并节约了计算成本。另外,通过多次调用机制,本披露的乘法器也支持高位宽的浮点数的计算。鉴于在浮点数乘法操作中,尾数(或称尾数位或尾数部分)的相乘操作对于整个浮点运算的性能至关重要,下面将结合图4来描述本披露的尾数操作。
图4是示出根据本披露实施例的尾数处理单元操作400的示意性框图。如图4中所示,本披露的尾数处理操作可以主要涉及两个单元,即前述结合如图3所讨论的部分积运算单元和部分积求和单元。从操作时序上来看,该尾数处理操作大体可以分为第一阶段和第二阶段,在第一阶段中该尾数处理操作将获得中间结果,而在第二阶段中该尾数处理操作将获得从加法器408输出的尾数结果。
在示例性的具体操作中,由乘法器接收到的第一浮点数和第二浮点数可以被划分成多个部分,即前述的符号(可选的)、指数和尾数。可选地,在经过规格化处理后,两个浮点数的尾数部分将作为输入进入到尾数处理单元(如图2或图3中的尾数处理单元),并且具体地进入到部分积 运算单元。如图4中所示,本披露利用布斯编码电路402对第二浮点数(即浮点运算中的乘数)的尾数的高低位补0,并进行布斯编码处理,从而在部分积产生电路404中获得所述中间结果。当然,这里的第一浮点数和第二浮点数仅仅用于说明性而非限制性的目的,因此在一些应用场景中,第一浮点数可以是乘数而第二浮点数可以是被乘数。相应地,在一些编码处理中,也可以对充当被乘数的浮点数执行编码操作。
为了更好的理解本披露的技术方案,下面对布斯编码进行简要地介绍。一般地,当两个二进制数进行相乘操作时,通过乘法操作会产生大量的称之为部分积的中间结果,然后在对这些部分积进行累加操作进而得到两个二进制数相乘的最终结果。其中部分积数量越多,阵列乘法器的面积和功耗就会越大,执行速度就会越慢,其实现电路也就越困难。而布斯编码的目的就是为了有效地减少部分积的求和项的数量,从而减小电路面积。其算法在于首先对输入的乘数进行相应规则的编码,在一个实施例中,编码规则例如可以是下表4所示的规则:
表4
其中表4中的y2i+1,y2i和y2i-1可以表示每一组待编码子数据(即乘数)对应的数值,X可以表示第一浮点数(即被乘数)中的尾数。对每一组对应的待编码数据进行布斯编码处理后,得到对应的编码信号PPi(i=0,1,2,...,n)。如表4中所示意性示出的,布斯编码后得到的编码信号可以包括五类,分别为-2X、2X、-X、X和0。示例性地,基于上述的编码规则,若接收到的被乘数为8位数据“X
7X
6X
5X
4X
3X
2X
1X
0”,则可以获得下述的部分积:
1)当乘数位中包括上表中的连续三位数据“001”时,部分积为X,可以表示为“X
7X
6X
5X
4X
3X
2X
1X
0”,第9位是符号位,即PPi={X[7],X};
2)当乘数位中包括上表中的连续三位数据“011”时,部分积为2X,可以表示为X左移一位,得到“X
7X
6X
5X
4X
3X
2X
1X
00”,即PPi={X,0};
5)当乘数位中包括上表中的连续三位数据“111”或“000”时,部分积为0,即PPi={9′b0}。
应当理解的是上面结合表4对获得部分积的过程的描述仅仅是示例性的而非限制性的,本领域技术人员在本披露的教导下,可以对表4中的规则进行改变,以获得不同于表4所示出的部分积。例如,在乘数位中存在连续多位(例如3位或3位以上)的特定数时,得到的部分积可以是被乘数的补码,或者例如在对部分积进行加和之后再执行上述3)和4)项中的“加1”操作。
根据上述介绍性描述可以理解,通过对第二浮点数的尾数利用布斯编码电路进行编码,并且利用第一浮点数的尾数,可以从部分积产生电路产生多个部分积作为中间结果,并且将中间结果输送入到部分积求和单元中的华莱士树(“Wallace Tree”)压缩器406。应当理解的是,此处利 用布斯编码获得部分积仅是本披露得到部分积的一种优选方式,而本领域技术人员也可以通过其他的方式来获得该部分积。例如,还可以通过移位操作来获得,即根据乘数的位值为1还是0来选择移位加被乘数还是加0而获得相应的部分积。类似地,利用华莱士树压缩器以实现部分积的加法操作也仅仅是示例性的而非限制性的,本领域技术人员也可以想到利用其他类型的加法器来实现这样的部分积相加操作。该加法器例如可以是一个或多个全加器、半加器或二者的各种组合形式。
关于华莱士树压缩器(或简称为华莱士树),其主要用于对上述的中间结果(即多个部分积)进行求和,以减少部分积的累加次数(即,压缩)。通常,华莱士树压缩器可以采用进位保存CAS(carry-save)架构和Wallace树算法,其利用华莱士树阵列的计算速度比传统进位传递的加法快得多。
具体地,华莱士树压缩器能并行计算各行部分积之和,例如可以将N个部分积的累加次数从N-1次减少到Log2N次,从而提高了乘法器的速度,对资源的有效利用具有重要意义。根据不同的应用需要,可以将华莱士树压缩器设计成多种类型,例如7-2华莱士树、4-2华莱士树以及3-2华莱士树等。在一个或多个实施例中,本披露使用7-2华莱士树作为实现本披露的各种浮点运算的示例,稍后将结合图5和图6对其进行详细的描述。
在一些实施例中,本披露所公开的华莱士树压缩操作可以布置为具有M个输入,N个输出,其数目可以不小于K,其中N为预设的小于M的正整数,K为不小于中间结果的最大位宽的正整数。例如,M可以是7,N可以是2,即下文将详细描述的7-2华莱士树。当中间结果的最大位宽是48时,K可以取正整数48,也就是说华莱士树的数目可以是48个。
在一些实施例中,根据运算模式,可以选用一组或多组所述华莱士树对所述中间结果进行加和,其中每组有X个华莱士树,X为所述中间结果的位数。进一步,各组内的华莱士树之间可以存在依次进位的关系,而各组间并不存在进位的关系。在示例性的连接中,华莱士树压缩器可以通过进位进行连接,例如来自于低位华莱士树压缩器的进位输出(如图6中Cin)至高位华莱士树,而高位华莱士树压缩器的进位输出(Cout)又可以成为更高位华莱士树压缩器接收来自低位华莱士树压缩器的进位输入。另外,当从多个华莱士树压缩器中选择一个或多个华莱士时,可以进行任意的选择,例如既可以按0、1、2和3编号的顺序来选择,也可以按0、2、4和6编号的顺序来连接,只要选择的华莱士树压缩器是按上述的进位关系来选择即可。
下面结合一个说明性的示例来介绍上文的华莱士树及其操作。假设第一浮点数和第二浮点数的是16位数据(例如FP16*FP16),乘法器支持的数据位宽是32位(由此支持两组16位数的并行相乘操作),华莱士树是7个(即上述M的一个示例值)输入和2个(即上述N的一个示例值)输出的7-2华莱士树压缩器。在该示例场景下,可以采用48个(即上述K的一个示例值)华莱士树来并行完成两组数据的乘法运算。
在上述的48个华莱士树中,第0~23个华莱士树(即第一组华莱士树中的24个华莱士树)可以完成第一组乘法的部分积加和运算,并且该组内的各华莱士树可以依次通过进位连接。进一步,第24~47个华莱士树(即第二组华莱士树中的24个华莱士树)可以完成第二组乘法的部分积加和运算,其中该组内的各华莱士树依次通过进位连接。另外,第一组中的第23个华莱士树和第二组中的第24个华莱士树之间不存在进位关系,即不同组的华莱士树之间不存在进位关系。
返回到图4,在通过华莱士树压缩器对部分积进行加和压缩后,将经过压缩后的部分积通过加法器进行求和,以获得尾数乘法操作的结果。关于加法器,在本披露的一个或多个实施例中,其可以包括全加器、串行加法器和超前进位加法器中的一种,用于对华莱士树压缩器进行加和所得到的最后两行部分积进行求和操作,以获得尾数乘法操作的结果。
可以理解,通过图4所示出的尾数乘法操作,特别是示例性地使用布斯编码和华莱士树,可以有效地获得尾数乘法操作的结果。具体地,布斯编码处理能有效减少部分积求和项的数目,从而减小电路面积,而华莱士压缩树能并行计算各行部分积之和,从而提高了乘法器的速度。
下面将结合图5和图6对部分积和7-2华莱士树的示例操作过程作详细的描述。可以理解的是这里的描述仅仅是示例性的而非限制性的,目的仅在于对本披露方案的更好理解。
图5示出在经过前述结合图2-图4所描述的尾数处理单元中的部分积产生电路后所获得的部分积500,如图中的两个虚线之间四行白色圆点,其中每行白色圆点标识出一个部分积。为了便于后续的华莱士树压缩器的执行,可以预先对位数进行扩展。例如,图5中的黑点为复制的每个9位部分积的最高位数值,可以看出部分积被扩展对齐至16(8+8)bit(即,被乘数尾数的位宽8bit+乘数尾数的位宽8bit)。在另一个实施例中,例如对于25*13二进制乘法的部分积,其部分积被扩展至38(25+13)bit(即,被乘数尾数的位宽25bit+乘数尾数的位宽13bit)。
图6是示出根据本披露实施例的华莱士树压缩器的操作流程和示意框图600。
如图6中所示,在对两个浮点数的尾数执行相乘操作后,例如如前所述,通过将乘数进行布斯编码并且通过被乘数可以获得图6中所示出的7个部分积。由于布斯编码算法的使用,减小了产生的部分积的数目。为了便于理解,图中在部分积部分用虚线框标识出一个包括7个元素的华莱士树,并且进一步以箭头示出其从7个元素压缩至2个元素的过程。在一个实施例中,该压缩过程(或称加和过程)可以借助于全加器来实现,即输入三个元素输出两个元素(即一个和“sum”以及向高位的进位“carry”)。7-2华莱士树压缩器的示意框图在图6的右侧示出,可以理解该华莱士树压缩器包括7个来自一列部分积的输入(如图6左侧虚线框中标识的七个元素)。在操作中,第0列华莱士树的进位输入为0,每列华莱士树的进位输出Cout作为下一列华莱士树的进位输入Cin。
从图6左侧部分中可以看到,经过四次压缩后可以将包括7个元素的华莱士树压缩为包括2个元素。如前所提到,本披露利用7-2华莱士树压缩器将7行的部分积最终压缩成具有两行的部分积(即本披露的第二中间结果),并且利用加法器(例如超前进位加法器)来获得尾数结果。
为了进一步阐述本披露方案的原理,下面将示例性地描述本披露的乘法器如何完成FP16*FP16、FP16*FP16、FP32*FP32和FP32*BF16四种运算模式下在第一阶段的操作,即直到华莱士树压缩器完成中间结果的求和以获得第二中间结果:
(1)FP16*FP16
在乘法器的该运算模式下,浮点数的尾数位为10bit,考虑IEEE754标准下非规格化非零数,可以扩展1bit位,从而尾数位为11bit。另外,由于尾数位为无符号数,采用布斯编码算法时可以在高位扩展1bit的0,因此总的尾数位数为12bit。当对作为第二浮点数即乘数进行布斯编码,并且参照第一浮点数时,则通过部分积产生电路可以在高低部分分别获得7个部分积,其中第七个部分积为0,每个部分积的位宽为24bit,此时可以通过48个7-2华莱士树进行压缩处理,并且第23个到第24个华莱士树的进位为0。
(2)BF16*BF16
在乘法器的该运算模式下,浮点数的尾数位为7bit,考虑IEEE754标准下非规格化非零数及扩展为有符号数,则尾数可以扩展为9bit。当对作为第二浮点数即乘数进行布斯编码,并且参照第一浮点数时,则通过部分积产生电路可以在高低部分分别获得7个有效部分积,其中第6、7个部分积为0,每个部分积位宽为18bit,通过使用第0~17个和第24~41个两组的7-2华莱士树进行压缩处理,其中第23到第24个华莱士树的进位为0。
(3)FP32*FP32
在乘法器的该运算模式下,浮点数的尾数位可以为23bit,考虑IEEE754标准下非规格化非零数及扩展为有符号数,则尾数可以扩展为25bit。为节省乘法单元的面积,例如乘法器所支持的位宽可以被设计得较小,并且使得本披露的乘法器在该运算模式下可以被调用两次以完成一次运算。为此,每次尾数位进行的乘法为25bit*13bit,即将第一浮点数ina扩展1比特0成为25bit的有符号数,将第二浮点数inb的24bit尾数位分高低两部分12bit分别扩展1比特0得到两个13bit的乘数,表示为inb_high13和inb_low13高低两部分。具体操作中,第一次调用本披露的乘法器计算ina*inb_low13,第二次调用乘法器计算ina*inb_high13。在每一次的计算中,通过布斯编码生成7个有效部分积,每个部分积的位宽为38bit,通过第0~37个的7-2华莱士树进行压缩。
(4)FP32*BF16
该乘法器的该运算模式下,第一浮点数ina的尾数位为23bit,第二浮点数的inb的尾数位 为7bit,考虑IEEE754标准下非规格化非零数和扩展为有符号数,则尾数可以分别扩展为25bit和9bit,进行25bit×9bit的乘法,获得7个有效部分积,其中第6、7个部分积为0,每个部分积的位宽为34bit,通过第0~33个华莱士树进行压缩。
以上通过具体示例描述了本披露的乘法器如何在四种运算模式下完成第一阶段的操作,其中优选的使用了布斯编码算法和7-2华莱士树。基于上述的描述,本领域技术人员可以理解本披露使用7个部分积,使得可以在不同的运算模式中复用7-2华莱士树。
下面将更具体地描述本公开的乘法器(尾数处理单元和指数处理单元)被多次调用的情况。
根据本公开的另一方面,如图3所示,尾数处理单元可以包括控制电路316,并且该控制电路316可以用于在两个浮点数中的至少一个的尾数位宽大于所述尾数处理单元一次可处理的数据位宽时,多次调用所述尾数处理单元。所述尾数处理单元一次可处理的数据位宽是指尾数处理单元所支持的两个位宽(例如乘数位宽和被乘数位宽)。因此,可以理解,所述控制电路用于根据所述两个浮点数中的一个的尾数位宽和所述尾数处理单元所支持的两个位宽中的一个,或者根据所述两个浮点数的尾数位宽和所述尾数处理单元所支持的两个位宽来确定多次调用所述尾数处理单元以获得所述乘法运算后的尾数。因此,乘法器中的尾数处理单元的这种反复调用避免了布置大面积的乘法器部件来处理大位宽尾数运算并且避免了布置小面积的乘法器部件无法处理大位宽尾数运算,从而在适用性更强的同时有利于减小芯片面积。
根据本公开的第一实施例,所述两个浮点数包括第一浮点数和第二浮点数,所述尾数处理单元支持第一位宽和第二位宽,所述第一浮点数的尾数作为与所述第一位宽对应的第一输入,所述第二浮点数的尾数作为与所述第二位宽对应的第二输入,所述第一输入的位宽小于或等于所述第一位宽,所述控制电路用于当所述第二输入的位宽大于所述第二位宽时,多次调用所述尾数处理单元来获得所述乘法运算后的尾数。根据该实施例,已知两个输入中的一个的位宽固定小于或等于与其对应的尾数处理单元所支持的一个位宽,由此,只需判断另一个输入与对应的尾数处理单元所支持位宽的大小关系,即可确定是否多次调用尾数处理单元。
根据本公开的第二实施例,所述两个浮点数包括第一浮点数和第二浮点数,所述尾数处理单元支持第一位宽和第二位宽,所述第一浮点数的尾数作为与所述第一位宽对应的第一输入,所述第二浮点数的尾数作为与所述第二位宽对应的第二输入,所述控制电路用于当所述第一输入的位宽大于所述第一位宽且所述第二输入的位宽小于或等于所述第二位宽时、当所述第二输入的位宽大于所述第二位宽且所述第一输入的位宽小于或等于所述第一位宽时或者当所述第一输入的位宽大于所述第一位宽且所述第二输入的位宽大于所述第二位宽时,多次调用所述尾数处理单元来获得所述乘法运算后的尾数。根据该实施例,两个输入的位宽与尾数处理单元所支持的两个位宽的大小关系不确定,需要判断两个输入与各自对应的尾数处理单元所支持位宽的大小关系,来确定是否多次调用尾数处理单元。
根据该第二实施例,当所述第一浮点数的尾数位宽小于所述第二浮点数的尾数位宽并且所述第一位宽大于所述第二位宽时,或者当所述第一浮点数的尾数位宽大于所述第二浮点数的尾数位宽并且所述第一位宽小于所述第二位宽时,所述控制电路选择所述第一浮点数的尾数作为与所述第二位宽对应的所述第二输入并且选择所述第二浮点数的尾数作为与所述第一位宽对应的第一输入。应当理解,在两个浮点数的尾数无规则输入时,可以先将输入的两个浮点数的尾数根据大位宽对大位宽、小位宽对小位宽的策略与尾数处理单元支持的两个位宽进行匹配,以避免本可一次处理完成两个浮点数的尾数运算,却进行了多次调用。
进一步地,当所述第一输入的位宽大于所述第一位宽且所述第二输入的位宽小于或等于所述第二位宽时,所述控制电路根据所述第一输入的位宽和所述第一位宽来确定调用所述尾数处理单元的次数以及在每次调用中输入所述尾数处理单元的数据。当所述第二输入的位宽大于所述第二位宽且所述第一输入的位宽小于或等于所述第一位宽时,所述控制电路根据所述第二输入的位宽和所述第二位宽来确定调用所述尾数处理单元的次数以及在每次调用中输入所述尾数处理单元的数据。当所述第一输入的位宽大于所述第一位宽且所述第二输入的位宽大于所述第二位宽时,所述控制电路根据所述第一输入的位宽和所述第一位宽以及所述第二输入的位宽和 所述第二位宽来确定调用所述尾数处理单元的次数以及在每次调用中输入所述尾数处理单元的数据。
在本公开中,关于第一浮点数和第二浮点数的描述只是为了区分两个浮点数,其中“第一”和“第二”不具有限定作用。同样地,关于第一位宽和第二位宽的描述只是为了区分尾数处理单元所支持的两个最大处理位宽,并且关于第一输入和第二输入的描述只是为了区分所述尾数处理单元的与所述两个最大处理位宽对应的两个输入,因此其中“第一”和“第二”都不具有限定作用。
值得注意的是,以上实施例描述的输入乘法器的浮点数是符合运算要求格式以及适用乘法器内部部件和外部部件的浮点数,即经过例如规格化等预处理的浮点数。应当理解,输入乘法器的浮点数可以是规格化或非规格化的浮点数,结合以上关于规格化单元的描述可知,如果输入的两个浮点数中的至少一个浮点数为非规格化的非零浮点数,可以首先通过规格化单元对所述至少一个浮点数进行规格化处理,以获得规格化后的指数和尾数,然后使用规格化后的尾数作为尾数处理单元的输入来进行上述的浮点数乘法运算。另外,本公开之前提到的布斯编码电路进行有符号定点数乘法计算,因此还需要对尾数前面扩展1位0,即将尾数变为有符号正数,然后使用扩展后的有符号尾数作为尾数处理单元的输入来进行上述的浮点数乘法运算。当然,还可以对浮点数进行其他的预处理,并将预处理后的浮点数的尾数作为尾数处理单元的输入来进行上述的浮点数乘法运算,例如以上关于规格化单元的描述中提到的为了适用运算模式而对浮点数进行的规格化,本公开的第一实施例和第二实施例同样适用于如上所述的根据运算模式进行浮点数的运算。
下面将详细说明根据本公开的上述第二实施例的多次调用尾数处理单元的三个示例。为了更清楚直观地理解这三个示例,上述第一输入例如可以是乘数,第二输入例如可以是被乘数,第一位宽例如可以是尾数处理单元所支持的最大乘数位宽,第二位宽例如可以是尾数处理单元所支持的最大被乘数位宽。
根据本公开的多次调用尾数处理单元的第一示例,结合以上描述的根据运算模式的浮点数乘法运算,以输入到本公开乘法器的两个浮点数为非规格化的非零浮点数为例,并结合本公开使用的布斯编码电路进行有符号定点数乘法运算的情况,首先将两个浮点数规格化,因此两个浮点数的尾数扩展1位,另外为了适用于本公开实施例中的布斯编码电路,再将两个尾数扩展1位而形成有符号数。在经过这些预处理后,将两个浮点数的尾数和尾数处理单元的输入进行匹配。因此,当乘数的位宽大于最大乘数位宽且被乘数的位宽小于或等于最大被乘数时,所述控制电路将该乘数对应的原始尾数仅规格化后形成的尾数作为待截取尾数,并且为了适用于本公开实施例中的布斯编码电路,对每次截取的部分扩展符号位。为了使得尾数处理单元可以处理该待截取尾数,在每次调用中从该待截取尾数中截取位宽为A-1的部分,其中,A代表尾数处理单元所支持的最大乘数位宽,对每次截取的位宽为A-1的部分在高位补充一位0作为符号形成位宽为A的乘数部分,该乘数部分作为在每次调用中输入尾数处理单元的一个输入。另外,所述被乘数(在该实施例中,该被乘数是规格化且扩展符号位的尾数)在每次调用中作为另一个输入而输入尾数处理单元。由此,可以使用以下公式来确定尾数处理单元的调用次数:
n=ceil((B+1)/(A-1)),
其中,n代表调用尾数处理单元的次数,B代表未规格化且未扩展符号位的尾数的位宽,B+1代表对尾数规格化后的位宽,B+1也可理解为B+2-1,即乘数的位宽减去符号位的位宽,A代表乘数部分的位宽(尾数处理单元所支持的最大乘数位宽),A-1代表每次调用中从待截取尾数中截取的部分的位宽。
举例来说,尾数处理单元所支持的最大乘数位宽例如为8bit,最大被乘数位宽例如为32bit,输入乘法器的两个浮点数分别是FP32类型和BF16类型的浮点数,因此选择在FP32*BF16运算模式中进行乘法运算,并且两个浮点数是非规格化非零数,因此两个浮点数的尾数分别具有23bit和7bit的位宽,考虑IEEE754标准,则两个尾数的位宽可以扩展为24bit和8bit。为了适用于本公开实施例中的布斯编码电路,再将两个尾数扩展1比特0成为25bit和9bit的有符号数。因此 控制电路将位宽为9bit的尾数作为与最大乘数位宽对应的乘数并且将位宽为25bit的尾数作为与最大被乘数位宽对应的被乘数,由于仅乘数的位宽(9bit)大于最大乘数位宽(8bit),而被乘数的位宽(25bit)小于最大被乘数位宽(32bit),因此将该乘数所对应的原始尾数仅规格化后形成的尾数作为待截取尾数inb,则被乘数作为输入尾数处理单元的被乘数ina。根据以上公式,ceil((7+1)/(8-1))=2,因此,需要调用两次尾数处理单元,并且在每次调用时,在inb中每次截取7bit数据,最后一次调用(第二次调用)时,不足7bit数据,则将剩余数据全部截取并在前面补0凑齐7bit,并且每次截取的7bit数据扩展1比特0(符号位)成为8bit作为乘数部分inb_m,因此,在每次调用时进行的计算为ina*inb_m,即位宽为25bit的被乘数与位宽为8bit的乘数部分的乘法运算,从而可以计算得出该次调用所获得的尾数结果。值得注意的是,对待截取尾数的截取可以按照从高位到低位的顺序进行,也可以按照从低位到高位的顺序进行。值得注意的是,该示例同样适用于本公开上述第一实施例。
根据本公开的多次调用尾数处理单元的第二示例,结合以上描述的根据运算模式的浮点数乘法运算,以输入到本公开乘法器的两个浮点数为非规格化的非零浮点数为例,并结合本公开使用的布斯编码电路进行有符号定点数乘法运算的情况,首先将两个浮点数规格化,因此两个浮点数的尾数扩展1位,另外为了适用于本公开实施例中的布斯编码电路,再将两个尾数扩展1位而形成有符号数。在经过这些预处理后,将两个浮点数的尾数和尾数处理单元的输入进行匹配。因此,当被乘数的位宽大于最大被乘数位宽且乘数的位宽小于或等于最大乘数位宽时,所述控制电路将该被乘数对应的原始尾数仅规格化后形成的尾数作为待截取尾数,并且为了适用于本公开实施例中的布斯编码电路,对每次截取的部分扩展符号位。为了使得尾数处理单元可以处理该待截取尾数,在每次调用中从该尾数中截取位宽为C-1的部分,其中,C代表尾数处理单元所支持的最大被乘数位宽,对每次截取的位宽为C-1的部分在高位补充一位0作为符号形成位宽为C的被乘数部分,该被乘数部分作为在每次调用中输入尾数处理单元的一个输入。另外,所述乘数(在该实施例中,该乘数是规格化且扩展符号位的尾数)在每次调用中作为另一个输入而输入尾数处理单元。由此,可以使用以下公式来确定尾数处理单元的调用次数:
n=ceil((D+1)/(C-1)),
其中,n代表调用尾数处理单元的次数,D代表未规格化且未扩展符号位的尾数的位宽,D+1代表对尾数规格化后的位宽,D+1也可理解为D+2-1,即被乘数的位宽减去符号位的位宽,C代表被乘数部分的位宽(尾数处理单元所支持的最大被乘数位宽),C-1代表每次调用中从待截取尾数中截取的部分的位宽。
举例来说,尾数处理单元所支持的最大乘数位宽例如为12bit,最大被乘数位宽例如为16bit,输入乘法器的两个浮点数分别是FP32类型和BF16类型的浮点数,因此选择在FP32*BF16运算模式中进行乘法运算,并且两个浮点数是非规格化非零数,因此两个浮点数的尾数分别具有23bit和7bit的位宽,考虑IEEE754标准,则两个尾数的位宽可以扩展为24bit和8bit。为了适用于本公开实施例中的布斯编码电路,再将两个尾数扩展1比特0成为25bit和9bit的有符号数。因此控制电路将位宽为9bit的尾数作为与最大乘数位宽对应的乘数并且将位宽为25bit的尾数作为与最大被乘数位宽对应的被乘数,由于仅被乘数的位宽(25bit)大于尾数处理单元所支持的最大被乘数位宽(16bit),而乘数的位宽(9bit)小于最大乘数位宽(12bit),因此将该被乘数所对应的原始尾数仅规格化后形成的尾数作为待截取尾数ina,则乘数作为输入尾数处理单元的乘数inb。根据以上公式,ceil((23+1)/(16-1))=2,因此,需要调用两次尾数处理单元,并且在每次调用时,在ina中每次截取15bit数据,最后一次调用(第二次调用)时,不足15bit数据则在前面补0凑齐15bit,并且每次截取的15bit数据扩展1比特0(符号位)成为16bit作为被乘数部分ina_m,因此,在每次调用时进行的计算为ina_m*inb,即位宽为16bit的被乘数部分与位宽为9bit的乘数的乘法运算,从而可以计算得出该次调用所获得的尾数结果。值得注意的是,对待截取尾数的截取可以按照从高位到低位的顺序进行,也可以按照从低位到高位的顺序进行。值得注意的是,该示例同样适用于本公开上述第一实施例。
根据本公开的多次调用尾数处理单元的第三示例,结合以上描述的根据运算模式的浮点数 乘法运算,以输入到本公开乘法器的两个浮点数为非规格化的非零浮点数为例,并结合本公开使用的布斯编码电路进行有符号定点数乘法运算的情况,首先将两个浮点数规格化,因此两个浮点数的尾数扩展1位,另外为了适用于本公开实施例中的布斯编码电路,再将两个尾数扩展1位而形成有符号数。在经过这些预处理后,将两个浮点数的尾数和尾数处理单元的输入进行匹配。因此,当所述乘数的位宽大于所述最大乘数位宽且所述被乘数(在该实施例中,该被乘数是规格化且扩展符号位的尾数)的位宽大于所述最大被乘数位宽时,所述控制电路将该乘数对应的原始尾数仅规格化后形成的尾数和该被乘数对应的原始尾数仅规格化后形成的尾数作为待截取尾数,并且为了适用于本公开实施例中的布斯编码电路,对每次截取的部分扩展符号位。为了使得尾数处理单元可以处理这两个待截取尾数,在每次调用中分别从与乘数对应的待截取尾数中截取位宽为A-1的部分并且从与被乘数对应的待截取尾数中截取位宽为C-1的部分,其中,A代表尾数处理单元所支持的最大乘数位宽,C代表尾数处理单元所支持的最大被乘数位宽,对每次截取的位宽为A-1的部分在高位补充一位0作为符号形成位宽为A的乘数部分,该乘数部分作为在每次调用中输入尾数处理单元的一个输入,并且对每次截取的位宽为C-1的部分在高位补充一位0作为符号形成位宽为C的被乘数部分,该被乘数部分作为在每次调用中输入尾数处理单元的另一个输入。由此,可以使用以下公式来确定尾数处理单元的调用次数:
n=ceil((B+1)/(A-1))*ceil((D+1)/(C-1))
其中,n代表调用尾数处理单元的次数,B代表未规格化且未扩展符号位的尾数的位宽,B+1代表对尾数规格化后的位宽,B+1也可理解为B+2-1,即乘数的位宽减去符号位的位宽,A代表乘数部分的位宽(尾数处理单元所支持的最大乘数位宽),A-1代表每次调用中从与乘数对应的待截取尾数中截取的部分的位宽,D代表未规格化且未扩展符号位的尾数的位宽,D+1代表对尾数规格化后的位宽,D+1也可理解为D+2-1,即被乘数的位宽减去符号位的位宽,C代表被乘数部分的位宽(尾数处理单元所支持的最大被乘数位宽),C-1代表每次调用中从待截取尾数中截取的部分的位宽。
举例来说,尾数处理单元所支持的最大乘数位宽例如为8bit,最大被乘数位宽例如为16bit,输入乘法器的两个浮点数都是FP32类型的浮点数,因此选择在FP32*FP32运算模式中进行乘法运算,并且两个浮点数是非规格化非零数,因此两个浮点数的尾数位宽都为23bit,考虑IEEE754标准,则两个尾数的位宽可以扩展为24bit。为了适用于本公开实施例中的布斯编码电路,再将两个尾数扩展1比特0成为25bit的有符号数。因此控制电路将两个浮点数的尾数分别选择作为与最大乘数位宽对应的乘数和与最大被乘数位宽对应的被乘数(由于两个浮点数的尾数在扩展后位宽相同,因此任选一个作为乘数,另一个作为被乘数),由于所述乘数的位宽(25bit)大于所述最大乘数位宽(8bit)且所述被乘数的位宽(25bit)大于所述最大被乘数位宽(16bit),因此将乘数所对应的原始尾数规格化后形成的尾数作为待截取尾数inb并且将被乘数所对应的原始尾数规格化后形成的尾数作为待截取尾数ina。根据以上公式,ceil((23+1)/(8-1))*ceil((23+1)/(16-1))=8,因此,需要调用八次尾数处理单元。在每次调用时,在inb中每次截取7bit数据,最后一次调用时,不足7bit数据,则将剩余数据全部截取并在前面补0凑齐7bit,并且每次截取的7bit数据扩展1比特0(符号位)成为8bit作为乘数部分inb_m,由于将inb截取为四个部分,因此可以具有四个乘数部分inb_m1、inb_m2、inb_m3、inb_m4。另外在每次调用时,在ina中每次截取15bit数据,最后一次调用时,不足15bit数据,则将剩余数据全部截取并在前面补0凑齐15bit,并且每次截取的15bit数据扩展1比特0(符号位)成为16bit作为被乘数部分ina_m,由于将ina截取为两个部分,因此可以具有两个被乘数部分ina_m1、ina_m2。因此,例如在八次调用尾数处理单元时可以依次进行以下计算:ina_m1*inb_m1、ina_m1*inb_m2、ina_m1*inb_m3、ina_m1*inb_m4、ina_m2*inb_m1、ina_m2*inb_m2、ina_m2*inb_m3、ina_m2*inb_m4,当然也可以依次进行以下计算:inb_m1*ina_m1、inb_m1*ina_m2、inb_m2*ina_m1、inb_m2*ina_m2、inb_m3*ina_m1、inb_m3*ina_m2、inb_m4*ina_m1、inb_m4*ina_m2。每次调用进行的计算为位宽为16bit的被乘数部分与位宽为8bit的乘数部分的乘法运算,从而可以计算得出该次调用所获得的尾数结果。值得注意的是,对待截取尾数的截 取可以按照从高位到低位的顺序进行,也可以按照从低位到高位的顺序进行。
以上示例仅仅用于说明性而非限制性的目的,根据这些示例,本领域技术人员可以想到在其它运算模式下多次调用最大支持任意位宽的尾数处理单元所进行的浮点数乘法运算。
针对以上多次调用尾数处理单元,所述尾数处理单元还可以包括移位加法电路,所述移位加法电路用于根据每次调用所述尾数处理单元所获得的尾数结果来获得所述乘法运算后的尾数。
进一步,所述移位加法电路包括移位器、中间存储器和加法器,当所述控制电路根据所述运算模式多次调用所述尾数处理单元时,在第一次调用后,所述移位器将第一次调用获得的尾数结果进行移位获得移位后尾数结果并将所述移位后尾数结果存入所述中间存储器中,从第二次调用开始,所述移位器将当次调用中获得的尾数结果进行移位获得当次尾数结果,所述加法器将所述当次尾数结果与存储在所述中间存储器中的结果相加并且将相加后的结果存储在所述中间存储器中来更新所述中间存储器,并且在最后一次调用后存储在所述中间存储器中的结果作为所述乘法运算后的尾数。
在该实施例中,例如,对待截取尾数的截取按照从高位到低位的顺序进行。在每次调用所述尾数处理单元时,所述移位器将当次调用中获得的尾数结果按照以下公式进行移位:
Y=k+j
其中,Y代表当次调用中获得的尾数结果所需进行的移位数,k代表在与乘数对应的待截取尾数中在当次调用所使用的截取部分后面的全部数据的位数之和,j代表在与被乘数对应的待截取尾数中在当次调用所使用的截取部分后面的全部数据的位数之和。应当理解,如果仅乘数的位宽大于最大乘数位宽或者仅被乘数的位宽大于最大被乘数位宽,则只需要对与乘数对应的待截取尾数或与被乘数对应的待截取尾数进行截取,而不需要截取的尾数每次调用时使用的是其全部数据,因此后面不存在数据,从而k或j的取值为0,由此可知对于仅乘数的位宽大于最大乘数位宽的情况,以上计算移位数的公式可以写为:Y=k,对于仅被乘数的位宽大于最大被乘数位宽的情况,以上计算移位数的公式可以写为:Y=j。
举例来说,如前所述,在FP32*BF16运算模式中,当仅所述乘数的位宽大于所述最大乘数位宽时,两次调用尾数处理单元,并且例如对待截取尾数的截取按照从高位到低位的顺序进行。具体地,例如两次调用中的乘数部分分别为inb_m1和inb_m2,在第一次调用后,所述移位器将ina*inb_m1的结果向左移位,由于在第一次调用中截取7bit数据,因此在该次调用所使用的这7bit数据之后的全部数据的位数之和为k=8-7=1bit,根据上述公式可知,Y=1,因此,向左移位的位数为1位,从而获得移位1位后的结果R1,所述加法器将该R1存入所述中间存储器中;在第二次调用(最后一次调用)后,所述移位器将ina*inb_m2的结果向左移位,由于在第二次调用中已将最后1位数据截取,因此在该次调用的所使用的1bit数据之后不存在数据,根据上述公式可知,Y=0,因此,向左移位的位数为0位,即不移位,从而获得结果R2,所述加法器将该R2与存储在所述中间存储器中的R1相加,并且将相加后的结果存储在所述中间存储器中来更新所述中间存储器中,由于该第二次调用为最后一次调用,因此在第二次调用后存储在所述中间存储器中的结果为所述乘法运算后的尾数。对上述当仅所述被乘数的位宽大于所述最大被乘数位宽时的情况,移位加法电路可以同样如此工作。
举例来说,如前所述,在FP32*FP32运算模式中,当所述乘数的位宽大于所述最大乘数位宽且所述被乘数的位宽大于所述最大被乘数位宽时,八次调用尾数处理单元,并且例如对待截取尾数的截取按照从高位到低位的顺序进行。具体地,例如八次调用中的乘数部分分别为inb_m1、inb_m2、inb_m3和inb_m4,被乘数部分分别为ina_m1、ina_m2,例如在八次调用尾数处理单元时依次进行以下计算:ina_m1*inb_m1、ina_m1*inb_m2、ina_m1*inb_m3、ina_m1*inb_m4、ina_m2*inb_m1、ina_m2*inb_m2、ina_m2*inb_m3、ina_m2*inb_m4。在第一次调用中,所述移位器将ina_m1*inb_m1的结果向左移位,由于在第一次调用中在与乘数对应的待截取尾数中截取7bit数据,因此在该待截取尾数中在该次调用所使用的7bit数据之后的全部数据的位数之和为k=24-7=17bit,并且在与被乘数对应的待截取尾数中截取15bit数据,因此在该待截取尾数中在该次调用所使用的15bit数据之后的全部数据的位数之和为j=24-15=9bit,根据上述公式可知, Y=17+9=26,因此,向左移位的位数为26位,从而获得移位26位后的结果S1,所述加法器将该S1存入所述中间存储器中;在第二次调用后,所述移位器将ina_m1*inb_m2的结果向左移位,由于在第二次调用中在与乘数对应的待截取尾数中截取下一个7bit数据,因此在该待截取尾数中在该次调用所使用的7bit数据之后的全部数据的位数之和为k=24-7-7=10bit,而在与被乘数对应的待截取尾数中截取与上一次调用时同样的7bit数据(使用与上一次调用同样的7bit数据),因此在该待截取尾数中在该次调用所使用的7bit数据之后的全部数据的位数之和仍为j=24-15=9bit,根据上述公式可知,Y=10+9=19,因此,向左移位的位数为19位,从而获得移位19位后的结果S2,所述加法器将该S2与存储在所述中间存储器中S1相加,并且将相加后的结果存储在所述中间存储器中来更新所述中间存储器;如此重复调用尾数处理单元直到第四次调用,在第四次调用中,所述移位器将ina_m1*inb_m4的结果向左移位,由于在第四次调用中截取与乘数对应的待截取尾数中的最后3bit数据,因此在该待截取尾数中在该次调用所使用的3bit数据之后不存在数据,从而k=0,而在与被乘数对应的待截取尾数中截取与上一次调用时同样的7bit数据,因此在该待截取尾数中在该次调用所使用的7bit数据之后的全部数据的位数之和仍为j=24-15=9bit,根据上述公式可知,Y=0+9=9,因此,向左移位的位数为9位,从而获得移位9位后的结果S4,所述加法器将该S4与存储在所述中间存储器中的结果相加,并且将相加后的结果存储在所述中间存储器中来更新所述中间存储器;由于在第五次至第八次调用中,都是截取与被乘数对应的待截取尾数中最后9bit数据,而在该9bit数据之后不再有数据,因此在第五次至第八次调用中,j=0,在第五次调用中,所述移位器将ina_m2*inb_m1的结果向左移位,由于在第五次调用中在与乘数对应的待截取尾数中截取与在第一次调用中同样的7bit数据,因此k=24-7=17bit,根据上述公式可知,Y=17+0=17,因此,向左移位的位数为17位,从而获得移位17位后的结果S5,所述加法器将该结果S5与存储在所述中间存储器中的结果相加,并且将相加后的结果存储在所述中间存储器中来更新所述中间存储器;如此重复调用尾数处理单元直到第八次调用,在第八次调用中,所述移位器将ina_m2*inb_m4的结果向左移位,由于在第八次调用中截取与乘数对应的待截取尾数中的最后3bit数据,因此在该待截取尾数中在该次调用所使用的3bit数据之后不存在数据,从而k=0,根据上述公式可知,Y=0+0=0,因此,向左移位的位数为0位,即不移位,从获得不移位的结果S8,所述加法器将该S8与存储在所述中间存储器中的结果相加,并且将相加后的结果存储在所述中间存储器中来更新所述中间存储器;由于该第八次调用为最后一次调用,因此在第八次调用后存储在所述中间存储器中的结果为所述乘法运算后的尾数。
另一方面,为了进一步减小乘法器的面积,所述指数处理单元包括第二控制电路(图中未示出),所述第二控制电路用于根据所述两个浮点数中的一个的指数位宽和所述指数处理单元所支持的两个位宽中的一个或者根据所述两个浮点数的指数位宽和所述指数处理单元所支持的两个位宽来确定多次调用所述指数处理单元以获得所述乘法运算后的指数。
根据本公开的第三实施例,所述两个浮点数包括第一浮点数和第二浮点数,所述指数处理单元支持第三位宽和第四位宽,所述第一浮点数的指数作为与所述第三位宽对应的第三输入,所述第二浮点数的指数作为与所述第四位宽对应的第四输入,所述第三输入的位宽小于或等于所述第三位宽,所述第二控制电路用于当所述第四输入的位宽大于所述第四位宽时,多次调用所述指数处理单元来获得所述乘法运算后的指数。根据该实施例,已知两个输入中的一个的位宽固定小于或等于与其对应的指数处理单元所支持的一个位宽,由此,只需判断另一个输入与对应的指数处理单元所支持位宽的大小关系,即可确定是否多次调用指数处理单元。
根据本公开的第四实施例,所述两个浮点数包括第一浮点数和第二浮点数,所述指数处理单元支持第三位宽和第四位宽,所述第一浮点数的指数作为与所述第三位宽对应的第三输入,所述第二浮点数的指数作为与所述第四位宽对应的第四输入,所述第二控制电路用于当所述第三输入的位宽大于所述第三位宽且所述第四输入的位宽小于或等于所述第四位宽时、当所述第四输入的位宽大于所述第四位宽且所述第三输入的位宽小于或等于所述第三位宽时或者当所述第三输入的位宽大于所述第三位宽且所述第四输入的位宽大于所述第四位宽时,多次调用所述 指数处理单元来获得所述乘法运算后的指数。根据该实施例,两个输入的位宽与指数处理单元所支持的两个位宽的大小关系不确定,需要判断两个输入与各自对应的指数处理单元所支持位宽的大小关系,来确定是否多次调用指数处理单元。
根据该第四实施例,当所述第一浮点数的指数位宽小于所述第二浮点数的指数位宽并且所述第三位宽大于所述第四位宽时,或者当所述第一浮点数的指数位宽大于所述第二浮点数的指数位宽并且所述第三位宽小于所述第四位宽时,所述第二控制电路选择所述第一浮点数的指数作为与所述第四位宽对应的所述第四输入并且选择所述第二浮点数的指数作为与所述第三位宽对应的第三输入。应当理解,在两个浮点数的指数无规则输入时,可以先将输入的两个浮点数的指数根据大位宽对大位宽、小位宽对小位宽的策略与指数处理单元支持的两个位宽进行匹配,以避免本可一次处理完成两个浮点数的指数运算,却进行了多次调用。
进一步地,当所述第三输入的位宽大于所述第三位宽且所述第四输入的位宽小于或等于所述第四位宽时、当所述第四输入的位宽大于所述第四位宽且所述第三输入的位宽小于或等于所述第三位宽时或者当所述第三输入的位宽大于所述第三位宽且所述第四输入的位宽大于所述第四位宽时,所述第二控制电路用于当所述第三输入的位宽小于或等于所述第四输入的位宽且所述第三位宽小于或等于所述第四位宽时,根据所述第四输入的位宽和所述第三位宽来确定调用所述指数处理单元的次数以及在每次调用中输入所述指数处理单元的数据。值得注意的是,以上三种情况下,指数处理单元的调用次数以及在每次调用中输入所述指数处理单元的数据都是根据第三输入和第四输入的位宽中的较大者与第三位宽和第四位宽中的较小者来确定。当然,当第三输入和第四输入的位宽相同或者第三位宽和第四位宽相同时,可以在相同位宽的两者中任选其一。
在该实施例中,关于第一浮点数和第二浮点数的描述只是为了区分两个浮点数,其中“第三”和“第四”不具有限定作用。同样地,关于第三输入和第四输入的描述只是为了区分所述指数处理单元的两个输入,关于第三位宽和第四位宽的描述只是为了区分指数处理单元所支持的与所述指数处理单元的两个输入对应的两个最大处理位宽,因此其中“第三”和“第四”都不具有限定作用。
值得注意的是,以上实施例描述的输入乘法器的浮点数是符合运算要求格式以及适用乘法器内部部件和外部部件的浮点数,即经过例如规格化等预处理的浮点数。应当理解,输入乘法器的浮点数可以是规格化或非规格化的浮点数,结合以上关于规格化单元的描述可知,如果输入的两个浮点数中的至少一个浮点数为非规格化的非零浮点数,可以首先通过规格化单元对所述至少一个浮点数进行规格化处理,以获得规格化后的指数和尾数,然后使用规格化后的指数作为指数处理单元的输入来进行上述的浮点数乘法运算。当然,还可以对浮点数进行其他的预处理,并将预处理后的浮点数的指数作为指数处理单元的输入来进行上述的浮点数乘法运算,例如以上关于规格化单元的描述中提到的为了适用运算模式而对浮点数进行的规格化,本公开的第三实施例和第四实施例同样适用于如上所述的根据运算模式进行浮点数的运算。
下面将详细说明多次调用指数处理单元的示例。为了更清楚直观地理解该示例,上述第三输入例如可以是加数,第四输入例如可以是被加数,第三位宽例如可以是指数处理单元所支持的最大加数位宽,第四位宽例如可以是指数处理单元所支持的最大被加数位宽。
根据本公开的多次调用指数处理单元的示例,结合以上描述的根据运算模式的浮点数乘法运算,以输入到本公开乘法器的两个浮点数为非规格化的非零浮点数为例,首先将两个浮点数规格化,因此两个浮点数的尾数扩展1位。在经过该预处理后,两个浮点数的指数和指数处理单元的输入进行匹配。因此,当加数的位宽大于最大加数位宽且被加数的位宽小于或等于最大被加数位宽时、当被加数的位宽大于最大被加数位宽且加数的位宽小于或等于最大加数位宽时或者当加数的位宽大于最大加数位宽且被加数的位宽大于最大被加数位宽时,所述控制电路可以根据以下公式来确定指数处理单元的调用次数:
m=ceil(P/(Q-1)),
其中,m代表调用指数处理单元的次数,P代表被加数的位宽,Q代表最大加数位宽,Q-1 代表每次调用中从加数和被加数中截取的部分的位宽。在每次调用中同时对加数和被加数截取位宽为Q-1的部分,使得从加数和被加数中截取的相同位宽且相同数位的部分进行加法运算,若在调用中截取的部分的数据不足Q-1位或无数据,在其前面或全部补0凑齐Q-1位数据。在将从加数和被加数中截取的部分前面扩展一个进位位后,形成输入指数处理单元的加数部分和被加数部分,因此,Q也代表每次调用时输入指数处理单元的加数部分和被加数部分的位宽。
由此,第二控制电路可在每次调用指数处理单元时,从加数和被加数中按照相同的顺序截取Q-1位的部分作为指数处理单元的输入,通过指数处理单元获得该次调用的指数结果,并且在调用指数处理单元m次之后获得最终的指数。值得注意的是,上述相同的顺序可以是从高位到低位的顺序,也可以从低位到高位的顺序。
举例来说,加数的位宽为6bit,被加数的位宽为9bit,指数处理单元所支持的最大加数位宽和最大被加数位宽都为8bit。因此,调用指数处理单元的次数为ceil(9/(8-1))=2,并且首先将加数前面补0,使得加数的位宽和被加数的位宽相同,然后在每次调用中按照从高位到低位的顺序同时对加数和被加数截取位宽为7位的部分,并将这两个截取的部分分别扩展一位进位位,形成两个8位的带进位数据进行相加,在第二次调用(即最后一次调用)时,只能从加数和被加数中截取2位数据(只剩2位数据),因此,在第二次调用时截取的2位数据前补0凑齐7位,并且扩展一位进位位,形成两个8位的带进位数据进行相加。
值得注意的是,该示例中的当加数的位宽大于最大加数位宽且被加数的位宽小于或等于最大被加数位宽时和当被加数的位宽大于最大被加数位宽且加数的位宽小于或等于最大加数位宽时对指数处理单元的调用同样适用于本公开上述第三实施例。
根据实施例,所述指数处理单元还可以包括第二移位加法电路,所述第二移位加法电路用于根据每次调用所述指数处理单元所获得的指数结果来获得所述乘法运算后的指数。
进一步,所述第二移位加法电路包括第二移位器、第二中间存储器和第二加法器,当所述第二控制电路多次调用所述指数处理单元时,在第一次调用后,所述第二移位器将第一次调用获得的指数结果进行移位并将移位后的指数结果存入所述第二中间存储器中,从第二次调用指数处理单元开始,所述第二移位器将当次调用中获得的指数结果进行移位,所述第二加法器将移位后的指数结果与存储在第二中间存储器中的数值相加并且将相加后的结果存储在所述第二中间存储器中来更新所述第二中间存储器,并且将在最后一次调用中存储在所述第二中间存储器中的数值作为所述乘法运算后的指数。
在每次调用所述指数处理单元时,所述第二移位器将当次调用中获得的指数结果按照以下方式进行移位:若在调用指数处理单元时按照从高位到低位的顺序截取加数和被加数时,对当次调用从加数和被加数中所截取的部分向左移位,移位位数是当次调用中从被加数中截取的部分之后的部分的位数。
举例来说,结合以上示例,例如加数的位宽为6bit,被加数的位宽为9bit,指数处理单元所支持的最大加数位宽和最大被加数位宽都为8bit,在每次调用中按照从高位到低位的顺序同时对加数和被加数截取位宽为7位的部分。具体地,在第一次调用指数处理单元后,所述第二移位器将第一次调用获得的指数结果向左移2位(因为该次调用中被加数截取的部分之后有2位数据)并将移位后的指数结果存入所述第二中间存储器中,从第二次调用指数处理单元开始,所述第二移位器将当次调用中获得的指数结果向左移位,由于该次调用中截取的部分之后不再有数据,因此向左移0位,即不移位,所述第二加法器将移0位后的指数结果与存储在第二中间存储器中的数值相加并且将相加后的结果存储在所述第二中间存储器中来更新所述第二中间存储器,由于该第二次调用即为最后一次调用,因此在该第二次调用后存储在所述第二中间存储器中的数值即为所述乘法运算后的指数。
根据以上具体描述的本公开乘法器(尾数处理单元和指数处理单元)被多次调用的情况可知,所述控制模块可以包括多个子模块,所述多个子模块可以分别用于执行多次调用中的各种操作,例如确定多次调用尾数处理单元、确定调用次数、确定每次调用中输入所述尾数处理单元的数据、判断尾数位宽与尾数处理单元所支持位宽是否匹配、调整尾数输入等。所述第二控 制模块也可以包括多个子模块,同样地,这些子模块可以分别执行多次调用中的各种操作。
上文结合图4-图6详细描述了本披露的乘法器在执行浮点运算时,对第一浮点数和第二浮点数的尾数相乘所执行的操作。当然,图4为了注重描述本披露乘法器的尾数处理单元的操作,并没有绘出其他的单元,例如指数处理单元和符号处理单元,并对其进行描述。下面将结合图7对本披露的乘法器进行整体上的描述,对于前文针对尾数处理单元所做的描述,同样也适用于图7所绘的情形。
图7是示出根据本披露实施例的乘法器700的整体示意框图。需要理解的是图中绘出的各类单元的位置、存在和连接关系仅仅是示例性的而非限制性的,例如其中的一些单元可以集成,而另一些单元也可以分离或依应用场景的不同而被省略或替换。
本披露的乘法器在每种运算模式的操作中按操作流程可以示例性地分为第一阶段和第二阶段,如图中的虚线所绘出的。概括来说,在第一阶段中:输出符号位的计算结果,输出指数位的中间计算结果,输出尾数位的中间计算结果(例如包括前述的输入尾数位定点乘法布斯算法的编码过程和华莱士树压缩过程)。在第二阶段中:对指数和尾数进行规则化和舍入操作,以输出指数的计算结果和输出尾数的计算结果。
如图7中所示,本披露的乘法器可以包括模式选择单元702和规格化处理单元704,其中模式选择单元可以根据输入模式信号(in_mode)来选择运算模式。在一个实施例中,该输入模式信号可以与表2中的运算模式编号相对应。例如,当输入模式信号指示表2中的运算模式编号“1”时,则可以令乘法器工作于FP16*FP16的运算模式中,而当输入模式信号指示表2中的运算模式编号“3”时,则可以令乘法器工作于FP32*FP32的运算模式中。为了图示的目的,图7仅示出FP16*FP16、BF16*BF16、FP32*FP32和FP32*BP16四种示例性运算模式。然而,正如前所述,本披露的乘法器同样也支持其他多种不同的运算模式。
规格化处理单元可以配置成用于当第一浮点数或第二浮点数为非规格化的非零浮点数时,根据运算模式,对第一浮点数或第二浮点数进行规格化处理,以获得对应的指数和尾数,例如按照IEEE754标准、对运算模式所指示的数据格式的浮点数进行规则化处理。
进一步,乘法器包括尾数处理单元,以执行第一浮点数尾数和第二浮点数尾数的相乘操作。为此,在一个或多个实施例中,该尾数处理单元可以包括位数扩展电路706、布斯编码器708、部分积产生电路710、华莱士树压缩器712以及加法器714,其中位数扩展电路可以用于对所述第一浮点数和所述第二浮点数中的至少一个的尾数进行位数扩展,例如在高位补0,以适合于布斯编码器的操作。控制电路可以根据位数扩展电路对尾数进行符号位扩展后获得的尾数进行以上多次调用尾数处理单元的操作。由于关于布斯编码器、部分积产生电路、华莱士树压缩器和加法器,已经结合图4-图6进行了详细了描述,因此相同的描述在此同样适用并因此不再赘述。
在一些实施例中,本披露的乘法器还包括规则化单元716和舍入单元718,该规则化单元和舍入单元具有与图3中所示出的单元相同的功能。具体地,对于规则化单元,其可以根据如图7中所示的输出模式信号“out_mode”所指示的数据格式来对所述加和结果和来自于指数处理单元的指数数据进行浮点数规则化处理以获得规则化指数结果和规则化尾数结果。例如,根据输出模式信号所指示的数据格式,规则化单元可以调整指数和尾数的位宽,以使其符合前述指示的数据格式的要求。再例如,当尾数的最高位为0,且该尾数不为0,则规则化单元可以重复将尾数左移1位,并且指数减1,直到最高位数值为1。对于舍入单元,在一个实施例中,其可以用于根据舍入模式对所述规则化尾数结果执行舍入操作以获得舍入后的尾数,并将舍入后的尾数作为所述乘法运算后的尾数。
在一个或多个实施例中,前述的输出模式信号可以是运算模式的一部分,用于指示乘法运算后的数据格式。例如,如前表3中所描述的,当运算模式编号为“12”时,则其中的数字“1”可以相当于前述的“in_mode”信号,用于指示执行FP16*FP16的乘法操作,而其中的数字“2”可以相当于“out_mode”信号,用于指示输出结果的数据类型是BF16。因此可以理解的是,在一些应用场景中,输出模式信号可以与前述的输入模式信号合并,以提供给模式选择单元。基于此合 并后的模式信号,模式选择单元可以在乘法器操作的初始阶段明确输入数据和输出结果的数据格式,而无需向规则化单独的提供输出模式信号,由此也可以进一步简化操作。
在一个或多个实施例中,对于前述的舍入操作,可以示例性包括如下5种舍入模式。
(1)舍入到最接近的值:在此模式下,当两个值同样接近的情况下,偶数优先。此时会将结果舍入为最接近且可以表示的值,但是当存在两个数同样接近的时候,则取其中的偶数作为舍入结果(在二进制中是以0结尾的数);
(2)四舍五入:示例性操作参见下面的例子;
(3)朝+∞方向舍入:在此规则下,会将结果朝正无限大的方向舍入;
(4)朝-∞方向舍入:在此规则下,会将结果朝负无限大的方向舍入;以及
(5)朝0方向舍入:在此规则下,会将结果朝0的方向舍入。
对于“四舍五入”模式下的尾数舍入的例子:例如两个规格化浮点数的24位的尾数相乘得到一个48位(47~0)的尾数,经过规格化处理(若尾数的最高位为0,将尾数左移1位;若尾数的最高位为1,则尾数不动,且将前面所求的临时的阶码加1),输出时只取第46至第24位。当尾数的第23位为0时,则舍去第(23-0)位;当尾数的第23位为1时,则向第24位进1并舍去第(23-0)位。
返回到图7,本披露的乘法器还包括指数处理单元720和符号处理单元722,其中指数处理单元可以用于根据运算模式、第一浮点数的指数和第二浮点数的指数获得所述乘法运算后的指数。例如,指数处理电路可以将第一浮点数的指数位数据、第二浮点数的指数位数据和各自对应的输入浮点数据类型的偏移值相加,并且减去输出浮点数据类型的偏移值,以获得所述第一浮点数和第二浮点数的乘积的指数位数据。在一个或多个实施例中,指数处理单元可以实现为或包括加减法电路,其用于根据所述运算模式、所述第一浮点数的指数、所述第二浮点数的指数和所述运算模式获得所述乘法运算后的指数。
符号处理单元在一个实施例中可以实现为异或电路,其用于对所述第一浮点数和第二浮点数的符号位数据执行异或操作,以获得所述第一浮点数和第二浮点数的乘积的符号位数据。
上文结合图7对本披露的乘法器整体进行了详细的描述。通过该描述,本领域技术人员可以理解本披露的乘法器支持多种运算模式下的操作,从而克服了现有技术中仅支持单一浮点型运算的乘法器的缺陷。进一步,由于本披露的乘法器可以复用,因此也支持高位宽的浮点型数据,降低了运算成本和开销。在一个或多个实施例中,本披露的乘法器还可以布置成或包括于集成电路芯片或计算装置中,以实现在多种运算模式下对浮点数执行乘法运算。
图8是示出根据本披露实施例的使用乘法器执行浮点数乘法运算的方法800的流程图。可以理解的是此处所述的乘法器即前面结合图1-图7详细描述的乘法器,因此在前关于该乘法器及其内部组成、功能和操作的描述也同样适用于此处的描述。
如图8中所示,所述方法800可以包括在步骤S802处利用所述乘法器的指数处理单元来根据运算模式、第一浮点数的指数和第二浮点数的指数获得所述乘法运算后的指数。正如前所述,该运算模式可以是多种运算模式中的一种,并且可以用于指示浮点数的数据格式。在一个或多个实施例中,该运算模式还可以用于确定输出结果的浮点数的数据格式。
接着,在步骤S804处,该方法800可以利用乘法器的尾数处理单元来根据所述运算模式、第一浮点数和第二浮点数获得所述乘法运算后的尾数。关于尾数的示例性操作,本披露在一些优选的实施例中使用了布斯编码算法和华莱士树压缩器,从而提高尾数处理的效率。另外,当第一浮点数和第二浮点数是有符号数时,方法800还可以在步骤S806中用于根据第一浮点数的符号和第二浮点数的符号获得乘法运算后的符号。
尽管上述方法以步骤形式示出利用本披露的乘法器来执行浮点数乘法运算,但这些步骤顺序并不意味着本方法的步骤必须依所述顺序来执行,而是可以以其他顺序或并行的方式来处理。另外,此处为了描述的简明而没有阐述方法800的其他步骤,但本领域技术人员根据本披露的内容可以理解该方法也可以通过使用乘法器来执行前述结合图1-图7描述的各种操作。
在本披露的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部 分,可以参见其他实施例的相关描述。上述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
图9是示出根据本披露实施例的一种组合处理装置900的结构图。如图所示,该组合处理装置900包括计算装置902,该计算装置可以包括如前结合附图描述的本披露的乘法器。另外,该组合处理装置还包括通用互联接口904和其他处理装置906。根据本披露的计算装置与其他处理装置进行交互,共同完成用户指定的操作。
根据本披露的方案,该其他处理装置可以包括中央处理器(“CPU”)、图形处理器(“GPU”)、神经网络处理器等通用和/或专用处理器中的一种或多种类型的处理器,其数目不做限制而是依实际需要来确定。在一个或多个实施例中,该其他处理装置可以作为本披露的计算装置(其可以具体化为机器学习运算装置)与外部数据和控制的接口,执行包括但不限于数据搬运,完成对本机器学习运算装置的开启、停止等的基本控制;其他处理装置也可以和机器学习运算装置协作共同完成运算任务。
根据本披露的方案,该通用互联接口可以用于在计算装置与其他处理装置间传输数据和控制指令。例如,该计算装置可以经由所述通用互联接口从其他处理装置中获取所需的输入数据,写入该计算装置片上的存储装置。进一步,该计算装置可以经由所述通用互联接口从其他处理装置中获取控制指令,写入计算装置片上的控制缓存。替代地或可选地,通用互联接口也可以读取计算装置的存储模块中的数据并传输给其他处理装置。
可选地,该组合处理装置还可以包括存储装置908,其可以分别与所述计算装置和所述其他处理装置连接。在一个或多个实施例中,存储装置可以用于保存所述计算装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本计算装置或其他处理装置的内部存储中无法全部保存的数据。
根据应用场景的不同,本披露的组合处理装置可以作为手机、机器人、无人机、视频采集、视频监控设备等设备的SOC片上系统,从而有效地降低控制部分的核心面积,提高处理速度并降低整体的功耗。在此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。此处的某些部件可以例如是摄像头,显示器,鼠标,键盘,网卡或wifi接口。
在一些实施例里,本披露还公开了一种芯片或集成电路芯片,其包括了上述计算装置、组合处理装置以及本披露的乘法器。在另一些实施例里,本披露还公开了一种芯片封装结构,其包括了上述芯片。
在一些实施例里,本披露还公开了一种板卡,其包括了上述芯片封装结构。参阅图10,其提供了前述的示例性板卡,上述板卡除了包括上述芯片1002以外,还可以包括其他的配套部件,该配套部件可以包括但不限于:存储器件1004、接口装置1006和控制器件1008。
所述存储器件与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元1010。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(“Double Data Rate SDRAM”,双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储器件可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。
在一个实施例中,每一组所述存储单元可以包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备1012(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以 为标准PCIE接口。例如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。在另一个实施例中,所述接口装置还可以是其他的接口,本披露并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述芯片电连接,以便对所述芯片的状态进行监控。具体地,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(“MCU”,Micro Controller Unit)。所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,并且可以带动多个负载。由此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和/或多个处理电路的工作状态的调控。
在一些实施例里,本披露还公开了一种电子设备或装置,其包括了上述板卡。根据不同的应用场景,电子设备或装置可以包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本披露并不受所描述的动作顺序的限制,因为依据本披露,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本披露所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本披露所提供的几个实施例中,应该理解到,所披露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性、光学、声学、磁性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本披露各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,当本披露的技术方案可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本披露各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(“ROM”,Read-Only Memory)、随机存取存储器(“RAM”,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
依据以下条款可更好地理解前述内容:
条款A1,一种乘法器,用于进行浮点数的乘法运算,其中,所述乘法器包括:尾数处理单元,用于根据所述浮点数的尾数来获得所述乘法运算后的尾数,所述尾数处理单元包括控制电路,所述控制电路用于在两个浮点数中的至少一个的尾数位宽大于所述尾数处理单元一次可处理的数据位宽时,多次调用所述尾数处理单元。
条款A2,根据条款A1所述的乘法器,其中,所述两个浮点数包括第一浮点数和第二浮点 数,所述尾数处理单元支持第一位宽和第二位宽,所述第一浮点数的尾数作为与所述第一位宽对应的第一输入,所述第二浮点数的尾数作为与所述第二位宽对应的第二输入,所述第一输入的位宽小于或等于所述第一位宽,所述控制电路用于当所述第二输入的位宽大于所述第二位宽时,多次调用所述尾数处理单元来获得所述乘法运算后的尾数。
条款A3,根据条款A1或条款A2所述的乘法器,其中,所述两个浮点数包括第一浮点数和第二浮点数,所述尾数处理单元支持第一位宽和第二位宽,所述第一浮点数的尾数作为与所述第一位宽对应的第一输入,所述第二浮点数的尾数作为与所述第二位宽对应的第二输入,所述控制电路用于当所述第一输入的位宽大于所述第一位宽且所述第二输入的位宽小于或等于所述第二位宽时、当所述第二输入的位宽大于所述第二位宽且所述第一输入的位宽小于或等于所述第一位宽时或者当所述第一输入的位宽大于所述第一位宽且所述第二输入的位宽大于所述第二位宽时,多次调用所述尾数处理单元来获得所述乘法运算后的尾数。
条款A4,根据条款A1-A3任一项所述的乘法器,其中,当所述第一浮点数的尾数位宽小于所述第二浮点数的尾数位宽并且所述第一位宽大于所述第二位宽时,或者当所述第一浮点数的尾数位宽大于所述第二浮点数的尾数位宽并且所述第一位宽小于所述第二位宽时,所述控制电路选择所述第一浮点数的尾数作为与所述第二位宽对应的所述第二输入并且选择所述第二浮点数的尾数作为与所述第一位宽对应的第一输入。
条款A5,根据条款A1-A4任一项所述的乘法器,其中,当所述第一输入的位宽大于所述第一位宽且所述第二输入的位宽小于或等于所述第二位宽时,所述控制电路根据所述第一输入的位宽和所述第一位宽来确定调用所述尾数处理单元的次数以及在每次调用中输入所述尾数处理单元的数据。
条款A6,根据条款A1-A5任一项所述的乘法器,其中,当所述第二输入的位宽大于所述第二位宽且所述第一输入的位宽小于或等于所述第一位宽时,所述控制电路根据所述第二输入的位宽和所述第二位宽来确定调用所述尾数处理单元的次数以及在每次调用中输入所述尾数处理单元的数据。
条款A7,根据条款A1-A6任一项所述的乘法器,其中,当所述第一输入的位宽大于所述第一位宽且所述第二输入的位宽大于所述第二位宽时,所述控制电路根据所述第一输入的位宽和所述第一位宽以及所述第二输入的位宽和所述第二位宽来确定调用所述尾数处理单元的次数以及在每次调用中输入所述尾数处理单元的数据。
条款A8,根据条款A1-A7任一项所述的乘法器,其中,所述尾数处理单元还包括移位加法电路,所述移位加法电路用于根据每次调用所述尾数处理单元所获得的尾数结果来获得所述乘法运算后的尾数。
条款A9,根据条款A1-A8任一项所述的乘法器,其中,所述移位加法电路包括移位器、中间存储器和加法器,当所述控制电路多次调用所述尾数处理单元时,在第一次调用后,所述移位器将第一次调用获得的尾数结果进行移位获得移位后尾数结果并将所述移位后尾数结果存入所述中间存储器中,从第二次调用开始,所述移位器将当次调用中获得的尾数结果进行移位获得当次尾数结果,所述加法器将所述当次尾数结果与存储在所述中间存储器中的结果相加并且将相加后的结果存储在所述中间存储器中来更新所述中间存储器,并且在最后一次调用后存储在所述中间存储器中的结果作为所述乘法运算后的尾数。
条款A10,根据条款A1-A9任一项所述的乘法器,其中,所述乘法器还包括指数处理单元,所述指数处理单元用于根据所述两个浮点数的指数来获得所述乘法运算后的指数,所述指数处理单元包括第二控制电路,所述第二控制电路用于根据所述两个浮点数中的一个的指数位宽和所述指数处理单元所支持的两个位宽中的一个或者根据所述两个浮点数的指数位宽和所述指数处理单元所支持的两个位宽来确定多次调用所述指数处理单元以获得所述乘法运算后的指数。
条款A11,根据条款A1-A10任一项所述的乘法器,其中,所述两个浮点数包括第一浮点数和第二浮点数,所述指数处理单元支持第三位宽和第四位宽,所述第一浮点数的指数作为与所述第三位宽对应的第三输入,所述第二浮点数的指数作为与所述第四位宽对应的第四输入, 所述第三输入的位宽小于或等于所述第三位宽,所述第二控制电路用于当所述第四输入的位宽大于所述第四位宽时,多次调用所述指数处理单元来获得所述乘法运算后的指数。
条款A12,根据条款A1-A11任一项所述的乘法器,其中,所述两个浮点数包括第一浮点数和第二浮点数,所述指数处理单元支持第三位宽和第四位宽,所述第一浮点数的指数作为与所述第三位宽对应的第三输入,所述第二浮点数的指数作为与所述第四位宽对应的第四输入,所述第二控制电路用于当所述第三输入的位宽大于所述第三位宽且所述第四输入的位宽小于或等于所述第四位宽时、当所述第四输入的位宽大于所述第四位宽且所述第三输入的位宽小于或等于所述第三位宽时或者当所述第三输入的位宽大于所述第三位宽且所述第四输入的位宽大于所述第四位宽时,多次调用所述指数处理单元来获得所述乘法运算后的指数。
条款A13,根据条款A1-A12任一项所述的乘法器,其中,当所述第一浮点数的指数位宽小于所述第二浮点数的指数位宽并且所述第三位宽大于所述第四位宽时,或者当所述第一浮点数的指数位宽大于所述第二浮点数的指数位宽并且所述第三位宽小于所述第四位宽时,所述第二控制电路选择所述第一浮点数的指数作为与所述第四位宽对应的所述第四输入并且选择所述第二浮点数的指数作为与所述第三位宽对应的第三输入。
条款A14,根据条款A1-A13任一项所述的乘法器,其中,所述第二控制电路用于当所述第三输入的位宽小于或等于所述第四输入的位宽且所述第三位宽小于或等于所述第四位宽时,根据所述第四输入的位宽和所述第三位宽来确定调用所述指数处理单元的次数以及在每次调用中输入所述指数处理单元的数据。
条款A15,根据条款A1-A14任一项所述的乘法器,其中,所述指数处理单元还包括第二移位加法电路,所述第二移位加法电路用于根据每次调用所述指数处理单元所获得的指数结果来获得所述乘法运算后的指数。
条款A16,根据条款A1-A15任一项所述的乘法器,其中,所述尾数处理单元包括部分积运算单元和部分积求和单元,其中所述部分积运算单元用于根据所述两个浮点数的尾数获得中间结果,所述部分积求和单元用于将所述中间结果进行加和运算以获得加和结果,并将所述加和结果作为所述乘法运算后的尾数。
条款A17,根据条款A1-A16任一项所述的乘法器,其中,所述部分积运算单元包括布斯编码电路,所述布斯编码电路用于对所述第一浮点数或所述第二浮点数的尾数进行布斯编码处理,以获得所述中间结果。
条款A18,根据条款A1-A17任一项所述的乘法器,其中,所述部分积求和单元包括加法器,所述加法器用于对所述中间结果进行加和,以获得所述加和结果。
条款A19,根据条款A1-A18任一项所述的乘法器,其中,所述部分积求和单元包括华莱士树和加法器,其中所述华莱士树用于对所述中间结果进行加和,以获得第二中间结果,所述加法器用于对所述第二中间结果进行加和,以获得所述加和结果。
条款A20,根据条款A1-A19任一项所述的乘法器,其中,所述加法器包括全加器、串行加法器和超前进位加法器中的至少一种。
条款A21,根据条款A1-A20任一项所述的乘法器,其中,当所述中间结果的个数不足M个时,补充零值作为中间结果,使得所述中间结果的数量等于M,其中M为预设的正整数。
条款A22,根据条款A1-A21任一项所述的乘法器,其中,每个所述华莱士树具有M个输入和N个输出,所述华莱士树的数目不小于K,其中N为预设的小于M的正整数,K为不小于所述中间结果的最大位宽的正整数。
条款A23,根据条款A1-A22任一项所述的乘法器,其中,所述部分积求和单元用于选用一组或多组所述华莱士树对所述中间结果进行加和,其中每组所述华莱士树有X个华莱士树,X为所述中间结果的位数,其中各组内的所述华莱士树之间存在依次进位的关系,而各组之间的华莱士树不存在进位的关系。
条款A24,根据条款A1-A23任一项所述的乘法器,其中,所述乘法器还包括:规格化处理单元,用于当所述两个浮点数中的至少一个浮点数为非规格化的非零浮点数时,对所述至少 一个浮点数进行规格化处理,以获得对应的指数和尾数。
条款A25,根据条款A1-A24任一项所述的乘法器,其中,所述乘法器用于根据运算模式进行所述两个浮点数的乘法运算,所述运算模式指示所述两个浮点数的数据格式,所述尾数处理单元用于根据所述运算模式以及所述两个浮点数的尾数来获得所述乘法运算后的尾数,并且所述指数处理单元用于根据所述运算模式以及所述两个浮点数的指数来获得所述乘法运算后的指数。
条款A26,根据条款A1-A25任一项所述的乘法器,所述规格化处理单元还用于根据所述运算模式,对所述两个浮点数中的至少一个浮点数进行规格化处理,以获得对应的指数和尾数。
条款A27,根据条款A1-A26任一项所述的乘法器,其中,所述数据格式包括半精度浮点数、单精度浮点数、脑浮点数、双精度浮点数、自定义浮点数中的至少一种。
条款A28,根据条款A1-A27任一项所述的乘法器,其中,所述尾数处理单元包括位数扩展电路,所述位数扩展电路用于对所述第一浮点数和所述第二浮点数中的至少一个的尾数进行位数扩展。
条款A29,根据条款A1-A28任一项所述的乘法器,其中,所述浮点数还包括符号,所述乘法器进一步包括:
符号处理单元,用于根据所述两个浮点数的符号获得乘法运算后的符号。
条款A30,根据条款A1-A29任一项所述的乘法器,其中,所述符号处理单元包括异或逻辑电路,所述异或逻辑电路用于根据所述两个浮点数的符号进行异或运算,获得所述乘法运算后的符号。
条款A31,根据条款A1-A30任一项所述的乘法器,进一步包括规则化单元,用于:
对所述乘法运算后的尾数和指数进行浮点数规则化处理,以获得规则化指数结果和规则化尾数结果,并且将所述规则化指数结果和所述规则化尾数结果作为所述乘法运算后的指数和所述乘法运算后的尾数。
条款A32,根据条款A1-A31任一项所述的乘法器,进一步包括:舍入单元,用于根据舍入模式对所述规则化尾数结果执行舍入操作以获得舍入后的尾数,并将所述舍入后的尾数作为所述乘法运算后的尾数。
条款A33,一种使用乘法器执行浮点数乘法运算的方法,其中,利用所述乘法器的尾数处理单元根据所述浮点数的尾数来获得所述乘法运算后的尾数,所述尾数处理单元包括控制电路,所述控制电路用于在两个浮点数中的至少一个的尾数位宽大于所述尾数处理单元一次可处理的数据位宽时,多次调用所述尾数处理单元。
条款A34,一种集成电路芯片,包括根据条款A1-A31的任意一项所述的乘法器。
条款A35,一种计算装置,包括根据条款A1-A31的任意一项所述的乘法器或根据条款A34所述的集成电路芯片。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心思想;同时,对于本领域的一般技术人员,依据本披露的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本披露的限制。
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本披露的方法及其核心思想。同时,本领域技术人员依据本披露的思想,基于本披露的具体实施方式及应用范围上做出的改变或变形之处,都属于本披露保护的范围。综上所述,本说明书内容不应理解为对本披露的限制。
Claims (35)
- 一种乘法器,用于进行浮点数的乘法运算,其中,所述乘法器包括:尾数处理单元,用于根据所述浮点数的尾数来获得所述乘法运算后的尾数,所述尾数处理单元包括控制电路,所述控制电路用于在两个浮点数中的至少一个的尾数位宽大于所述尾数处理单元一次可处理的数据位宽时,多次调用所述尾数处理单元。
- 根据权利要求1所述的乘法器,其中,所述两个浮点数包括第一浮点数和第二浮点数,所述尾数处理单元支持第一位宽和第二位宽,所述第一浮点数的尾数作为与所述第一位宽对应的第一输入,所述第二浮点数的尾数作为与所述第二位宽对应的第二输入,所述第一输入的位宽小于或等于所述第一位宽,所述控制电路用于当所述第二输入的位宽大于所述第二位宽时,多次调用所述尾数处理单元来获得所述乘法运算后的尾数。
- 根据权利要求1所述的乘法器,其中,所述两个浮点数包括第一浮点数和第二浮点数,所述尾数处理单元支持第一位宽和第二位宽,所述第一浮点数的尾数作为与所述第一位宽对应的第一输入,所述第二浮点数的尾数作为与所述第二位宽对应的第二输入,所述控制电路用于当所述第一输入的位宽大于所述第一位宽且所述第二输入的位宽小于或等于所述第二位宽时、当所述第二输入的位宽大于所述第二位宽且所述第一输入的位宽小于或等于所述第一位宽时或者当所述第一输入的位宽大于所述第一位宽且所述第二输入的位宽大于所述第二位宽时,多次调用所述尾数处理单元来获得所述乘法运算后的尾数。
- 根据权利要求3所述的乘法器,其中,当所述第一浮点数的尾数位宽小于所述第二浮点数的尾数位宽并且所述第一位宽大于所述第二位宽时,或者当所述第一浮点数的尾数位宽大于所述第二浮点数的尾数位宽并且所述第一位宽小于所述第二位宽时,所述控制电路选择所述第一浮点数的尾数作为与所述第二位宽对应的所述第二输入并且选择所述第二浮点数的尾数作为与所述第一位宽对应的第一输入。
- 根据权利要求4所述的乘法器,其中,当所述第一输入的位宽大于所述第一位宽且所述第二输入的位宽小于或等于所述第二位宽时,所述控制电路根据所述第一输入的位宽和所述第一位宽来确定调用所述尾数处理单元的次数以及在每次调用中输入所述尾数处理单元的数据。
- 根据权利要求4所述的乘法器,其中,当所述第二输入的位宽大于所述第二位宽且所述第一输入的位宽小于或等于所述第一位宽时,所述控制电路根据所述第二输入的位宽和所述第二位宽来确定调用所述尾数处理单元的次数以及在每次调用中输入所述尾数处理单元的数据。
- 根据权利要求4所述的乘法器,其中,当所述第一输入的位宽大于所述第一位宽且所述第二输入的位宽大于所述第二位宽时,所述控制电路根据所述第一输入的位宽和所述第一位宽以及所述第二输入的位宽和所述第二位宽来确定调用所述尾数处理单元的次数以及在每次调用中输入所述尾数处理单元的数据。
- 根据权利要求2至7中任一项所述的乘法器,其中,所述尾数处理单元还包括移位加法电路,所述移位加法电路用于根据每次调用所述尾数处理单元所获得的尾数结果来获得所述乘法运算后的尾数。
- 根据权利要求8所述的乘法器,其中,所述移位加法电路包括移位器、中间存储器和加法器,当所述控制电路多次调用所述尾数处理单元时,在第一次调用后,所述移位器将第一次调用获得的尾数结果进行移位获得移位后尾数结果并将所述移位后尾数结果存入所述中间存储器中,从第二次调用开始,所述移位器将当次调用中获得的尾数结果进行移位获得当次尾数结果,所述加法器将所述当次尾数结果与存储在所述中间存储器中的结果相加并且将相加后的结果存储在所述中间存储器中来更新所述中间存储器,并且在最后一次调用后存储在所述中间存储器中的结果作为所述乘法运算后的尾数。
- 根据权利要求1所述的乘法器,其中,所述乘法器还包括指数处理单元,所述指数处理单元用于根据所述两个浮点数的指数来获得所述乘法运算后的指数,所述指数处理单元包括第二控制电路,所述第二控制电路用于根据所述两个浮点数中的一个的指数位宽和所述指数处理单元所支持的两个位宽中的一个或者根据所述两个浮点数的指数位宽和所述指数处理单元所支持的两个位宽来确定多次调用所述指数处理单元以获得所述乘法运算后的指数。
- 根据权利要求10所述的乘法器,其中,所述两个浮点数包括第一浮点数和第二浮点数,所述指数处理单元支持第三位宽和第四位宽,所述第一浮点数的指数作为与所述第三位宽对应的第三输入,所述第二浮点数的指数作为与所述第四位宽对应的第四输入,所述第三输入的位宽小于或等于所述第三位宽,所述第二控制电路用于当所述第四输入的位宽大于所述第四位宽时,多次调用所述指数处理单元来获得所述乘法运算后的指数。
- 根据权利要求10所述的乘法器,其中,所述两个浮点数包括第一浮点数和第二浮点数,所述指数处理单元支持第三位宽和第四位宽,所述第一浮点数的指数作为与所述第三位宽对应的第三输入,所述第二浮点数的指数作为与所述第四位宽对应的第四输入,所述第二控制电路用于当所述第三输入的位宽大于所述第三位宽且所述第四输入的位宽小于或等于所述第四位宽时、当所述第四输入的位宽大于所述第四位宽且所述第三输入的位宽小于或等于所述第三位宽时或者当所述第三输入的位宽大于所述第三位宽且所述第四输入的位宽大于所述第四位宽时,多次调用所述指数处理单元来获得所述乘法运算后的指数。
- 根据权利要求12所述的乘法器,其中,当所述第一浮点数的指数位宽小于所述第二浮点数的指数位宽并且所述第三位宽大于所述第四位宽时,或者当所述第一浮点数的指数位宽大于所述第二浮点数的指数位宽并且所述第三位宽小于所述第四位宽时,所述第二控制电路选择所述第一浮点数的指数作为与所述第四位宽对应的所述第四输入并且选择所述第二浮点数的指数作为与所述第三位宽对应的第三输入。
- 根据权利要求13所述的乘法器,其中,所述第二控制电路用于当所述第三输入的位宽小于或等于所述第四输入的位宽且所述第三位宽小于或等于所述第四位宽时,根据所述第四输入的位宽和所述第三位宽来确定调用所述指数处理单元的次数以及在每次调用中输入所述指数处理单元的数据。
- 根据权利要求11至14中任一项所述的乘法器,其中,所述指数处理单元还包括第二移位加法电路,所述第二移位加法电路用于根据每次调用所述指数处理单元所获得的指数结果来获得所述乘法运算后的指数。
- 根据权利要求1所述的乘法器,其中,所述尾数处理单元包括部分积运算单元和部分积求和单元,其中所述部分积运算单元用于根据所述两个浮点数的尾数获得中间结果,所述部分积求和单元用于将所述中间结果进行加和运算以获得加和结果,并将所述加和结果作为所述乘法运算后的尾数。
- 根据权利要求16所述的乘法器,其中,所述部分积运算单元包括布斯编码电路,所述布斯编码电路用于对所述第一浮点数或所述第二浮点数的尾数进行布斯编码处理,以获得所述中间结果。
- 根据权利要求17所述的乘法器,其中,所述部分积求和单元包括加法器,所述加法器用于对所述中间结果进行加和,以获得所述加和结果。
- 根据权利要求17所述的乘法器,其中,所述部分积求和单元包括华莱士树和加法器,其中所述华莱士树用于对所述中间结果进行加和,以获得第二中间结果,所述加法器用于对所述第二中间结果进行加和,以获得所述加和结果。
- 根据权利要求18或19所述的乘法器,其中,所述加法器包括全加器、串行加法器和超前进位加法器中的至少一种。
- 根据权利要求19所述的乘法器,其中,当所述中间结果的个数不足M个时,补充零值作为中间结果,使得所述中间结果的数量等于M,其中M为预设的正整数。
- 根据权利要求21所述的乘法器,其中,每个所述华莱士树具有M个输入和N个输出,所述华莱士树的数目不小于K,其中N为预设的小于M的正整数,K为不小于所述中间结果的最大位宽的正整数。
- 根据权利要求22所述的乘法器,其中,所述部分积求和单元用于选用一组或多组所述华莱士树对所述中间结果进行加和,其中每组所述华莱士树有X个华莱士树,X为所述中间结果的位数,其中各组内的所述华莱士树之间存在依次进位的关系,而各组之间的华莱士树不存在进位的关系。
- 根据权利要求10所述的乘法器,其中,所述乘法器还包括:规格化处理单元,用于当所述两个浮点数中的至少一个浮点数为非规格化的非零浮点数时,对所述至少一个浮点数进行规格化处理,以获得对应的指数和尾数。
- 根据权利要求10所述的乘法器,其中,所述乘法器用于根据运算模式进行所述两个浮点数的乘法运算,所述运算模式指示所述两个浮点数的数据格式,所述尾数处理单元用于根据所述运算模式以及所述两个浮点数的尾数来获得所述乘法运算后的尾数,并且所述指数处理单元用于根据所述运算模式以及所述两个浮点数的指数来获得所述乘法运算后的指数。
- 根据权利要求25所述的乘法器,所述规格化处理单元还用于根据所述运算模式,对所述两个浮点数中的至少一个浮点数进行规格化处理,以获得对应的指数和尾数。
- 根据权利要求26所述的乘法器,其中,所述数据格式包括半精度浮点数、单精度浮点 数、脑浮点数、双精度浮点数、自定义浮点数中的至少一种。
- 根据权利要求17所述的乘法器,其中,所述尾数处理单元包括位数扩展电路,所述位数扩展电路用于对所述第一浮点数和所述第二浮点数中的至少一个的尾数进行位数扩展。
- 根据权利要求1所述的乘法器,其中,所述浮点数还包括符号,所述乘法器进一步包括:符号处理单元,用于根据所述两个浮点数的符号获得乘法运算后的符号。
- 根据权利要求29所述的乘法器,其中,所述符号处理单元包括异或逻辑电路,所述异或逻辑电路用于根据所述两个浮点数的符号进行异或运算,获得所述乘法运算后的符号。
- 根据权利要求25所述的乘法器,进一步包括规则化单元,用于:对所述乘法运算后的尾数和指数进行浮点数规则化处理,以获得规则化指数结果和规则化尾数结果,并且将所述规则化指数结果和所述规则化尾数结果作为所述乘法运算后的指数和所述乘法运算后的尾数。
- 根据权利要求31所述的乘法器,进一步包括:舍入单元,用于根据舍入模式对所述规则化尾数结果执行舍入操作以获得舍入后的尾数,并将所述舍入后的尾数作为所述乘法运算后的尾数。
- 一种使用乘法器执行浮点数乘法运算的方法,其中,利用所述乘法器的尾数处理单元根据所述浮点数的尾数来获得所述乘法运算后的尾数,所述尾数处理单元包括控制电路,所述控制电路用于在两个浮点数中的至少一个的尾数位宽大于所述尾数处理单元一次可处理的数据位宽时,多次调用所述尾数处理单元。
- 一种集成电路芯片,包括权利要求1-32的任意一项所述的乘法器。
- 一种计算装置,包括根据权利要求1-32的任意一项所述的乘法器或根据权利要求34所述的集成电路芯片。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/620,583 US20240289092A1 (en) | 2019-10-14 | 2020-10-13 | Multiplier, method, integrated circuit chip, and computing device for floating point operation |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910970802 | 2019-10-14 | ||
CN201910970802.8 | 2019-10-14 | ||
CN202011074061.4A CN112732220A (zh) | 2019-10-14 | 2020-10-09 | 用于浮点运算的乘法器、方法、集成电路芯片和计算装置 |
CN202011074061.4 | 2020-10-09 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021073511A1 true WO2021073511A1 (zh) | 2021-04-22 |
Family
ID=75538419
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/120716 WO2021073511A1 (zh) | 2019-10-14 | 2020-10-13 | 用于浮点运算的乘法器、方法、集成电路芯片和计算装置 |
Country Status (2)
Country | Link |
---|---|
US (1) | US20240289092A1 (zh) |
WO (1) | WO2021073511A1 (zh) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8019805B1 (en) * | 2003-12-09 | 2011-09-13 | Globalfoundries Inc. | Apparatus and method for multiple pass extended precision floating point multiplication |
CN102722352A (zh) * | 2012-05-21 | 2012-10-10 | 华南理工大学 | 一种Booth乘法器 |
US20190042244A1 (en) * | 2018-09-27 | 2019-02-07 | Intel Corporation | Computer processor for higher precision computations using a mixed-precision decomposition of operations |
US20190196785A1 (en) * | 2017-12-21 | 2019-06-27 | Qualcomm Incorporated | System and method of floating point multiply operation processing |
-
2020
- 2020-10-13 US US17/620,583 patent/US20240289092A1/en active Pending
- 2020-10-13 WO PCT/CN2020/120716 patent/WO2021073511A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8019805B1 (en) * | 2003-12-09 | 2011-09-13 | Globalfoundries Inc. | Apparatus and method for multiple pass extended precision floating point multiplication |
CN102722352A (zh) * | 2012-05-21 | 2012-10-10 | 华南理工大学 | 一种Booth乘法器 |
US20190196785A1 (en) * | 2017-12-21 | 2019-06-27 | Qualcomm Incorporated | System and method of floating point multiply operation processing |
US20190042244A1 (en) * | 2018-09-27 | 2019-02-07 | Intel Corporation | Computer processor for higher precision computations using a mixed-precision decomposition of operations |
Also Published As
Publication number | Publication date |
---|---|
US20240289092A1 (en) | 2024-08-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI763079B (zh) | 用於浮點運算的乘法器、方法、積體電路晶片和計算裝置 | |
WO2021078212A1 (zh) | 用于向量内积的计算装置、方法和集成电路芯片 | |
WO2021078210A1 (zh) | 用于神经网络运算的计算装置、方法、集成电路和设备 | |
CN111008003B (zh) | 数据处理器、方法、芯片及电子设备 | |
CN110515589B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
TWI774093B (zh) | 用於轉換資料類型的轉換器、晶片、電子設備及其方法 | |
CN110362293B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
CN110515587B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
WO2021078209A1 (zh) | 用于转换数据类型的转换器、芯片、电子设备及其方法 | |
CN110515590B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
CN110531954B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
TW202319909A (zh) | 用於將輸入集相乘之硬體電路及方法,以及非暫時性機器可讀儲存裝置 | |
WO2021073512A1 (zh) | 用于浮点运算的乘法器、方法、集成电路芯片和计算装置 | |
CN111258541B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
CN111258633B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
WO2021073511A1 (zh) | 用于浮点运算的乘法器、方法、集成电路芯片和计算装置 | |
CN209895329U (zh) | 乘法器 | |
CN113033799B (zh) | 数据处理器、方法、装置及芯片 | |
CN110647307B (zh) | 数据处理器、方法、芯片及电子设备 | |
CN113031911B (zh) | 乘法器、数据处理方法、装置及芯片 | |
CN110515586B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
CN210109863U (zh) | 乘法器、装置、神经网络芯片及电子设备 | |
CN111258542B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
CN111258545B (zh) | 乘法器、数据处理方法、芯片及电子设备 | |
WO2023231363A1 (zh) | 乘累加操作数的方法及其设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20876618 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20876618 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20876618 Country of ref document: EP Kind code of ref document: A1 |