WO2022068327A1 - 运算单元、浮点数计算的方法、装置、芯片和计算设备 - Google Patents

运算单元、浮点数计算的方法、装置、芯片和计算设备 Download PDF

Info

Publication number
WO2022068327A1
WO2022068327A1 PCT/CN2021/106965 CN2021106965W WO2022068327A1 WO 2022068327 A1 WO2022068327 A1 WO 2022068327A1 CN 2021106965 W CN2021106965 W CN 2021106965W WO 2022068327 A1 WO2022068327 A1 WO 2022068327A1
Authority
WO
WIPO (PCT)
Prior art keywords
floating
point number
calculated
mantissa
point
Prior art date
Application number
PCT/CN2021/106965
Other languages
English (en)
French (fr)
Inventor
潘秋萍
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21873989.4A priority Critical patent/EP4206902A4/en
Publication of WO2022068327A1 publication Critical patent/WO2022068327A1/zh
Priority to US18/191,688 priority patent/US20230289141A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/485Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/499Denomination or exception handling, e.g. rounding or overflow
    • G06F7/49905Exception handling
    • G06F7/4991Overflow or underflow
    • G06F7/49915Mantissa overflow or underflow in handling floating-point numbers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/487Multiplying; Dividing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/3808Details concerning the type of numbers or the way they are handled
    • G06F2207/3812Devices capable of handling different types of numbers
    • G06F2207/382Reconfigurable for different fixed word lengths
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of computer technology, and in particular, to an arithmetic unit, a method, an apparatus, a chip, and a computing device for calculating floating-point numbers.
  • Floating-point numbers are an important digital format in computers.
  • floating-point numbers are composed of three parts: sign, exponent and mantissa.
  • computers In order to meet the different demands of different businesses on data precision, computers usually need to support multiple floating-point calculation types.
  • each operation unit can implement one type of floating-point number operation.
  • the present application provides an arithmetic unit, a floating-point number calculation method, device, chip and computing device, so as to improve the utilization rate and processing efficiency of the chip.
  • an operation unit in a first aspect, includes a disassembly circuit and an operator; the disassembly circuit is used to obtain the mode included in the calculation instruction and the floating point number to be calculated; Calculating floating-point numbers, wherein the mode is used to indicate the operation type of the floating-point number to be calculated; an operation unit is used to complete the processing of the calculation instruction according to the mode and the disassembled floating-point number to be calculated.
  • the control unit in the processor can obtain the calculation instruction from the storage unit or the memory, and send it to the operation unit.
  • the disassembly circuit in the operation unit receives the calculation instruction and disassembles the mantissa of the floating-point number to be calculated according to the type of the floating-point number to be calculated, the number of mantissa disassembly segments and the bit width of each mantissa segment corresponding to the floating-point number to be calculated. solution, and output the disassembled mantissa segment, sign and exponent to the operator.
  • the operator performs corresponding processing on the mantissa segment, sign and exponent of the input floating-point number to be calculated according to the mode, and obtains the calculation result. That is, in the solution shown in the present application, a floating-point number operation with different precisions and operation types can be realized by one operation unit, and the applicability of the operation unit is higher.
  • the floating-point number to be calculated is a high-precision floating-point number
  • the disassembly circuit is used to disassemble the floating-point number to be calculated into multiple low-precision floating-point numbers according to the mantissa of the floating-point number to be calculated.
  • the disassembly circuit can disassemble the high-precision floating-point numbers to be calculated into multiple low-precision floating-point numbers, and then multiplex the low-precision floating-point number multiplier and the low-precision floating-point number adder to perform the corresponding processing, instead of designing a high-precision floating-point number separately.
  • Floating-point multipliers or high-precision floating-point adders can save operator costs.
  • the exponent bit width of the disassembled floating-point number to be calculated is larger than the exponent bit width of the floating-point number to be calculated.
  • the solution can disassemble the floating-point number to be calculated into a floating-point number of a specified type.
  • the floating-point number to be calculated of the specified type can be a floating-point number of a non-standard type. In order to satisfy the shift condition of the order code, it is only necessary to ensure that the specified
  • the exponent bit width of the floating-point number of the type can be larger than the exponent bit width of the floating-point number to be calculated.
  • the disassembly circuit is configured to disassemble the floating-point number to be calculated into a sign, an exponent and a mantissa, and to disassemble the mantissa of the floating-point number to be calculated into a plurality of mantissa segments.
  • the disassembly circuit can disassemble the mantissa of the floating-point number to be calculated.
  • the floating-point multiplier in this embodiment of the present application can support the lowest-precision floating-point multiplication. Therefore, the mantissa of the lowest-precision floating-point number does not need to be disassembled.
  • the bit width of each mantissa segment to be disassembled can be smaller than or equal to the maximum bit width of the mantissa supported by the floating-point multiplier.
  • the mantissa bit width of the lowest-precision floating-point number and the mantissa of various types of high-precision floating-point numbers can be split.
  • the bit width of each mantissa segment obtained by the solution is similar.
  • the operator includes a floating-point number multiplier and a floating-point number adder
  • the floating-point number multiplier is used to perform the addition operation of the disassembled floating-point number to be calculated
  • the floating-point number adder is used to perform the disassembled floating-point number adder. The addition operation of the floating-point number to be calculated.
  • the operator includes a plurality of floating-point number multipliers and a floating-point number adder; the first floating-point number multiplier among the plurality of floating-point number multipliers is used to disassemble the input to be calculated The sign of the floating-point number is XORed, the order code of the input disassembled floating-point number to be calculated is added, and the mantissa segment of the input disassembled floating-point number to be calculated is multiplied, and added to the floating-point number adder.
  • the second floating-point multiplier in the multiple floating-point multipliers is used for parallel dismantling of the input and the mantissa of the floating-point number to be calculated
  • the segment performs multiplication calculation, and outputs the mantissa segment product result to the floating-point number adder;
  • the floating-point number adder is used to add the input mantissa segment product result to obtain the mantissa segment summation result, according to the mode,
  • the summation result of the mantissa segment, the signed XOR result, and the summation result of the exponent code output the calculation result of the floating-point number to be calculated.
  • the arithmetic unit can be provided with multiple floating-point multipliers, and the multiple floating-point multipliers can perform mantissa segment multiplication in parallel, or perform floating-point multiplication in parallel, which can effectively improve the computing efficiency of floating-point numbers.
  • the operator includes x2 floating-point multipliers and floating-point adders, and a disassembly circuit is used to disassemble the mantissa of each floating-point number to be calculated into x mantissa segments, where x is an integer greater than 1.
  • the operation unit may be provided with x2 floating-point number multipliers, at least one floating-point number adder, and at least one disassembly circuit.
  • x is the mantissa segment number of the mantissa disassembly of the highest precision floating point number supported by the operation unit.
  • the disassembled floating-point numbers are processed in parallel by multiple multipliers, which improves the efficiency of floating-point number operations.
  • a disassembly circuit is used to obtain the mode included in the calculation instruction and the vector of floating-point numbers to be calculated, and disassemble the floating-point numbers to be calculated in each vector of floating-point numbers to be calculated into symbols, order codes and the mantissa, disassemble the mantissa of each floating-point number to be calculated into a plurality of mantissa segments, and output the symbol combination, the exponent combination and the mantissa segment combination to the first floating-point number multiplier, wherein each symbol combination includes disassembly Since the sign of the floating-point number to be calculated, each exponent combination includes disassembly of the order code of the floating-point number to be calculated, and each combination of mantissa segments includes the disassembly of the two mantissa segments of the floating-point number to be calculated.
  • a float consists of two floats to be computed from different vectors of floats to be computed.
  • the first floating-point multiplier is used to perform XOR calculation on the symbols in the input symbol combination, add the order code in the input order code combination, and perform multiplication calculation on the mantissa segment in the input mantissa segment combination, Output the result of signed XOR, the sum of the order code and the product of the mantissa segment to the floating-point number adder;
  • the second floating-point number multiplier is used to multiply the mantissa segments in the input combination of the mantissa segments in parallel, and add to the floating-point number
  • the device outputs the product result of the mantissa segment;
  • the floating-point number adder is used to add the product result of the mantissa segment input from the same floating-point number to be calculated, and obtain the summation result of the mantissa segment corresponding to each floating-point number to be calculated.
  • the arithmetic unit can realize the calculation of the floating-point number vector.
  • the present application can realize the relevant calculation on the floating-point number vector.
  • the disassembly circuit first disassembles the vector into a floating-point number scalar, and then disassembles each floating-point number scalar into symbols, orders Code and mantissa three parts.
  • the mantissa must be disassembled to obtain multiple mantissa segments.
  • the sign, exponent, and mantissa segments are output to the floating-point multiplier.
  • the floating-point multiplier performs the XOR calculation on the two symbols of the input, the summation calculation on the input exponent, and the multiplication calculation on the input mantissa segment. Then, the obtained sign XOR result, the summation result of the order code and the product result of the mantissa segment are output to the floating-point number adder.
  • the processing circuit performs normalization processing and outputs it.
  • the mode indicates that the operation type of the floating-point number vector to be calculated is a vector element-by-element multiplication operation; the floating-point number adder is used to add the mantissa segments corresponding to each floating-point number to be calculated.
  • the result, the signed XOR result, and the exponent sum result are output as an element-wise product result.
  • element-by-element multiplication of vectors can be implemented.
  • the floating-point adder only needs to output to The normalization processing circuit can output.
  • the mode indicates that the operation type of the floating-point number vector to be calculated is a vector inner product operation; the floating-point number adder is used for summing the result of the order code corresponding to each floating-point number to be calculated. , perform the order matching on the summation result of each mantissa segment corresponding to the floating-point number to be calculated, perform the addition calculation on the summation result of each mantissa segment after the pairing, and output the result of the vector inner product operation.
  • the vector inner product operation can also be implemented.
  • the floating-point number adder also needs to calculate the order difference according to the summation result of the order code corresponding to each floating-point number to be calculated, and based on the calculated order difference.
  • the summation results of the mantissa segments corresponding to the floating-point numbers to be calculated are equalized, and the addition calculation is performed on the summation results of the mantissa segments after the equalization.
  • the calculation result is output to the normalization processing circuit, and the calculation result is a complete floating point number, including sign, exponent and mantissa. After the calculation result is normalized by the normalization processing circuit, it can be output.
  • the mode indicates that the operation type of the floating-point number vector to be calculated is a vector element accumulation operation
  • the disassembly circuit is used to obtain the mode and the first floating-point number vector included in the calculation instruction, and generate a second floating-point number vector, wherein the type of each floating-point number to be calculated in the second floating-point number vector is the same as the type of the floating-point number to be calculated in the second floating-point number vector.
  • the types of floating-point numbers to be calculated in the first vector of floating-point numbers to be calculated are the same, the value of each floating-point number to be calculated in the second vector of floating-point numbers is 1, and the first vector of floating-point numbers and the second The floating-point number vector is used as the floating-point number vector to be calculated;
  • the floating-point number adder is used to align the summation result of each mantissa segment corresponding to the floating-point number to be calculated according to the summation result of the order code corresponding to each floating-point number to be calculated, and add the sum of each mantissa segment after the balancing. The result is added, and the result of the vector element accumulation is output.
  • the vector element accumulation operation can also be implemented.
  • the input floating-point number to be calculated is a floating-point number vector.
  • the dismantling circuit obtains the calculation instruction and determines that the calculation type indicated by the mode is vector element accumulation, a floating-point number vector of the same type as the input floating-point number vector to be calculated can be generated first, and each element in the generated floating-point number vector can be generated. The values are all 1.
  • the input floating-point number vector to be calculated and the generated floating-point number vector can be jointly used as the floating-point number vector to be calculated.
  • Next includes dismantling, multiplying, adding and the same as vector inner product operations.
  • a method for calculating floating-point numbers includes: acquiring a mode included in a calculation instruction and a floating-point number to be calculated; disassembling the floating-point number to be calculated according to a preset rule, wherein the mode uses is used to indicate the operation type of the floating-point number to be calculated; the processing of the calculation instruction is completed according to the mode and the disassembled floating-point number to be calculated.
  • the control unit in the processor can obtain the calculation instruction from the storage unit or the memory, and send it to the operation unit.
  • the arithmetic unit in the operation unit receives the calculation instruction and disassembles the mantissa of the floating-point number to be calculated according to the type of the floating-point number to be calculated, as well as the number of mantissa segments corresponding to the floating-point number stored in this type and the bit width of each mantissa segment.
  • the disassembled mantissa segment, symbol and exponent are processed accordingly to obtain the calculation result. That is, in the solution shown in the present application, one operation unit can realize the operation of different operation types.
  • the floating-point number to be calculated is a high-precision floating-point number
  • disassembling the floating-point number to be calculated according to a preset rule includes: disassembling the floating-point number to be calculated according to the mantissa of the floating-point number to be calculated
  • the floating-point number to be calculated is disassembled into multiple low-precision floating-point numbers.
  • the arithmetic unit can decompose the high-precision floating-point numbers to be calculated into multiple low-precision floating-point numbers, and then multiplex the low-precision floating-point number multiplier and the low-precision floating-point number adder to perform the corresponding processing, instead of designing a high-precision floating-point number separately.
  • Point multiplier or high precision floating point adder which can save operator cost.
  • the exponent bit width of the disassembled floating-point number to be calculated is larger than the exponent bit width of the floating-point number to be calculated.
  • the arithmetic unit can disassemble the floating-point number to be calculated into a floating-point number of a specified type.
  • the floating-point number to be calculated of the specified type can be a floating-point number of a non-standard type. In order to meet the shift condition of the order code, it is only necessary to ensure that the specified type is The exponent bit width of the floating-point number is greater than the exponent bit width of the floating-point number to be calculated.
  • disassembling the floating-point number to be calculated according to a preset rule includes: disassembling the floating-point number to be calculated into a symbol, an exponent and a mantissa, and disassembling the floating-point number to be calculated The mantissa of is split into multiple mantissa segments.
  • the arithmetic unit can disassemble the mantissa of the floating-point number to be calculated.
  • the floating-point multiplier in this embodiment of the present application may support the multiplication of the lowest-precision floating-point number. Therefore, the mantissa of the lowest-precision floating-point number may not be used. dismantling. When dismantling the mantissa of high-precision floating-point numbers, the bit width of each mantissa segment to be disassembled can be smaller than or equal to the maximum bit width of the mantissa supported by the floating-point multiplier.
  • the mantissa bit width of the lowest-precision floating-point number and the mantissa of various types of high-precision floating-point numbers can be split.
  • the bit width of each mantissa segment obtained by the solution is similar.
  • the operation unit performs an XOR calculation on the symbols of the disassembled floating-point numbers to be calculated to obtain a signed XOR result, and performs addition calculation on the order codes of the disassembled floating-point numbers to be calculated to obtain
  • For the summation result of the order code perform multiplication calculation on the disassembled mantissa segments from different floating-point numbers to be calculated, and output the product result of the mantissa segments. And add the mantissa segment product results to obtain the mantissa segment addition result.
  • the mantissa segment addition result, the signed XOR result and the order code addition result the calculation result of the floating-point number to be calculated is obtained.
  • the operation of floating-point numbers of different precisions in different modes can be completed.
  • acquiring the mode and the floating-point number to be calculated included in the calculation instruction, and disassembling the floating-point number to be calculated according to a preset rule includes: acquiring the mode and the vector of the floating-point number to be calculated included in the calculation instruction , decompose the to-be-calculated floating-point numbers in each to-be-calculated floating-point number vector into symbols, exponents, and mantissas, and obtain multiple symbol combinations, exponents, and mantissa segment combinations, where each symbol combination includes disassembly from a The sign of the floating-point number to be calculated, each exponent combination includes dismantling the exponent of the floating-point number to be calculated, and each mantissa segment combination includes the disassembly of the two mantissa segments of the floating-point number to be calculated.
  • the XOR calculation is performed on the symbols of the disassembled floating-point numbers to be calculated to obtain a signed XOR result, and the disassembled floating-point numbers to be calculated are XORed.
  • the order code is added to obtain the summation result of the order code, and the multiplication calculation is performed on the disassembled mantissa segments from different floating-point numbers to be calculated to obtain the result of the product of the mantissa segments, including: XOR the symbols in each symbol combination.
  • the present application can realize the relevant calculation on the floating-point number vector.
  • the operation unit first disassembles the vector into a floating-point number scalar, and then disassembles each floating-point number scalar into symbols and order codes. and mantissa three parts.
  • the mantissa must be disassembled to obtain multiple mantissa segments.
  • the XOR calculation is performed on the signs of the two floating-point scalars at the corresponding positions in the two floating-point number vectors, the summation calculation is performed on the exponent, and the multiplication calculation is performed on the mantissa segment.
  • the obtained mantissa segment product results are added to the order and output to the normalization processing circuit, and the normalization processing circuit performs normalization processing and outputs the result.
  • the mode indicates that the operation type of the floating-point number vector to be calculated is a vector element-by-element multiplication operation; , the signed XOR result and the order code addition result, and output the vector calculation results corresponding to the plurality of floating-point number vectors to be calculated, including: summing the mantissa segment corresponding to each floating-point number to be calculated, the signed XOR result and the order number Code addition and result, output as an element-wise product result.
  • vector element-by-element multiplication can be implemented.
  • the operation unit only needs to output the summation result of the mantissa segment, the signed XOR result, and the summation result of the order code corresponding to each floating-point number to be calculated to the normalized
  • the processing circuit can output.
  • the mode indicates that the operation type of the vector of floating-point numbers to be calculated is vector inner product operation;
  • the signed XOR result and the summation result of the order code outputting the vector calculation results corresponding to the plurality of floating-point number vectors to be calculated, including: according to the summation result of the order code corresponding to each floating-point number to be calculated, for each floating-point number to be calculated corresponding to The sum of the mantissa segments of , and the result of the mantissa segment are added to the order, and the addition of the sum of the mantissa segments after the order is performed, and the result of the vector inner product operation is output.
  • the vector inner product operation can also be implemented in this application.
  • the operation unit also needs to calculate the order difference according to the summation result of the order code corresponding to each floating point number to be calculated, and calculate the order difference for each floating point number to be calculated based on the calculated order difference.
  • the summation results of the mantissa segments corresponding to the points are equalized, and the summation results of the mantissa segments after the equalization are added.
  • the calculation result is output to the normalization processing circuit, and the calculation result is a complete floating point number, including sign, exponent and mantissa. After the calculation result is normalized by the normalization processing circuit, it can be output.
  • the mode indicates that the operation type of the vector of floating-point numbers to be calculated is vector element accumulation operation; the obtaining the mode and the floating-point number to be calculated included in the calculation instruction includes: obtaining the calculation instruction includes: mode and the first floating-point number vector to generate a second floating-point number vector, wherein the type of each floating-point number to be calculated in the second floating-point number vector is the same as the type of the floating-point number to be calculated in the first floating-point number vector to be calculated.
  • the value of each floating-point number to be calculated in the second floating-point number vector is 1, and the first floating-point number vector and the second floating-point number vector are used as the floating-point number vector to be calculated;
  • the mode the summation result of the mantissa segment corresponding to the floating-point number to be calculated, the signed XOR result and the summation result of the order code, outputting the vector calculation results corresponding to the plurality of floating-point number vectors to be calculated, including: For the summation result of the order code corresponding to the floating point number, the order is performed on the summation result of each mantissa segment corresponding to the floating point number to be calculated, and the summation result of each mantissa segment after the order is added, and the cumulative result of the vector elements is output.
  • the vector element accumulation operation can also be implemented.
  • the input floating-point number to be calculated is a floating-point number vector.
  • the operation unit obtains the calculation instruction and determines that the calculation type indicated by the mode is vector element accumulation, it can first generate a floating-point number vector of the same type as the input floating-point number vector to be calculated, and the value of each element in the generated floating-point number vector Both are 1.
  • the input floating-point number vector to be calculated and the generated floating-point number vector can be jointly used as the floating-point number vector to be calculated.
  • Next includes dismantling, multiplying, adding and the same as vector inner product operations.
  • a third aspect provides an apparatus for calculating floating-point numbers, the apparatus comprising various modules for executing the floating-point number calculation method in the second aspect or any possible implementation manner of the second aspect.
  • a chip in a third aspect, includes at least one arithmetic unit as described in the first aspect.
  • a computing device in a fourth aspect, includes a motherboard and the chip described in the third aspect; the chip is provided on the motherboard.
  • the arithmetic unit is composed of a disassembly circuit and an arithmetic unit.
  • the disassembly circuit can obtain the mode included in the calculation instruction and the floating-point number to be calculated, and disassemble the floating-point number to be calculated according to preset rules. Then, the arithmetic unit completes the processing of the calculation instruction according to the mode and the disassembled floating-point number to be calculated.
  • the mode in the calculation instruction is used to indicate the operation type of the floating-point number to be calculated, that is, one operation unit in this application can implement a variety of different operation types.
  • FIG. 1 is a schematic diagram of a floating-point number composition provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of the composition of a floating point number provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of the composition of a floating point number provided by an embodiment of the present application.
  • FIG. 4 is a logical architecture diagram of a chip provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a computing unit provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a dismantling circuit provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an adder arrangement provided by an embodiment of the present application.
  • FIG. 8 is a flowchart of a method for calculating floating-point numbers provided by an embodiment of the present application.
  • FIG. 9 is a flowchart of a method for calculating floating-point numbers provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a computing unit provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of an apparatus for floating-point number calculation provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • the half-precision floating-point number FP16 occupies 16 bits in the computer's storage, including the sign, exponent and mantissa.
  • the bit width of the symbol is 1 bit
  • the bit width of the exponent is 5 bits
  • the bit width of the mantissa is 10 bits (the fractional part of the mantissa).
  • the mantissa divides the stored 10bits fractional part, and also includes the hidden 1bit integer part, that is, the mantissa has a total of 11bits.
  • the single-precision floating-point number FP32 occupies 32 bits in the computer's storage, including sign, exponent and mantissa.
  • the bit width of the symbol is 1 bit
  • the bit width of the exponent is 8 bits
  • the bit width of the mantissa is 23 bits (the fractional part of the mantissa).
  • the mantissa divides the stored 23bits fractional part, and also includes the hidden 1bit integer part, that is, the mantissa has a total of 24bits.
  • double-precision floating-point number FP64 occupies 64 bits in the computer's storage, including sign, exponent and mantissa.
  • the bit width of the symbol is 1 bit
  • the bit width of the exponent is 11 bits
  • the bit width of the mantissa is 52 bits (the fractional part of the mantissa).
  • the mantissa divides the stored 52bits fractional part, and also includes the hidden 1bit integer part, that is, the mantissa has a total of 53bits.
  • a 1 , a 2 ... a n and b 1 , b 2 ... b n are floating point numbers.
  • a 1 , a 2 ... a n and b 1 , b 2 ... b n are floating point numbers.
  • the system architecture of the present application is the logical architecture of the chip 100 , including a control unit 1 , an arithmetic unit 2 and a storage unit 3 (for example, Cache).
  • the control unit 1 , the arithmetic unit 2 and the storage unit 3 pass through Internal buses are connected in pairs.
  • the control unit 1 is configured to send instructions to the storage unit 3 and the operation unit 2 to control the storage unit 3 and the operation unit 2 .
  • the operation unit 2 is configured to receive the instruction sent by the control unit 1, and perform corresponding processing according to the instruction, for example, to perform the floating-point multiplication calculation method provided in this application.
  • the storage unit 3 may also be called a cache, and data may be stored in the storage unit 3, for example, floating-point numbers to be calculated may be stored.
  • the operation unit 2 may include an arithmetic operation ALU 20 for performing arithmetic operations, and a logic operation ALU 21 for performing logical operations.
  • the arithmetic logic ALU20 may be provided with sub-units that perform basic operations such as addition (add), subtraction (sub), multiplication (mul), and division (dev) and their additional operations), and is also provided with multi-mode floating
  • the floating point arithmetic subunit 22 of point arithmetic can perform the floating point arithmetic method provided in this application.
  • the logical operation ALU21 may be provided with subunits that perform operations such as shifting, logical sum (and), logical or (or), and comparison of two values, respectively.
  • the chip 100 may also be connected to the memory 200 for data interaction and instruction transmission with the memory 200 .
  • the memory 20 is connected to the control unit 1 and the storage unit 3, and the control unit 1 can obtain the instructions or data stored in the memory 200 from the memory.
  • the control unit 1 reads the instruction from the memory 200, and further sends it to the operation unit 2, and the operation unit 2 executes the instruction.
  • the logic architecture of the chip 10 shown in FIG. 4 may be the logic architecture of any chip, for example, a central processing unit (CPU) chip, a graphics processing unit (graphics processing unit, GPU) chip , field-programmable gate array (FPGA) chip, application specific integrated circuits (ASIC) chip, tensor processing unit (TPU) chip or other artificial intelligence (artificial intelligence, AI) chips, etc.
  • CPU central processing unit
  • GPU graphics processing unit
  • FPGA field-programmable gate array
  • ASIC application specific integrated circuits
  • TPU tensor processing unit
  • AI artificial intelligence
  • the floating-point arithmetic subunit 22 in the arithmetic unit 2 further includes a disassembly circuit 221 and an arithmetic unit 222 .
  • the point arithmetic subunit 22 can disassemble the floating point number through the disassembly circuit 221 and calculate the disassembled floating point number through the arithmetic unit 222, so as to realize the calculation of various precision floating point numbers in various modes.
  • the disassembly circuit 221 is used to obtain the mode and the floating point number to be calculated included in the calculation instruction, and disassemble the floating point number to be calculated according to a preset rule.
  • the mode is used to indicate the operation type of the floating-point number to be calculated, and the operation type may include vector inner product operation, vector element-by-element multiplication operation, vector element accumulation operation, and so on.
  • the arithmetic unit 222 is configured to complete the processing of the calculation instruction according to the mode in the above calculation instruction and the disassembled floating point number to be calculated.
  • the operator 222 may include a floating-point number multiplier 2221 and a floating-point number adder 2222 .
  • the operation of the above-mentioned dismantling circuit 21 for dismantling the floating-point number to be calculated according to a preset rule may be: dismantling the mantissa of the floating-point number to be calculated into a plurality of mantissa segments. After the disassembly is completed, the disassembly circuit 221 outputs the disassembled mantissa segment, the content of the sign segment of the floating-point number to be calculated, and the content of the exponent segment to the floating-point number multiplier 2221 .
  • the floating-point number multiplier 2221 performs XOR calculation on the content of the sign segment of the floating-point number to be calculated, performs addition calculation on the content of the exponent segment, and performs multiplication on the disassembled mantissa segment. Then, the floating-point number multiplier 2221 outputs the signed XOR result, the exponent sum result and the mantissa segment product result to the floating-point number adder 2222, and the floating-point number adder completes the addition of the mantissa segment product results, and calculates the result Output as a floating point number.
  • the floating-point multiplier can also perform conventional floating-point multiplication
  • the floating-point adder can also perform conventional floating-point addition.
  • the same arithmetic unit 2 includes two dismantling circuits below. 21 is described as an example.
  • each disassembly circuit 21 can disassemble one floating-point number to be calculated respectively.
  • the disassembly circuit 21 may include a floating-point number disassembly sub-circuit 211 and a mantissa disassembly sub-circuit 212, wherein the floating-point number disassembly sub-circuit 211 is used to disassemble the input floating-point number to be calculated into symbols, The exponent and the mantissa, the mantissa disassembly sub-circuit 212 is used to disassemble the mantissa of the floating-point number to be calculated into a plurality of mantissa segments.
  • the floating-point multiplier in this embodiment of the present application can support the lowest-precision floating-point multiplication. Therefore, the mantissa of the lowest-precision floating-point number does not need to be disassembled.
  • the bit width of each mantissa segment to be disassembled can be smaller than or equal to the maximum bit width of the mantissa supported by the floating-point multiplier.
  • the mantissa bit width of the lowest-precision floating-point number and the mantissa of various types of high-precision floating-point numbers can be split.
  • the bit width of each mantissa segment obtained by the solution is similar.
  • the dismantling circuit can be preset to dismantle various types of floating-point numbers.
  • a point multiplier When a point multiplier is used, multiple floating point multipliers can be used to process the disassembled floating point numbers in parallel.
  • the disassembly circuit 21 may first determine the type of the floating-point number to be calculated. Then, the mantissa of the floating-point number to be calculated is disassembled according to the preset dismantling method corresponding to the floating-point number of the type to obtain a plurality of mantissa segments. Pre-set disassembly methods for various types of floating-point numbers.
  • the setting principle of the dismantling method of floating-point number is: in the case of multiplexing the existing floating-point number multiplier, the maximum mantissa bit width a supported by the operator of the lowest-precision floating-point number multiplier can be determined. Then, with a as the maximum mantissa segment bit width, determine the number of mantissa segments for each type of floating-point number disassembly.
  • the floating-point multiplier can also be redesigned as needed.
  • the redesigned floating-point multiplier needs to support the multiplication of the lowest precision floating-point number, and the maximum mantissa bit width it supports is larger than the mantissa segment of the dismantling of various types of floating-point numbers. bit width.
  • the maximum mantissa bit width supported by the redesigned floating-point multiplier can also be set when setting the disassembly method and designing the floating-point multiplier. , the mantissa bit width of the lowest precision floating point number, and the bit width of each mantissa segment disassembled by various types of high precision floating point numbers as close as possible.
  • FP16 For FP16, usually FP16 is the lowest precision floating point number, therefore, the mantissa of FP16 does not need to be disassembled.
  • the mantissa of FP32 can be disassembled as 2 mantissa segments, each with 12 bits.
  • each mantissa segment disassembled by FP32 is 12bits, and FP64 is required to make the mantissa bit width of FP16, the bit width of each mantissa segment obtained by disassembling the mantissa of FP32, and the mantissa segment of FP64.
  • the bit width of each mantissa segment obtained by the solution is similar to the maximum bit width of the mantissa supported by the floating-point multiplier.
  • the mantissa of FP64 can be disassembled into 4 mantissa segments, of which 3 mantissa segments have a bit width of 13 bits and 1 mantissa segment. The width is 14bits.
  • each mantissa segment is 12bits.
  • the disassembly circuit 21 may include a floating-point number disassembly sub-circuit corresponding to FP16, a floating-point number disassembly sub-circuit corresponding to FP32, and a mantissa disassembly sub-circuit Decomposition sub-circuit, floating-point number disassembly sub-circuit corresponding to FP64, and mantissa disassembly sub-circuit, in addition, the disassembly circuit 21 may also include an output selection circuit, which can select the corresponding floating-point number disassembly sub-circuit according to the mode Output the dismantling result output by the circuit or the mantissa dismantling sub-circuit.
  • N floating-point number multipliers 2221 may be provided in the operation unit 2 .
  • each floating-point number multiplier can independently calculate a set of complete floating-point number multiplications, a set of complete floating-point number multiplications, including signed XOR, exponent addition, and mantissa multiplication.
  • the number N of floating point multipliers 2221 may be the square of the number m of mantissa segments split by the mantissa of the highest precision floating point number supported by the operation unit 2 . which is,.
  • the number of floating-point number multipliers is N
  • the length of the lowest-precision floating-point number vector supported by the operation unit 2 is N
  • the length of the supported higher-precision floating-point number vector is N/o2 where o is the higher precision
  • the length of the supported higher-precision floating-point number vector is N/p2, and so on.
  • the number of floating point multipliers 221 can be 16, and the lowest precision floating point number FP16 vector supported by operation unit 2
  • the bit width of the order code adder of each floating-point number multiplier needs to be greater than or equal to the lowest precision floating-point number.
  • the bit width of the exponent calculation of , and the bit width of the exponent adder with N/o2 floating-point multipliers is greater than or equal to the bit width of the exponent calculation of higher-precision floating-point numbers, and in the N/o2 floating-point multipliers , the bit width of the exponent adder with N/p2 floating-point multipliers is greater than or equal to the bit width of the exponent calculation of higher-precision floating-point numbers, and so on.
  • the number of floating-point number adders 2222 is related to the number of floating-point numbers that the floating-point number adder 2222 can support for simultaneous calculation and the maximum length of the lowest-precision floating-point number vector supported by the operation unit 2 .
  • the arithmetic unit 2 supports a maximum length of 16 floating-point numbers (such as FP16), and a floating-point number adder 222 can simultaneously calculate the addition of 4 floating-point numbers, or can calculate the addition of 2 floating-point numbers.
  • the floating-point number adders may be arranged in groups. The first group of floating-point adders can perform the addition operation of the mantissa segment multiplication results of floating-point numbers or the addition of floating-point numbers. For the vector inner product operation of FP32 vectors of length 4, the first group of floating-point number adders is completed.
  • the summation result of the multiplication results of the four mantissa segments can be obtained.
  • the 4 floating-point number addition operations corresponding to the addition result of the 4 mantissa segment multiplication results need to be implemented by 2 floating-point number adders, and these 2 floating-point number adders can be used as the second group of floating-point number adders.
  • a floating-point number adder in the third group is also required to add the sums obtained by the second group of floating-point number adders.
  • the operation unit 2 supports a maximum length of 16 floating-point numbers (such as FP16) with the lowest precision, and a floating-point number adder can calculate the addition of two floating-point numbers. Then in order to realize the inner product operation of floating-point number vectors and the element accumulation operation of floating-point number vectors, the floating-point number adders can be divided into 4 groups, the first group includes 8 floating-point number adders, and the second group includes 4 floating-point number adders. , the third group includes 2 floating point adders, and the fourth group includes 1 floating point adder.
  • FP16 floating-point number adder
  • the floating-point number adder adds the complete floating-point number, it can complete the comparison of the maximum value of the order code, calculate the order difference, the order of the mantissa and the addition of the mantissa.
  • the arithmetic unit 2 may further include a normalization processing circuit 423 .
  • the normalization processing circuit can perform conventional mantissa rounding and exponent conversion operations.
  • the mantissa rounding operation that is, performing a rounding operation on the mantissa of the floating-point number to be output, and converting it into a standard format, such as the IEEE754 standard format.
  • the mantissa bit widths corresponding to FP16, FP32, and FP64 are 11bits, 24bits, and 53bits respectively;
  • the exponent conversion operation is to convert the exponent of the floating-point number to be output into the corresponding exponent format in the standard floating-point number, such as the IEEE754 annotation format.
  • the order code bit width is 5bits
  • the order code value is revised to 5'b11111
  • 5'b represents 5-bit binary number
  • the exponent value is less than -14
  • the integer bit of the mantissa is 0, the exponent value is corrected to 5'b0.
  • the bit width of the order code is 8bits
  • the embodiment of the present application also provides a method for calculating floating-point numbers.
  • the method can be implemented by the above-mentioned arithmetic unit, and the arithmetic unit may include a disassembly circuit and an arithmetic unit.
  • the method may include the following processing: process:
  • Step 801 the disassembly circuit acquires the mode included in the calculation instruction and the floating-point number to be calculated.
  • control unit obtains the calculation instruction from the storage unit or the memory, and sends it to the operation unit.
  • the disassembly circuit in the arithmetic unit receives the calculation instruction, and obtains the mode and the floating-point number to be calculated carried in the calculation instruction.
  • the floating-point number to be calculated can be two floating-point scalars of the same type, or two floating-point scalars of different types, or two floating-point vectors of the same type and the same length, or two floating-point scalars of different types and the same length Point vector.
  • the length of the two floating-point number vectors that can be input in the operation unit is related to the number of floating-point number multipliers in the operation unit. Specifically, when the number of floating-point number multipliers is N, the lowest precision floating-point number supported by the operation unit The length of the vector is N, and the length of the supported higher-precision floating-point number vector is N/o2, where o is the number of mantissa segments split by the mantissa of the higher-precision floating-point number, and so on.
  • Step 802 The disassembly circuit disassembles the floating-point number to be calculated according to a preset rule, wherein the mode is used to indicate the operation type of the floating-point number to be calculated.
  • the operation type indicated by the mode may include element-by-element multiplication of vectors, inner product of vectors, accumulation of vector elements, and the like.
  • the dismantling circuit can disassemble the mantissa of the floating-point number to be calculated according to the type of the floating-point number to be calculated, the number of mantissa dismantling segments corresponding to the stored floating-point number of this type, and the bit width of each mantissa segment.
  • the disassembled mantissa segment, sign and exponent are output to the operator. And when outputting the mantissa segment, it needs to be sorted according to a preset fixed order and then output, and the mantissa segments of the mantissas of different floating-point numbers to be calculated that need to be multiplied can be combined in various possible ways.
  • step 802 The disassembly method of step 802 is described below in conjunction with the operation unit shown in FIG.
  • Each teardown circuit can teardown one of the FP16 vectors.
  • the floating-point number dismantling sub-circuit in the dismantling circuit disassembles each FP16 into a group of ⁇ sign (sign), exponent (exp), mantissa (mts) according to the bit width occupied by the sign, exponent and mantissa in FP16 ⁇ .
  • the mantissa obtained by dismantling refers to the mantissa including integer bits.
  • FP64 is disassembled into three parts in the order of 1bit, 5bits, and 10bits from high to low.
  • the 1bit of the first part is sign
  • the 5bits of the second part is exp
  • the 10bits of the third part are the highest of these 10bits.
  • Prepend 1 hidden integer bits
  • 11bits For an FP16 vector, 16 sets of ⁇ sign, exp, mts ⁇ can be disassembled. Because the floating-point multiplier supports the multiplication calculation of the lowest-precision floating-point number in the embodiment of the present application, it is not necessary to disassemble the mantissa of the lowest-precision floating-point number FP16.
  • the dismantling circuit inputs each set of ⁇ sign, exp, mts ⁇ obtained into a floating-point multiplier, which can be input in order according to the position of the set ⁇ sign, exp, mts ⁇ in the FP16 vector, Two sets of ⁇ sign, exp, mts ⁇ corresponding to floating-point numbers to be calculated at the same position in different vectors are input to the same floating-point number multiplier.
  • the two vectors are vector A(a1,a2...a16) and vector B(b1,b2...b16).
  • the first floating-point number a1 to be calculated in vector A can be disassembled to obtain ⁇ signA1, expA1, mtsA1 ⁇
  • the first floating-point number b1 to be calculated in vector B can be disassembled to obtain ⁇ signB1, expB1, mtsB1 ⁇
  • ⁇ signA1, expA1, mtsA1 ⁇ and ⁇ signB1, expB1, mtsB1 ⁇ can be input into the same floating-point multiplier.
  • Each teardown circuit can teardown one of the FP32 vectors.
  • the floating-point number disassembly sub-circuit in the disassembly circuit disassembles each FP32 into a set of ⁇ sign, exp, mts ⁇ according to the bit width occupied by the sign, exponent and mantissa in the FP32.
  • FP64 is disassembled into three parts in the order of 1bit, 8bits, and 23bits from high to low. The 1bit of the first part is sign, the 8bits of the second part are exp, and the 23bits of the third part are the highest of these 23bits. Prepend 1 (hidden integer bits) to get 24bits as mts.
  • ⁇ sign, exp, mts ⁇ 4 sets of ⁇ sign, exp, mts ⁇ can be obtained by disassembly, and the disassembled mantissa can be input to the mantissa disassembly sub-circuit.
  • the mantissa dismantling subcircuit dismantles the input mts according to the preset dismantling method for FP32.
  • the preset disassembly method of FP32 is to disassemble into two mantissa segments, and the bit width of each mantissa segment is 24 bits. .
  • the two FP32 vectors are vector C(c1, c2, c3, c4) and vector D(d1, d2, d3, d4).
  • vector C according to the bit width occupied by the sign, order code and mantissa in FP32, the floating-point numbers in vector C are decomposed into ⁇ signC1, expC1, mtsC1 ⁇ , ⁇ signC2, expC2, mtsC2 ⁇ , ⁇ signC3, expC3 respectively. , mtsC3 ⁇ and ⁇ signC4, expC4, mtsC4 ⁇ .
  • mtsC1 will be disassembled into mtsC10 and mtsC11
  • mtsC2 will be disassembled into mtsC20 and mtsC21
  • mtsC3 will be disassembled into mtsC30 and mtsC31
  • mtsC4 will be disassembled into mtsC40 and mtsC41 , where mtsC10, mtsC20, mtsC30, and mtsC40 represent the low-order mantissa segment, and mtsC11, mtsC21, mtsC31, and mtsC41 represent the high-order mantissa segment.
  • the symbols that can be disassembled for vector D include signD1, signD2, signD3 and signD4, the order codes obtained by disassembling include expD1, expD2, expD3 and expD4, and the mantissa segments obtained by disassembling include mtsD11, mtsD12, mtsD13, mtsD14 , mtsD21, mtsD21, mtsD21, and mtsD21, where mtsD10, mtsD20, mtsD30, and mtsD40 represent the low-order mantissa segment, and mtsD11, mtsD21, mtsD31, and mtsD41 represent the high-order mantissa segment.
  • the mantissa segments for each mantissa in the first FP32 vector are arranged in the order of ⁇ mts1, mts1, mts0, mts0 ⁇ , and then each mantissa segment is output to a floating-point multiplier.
  • the mantissa segments for each mantissa in the second FP32 vector are arranged in the order of ⁇ mts1, mts0, mts1, mts0 ⁇ , and then each mantissa segment is output to a floating-point multiplier.
  • the mantissa segment of the mantissa mtsC1 of the first floating-point number c1 can be arranged as ⁇ mtsC11, mts C11, mtsC10, mtsC10 ⁇ , correspondingly, for the first floating-point number to be calculated in the vector D Mantissa of d1
  • the mantissa segment of mtsD1 can be arranged as ⁇ mtsD11, mtsD10, mtsD11, mtsD10 ⁇ .
  • the above-mentioned sorting method of the mantissa segments is only an example, and the purpose of sorting the output is to make the mantissa segments of the mantissas of the floating-point numbers to be calculated at the corresponding positions in the two vectors to be combined in various possible ways.
  • the output arrangement is not limited in the embodiments of the present application, as long as it is ensured that the output is in a fixed arrangement and the above purpose is achieved.
  • the first mantissa segment in the sorting corresponding to mtsC1 can be input into the same floating-point number multiplier.
  • Each disassembly circuit can disassemble one of the FP64s.
  • the floating-point number disassembly subcircuit disassembles each FP64 into ⁇ sign, exp, mts ⁇ according to the bit width occupied by the sign, exponent and mantissa in FP64.
  • FP64 is disassembled into three parts in the order of 1bit, 11bits, and 52bits from high to low. The 1bit of the first part is sign, the 11bits of the second part is exp, and the 52bits of the third part are the highest of these 52bits. Prepend 1 (hidden integer bits) to get 53bits as mts.
  • the mantissa disassembly subcircuit disassembles the received mts according to the preset disassembly method for the FP64.
  • the preset disassembly method of FP64 is to disassemble the mantissa into 4 mantissa segments, and the bit width of each mantissa segment is 13bits, 13bits, 13bits and 14bits respectively.
  • the two floating point numbers to be computed are E and F.
  • E you can first disassemble it into ⁇ signE, expE, mtsE ⁇ according to the bit width occupied by the sign, exponent and mantissa in FP64, and then disassemble mtsE into mtsE3, mtsE2, mtsE1 and mtsE0, wherein mtsE3, mtsE2, mtsE1 and mtsE0 represent the mantissa segment from high to low.
  • F can be disassembled into ⁇ signF, expF, mtsF ⁇ , and then, mtsF can be disassembled into mtsF3, mtsF2, mtsF1 and mtsF0, where mtsF3, mtsF2, mtsF1 and mtsF0 represent the mantissa segment from high to low .
  • mantissa segment of the first FP64 mantissa in the order of ⁇ mts3,mts3,mts2,mts3,mts2,mts1,mts3,mts2,mts1,mts0,mts2,mts1,mts0,mts1,mts0,mts0 ⁇ , then put each Each mantissa segment is output to a floating-point multiplier.
  • the mantissa segments are arranged in the order of ⁇ mts3,mts2,mts3,mts1,mts2,mts3,mts0,mts1,mts2,mts3,mts0,mts1,mts2,mts0,mts1,mts2,mts0,mts1,mts0 ⁇ , and then each Each mantissa segment is output to a floating-point multiplier.
  • the mantissa segment of the mantissa mtsE of the floating-point number E can be arranged as ⁇ mtsE3, mtsE3, mtsE2, mtsE3, mtsE2, mtsE1, mtsE3, mtsE2, mtsE1, mtsE0, mtsE2, mtsE1, mtsE0, mtsE1, mtsE0, mtsE0 ⁇ .
  • the mantissa segment of the mantissa mtsF of the vector F can be arranged as ⁇ mtsF3, mts2F, mtsF3, mtsF1, mtsF2, mtsF3, mtsF0, mtsF1, mtsF2, mtsF3, mtsF0, mtsF1, mtsF2, mtsF0, mtsF1, mtsF0 ⁇ .
  • the above-mentioned sorting method of the mantissa segments is only an example, and the purpose of sorting the output is to make the mantissa segments of the mantissas of the floating-point numbers to be calculated at the corresponding positions in the two vectors to be combined in various possible ways.
  • the output arrangement is not limited in the embodiments of the present application, as long as it is ensured that the output is in a fixed arrangement and the above purpose is achieved.
  • disassembled sign and exp only need to be output to the floating-point multiplier input to the first mantissa segment in the ordering of the mantissa segments corresponding to the mantissa.
  • the mantissa segment before outputting the mantissa segment to the floating-point number multiplier, the mantissa segment may be filled with 0s at the high-order bits, so that the bit width of the mantissa segment after 0-filling is the same as the multiplication bits supported by the floating-point number multiplier. same width.
  • Step 803 The operator completes the processing of the calculation instruction according to the mode and the disassembled floating-point number to be calculated.
  • step 803 may be implemented by a floating-point number multiplier and a floating-point number adder in the operator. Specifically, as shown in FIG. 9 , step 803 may include the following processing flow:
  • Step 8031 The floating-point multiplier in the operator performs XOR calculation on the symbols of the input disassembled floating-point numbers to be calculated, performs addition calculation on the order codes of the input disassembled floating-point numbers to be calculated, and performs an addition calculation on the input disassembled floating-point numbers to be calculated.
  • the mantissa segment of the disassembled floating-point number to be calculated is multiplied, and the result of the signed XOR, the summation result of the exponent and the product of the mantissa segment are output to the floating-point number adder in the operator.
  • Each floating-point number multiplier multiplies the input floating-point numbers, specifically, including XOR calculation on the two input symbols, addition calculation on the input two order codes, and multiplication on the input two mantissa segments calculate. 16 floating-point multipliers can be executed in parallel.
  • Each floating-point number multiplier can output the signed XOR result, the summation result of the exponent, and the product result of the mantissa segment to the normalization processing circuit. Normalize the product of the result and the mantissa segment to get a normalized FP16.
  • the normalization processing circuit can obtain 4 normalized FP16 outputs as the result of element-by-element multiplication of vectors.
  • the normalization processing circuit normalizes the input sign XOR result, the exponent sum result and the mantissa segment product result, it is the same as the sign, exponent and mantissa of the conventional floating point number. Normalization is the same.
  • the input is two FP16 vectors of length 16:
  • Each floating-point number multiplier multiplies the input floating-point numbers, specifically, including XOR calculation on the two input symbols, addition calculation on the input two order codes, and multiplication on the input two mantissa segments calculate.
  • 16 floating-point multipliers can be executed in parallel to obtain 16 floating-point product results.
  • the product results of 16 floating-point numbers output by the floating-point number multiplier are divided into 4 groups, which are respectively output to one floating-point number adder in the 4 floating-point number adders of the first group.
  • Each floating-point number multiplier multiplies the input mantissa segment, and 16 floating-point number multipliers can perform multiplication of the mantissa segment in parallel. And for the input sign and exponent, the floating-point multiplier also performs a signed XOR operation and an exponent addition operation.
  • the product results of 16 mantissa segments can be obtained. Divide the product results of the 16 mantissa segments into 4 groups, and each group is output to a floating-point number adder in the first group of 4 floating-point number adders, wherein, the product results of each mantissa segment in the same group come from the same treatment Calculate floating point numbers.
  • the mantissa segment product result included in the first group can be: mtsC11*mtsD11, mtsC11*mtsD10, mtsC10*mtsD11, mtsC10*mtsD10;
  • the second group of mantissa segment multiplication results can be: mtsC21*mtsD21, mtsC21*mtsD20, mtsC20*mtsD21, mtsC20*mtsD20,
  • the third and fourth groups include mantissa segments The result of the product can be deduced and so on.
  • Each floating-point number multiplier multiplies the input mantissa segment, and 16 floating-point number multipliers can perform multiplication of the mantissa segment in parallel. And for the input sign and exponent, the floating-point multiplier also needs to perform a signed XOR operation and an exponent addition operation.
  • the product results of the 16 mantissa segments obtained by the 16 floating-point multipliers can be divided into 4 groups, and each group is output to one floating-point number adder in the first group of 4 floating-point number adders.
  • the mantissa segment product result included in the first group can be: mtsE3*mtsF3, mtsE3 *mts F 2, mtsE2*mtsF3, mtsE3*mtsF1;
  • the mantissa segment product result included in the second group can be: mtsE2*mtsF2, mtsE1*mtsF3, mtsE1*mtsF2, mtsE0*mtsF3;
  • the mantissa segment product result included in the third group Can be: mtsE3*mtsF0, mtsE2*mtsF1, mtsE2*mtsF0, mtsE0*mtsF2; the mantissa
  • Step 8032 the floating-point number adder performs addition calculation on the input mantissa segment product result, obtains the mantissa segment summation result, and outputs the treatment according to the calculation instruction mode, the mantissa segment summation result, the sign XOR result and the order code summation result. Calculate the result of a floating-point number.
  • Each floating-point number adder of the first group obtains a corresponding fixed shift value according to the type of floating-point number to be calculated indicated by the input mode. Then, for the mantissa segment product result of the input floating-point number, the order code is aligned according to the fixed shift value, and then the mantissa segment product result after the alignment is added to obtain the first-stage addition result.
  • the first group of 4 floating-point multipliers can get 4 first-stage sum results. Then, each floating-point number multiplier outputs a first-stage summation result and the corresponding sign result and exponent code summation result to the normalization processing circuit.
  • the normalization processing circuit performs normalization processing on each group of input first-stage addition results and the corresponding symbol results and order code addition results, and outputs a normalized FP32.
  • the normalization processing circuit can obtain 4 normalized FP32s and output them as the result of element-by-element multiplication of vectors.
  • the fixed shift value is pre-calculated and stored, because the mantissa segment output by the dismantling circuit is output to the corresponding floating-point multiplier in a fixed order, and the output of the floating-point multiplier will be fixedly output to the corresponding floating-point multiplier.
  • Point number adder so the floating point number adder can store the fixed shift value in advance, and the fixed shift value can also be different for different types of floating point numbers to be calculated.
  • the fixed shift value is related to the position of the mantissa segment corresponding to the product result of the mantissa segment in the mantissa of the floating-point number to be calculated and the bit width occupied.
  • the fixed shift value corresponding to FP32 is illustrated below as an example.
  • the mantissa segment of the floating-point number c1 (FP32) to be calculated includes mtsC11 and mtsC10, and the mantissa segment of the floating-point number d1 to be calculated includes mtsD11 and mtsD10.
  • the mantissa segment product results include mtsC11*mtsD11, mtsC11*mtsD10, mtsC10*mtsD11, and mtsC10*mtsD10.
  • the fixed shift value of mtsC10*mtsD10 is 0, because the difference between the sum of the lowest bits of the two mantissa segments corresponding to mtsC10*mtsD11 and the sum of the lowest bits of the two mantissa segments corresponding to mtsC10*mtsD10 is 12, so the fixed shift value of mtsC10*mtsD11 is 12.
  • the fixed shift value of mtsC11*mtsD10 is 12, and the fixed shift value of mtsC11*mtsD11 is 24. That is, the fixed shift value stored corresponding to FP32 can be 0, 12, 12, and 24 in sequence.
  • the floating-point adder When the floating-point adder adds mtsC11*mtsD11, mtsC11*mtsD10, mtsC10*mtsD11 and mtsC10*mtsD10, it shifts mtsC11*mtsD10, mtsC10*mtsD11 and mtsC10*mtsD10 to the left by 12, 12, and 24 bits, respectively. The shifted mtsC11*mtsD10, mtsC10*mtsD11 and mtsC10*mtsD10 and mtsC11*mtsD11 are summed.
  • the input is two FP32 vectors of length 4:
  • Each floating-point number adder of the first group obtains a corresponding fixed shift value according to the type of floating-point number to be calculated indicated by the input mode. Then, for the mantissa segment product result of the input floating-point number, the order code is aligned according to the fixed shift value, and then the mantissa segment product result after the alignment is added to obtain the first-stage addition result.
  • the first group of 4 floating-point multipliers can get 4 first-stage sum results. Then, the floating-point multiplier of the first group divides the four first-stage addition results into two groups, each of which is output to a floating-point adder of the second group. When the first-stage addition result is output to the floating-point number adder of the second group, the sign result and the order code addition result corresponding to the first-stage addition result are also output to the floating-point number adder of the second group .
  • the floating-point number adder of the second group performs maximum order comparison for the summation result of the input two orders, and calculates the order difference. Then, according to the calculated step difference, the two input first-stage summation results are rank-matched, and then the first-stage summation result after the rank-matching is added to obtain the second-stage summation result.
  • the floating-point number adder of the second group can obtain two second-stage addition results (here, the floating-point number adder of the second group is essentially a complete floating-point number addition calculation, then the floating-point number addition of the second group
  • the result of the second-stage addition of the output of the adder is a complete floating-point number), and then output to the floating-point number adder of the third group.
  • the floating-point number adder of the third group performs addition calculation on the addition result of the second stage, and obtains the addition result of the third stage. Finally, the floating-point number adder of the third group outputs the third-stage addition result to the normalization processing circuit. After the normalization processing circuit performs normalization processing, a normalized FP32 is obtained as the floating-point vector inner product calculation result output.
  • Each floating-point number adder of the first group obtains a corresponding fixed shift value according to the type of floating-point number to be calculated indicated by the input mode. Then, for the mantissa segment product result of the input floating-point number, the order code is aligned according to the fixed shift value, and then the mantissa segment product result after the alignment is added to obtain the first-stage addition result.
  • the first group of 4 floating-point multipliers can get 4 first-stage sum results. Then, the floating-point adder of the first group divides the four first-stage addition results into two groups, each of which is output to a floating-point adder of the second group. At the same time, the input sign XOR result and the exponent sum result are also output to a floating-point number adder in the second group.
  • the fixed shift value corresponding to FP64 in the floating-point number adder of the first group will be illustrated below as an example.
  • the mantissa segment of the floating-point number E to be calculated includes mtsE3, mtsE2, mtsE1 and mtsE0, and the mantissa segment of the floating-point number F to be calculated includes mtsF3, mtsF2, mtsF1 and mtsF0.
  • the product result of each mantissa segment includes mtsE3, mtsE2, mtsE1 and mtsE0.
  • the fixed shift value of mtsE0*mtsF1 is 13
  • the fixed shift value of mtsE1*mtsF0 is 13
  • the fixed shift value of mtsE1*mtsF1 is 26, and the result of the product of the above 4 mantissas is a group
  • the addition calculation is performed by a floating-point number adder, and the fixed shift value stored in the floating-point number adder corresponding to the FP64 can be 0, 13, 13, and 26 in sequence.
  • the fixed shift value of mtsE2*mtsF0 is 0, that is, no shift is required
  • the fixed shift value of mtsE2*mtsF1 is 13
  • the fixed shift value of mtsE3*mtsF0 is 13
  • the product result is a group, which is added by a floating-point number adder.
  • the fixed shift value stored in the floating-point number adder corresponding to FP64 can be 0, 0, 13, and 13 in sequence.
  • the fixed shift value of mtsE1*mtsF2 is 0, that is, no shift is required
  • the fixed shift value of mtsE1*mtsF3 is 13
  • the fixed shift value of mtsE2*mtsF2 is 13
  • the above 4 mantissas The product result is a group, which is added by a floating-point number adder.
  • the fixed shift value stored in the floating-point number adder corresponding to FP64 can be 0, 0, 13, and 13 in sequence.
  • the fixed shift value of mtsE2*mtsF3 is 13
  • the fixed shift value of mtsE3*mts F2 is 13
  • the fixed shift value of mtsE3*mtsF3 is 26.
  • the result of the product of the above 4 mantissas is a group
  • the addition is performed by a floating-point number adder
  • the fixed shift value stored in the floating-point number adder corresponding to FP64 can be 0, 13, 13, and 26 in sequence.
  • the floating-point adder When the floating-point adder adds mtsE0*mtsF0, mtsE0*mtsF1, mtsE1*mtsF0, and mtsE1*mtsF1, it first shifts mtsE0*mtsF1, mtsE1*mtsF0, and mtsE1*mtsF to the left by 13, 13, and 26 bits, respectively, and then Add the shifted mtsE0*mtsF1, mtsE1*mtsF0 and mtsE1*mtsF and mtsE0*mtsF0.
  • the floating-point number adder of the second group performs the ordering of the first-stage addition result of the input according to the fixed shift value, and then performs the addition operation to obtain the second-stage addition result, and then outputs it to the floating-point number of the third group Adder.
  • the input sign XOR result and the exponent sum result are also output to the floating-point number adder of the third group.
  • the four first-stage addition results are P1, P2, P3, and P4, respectively, where P1 is obtained by adding mtsE1*mtsF1, mtsE1*mtsF0, mtsE0*mtsF1, and mtsE0*mtsF0 after shifting, and P2 It is obtained by the addition of mtsE3*mtsF0, mtsE2*mtsF1, mtsE2*mtsF0, mtsE0*mtsF2 after shifting, and P3 is obtained by adding mtsE2*mtsF2, mtsE1*mtsF3, mtsE1*mtsF2, mtsE0*mtsF3 after shifting Obtained, P4 is obtained by adding mtsE3*mtsF3, mtsE3*mts F2, mtsE2*mtsF3, and mtsE3*mt
  • P1 and P2 are taken as a group, based on P1, since the sum of the lowest bits corresponding to the mantissa segment product result corresponding to P2 as the benchmark, and the mantissa of mtsE0*mtsF0 as the benchmark in the mantissa segment product result corresponding to P1
  • the bit difference of the sum of the lowest bits corresponding to the result of the segment product is 26, so the fixed shift value of P2 is 26, that is, the fixed shift value of the corresponding FP64 stored in the corresponding floating-point adder can be 0 and 26 in turn.
  • P3 and P4 are used as a group, wherein, based on P3, the fixed shift value of P4 is 13, that is, the fixed shift value of the corresponding FP64 stored in the corresponding floating-point number adder can be 0 and 13 in turn.
  • the floating-point number adder of the second group When the floating-point number adder of the second group performs addition calculation on P1 and P2, it first shifts P2 to the left by 26 bits, and then adds the shifted P1 and P2. When adding P3 and P4, first shift P4 to the left by 13 bits, and then add the shifted P3 and P4.
  • the floating-point number adder of the third group performs the addition operation for the second-stage addition result of the input according to the fixed shift value, and then performs the addition operation to obtain the third-stage addition result.
  • the fixed shift value corresponding to FP64 in the floating-point number adder of the second group is illustrated as an example.
  • the above-mentioned P1 and P2 are added, and the third-stage addition result obtained is Q1, and P3 and P4 are added, and the third-stage addition result obtained is Q2.
  • the fixed shift value of Q1 is 0, that is, no shift is required
  • the fixed shift value of Q2 is 39. That is, the fixed shift values corresponding to FP64 stored in the third group of floating-point multipliers are 0 and 39 in turn.
  • the floating-point adder of the third group When adding Q1 and Q2, the floating-point adder of the third group first shifts Q2 left by 39 bits, and then adds the shifted Q2 and Q1.
  • the input is two FP16 vectors of length 16:
  • Each floating-point number adder in the first group adds the product results of the four input floating-point numbers to obtain the first-stage addition result.
  • the floating-point number adder of the first group can obtain four first-stage addition results, and then divide the four first-stage addition results into two groups and output them to the floating-point number adder of the second group respectively.
  • the floating-point adders of the second group perform an addition calculation on the input first-stage addition results to obtain two second-stage addition results.
  • the floating-point adders of the second group output the two second-stage addition results to the floating-point adders of the third group.
  • the floating-point number adder of the third group performs addition calculation on the addition result of the second stage, and obtains the addition result of the third stage. Finally, the floating-point number adder of the third group outputs the third-stage addition result to the normalization processing circuit, and the normalization processing circuit performs normalization processing to obtain a normalized FP16 as the vector inner product output.
  • an element-accumulation operation of a floating-point number vector may also be implemented.
  • the input floating-point number to be calculated is a floating-point number vector.
  • a floating-point number vector of the same type as the input floating-point number vector to be calculated can be generated first, and each element in the generated floating-point number vector can be generated.
  • the values are all 1.
  • the input floating-point number vector to be calculated and the generated floating-point number vector can be jointly used as the floating-point number vector to be calculated.
  • the processing of the floating-point vector element accumulation operation in the above steps 801 to 8032 is the same as the processing of the floating-point vector inner product operation in the above-mentioned steps 801 to 8032, and will not be repeated here.
  • an embodiment of the present application also provides a device for calculating floating-point numbers.
  • the device may be the above-mentioned arithmetic unit. As shown in FIG. 11 , the device includes:
  • the disassembly module 130 is used to obtain the mode included in the calculation instruction and the floating-point number to be calculated; disassemble the floating-point number to be calculated according to a preset rule, wherein the mode is used to indicate the operation of the floating-point number to be calculated type;
  • the calculation module 131 is configured to complete the processing of the calculation instruction according to the mode and the disassembled floating-point number to be calculated.
  • the apparatus in the embodiments of the present application may be implemented by an application-specific integrated circuit (ASIC), or a programmable logic device (PLD), and the above-mentioned PLD may be a complex program logic device ( complex programmable logical device (CPLD), field-programmable gate array (FPGA), generic array logic (GAL), or any combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL generic array logic
  • the floating-point number to be calculated is a high-precision floating-point number
  • the disassembly module is used for:
  • the to-be-calculated floating-point number is decomposed into multiple low-precision floating-point numbers according to the mantissa of the to-be-calculated floating-point number.
  • the exponent bit width of the disassembled floating-point number to be calculated is larger than the exponent bit width of the floating-point number to be calculated.
  • the disassembly module 130 is used for:
  • the floating-point number to be calculated is decomposed into a sign, an exponent and a mantissa, and the mantissa of the floating-point number to be calculated is decomposed into a plurality of mantissa segments.
  • the computing module 131 includes a floating-point number multiplication unit and a floating-point number addition unit for;
  • the floating-point number multiplication calculation unit is used to perform XOR calculation on the symbols of the disassembled floating-point numbers to be calculated, obtain a signed XOR result, and perform addition calculation on the order codes of the disassembled floating-point numbers to be calculated to obtain the order Code addition and result, multiply the disassembled mantissa segments from different floating-point numbers to be calculated, and output the product result of the mantissa segments;
  • the floating-point number addition calculation unit is configured to perform addition calculation on the mantissa segment product result to obtain the mantissa segment summation result, according to the mode, the mantissa segment summation result, the signed XOR result and the Add the result of the exponent to obtain the calculation result of the floating-point number to be calculated.
  • the floating-point number calculation device provided in the above-mentioned embodiment calculates the floating-point number
  • only the division of the above-mentioned functional modules is used as an example for illustration.
  • Module completion that is, dividing the internal structure of the computing device into different functional modules to complete all or part of the functions described above.
  • the device for calculating floating-point numbers provided in the above embodiments belongs to the same concept as the embodiments of the method for calculating floating-point numbers, and the specific implementation process is detailed in the method embodiments, which will not be repeated here.
  • the embodiment of the present application also provides a chip, the structure of the chip may be the same as that of the chip 100 shown in FIG. 1 , and the chip can implement the floating-point number calculation method provided by the embodiment of the present application.
  • an embodiment of the present application provides a computing device 1300 .
  • the computing device 1300 includes at least a processor 1301 , a bus system 1302 , a memory 1303 , a communication interface 1304 and a memory unit 1305 .
  • the processor 1301 may be a general-purpose central processing unit (CPU), a network processor (NP), a graphics processing unit (graphics processing unit) microprocessor, an application-specific integrated circuit (application-specific integrated circuit) circuit, ASIC), or one or more integrated circuits used to control the execution of the program of the present application.
  • CPU central processing unit
  • NP network processor
  • graphics processing unit graphics processing unit
  • ASIC application-specific integrated circuit
  • the bus system 1302 described above may include a path to transfer information between the above described components.
  • the above-mentioned memory 1303 can be a read-only memory (read-only memory, ROM) or other types of static storage devices that can store static information and instructions, a random access memory (random access memory, RAM) or other types of storage devices that can store information and instructions.
  • ROM read-only memory
  • RAM random access memory
  • dynamic storage device it can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, CD-ROM storage (including compact discs, laser discs, compact discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or capable of carrying or storing desired program code in the form of instructions or data structures and capable of being accessed by Any other medium accessed by the computer, but not limited to this.
  • the memory can exist independently and be connected to the processor through a bus.
  • the memory can also be integrated with the processor.
  • the memory unit 1305 is used to store the application code for executing the solution of the present application, and the execution is controlled by the processor 1301 .
  • the processor 1301 is configured to execute the application program code stored in the memory unit 1305, thereby implementing the floating-point number calculation method proposed in this application.
  • the processor 1301 may include one or more processors 1301 .
  • the communication interface 1304 is used to enable the connection and communication of the computing device 1300 with external devices.
  • the computing device can obtain multiple low-precision floating-point numbers by disassembling the floating-point numbers to be calculated, and perform operations on the disassembled floating-point numbers by multiple floating-point number multipliers in parallel, so that the same computing device It can support the operation of floating-point numbers of different precisions, without setting up a dedicated computing unit to perform operations on floating-point numbers of specified precision, and the compatibility of the entire computing device is stronger.
  • the number of different precision floating-point number operators is reduced, and the cost is reduced.
  • multiple floating-point multipliers can respectively perform operations on the disassembled floating-point numbers in parallel, the processing delay is reduced and the processing efficiency is improved.
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media.
  • the semiconductor medium may be a solid state drive (SSD).

Abstract

一种运算单元、浮点数计算的方法和装置,属于计算机技术领域。运算单元由拆解电路(221)和运算器(222)组成,拆解电路(221)可以获取计算指令中包括的模式和待计算浮点数,并根据预设规则拆解所述待计算浮点数。然后,运算器(222)再按照模式和拆解后的待计算浮点数完成计算指令的处理。

Description

运算单元、浮点数计算的方法、装置、芯片和计算设备 技术领域
本申请涉及计算机技术领域,特别涉及一种运算单元、浮点数计算的方法、装置、芯片和计算设备。
背景技术
浮点数是计算机中一种重要的数字格式,在计算机中浮点数由符号、阶码和尾数三部分组成。为了满足不同业务对数据精度的不同需求,计算机通常需要支持多种浮点数计算类型。
目前,为了实现对不同的浮点数运算类型,通常会对应设计多个独立的运算单元,每个运算单元可以实现一种浮点数运算类型。
在实现本申请的过程中,发明人发现相关技术至少存在以下问题:
在芯片中独立设计多种分别支持不同浮点数运算类型的运算单元,当系统只使用其中一种运算类型的运算单元执行浮点数运算时,其余的运算单元就会处于闲置状态,十分浪费计算资源。
发明内容
本申请提供了一种运算单元、浮点数计算的方法、装置、芯片和计算设备,以提升芯片的利用率和处理效率。
第一方面,提供了一种运算单元,该运算单元包括拆解电路和运算器;拆解电路,用于获取计算指令中包括的模式和待计算浮点数;根据预设规则拆解所述待计算浮点数,其中,所述模式用于指示对所述待计算浮点数的运算类型;运算单元,用于按照模式和拆解后的待计算浮点数完成所述计算指令的处理。
处理器中的控制单元可以在存储单元或者内存的获取计算指令,并发送给运算单元。运算单元中的拆解电路接收该计算指令并根据待计算浮点数的类型,以及存储该类型的浮点数对应的尾数拆解段数和各尾数段的位宽,将对待计算浮点数的尾数进行拆解,并将拆解后的尾数段、符号和阶码输出至运算器。由运算器根据模式对输入的待计算浮点数的尾数段、符号和阶码进行相应的处理,得到计算结果。即,在本申请所示的方案中,由一个运算单元即可实现不通精度和运算类型的浮点数运算,运算单元的适用性更高。
在一种可能的实现方式中,待计算浮点数为高精度浮点数,拆解电路,用于根据待计算浮点数的尾数将待计算浮点数拆解为多个低精度浮点数。
拆解电路可以将高精度的待计算浮点数拆解为多个低精度浮点数,然后,可以复用低精度浮点数乘法器以及低精度浮点数加法器执行相应处理,而不用单独设计高精度浮点数乘法器或高精度浮点数加法器,可以节约运算器成本。
在一种可能的实现方式中,拆解后的待计算浮点数的阶码位宽大于待计算浮点数的阶码位宽。
所述方案可以对将待计算浮点数拆解为指定类型的浮点数,该指定类型的待计算浮点数可以为非标准类型的浮点数,为了满足阶码的移位条件,只需保证该指定类型的浮点数的阶 码位宽大于待计算浮点数的阶码位宽即可。
在一种可能的实现方式中,所述拆解电路,用于将待计算浮点数拆解为符号、阶码和尾数,将待计算浮点数的尾数拆解为多个尾数段。
拆解电路可以对待计算浮点数的尾数进行拆解。为了使浮点数乘法器可以被多种精度浮点数乘法计算复用,本申请实施例中浮点数乘法器可以支持最低精度浮点数乘法,因此,对于最低精度浮点数的尾数可以不用拆解。对于高精度浮点数的尾数在拆解时,可以使拆解的每个尾数段的位宽小于等于浮点数乘法器支持的尾数最大位宽。此外,为了使进行不同类型浮点数的计算时,每个浮点数乘法器中的尾数乘法器资源得到充分利用,可以使最低精度浮点数的尾数位宽、各类型的高精度浮点数的尾数拆解得到的每个尾数段的位宽相近。
在一种可能的实现方式中,运算器包括浮点数乘法器和浮点数加法器,浮点数乘法器用于执行拆解后的待计算浮点数的加法运算,浮点数加法器用于执行拆解后的待计算浮点数的加法运算。
在一种可能的实现方式中,运算器包括多个浮点数乘法器和浮点数加法器;多个浮点数乘法器中的第一浮点数乘法器,用于对输入的拆解后的待计算浮点数的符号进行异或计算,对输入的拆解后的待计算浮点数的阶码进行加法计算,对输入的拆解后的待计算浮点数的尾数段进行乘法计算,向浮点数加法器输出符号异或结果、阶码加和结果和尾数段乘积结果;所述多个浮点数乘法器中的第二浮点数乘法器,用于并行对输入的拆解后的待计算浮点数的尾数段进行乘法计算,向所述浮点数加法器输出尾数段乘积结果;所述浮点数加法器,用于对输入的尾数段乘积结果进行加法计算,得到尾数段加和结果,根据所述模式、所述尾数段加和结果、所述符号异或结果以及所述阶码加和结果,输出对所述待计算浮点数的计算结果。
运算单元可以设置多个浮点数乘法器,多个浮点数乘法器可以并行执行尾数段乘法计算,或者并行执行浮点数乘法计算,可以有效的提高浮点数的计算效率。
在一种可能的实现方式中,运算器包括x2个浮点数乘法器和浮点数加法器,拆解电路,用于将每个待计算浮点数的尾数拆解为x个尾数段,其中,x为大于1的整数。
运算单元可以设置有x2个浮点数乘法器,至少一个浮点数加法器,以及至少一个拆解电路。x为运算单元支持的最高精度浮点数的尾数拆解的尾数段数。由多个乘法器分别并行对拆解后的浮点数进行处理,提升浮点数运算的效率。
在一种可能的实现方式中,拆解电路,用于获取计算指令中包括的模式和待计算浮点数向量,将每个待计算浮点数向量中的待计算浮点数拆解为符号、阶码和尾数,将每个待计算浮点数的尾数拆解为多个尾数段,向所述第一浮点数乘法器输出符号组合、阶码组合以及尾数段组合,其中,每个符号组合包括拆解自一对待计算浮点数的符号,每个阶码组合包括拆解自一对待计算浮点数的阶码,每个尾数段组合包括拆解自一对待计算浮点数的两个尾数段,每对待计算浮点数包括来自不同待计算浮点数向量的两个待计算浮点数。第一浮点数乘法器,用于对输入的符号组合中的符号进行异或计算,对输入的阶码组合中的阶码进行加法计算,对输入的尾数段组合中的尾数段进行乘法计算,向浮点数加法器输出符号异或结果、阶码加和结果和尾数段乘积结果;第二浮点数乘法器,用于并行对输入的尾数段组合中的尾数段进行乘法计算,向浮点数加法器输出尾数段乘积结果;浮点数加法器,用于对输入的来自同一对待计算浮点数的尾数段乘积结果进行加法计算,得到每对待计算浮点数对应的尾数段加和结果,根据模式、每对待计算浮点数对应的尾数段加和结果、符号异或结果以及阶码加和结果,输出向量计算结果。由此运算单元可以实现对浮点数向量的计算。
本申请可以实现对浮点数向量的相关计算,在计算指令中包括待计算浮点数向量时,拆 解电路先将向量拆解为浮点数标量,再将每个浮点数标量拆解为符号、阶码和尾数三部分。对于高精度浮点数,还要继续对尾数进行拆解,得到多个尾数段。然后,将符号、阶码和尾数段输出至浮点数乘法器。浮点数乘法器对于输入的两个符号进行异或计算,对于输入的阶码进行加和计算,对于输入的尾数段进行乘法计算。再将得到符号异或结果、阶码加和结果和尾数段乘积结果输出至浮点数加法器,由浮点数加法器对尾数段进行对阶加和,并输出至规格化处理电路,由规格化处理电路进行规格化处理后输出。
在一种可能的实现方式中,模式指示对所述待计算浮点数向量的运算类型为向量逐元素乘运算;所述浮点数加法器,用于将每对待计算浮点数对应的尾数段加和结果、符号异或结果和阶码加和结果,作为一个元素乘积结果输出。
本申请中可以实现向量逐元素乘运算,对于向量逐元素乘运算,浮点数加法器只需对每对待计算浮点数对应的尾数段加和结果、符号异或结果和阶码加和结果输出至规格化处理电路进行输出即可。
在一种可能的实现方式中,模式指示对所述待计算浮点数向量的运算类型为向量内积运算;所述浮点数加法器,用于根据每对待计算浮点数对应的阶码加和结果,对每对待计算浮点数对应的尾数段加和结果进行对阶,对对阶后的各尾数段加和结果进行加法计算,输出向量内积运算结果。
本申请中还可以实现向量内积运算,对于向量内积运算,浮点数加法器,还需要根据每对待计算浮点数对应的阶码加和结果计算阶差,并基于计算出的阶差对每对待计算浮点数对应的尾数段加和结果进行对阶,对对阶后的各尾数段加和结果再进行加法计算。最后,向规格化处理电路输出计算结果,计算结果为完整的浮点数,包括符号、阶码和尾数。由规格化处理电路对计算结果进行规格化处理后,输出即可。
在一种可能的实现方式中,模式指示对所述待计算浮点数向量的运算类型为向量元素累加运算;
所述拆解电路,用于获取计算指令中包括的模式和第一浮点数向量,生成第二浮点数向量,其中,所述第二浮点数向量中的各待计算浮点数的类型与所述第一待计算浮点数向量中的待计算浮点数的类型相同,所述第二浮点数向量中的各待计算浮点数的值均为1,将所述第一浮点数向量和所述第二浮点数向量作为待计算浮点数向量;
所述浮点数加法器,用于根据每对待计算浮点数对应的阶码加和结果,对每对待计算浮点数对应的尾数段加和结果进行对阶,对对阶后的各尾数段加和结果进行加法计算,输出向量元素累加结果。
本申请中还可以实现向量元素累加运算,对于向量元素累加运算,输入的待计算浮点数为一个浮点数向量。拆解电路获取到计算指令后,确定模式指示的计算类型为向量元素累加,则可以先生成与输入的待计算浮点数向量相同的类型的浮点数向量,且生成的浮点数向量中的各元素值均为1。对于输入的待计算浮点数向量和生成的浮点数向量可以共同作为待计算浮点数向量。接下来包括拆解、乘积、加和与向量内积运算相同。
第二方面,提供了一种浮点数计算的方法,该方法包括:获取计算指令中包括的模式和待计算浮点数;根据预设规则拆解所述待计算浮点数,其中,所述模式用于指示对所述待计算浮点数的运算类型;按照所述模式和拆解后的待计算浮点数完成所述计算指令的处理。
处理器中的控制单元可以在存储单元或者内存的获取计算指令,并发送给运算单元。运算单元中的接收该计算指令并根据待计算浮点数的类型,以及存储该类型的浮点数对应的尾数拆解段数和各尾数段的位宽,将对待计算浮点数的尾数进行拆解,并将拆解后的尾数段、 符号和阶码进行相应的处理,得到计算结果。即,在本申请所示的方案中,由一个运算单元即可实现不用运算类型的运算。
在一种可能的实现方式中,所述待计算浮点数为高精度浮点数,所述根据预设规则拆解所述待计算浮点数,包括:根据所述待计算浮点数的尾数将所述待计算浮点数拆解为多个低精度浮点数。
运算单元可以将高精度的待计算浮点数拆解为多个低精度浮点数,然后,可以复用低精度浮点数乘法器以及低精度浮点数加法器执行相应处理,而不用单独设计高精度浮点数乘法器或高精度浮点数加法器,可以节约运算器成本。
在一种可能的实现方式中,拆解后的待计算浮点数的阶码位宽大于所述待计算浮点数的阶码位宽。
运算单元可以对将待计算浮点数拆解为指定类型的浮点数,该指定类型的待计算浮点数可以为非标准类型的浮点数,为了满足阶码的移位条件,只需保证该指定类型的浮点数的阶码位宽大于待计算浮点数的阶码位宽即可。
在一种可能的实现方式中,所述根据预设规则拆解所述待计算浮点数,包括:将所述待计算浮点数拆解为符号、阶码和尾数,将所述待计算浮点数的尾数拆解为多个尾数段。
运算单元可以对待计算浮点数的尾数进行拆解。为了使运算单元中的浮点数乘法器可以被多种精度浮点数乘法计算复用,本申请实施例中浮点数乘法器可以支持最低精度浮点数乘法,因此,对于最低精度浮点数的尾数可以不用拆解。对于高精度浮点数的尾数在拆解时,可以使拆解的每个尾数段的位宽小于等于浮点数乘法器支持的尾数最大位宽。此外,为了使进行不同类型浮点数的计算时,每个浮点数乘法器中的尾数乘法器资源得到充分利用,可以使最低精度浮点数的尾数位宽、各类型的高精度浮点数的尾数拆解得到的每个尾数段的位宽相近。
在一种可能的实现方式中,运算单元对拆解后的待计算浮点数的符号进行异或计算,得到符号异或结果,对拆解后的待计算浮点数的阶码进行加法计算,得到阶码加和结果,对拆解后的来自不同待计算浮点数的尾数段进行乘法计算,输出尾数段乘积结果。并对尾数段乘积结果进行加法计算,得到尾数段加和结果,然后,根据模式、尾数段加和结果、符号异或结果以及阶码加和结果,得到对待计算浮点数的计算结果。采用一个运算单元,即可完成不同精度的浮点数在不同模式下的运算。
在一种可能的实现方式中,获取计算指令中包括的模式和待计算浮点数,根据预设规则拆解所述待计算浮点数,包括:获取计算指令中包括的模式和待计算浮点数向量,将每个待计算浮点数向量中的待计算浮点数拆解为符号、阶码和尾数,得到多个符号组合、阶码组合以及尾数段组合,其中,每个符号组合包括拆解自一对待计算浮点数的符号,每个阶码组合包括拆解自一对待计算浮点数的阶码,每个尾数段组合包括拆解自一对待计算浮点数的两个尾数段,每对待计算浮点数包括来自不同待计算浮点数向量的两个待计算浮点数;所述对拆解后的待计算浮点数的符号进行异或计算,得到符号异或结果,对拆解后的待计算浮点数的阶码进行加法计算,得到阶码加和结果,对拆解后的来自不同待计算浮点数的尾数段进行乘法计算,得到尾数段乘积结果,包括:对每个符号组合中的符号进行异或计算,得到所述符号组合对应的符号异或结果,对每个阶码组合中的阶码进行加法计算,得到阶码加和结果,对每个尾数段组合中的尾数段进行乘法计算,得到尾数段乘积结果;所述对所述尾数段乘积结果进行加法计算,得到尾数段加和结果,根据所述模式、所述尾数段加和结果、所述符号异或结果以及所述阶码加和结果,得到对所述待计算浮点数的计算结果,包括:对来自同一 对待计算浮点数的尾数段乘积结果,按照每个尾数段乘积结果对应的固定移位值,进行加法计算,得到每对待计算浮点数对应的尾数段加和结果,根据所述模式、每对待计算浮点数对应的尾数段加和结果、符号异或结果以及阶码加和结果,输出向量计算结果。
本申请可以实现对浮点数向量的相关计算,在计算指令中包括待计算浮点数向量时,运算单元先将向量拆解为浮点数标量,再将每个浮点数标量拆解为符号、阶码和尾数三部分。对于高精度浮点数,还要继续对尾数进行拆解,得到多个尾数段。然后,对两浮点数向量中对应位置的两浮点数标量的符号进行异或计算,对阶码进行加和计算,对于尾数段进行乘法计算。再将得到尾数段乘积结果进行对阶加和,并输出至规格化处理电路,由规格化处理电路进行规格化处理后输出。
在一种可能的实现方式中,所述模式指示对所述待计算浮点数向量的运算类型为向量逐元素乘运算;所述根据所述模式、每对待计算浮点数对应的尾数段加和结果、符号异或结果以及阶码加和结果,输出所述多个待计算浮点数向量对应的向量计算结果,包括:将每对待计算浮点数对应的尾数段加和结果、符号异或结果和阶码加和结果,作为一个元素乘积结果输出。
本申请中可以实现向量逐元素乘运算,对于向量逐元素乘运算,运算单元只需对每对待计算浮点数对应的尾数段加和结果、符号异或结果和阶码加和结果输出至规格化处理电路进行输出即可。
在一种可能的实现方式中,所述模式指示对所述待计算浮点数向量的运算类型为向量内积运算;所述根据所述模式、每对待计算浮点数对应的尾数段加和结果、符号异或结果以及阶码加和结果,输出所述多个待计算浮点数向量对应的向量计算结果,包括:根据每对待计算浮点数对应的阶码加和结果,对每对待计算浮点数对应的尾数段加和结果进行对阶,对对阶后的各尾数段加和结果进行加法计算,输出向量内积运算结果。
本申请中还可以实现向量内积运算,对于向量内积运算,运算单元还需要根据每对待计算浮点数对应的阶码加和结果计算阶差,并基于计算出的阶差对每对待计算浮点数对应的尾数段加和结果进行对阶,对对阶后的各尾数段加和结果再进行加法计算。最后,向规格化处理电路输出计算结果,计算结果为完整的浮点数,包括符号、阶码和尾数。由规格化处理电路对计算结果进行规格化处理后,输出即可。
在一种可能的实现方式中,模式指示对所述待计算浮点数向量的运算类型为向量元素累加运算;所述获取计算指令中包括的模式和待计算浮点数,包括:获取计算指令中包括的模式和第一浮点数向量,生成第二浮点数向量,其中,所述第二浮点数向量中的各待计算浮点数的类型与所述第一待计算浮点数向量中的待计算浮点数的类型相同,所述第二浮点数向量中的各待计算浮点数的值均为1,将所述第一浮点数向量和所述第二浮点数向量作为待计算浮点数向量;所述根据所述模式、每对待计算浮点数对应的尾数段加和结果、符号异或结果以及阶码加和结果,输出所述多个待计算浮点数向量对应的向量计算结果,包括:根据每对待计算浮点数对应的阶码加和结果,对每对待计算浮点数对应的尾数段加和结果进行对阶,对对阶后的各尾数段加和结果进行加法计算,输出向量元素累加结果。
本申请中还可以实现向量元素累加运算,对于向量元素累加运算,输入的待计算浮点数为一个浮点数向量。运算单元获取到计算指令后,确定模式指示的计算类型为向量元素累加,则可以先生成与输入的待计算浮点数向量相同的类型的浮点数向量,且生成的浮点数向量中的各元素值均为1。对于输入的待计算浮点数向量和生成的浮点数向量可以共同作为待计算浮点数向量。接下来包括拆解、乘积、加和与向量内积运算相同。第三方面,提供了一种浮 点数计算的装置,所述装置包括用于执行第二方面或第二方面任一种可能实现方式中的浮点数计算方法的各个模块。
第三方面,提供了一种芯片,所述芯片包括至少一个如上述第一方面所述的运算单元。
第四方面,提供了一种计算设备,所述计算设备包括主板以及上述第三方面所述的芯片;所述芯片设置在所述主板上。
本申请实施例提供的技术方案带来的有益效果至少包括:
运算单元由拆解电路和运算器组成,拆解电路可以获取计算指令中包括的模式和待计算浮点数,并根据预设规则拆解所述待计算浮点数。然后,运算单元,再按照模式和拆解后的待计算浮点数完成计算指令的处理。在本申请中,计算指令中的模式用于指示对待计算浮点数的运算类型,即本申请中的一个运算单元可以实现多种不同的运算类型。
附图说明
图1是本申请实施例提供的一种浮点数组成示意图;
图2是本申请实施例提供的一种浮点数组成示意图;
图3是本申请实施例提供的一种浮点数组成示意图;
图4是本申请实施例提供的一种芯片的逻辑架构图;
图5是本申请实施例提供的一种运算单元的结构示意图;
图6是本申请实施例提供的一种拆解电路的结构示意图;
图7是本申请实施例提供的一种加法器排布的示意图;
图8是本申请实施例提供的一种浮点数计算的方法流程图;
图9是本申请实施例提供的一种浮点数计算的方法流程图;
图10是本申请实施例提供的一种运算单元的结构示意图;
图11是本申请实施例提供的一种浮点数计算的装置结构示意图;
图12是本申请实施例提供的一种计算设备的结构示意图。
具体实施方式
为了便于理解本申请实施例提供的技术方案,下面先对于几种常用类型的浮点数的组成、以及几种常用类型的浮点数向量计算进行介绍:
1、半精度浮点数:
如图1所示,半精度浮点数FP16在计算机的存储占用16bits,其中,包括符号、阶码和尾数。具体的,符号的位宽为1bit,阶码的位宽为5bits,尾数的位宽为10bits(尾数的小数部分)。其中,尾数除存储的10bits小数部分,还包括隐藏的1bit整数部分,即,尾数总共11bits。
2、单精度浮点数:
如图2所示,单精度浮点数FP32在计算机的存储占用32bits,其中,包括符号、阶码和尾数。具体的,符号的位宽为1bit,阶码的位宽为8bits,尾数的位宽为23bits(尾数的小数部分)。其中,尾数除存储的23bits小数部分,还包括隐藏的1bit整数部分,即,尾数总共24bits。
3、双精度浮点数:
如图3所示,双精度浮点数FP64在计算机的存储占用64bits,其中,包括符号、阶码和 尾数。具体的,符号的位宽为1bit,阶码的位宽为11bits,尾数的位宽为52bits(尾数的小数部分)。其中,尾数除存储的52bits小数部分,还包括隐藏的1bit整数部分,即,尾数总共53bits。
4、浮点数的向量逐元素乘(element-wise multiplication):
Figure PCTCN2021106965-appb-000001
其中,
Figure PCTCN2021106965-appb-000002
Figure PCTCN2021106965-appb-000003
为浮点数向量,a 1,a 2...a n和b 1,b 2...b n为浮点数。
5、浮点数的向量内积运算:
Figure PCTCN2021106965-appb-000004
其中,
Figure PCTCN2021106965-appb-000005
Figure PCTCN2021106965-appb-000006
为浮点数向量,a 1,a 2...a n和b 1,b 2...b n为浮点数。
6、浮点数的元素累加运算:
Figure PCTCN2021106965-appb-000007
元素累加为:c=a 1+a 2+...a n
下面结合图4对本申请的系统架构进行说明:
如图4所示,本申请的系统架构为芯片100的逻辑架构,包括控制单元1、运算单元2和存储单元3(例如,Cache),控制单元1、运算单元2和存储单元3之间通过内部总线两两连接。控制单元1用于向存储单元3和运算单元2发送指令,以对存储单元3和运算单元2进行控制。运算单元2用于接收控制单元1发送的指令,并根据指令执行相应的处理,例如,执行本申请提供的浮点数乘法计算的方法。存储单元3也可以称为缓存,存储单元3中可以存储有数据,例如,可以存储有待计算浮点数。运算单元2可以包括用于执行算术运算的算术运算ALU20,以及用于执行逻辑运算的逻辑运算ALU21。其中,算术逻辑ALU20中可以设置有分别执行加(add)、减(sub)、乘(mul)、除(dev)等基本运算及其附加运算的子单元),还设置有用于执行多模式浮点数运算的浮点数运算子单元22,可以执行本申请提供的浮点数计算的方法。逻辑运算ALU21中可以设置有分别执行移位、逻辑和(and)逻辑或(or)以及两个值的比较等运算的子单元。
芯片100还可以与内存200连接,用于与内存200进行数据交互和指令传输。如图4所示,内存20与控制单元1和存储单元3连接,控制单元1可以从内存中获得内存200存储的指令或数据。例如:控制单元1从内存200读取指令,进一步地发送给运算单元2,由运算单元2执行指令。
需要说明的是,图4所示的芯片10的逻辑架构可以为任意一种芯片的逻辑架构,例如,中央处理器(central processing unit,CPU)芯片、图形处理器(graphics processing unit,GPU)芯片、现场可编程门阵列(field-programmable gate array,FPGA)芯片、专用集成电路(application specific integrated circuits,ASIC)芯片、张量处理单元(tensor processing unit,TPU)芯片或其他人工智能(artificial intelligence,AI)芯片等。不同类型的芯片的主要区别在于控制单元1、存储单元3和运算单元2的比例不同。
接下来,结合图5进一步介绍图4中运算单元2。如图5所示,该运算单元2中的浮点数运算子单元22又包括拆解电路221和运算器222。该点数运算子单元22可以通过拆解电路221对浮点数的拆解,以及通过运算器222对拆解后浮点数的计算,可以实现多种模式下的多种精度浮点数的计算。
其中,拆解电路221,用于获取计算指令中包括的模式和待计算浮点数,并根据预设规则拆解待计算浮点数。其中,模式用于指示对待计算浮点数的运算类型,运算类型可以包括向量内积运算、向量逐元素乘运算、向量元素累加运算等。
运算器222,用于按照上述计算指令中的模式和拆解后的待计算浮点数完成计算指令的处理。该运算器222可以包括浮点数乘法器2221和浮点数加法器2222。
在一种可能的实现方式中,上述拆解电路21根据预设规则拆解待计算浮点数的操作可以为:将待计算浮点数的尾数拆解为多个尾数段。拆解完成后,拆解电路221将拆解后的尾数段以及待计算浮点数的符号段的内容、阶码段的内容输出至浮点数乘法器2221。浮点数乘法器2221对待计算浮点数的符号段的内容进行异或计算,对阶码段的内容进行加法计算,并对拆解的尾数段进行乘法运算。然后,浮点数乘法器2221将符号异或结果、阶码加和结果以及尾数段乘积结果输出至浮点数加法器2222,由浮点数加法器完成对尾数段乘积结果的加和,并将计算结果以浮点数的形式输出。
此外,上述浮点数乘法器还可以执行常规的浮点数乘法计算,上述浮点数加法器还可以执行常规的浮点数加法计算。
下面对于拆解电路21、浮点数乘法器2211和浮点数加法器2222做进一步的说明:
拆解电路21,为了提高对待计算浮点数的拆解效率,在同一运算单元2中可以设置两个或多个拆解电路21,为了便于描述,以下以同一运算单元2包括两个拆解电路21为例进行说明。在计算两个待计算浮点数的相关运算时,每个拆解电路21可以分别对一个待计算浮点数进行拆解。
如图5所示,拆解电路21可以包括浮点数拆解子电路211和尾数拆解子电路212,其中,浮点数拆解子电路211用于将输入的待计算浮点数拆解为符号、阶码和尾数,尾数拆解子电路212用于将待计算浮点数的尾数拆解为多个尾数段。
为了使浮点数乘法器可以被多种精度浮点数乘法计算复用,本申请实施例中浮点数乘法器可以支持最低精度浮点数乘法,因此,对于最低精度浮点数的尾数可以不用拆解。对于高精度浮点数的尾数在拆解时,可以使拆解的每个尾数段的位宽小于等于浮点数乘法器支持的尾数最大位宽。此外,为了使进行不同类型浮点数的计算时,每个浮点数乘法器中的尾数乘法器资源得到充分利用,可以使最低精度浮点数的尾数位宽、各类型的高精度浮点数的尾数拆解得到的每个尾数段的位宽相近。
在本申请中,可以对拆解电路预先设定对各种类型浮点数的拆解方式,例如,以浮点数乘法器所支持的最大尾数位宽对浮点数进行拆解,当存在多个浮点数乘法器时,可以使多个浮点数乘法器并行对拆解后的浮点数进行处理。示例地,拆解电路21在获取到待计算浮点数后,可以先确定待计算浮点数的类型。再按照预先设定的该类型的浮点数对应的拆解方式,对该待计算浮点数的尾数进行拆解,得到多个尾数段。预先设定对各种类型浮点数的拆解方式。
其中,浮点数的拆解方式的设置原则为:在复用已有的浮点数乘法器的情况下,可以确定最低精度浮点数乘法器运算器支持的最大尾数位宽a。然后,以a为最大尾数段位宽,确定每种类型的浮点数拆解的尾数段数。
此外,还可以根据需要重新设计浮点数乘法器,重新设计的浮点数乘法器需要支持最低精度浮点数的乘法计算,且其支持的最大尾数位宽要大于各类型浮点数拆解的尾数段的位宽。并且,为了使重新设计的浮点数乘法器的尾数乘法器资源得到充分利用,还可以在设定拆解方式以及设计浮点数乘法器时,使重新设计的浮点数乘法器支持的最大尾数位宽、最低精度浮点数的尾数位宽、以及各类型高精度浮点数拆解的各尾数段的位宽之间尽量相近。
下面对几种常见类型浮点数的尾数的拆解方式进行说明。
对于FP16来说,通常FP16为最低精度浮点数,因此,对于FP16的尾数可以不用拆解。
对于FP32来说,因为FP16的尾数共11bits,FP32的尾数共24bits,要使FP16的尾数位宽和FP32的尾数拆解得到的每个尾数段的位宽相近,可以将FP32的尾数拆解为2个尾数段,每个尾数段12bits。
对于FP64,因为FP16的尾数共11bits,FP32拆解的每个尾数段12bits,FP64要,要使FP16的尾数位宽、FP32的尾数拆解得到的每个尾数段的位宽、FP64的尾数拆解得到的每个尾数段的位宽和浮点数乘法器支持的尾数最大位宽相近,可以将FP64的尾数拆解为4个尾数段,其中,3个尾数段位宽为13bits,1个尾数段位宽为14bits。
为了更清楚说明对于不同类型的浮点数的尾数的拆解,下面列举几个不同类型的浮点数的尾数段的拆解示例进行说明。
例如,对于FP32的尾数1.010 1010 1010 1010 1010 1010,可以拆解为2个尾数段,分别为:x 1=1010 1010 1010和x 2=1010 1010 1010,每个尾数段12bits。
又例如,对于FP64的尾数1.010 1010 1010 1010 1010 1010 1010 1010 1010 1010 1010 1010 0101 0,可以拆解为4个尾数段,分别为:y 1=1010 1010 1010 10、y 2=10 1010 1010 101,y 3=0 1010 1010 1010,y 4=1010 1010 0101 0,其中,y 1共14bits,y 2、y 3和y 4各13bits。
由于不同类型的浮点数的尾数的拆解规则不相同,因此,在浮点数乘法器的每种类型的浮点数运算中,可以分别对应有拆分该类型浮点数的浮点数拆解子电路以及尾数拆解子电路。
如图6所示,对于一个支持FP16、FP32以及FP64的运算单元2,其拆解电路21中可以包括有对应FP16的浮点数拆解子电路、对应FP32的浮点数拆解子电路以及尾数拆解子电路、对应FP64的浮点数拆解子电路以及尾数拆解子电路,此外,拆解电路21还可以包括有输出选择电路,该输出选择电路可以根据模式,选择对应的浮点数拆解子电路或者尾数拆解子电路输出的拆解结果进行输出。
浮点数乘法器2221
为了提高对待计算浮点数的尾数乘法的计算效率,运算单元2中可以设置有N个浮点数乘法器2221。其中,每个浮点数乘法器可以独立计算一组完整浮点数乘法,一组完整的浮点数乘法,包括符号异或、阶码相加以及尾数乘法。
在一种可能的实现方式中,浮点数乘法器2221的个数N可以为该运算单元2支持的最高精度浮点数的尾数拆分的尾数段个数m的平方。即,。在浮点数乘法器个数为N的情况下,运算单元2支持的最低精度浮点数向量的长度为N,支持的较高精度浮点数向量的长度为N/o2,其中,o为较高精度浮点数的尾数拆分的尾数段个数,支持的更高精度浮点数向量的长度为N/p2,以此类推。
例如,运算单元2支持的最高精度浮点数为FP64,其尾数拆分的尾数段个数为4,则浮点数乘法器221的个数可以为16,运算单元2支持的最低精度浮点数FP16向量的长度为16,支持的较高精度浮点数FP32向量的长度为16/4=4,支持的更高精度浮点数FP64向量的长度为16/16=1。
为了能实现对多种类型的待计算浮点数的阶码相加,在这N个浮点数乘法器2221中,需要每个浮点数乘法器的阶码加法器的位宽大于等于最低精度浮点数的阶码计算位宽,且有N/o2个浮点数乘法器的阶码加法器的位宽大于等于较高精度浮点数的阶码计算位宽,且在N/o2个浮点数乘法器中,有N/p2个浮点数乘法器的阶码加法器的位宽大于等于更高精度浮点数的阶码计算位宽,以此类推。
浮点数加法器2222
为了能实现多种模式的浮点数运算类型,浮点数加法器2222可以有多个,并以树形排布。 具体的,浮点数加法器2222的个数与浮点数加法器2222可支持同时计算的浮点数个数以及运算单元2支持的最低精度浮点数向量的最大长度有关。
例如,运算单元2支持最低精度浮点数(如FP16)的最大长度为16,一个浮点数加法器222可同时计算4个浮点数的加法计算,或可计算2个浮点数的加法计算。如图7所示,为了实现浮点数向量内积运算以及浮点数向量的元素累加运算,可以对浮点数加法器进行分组排布。第一组浮点数加法器可以执行浮点数的尾数段乘法结果的加和运算或者浮点数加法运算,对于长度为4的FP32向量的向量内积运算来说,在第一组浮点数加法器完成尾数段乘法结果的加和运算后,可以得到4个尾数段乘法结果的加和结果。对于这4个加和结果需要结合符号、阶码进行浮点数加法运算,考虑到一个浮点数加法器同时计算4个较高精度的浮点数加法时,会导致尾数对阶时移位过多,产生较大误差。因此,对于第一组浮点数加法器之后的执行浮点数加法运算的浮点数加法器来说,可以选择支持2个浮点数加法运算的浮点数加法器。这样,4个尾数段乘法结果的加和结果对应的4个浮点数加法运算,需要由2个浮点数加法器实现,这2个浮点数加法器可以作为第二组浮点数加法器。又因为向量内积运算需要对所有浮点数乘积进行累加,所以还需要第三组的一个浮点数加法器,对第二组浮点数加法器得到的加和进行加法运算。
又例如,运算单元2支持最低精度浮点数(如FP16)的最大长度为16,一个浮点数加法器可可计算2个浮点数的加法计算。则为了实现浮点数向量内积运算以及浮点数向量的元素累加运算,可以对浮点数加法器分为4组,第一组包括8个浮点数加法器,第二组包括4个浮点数加法器,第三组包括2个浮点数加法器,第四组包括1个浮点数加法器。
需要说明的是,浮点数加法器在对完整的浮点数进行加法运算时,可以完成阶码最大值比较、计算阶差、尾数对阶以及尾数加和,在对尾数段乘积结果进行加法运算时,可以直接执行尾数对阶和尾数加和,其中,尾数对阶时采用的是固定移位值。
此外,为了使运算单元2输出规格化的浮点数计算结果,在运算单元2中还可以包括有规格化处理电路423。规格化处理电路可以完成常规的尾数舍入操作以及阶码转换操作。
尾数舍入操作,即对需要输出的浮点数的尾数进行舍入(rounding)操作,转换为标准格式,如IEEE754标准格式。其中,FP16、FP32、FP64对应的尾数位宽分别为11bits、24bits、53bits;
阶码转换操作,即对需要输出的浮点数的阶码转换为标准浮点数中对应的阶码格式,如IEEE754标注格式。
其中,FP16,阶码位宽为5bits,偏移值(bias)=15,若实际阶码值大于16,则将阶码值修正为5’b11111,5’b表示5位二进制数;若实际阶码值小于-14,且尾数的整数位为0,则将阶码值修正为5’b0。FP32,阶码位宽为8bits,bias=127,若实际阶码值大于128,则将阶码值修正为8’b11111111,若实际阶码值小于-126,且尾数整数位为0,阶码值修正为8’b0。FP64,指数位宽为11bits,bias=1023,若实际指数值大于1024,指数值修正为11’b11111111111;若实际指数值小于-1023,且尾数整数位为0,阶码值修正为11’b0。
本申请实施例还提供了一种浮点数计算的方法,该方法可以由上述运算单元实现,运算单元可以包括拆解电路和运算器,具体的,如图8所示,该方法可以包括如下处理流程:
步骤801、拆解电路获取计算指令中包括的模式和待计算浮点数。
在实施中,控制单元在存储单元或者内存中获取计算指令,并发送给运算单元。运算单元中的拆解电路接收该计算指令,并获取该计算指令中携带的模式和待计算浮点数。其中,待计算浮点数可以为两个相同类型的浮点数标量,或者为两个不同类型的浮点数标量,或者 两个相同类型相同长度的浮点数向量,又或者两个不同类型相同长度的浮点数向量。
运算单元中可以输入的两个浮点数向量的长度与运算单元中浮点数乘法器的个数有关,具体的,在浮点数乘法器个数为N的情况下,运算单元支持的最低精度浮点数向量的长度为N,支持的较高精度浮点数向量的长度为N/o2,其中,o为较高精度浮点数的尾数拆分的尾数段个数,以此类推。
例如,如图10所示,运算器包括16个浮点数乘法器,则可以输入两个长度为16的FP16向量,或者输入两个长度为4的FP32向量,或者输入两个FP64标量。
步骤802、拆解电路根据预设规则拆解待计算浮点数,其中,模式用于指示对待计算浮点数的运算类型。
其中,模式指示的运算类型可以包括向量逐元素乘、向量内积和向量元素累加等。
在实施中,拆解电路可以根据待计算浮点数的类型,以及存储的该类型的浮点数对应的尾数拆解段数和各尾数段的位宽,对待计算浮点数的尾数进行拆解,并将拆解后的尾数段、符号和阶码输出至运算器。且在输出尾数段时,需要按照预设的固定顺序进行排序后输出,以需要进行乘法计算的不同待计算浮点数的尾数的尾数段可以以各种可能的方式组合。
下面结合图10所示的运算单元分别以输入两个长度为16的FP16向量、输入两个长度为4的FP32向量,以及输入两个FP64标量为例,对步骤802的拆解方法进行说明。
输入两个长度为16的FP16向量:
每个拆解电路可以对其中一个FP16向量进行拆解。拆解电路中的浮点数拆解子电路按照FP16中符号、阶码和尾数所占位宽,将每个FP16拆解为1组{符号(sign),阶码(exp),尾数(mts)}。其中,对于拆解得到的尾数是指包括整数位的尾数。具体的,按照由高位到低位的顺序1bit、5bits、10bits将FP64拆解为三部分,第一部分的1bit为sign、第二部分的5bits为exp,对于第三部分的10bits,在这10bits的最高位前补1(隐藏的整数位)得到11bits作为mts。对于一个FP16向量,可以拆解得到16组{sign,exp,mts}。因为在本申请实施例中浮点数乘法器是支持最低精度浮点数的乘法计算的,所以对于最低精度浮点数FP16的尾数可以无需进行拆解。
然后,拆解电路将得到的每组{sign,exp,mts}输入到一个浮点数乘法器中,输入时可以按照该组{sign,exp,mts},在FP16向量中的位置按顺序输入,位于不同向量中的相同位置的待计算浮点数对应的两组{sign,exp,mts},输入至同一浮点数乘法器中。
例如,两个向量为向量A(a1,a2…a16)和向量B(b1,b2…b16)。向量A中的第一个待计算浮点数a1,可以拆解得到{signA1,expA1,mtsA1},向量B中的第一个待计算浮点数b1,可以拆解得到{signB1,expB1,mtsB1},则可以将{signA1,expA1,mtsA1}和{signB1,expB1,mtsB1}输入到同一个浮点数乘法器中。
输入两个长度为4的FP32向量:
每个拆解电路可以对其中一个FP32向量进行拆解。首先,拆解电路中的浮点数拆解子电路按照FP32中符号、阶码和尾数所占位宽,将每个FP32拆解为一组{sign,exp,mts}。具体的,按照由高位到低位的顺序1bit、8bits、23bits将FP64拆解为三部分,第一部分的1bit为sign、第二部分的8bits为exp,对于第三部分的23bits,在这23bits的最高位前补1(隐藏的整数位)得到24bits作为mts。对于一个FP32向量,可以拆解得到4组{sign,exp,mts},并将拆解得到的尾数输入至尾数拆解子电路。尾数拆解子电路按照预先设定的对FP32的拆解方式,对输入的mts进行拆解。例如,预先设定的对FP32的拆解方式为拆解为2个尾数段,每个尾数段的位宽为24bits。。
例如,两个FP32向量为向量C(c1,c2,c3,c4)和向量D(d1,d2,d3,d4)。对于向量C,先按照FP32中符号、阶码和尾数所占位宽,将向量C中的浮点数分别拆解为{signC1,expC1,mtsC1}、{signC2,expC2,mtsC2}、{signC3,expC3,mtsC3}和{signC4,expC4,mtsC4}。然后,将按照预先设定的对FP64的拆解方式,将mtsC1拆解为mtsC10和mtsC11,将mtsC2拆解为mtsC20和mtsC21,将mtsC3拆解为mtsC30和mtsC31,将mtsC4拆解为mtsC40和mtsC41,其中,mtsC10、mtsC20、mtsC30、mtsC40表示低位的尾数段,mtsC11、mtsC21、mtsC31、mtsC41表示高位的尾数段。同样的,对于向量D可以拆解得到的符号包括signD1、signD2、signD3和signD4,拆解得到的阶码包括expD1、expD2、expD3和expD4,拆解得到的尾数段包括mtsD11、mtsD12、mtsD13、mtsD14、mtsD21、mtsD21、mtsD21和mtsD21,其中,mtsD10、mtsD20、mtsD30、mtsD40表示低位的尾数段,mtsD11、mtsD21、mtsD31、mtsD41表示高位的尾数段。
对于第一个FP32向量中的每个尾数的尾数段均以{mts1,mts1,mts0,mts0}顺序排列,再将每个尾数段分别输出至一个浮点数乘法器。对于第二个FP32向量中的每个尾数的尾数段均以{mts1,mts0,mts1,mts0}顺序排列,再将每个尾数段分别输出至一个浮点数乘法器。
例如,对于向量C中的第一个待计算浮点数c1的尾数mtsC1的尾数段,可以排列为{mtsC11,mts C11,mtsC10,mtsC10},相应的,对于向量D的第一个待计算浮点数d1的尾数mtsD1的尾数段可以排列为{mtsD11,mtsD10,mtsD11,mtsD10}。排序后,按照排序分别输出至浮点数乘法器,其中,mtsC1对应的排序中的第一个尾数段与mtsD1对应的排序中的第一个尾数段输出至同一浮点数乘法器,以此类推。
需要说明的是,上述尾数段的排序方式仅为一种示例,排序输出的目的在于使得两个向量中对应位置的待计算浮点数的尾数的尾数段可以以各种可能的方式组合,具体以何种排列方式输出本申请实施例不做限定,只需保证是按照固定的排列方式输出,且达到上述目的即可。
此外,对于拆解得到的每组中sign和exp,只需输出至同一组的尾数对应的排序中的第一个尾数段所输入的浮点数乘法器中即可。
例如,对于向量C中的第一个待计算浮点数c1的符号signC1和阶码expC1,可以与mtsC1对应的排序中的第一个尾数段输入同一个浮点数乘法器中。
输入两个FP64标量:
每个拆解电路可以对其中一个FP64进行拆解。首先,浮点数拆解子电路按照FP64中符号、阶码和尾数所占位宽,将每个FP64拆解为{sign,exp,mts}。具体的,按照由高位到低位的顺序1bit、11bits、52bits将FP64拆解为三部分,第一部分的1bit为sign、第二部分的11bits为exp,对于第三部分的52bits,在这52bits的最高位前补1(隐藏的整数位)得到53bits作为mts。然后,再将mts输入至尾数拆解子电路。然后,尾数拆解子电路按照预先设定的对FP64的拆解方式,对接收到的mts进行拆解。例如,预先设定的对FP64的拆解方式为将尾数拆解为4个尾数段,每个尾数段的位宽分别为13bits、13bits、13bits和14bits。
例如,两个待计算浮点数为E和F。对于E可以先按照FP64中符号、阶码和尾数所占位宽,拆解为{signE,expE,mtsE},然后,按照预先设定的对FP64的拆解方式,将mtsE拆解为mtsE3、mtsE2、mtsE1和mtsE0,其中,mtsE3、mtsE2、mtsE1和mtsE0表示由高位到低位的尾数段。同样的,对于F可以先拆解为{signF,expF,mtsF},然后,将mtsF拆解为mtsF3、mtsF2、mtsF1和mtsF0,其中,mtsF3、mtsF2、mtsF1和mtsF0表示由高位到低位的尾数段。
对于第一个FP64的尾数的尾数段以{mts3,mts3,mts2,mts3,mts2,mts1,mts3,mts2,mts1,mts0, mts2,mts1,mts0,mts1,mts0,mts0}顺序排列,再将每个尾数段分别输出至一个浮点数乘法器。对于第二个FP64的尾数的尾数段以{mts3,mts2,mts3,mts1,mts2,mts3,mts0,mts1,mts2,mts3,mts0,mts1,mts2,mts0,mts1,mts0}顺序排列,再将每个尾数段分别输出至一个浮点数乘法器。
例如,对于待计算浮点数E的尾数mtsE的尾数段,可以排列为{mtsE3,mtsE3,mtsE2,mtsE3,mtsE2,mtsE1,mtsE3,mtsE2,mtsE1,mtsE0,mtsE2,mtsE1,mtsE0,mtsE1,mtsE0,mtsE0}。相应的,对于向量F的的尾数mtsF的尾数段可以排列为{mtsF3,mts2F,mtsF3,mtsF1,mtsF2,mtsF3,mtsF0,mtsF1,mtsF2,mtsF3,mtsF0,mtsF1,mtsF2,mtsF0,mtsF1,mtsF0}。排序后,按照排序分别输出至浮点数乘法器,其中,mtsE对应的排序中的第一个尾数段与mtsF对应的排序中的第一个尾数段输出至同一浮点数乘法器,以此类推。
需要说明的是,上述尾数段的排序方式仅为一种示例,排序输出的目的在于使得两个向量中对应位置的待计算浮点数的尾数的尾数段可以以各种可能的方式组合,具体以何种排列方式输出本申请实施例不做限定,只需保证是按照固定的排列方式输出,且达到上述目的即可。
此外,对于拆解得到的sign和exp,只需输出至尾数对应的尾数段的排序中的第一个尾数段所输入的浮点数乘法器中即可。
在一种可能的实现方式中,在将尾数段输出至浮点数乘法器之前,可以先对尾数段进行高位补0,使得补0后的尾数段的位宽与浮点数乘法器支持的乘法位宽相同。
步骤803、运算器按照模式和拆解后的待计算浮点数完成所述计算指令的处理。
在实施中,在步骤803可以由运算器中的浮点数乘法器和浮点数加法器实现。具体的,如图9所示,步骤803可以包括如下处理流程:
步骤8031、运算器中的浮点数乘法器对输入的拆解后的待计算浮点数的符号进行异或计算,对输入的拆解后的待计算浮点数的阶码进行加法计算,对输入的拆解后的待计算浮点数的尾数段进行乘法计算,并向运算器中的浮点数加法器输出符号异或结果、阶码加和结果和尾数段乘积结果。
下面结合图10所示的运算单元,对不同计算类型下,上述步骤802中示例的几种输入在该步骤8031的处理进行说明。
浮点数向量逐元素乘运算,输入为两个长度为16的FP16向量:
每个浮点数乘法器对输入的浮点数进行乘法运算,具体的,包括对输入的两个符号进行异或计算,对输入的两个阶码进行加法计算,对输入的两个尾数段进行乘法计算。16个浮点数乘法器可以并行执行。
每个浮点数乘法器可以将符号异或结果、阶码加和结果和尾数段乘积结果输出至规格化处理电路,规格化处理电路对同一浮点数加法器输入的符号异或结果、阶码加和结果和尾数段乘积结果进行规格化处理,得到一个规格化的FP16。对于四个浮点数加法器分别输入的符号异或结果、阶码加和结果和尾数段乘积结果,规格化处理电路可以得到4个规格化的FP16作为向量逐元素乘运算结果输出。此处,需要说明的是,规格化处理电路对于输入的符号异或结果、阶码加和结果和尾数段乘积结果进行规格化处理时,与对常规的浮点数的符号、阶码和尾数进行规格化处理是相同。
浮点数向量内积运算,输入为两个长度为16的FP16向量:
每个浮点数乘法器对输入的浮点数进行乘法运算,具体的,包括对输入的两个符号进行异或计算,对输入的两个阶码进行加法计算,对输入的两个尾数段进行乘法计算。16个浮点数乘法器可以并行执行,得到16个浮点数乘积结果。
浮点数乘法器输出的16个浮点数乘积结果分为4组,分别输出至第一组的4个浮点数加法器中的一个浮点数加法器。
浮点数向量逐元素乘运算,输入为两个长度为4的FP32向量:
每个浮点数乘法器对输入的尾数段进行乘法计算,16个浮点数乘法器可以并行执行对尾数段的乘法。且对于输入的符号和阶码,浮点数乘法器还要执行符号异或运算以及阶码加法运算。
16个浮点数乘法器分别对输入的尾数段进行乘法运算后,可以得到16个尾数段乘积结果。并将这16个尾数段乘积结果分为4组,每组输出至第一组4个浮点数加法器中的一个浮点数加法器,其中,同一组中的各尾数段乘积结果均来自同一对待计算浮点数。
例如,在上述对FP32向量的拆解与尾数段排序的示例基础上,此处,16个尾数段乘积结果分成的4组中,第一组包括的尾数段乘积结果可以为:mtsC11*mtsD11、mtsC11*mtsD10、mtsC10*mtsD11、mtsC10*mtsD10;第二组可以尾数段乘法结果可以为:mtsC21*mtsD21、mtsC21*mtsD20、mtsC20*mtsD21、mtsC20*mtsD20,第三组和第四组包括的尾数段乘积结果可以以此类推。
浮点数向量逐元素乘运算,输入为两个FP64标量:
每个浮点数乘法器对输入的尾数段进行乘法计算,16个浮点数乘法器可以并行执行对尾数段的乘法。且对于输入的符号和阶码,浮点数乘法器还需要执行符号异或运算以及阶码加法运算。
16个浮点数乘法器得到的16个尾数段乘积结果可以分为4组,每组输出至第一组4个浮点数加法器中的一个浮点数加法器。
例如,在上述对FP64的拆解与尾数段排序的示例基础上,此处,16个尾数段乘积结果分成的4组中,第一组包括的尾数段乘积结果可以为:mtsE3*mtsF3、mtsE3*mts F 2、mtsE2*mtsF3、mtsE3*mtsF1;第二组包括的尾数段乘积结果可以为:mtsE2*mtsF2、mtsE1*mtsF3、mtsE1*mtsF2、mtsE0*mtsF3;第三组包括的尾数段乘积结果可以为:mtsE3*mtsF0、mtsE2*mtsF1、mtsE2*mtsF0、mtsE0*mtsF2;第四组包括的尾数段乘积结果可以为:mtsE1*mtsF1、mtsE1*mtsF0、mtsE0*mtsF1、mtsE0*mtsF0。
需要说明的是,FP32的浮点数向量内积运算和浮点数向量逐元素乘运算在该步骤8031的处理是相同的,因此,对于FP32的浮点数向量内积运算在步骤8031中的处理不再赘述。
步骤8032、浮点数加法器对输入的尾数段乘积结果进行加法计算,得到尾数段加和结果,并根据计算指令模式、尾数段加和结果、符号异或结果以及阶码加和结果,输出对待计算浮点数的计算结果。
下面结合图10所示的运算单元,对不同计算类型下,上述步骤802中示例的几种输入在该步骤8033的处理进行说明。
浮点数向量逐元素乘运算,输入为两个长度为4的FP32向量:
第一组的每个浮点数加法器根据输入的模式所指示的待计算浮点数的类型获取对应的固定移位值。然后,对于输入的浮点数尾数段乘积结果,按照固定移位值进行阶码对阶,再对对阶后的尾数段乘积结果进行加法运算,得到第一阶段加和结果。第一组的4个浮点数乘法器可以得到4个第一阶段加和结果。然后,每个浮点数乘法器向规格化处理电路输出一个第一阶段加和结果以及对应的符号结果和阶码加和结果。规格化处理电路,对输入的每组第一阶段加和结果以及对应的符号结果和阶码加和结果进行规格化处理,输出一个规格化的FP32。该规格化处理电路可以得到4个规格化的FP32并作为向量逐元素乘运算结果输出。
其中,固定移位值是预先计算并存储的,因为拆解电路输出的尾数段是按照固定排序输出至对应的浮点数乘法器中的,且浮点数乘法器的输出会固定输出至对应的浮点数加法器,所以浮点数加法器可以预先存储固定移位值,对于不同类型的待计算浮点数固定移位值也可以不同。固定移位值与尾数段乘积结果所对应的尾数段在原待计算浮点数的尾数中的位置以及所占位宽有关。
下面对FP32对应的固定移位值进行举例说明。
待计算浮点数c1(FP32)的尾数段包括mtsC11和mtsC10,待计算浮点数d1的尾数段包括mtsD11和mtsD10。其中,尾数段乘积结果包括mtsC11*mtsD11、mtsC11*mtsD10、mtsC10*mtsD11和mtsC10*mtsD10。以mtsC10*mtsD10为基准,即mtsC10*mtsD10的固定移位值为0,因为mtsC10*mtsD11对应的两个尾数段最低位之和与mtsC10*mtsD10对应的两个尾数段最低位之和的位差为12,所以mtsC10*mtsD11的固定移位值为12,同理,mtsC11*mtsD10的固定移位值为12,mtsC11*mtsD11的固定移位值为24。即,对应FP32存储的固定移位值可以依次为0、12、12、24。
需要说明的是,上述固定移位值均表示左移位数。
浮点数加法器在对mtsC11*mtsD11、mtsC11*mtsD10、mtsC10*mtsD11和mtsC10*mtsD10的加法计算时,分别将mtsC11*mtsD10、mtsC10*mtsD11和mtsC10*mtsD10左移12、12、24位,再对移位后的mtsC11*mtsD10、mtsC10*mtsD11和mtsC10*mtsD10与mtsC11*mtsD11进行加和。
浮点数向量内积运算,输入为两个长度为4的FP32向量:
第一组的每个浮点数加法器根据输入的模式所指示的待计算浮点数的类型获取对应的固定移位值。然后,对于输入的浮点数尾数段乘积结果,按照固定移位值进行阶码对阶,再对对阶后的尾数段乘积结果进行加法运算,得到第一阶段加和结果。第一组的4个浮点数乘法器可以得到4个第一阶段加和结果。然后,第一组的浮点数乘法器将四个第一阶段加和结果分为2组,每组输出至第二组的一个浮点数加法器。在向第二组的浮点数加法器输出第一阶段加和结果时,同时将该第一阶段加和结果对应的符号结果和阶码加和结果也输出至该第二组的浮点数加法器。
第二组的浮点数加法器,对于输入的两个阶码加和结果进行最大阶码比较,并计算阶差。然后根据计算出的阶差对输入的两个第一阶段加和结果进行对阶,再对对阶后的第一阶段加和结果进行加法计算,得到第二阶段加和结果。第二组的浮点数加法器可以得到两个第二阶段加和结果(此处,在第二组浮点数加法器实质上是完成的完整的浮点数加法计算,则第二组的浮点数加法器输出的第二阶段加和结果为完整的浮点数),再输出至第三组的浮点数加法器中。
第三组的浮点数加法器对第二阶段加和结果进行加法计算,得到第三阶段加和结果。最后,第三组的浮点数加法器将第三阶段加和结果输出至规格化处理电路,由规格化处理电路进行规格化处理后,得到1个规格化的FP32作为浮点数向量内积计算结果输出。
浮点数向量逐元素乘运算,输入为FP64标量:
第一组的每个浮点数加法器根据输入的模式所指示的待计算浮点数的类型获取对应的固定移位值。然后,对于输入的浮点数尾数段乘积结果,按照固定移位值进行阶码对阶,再对对阶后的尾数段乘积结果进行加法运算,得到第一阶段加和结果。第一组的4个浮点数乘法器可以得到4个第一阶段加和结果。然后,第一组的浮点数加法器将4个第一阶段加和结果分为两组,每组输出至第二组的一个浮点数加法器。同时将输入的符号异或结果和阶码加和 结果也输出至第二组的一个浮点数加法器。
下面对第一组的浮点数加法器中的FP64对应的固定移位值进行举例说明。
待计算浮点数E的尾数段包括mtsE3、mtsE2、mtsE1和mtsE0,待计算浮点数F的尾数段包括mtsF3、mtsF2、mtsF1和mtsF0。在各尾数段乘积结果中:
以mtsE0*mtsF0为基准,mtsE0*mtsF1的固定移位值为13,mtsE1*mtsF0的固定移位值为13,mtsE1*mtsF1的固定移位值为26,以上4个尾数乘积结果为一组,由一个浮点数加法器进行加法计算,在该浮点数加法器中对应FP64存储的固定移位值可以依次为0、13、13、26。
以mtsE0*mtsF2为基准,mtsE2*mtsF0的固定移位值为0,即不用进行移位,mtsE2*mtsF1的固定移位值为13,mtsE3*mtsF0的固定移位值为13,以上4个尾数乘积结果为一组,由一个浮点数加法器进行加法计算,在该浮点数加法器中对应FP64存储的固定移位值可以依次为0、0、13、13。
以mtsE0*mtsF3为基准,mtsE1*mtsF2的固定移位值为0,即不用进行移位,mtsE1*mtsF3的固定移位值为13,mtsE2*mtsF2的固定移位值为13,以上4个尾数乘积结果为一组,由一个浮点数加法器进行加法计算,在该浮点数加法器中对应FP64存储的固定移位值可以依次为0、0、13、13。
以mtsE3*mtsF1为基准,mtsE2*mtsF3的固定移位值为13,mtsE3*mts F2的固定移位值为13,mtsE3*mtsF3的固定移位值为26,以上4个尾数乘积结果为一组,由一个浮点数加法器进行加法计算,在该浮点数加法器中对应FP64存储的固定移位值可以依次为0、13、13、26。
需要说明的是,上述固定移位值均表示左移位数。
浮点数加法器在对mtsE0*mtsF0、mtsE0*mtsF1、mtsE1*mtsF0和mtsE1*mtsF1进行加法计算时,先将mtsE0*mtsF1、mtsE1*mtsF0和mtsE1*mtsF分别左移13、13、26位,再对移位后的mtsE0*mtsF1、mtsE1*mtsF0和mtsE1*mtsF与mtsE0*mtsF0进行加和。在对mtsE0*mtsF2、mtsE2*mtsF0、mtsE2*mtsF1和mtsE3*mtsF0进行加法计算时,先分别将mtsE2*mtsF1和mtsE3*mtsF0左移13、13位,再对移位后的mtsE2*mtsF1、mtsE3*mtsF0与mtsE0*mtsF2、mtsE2*mtsF0进行加和。在对mtsE0*mtsF3、mtsE1*mtsF2、mtsE1*mtsF3和mtsE2*mtsF2进行加法计算时,先将mtsE1*mtsF3和mtsE2*mtsF2分别左移13、13位,再对移位后的mtsE1*mtsF3、mtsE2*mtsF2和mtsE0*mtsF3、mtsE1*mtsF2进行加和。在对mtsE3*mtsF1、mtsE2*mtsF3、mtsE3*mts F2和mtsE3*mtsF3进行加法计算时,先将mtsE2*mtsF3、mtsE3*mts F2和mtsE3*mtsF3分别左移13、13、26位,再对移位后的mtsE2*mtsF3、mtsE3*mts F2、mtsE3*mtsF3和mtsE3*mtsF1进行加和。
第二组的浮点数加法器,对于输入的第一阶段加和结果按照固定移位值进行对阶后,再进行加法运算,得到第二阶段加和结果,再输出至第三组的浮点数加法器。同时将输入的符号异或结果和阶码加和结果也输出至第三组的浮点数加法器。
在上述第一组的浮点数加法器中存储的固定移位值的示例的基础上,对于第二组的浮点数加法器中的FP64对应的固定移位值进行说明。
例如,4个第一阶段加和结果分别为P1、P2、P3和P4,其中,P1是由mtsE1*mtsF1、mtsE1*mtsF0、mtsE0*mtsF1、mtsE0*mtsF0经过移位后相加得到的,P2是由mtsE3*mtsF0、mtsE2*mtsF1、mtsE2*mtsF0、mtsE0*mtsF2经过移位后相加得到的,P3是由mtsE2*mtsF2、mtsE1*mtsF3、mtsE1*mtsF2、mtsE0*mtsF3经过移位后相加得到的,P4是由mtsE3*mtsF3,mtsE3* mts F2,mtsE2*mtsF3,mtsE3*mtsF1经过移位后相加得到的。
P1和P2作为一组,以P1基准,由于P2对应的尾数段乘积结果中作为基准的尾数段乘积结果对应的最低位和,与P1对应的尾数段乘积结果中作为基准的mtsE0*mtsF0的尾数段乘积结果对应的最低位和的位差为26,所以P2的固定移位值为26,即在对应的浮点数加法器中存储的对应FP64的固定移位值可以依次为0、26。P3和P4作为一组,其中,以P3的为基准,P4的固定移位值为13,即在对应的浮点数加法器中存储的对应FP64的固定移位值可以依次为0、13。
第二组的浮点数加法器在对P1和P2进行加法计算时,先将P2左移26位,再对移位后的P1和P2进行加和。在对P3和P4进行加法计算时,先将P4左移13位,再对移位后的P3和P4进行加和。
第三组的浮点数加法器,对于输入的第二阶段加和结果再按照固定移位值进行对阶后,再进行加法运算,得到第三阶段加和结果。
在上述第一组的浮点数加法器中存储的固定移位值的示例的基础上,对于第二组的浮点数加法器中的FP64对应的固定移位值进行举例说明。
例如,上述P1和P2进行加法计算,得到的第三阶段加和结果为Q1,P3和P4进行加法计算,得到的第三阶段加和结果为Q2。其中,Q1的固定移位值为0,即不用进行移位,Q2的固定移位值为39。即在第三组放浮点数乘法器中存储的对应FP64的固定移位值依次为0、39。
需要说明的是,上述固定移位值均表示左移位数。
第三组的浮点数加法器在对Q1和Q2进行加法计算时,先对Q2左移39位,再对移位后的Q2和Q1进行加和。
最后,将符号异或结果、阶码加和结果和第三阶段加和结果输出至规格化处理电路,由规格化处理电路进行规格化处理后,得到1个规格化的FP64作为计算结果输出。
浮点数向量内积运算,输入为两个长度为16的FP16向量:
第一组的每个浮点数加法器对输入的4个浮点数乘积结果进行加法计算,得到第一阶段加和结果。第一组的浮点数加法器可以得到4个第一阶段加和结果,再将4个第一阶段加和结果分为2组分别输出至第二组的浮点数加法器.
第二组的浮点数加法器对输入的第一阶段加和结果进行加法计算,得到两个第二阶段加和结果。第二组的浮点数加法器将两个第二阶段加和结果输出至第三组的浮点数加法器。
第三组的浮点数加法器对第二阶段加和结果进行加法计算,得到第三阶段加和结果。最后,第三组的浮点数加法器将第三阶段加和结果输出至规格化处理电路,由规格化处理电路进行规格化处理,得到1个规格化的FP16作为向量内积结果输出。
还需说明的是,本申请实施例中还可以实现浮点数向量元素累加运算,在此种运算类型中,输入的待计算浮点数为一个浮点数向量。拆解电路获取到计算指令后,确定模式指示的计算类型为向量元素累加,则可以先生成与输入的待计算浮点数向量相同的类型的浮点数向量,且生成的浮点数向量中的各元素值均为1。对于输入的待计算浮点数向量和生成的浮点数向量可以共同作为待计算浮点数向量。接下来,浮点数向量元素累加运算在上述步骤801-步骤8032中的处理,与上述浮点数向量内积运算在上述步骤801-步骤8032中的处理相同在此不做赘述。
基于相同的技术构思,本申请实施例还提供了一种浮点数计算的装置,该装置可以为上述运算单元,如图11所示,该装置包括:
拆解模块130,用于获取计算指令中包括的模式和待计算浮点数;根据预设规则拆解所述待计算浮点数,其中,所述模式用于指示对所述待计算浮点数的运算类型;
计算模块131,用于按照所述模式和拆解后的待计算浮点数完成所述计算指令的处理。
应理解的是,本申请实施例的装置可以通过专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图1至10所示的浮点数计算方法时,装置及其各个模块也可以为软件模块。
在一种可能的实现方式中,所述待计算浮点数为高精度浮点数,所述拆解模块,用于:
根据所述待计算浮点数的尾数将所述待计算浮点数拆解为多个低精度浮点数。
在一种可能的实现方式中,所述拆解后的待计算浮点数的阶码位宽大于所述待计算浮点数的阶码位宽。
在一种可能的实现方式中,所述拆解模块130,用于:
将所述待计算浮点数拆解为符号、阶码和尾数,将所述待计算浮点数的尾数拆解为多个尾数段。
在一种可能的实现方式中,所述计算模块131包括浮点数乘法单元和浮点数加法单元,用于;
所述浮点数乘法计算单元,用于对拆解后的待计算浮点数的符号进行异或计算,得到符号异或结果,对拆解后的待计算浮点数的阶码进行加法计算,得到阶码加和结果,对拆解后的来自不同待计算浮点数的尾数段进行乘法计算,输出尾数段乘积结果;
所述浮点数加法计算单元,用于对所述尾数段乘积结果进行加法计算,得到尾数段加和结果,根据所述模式、所述尾数段加和结果、所述符号异或结果以及所述阶码加和结果,得到对所述待计算浮点数的计算结果。
还需要说明的是,上述实施例提供的浮点数计算的装置在计算浮点数时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将计算设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的浮点数计算的装置与浮点数计算的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
本申请实施例还提供了一种芯片,该芯片的结构可以与图1中所示的芯片100的结构相同,该芯片可以实现本申请实施例所提供的浮点数计算的方法。
参见图12,本申请实施例提供了一种计算设备1300。该计算设备1300包括至少一个处理器1301,总线系统1302,存储器1303,通信接口1304和内存单元1305。
上述处理器1301可以是一个通用中央处理器(central processing unit,CPU),网络处理器(network processor,NP),图形处理器(graphics processing unit)微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制本申请方案程序执行的集成电路。
上述总线系统1302可包括一通路,在上述组件之间传送信息。
上述存储器1303可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable  programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过总线与处理器相连接。存储器也可以和处理器集成在一起。
内存单元1305用于存储执行本申请方案的应用程序代码,并由处理器1301来控制执行。处理器1301用于执行内存单元1305中存储的应用程序代码,从而实现本申请提出的浮点数计算方法。
在具体实现中,作为一种实施例,处理器1301可以包括一个或多个处理器1301。
通信接口1304用于实现计算设备1300与外部设备的连接和通信。
综上所述,计算设备可以通过对待计算浮点数的拆解,获得多个低精度浮点数,并分别由多个浮点数乘法器并行对拆解后的浮点数进行运算处理,使得同一计算设备可以支持不同精度浮点数的运算,无需设置专有计算单元执行指定精度浮点数的运算,整个计算设备的兼容性更强,另一方面,由于单一计算设备即可完成不同精度浮点数运算过程,减少了不同精度浮点数运算器的个数,降低了成本。此外,由于多个浮点数乘法器可以分别对拆解后的浮点数并行执行运算操作,降低了处理时延,提升了处理效率。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)。
以上所述仅为本申请一个实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (16)

  1. 一种运算单元,其特征在于,所述运算单元包括拆解电路和运算器;
    所述拆解电路,用于获取计算指令中包括的模式和待计算浮点数;根据预设规则拆解所述待计算浮点数,其中,所述模式用于指示对所述待计算浮点数的运算类型;
    所述运算器,用于按照所述模式和拆解后的待计算浮点数完成所述计算指令的处理。
  2. 根据权利要求1所述运算单元,其特征在于,
    拆解电路,还用于在所述待计算浮点数为高精度浮点数时,根据所述待计算浮点数的尾数将所述待计算浮点数拆解为多个低精度浮点数。
  3. 根据权利要求1所述运算单元,其特征在于,所述拆解后的待计算浮点数的阶码位宽大于所述待计算浮点数的阶码位宽。
  4. 根据权利要求1所述运算单元,其特征在于,
    所述拆解电路,还用于将所述待计算浮点数拆解为符号、阶码和尾数,将所述待计算浮点数的尾数拆解为多个尾数段。
  5. 根据权利要求1至4中任一项所述的运算单元,其特征在于,所述运算器包括浮点数乘法器和浮点数加法器,
    所述浮点数乘法器,用于执行所述拆解后的待计算浮点数的加法运算,所述浮点数加法器用于执行所述拆解后的待计算浮点数的加法运算。
  6. 一种浮点数计算的方法,其特征在于,所述方法包括:
    获取计算指令中包括的模式和待计算浮点数;
    根据预设规则拆解所述待计算浮点数,其中,所述模式用于指示对所述待计算浮点数的运算类型;
    按照所述模式和拆解后的待计算浮点数完成所述计算指令的处理。
  7. 根据权利要求6所述的方法,其特征在于,所述待计算浮点数为高精度浮点数,所述根据预设规则拆解所述待计算浮点数,包括:
    根据所述待计算浮点数的尾数将所述待计算浮点数拆解为多个低精度浮点数。
  8. 根据权利要求6所述的方法,其特征在于,所述拆解后的待计算浮点数的阶码位宽大于所述待计算浮点数的阶码位宽。
  9. 根据权利要求6所述装置,其特征在于,所述根据预设规则拆解所述待计算浮点数,包括:
    将所述待计算浮点数拆解为符号、阶码和尾数,将所述待计算浮点数的尾数拆解为多个尾数段。
  10. 根据权利要求6所述的方法,其特征在于,所述按照所述模式和拆解后的待计算浮点数完成所述计算指令的处理,包括;
    对拆解后的待计算浮点数的符号进行异或计算,得到符号异或结果,对拆解后的待计算浮点数的阶码进行加法计算,得到阶码加和结果,对拆解后的来自不同待计算浮点数的尾数段进行乘法计算,输出尾数段乘积结果;
    对所述尾数段乘积结果进行加法计算,得到尾数段加和结果,根据所述模式、所述尾数段加和结果、所述符号异或结果以及所述阶码加和结果,得到对所述待计算浮点数的计算结果。
  11. 一种浮点数计算的装置,其特征在于,所述装置包括:
    拆解模块,用于获取计算指令中包括的模式和待计算浮点数;根据预设规则拆解所述待计算浮点数,其中,所述模式用于指示对所述待计算浮点数的运算类型;
    计算模块,用于按照所述模式和拆解后的待计算浮点数完成所述计算指令的处理。
  12. 根据权利要求11所述的装置,其特征在于,所述待计算浮点数为高精度浮点数,所述拆解模块,用于:
    根据所述待计算浮点数的尾数将所述待计算浮点数拆解为多个低精度浮点数。
  13. 根据权利要求11所述的装置,其特征在于,所述拆解后的待计算浮点数的阶码位宽大于所述待计算浮点数的阶码位宽。
  14. 根据权利要求11所述装置,其特征在于,所述拆解模块,用于:
    将所述待计算浮点数拆解为符号、阶码和尾数,将所述待计算浮点数的尾数拆解为多个尾数段。
  15. 一种芯片,其特征在于,所述芯片包括运算单元,所述运算单元用于实现如权利要求1-5中任一项所述权利要求中所述运算单元所实现的功能。
  16. 一种计算设备,其特征在于,所述计算设备包括主板以及如权利要求18所述的芯片;所述芯片设置在所述主板上,所述芯片包括运算单元,所述运算单元用于实现如权利要求1-5中任一项所述权利要求中所述运算单元所实现的功能。
PCT/CN2021/106965 2020-09-29 2021-07-17 运算单元、浮点数计算的方法、装置、芯片和计算设备 WO2022068327A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21873989.4A EP4206902A4 (en) 2020-09-29 2021-07-17 OPERATING UNIT, METHOD AND DEVICE FOR CALCULATION OF A FLOATING POINT NUMBER AND CHIP AND CALCULATION DEVICE
US18/191,688 US20230289141A1 (en) 2020-09-29 2023-03-28 Operation unit, floating-point number calculation method and apparatus, chip, and computing device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011053108.9A CN114327360B (zh) 2020-09-29 2020-09-29 运算装置、浮点数计算的方法、装置、芯片和计算设备
CN202011053108.9 2020-09-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/191,688 Continuation US20230289141A1 (en) 2020-09-29 2023-03-28 Operation unit, floating-point number calculation method and apparatus, chip, and computing device

Publications (1)

Publication Number Publication Date
WO2022068327A1 true WO2022068327A1 (zh) 2022-04-07

Family

ID=80949159

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/106965 WO2022068327A1 (zh) 2020-09-29 2021-07-17 运算单元、浮点数计算的方法、装置、芯片和计算设备

Country Status (4)

Country Link
US (1) US20230289141A1 (zh)
EP (1) EP4206902A4 (zh)
CN (1) CN114327360B (zh)
WO (1) WO2022068327A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133656A (zh) * 2014-07-25 2014-11-05 国家电网公司 一种尾码采用移位和减法运算的浮点数除法器及运算方法
CN105224284A (zh) * 2015-09-29 2016-01-06 北京奇艺世纪科技有限公司 一种浮点数处理方法及装置
CN105224283A (zh) * 2015-09-29 2016-01-06 北京奇艺世纪科技有限公司 一种浮点数处理方法及装置
US20170351493A1 (en) * 2016-06-01 2017-12-07 The Mathworks, Inc. Systems and methods for generating code from executable models with floating point data
CN109901813A (zh) * 2019-03-27 2019-06-18 苏州中晟宏芯信息科技有限公司 一种浮点运算装置及方法
US20190339937A1 (en) * 2018-05-04 2019-11-07 Microsoft Technology Licensing, Llc Block floating point computations using reduced bit-width vectors

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140067894A1 (en) * 2012-08-30 2014-03-06 Qualcomm Incorporated Operations for efficient floating point computations
CN104111816B (zh) * 2014-06-25 2017-04-12 中国人民解放军国防科学技术大学 Gpdsp中多功能simd结构浮点融合乘加运算装置
WO2017185203A1 (zh) * 2016-04-25 2017-11-02 北京中科寒武纪科技有限公司 一种用于执行多个浮点数相加的装置及方法
CN106951211B (zh) * 2017-03-27 2019-10-18 南京大学 一种可重构定浮点通用乘法器
CN108287681B (zh) * 2018-02-14 2020-12-18 中国科学院电子学研究所 一种单精度浮点融合点乘运算装置
US10853067B2 (en) * 2018-09-27 2020-12-01 Intel Corporation Computer processor for higher precision computations using a mixed-precision decomposition of operations
US11169776B2 (en) * 2019-06-28 2021-11-09 Intel Corporation Decomposed floating point multiplication

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133656A (zh) * 2014-07-25 2014-11-05 国家电网公司 一种尾码采用移位和减法运算的浮点数除法器及运算方法
CN105224284A (zh) * 2015-09-29 2016-01-06 北京奇艺世纪科技有限公司 一种浮点数处理方法及装置
CN105224283A (zh) * 2015-09-29 2016-01-06 北京奇艺世纪科技有限公司 一种浮点数处理方法及装置
US20170351493A1 (en) * 2016-06-01 2017-12-07 The Mathworks, Inc. Systems and methods for generating code from executable models with floating point data
US20190339937A1 (en) * 2018-05-04 2019-11-07 Microsoft Technology Licensing, Llc Block floating point computations using reduced bit-width vectors
CN109901813A (zh) * 2019-03-27 2019-06-18 苏州中晟宏芯信息科技有限公司 一种浮点运算装置及方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4206902A4 *

Also Published As

Publication number Publication date
CN114327360A (zh) 2022-04-12
CN114327360B (zh) 2023-07-18
US20230289141A1 (en) 2023-09-14
EP4206902A1 (en) 2023-07-05
EP4206902A4 (en) 2024-02-28

Similar Documents

Publication Publication Date Title
US9519460B1 (en) Universal single instruction multiple data multiplier and wide accumulator unit
CN107077416B (zh) 用于以选择性舍入模式进行向量处理的装置和方法
CN105468331B (zh) 独立的浮点转换单元
US8489663B2 (en) Decimal floating-point adder with leading zero anticipation
US20040015533A1 (en) Multiplier array processing system with enhanced utilization at lower precision
JPH02196328A (ja) 浮動小数点演算装置
Brunie Modified fused multiply and add for exact low precision product accumulation
JP2000259394A (ja) 浮動小数点乗算器
Hormigo et al. New formats for computing with real-numbers under round-to-nearest
JP3139466B2 (ja) 乗算器及び積和演算器
US10684825B2 (en) Compressing like magnitude partial products in multiply accumulation
US8495121B2 (en) Arithmetic processing device and methods thereof
WO2019182943A1 (en) Stochastic rounding logic
JP4273071B2 (ja) 除算・開平演算器
JP7407291B2 (ja) 浮動小数点数の乗算計算方法及び機器、並びに算術論理演算装置
US20120215825A1 (en) Efficient multiplication techniques
WO2022068327A1 (zh) 运算单元、浮点数计算的方法、装置、芯片和计算设备
TW202319909A (zh) 用於將輸入集相乘之硬體電路及方法,以及非暫時性機器可讀儲存裝置
Schulte et al. Floating-point division algorithms for an x86 microprocessor with a rectangular multiplier
CN113608718A (zh) 一种实现素数域大整数模乘计算加速的方法
US20200133633A1 (en) Arithmetic processing apparatus and controlling method therefor
JPH11296346A (ja) 浮動小数点2進4倍長語フォ―マット乗算命令装置
JP7331951B2 (ja) 秘密平方根計算システム、秘密正規化システム、それらの方法、秘密計算装置、およびプログラム
CN112241252A (zh) 用于处理浮点数的设备和方法
TWI804043B (zh) 多輸入多輸出的累加器及其執行方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21873989

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021873989

Country of ref document: EP

Effective date: 20230331

NENP Non-entry into the national phase

Ref country code: DE