WO2023147770A1 - 一种浮点数运算方法以及相关的算术单元 - Google Patents

一种浮点数运算方法以及相关的算术单元 Download PDF

Info

Publication number
WO2023147770A1
WO2023147770A1 PCT/CN2023/074108 CN2023074108W WO2023147770A1 WO 2023147770 A1 WO2023147770 A1 WO 2023147770A1 CN 2023074108 W CN2023074108 W CN 2023074108W WO 2023147770 A1 WO2023147770 A1 WO 2023147770A1
Authority
WO
WIPO (PCT)
Prior art keywords
exponent
mantissa
register
threshold
floating
Prior art date
Application number
PCT/CN2023/074108
Other languages
English (en)
French (fr)
Inventor
吴润身
吕仁硕
Original Assignee
吕仁硕
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 吕仁硕 filed Critical 吕仁硕
Publication of WO2023147770A1 publication Critical patent/WO2023147770A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers

Definitions

  • the invention relates to an application of floating-point calculation, in particular to a floating-point calculation method and a related arithmetic unit.
  • the object of the present invention is to disclose an efficient floating-point number encoding and calculation method to improve the defects of floating-point number calculation in the prior art without greatly increasing the cost, thereby increasing the calculation speed and reducing power consumption.
  • An embodiment of the present invention discloses a floating-point number operation method, which is applied to a multiplication operation between a first register and a second register, the first register stores a first floating-point number, and the second register stores a second Floating-point numbers;
  • the first register includes a first sign bit (Sign bit), a first exponent bit (Exponent bit) and a first mantissa bit (Mantissa bit), respectively storing the first sign, the first exponent and the first mantissa;
  • the second register includes a second sign position, a second exponent position and a second mantissa position, respectively storing a second sign, a second exponent and a second mantissa;
  • the method includes using an arithmetic unit (Arithmetic Unit) perform the following steps: comparing the first index with an index threshold, wherein when the first index is not less than the index threshold, multiplying the first mantissa and the second mantissa to generate a mantis
  • another embodiment of the present invention discloses an arithmetic unit, which is coupled to a first register and a second register, the first register stores a first floating point number, and the second register stores a second Floating-point number; the first register includes a first exponent bit and a first mantissa bit, respectively storing a first symbol, a first exponent, and a first mantissa; the second register includes a second exponent bit and a second mantissa bit, respectively storing a second sign, a second exponent, and a second mantissa; wherein when processing a multiplication operation between the first register and the second register, the arithmetic unit performs the following steps: combining the first exponent with a Exponent threshold is compared, wherein when the first exponent is not less than the exponent threshold, the first mantissa is multiplied by the second mantissa to generate a mantissa operation result; when the first exponent is less than
  • an arithmetic device comprising an arithmetic unit, a first register, and a second register, the arithmetic unit is coupled to the first register and the second register, and the first A register stores a first floating-point number, and the second register stores a second floating-point number;
  • the first register includes a first exponent bit and a first mantissa bit, and stores a first symbol, a first exponent, and a first mantissa respectively;
  • the second register includes a second exponent bit and a second mantissa bit, respectively storing a second sign, a second exponent, and a second mantissa; wherein when processing the multiplication operation between the first register and the second register,
  • the arithmetic unit performs the step of: comparing the first exponent with an exponent threshold, wherein when the first exponent is not less than the exponent threshold, multiplying the first mantissa and the second mantissa to generate a Mantiss
  • the exponent threshold is stored in a third register, and the arithmetic unit accesses the third register when performing a multiplication operation between the first register and the second register. third register.
  • the first register further includes a first sign bit (Sign bit), and the first sign bit stores a first sign; the second register further includes a first sign bit. Two sign bits, the second sign bit stores a second symbol, and the floating-point arithmetic method further includes: performing an exclusive OR (XOR) operation on the first symbol and the second symbol to generate a sign operation result; and generate a calculated floating point number according to the mantissa operation result, the sign operation result and the exponent operation result.
  • XOR exclusive OR
  • the first mantissa when the first exponent is smaller than the exponent threshold, the first mantissa is only temporarily stored and no operation is involved.
  • the index threshold is dynamically adjustable.
  • the index threshold is dynamically adjusted according to the temperature of the arithmetic unit and/or the processing item type of the arithmetic unit.
  • the exponent threshold is within a dynamically adjustable range
  • the arithmetic unit starts training with an exponent threshold value of 1
  • the arithmetic unit judges whether the operation accuracy is high In an accuracy threshold condition, if the condition is met, the value of the index threshold is incremented until the operation accuracy is not higher than an accuracy threshold
  • the dynamically adjustable range is that the multiple The index threshold for the condition.
  • the first register is coupled to a memory, and the memory stores a first exponent, and when the first exponent is smaller than the exponent threshold, at least one bit of the first mantissa is discarded instead of stored in the memory.
  • At least one bit of the first mantissa is in a don't care state (Don't care).
  • the first floating point number is decoded into (-1) Sign1 ⁇ 2 Exponent1 , where Sign1 represents the first sign No., Exponent1 represents the first exponent.
  • the second floating point number is decoded into (-1) Sign2 ⁇ 2 Exponent2 , where Sign2 represents the second positive The negative sign, Exponent2 represents the second exponent.
  • the floating-point number operation method further includes accessing a memory by the arithmetic unit, and the memory stores multiple sets of batch norm coefficients (Batch Normalization Coefficient), respectively corresponding to multiple candidate thresholds, the index threshold is selected from One of the plurality of candidate thresholds.
  • the batch norm coefficient is the coefficient for adjusting the average and standard deviation of values in artificial intelligence operations.
  • a feature map (Feature map) numerical data corresponds to a specific set of batch norm coefficients.
  • the operation process corresponds to multiple sets of batch norm coefficients due to different exponent thresholds and different mantissa omissions.
  • the present invention can discard the mantissa to further save storage space when the value of the exponent field of the floating point number is less than the threshold value, or only store the mantissa without being decoded and calculated, so as to save data transmission and calculation power consumption.
  • the associated electronic products can flexibly make a trade-off between the high-efficiency mode and the low-power mode. Saving power consumption and speeding up calculation
  • FIG. 1 is a schematic diagram of an arithmetic unit applied to an arithmetic device according to an embodiment of the present invention.
  • Fig. 2 is a schematic diagram of registers storing floating point numbers in the prior art.
  • FIG. 3 is a schematic diagram of registers storing floating point numbers according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of an arithmetic unit architecture for multiplying two floating-point numbers in the present invention.
  • Fig. 5 is a flow chart of training an artificial intelligence model by an arithmetic unit according to an embodiment of the present invention.
  • FIG. 6 is a flow chart of reducing chip power consumption by a computing device according to an embodiment of the present invention.
  • FIG. 7 is a flow chart of adaptively adjusting chip power consumption by a computing device while maintaining accuracy according to an embodiment of the present invention.
  • FIG. 8 is a flow chart of a floating-point number calculation method according to an embodiment of the present invention.
  • the words “substantially”, “around”, “about” or “approximately” shall generally mean within 20% of a given value or range, Preferably within 10%. Furthermore, quantities provided herein may be approximate, thus meaning that the words “about”, “approximately” or “approximately” may be used unless otherwise stated. Where quantities, concentrations, or other values or parameters have specified ranges, preferred ranges, or tabulated upper and lower ideal values, it shall be deemed to specifically disclose all ranges formed by any pair of upper and lower limits or ideal values, regardless of Whether the ranges are disclosed separately. For example, if a certain length of the disclosed range is X centimeters to Y centimeters, it should be deemed that the disclosed length is H centimeters and H can be any real number between X and Y.
  • electrical (sexual) coupling or “electrical (sexual) connection”
  • electrical (sexual) connection includes any direct and indirect electrical connection means.
  • first device is electrically coupled to the second device, it means that the first device can be directly connected to the second device, or indirectly connected to the second device through other devices or connection means.
  • description is about the transmission and provision of electrical signals, those skilled in the art should be able to understand that the transmission of electrical signals may be accompanied by attenuation or other non-ideal changes, but if there is no special In fact, it should be regarded as the same signal.
  • an electrical signal S is transmitted (or provided) from terminal A of the electronic circuit to terminal B of the electronic circuit, a voltage drop may be generated across the source and drain terminals of the transistor switch and/or possible stray capacitance , but if the purpose of this design is not to deliberately use the attenuation or other non-ideal changes generated during transmission (or provision) to achieve certain specific technical effects, the electrical signal S at the endpoint A and endpoint B of the electronic circuit should be regarded as are actually the same signal.
  • FIG. 1 is a schematic diagram of an arithmetic unit 110 applied to a computing device 100 according to an embodiment of the present invention.
  • the arithmetic device 100 includes an arithmetic unit 110, a first register 111, a second register 112, a third register 113 and a memory 114, and the arithmetic unit 110 is coupled to the first register 111, the second register 112 and the second register 112.
  • the memory 114 is coupled to the first register 111 , the second register 112 and the third register 113 .
  • the memory 114 is only a general term for memory units in the computing device 100, that is to say, the memory 114 can be an independent memory unit, or generally refers to all possible memory units in the computing device 100, such as the first register 111, The second register 112 and the third register 113 may be respectively coupled to different memories.
  • the computing device 100 can be any device with computing capabilities, such as a central processing unit (CPU), a graphics processing unit (GPU), an artificial intelligence accelerator (AI Accelerator), a programmable logic array (FPGA), a desktop computer, a notebook Computers, smartphones, tablets, smart wearable devices, etc.
  • CPU central processing unit
  • GPU graphics processing unit
  • AI Accelerator artificial intelligence accelerator
  • FPGA programmable logic array
  • the present invention can ignore them and not store them in the memory 114, thereby saving memory space.
  • the memory 114 may store multiple groups of batch norm coefficients (Batch Normalization Coefficient), respectively corresponding to multiple candidate thresholds, and the above-mentioned index threshold is selected from one of the multiple candidate thresholds.
  • the batch norm coefficient is the coefficient for adjusting the average and standard deviation of values in artificial intelligence operations.
  • Feature map numerical data corresponds to a specific set of batch norm coefficients.
  • the operation process corresponds to multiple sets of batch norm coefficients due to different exponent thresholds and different mantissa omissions.
  • the first register 111 is used to store the first floating-point number
  • the second register 112 is used to store the second floating-point number
  • the third register 113 is used to store an exponent threshold
  • the first register 111 and the second register 112 perform operations
  • the third register 113 is accessed to read the exponent threshold.
  • FIG. 2 is a schematic diagram of registers storing floating point numbers in the prior art. As shown in Figure 2, floating-point numbers are divided into sign (Sign), exponent (Exponent) and mantissa (Mantissa) and stored in three different fields of the register. When decoding operations, they are all decoded into:
  • Sign represents the sign of the floating-point number
  • Exponent represents the exponent of the floating-point number
  • the leftmost bit of the register will be allocated as the sign bit to store the sign
  • the remaining bits (for example, 7-63 bits) are allocated as exponent bits and mantissa bits to store the exponent and mantissa respectively.
  • the sum of the sign, exponent and mantissa can be 8-64 bits, but the present invention is not limited thereto, and the sum of the above-mentioned bits can also be less than 8 bits, for example, 7 bits.
  • FIG. 3 is a schematic diagram of registers storing floating point numbers according to an embodiment of the present invention.
  • the present invention compares the exponent position of the floating-point number with an exponent threshold, and mainly selects the processing mode for the mantissa of the floating-point number by setting an exponent threshold, as shown in Figure 3, under the single precision (Float 32) representation, the decimal system
  • the value "0.3057” converted into a binary floating-point number is "00111110100111001000010010110110", in which the first bit from the highest bit stores "0" to indicate the sign, the second to ninth bits store the exponent, and the remaining bits store the mantissa.
  • the mantissa When the second to ninth digits "01111101" are higher than the exponent threshold, the mantissa "00111001000010010110110" is considered valid and stored in the 10th to 32nd digits. In this way, this floating-point number will be subsequently operated with other floating-point numbers , the mantissa part is actually used.
  • the decimal value "-0.002" converted into a binary floating point number is "10111011000000110001001001101111", wherein the first bit from the highest bit stores "1" to represent the sign, and the second to ninth bits store the exponent , and the remaining bits store the mantissa.
  • the mantissa "00000110001001001101111” is regarded as invalid and not stored, so the 10th to 32nd bits are empty at this time. In this way, the When the floating-point number is subsequently operated with other floating-point numbers, the mantissa will not be involved in the calculation.
  • the floating point number when the value of the exponent field of the floating point number is less than the threshold value, it means that the value of the floating point number is small, and in the case of ignoring the mantissa of the floating point number, the floating point number can be decoded as:
  • All the bits of the mantissa may not be involved in the calculation, and may not be transferred into the register, which can save power consumption and transmission, and even the mantissa may not be stored in the memory at all, so as to further save storage space.
  • at least one bit of the mantissa does not participate in the calculation, and is not transferred into the register, or even stored in the memory, to further save storage space
  • the decimal value "0.003" converted into a binary floating point number is "00111011010001001001101110100110", wherein the first bit from the highest bit stores "1" to represent the sign, and the second to ninth bits store the exponent, The remaining bits store the mantissa.
  • the mantissa When the second to ninth digits "01110110" are less than the exponent threshold, the mantissa "10001001001101110100110" is regarded as negligible, but it is still stored in the 10th to 32nd digits and marked as ignored (Don 't care), such as In this way, when this floating-point number is subsequently operated with other floating-point numbers, the mantissa will not be involved in the calculation.
  • the mantissa can exist but not be decoded and calculated, so as to further save data transmission and calculation power consumption.
  • the sum of sign position, exponent position and mantissa position can be 8 ⁇ 64, but the present invention is not limited to this, the sum of above-mentioned position also can be below 8, for example 7 digits.
  • FIG. 4 is a schematic diagram of an arithmetic unit architecture for multiplying two floating-point numbers in the present invention.
  • the first floating point number can be extracted from the first register 111
  • the second floating point number can be extracted from the second register 112
  • the exponent threshold can be extracted from the third register 113 .
  • the first register comprises a first sign position, an exponent position and a mantissa position, respectively stores the first sign (that is, the sign corresponding to the first floating-point number), the first exponent and the first mantissa; the second register comprises the second sign position
  • the minus sign bit, the exponent bit and the mantissa bit respectively store the second sign, the second exponent and the second mantissa.
  • the arithmetic unit 110 compares the first exponent with the exponent threshold through the comparison logic 144, wherein when the first exponent is not less than the exponent threshold, it represents the first floating-point number
  • the first mantissa and the second mantissa are multiplied by the multiplication logic 143 to generate the mantissa operation result (that is, the output of the comparison logic 144); if the first exponent is less than the exponent threshold , which means that the first floating-point number is relatively small, and the effective digits of the mantissa can be ignored, then after discarding at least one bit (for example, one or more bits) of the first mantissa, multiply it with the second mantissa to obtain the operation result of the mantissa , this step may include discarding only one or several bits, or discard
  • discarding the entire first mantissa can reduce more power consumption, but in the case where accuracy is required, even discarding only one bit can achieve the purpose of reducing power consumption.
  • the XOR operation between the first symbol and the second symbol can be performed by the XOR logic 141 to generate a symbol operation result (that is, the output of the XOR logic 141), and the first index and the second index can be combined by the addition logic 142
  • the exponents are added to produce an exponentiation result (ie, the output of addition logic 142).
  • a calculated floating-point number is generated according to the mantissa operation result, the sign operation result and the exponent operation result as the final operation result.
  • the first floating point number is decoded into (-1) Sign1 ⁇ 2 Exponent1 , wherein Sign1 represents the first sign, and Exponent1 represents the first exponent.
  • this embodiment can further compare the second exponent with the exponent threshold.
  • the second floating point number is decoded into (-1) Sign2 ⁇ 2 Exponent2 , where Sign2 represents the second sign, and Exponent2 represents the second exponent.
  • exclusive OR logic 141 The presentation of the addition logic 142, the multiplication logic 143 and the comparison logic 144 is only an example, and the exact implementation method can be changed according to actual needs, and is different from the aspect shown in this embodiment, but the present invention includes all possible There are no additional restrictions on detailed adjustments.
  • the multiplication logic 143 of the single-precision floating-point arithmetic unit interprets Mantissa as 1.Mantissa, that is, the bit to the left of the decimal point is 1, and the right side of the decimal point is Mantissa, but not limited thereto.
  • addition logic 142 of the single-precision floating-point arithmetic unit interprets the Exponent as (Exponent-127) and then performs addition, but it is not limited thereto.
  • Exponent-127 the same concept can also be applied to the second mantissa. and the storage and transmission of the second mantissa are simplified.
  • the index threshold may be a certain value, or dynamically adjustable. Through the adjustable threshold design, it is possible to choose the accuracy of the required floating-point calculation. For example, if the threshold is high, more mantissas will not be decoded, so the power consumption of data transmission and calculation can be greatly reduced.
  • the index threshold can be dynamically adjusted according to the temperature of the arithmetic unit 110 and/or the type of processing items of the arithmetic unit 110. For example, when the current temperature of the arithmetic unit 100 is too high and needs to be cooled down, the index threshold can be raised so that the arithmetic unit 110 can operate in low power consumption, low temperature mode.
  • the index threshold can also be increased to prolong the standby time of the mobile device.
  • the exponent threshold can be lowered so that more mantissas are decoded, thereby improving accuracy.
  • the index threshold is within a dynamically adjustable range
  • the arithmetic unit 110 starts training with an index threshold value of 1, and the arithmetic unit 110 judges whether the operation accuracy is higher than the accuracy threshold condition, If the conditions are met, the value of the exponent threshold is incremented until the operation accuracy is not higher than the accuracy threshold, and the dynamically adjustable range is the above exponent threshold that meets the conditions.
  • the present invention ignores the mantissa field of the floating-point number with a small value, and only decodes the mantissa field for the floating-point number with a large value. Therefore, compared with the prior art, the present invention can avoid excessive design of the hardware architecture (that is, The hardware structure can be simplified), so the power consumption and time of data storage and data transmission can be saved.
  • FIG. 5 is a flowchart of training the artificial intelligence model by the arithmetic unit 110 according to an embodiment of the present invention, which can be simply summarized as follows:
  • Step S502 Set the initial value of the index threshold to 1;
  • Step S504 applying the index threshold to the AI model
  • Step S506 Retrain the AI model according to the index threshold (retrain);
  • Step S508 Determine whether the decrease in accuracy of floating-point calculations has reached the maximum acceptable level of the AI model, if yes, execute step S510; if not, execute step S512;
  • Step S510 Raise the index threshold
  • Step S512 The training is completed.
  • Fig. 5 shows a low-power mode training scheme. If it is judged in step S508 that the decrease in accuracy of floating-point calculations does not exceed the maximum acceptable level of the AI model, it means that the current floating-point calculation accuracy is The accuracy is still higher than expected, and the exponential threshold can be adjusted to further reduce power consumption and processing time when the fault tolerance rate permits.
  • FIG. 6 is a flow chart of reducing chip power consumption by the computing device 100 according to an embodiment of the present invention, which can be briefly summarized as follows:
  • Step S602 Determine whether the chip needs to reduce power consumption, if so, execute step S604; if not, the process skips to step S608;
  • Step S604 Determine whether the accuracy reduction of the floating-point number calculation has reached the maximum acceptable level of the AI model, if not, execute step S606; if so, the process skips to step S608;
  • Step S606 Raise the index threshold
  • Step S608 the process ends.
  • step S602 it is determined whether there is a need to reduce power consumption. Taking a smart phone as an example, if the power of the mobile phone is sufficient or the mobile phone is in a high-use state, then No reduction in power consumption. Conversely, if the battery of the mobile phone is insufficient, or the mobile phone is in a low-use state, the power consumption should be reduced.
  • step S604 determines the accuracy of the current floating-point calculation. If the decrease in accuracy does not reach the maximum acceptable level of the AI model, it means that the current accuracy of the floating-point calculation is still higher than expected.
  • the exponential threshold can be adjusted to further reduce power consumption and processing time when the fault tolerance rate permits.
  • FIG. 7 is a flow chart of adaptively adjusting chip power consumption by the computing device 100 while maintaining accuracy according to an embodiment of the present invention, which can be briefly summarized as follows:
  • Step S702 Determine whether the chip needs to improve the calculation accuracy, if so, execute step S704; if not, the process skips to step S708;
  • Step S704 Determine whether the index threshold is 1 (ie the minimum value of the index threshold), if not, execute step S706; if so, the flow skips to step S708;
  • Step S706 Lowering the index threshold
  • Step S708 the process ends.
  • step S702 it is determined whether the calculation accuracy has been improved. Taking a smartphone as an example, if the mobile phone is running high-definition For image processing, because there is a high requirement for accuracy, the chip will enter the high-performance (Turbo) mode without considering power saving. Conversely, if the mobile phone is performing image recognition and requires less precision, power consumption can be saved.
  • step S704 judges whether the index threshold is the minimum index threshold (the present invention takes 1 as an example, but not limited thereto), and if it is still not the minimum index threshold, continue to decrease through step S706.
  • FIG. 8 is a flow chart of a floating-point calculation method according to an embodiment of the present invention. Please note that these steps do not have to be performed in the order shown in FIG. 8 if substantially the same result can be obtained.
  • the floating-point arithmetic method shown in FIG. 8 can be adopted by the arithmetic device 100 or the arithmetic unit 110 shown in FIG. 1, and can be simply summarized as the following steps:
  • Step S802 compare the first exponent with the exponent threshold, wherein when the first exponent is not less than the exponent threshold, multiply the first mantissa and the second mantissa to generate a mantissa operation result; when the first exponent is less than the exponent threshold, multiply the second mantissa After the first mantissa discards at least one bit, it is multiplied by the second mantissa to generate a mantissa operation result;
  • Step S804 Exclusive OR operation is performed on the first symbol and the second symbol to generate a symbol operation result
  • Step S806 adding the first exponent and the second exponent to generate an exponent operation result
  • Step S808 Generate a calculated floating point number according to the mantissa operation result, the sign operation result and the exponent operation result.
  • the mantissa when the value of the exponent field of the floating-point number is less than the threshold value (representing that the value of the floating-point number is too small), the mantissa can be discarded (that is, not stored in the memory) to further save storage space, or Only the mantissa is stored without being decoded and calculated to save data transmission and calculation power consumption.
  • the matched electronic products can flexibly make a trade-off between the high-efficiency mode and the low-power mode (for example, the threshold is high, not translated The mantissa of the code is more, the power consumption of data transmission and calculation can be reduced), in this way, the present invention can save power consumption and accelerate the calculation speed while meeting the accuracy requirements of the application program.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computing Systems (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Nonlinear Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

一种浮点数运算方法,应用于第一浮点数与第二浮点数的乘法运算。第一浮点数包括第一符号、第一指数及第一尾数,第二浮点数包括第二符号、第二指数及第二尾数。方法包括使用算术单元进行以下步骤:将第一指数与指数阈值比较,其中当第一指数不小于指数阈值,将第一尾数与第二尾数相乘以产生一尾数运算结果;根据尾数运算结果、第一指数与第二指数的指数运算结果产生计算后浮点数。本发明于浮点数的指数栏位值小于阈值,可将尾数舍弃以进一步节省存储空间,或仅存储尾数而不被译码及运算,以节省资料传输与运算功耗,且阈值具有可调性,以在符合应用程式对于精确度的要求的情况下节省功耗并加快运算速度。

Description

一种浮点数运算方法以及相关的算术单元 技术领域
本发明涉及一种浮点数运算的应用,尤其是一种浮点数运算方法以及相关的算术单元。
背景技术
随着机器学习(Machine Learning)领域越来越广泛所带来的庞大的浮点数运算量,如何压缩浮点数资料以增加运算速度及降低功耗成为本领域人士致力研究的议题。现有的浮点数技术皆使用均匀的编码及运算,这导致过度设计(over design),且因为存储了不必要的资料而浪费存储空间,并且增加传输时间及运算功耗。
综上所述,实有需要一种新颖的浮点数运算方法及硬体架构来改善现有技术的问题。
发明内容
根据以上需求,本发明的目的在于公开一种高效的浮点数编码及运算方法,以在不大幅增加成本的前提下改善现有技术中浮点数运算的缺陷,进而提高运算速度并降低功耗。
本发明一实施例公开了一种浮点数运算方法,应用于一第一寄存器及一第二寄存器之间的乘法运算,所述第一寄存器存储第一浮点数,所述第二寄存器存储第二浮点数;所述第一寄存器包括第一正负号位(Sign bit)、第一指数位(Exponent bit)及第一尾数位(Mantissa bit),分别存储第一符号、第一指数及第一尾数;所述第二寄存器包括第二正负号位、第二指数位及第二尾数位,分别存储第二符号、第二指数及第二尾数;其中所述方法包括使用一算术单元(Arithmetic Unit)进行以下步骤:将所述第一指数与一指数阈值进行比较,其中当所述第一指数不小于所述指数阈值,将所述第一尾数与第二尾数相乘以产生一尾数运算结果;当所述第一指数小于所述指数阈值,则将所述第一尾数舍弃至少一个位后,与第二尾数相乘以产生尾数运算结果(或将所述第一尾数全部位舍弃);将所述第一指数与所述第二指数进行相加运算,以产 生一指数运算结果;以及根据所述尾数运算结果及所述指数运算结果产生一计算后浮点数。
除了上述方法,本发明另一实施例公开了一种算术单元,其耦接于一第一寄存器及一第二寄存器,所述第一寄存器存储第一浮点数,所述第二寄存器存储第二浮点数;所述第一寄存器包括第一指数位及第一尾数位,分别存储第一符号、第一指数及第一尾数;所述第二寄存器包括第二指数位及第二尾数位,分别存储第二符号、第二指数及第二尾数;其中于处理所述第一寄存器及所述第二寄存器之间的乘法运算时,所述算术单元进行以下步骤:将所述第一指数与一指数阈值进行比较,其中当所述第一指数不小于所述指数阈值,将所述第一尾数与第二尾数相乘以产生尾数运算结果;当所述第一指数小于所述指数阈值,则将所述第一尾数舍弃至少一个位后,与第二尾数相乘以产生尾数运算结果(或将所述第一尾数全部位舍弃);将所述第一指数与所述第二指数进行相加运算,以产生一指数运算结果;以及根据所述尾数运算结果及所述指数运算结果产生一计算后浮点数。
本发明另一实施例公开了一种运算装置,包括一算术单元、一第一寄存器、一第二寄存器,所述算术单元耦接于所述第一寄存器及所述第二寄存器,所述第一寄存器存储第一浮点数,所述第二寄存器存储第二浮点数;所述第一寄存器包括第一指数位及第一尾数位,分别存储第一符号、第一指数及第一尾数;所述第二寄存器包括第二指数位及第二尾数位,分别存储第二符号、第二指数及第二尾数;其中于处理所述第一寄存器及所述第二寄存器之间的乘法运算时,所述算术单元进行以下步骤:将所述第一指数与一指数阈值进行比较,其中当所述第一指数不小于所述指数阈值,将所述第一尾数与第二尾数相乘以产生一尾数运算结果;当所述第一指数小于所述指数阈值,则将所述第一尾数舍弃至少一个位后,与第二尾数相乘以产生尾数运算结果(或将所述第一尾数全部位舍弃);将所述第一指数与所述第二指数进行相加运算,以产生一指数运算结果;以及根据所述尾数运算结果及所述指数运算结果产生一计算后浮点数。
可选地,根据本发明一实施例,所述指数阈值存储于一第三寄存器中,所述算术单元于执行所述第一寄存器及所述第二寄存器之间的乘法运算时存取所述第三寄存器。
可选地,根据本发明一实施例,所述第一寄存器还包括第一正负号位(Sign bit),所述第一正负号位存储第一符号;所述第二寄存器还包括第二正负号位,所述第二正负号位存储第二符号,所述浮点数运算方法还包括:将所述第一符号与所述第二符号进行异或(XOR)运算,以产生一符号运算结果;以及根据所述尾数运算结果、所述符号运算结果及所述指数运算结果产生一计算后浮点数。
可选地,根据本发明一实施例,当所述第一指数小于所述指数阈值,所述第一尾数仅被暂存而不涉及运算。
可选地,根据本发明一实施例,所述指数阈值为动态可调。
可选地,根据本发明一实施例,所述指数阈值是根据所述算术单元的温度及/或所述算术单元的处理事项类型进行动态调整。
可选地,根据本发明一实施例,所述指数阈值是介于一动态可调的范围,所述算术单元以数值为1的指数阈值开始训练,由所述算术单元判断运算精确度是否高于一精确度阈值条件,若符合所述条件则递增所述指数阈值的数值,直至所述运算精确度不高于一精确度阈值,所述动态可调的范围为所述多个符合所述条件的指数阈值。
可选地,根据本发明一实施例,第一寄存器耦合存储器,所述存储器存储第一指数,当所述第一指数小于所述指数阈值,所述第一尾数的至少一个位被舍弃而不被所述存储器存储。
可选地,根据本发明一实施例,当所述第一指数小于所述指数阈值,所述第一尾数的至少一个位为不理会状态(Don’t care)。
可选地,根据本发明一实施例,当所述第一指数小于所述指数阈值,所述第一浮点数译码成(-1)Sign1×2Exponent1,其中Sign1代表所述第一正负号,Exponent1代表所述第一指数。
可选地,根据本发明一实施例,其中当所述第二指数小于所述指数阈值,所述第二浮点数译码成(-1)Sign2×2Exponent2,其中Sign2代表所述第二正负号,Exponent2代表所述第二指数。
可选地,根据本发明一实施例,所述浮点数运算方法还包括以所述算术单元存取一存储器,所述存储器存储有多组批量范数系数(Batch Normalization Coefficient),分别对应于多个候选阈值,所述指数阈值是选自 所述多个候选阈值中的一者。批量范数系数是人工智慧运算中,调整数值的平均及标准差的系数。通常一笔特征图(Feature map)数值资料,对应一组特定的批量范数系数。根据本实施例,一笔特征图数值资料,其运算过程因为指数阈值不同,而尾数省略的情况不同,故对应多组批量范数系数。
综上所述,本发明于浮点数的指数栏位值小于阈值,可将尾数舍弃以进一步节省存储空间,或仅存储尾数而不被译码及运算,以节省资料传输与运算功耗。此外,通过阈值的可调性,所搭配的电子产品可弹性地在高效能模式和低功耗模式之间作折衷取舍,如此一来,本发明得以在符合应用程式对于精确度的要求的情况下节省功耗并加快运算速度为让本发明的上述和其他目的、特征和优点能更明显易懂,下文特举实施例,并配合附图,作详细说明如下。
附图概述
图1是根据本发明实施例的算术单元应用于运算装置的示意图。
图2是现有技术中寄存器存储浮点数的示意图。
图3是根据本发明一实施例的寄存器存储浮点数的示意图。
图4是本发明对二浮点数进行相乘的算术单元架构示意图。
图5是根据本发明一实施例算术单元对人工智慧模型进行训练的流程图。
图6是根据本发明一实施例运算装置降低晶片功耗的流程图。
图7是根据本发明一实施例运算装置在维持精确度的情况下适应性调整晶片功耗的流程图。
图8是根据本发明一实施例的一种浮点数运算方法的流程图。
本发明的较佳实施方式
本发明特别以下述例子加以描述,这些例子仅是用以举例说明而已,因为对于熟习此技艺者而言,在不脱离本揭示内容的精神和范围内,当可作各种的更动与润饰,因此本揭示内容的保护范围当视后附的申请专利范围所界定者为准。在通篇说明书与申请专利范围中,除非内容清楚指定,否则“一”以及“所述”的意义包括这一类叙述包括“一或至少一”元件或成分。此外,如本发明所用,除非从特定上下文明显可见将复数排除在外,否则单数冠词亦 包括多个元件或成分的叙述。而且,应用在此描述中与下述的全部申请专利范围中时,除非内容清楚指定,否则“在其中”的意思可包括“在其中”与“在其上”。在通篇说明书与申请专利范围所使用的用词(terms),除有特别注明,通常具有每个用词使用在此领域中、在此揭露的内容中与特殊内容中的平常意义。某些用以描述本发明的用词将于下或在此说明书的别处讨论,以提供从业人员(practitioner)在有关本发明的描述上额外的引导。在通篇说明书的任何地方的例子,包括在此所讨论的任何用词的例子的使用,仅是用以举例说明,当然不限制本发明或任何例示用词的范围与意义。同样地,本发明并不限于此说明书中所提出的各种实施例。
在此所使用的用词“实际上(substantially)”、“大约(around)”、“约(about)”或“近乎(approximately)”应大体上意味在给定值或范围的20%以内,较佳是在10%以内。此外,在此所提供的数量可为近似的,因此意味着若无特别陈述,可以用词“大约”、“约”或“近乎”加以表示。当数量、浓度或其他数值或参数有指定的范围、较佳范围或表列出上下理想值的时,应视为特别揭露由任何上下限的数对或理想值所构成的所有范围,不论等范围是否分别揭露。举例而言,如揭露范围某长度为X公分到Y公分,应视为揭露长度为H公分且H可为X到Y之间的任意实数。
此外,若使用“电(性)耦接”或“电(性)连接”一词在此是包括任何直接及间接的电气连接手段。举例而言,若文中描述第一装置电性耦接于第二装置,则代表第一装置可直接连接于第二装置,或通过其他装置或连接手段间接地连接至第二装置。另外,若描述关于电信号的传输、提供,熟习此技艺者应可以了解电信号的传递过程中可能伴随衰减或其他非理想性的变化,但电信号传输或提供的来源与接收端若无特别叙明,实际上应视为同一信号。举例而言,若由电子电路的端点A传输(或提供)电信号S给电子电路的端点B,其中可能经过电晶体开关的源汲极两端及/或可能的杂散电容而产生电压降,但此设计的目的若非刻意使用传输(或提供)时产生的衰减或其他非理想性的变化而达到某些特定的技术效果,电信号S在电子电路的端点A与端点B应可视为实际上为同一信号。
可了解如在此所使用的用词“包括(comprising或including)”、“具有(having)”、“含有(containing)”、“涉及(involving)”等等,为开放性的 (open-ended),即意指包括但不限于。另外,本发明的任一实施例或申请专利范围不须达成本发明所揭露的全部目的或优点或特点。此外,摘要部分和标题仅是用来辅助专利文件搜寻的用,并非用来限制本发明的申请专利范围。
请参考图1,图1是根据本发明一实施例的算术单元110应用于运算装置100的示意图。如图1所示,运算装置100包括算术单元110、第一寄存器111、第二寄存器112、第三寄存器113及存储器114,算术单元110是耦接于第一寄存器111、第二寄存器112及第三寄存器113,且存储器114是耦接于第一寄存器111、第二寄存器112及第三寄存器113。值得注意的是,存储器114仅为运算装置100内记忆单元的总称,也就是说,存储器114可以是独立的记忆单元,或泛指运算装置100内所有可能的记忆单元,例如第一寄存器111、第二寄存器112及第三寄存器113可能各自耦接于不同的存储器。运算装置100可以是任何具备运算能力的装置,诸如中央处理器(CPU)、图形处理器(GPU)、人工智慧加速器(AI Accelerator)、可程式逻辑阵列(FPGA)、桌上型电脑、笔记型电脑、智慧型手机、平板电脑、智慧穿戴装置等。对于存储于第一寄存器111和第二寄存器112内的浮点数的尾数,本发明可进行忽略而不存储于存储器114中,藉此节省存储器空间。此外,存储器114可存储有多组批量范数系数(Batch Normalization Coefficient),分别对应于多个候选阈值,上述指数阈值是选自所述多个候选阈值中的一者。批量范数系数是人工智慧运算中,调整数值的平均及标准差的系数。通常一笔特征图(Feature map)数值资料,对应一组特定的批量范数系数。根据本实施例,一笔特征图数值资料,其运算过程因为指数阈值不同,而尾数省略的情况不同,故对应多组批量范数系数。第一寄存器111是用以存储第一浮点数,第二寄存器112是用以存储第二浮点数,且第三寄存器113是用以存储一指数阈值,第一寄存器111与第二寄存器112进行运算时会存取第三寄存器113,以读取指数阈值。举例来说,请参考图2,图2是现有技术中寄存器存储浮点数的示意图。如图2所示,浮点数会分为正负号(Sign)、指数(Exponent)及尾数(Mantissa)而存储于寄存器的三个不同栏位,译码运算时都译码成:
(-1)Sign×1.Mantissa×2Exponent
其中Sign代表此浮点数的正负号,Exponent代表此浮点数的指数。一般来说,寄存器的最左一位会分配作为正负号位以存储正负号,其余多个位 (例如7~63个位)会分别分配作为指数位及尾数位以存储指数和尾数。在图2的例子中,正负号位、指数位及尾数位的总和可为8~64位,但本发明不以此为限,上述位的总和亦可为8位以下,例如7位。
接着请参阅图3,图3是根据本发明一实施例的寄存器存储浮点数的示意图。本发明将浮点数的指数位与一指数阈值进行比较,主要通过设定一指数阈值来选择对于浮点数的尾数的处理模式,如图3所示,在单精度(Float 32)表示下,十进制数值“0.3057”转换为二进制浮点数即为“00111110100111001000010010110110”,其中最高位算起第一个位存储“0”以表示正负号,第二~第九位存储指数,其余位存储尾数,当第二~第九位“01111101”高于指数阈值时,将尾数“00111001000010010110110”视为有效,并将其存储于第10~第32位,如此一来,此浮点数在后续和其他浮点数进行运算时,尾数部分会被实际使用。
在另一范例中,十进制数值“-0.002”转换为二进制浮点数即为“10111011000000110001001001101111”,其中最高位算起第一个位存储“1”以表示正负号,第二~第九位存储指数,其余位存储尾数,当第二~第九位“01110110”小于指数阈值时,将尾数“00000110001001001101111”视为无效而不进行存储,故此时第10~第32位为空,如此一来,此浮点数在后续和其他浮点数进行运算时,尾数部分不会参与计算。换言的,当浮点数的指数栏位值小于阈值,即代表此浮点数数值小,而在忽略所述浮点数的尾数的情况下,浮点数可译码成:
(-1)Sign×2Exponent
其中尾数的全部位可不参与计算,也可不用传输进入寄存器,如此可节省功耗与传输,甚至尾数可根本不存储在于存储器中,以进一步节省存储空间。在另一实施方式中,尾数的至少一个位不参与计算,且不被传输进入寄存器,甚至不存储在于存储器中,以进一步节省存储空间
在又一范例中,十进制数值“0.003”转换为二进制浮点数即为“00111011010001001001101110100110”,其中最高位算起第一个位存储“1”以表示正负号,第二~第九位存储指数,其余位存储尾数,当第二~第九位“01110110”小于指数阈值时,将尾数“10001001001101110100110”视为可忽略,但仍将其存储于第10~第32位并标注为不理会状态(Don’t care),如 此一来,此浮点数在后续和其他浮点数进行运算时,尾数部分不会参与计算。本范例和前一个范例的差别在于,尾数可以存在但不被译码及运算,以进一步节省资料传输与运算功耗。同理,在图3的例子中,正负号位、指数位及尾数位的总和可为8~64位,但本发明不以此为限,上述位的总和亦可为8位以下,例如7位。
请参考图4,图4是本发明对二浮点数进行相乘的算术单元架构示意图。如前所述,第一浮点数可提取自第一寄存器111,第二浮点数可提取自第二寄存器112,指数阈值可提取自第三寄存器113。第一寄存器包括第一正负号位、指数位及尾数位,分别存储第一符号(即对应第一浮点数的正负号)、第一指数及第一尾数;第二寄存器包括第二正负号位、指数位及尾数位,分别存储第二符号、第二指数及第二尾数。
于处理第一寄存器111及第二寄存器112之间的乘法运算时,算术单元110通过比较逻辑144将第一指数与指数阈值进行比较,其中当第一指数不小于指数阈值,代表第一浮点数的数字相对较大,不可忽略尾数的有效数字,则通过乘法逻辑143将第一尾数与第二尾数相乘以产生尾数运算结果(也就是比较逻辑144的输出);若第一指数小于指数阈值,代表第一浮点数数字相对较小,可忽略尾数有效数字,则将所述第一尾数舍弃至少一个位(例如一个或多个位)后,与第二尾数相乘以所述尾数运算结果,此步骤可包括仅舍弃一个或若干位,或者舍弃全部位(也就是忽略整个第一尾数,相当于直接根据第二尾数产生尾数运算结果)。较佳地,舍弃整个第一尾数能降低更多功耗,但在精确度有要求的情况下,即便仅舍弃1个位亦能达到降低功耗的目的。此外,可通过异或逻辑141进行第一符号与第二符号之间的异或运算以产生一符号运算结果(也就是异或逻辑141的输出),通过加法逻辑142将第一指数与第二指数进行相加运算以产生指数运算结果(也就是加法逻辑142的输出)。最后,根据上述尾数运算结果、符号运算结果及指数运算结果产生一计算后浮点数,作为最终运算结果。其中,当第一指数小于指数阈值时,第一浮点数译码成(-1)Sign1×2Exponent1,其中Sign1代表第一正负号,Exponent1代表第一指数。同理,除了比较第一指数与指数阈值,本实施例可进一步比较第二指数与指数阈值,当第二指数小于指数阈值,第二浮点数译码成(-1)Sign2×2Exponent2,其中Sign2代表第二正负号,Exponent2代表第二指数。本实施例中异或逻辑141、 加法逻辑142、乘法逻辑143及比较逻辑144的呈现方式仅为举例,确切的实作方式可根据实际需求作出变化,而与本实施例所示的态样有所不同,然本发明包括所有可能的细节调整,并不额外限制。例如,单精度(single-precision)浮点数算术单元的乘法逻辑143,将Mantissa解读为1.Mantissa,即小数点左边位为1,小数点右边为Mantissa,但不以此为限。此外,单精度浮点数算术单元的加法逻辑142,将Exponent解读为(Exponent-127)再进行加法,但亦不以此为限。虽然以上大致以对第一尾数的存储及传输进行精简化,但同样的概念亦可应用在第二尾数,例如以上举例中的第一尾数和第二尾数可角色互换,或者对于第一尾数和第二尾数的存储及传输皆进行精简化。
根据本发明不同实施例,指数阈值可为一定值,或为动态可调。通过阈值可调的设计,能够选择所需的浮点数运算的精确度高低。例如,若阈值高,不被译码的尾数就会更多,因此可大幅降低资料传输及运算的功率功耗。指数阈值可根据算术单元110的温度及/或算术单元110的处理事项类型进行动态调整,举例来说,当运算装置100的当前温度过高而需要降温时,可以调升指数阈值以使算术单元110可操作在低功耗、低温模式。此外,当运算装置100为一行动装置且处于低电量的状况时,亦可调升指数阈值以延长行动装置的待机时间。另外,倘若算术单元110要执行精密运算时,可调降指数阈值以使得有更多尾数被译码,藉此来提高精确度。
可选地,根据本发明实施例,指数阈值是介于动态可调的范围,算术单元110以数值为1的指数阈值开始训练,由算术单元110判断运算精确度是否高于精确度阈值条件,若符合条件则递增指数阈值的数值,直至运算精确度不高于精确度阈值,动态可调的范围为上述符合条件的指数阈值。本发明忽略数值小的浮点数的尾数栏位,只有针对数值大的浮点数才进行尾数栏位的译码,故相较现有技术,本发明可避免对于硬体架构的过度设计(也就是可将硬体架构精简化),故能节省资料存储及资料传输的功耗和时间。
从以上实施例可知,由于运算装置100可能会有各种不同的应用层面,因此如何妥善选取指数阈值是至关重要的,以使精确度与功耗及处理速度之间得到最佳折衷。若本发明应用于人工智慧(AI)模型,可根据运算装置100的当下需求计算出适当的指数阈值。请参考图5,图5是根据本发明一实施例算术单元110对人工智慧模型进行训练的流程图,可简单归纳如下:
步骤S502:设定指数阈值的初始值为1;
步骤S504:将指数阈值应用至AI模型;
步骤S506:根据指数阈值对AI模型重新训练(retrain);
步骤S508:判断浮点数运算的精确度的下降是否达到AI模型的最大可接受程度,若是,执行步骤S510;若否,执行步骤S512;
步骤S510:调升指数阈值;
步骤S512:训练完成。
总结来说,图5示意了一种低功耗模式的训练方案,倘若在步骤S508中判断出浮点数运算的精确度的下降未超过AI模型的最大可接受程度,即表示目前浮点数运算的精确度仍高于预期,在容错率许可的状况下可调升指数阈值以更进一步降低功耗和处理时间。
请参考图6,图6是根据本发明一实施例运算装置100降低晶片功耗的流程图,可简单归纳如下:
步骤S602:判断晶片是否需要降低功率消耗,若是,执行步骤S604;若否,流程跳至步骤S608;
步骤S604:判断浮点数运算的精确度下降是否达到AI模型的最大可接受程度,若否,执行步骤S606;若是,流程跳至步骤S608;
步骤S606:调升指数阈值;
步骤S608:流程结束。
总结来说,图6示意了一种功耗最佳化的方案,首先于步骤S602判断是否有降低功耗的需求,以智慧型手机为例,倘若手机电量充足或手机处于高度使用状态,则不降低功耗。反的,若手机电量不足,或手机处于低度使用状态,则应降低功耗。当判断晶片有降低功耗的需求之后,步骤S604判断当前浮点数运算的精确度,若精确度下降未达到AI模型的最大可接受程度,即表示目前浮点数运算的精确度仍高于预期,在容错率许可的状况下可调升指数阈值以更进一步降低功耗和处理时间。
请参考图7,图7是根据本发明一实施例运算装置100在维持精确度的情况下适应性调整晶片功耗的流程图,可简单归纳如下:
步骤S702:判断晶片是否需要提升运算精确度,若是,执行步骤S704;若否,流程跳至步骤S708;
步骤S704:判断指数阈值是否为1(即指数阈值的最小值),若否,执行步骤S706;若是,流程跳至步骤S708;
步骤S706:调降指数阈值;
步骤S708:流程结束。
总结来说,图7示意了一种以浮点数运算精确度为导向的功耗调整方案,首先于步骤S702判断是否有提升运算精确度,以智慧型手机为例,倘若手机正在执行高画质影像处理,因为对精确度有较高的要求,则晶片会进入高效能(Turbo)模式而不考虑节约功耗。反的,倘若手机正在执行影像辨识,对精确度要求较低,则可节约功耗。接着,步骤S704判断指数阈值是否为最小指数阈值(本发明以1为例,但不限于此),若仍非最小指数阈值则通过步骤S706继续调降。
请参考图8,图8是根据本发明一实施例的一种浮点数运算方法的流程图。请注意,假若可获得实际上相同的结果,则这些步骤并不一定要遵照图8所示的执行次序来执行。图8所示的浮点数运算方法可被图1所示的运算装置100或算术单元110所采用,并可简单归纳为下列步骤:
步骤S802:将第一指数与指数阈值进行比较,其中当第一指数不小于指数阈值,将第一尾数与第二尾数相乘以产生尾数运算结果;当第一指数小于指数阈值,则将第一尾数舍弃至少一个位后,与第二尾数相乘以产生尾数运算结果;
步骤S804:将第一符号与第二符号进行异或运算,以产生符号运算结果;
步骤S806:将第一指数与第二指数进行相加运算,以产生指数运算结果;
步骤S808:根据尾数运算结果、符号运算结果及指数运算结果产生计算后浮点数。
由于熟习技艺者在阅读完以上段落后应可轻易了解图8中每一步骤的细节,为简洁的故,在此将省略进一步的描述。
综上所述,本发明于浮点数的指数栏位值小于阈值(代表所述浮点数数值太小),可将尾数舍弃(即:不存储在于存储器中)以进一步节省存储空间,或 仅存储尾数而不被译码及运算,以节省资料传输与运算功耗。此外,通过阈值的可调性(详见图5至图7的优化流程),所搭配的电子产品可弹性地在高效能模式和低功耗模式之间作折衷取舍(例如阈值高,不被译码的尾数就多,资料传输、运算功耗可被降低),如此一来,本发明得以在符合应用程式对于精确度的要求的情况下节省功耗并加快运算速度。

Claims (23)

  1. 一种浮点数运算方法,应用于一第一寄存器及一第二寄存器之间的乘法运算,所述第一寄存器存储第一浮点数,所述第二寄存器存储第二浮点数;所述第一寄存器包括第一指数位及第一尾数位,分别存储第一指数及第一尾数;所述第二寄存器包括第二指数位及第二尾数位,分别存储第二指数及第二尾数;所述方法的特征在于,使用一算术单元进行以下步骤:
    将所述第一指数与一指数阈值进行比较,其中当所述第一指数不小于所述指数阈值,将所述第一尾数与第二尾数相乘以产生一尾数运算结果;当所述第一指数小于所述指数阈值,则将所述第一尾数舍弃至少一个位后,与第二尾数相乘以产生所述尾数运算结果;
    将所述第一指数与所述第二指数进行相加运算,以产生一指数运算结果;以及
    根据所述尾数运算结果及所述指数运算结果产生一计算后浮点数。
  2. 如权利要求1所述的浮点数运算方法,其特征在于,所述第一寄存器还包括第一正负号位,所述第一正负号位存储第一符号;所述第二寄存器还包括第二正负号位,所述第二正负号位存储第二符号,所述浮点数运算方法还包括:
    将所述第一符号与所述第二符号进行异或运算,以产生一符号运算结果;以及
    根据所述尾数运算结果、所述符号运算结果及所述指数运算结果产生一计算后浮点数。
  3. 如权利要求1所述的浮点数运算方法,其特征在于,所述指数阈值存储于一第三寄存器中,所述算术单元于执行所述第一寄存器及所述第二寄存器之间的乘法运算时存取所述第三寄存器。
  4. 如权利要求1所述的浮点数运算方法,其特征在于,当所述第一指数小于所述指数阈值,所述第一尾数的至少一个位仅被暂存而不涉及运 算。
  5. 如权利要求4所述的浮点数运算方法,其特征在于,所述指数阈值为动态可调。
  6. 如权利要求5所述的浮点数运算方法,其特征在于,所述指数阈值是根据所述算术单元的温度及/或所述算术单元的处理事项类型进行动态调整。
  7. 如权利要求4所述的浮点数运算方法,其特征在于,所述指数阈值,是介于一动态可调的范围,所述算术单元以数值为1的指数阈值开始训练,由所述算术单元判断运算精确度是否高于一精确度阈值条件,若符合所述条件则递增所述指数阈值的数值,直至所述运算精确度不高于一精确度阈值,所述动态可调的范围为所述多个符合所述条件的指数阈值。
  8. 如权利要求1所述的浮点数运算方法,其特征在于,所述第一寄存器耦合存储器,所述存储器存储第一指数,当所述第一指数小于所述指数阈值,所述第一尾数的至少一个位被舍弃而不被所述存储器存储。
  9. 如权利要求1所述的浮点数运算方法,其特征在于,当所述第一指数小于所述指数阈值,所述第一尾数的至少一个位为不理会状态。
  10. 如权利要求1所述的浮点数运算方法,其特征在于,当所述第一指数小于所述指数阈值,所述第一浮点数译码成(-1)Sign1×2Exponent1,其中Sign1代表所述第一正负号,Exponent1代表所述第一指数。
  11. 如权利要求10所述的浮点数运算方法,其特征在于,当所述第二指数小于所述指数阈值,所述第二浮点数译码成(-1)Sign2×2Exponent2,其中Sign2代表所述第二正负号,Exponent2代表所述第二指数。
  12. 如权利要求1所述的浮点数运算方法,还包括以所述算术单元存取一存储器,所述存储器存储有多组批量范数系数,分别对应于多个候选阈值,所述指数阈值是选自所述多个候选阈值中的一者。
  13. 一种算术单元,耦接于一第一寄存器及一第二寄存器,所述第一寄存器存储第一浮点数,所述第二寄存器存储第二浮点数;所述第一寄存器包括第一指数位及第一尾数位,分别存储第一指数及第一尾数;所述第二寄存器包括第二指数位及第二尾数位,分别存储第二指数及第 二尾数;所述算术单元的特征在于,在处理所述第一寄存器及所述第二寄存器之间的乘法运算时进行以下步骤:
    将所述第一指数与一指数阈值进行比较,其中当所述第一指数不小于所述指数阈值,将所述第一尾数与第二尾数相乘以产生一尾数运算结果;当所述第一指数小于所述指数阈值,则将所述第一尾数舍弃至少一个位后,与第二尾数相乘以产生所述尾数运算结果;
    将所述第一指数与所述第二指数进行相加运算,以产生一指数运算结果;以及
    根据所述尾数运算结果及所述指数运算结果产生一计算后浮点数。
  14. 如权利要求13所述的算术单元,其特征在于,所述第一寄存器还包括第一正负号位,所述第一正负号位存储第一符号;所述第二寄存器还包括第二正负号位,所述第二正负号位存储第二符号,所述算术单元另执行以下步骤:
    将所述第一符号与所述第二符号进行异或运算,以产生一符号运算结果;以及
    根据所述尾数运算结果、所述符号运算结果及所述指数运算结果产生一计算后浮点数。
  15. 如权利要求13所述的算术单元,其特征在于,所述指数阈值存储于一第三寄存器中,所述算术单元于执行所述第一寄存器及所述第二寄存器之间的乘法运算时存取所述第三寄存器。
  16. 如权利要求13所述的算术单元,其特征在于,当所述第一指数小于所述指数阈值,所述第一尾数的至少一个位仅被暂存而不涉及运算。
  17. 如权利要求16所述的算术单元,其特征在于,所述指数阈值为动态可调。
  18. 如权利要求17所述的算术单元,其特征在于,所述指数阈值是根据所述算术单元的温度及/或所述算术单元的处理事项类型进行动态调整。
  19. 如权利要求16所述的算术单元,其特征在于,所述指数阈值,是介于一动态可调的范围,所述算术单元以数值为1的指数阈值开始训练, 由所述算术单元判断运算精确度是否高于一精确度阈值条件,若符合所述条件则递增所述指数阈值的数值,直至所述运算精确度不高于一精确度阈值,所述动态可调的范围为所述多个符合所述条件的指数阈值。
  20. 如权利要求13所述的算术单元,其特征在于,当所述第一指数小于所述指数阈值,所述第一浮点数译码成(-1)Sign1×2Exponent1,其中Sign1代表所述第一正负号,Exponent1代表所述第一指数。
  21. 如权利要求20所述的算术单元,其特征在于,当所述第二指数小于所述指数阈值,所述第二浮点数译码成(-1)Sign2×2Exponent2,其中Sign2代表所述第二正负号,Exponent2代表所述第二指数。
  22. 如权利要求13所述的算术单元,其特征在于,所述第一寄存器及第二寄存器耦接一存储器,所述存储器存储多组批量范数系数,分别对应于多个候选阈值,所述指数阈值是选自所述多个候选阈值中的一者。
  23. 一种运算装置,包括一第一寄存器、一第二寄存器以及一算术单元,所述算术单元耦接于所述第一寄存器及所述第二寄存器,所述第一寄存器存储第一浮点数,所述第二寄存器存储第二浮点数;所述第一寄存器包括第一指数位及第一尾数位,分别存储第一符号、第一指数及第一尾数;所述第二寄存器包括第二指数位及第二尾数位,分别存储第二符号、第二指数及第二尾数;其中于处理所述第一寄存器及所述第二寄存器之间的乘法运算时,所述算术单元进行以下步骤:
    将所述第一指数与一指数阈值进行比较,其中当所述第一指数不小于所述指数阈值,将所述第一尾数与第二尾数相乘以产生一尾数运算结果;当所述第一指数小于所述指数阈值,则将所述第一尾数舍弃至少一个位后,与第二尾数相乘以产生所述尾数运算结果;
    将所述第一指数与所述第二指数进行相加运算,以产生一指数运算结果;以及
    根据所述尾数运算结果及所述指数运算结果产生一计算后浮点数。
PCT/CN2023/074108 2022-02-02 2023-02-01 一种浮点数运算方法以及相关的算术单元 WO2023147770A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263305711P 2022-02-02 2022-02-02
US63/305,711 2022-02-02

Publications (1)

Publication Number Publication Date
WO2023147770A1 true WO2023147770A1 (zh) 2023-08-10

Family

ID=87553164

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/074108 WO2023147770A1 (zh) 2022-02-02 2023-02-01 一种浮点数运算方法以及相关的算术单元

Country Status (2)

Country Link
TW (1) TW202333043A (zh)
WO (1) WO2023147770A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130007076A1 (en) * 2011-06-30 2013-01-03 Samplify Systems, Inc. Computationally efficient compression of floating-point data
US9417839B1 (en) * 2014-11-07 2016-08-16 The United States Of America As Represented By The Secretary Of The Navy Floating point multiply-add-substract implementation
WO2021107995A1 (en) * 2020-05-30 2021-06-03 Futurewei Technologies, Inc. Single-cycle kulisch accumulator
CN112988110A (zh) * 2019-12-17 2021-06-18 深圳市中兴微电子技术有限公司 一种浮点处理装置和数据处理方法
CN113282273A (zh) * 2020-02-19 2021-08-20 脸谱公司 用于多种格式的浮点运算的硬件

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130007076A1 (en) * 2011-06-30 2013-01-03 Samplify Systems, Inc. Computationally efficient compression of floating-point data
US9417839B1 (en) * 2014-11-07 2016-08-16 The United States Of America As Represented By The Secretary Of The Navy Floating point multiply-add-substract implementation
CN112988110A (zh) * 2019-12-17 2021-06-18 深圳市中兴微电子技术有限公司 一种浮点处理装置和数据处理方法
CN113282273A (zh) * 2020-02-19 2021-08-20 脸谱公司 用于多种格式的浮点运算的硬件
WO2021107995A1 (en) * 2020-05-30 2021-06-03 Futurewei Technologies, Inc. Single-cycle kulisch accumulator

Also Published As

Publication number Publication date
TW202333043A (zh) 2023-08-16

Similar Documents

Publication Publication Date Title
US20240028905A1 (en) Artificial neural network training using flexible floating point tensors
JP7244186B2 (ja) 改良された低精度の2進浮動小数点形式設定
US9268528B2 (en) System and method for dynamically reducing power consumption of floating-point logic
US10146680B2 (en) Data processing system and method of operating the same
Melchert et al. SAADI-EC: A quality-configurable approximate divider for energy efficiency
KR20080055985A (ko) 선택가능 준정밀도를 가진 부동―소수점 프로세서
WO2022052625A1 (zh) 一种定点与浮点转换器、处理器、方法以及存储介质
US20210116955A1 (en) Dynamic power monitor monitoring power basted on clock cycle, processor, and system on chip
WO2023147770A1 (zh) 一种浮点数运算方法以及相关的算术单元
JPH05216620A (ja) 浮動小数点を正規化する方法及び回路
KR20060051572A (ko) 임의 정밀도 연산기, 임의 정밀도 연산 방법, 및 전자 기기
CN114818585A (zh) 一种适用于多电平通信的io电路及其控制方法
CN114840734A (zh) 多模态表示模型的训练方法、跨模态检索方法及装置
CN114139693A (zh) 神经网络模型的数据处理方法、介质和电子设备
WO2017185203A1 (zh) 一种用于执行多个浮点数相加的装置及方法
US20230305803A1 (en) Method for Processing Floating Point Number and Related Device
US20230273768A1 (en) Floating-point calculation method and associated arithmetic unit
US20040059769A1 (en) Methods and apparatus for predicting an underflow condition associated with a floating-point multiply-add operation
TWI837000B (zh) 一種浮點數壓縮方法、運算裝置及電腦可讀取儲存媒介
WO2023227064A1 (zh) 一种浮点数压缩方法、运算装置及计算器可读取存储媒介
US20230106651A1 (en) Systems and methods for accelerating the computation of the exponential function
US7272623B2 (en) Methods and apparatus for determining a floating-point exponent associated with an underflow condition or an overflow condition
US20090172054A1 (en) Efficient leading zero anticipator
US9335967B1 (en) Accurate floating-point calculation method and device
CN111313890A (zh) 一种高性能近似全加器门级单元

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23749349

Country of ref document: EP

Kind code of ref document: A1