WO2023227064A1 - 一种浮点数压缩方法、运算装置及计算器可读取存储媒介 - Google Patents

一种浮点数压缩方法、运算装置及计算器可读取存储媒介 Download PDF

Info

Publication number
WO2023227064A1
WO2023227064A1 PCT/CN2023/096302 CN2023096302W WO2023227064A1 WO 2023227064 A1 WO2023227064 A1 WO 2023227064A1 CN 2023096302 W CN2023096302 W CN 2023096302W WO 2023227064 A1 WO2023227064 A1 WO 2023227064A1
Authority
WO
WIPO (PCT)
Prior art keywords
floating
point number
mantissas
point
compression
Prior art date
Application number
PCT/CN2023/096302
Other languages
English (en)
French (fr)
Other versions
WO2023227064A9 (zh
Inventor
罗允辰
吕仁硕
Original Assignee
吕仁硕
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 吕仁硕 filed Critical 吕仁硕
Publication of WO2023227064A1 publication Critical patent/WO2023227064A1/zh
Publication of WO2023227064A9 publication Critical patent/WO2023227064A9/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers

Definitions

  • the invention relates to an application of floating-point number operations, in particular to a floating-point number operation method and related arithmetic units.
  • MSFP Mobile Floating Point
  • the method includes forcibly compressing multiple exponents of multiple floating-point numbers into only a single exponent to simplify the overall operation.
  • the compression error is too large. This leads to a significant decrease in the accuracy of operations, and the field of machine learning (such as neural algorithms) has certain requirements for accuracy, so it is not ideal for practical applications.
  • one of the purposes of the present invention is to provide an efficient floating-point number compression (also known as encoding) and operation method, so as to improve the defects of floating-point number operation in the existing technology without significantly increasing the cost. This in turn speeds up the instruction cycle and reduces power consumption.
  • An embodiment of the present invention provides a floating-point number compression method, which includes using an arithmetic unit to perform the following steps: A) obtain b floating-point numbers f1 ⁇ fb, where b is a positive integer greater than 1; B) obtain the b floating-point numbers f1 ⁇ fb, where b is a positive integer greater than 1; B) The points generate k common multiplication factors r1 ⁇ rk, where k is 1 or a positive integer greater than 1, and the k common multiplication factors r1 ⁇ rk include at least one floating point number with one mantissa; C) for the b floating point numbers Each floating-point number fi in is compressed into k fixed-point mantissas mi_1 ⁇ mi_k to generate a total of b ⁇ k fixed-point mantissas mi_j, where i is a positive integer not greater than b, and j is a positive integer not greater than k; and D) Output a compression result.
  • the compression result includes the k common magnification factors r1 ⁇ rk and the b ⁇ k fixed-point number mantissas mi_j, representing b compressed floating-point numbers cf1 ⁇ cfb.
  • the value of each compressed floating-point number cfi is :
  • the computing device before performing step D), the computing device further performs the following steps: generating a quasi-compression result, the quasi-compression result including the k common magnification factors r1 ⁇ rk and the b ⁇ k fixed-point number mantissas mi_j; calculate a compression error for the quasi-compression result; set a threshold; and adjust the k common magnification factors r1 ⁇ rk and the b ⁇ k according to the compression error and the threshold.
  • the step of calculating the compression error for the quasi-compression result includes: calculating the compression error Ei for each floating point number fi among the b floating point numbers according to the following equation: Calculate the sum of squares SE of b errors E1 ⁇ Eb according to the following equation: and comparing the sum of squares to a threshold. If the sum of squares is not greater than the threshold, the quasi-compression result is used as the compression result.
  • steps B) and C) are re-executed.
  • the step of adjusting the k common magnification factors r1 ⁇ rk and the b ⁇ k fixed-point mantissas mi_j is: adjusting the k common magnification factors r1 ⁇ rk and the b ⁇ k
  • a fixed-point number mantissa mi_j is processed iteratively by one of the heuristic algorithm, random algorithm, or exhaustive method.
  • the step of setting the threshold includes: jointly proposing a common magnification factor r1' ⁇ rk' for the b floating-point numbers; for each floating-point number fi among the b floating-point numbers, compress are k fixed-point number mantissas mi_1' ⁇ mi_k' to generate b ⁇ k fixed-point number mantissas mi_j'; according to the following equation, calculate the compression error Ei' for each floating-point number fi of the b floating-point numbers: Calculate the sum of squares SE' of b errors E1' ⁇ Eb' according to the following equation: And set the threshold to the compression error SE'.
  • the mantissas mi_j of the b ⁇ k fixed-point numbers are all signed numbers.
  • At least one of the b ⁇ k fixed-point number mantissas mi_j is a signed number, and the numerical range expressed by the signed number is asymmetrical with respect to 0.
  • the number is 2's complement.
  • the floating-point number compression method further includes: storing the b ⁇ k fixed-point number mantissas mi_j and the k common multiplication factors in a memory of a network server for remote downloading and calculation. use.
  • the floating-point number compression method further includes: storing the b ⁇ k fixed-point number mantissas mi_j and all of the common magnification factors r1 ⁇ rk in a memory, but part of the b ⁇ k The fixed-point mantissa mi_j and some of the common multiplication factors r1 ⁇ rk do not participate in the operation.
  • k is equal to 2
  • the common multiplication factors r1 to rk are all floating point numbers not larger than 16 bits.
  • step D) includes: calculating a quasi-compression result, the quasi-compression result including the k common magnification factors r1 ⁇ rk and the b ⁇ k fixed-point number mantissas mi_j; for the quasi-compression result
  • the compression result calculates a compression error; sets a threshold; and adjusts the quasi-compression result according to the compression error and the threshold to serve as the compression result.
  • An embodiment of the present invention provides an arithmetic device, including a first register, a second register and an arithmetic unit.
  • the arithmetic unit includes at least one multiplier and at least one adder.
  • the arithmetic unit is coupled to the The first register and the second register, wherein: the first register stores b activations Excitation values a1 ⁇ ab, where b is a positive integer greater than 1; the second buffer stores b compressed floating point numbers cf1 ⁇ cfb; the b compressed floating point numbers include k common magnification factors r1 ⁇ rk, where k is 1 Or a positive integer greater than 1; each compressed floating-point number cfi among the b compressed floating-point numbers contains k fixed-point number mantissas mi_1 ⁇ mi_k, totaling b ⁇ k fixed-point number mantissas mi_j, where i is a positive integer not greater than b , j is a positive integer not greater than k,
  • the computing device performs the following steps: A) obtains b floating-point numbers f1 ⁇ fb, where b is a positive integer greater than 1; B) generates k common numbers for the b floating-point numbers.
  • Multiplication factors r1 ⁇ rk where k is a positive integer greater than 1, and the k common multiplication factors r1 ⁇ rk include at least one floating point number with one mantissa; C) for each floating point number fi among the b floating point numbers, Compress into k fixed-point number mantissas mi_1 ⁇ mi_k to generate b ⁇ k fixed-point number mantissas mi_j, where i is a positive integer not greater than b, j is a positive integer not greater than k; and D) output a compression result, which The compression result includes the k common magnification factors r1 ⁇ rk and the b ⁇ k fixed-point number mantissas mi_j, which represent b compressed floating-point numbers cf1 ⁇ cfb.
  • the value of each compressed floating-point number cfi is
  • the computing device before performing step D), the computing device further performs the following steps: calculate a quasi-compression result, the quasi-compression result includes the k common magnification factors r1 ⁇ rk and the b ⁇ k fixed-point mantissas mi_j; calculate a compression error for the quasi-compression result; set a threshold; and adjust the k common magnification factors r1 ⁇ rk and the b ⁇ k fixed-point mantissas mi_j according to the compression error and the threshold.
  • the step of calculating the compression error for the quasi-compression result includes: calculating the compression error Ei for each floating point number fi among the b floating point numbers according to the following equation: Calculate the sum of squares SE of b errors E1 ⁇ Eb according to the following equation: and comparing the sum of squares with a threshold; if the sum of squares is not greater than the threshold, the quasi-compression result is used as the compression result.
  • steps B) and C) are re-executed.
  • the step of adjusting the k common magnification factors r1 ⁇ rk and the b ⁇ k fixed-point mantissas mi_j is: adjusting the k common magnification factors r1 ⁇ rk and the b ⁇ k
  • the fixed-point number mantissa mi_j compression results are iteratively processed by one of the heuristic algorithm, random algorithm, or exhaustive method.
  • the step of setting the threshold includes: jointly proposing a common multiplication factor r1' ⁇ rk' for the b floating-point numbers.
  • the compression is: k fixed-point number mantissas mi_1' ⁇ mi_k' to generate b ⁇ k fixed-point number mantissas mi_j'; for each floating-point number fi of the b floating-point numbers, calculate the compression error Ei': Calculate the sum of squares SE' of b errors E1' ⁇ Eb' according to the following equation: And set the threshold to the compression error SE'.
  • the b excitation values a1 ⁇ ab are integers, fixed-point numbers, or mantissas of MSFP block floating-point numbers.
  • the computing device further includes: all the b ⁇ k fixed-point number mantissas mi_j and all the common magnification factors r1 ⁇ rk are stored in the second register, but some of them are stored in the second register.
  • the b ⁇ k fixed-point number mantissas mi_j and some of the common multiplication factors r1 ⁇ rk are not involved in the operation.
  • An embodiment of the present invention provides a calculator-readable storage medium that stores calculator-readable instructions that can be executed by a calculator.
  • the calculator output will be triggered.
  • b compressed floating point number program where b is a positive integer greater than 1, the program includes the following steps: A) generate k common multiplication factors r1 ⁇ rk, where k is 1 or a positive integer greater than 1, where the k The common multiplication factors r1 ⁇ rk at least include a floating point number with one mantissa, with a multiplication factor exponent and a multiplication factor mantissa; B) generate k fixed-point number mantissas mi_1 ⁇ mi_k to generate b ⁇ k fixed-point number mantissas mi_j, where i is a positive integer not greater than b, j is a positive integer not greater than k; and C) output the k common multiplication factors r1 ⁇ rk and the b ⁇ k fixed-point number man
  • the block floating-point number compression of the present invention can save storage space, reduce power consumption and speed up the operation while meeting the accuracy requirements of the application program.
  • the matched electronic products can flexibly make a compromise between the high-efficiency mode and the low-power consumption mode, so it is more widely used in products.
  • the floating-point number compression method of the present invention can provide optimized computing performance and computing accuracy, so it can save power consumption and speed up operations while meeting the accuracy requirements of applications. calculating speed.
  • Figure 1 is a schematic diagram of floating point numbers in the prior art.
  • FIG. 2 is a schematic diagram of an arithmetic unit applied to an arithmetic device according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of compression processing performed by MSFP in the prior art.
  • FIG. 4 is a schematic diagram of compression processing by an arithmetic unit according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of compression processing by an arithmetic unit according to another embodiment of the present invention.
  • FIG. 6 is a schematic diagram of the present invention using an arithmetic unit and a register to perform floating-point multiplication of weight values and excitation values.
  • Figure 7 is a flow chart of a floating point number compression method according to an embodiment of the present invention.
  • Figure 8 illustrates the difference between the method of the present invention and the method of MSFP.
  • the words “substantially”, “around”, “about” or “approximately” as used herein shall generally mean within 20% of a given value or range, Preferably it is within 10%.
  • the quantities provided herein may be approximate and thus may be expressed by the words “about,” “approximately,” or “approximately” unless expressly stated otherwise.
  • a quantity, concentration, or other value or parameter has a specified range, a preferred range, or a table listing upper and lower ideal values, it shall be deemed to specifically disclose all ranges consisting of any pair of upper and lower limits or ideal values, regardless of range. Whether to reveal them separately. For example, if the length of the disclosure range is X centimeters to Y centimeters, it should be deemed that the disclosure length is H centimeters and H can be any real number between X and Y.
  • electrical (sexual) coupling or “electrical (sexual) connection”
  • first device is electrically coupled to a second device
  • first device can be directly connected to the second device, or indirectly connected to the second device through other devices or connections.
  • description is about the transmission and provision of electrical signals, those familiar with this art should be able to understand that the transmission process of electrical signals may be accompanied by attenuation or other non-ideal changes, but if the source and receiving end of the transmission or provision of electrical signals are not special clarification, they should be regarded as essentially the same signal.
  • an electrical signal S is transmitted (or provided) from terminal A of an electronic circuit to terminal B of an electronic circuit, a voltage drop may occur through the source-drain terminals of the transistor switch and/or possible stray capacitance.
  • the electrical signal S at endpoint A and endpoint B of the electronic circuit should be regarded as substantial. Above is the same signal.
  • Neural-like algorithms involve a large number of floating-point number multiplication operations of weight values (Weight) and activation values (Activation). Therefore, it is very important to properly compress floating-point numbers as much as possible while meeting accuracy requirements.
  • Figure 1 is a schematic diagram of a floating point number operation method in the prior art.
  • the weight value is an array (or vector) containing 16 words, which can be represented by the floating point number on the right.
  • Each floating point number will be divided into a sign (Sign), an exponent (Exponent) and a mantissa ( Mantissa) and the three different fields stored in the register are decoded into: (-1) Sign ⁇ (1.Mantissa) ⁇ 2 Exponent
  • Sign represents the positive and negative sign of this floating point number
  • Exponent represents the exponent of this floating point number
  • mantissa is also called the significant number (Significand).
  • the leftmost bit of the register will be allocated as a sign bit to store the sign, and the remaining bits (such as 15 to 18 bits) will be allocated as exponent bits and mantissa bits to store respectively. exponent and mantissa.
  • the previous technology method is to treat each word independently as a floating point number for calculation and storage. Therefore, the register must store 16 to 19 bits for each word. Not only is the calculation time-consuming, but also involves more hardware circuits, resulting in poor product performance. Reduce, cost and power consumption increase.
  • the number of bits of the architecture mentioned in the full text and figures is only for ease of understanding and is not intended to limit the scope of the present invention. In practice, the number of bits mentioned can be increased according to design requirements. reduce.
  • FIG. 2 is a schematic diagram of the arithmetic unit 110 applied to the computing device 100 according to an embodiment of the present invention.
  • the computing device 100 includes an arithmetic unit 110 , a first register 111 , a second register 112 , a third register 113 and a memory 114 .
  • the arithmetic unit 110 is coupled to the first register 111 and the second cache.
  • the memory 114 is coupled to the first register 111 , the second register 112 and the third register 113 .
  • the memory 114 is only a general term for the storage units in the computing device 100, that is, the memory 114 can be an independent storage unit, or generally refers to all possible storage units in the computing device 100, such as the first register 111, the third The second register 112 and the third register 113 may each be coupled to different memories.
  • the memory mentioned is only one of various available storage media, and those of ordinary skill in the art will understand that other types of storage media can be used to replace the memory.
  • the computing device 100 can be any device with computing capabilities, such as a central processing unit (CPU), a graphics processing unit (GPU), an artificial intelligence accelerator (AI accelerator), a programmable logic array (FPGA), a desktop calculator, a notebook calculators, smart phones, tablet calculators, smart wearable devices, etc.
  • CPU central processing unit
  • GPU graphics processing unit
  • AI accelerator artificial intelligence accelerator
  • FPGA programmable logic array
  • the present invention can ignore the mantissas of the floating point numbers stored in the first register 111 and the second register 112 and not store them in the memory 114, thereby saving memory space.
  • the memory 114 can store computer-readable instructions that can be executed by the computing device 100.
  • the computing device 100 When the computer-readable instructions are executed by the computing device 100, the computing device 100 (including the computing unit 110, the first register 111, the second register 112 and the third register 113) perform a method of compressing floating point numbers.
  • the memory 114 can also store a complex array batch norm coefficient (Batch Normalization Coefficient).
  • the batch norm coefficient is a coefficient used to adjust the average and standard deviation of values in artificial intelligence operations.
  • Feature map numerical data corresponds to a specific set of batch norm coefficients.
  • FIG 3 is a schematic diagram of MSFP's compression processing.
  • MSFP uses 16 floating point numbers as a "block”. Compression, in which the common exponent part (marked as an 8-bit common exponent in the figure) is extracted for 16 floating-point numbers. After extraction, only the sign part and the mantissa part of these floating-point numbers remain.
  • Figure 4 which is a schematic diagram of the arithmetic unit 110 compressing floating-point numbers according to an embodiment of the present invention.
  • Figure 4 is for compressing each floating-point number into two two-bit (2-bit) 2's complement (2'complement).
  • the fixed-point mantissas m1 and m2 are then compressed into two 7-bit floating point numbers for each block, namely the scaling r1 and r2, or the scaling factor. Then, perform integer operations between m1, m2, r1, and r2 for each floating point number, so that "m1 ⁇ r1+m2 ⁇ r2" has the minimum mean square error with the floating point number.
  • the fixed-point mantissas m1 and m2 can be signed integers (integers with a sign) or unsigned integers (integers without a sign).
  • the present invention does not limit the number of m and r.
  • the arithmetic unit 110 performs the following steps: obtains b floating point numbers f1 ⁇ fb, where b is a positive integer greater than 1; and jointly proposes a common multiplication factor r1 ⁇ rk for the b floating point numbers, where k is greater than 1. Positive integer; for each floating-point number fi among the b floating-point numbers, compress it into k fixed-point number mantissas mi_1 ⁇ mi_k to generate b ⁇ k fixed-point number mantissas mi_j.
  • FIG. 5 is a schematic diagram of the compression process of the arithmetic unit 110 according to another embodiment of the present invention.
  • the memory 114 of the arithmetic device 100 can store two sets of complex array batch norm coefficients, corresponding to There are two floating-point number compression processing modes.
  • the first mode is the complete operation shown in Figure 4, and the second mode deliberately ignores the (m2 ⁇ r2) term to reduce the computational complexity.
  • the arithmetic unit 110 can determine whether to select the first mode or the second mode according to the current status of the computing device 100 (for example, whether there is overheating or overloading), or can make the selection according to the accuracy requirements of the current application program. For example, when the current temperature of the computing device 100 is too high and needs to be cooled down, the second mode can be selected so that the arithmetic unit 110 can operate in a low-power, low-temperature state.
  • the second mode can also be selected to extend the standby time of the mobile device.
  • the first mode can be selected to further improve the calculation accuracy.
  • Figure 6 is a schematic diagram of the present invention using registers and arithmetic units to perform floating-point dot product multiplication operations of weight values (Weight) and activation values (Activation).
  • the first register, the second register and the third register The three registers may respectively correspond to the first register 111, the second register 112 and the third register 113 in FIG. 2, and the multiplier and the adder correspond to the arithmetic unit 110 in FIG. 2.
  • the second buffer stores the above-mentioned common magnification factors r1, r2 and the fixed-point mantissas m1_1, m1_2... etc. corresponding to the 2's complement of each floating point number, each of which is 2 bits.
  • the first buffer stores excitation values a1, ..., a14, a15, a16.
  • a1 will be multiplied by m1_1 and m1_2 respectively, and a2 will be multiplied by m2_1 and m2_2 respectively.
  • a16 will be multiplied by m16_1 and m16_2 respectively, and these multiplication results will be passed through Adders 601 and 602 add, and then perform operations through multipliers 611, 612 and adder 603 respectively, in which the adder 603 outputs a dot product multiplication result.
  • the present invention can simplify the hardware architecture, so it can save power consumption and time of data storage and data transmission.
  • the present invention can check the compression error before generating the compression result, for example, generate a quasi-compression result, the quasi-compression result includes k common magnification factors r1 ⁇ rk and b ⁇ k fixed-point number mantissas mi_j. Then a compression error is calculated for the quasi-compression result, and a threshold is set. Finally, the quasi-compression result is adjusted according to the compression error and the threshold to serve as the compression result.
  • the compression error Ei can be calculated for each floating point number fi among the b floating point numbers according to the following equation: According to the following equation, the compression error Ei is calculated for each floating point number fi among the b floating point numbers:
  • Iterative processing includes heuristic algorithm (Heuristic algorithm), randomized algorithm (Randomized algorithm), or exhaustive method (Brute-force algorithm). Heuristic algorithms include evolutionary algorithms and simulated annealing algorithms. For example, if an evolutionary algorithm is used, one bit (mutation) of the common multiplication factors r1 and r2 can be changed.
  • the common magnification factors r1 and r2 can be increased or decreased by a small value d respectively, resulting in r1+d, r2+d or r1+d, r2-d or r1-d, r2+ d or the common magnification factor after the four iterations of r1-d, r2-d.
  • a random algorithm for example, a random number function can be used to generate common multiplication factors r1’, r2’.
  • the exhaustive method is used, for example, if r1 and r2 are each 7 bits, then there are all 2 to the 14th power combinations of r1 and r2, and all are iterated through once.
  • evolutionary algorithms and simulated annealing algorithms are almost the most common and common heuristic algorithms, there are others such as Bee colony algorithm (Bee colony algorithm), Ant colony algorithm (Ant colony algorithm), and Whale optimization algorithm. algorithm)...etc.
  • Bee colony algorithm Bee colony algorithm
  • Ant colony algorithm Ant colony algorithm
  • Whale optimization algorithm algorithm...etc.
  • evolutionary algorithms also have selection operations and crossover operations, which are not described in detail for the sake of simplicity. It will be understood by those of ordinary skill in the art, and other types of algorithms can be used for the replacement.
  • the present invention is not limited to the method of generating the threshold.
  • one method is the relative threshold, which can be summarized as the following steps: generate a common multiplication factor r1' ⁇ rk' for b floating point numbers; for b floating point numbers Each floating-point number fi in is compressed into k fixed-point number mantissas mi_1' ⁇ mi_k' to generate b ⁇ k fixed-point number mantissas mi_j'; according to the following equation, for each floating-point number fi of b floating-point numbers, the compression is calculated Error Ei':
  • this method of generating a threshold can be combined with the aforementioned heuristic algorithms (evolutionary algorithms, simulated annealing algorithms, etc.), random algorithms, exhaustive methods, etc.
  • the step of jointly proposing some common multiplication factors r1 ⁇ rk for the b floating-point numbers includes: jointly proposing symbols for the b floating-point numbers so that the b ⁇ k fixed-point number mantissas mi_j do not have sign; or when b floating-point numbers are jointly proposed with some common multiplication factors r1 ⁇ rk, the sign may not be proposed, so that the mantissas mi_j of b ⁇ k fixed-point numbers are signed.
  • the b ⁇ k fixed-point number mantissas mi_j may be 2's complement, or may not be 2's complement.
  • the floating-point number compression method further includes: storing part of the b ⁇ k fixed-point number mantissas mi_j and part of the common magnification factors r1 ⁇ rk in a register for use. For subsequent operations, that is, part of the fixed-point mantissa and/or common multiplication factor will be discarded, which can further speed up device operations and reduce device power consumption.
  • the floating-point number compression method further includes: storing all b ⁇ k fixed-point number mantissas mi_j and all the common magnification factors r1 ⁇ rk in the buffer, but some The b ⁇ k fixed-point number mantissas mi_j and part of the common magnification factors r1 ⁇ rk do not participate in the operation, that is, not all stored common magnification factors will participate in the operation, which can further speed up the device operation and reduce the device power consumption.
  • FIG. 7 is a flow chart of a floating-point number compression method according to an embodiment of the present invention. Please note that these steps do not necessarily need to be performed in the order shown in Figure 7 if substantially the same result can be obtained.
  • the floating-point number operation method shown in Figure 7 can be adopted by the operation device 100 or the arithmetic unit 110 shown in Figure 2, and can be simply summarized into the following steps:
  • Step S702 Obtain b floating point numbers f1 ⁇ fb;
  • Step S704 jointly propose common magnification factors r1 ⁇ rk for the b floating-point numbers
  • Step S706 For each floating-point number fi among the b floating-point numbers, compress it into k fixed-point number mantissas mi_1 ⁇ mi_k to generate b ⁇ k fixed-point number mantissas mi_j;
  • Step S708 Output a compression result, which includes the k common magnification factors r1 ⁇ rk and the b ⁇ k fixed-point mantissas mi_j.
  • the present invention proposes a novel floating-point number compression method, which has optimized operation efficiency and provides the advantage of non-uniform quantization (uniform quantization), in which the present invention uses two sub-word vectors with two magnification ratios (subword vector) to approximate (approximate) Each full-precision weight vector (i.e., uncompressed floating point number). More specifically, each subword is a low-bit (e.g., 2-bit), signed (2's complement) integer, and each multiplier is a low-bit floating point (LBFP) (e.g., 7-bit).
  • LBFP low-bit floating point
  • One embodiment of the present invention uses two magnifications (i.e., r1, r2), and each floating-point number is compressed into two fixed-point mantissas (i.e., m1, m2), where the calculation cost of the magnification is apportioned to 16 weights, and Each multiplier is a low-bit floating point number LBFP, involving only low-bit operations.
  • Figure 8 illustrates the difference between the method of the present invention and the MSFP algorithm, in which the results of the weight vector in the floating point compression method of the present invention or the MSFP compression method are compared. It can be clearly understood from the figure that the present invention only requires less The quantization level (quantization level) achieves smaller quantization error than MSFP which uses more quantization levels. The advantages of this invention compared to MSFP are further listed below.
  • the floating-point number compression method of the present invention uses 2's complement without wasting quantization levels.
  • MSFP uses sign magnitude, which consumes an additional quantization level ( Both positive 0 and negative 0 are 0, so one of them is wasted. For example, 2 bits can only represent three values -1, 0, and 1, instead of 4 values to the power of 2). The impact of wasting time on a quantitative level is significant.
  • the floating-point number compression method of the present invention utilizes the property of 2's complement that is asymmetric to 0 (for example, the range of 2-bit 2's complement is -2, -1, 0, 1) and magnification to adapt to the asymmetric weight distribution of the weight vector.
  • MSFP uses sign magnitude, and its range is symmetrical to 0 (for example, a 2-bit sign magnitude is -1, 0, 1, symmetrical to 0), so MSFP
  • the quantization level of is fixed to be symmetrical, resulting in the need to consume additional quantization levels to adapt to the asymmetric weight distribution. As shown in Figure 8, while MSFP must use 15 quantization levels (4 bits), the present invention only uses 8 quantization levels (3 bits).
  • the floating-point number compression method of the present invention can provide non-uniform quantization levels by combining two magnifications (r1, r2). In comparison, MSFP can only provide uniform quantization levels. In other words, the floating-point number compression method of the present invention is more flexible for compressing non-uniformly distributed weights.
  • the quantization step size (step size) of the floating point number compression method of the present invention is defined by two magnifications (r1, r2), which are low bitwidth (low bitwidth) floating point values.
  • the quantization step size of MSFP can only be power-of-two values, such as 0.5, 0.25, 0.125.
  • the following table shows experimental data, comparing the present invention and MSFP in performing a type of neural network image classification operation. Both compress in blocks of 16 floats. In contrast, the present invention requires fewer bits for every 16 floating point numbers and can achieve higher classification accuracy.
  • bit numbers of the fixed-point mantissas m1 and m2 of the present invention can be as shown in the following table, but are not limited to the following table.
  • the number of bits of the common magnification r1 and r2 of the present invention may be as shown in the following table in a preferred embodiment, but is not limited to the following table.
  • the block floating-point number compression performed by the present invention can save power consumption and speed up the instruction cycle while meeting the accuracy requirements of the application program.
  • the matched electronic products can flexibly make a compromise between the high-efficiency mode and the low-power consumption mode, so it is more widely used in products.
  • the floating-point number compression method of the present invention can provide optimized computing performance and computing accuracy, so it can save power consumption and speed up operations while meeting the accuracy requirements of applications. calculating speed.

Abstract

一种浮点数压缩方法,包含使用一算术单元进行以下步骤:取得多个浮点数;对该等浮点数产生共同倍率因子;对于该等浮点数中每一者,压缩为多个定点数尾数;输出一压缩结果,该压缩结果包含该等共同倍率因子及该等个定点数尾数。

Description

一种浮点数压缩方法、运算装置及计算器可读取存储媒介 技术领域
本发明涉及一种浮点数运算的应用,尤其是一种浮点数运算方法以及相关的算术单元。
背景技术
随着机器学习(Machine Learning)领域越来越广泛所带来的庞大的浮点数运算量,如何压缩浮点数数据以加快指令周期及降低功耗成为本领域人士致力研究的议题。一般的浮点数技术对于多个浮点数皆完整地进行个别的存储及运算,亦即对于每个浮点数完整存储正负号、指数及尾数。如此一来,不仅因为存储了大量数据而耗费存储空间,并且还增加传输时间及运算功耗。
微软(Microsoft)提出了一种通称为MSFP(Microsoft Floating Point)的浮点数压缩方法,其作法包括强行将多个浮点数的多个指数压缩成只保留单一指数以简化整体运算,但压缩误差过大导致运算的精确度大幅下降,而机器学习领域(诸如类神经算法)对于精确度有一定程度的要求,因此实际应用上并不是最理想。
综上所述,实有需要一种新颖的浮点数运算方法及硬件架构来改善现有技术的问题。
发明内容
根据以上需求,本发明的目的之一在于提供一种高效的浮点数压缩(也被称为编码)及运算方法,以在不大幅增加成本的前提下改善现有技术中浮点数运算的缺陷,进而加快指令周期并降低功耗。
本发明一实施例提供了一种浮点数压缩方法,包含使用一算术单元进行以下步骤:A)取得b个浮点数f1~fb,其中b为大于1的正整数;B)对该b个浮点数产生k个共同倍率因子r1~rk,其中k为1或大于1的正整数,其中该k个共同倍率因子r1~rk至少包含一具有一尾数的浮点数;C)对于该b个浮点数中每一浮点数fi,压缩为k个定点数尾数mi_1~mi_k,以产生共b×k个定点数尾数mi_j,其中i为不大于b的正整数,j为不大于k的正整数;以及D)输出一压缩结果,该压缩结果包含该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j,代表b个压缩浮点数cf1~cfb,每一压缩浮点数cfi的值为:
可选地,根据本发明一实施例,该运算装置执行步骤D)前,进一步执行以下步骤:产生一准压缩结果,该准压缩结果包含该k个共同倍率因子r1~ rk及该b×k个定点数尾数mi_j;针对该准压缩结果计算一压缩误差;设定一阈值;以及根据该压缩误差以及该阈值调整该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j。
可选地,根据本发明一实施例,针对该准压缩结果计算该压缩误差的步骤包含:根据以下方程对该b个浮点数中每一浮点数fi计算压缩误差Ei:根据以下方程计算b个误差E1~Eb的平方和SE:以及将该平方和与一阈值进行比较。若该平方和不大于该阈值,则以该准压缩结果作为该压缩结果。
可选地,根据本发明一实施例,若该压缩误差大于该阈值,则重新执行该步骤B)、C)。
可选地,根据本发明一实施例,该调整该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j步骤为:对该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j进行启发式算法、随机算法、或穷举法之一的迭代处理。
可选地,根据本发明一实施例,设定该阈值的步骤包含:对该b个浮点数共同提出共同倍率因子r1’~rk’;对于该b个浮点数中每一浮点数fi,压缩为k个定点数尾数mi_1’~mi_k’,以产生b×k个定点数尾数mi_j’;根据以下方程,对该b个浮点数的每一个浮点数fi,计算压缩误差Ei': 根据以下方程计算b个误差E1’~Eb’的平方和SE’:以及将该阈值设为压缩误差SE’。
可选地,根据本发明一实施例,对该b×k个定点数尾数mi_j均为有号数。
可选地,根据本发明一实施例,该b×k个定点数尾数mi_j至少一个为有号数,且该有号数表达的数值范围相对于0不对称。
可选地,根据本发明一实施例,该有号数为2的补数。
可选地,根据本发明一实施例,浮点数压缩方法另包含:将该b×k个定点数尾数mi_j以及该k个共同倍率因子存储于一网络服务器的一内存,以供远程下载运算之用。
可选地,根据本发明一实施例,浮点数压缩方法另包含:将该b×k个定点数尾数mi_j以及全部的该些共同倍率因子r1~rk存储于一内存,但部分的b×k个定点数尾数mi_j以及部分的该些共同倍率因子r1~rk不参与运算。1可选地,根据本发明一实施例,k等于2,该共同倍率因子r1~rk均为不大于16比特的浮点数。
可选地,根据本发明一实施例,步骤D)包含:计算一准压缩结果,该准压缩结果包含该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j;针对该准压缩结果计算一压缩误差;设定一阈值;以及根据该压缩误差以及该阈值调整该准压缩结果,以作为该压缩结果。
本发明一实施例提供了一种运算装置,包含一第一缓存器、一第二缓存器以及一算术单元,该算术单元包含至少一乘法器及至少一加法器,该算术单元耦接于该第一缓存器及该第二缓存器,其中:该第一缓存器存储b个激 励值a1~ab,其中b为大于1的正整数;该第二缓存器存储b个压缩浮点数cf1~cfb;该b个压缩浮点数包含k个共同倍率因子r1~rk,其中k为1或大于1的正整数;该b个压缩浮点数中每一压缩浮点数cfi包含k个定点数尾数mi_1~mi_k,共为b×k个定点数尾数mi_j,其中i为不大于b的正整数,j为不大于k的正整数,每一压缩浮点数cfi的值为 以及该算术单元计算该b个激励值(a1,a2,…,ab)与该b个压缩浮点数(cf1,cf2,…,cfb)的一点积乘法结果。
可选地,根据本发明一实施例,该运算装置执行以下步骤:A)取得b个浮点数f1~fb,其中b为大于1的正整数;B)对该b个浮点数产生k个共同倍率因子r1~rk,其中k为大于1的正整数,其中该k个共同倍率因子r1~rk至少包含一具有一尾数的浮点数;C)对于该b个浮点数中每一浮点数fi,压缩为k个定点数尾数mi_1~mi_k,以产生b×k个定点数尾数mi_j,其中i为不大于b的正整数,j为不大于k的正整数;以及D)输出一压缩结果,该压缩结果包含该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j,代表b个压缩浮点数cf1~cfb,每一压缩浮点数cfi的值为
可选地,根据本发明一实施例,该运算装置执行步骤D)前,进一步执行以下步骤:计算一准压缩结果,该准压缩结果包含该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j;针对该准压缩结果计算一压缩误差;设定一阈值;以及根据该压缩误差以及该阈值调整该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j。
可选地,根据本发明一实施例,针对该准压缩结果计算该压缩误差的步骤包含:根据以下方程,对该b个浮点数中每一浮点数fi计算压缩误差Ei:根据以下方程计算b个误差E1~Eb的平方和SE:以及将该平方和与一阈值进行比较;其中若该平方和不大于该阈值,则以该准压缩结果作为该压缩结果。
可选地,根据本发明一实施例,若该压缩误差大于该阈值,则重新执行该步骤B)、C)。
可选地,根据本发明一实施例,该调整该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j步骤为:对该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j压缩结果进行启发式算法、随机算法、或穷举法之一的迭代处理。
可选地,根据本发明一实施例,设定该阈值的步骤包含:对该b个浮点数共同提出共同倍率因子r1’~rk’对于该b个浮点数中每一浮点数fi,压缩为k个定点数尾数mi_1’~mi_k’,以产生b×k个定点数尾数mi_j’;对该b个浮点数的每一个浮点数fi,计算压缩误差Ei': 根据以下方程计算b个误差E1’~Eb’的平方和SE’:以及将该阈值设为压缩误差SE’。
可选地,根据本发明一实施例,该b个激励值a1~ab是整数、定点数、或MSFP块状浮点数的尾数。
可选地,根据本发明一实施例,运算装置另包含:全部的该b×k个定点数尾数mi_j以及全部的该些共同倍率因子r1~rk被存储于该第二缓存器,但部分的该b×k个定点数尾数mi_j以及部分的该些共同倍率因子r1~rk不参与运算。
本发明一实施例提供了一种计算器可读取存储媒介,存储着可被计算器执行的计算器可读取指令,当该计算器可读取指令被计算器执行时将触发计算器输出b个压缩浮点数的程序,其中b为大于1的正整数,该程序包括以下步骤:A)产生k个共同倍率因子r1~rk,其中k为1或大于1的正整数,其中该k个共同倍率因子r1~rk至少包含一具有一尾数的浮点数,具有一倍率因子指数及一倍率因子尾数;B)产生k个定点数尾数mi_1~mi_k,以产生b×k个定点数尾数mi_j,其中i为不大于b的正整数,j为不大于k的正整数;以及C)输出该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j,代表b个压缩浮点数cf1~cfb,每一压缩浮点数cfi的值为
综上所述,本发明进行块状浮点数压缩可在符合应用程序对精确度的要求的情况下节省存储空间、或降低功耗并加快运算速度。此外,借由第一模式和第二模式的可调性,所搭配的电子产品可弹性地在高效能模式和低功耗模式之间作折衷取舍,故在产品上有更广泛地应用。此外,相较于微软MSFP以及其他现有技术,本发明的浮点数压缩方法能够提供优化的运算效能以及运算精确度,故可在符合应用程序对于精确度的要求的情况下节省功耗并加快运算速度。
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其他目的、特征和优点能够更明显易懂,以下特举较佳实施例,并配合附图,详细说明如下。
附图说明
图1为现有技术浮点数的示意图。
图2为根据本发明一实施例的算术单元应用于运算装置的示意图。
图3为现有技术MSFP进行压缩处理的示意图。
图4为根据本发明一实施例算术单元的进行压缩处理的示意图。
图5为根据本发明另一实施例算术单元的进行压缩处理的示意图。
图6为本发明利用算术单元与缓存器进行权重值与激励值的浮点数乘法运算的示意图。
图7为根据本发明一实施例的一种浮点数压缩方法的流程图。
图8示意本发明的方法与MSFP的方法的差异性。
具体实施方式
本发明特别以下述例子加以描述,这些例子仅用以举例说明而已,因为对于熟习此技艺者而言,在不脱离本揭示内容的精神和范围内,当可作各种的更动与润饰,因此本揭示内容的保护范围当视后附的申请专利范围所界定者为准。在通篇说明书与申请专利范围中,除非内容清楚指定,否则“一”以及“该”的意义包含这一类叙述包含“一或至少一”组件或成分。此外,如本发明所用,除非从特定上下文明显可见将复数排除在外,否则单数冠词亦包含复数个组件或成分的叙述。而且,应用在此描述中与下述的全部申请专利范围中时,除非内容清楚指定,否则“在其中」的意思可包含“在其中”与“在其上”。在通篇说明书与申请专利范围所使用的用词(terms),除有特别注明,通常具有每个用词使用在此领域中、在此揭露的内容中与特殊内容中的平常意义。某些用以描述本发明的用词将于下或在此说明书的别处讨论,以提供从业人员(practitioner)在有关本发明的描述上额外的引导。在通篇说明书的任何地方的例子,包含在此所讨论的任何用词的例子的使用,仅用以举例说明,当然不限制本发明或任何例示用词的范围与意义。同样地,本发明并不限于此说明书中所提出的各种实施例。
在此所使用的用词“实质上(substantially)”、“大约(around)”、“约(about)”或“近乎(approximately)”应大体上意味在给定值或范围的20%以内,较佳为在10%以内。此外,在此所提供的数量可为近似的,因此意味着若无特别陈述,可以用词“大约”、“约”或“近乎”加以表示。当数量、浓度或其他数值或参数有指定的范围、较佳范围或表列出上下理想值时,应视为特别揭露由任何上下限的数对或理想值所构成的所有范围,不论等范围是否分别揭露。举例而言,如揭露范围某长度为X公分到Y公分,应视为揭露长度为H公分且H可为X到Y之间的任意实数。
此外,若使用“电(性)耦接”或“电(性)连接”一词在此为包含任何直接及间接的电气连接手段。举例而言,若文中描述第一装置电性耦接于第二装置,则代表第一装置可直接连接于第二装置,或透过其他装置或连接手段间接地连接至第二装置。另外,若描述关于电信号的传输、提供,熟习此技艺者应可以了解电信号的传递过程中可能伴随衰减或其他非理想性的变化,但电信号传输或提供的来源与接收端若无特别叙明,实质上应视为同一信号。举例而言,若由电子电路的端点A传输(或提供)电信号S给电子电路的端点B,其中可能经过晶体管开关的源汲极两端及/或可能的杂散电容而产生电压降,但此设计的目的若非刻意使用传输(或提供)时产生的衰减或其他非理想性的变化而达到某些特定的技术效果,电信号S在电子电路的端点A与端点B应可视为实质上为同一信号。
可了解如在此所使用的用词“包含(comprising或including)”、“具有(having)”、“含有(containing)”、“涉及(involving)”等等,为开放性的(open-ended),即意指包含但不限于。另外,本发明的任一实施例或申请专利范围不须达成本发明所揭露的全部目的或优点或特点。此外,摘要部分和 标题仅是用来辅助专利文件搜寻之用,并非用来限制本发明的申请专利范围。
类神经算法涉及大笔权重值(Weight)与激励值(Activation)的浮点数乘法运算,因此如何尽可能在符合精确度要求的情况下妥善压缩浮点数是相当重要的。
请参考图1,图1为现有技术浮点数运算方式的示意图。如图1所示,权重值为包含16个字的数组(或向量),可用右侧的浮点数的表示,每个浮点数会分为正负号(Sign)、指数(Exponent)及尾数(Mantissa)而存储于缓存器的三个不同字段,译码运算时都译码成:
(-1)Sign×(1.Mantissa)×2Exponent
其中Sign代表此浮点数的正负号,Exponent代表此浮点数的指数,尾数又被称为有效数(Significand)。于存储于缓存器时,缓存器的最左一比特会分配作为正负号比特以存储正负号,其余多个比特(例如15~18个比特)会分别分配作为指数比特及尾数比特以存储指数和尾数。先前技术的作法是将每一个字独立视为一浮点数进行运算、存储,因此缓存器必须针对每个字存储16~19比特,不仅运算上费时,也牵涉更多硬件电路,导致产品的效能降低、成本及功耗增加。请注意,全文及图式中所提到架构的比特数仅为便于理解的目的,并非用以限制本发明的范畴,本发明在实作上可根据设计需求对这些提到的比特数进行增减。
请参考图2,图2为根据本发明一实施例的算术单元110应用于运算装置100的示意图。如图2所示,运算装置100包含算术单元110、第一缓存器111、第二缓存器112、第三缓存器113及内存114,算术单元110耦接于第一缓存器111、第二缓存器112及第三缓存器113,且内存114耦接于第一缓存器111、第二缓存器112及第三缓存器113。值得注意的是,内存114仅为运算装置100内存储单元的总称,亦即内存114可以是独立的存储单元,或泛指运算装置100内所有可能的存储单元,例如第一缓存器111、第二缓存器112及第三缓存器113可能各自耦接于不同的内存。此外,在本发明中,举出的内存只是各种可用的存储媒介的一种,本领域通常知识者当可理解可利用其他类型的存储媒介对内存进行置换。运算装置100可以是任何具备运算能力的装置,诸如中央处理器(CPU)、图形处理器(GPU)、人工智能加速器(AI Accelerator)、可程序逻辑数组(FPGA)、桌上型计算器、笔记型计算器、智能型手机、平板计算器、智能穿戴装置等。对于存储于第一缓存器111和第二缓存器112内的浮点数的尾数,本发明可进行忽略而不存储于内存114中,藉此节省内存空间。此外,内存114可存储可被运算装置100执行的计算器可读取指令,当该计算器可读取指令被运算装置100执行时,将导致运算装置100(包含运算单元110、第一缓存器111、第二缓存器112及第三缓存器113)执行一压缩浮点数的方法。内存114亦可存储复数组批量范数系数(Batch Normalization Coefficient),批量范数系数为人工智能运算中,调整数值的平均及标准偏差的系数。通常一笔特征图(Feature map)数值数据,对应一组特定的批量范数系数。
请参考图3,其为MSFP进行压缩处理的示意图,如图3所示,有别于将每一个字独立视为一浮点数进行运算、存储,MSFP以16个浮点数为一个“块”进行压缩,其中针对16个浮点数提取共同的指数部分(图中标示为8比特的共同指数),于提取后,这些浮点数仅剩下正负号部分和尾数部分。请参考图4,其为根据本发明一实施例算术单元110压缩浮点数的示意图,图4针对每个浮点数压缩成两个二比特(2-bit)2的补数(2’complement)的定点数(fixed-point)尾数m1、m2,接着,针对每个块压缩出两个7比特(7-bit)浮点数,即倍率(Scale)r1、r2,或称倍率因子(Scaling factor)。接着,针对每个浮点数进行m1、m2、r1、r2之间的整数运算,使“m1×r1+m2×r2”与该浮点数有最小均方误差。请注意,定点数尾数m1、m2可为有号整数(带有正负号的整数),或可为无号整数(不带有正负号的整数)。
此外,本发明并不限制m、r的数量。举例来说,算术单元110进行以下步骤:取得b个浮点数f1~fb,其中b为大于1的正整数;对该b个浮点数共同提出共同倍率因子r1~rk,其中k为大于1的正整数;对于该b个浮点数中每一浮点数fi,压缩为k个定点数尾数mi_1~mi_k,以产生b×k个定点数尾数mi_j对于该b个浮点数中每一浮点数fi,压缩为k个定点数尾数mi_1~mi_k,以产生b×k个定点数尾数mi_j,其中i为不大于b的正整数,j为不大于k的正整数;以及输出一压缩结果,该压缩结果包含该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j。参见图5,图5为根据本发明另一实施例,算术单元110的进行压缩处理的示意图,如图5所示,运算装置100的内存114可存储两套复数组批量范数系数,分别对应两种浮点数压缩处理模式,其中第一模式即为图4所示的完整运算,第二模式则故意忽略(m2×r2)项,减少运算复杂度。算术单元110可根据当前运算装置100的状态(例如是否有过热或过载的情况)来判断要选用第一模式或第二模式,也可根据当前应用程序对于精确度的要求来做选取。举例来说,当运算装置100的当前温度过高而需要降温时,可以选用第二模式以使算术单元110可操作在低功耗、低温的状态。此外,当运算装置100为一行动装置且处于低电量的状况时,亦可选用第二模式以延长行动装置的待机时间。另外,倘若算术单元110要执行精密运算时,可选用第一模式来进一步提高运算精确度。
请参考图6,图6为本发明利用缓存器及算数单元进行权重值(Weight)与激励值(Activation)的浮点数点积乘法运算的示意图,其中第一缓存器、第二缓存器及第三缓存器可分别对应图2中第一缓存器111、第二缓存器112及第三缓存器113,乘法器及加法器对应图2中算数单元110。如图6所示,第二缓存器存储了上述共同倍率因子r1、r2以及对应每个浮点数的2的补数的定点数尾数m1_1、m1_2…等,其各自为2比特。第一缓存器存储激励值a1、…、a14、a15、a16。在图6的架构下,a1会分别与m1_1、m1_2相乘,且a2会分别与m2_1、m2_2相乘,以此类推,a16会分别与m16_1、m16_2相乘,而这些相乘结果会透过加法器601、602相加,再分别透过乘法器611、612以及加法器603进行运算,其中加法器603输出点积乘法结果。相较先 前技术,本发明可将硬件架构精简化,故能节省数据存储及数据传输的功耗和时间。
进一步而言,为了确保对浮点数进行压缩后仍维持所要求的精确度,本发明可在产生压缩结果之前先检查压缩误差,例如产生一准压缩结果,准压缩结果包含k个共同倍率因子r1~rk及b×k个定点数尾数mi_j。接着针对准压缩结果计算一压缩误差,并且设定一阈值,最后根据压缩误差以及阈值调整准压缩结果,以作为压缩结果。
具体地,可根据以下方程,对b个浮点数中每一浮点数fi计算压缩误差Ei:根据以下方程,对b个浮点数中每一浮点数fi计算压缩误差Ei:
接下来,根据以下方程计算b个误差E1~Eb的平方和SE:
以及,将平方和与一阈值进行比较,其中若平方和不大于阈值,代表压缩误差小,则输出准压缩结果作为压缩结果;若平方和大于阈值,则重新产生准压缩结果,例如对压缩结果进行迭代处理。迭代处理包含启发式算法(Heuristic algorithm)、随机算法(Randomized algorithm)、或穷举法(Brute-force algorithm)。启发式算法包含进化算法(Evolutionary algorithm)、模拟退火算法(Simulated annealing algorithm)。举例来说,若使用进化算法,可以改变共同倍率因子r1、r2的一个比特(突变)。若使用模拟退火算法,举例来说,可以将共同倍率因子r1、r2分别增加或减少一个微小值d,产生r1+d、r2+d或r1+d,r2-d或r1-d,r2+d或r1-d,r2-d此4种迭代之后的共同倍率因子。若使用随机算法,举例来说,可使用随机数函数产生共同倍率因子r1’,r2’。若使用穷举法,举例来说,假如r1与r2分别都是7比特,则全部共有2的14次方种r1与r2的组合,全部迭代遍历一次。上述算法仅为举例,并非用以限制本发明的范畴。例如,虽然进化算法与模拟退火算法虽然几乎是最通用且常见的启发式算法,但还有其他如蜂群算法(Bee colony algorithm)、蚁群算法(Ant colony algorithm)、鲸群算法(Whale optimization algorithm)…等。又例如进化算法除了突变操作外,还有选择(selection)操作及交换(crossover)操作,为求简洁而未详述。本领域通常知识者当可理解,且可利用其他类型的算法进行置换。
本发明并不限定产生阈值的方式,除了绝对值的阈值外,一种方式为相对阈值,可归纳为以下步骤:对b个浮点数产生共同倍率因子r1’~rk’;对于b个浮点数中每一浮点数fi,压缩为k个定点数尾数mi_1’~mi_k’,以产生b×k个定点数尾数mi_j’;根据以下方程,对b个浮点数的每一个浮点数fi,计算压缩误差Ei':
接着,根据以下方程计算b个误差E1’~Eb’的平方和SE’:
以及,将阈值设为压缩误差SE’。本领域通常知识者当可理解,此产生阈值的方式,可结合前述的启发式算法(进化算法、模拟退火算法等)、随机算法、穷举法等。
可选地,根据本发明一实施例,对b个浮点数共同提出些共同倍率因子r1~rk的步骤包含:对b个浮点数共同提出符号,使b×k个定点数尾数mi_j不带有符号;或着对b个浮点数共同提出些共同倍率因子r1~rk时可不提出符号,使得b×k个定点数尾数mi_j带有符号。
可选地,根据本发明一实施例,b×k个定点数尾数mi_j可为2的补数,或不为2的补数。
可选地,根据本发明一实施例,所述的浮点数压缩方法另包含:将部分的b×k个定点数尾数mi_j以及部分的该些共同倍率因子r1~rk存储于缓存器,以供后续运算之用,亦即有部分的定点数尾数及/或共同倍率因子会被舍弃,如此可进一步加快装置运算并降低装置功耗。
可选地,根据本发明一实施例,所述的浮点数压缩方法另包含:将全部的b×k个定点数尾数mi_j以及全部的该些共同倍率因子r1~rk存储于缓存器,但部分的b×k个定点数尾数mi_j以及部分的该些共同倍率因子r1~rk不参与运算,亦即并非所有存储的共同倍率因子都会参与运算,如此可进一步加快装置运算并降低装置功耗。
请参考图7,图7为根据本发明一实施例的一种浮点数压缩方法的流程图。请注意,假若可获得实质上相同的结果,则这些步骤并不一定要遵照图7所示的执行次序来执行。图7所示的浮点数运算方法可被图2所示的运算装置100或算术单元110所采用,并可简单归纳为下列步骤:
步骤S702:取得b个浮点数f1~fb;
步骤S704:对该b个浮点数共同提出共同倍率因子r1~rk;
步骤S706:对于该b个浮点数中每一浮点数fi,压缩为k个定点数尾数mi_1~mi_k,以产生b×k个定点数尾数mi_j;
步骤S708:输出一压缩结果,该压缩结果包含该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j。
由于熟习技艺者在阅读完以上段落后应可轻易了解图7每一步骤的细节,为简洁之故,在此将省略进一步的描述。
综上所述,本发明提出了新颖的浮点数压缩方式,其具有优化的运算效率,并提供非均匀量化(uniform quantization)的优势,其中本发明使用具有两个倍率比例的两个子字向量(subword vector)之和来近似(approximate) 每个全精度权重向量(亦即未被压缩的浮点数)。更具体地说,每个子字都是低比特(例如2比特)、有符号(2的补数)的整数,并且每个倍率都是低比特浮点数(LBFP)(例如7比特)。以下将详细说明本发明在性能上为何优于微软的MSFP算法。
本发明的一实施例,采用了两个倍率(即r1、r2),每个浮点数压缩为两个定点数尾数(即m1、m2),其中倍率的计算成本分摊到16个权重上,且每个倍率都是一个低比特浮点数LBFP,只涉及低比特操作。
参见图8,图8示意本发明的方法与MSFP算法的差异性,其中比较了权重向量在本发明的浮点数压缩方法或MSFP压缩方法的结果,从图中可清楚理解本发明仅需较少的量化级别(quantization level)却比使用更多量化级别的MSFP实现更小的量化误差,以下进一步列出本发相较于MSFP的优势之处。
一、不浪费量化级别:本发明的浮点数压缩方法,使用2的补数而不浪费量化级别,相较之下,MSFP使用正负号与值(sign magnitude),额外耗费了一个量化级别(正0与负0都是0,因此浪费了其中之一。例如,2比特只能表示-1、0、1三种值,而非2的2次方4种值),在比特数低的时候浪费一量化级别带来的影响是显著的。
二、适应偏态分布:本发明的浮点数压缩方法利用2的补数的不对称于0的性质(例如,2比特的2的补数范围是-2、-1、0、1)和倍率来适应权重向量的不对称权重分布。相较之下,MSFP使用正负号与值(sign magnitude),其范围是对称于0的(例如,2比特的正负号与值是-1、0、1,对称于0),因此MSFP的量化级别固定是对称的,导致需要耗费额外的量化级别来适应不对称权重分布。如图8所示,在MSFP须使用15个量化级别(4比特)的情况下,本发明仅使用8个量化级别(3比特)。
三、适应非均匀分布:本发明的浮点数压缩方法可以通过结合两个倍率(r1,r2)来提供非均匀量化级别,相较之下,MSFP只能提供均匀量化级别。也就是说,本发明的浮点数压缩方法对于压缩非均匀分布的权重更具弹性。
四、更弹性的量化步长大小:本发明的浮点数压缩方法的量化步长(step size)由两个倍率(r1,r2)定义,其为低位宽(low bitwidth)浮点值。相较之下,MSFP的量化步长只能是二次幂(power-of-two)值,例如0.5、0.25、0.125。
下表为实验数据,比较本发明与MSFP进行一类神经网络图片分类运算。两者都以16个浮点数为一个块进行压缩。相比之下,本发明每16个浮点数所需的比特数较少,就能达到较高的分类准确率。
本发明的定点数尾数m1及m2的比特数,较佳的实施例可为下表,但不以下表为限。
本发明的共同倍率r1及r2的比特数,较佳的实施例可为下表,但不以下表为限。
综上所述,本发明进行块状浮点数压缩可在符合应用程序对精确度的要求的情况下节省功耗并加快指令周期。此外,借由第一模式和第二模式的可调性,所搭配的电子产品可弹性地在高效能模式和低功耗模式之间作折衷取舍,故在产品上有更广泛地应用。此外,相较于微软MSFP以及其他现有技术,本发明的浮点数压缩方法能够提供优化的运算效能以及运算精确度,故可在符合应用程序对于精确度的要求的情况下节省功耗并加快运算速度。
以上所述,仅是本发明的较佳实施例而已,并非对本发明作任何形式上 的限制,虽然本发明已以较佳实施例揭露如上,然而并非用以限定本发明,任何熟悉本专业的技术人员,在不脱离本发明技术方案范围内,当可利用上述揭示的方法及技术内容作出些许的更动或修饰为等同变化的等效实施例,但凡是未脱离本发明技术方案的内容,依据本发明的技术实质对以上实施例所作的任何简单修改、等同变化与修饰,均仍属于本发明技术方案的范围内。

Claims (22)

  1. 一种浮点数压缩方法,其特征在于,包含使用一算术单元进行以下步骤:
    A)取得b个浮点数f1~fb,其中b为大于1的正整数;
    B)对该b个浮点数产生k个共同倍率因子r1~rk,其中k为1或大于1的正整数,其中该k个共同倍率因子r1~rk至少包含一具有一尾数的浮点数;
    C)对于该b个浮点数中每一浮点数fi,压缩为k个定点数尾数mi_1~mi_k,以产生共b×k个定点数尾数mi_j,其中i为不大于b的正整数,j为不大于k的正整数;以及
    D)输出一压缩结果,该压缩结果包含该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j,代表b个压缩浮点数cf1~cfb,每一压缩浮点数cfi的值为
  2. 如权利要求1所述的浮点数压缩方法,其特征在于,该运算装置执行步骤D)前,进一步执行以下步骤:
    产生一准压缩结果,该准压缩结果包含该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j;针对该准压缩结果计算一压缩误差;
    设定一阈值;以及
    根据该压缩误差以及该阈值调整该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j。
  3. 如权利要求2所述的浮点数压缩方法,其特征在于,针对该准压缩结果计算该压缩误差的步骤包含:
    根据以下方程对该b个浮点数中每一浮点数fi计算压缩误差Ei:
    根据以下方程计算b个误差E1~Eb的平方和SE:
    以及将该平方和与一阈值进行比较;
    其中若该平方和不大于该阈值,则以该准压缩结果作为该压缩结果。
  4. 如权利要求2所述的浮点数压缩方法,其特征在于,若该压缩误差大于该阈值,则重新执行该步骤B)、C)。
  5. 如权利要求4所述的浮点数压缩方法,其特征在于,该调整该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j步骤为:
    对该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j进行启发式算法、随机算法、或穷举法之一的迭代处理。
  6. 如权利要求2所述的浮点数压缩方法,其特征在于,设定该阈值的步骤包含:
    对该b个浮点数共同提出共同倍率因子r1’~rk’;
    对于该b个浮点数中每一浮点数fi,压缩为k个定点数尾数mi_1’~mi_k’,以产生b×k个定点数尾数mi_j’;
    根据以下方程,对该b个浮点数的每一个浮点数fi,计算压缩误差Ei':
    根据以下方程计算b个误差E1’~Eb’的平方和SE’:
    以及
    将该阈值设为压缩误差SE’。
  7. 如权利要求1所述的浮点数压缩方法,其特征在于,该b×k个定点数尾数mi_j均为有号数。
  8. 如权利要求1所述的浮点数压缩方法,其特征在于,该b×k个定点数尾数mi_j至少一个为有号数,且该有号数表达的数值范围相对于0不对称。
  9. 如权利要求8所述的浮点数压缩方法,其特征在于,该有号数为2的补码。
  10. 如权利要求1所述的浮点数压缩方法,其特征在于,该浮点数压缩方法另包含:
    将该b×k个定点数尾数mi_j以及该k个共同倍率因子存储于一网络服务器的一内存,以供远程下载运算用。
  11. 如权利要求1所述的浮点数压缩方法,其特征在于,该浮点数压缩方法另包含:
    将该b×k个定点数尾数mi_j以及全部的该些共同倍率因子r1~rk存储于一内存,但部分的b×k个定点数尾数mi_j以及部分的该些共同倍率因子r1~rk不参与运算。
  12. 如权利要求1所述的浮点数压缩方法,其特征在于,k等于2,该些共同倍率因子r1~rk均为不大于16比特的浮点数。
  13. 一种运算装置,其特征在于,包含一第一缓存器、一第二缓存器以 及一算术单元,该算术单元包含至少一乘法器及至少一加法器,该算术单元耦接于该第一缓存器及该第二缓存器,其中:
    该第一缓存器存储b个激励值a1~ab,其中b为大于1的正整数;
    该第二缓存器存储b个压缩浮点数cf1~cfb;
    该b个压缩浮点数包含k个共同倍率因子r1~rk,其中k为1或大于1的正整数;
    该b个压缩浮点数中每一压缩浮点数cfi包含k个定点数尾数mi_1~mi_k,共为b×k个定点数尾数mi_j,其中i为不大于b的正整数,j为不大于k的正整数,每一压缩浮点数cfi的值为以及
    该算术单元计算该b个激励值(a1,a2,…,ab)与该b个压缩浮点数(cf1,cf2,…,cfb)的一点积乘法结果。
  14. 如权利要求13所述的运算装置,其特征在于,该运算装置执行以下步骤:
    A)取得b个浮点数f1~fb,其中b为大于1的正整数;
    B)对该b个浮点数产生k个共同倍率因子r1~rk,其中k为大于1的正整数,其中该k个共同倍率因子r1~rk至少包含一具有一尾数的浮点数;
    C)对于该b个浮点数中每一浮点数fi,压缩为k个定点数尾数mi_1~mi_k,以产生b×k个定点数尾数mi_j,其中i为不大于b的正整数,j为不大于k的正整数;以及
    D)输出一压缩结果,该压缩结果包含该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j,代表b个压缩浮点数cf1~cfb,每一压缩浮点数cfi的值为
  15. 如权利要求14所述的运算装置,其特征在于,该运算装置执行步骤D)前,进一步执行以下步骤:
    计算一准压缩结果,该准压缩结果包含该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j;
    针对该准压缩结果计算一压缩误差;
    设定一阈值;以及
    根据该压缩误差以及该阈值调整该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j。
  16. 如权利要求15所述的运算装置,其特征在于,针对该准压缩结果计算该压缩误差的步骤包含:
    根据以下方程,对该b个浮点数中每一浮点数fi计算压缩误差Ei:
    根据以下方程计算b个误差E1~Eb的平方和SE:
    以及将该平方和与一阈值进行比较;
    其中若该平方和不大于该阈值,则以该准压缩结果作为该压缩结果。
  17. 如权利要求15所述的运算装置,其特征在于,若该压缩误差大于该阈值,则重新执行该步骤B)、C)。
  18. 如权利要求17所述的运算装置,其特征在于,该调整该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j步骤为:
    对该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j压缩结果进行启发式算法、随机算法、或穷举法之一的迭代处理。
  19. 如权利要求14所述的运算装置,其特征在于,设定该阈值的步骤包含:
    对该b个浮点数共同提出共同倍率因子r1’~rk’
    对于该b个浮点数中每一浮点数fi,压缩为k个定点数尾数mi_1’~mi_k’,以产生b×k个定点数尾数mi_j’;
    根据以下方程对该b个浮点数的每一个浮点数fi,计算压缩误差Ei'
    根据以下方程计算b个误差E1’~Eb’的平方和SE’:
    以及
    将该阈值设为压缩误差SE’。
  20. 如权利要求13所述的运算装置,其特征在于,该b个激励值a1~ab系整数、定点数、或MSFP块状浮点数的尾数。
  21. 如权利要求13所述的运算装置,其特征在于,该运算装置另包含:
    全部的该b×k个定点数尾数mi_j以及全部的该些共同倍率因子r1~rk被存储于该第二缓存器,但部分的该b×k个定点数尾数mi_j以及部分的该些共同倍率因子r1~rk不参与运算。
  22. 一种计算器可读取存储媒介,其特征在于,该计算器可读取存储媒介存储着可被计算器执行的计算器可读取指令,当该计算器可读取指令被计算器执行时将触发计算器输出b个压缩浮点数的方法,其中b为大于1的正整数,该方法包括:
    A)产生k个共同倍率因子r1~rk,其中k为1或大于1的正整数,其中该k个共同倍率因子r1~rk至少包含一浮点数,该浮点数具有一倍率因子指数及一倍率因子尾数;
    B)产生k个定点数尾数mi_1~mi_k,以产生b×k个定点数尾数mi_j,其中i为不大于b的正整数,j为不大于k的正整数;以及
    C)输出该k个共同倍率因子r1~rk及该b×k个定点数尾数mi_j,代表b个压缩浮点数cf1~cfb,每一压缩浮点数cfi的值为
PCT/CN2023/096302 2022-05-26 2023-05-25 一种浮点数压缩方法、运算装置及计算器可读取存储媒介 WO2023227064A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202263345918P 2022-05-26 2022-05-26
US63/345,918 2022-05-26
US202263426727P 2022-11-19 2022-11-19
US63/426,727 2022-11-19

Publications (2)

Publication Number Publication Date
WO2023227064A1 true WO2023227064A1 (zh) 2023-11-30
WO2023227064A9 WO2023227064A9 (zh) 2024-01-04

Family

ID=88918577

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/096302 WO2023227064A1 (zh) 2022-05-26 2023-05-25 一种浮点数压缩方法、运算装置及计算器可读取存储媒介

Country Status (1)

Country Link
WO (1) WO2023227064A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060028482A1 (en) * 2004-08-04 2006-02-09 Nvidia Corporation Filtering unit for floating-point texture data
US20130007076A1 (en) * 2011-06-30 2013-01-03 Samplify Systems, Inc. Computationally efficient compression of floating-point data
CN114341882A (zh) * 2019-09-03 2022-04-12 微软技术许可有限责任公司 用于训练深度神经网络的无损指数和有损尾数权重压缩

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060028482A1 (en) * 2004-08-04 2006-02-09 Nvidia Corporation Filtering unit for floating-point texture data
US20130007076A1 (en) * 2011-06-30 2013-01-03 Samplify Systems, Inc. Computationally efficient compression of floating-point data
CN114341882A (zh) * 2019-09-03 2022-04-12 微软技术许可有限责任公司 用于训练深度神经网络的无损指数和有损尾数权重压缩

Also Published As

Publication number Publication date
WO2023227064A9 (zh) 2024-01-04
TW202403539A (zh) 2024-01-16

Similar Documents

Publication Publication Date Title
WO2021036890A1 (zh) 数据处理方法、装置、计算机设备和存储介质
WO2021036904A1 (zh) 数据处理方法、装置、计算机设备和存储介质
WO2021036908A1 (zh) 数据处理方法、装置、计算机设备和存储介质
WO2021036905A1 (zh) 数据处理方法、装置、计算机设备和存储介质
Liu et al. Design and analysis of inexact floating-point adders
JP7244186B2 (ja) 改良された低精度の2進浮動小数点形式設定
CN110717585B (zh) 神经网络模型的训练方法、数据处理方法和相关产品
WO2023029464A1 (zh) 数据处理装置、方法、芯片、计算机设备及存储介质
Mitschke et al. A fixed-point quantization technique for convolutional neural networks based on weight scaling
CN114677548A (zh) 基于阻变存储器的神经网络图像分类系统及方法
WO2019046722A1 (en) PROVIDING EFFICIENT FLOATING VIRGIN OPERATIONS USING MATRIX PROCESSORS IN PROCESSOR-BASED SYSTEMS
WO2023227064A1 (zh) 一种浮点数压缩方法、运算装置及计算器可读取存储媒介
TWI837000B (zh) 一種浮點數壓縮方法、運算裝置及電腦可讀取儲存媒介
TW202109281A (zh) 帶正負號多字乘法器
CN114115803B (zh) 一种基于部分积概率分析的近似浮点乘法器
CN116127255B (zh) 卷积运算电路、及具有该卷积运算电路的相关电路或设备
CN116795324A (zh) 混合精度浮点乘法装置和混合精度浮点数处理方法
CN112085176A (zh) 数据处理方法、装置、计算机设备和存储介质
CN111492369A (zh) 人工神经网络中移位权重的残差量化
WO2023147770A1 (zh) 一种浮点数运算方法以及相关的算术单元
CN112783473B (zh) 一种使用单个dsp单元并行计算整形数据乘法运算方法
US20210406690A1 (en) Efficient weight clipping for neural networks
Madadum et al. A resource-efficient convolutional neural network accelerator using fine-grained logarithmic quantization
CN103955355A (zh) 一种应用于非易失处理器中的分段并行压缩方法及系统
CN113159296A (zh) 一种二值神经网络的构建方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23811134

Country of ref document: EP

Kind code of ref document: A1