WO2023004799A1 - 电子设备及神经网络量化方法 - Google Patents

电子设备及神经网络量化方法 Download PDF

Info

Publication number
WO2023004799A1
WO2023004799A1 PCT/CN2021/109839 CN2021109839W WO2023004799A1 WO 2023004799 A1 WO2023004799 A1 WO 2023004799A1 CN 2021109839 W CN2021109839 W CN 2021109839W WO 2023004799 A1 WO2023004799 A1 WO 2023004799A1
Authority
WO
WIPO (PCT)
Prior art keywords
point
quantization
quantization coefficient
fixed
data
Prior art date
Application number
PCT/CN2021/109839
Other languages
English (en)
French (fr)
Inventor
肖延南
刘根树
张怡浩
左文明
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180100947.8A priority Critical patent/CN117813610A/zh
Priority to PCT/CN2021/109839 priority patent/WO2023004799A1/zh
Publication of WO2023004799A1 publication Critical patent/WO2023004799A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of neural networks, in particular to an electronic device and a neural network quantification method.
  • the parameters or input data of these neural network models are usually in the form of floating point, and the operation method also adopts floating point operation.
  • the data in the form of floating point usually has a higher number of bits, such as 32 bits. Therefore, the storage and operation of floating-point data consume a large amount of hardware costs.
  • the scale of the floating-point neural network model is large, for example, the number of parameters or input data is large, the hardware performance requirements of the floating-point neural network model are higher, resulting in a large hardware cost for operations based on the neural network model.
  • the prior art proposes to quantify the floating-point neural network model, and quantize the parameters or input data of the floating-point neural network model into fixed-point parameters or fixed-point input data . After quantization, the number of digits is reduced, which reduces the hardware cost for storage and operation of fixed-point data. In this case, the hardware cost for neural network operations is also reduced.
  • the parameters of the floating-point neural network model or the quantization process of input data also perform a large number of floating-point operations, resulting in high hardware operation costs required for neural network quantization, and the reduction in the number of digits also means a certain loss of precision, which leads to neural network quantization.
  • the computational accuracy of the network model has decreased.
  • the embodiment of the present application proposes an electronic device and a neural network quantization method.
  • the quantized neural network model can be improved under the premise of low hardware cost. accuracy.
  • the embodiment of the present application proposes an electronic device, including a processor and a logic circuit.
  • the preset minimum value of the fixed-point number determines the first zero offset and the first quantization coefficient
  • the floating-point data includes at least one of the floating-point parameters of the neural network or the floating-point input data; according to the preset fixed-point quantization performing multiple expansion on the first quantization coefficient to obtain a second quantization coefficient, and performing multiple expansion and quantization on the first zero offset according to the preset fixed-point quantization coefficient and the first quantization coefficient, obtain a second zero offset
  • the logic circuit is configured to perform quantization on the data to be quantized by floating-point multiplication and fixed-point addition according to the second quantization coefficient and the second zero offset, to obtain the first A quantization result: shifting the first quantization result according to the preset fixed-point quantization coefficient to obtain a final quantization result of the data to be quantized.
  • the processor can determine the first zero offset and the first quantization coefficient through the floating-point data, the preset maximum value of the fixed-point number, and the preset minimum value of the fixed-point number, and combine the preset The fixed-point quantized coefficients are further processed to obtain the second zero offset and the second quantized coefficient.
  • the second quantized coefficient is obtained by expanding the multiple of the first quantized coefficient
  • the second zero offset is obtained by expanding and quantizing the multiple of the first zero offset.
  • the logic circuit can be quantized according to the input parameters (the second zero offset and the second quantization parameter)
  • the data is quantized by floating-point multiplication and fixed-point addition, and the quantization result is shifted according to the preset fixed-point quantization coefficient to obtain the final quantization result of the data to be quantized, so that the final quantization result after shifting can meet the preset In the range from the minimum value of the fixed-point number to the maximum value of the preset fixed-point number, the final quantization result that meets the quantization requirements is obtained.
  • the logic circuit performs floating-point multiplication, fixed-point addition and shift operation, so that the hardware cost of the logic circuit capable of quantization is relatively low. Therefore, the accuracy of the quantized neural network model can be improved at a lower hardware cost.
  • the processor multiplies the first quantization coefficient according to a preset fixed-point quantization coefficient to obtain a second quantization coefficient, including: The processor determines the expansion factor of the first quantization coefficient according to the preset fixed-point quantization coefficient; the processor obtains according to the product of the first quantization coefficient and the expansion factor of the first quantization coefficient The second quantization coefficient.
  • the parameter accuracy of the input logic circuit can be improved.
  • the expansion factor of the first quantization coefficient can be adjusted by changing the value of the preset fixed-point quantization coefficient, so that the precision improvement method of the first quantization coefficient is more flexible.
  • the preset fixed-point quantization coefficient is an integer greater than or equal to 1, and the first quantization coefficient
  • the magnification factor of is equal to a value whose base is 2 and whose exponent is the preset fixed-point quantization coefficient.
  • the effect of multiplying is similar to the effect of using the preset fixed-point quantization coefficient as the number of shifting bits, which facilitates determining the number of shifting bits in the subsequent process of shifting the first quantization result.
  • the first zero offset is multiplied according to the preset fixed-point quantization coefficient and the first quantization coefficient and quantization to obtain a second zero point offset, including: determining the expansion factor of the first zero point offset according to the preset fixed-point quantization coefficient; for the first quantization coefficient, the first zero point The product of the expansion factor of the offset and the first zero offset is rounded to obtain the second zero offset.
  • the expansion factor of the first zero point offset can be adjusted by changing the value of the preset fixed-point quantization coefficient, so that the precision improvement method of the first zero point offset is more flexible.
  • the expansion factor of the first zero offset is equal to the expansion factor of the first quantization coefficient.
  • the value range of the second zero offset obtained is the same as that of the data to be quantized
  • the value range of the quantization result obtained during the second quantization coefficient is the same, so that the second zero offset and the second quantization coefficient can participate in the arithmetic operation in the logic circuit.
  • a fifth possible implementation of the electronic device according to the second quantization coefficient and the second zero offset , performing quantization on the data to be quantized by floating-point multiplication and fixed-point addition to obtain a first quantization result, including: rounding the product of the second quantization coefficient and the data to be quantized to obtain a second quantization Result; the first quantization result is obtained according to the sum of the second quantization result and the second zero offset.
  • the second quantization result of multiple expansion can be obtained; through the fixed-point addition operation of the second quantization result and the second zero offset, it can be obtained
  • the first quantization result is multiplied, so that the high-precision attribute of the data to be quantized can be retained in the first quantization result, and the precision of the first quantization result can be improved.
  • the fixed-point quantization coefficient shifts the first quantization result to obtain the final quantization result of the to-be-quantized data, including: the logic circuit shifts the first quantization result to the right to obtain the to-be-quantized data For the final quantization result of the data, the number of shifted bits is equal to the preset fixed-point quantization coefficient.
  • the high-precision attribute of the data to be quantized can be retained in the first quantization result, so that when the first quantization result is used to shift to obtain the final quantization result, the precision of the final quantization result can also be improved.
  • the number of bits shifted is equal to the preset fixed-point quantization coefficient, so after shifting, the value range of the final quantization result meets the quantization requirement.
  • the electronic device further includes a memory, and the memory is used to store the Floating-point data, the preset maximum value of the fixed-point number, the preset minimum value of the fixed-point number, the first fixed-point data, the second zero offset, the second quantization coefficient, the One or more of preset fixed-point quantization coefficients and final quantization results of the data to be quantized.
  • the logic circuit includes an arithmetic logic unit ALU.
  • the data to be quantized includes the floating-point data obtained after the processor quantizes the floating-point data One or more of fixed-point data, intermediate results or final results during neural network processing.
  • the electronic device can improve the accuracy of the data used by the neural network during the processing of the neural network, and further improve the accuracy of the input fixed-point data of the neural network and the intermediate results or final results obtained by the neural network processing.
  • the embodiment of the present application provides a neural network quantization method, the method comprising: the processor calculates the maximum value and the minimum value of the floating-point data according to the maximum value of the preset fixed-point number and the preset minimum value of the fixed-point number value, determine the first zero offset and the first quantization coefficient, and the floating-point data includes at least one of the floating-point parameters of the neural network or the floating-point input data; The first quantization coefficient is multiplied to obtain a second quantization coefficient, and the first zero offset is multiplied and quantized according to the preset fixed-point quantization coefficient and the first quantization coefficient to obtain a second quantization coefficient 20 point offset; the logic circuit quantizes the data to be quantized through floating-point multiplication and fixed-point addition according to the second quantization coefficient and the second zero point offset to obtain a first quantization result; according to the predetermined The set fixed-point quantization coefficient shifts the first quantization result to obtain the final quantization result of the data to be quantized.
  • the processor multiplies the first quantization coefficient according to a preset fixed-point quantization coefficient to obtain a second quantization coefficient, It includes: the processor determines the expansion factor of the first quantization coefficient according to the preset fixed-point quantization coefficient; , to obtain the second quantization coefficient.
  • the preset fixed-point quantization coefficient is an integer greater than or equal to 1, and the first The expansion factor of the quantization coefficient is equal to a value whose base is 2 and whose exponent is the preset fixed-point quantization coefficient.
  • the first zero offset is performed according to the preset fixed-point quantization coefficient and the first quantization coefficient Multiple expansion and quantization to obtain a second zero offset, including: determining the expansion multiple of the first zero offset according to the preset fixed-point quantization coefficient; for the first quantization coefficient, the first The multiplier of the zero offset and the product of the first zero offset are rounded to obtain the second zero offset.
  • the expansion factor of the first zero offset is equal to the expansion of the first quantization coefficient multiple.
  • quantizing the data to be quantized by floating-point multiplication and fixed-point addition to obtain a first quantization result including: rounding the product of the second quantization coefficient and the data to be quantized to obtain the first Second quantization result; obtain the first quantization result according to the sum of the second quantization result and the second zero offset.
  • the sixth possible implementation of the neural network quantization method shifts the first quantization result to obtain the final quantization result of the data to be quantized, including: the logic circuit shifts the first quantization result to the right to obtain the For the final quantization result of the data to be quantized, the number of shifted bits is equal to the preset fixed-point quantization coefficient.
  • the method further includes storing the floating point data, the The preset fixed-point maximum value, the preset fixed-point minimum value, the first fixed-point data, the second zero offset, the second quantization coefficient, the preset fixed-point quantization One or more of coefficients and final quantization results of the data to be quantized.
  • the logic circuit includes an arithmetic logic unit ALU.
  • the data to be quantized includes quantization of floating-point data by the processor One or more of the fixed-point data obtained later, the intermediate results during the neural network processing, or the final results.
  • an embodiment of the present application provides a non-volatile computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the neural network quantization method in the second aspect above is implemented.
  • an embodiment of the present application provides a computer program product, including computer readable codes, or a non-volatile computer readable storage medium carrying computer readable codes, when the computer readable codes are stored in a processor During operation, the processor executes the neural network quantization method of the second aspect above.
  • FIG. 1 shows a structural diagram of an exemplary electronic device according to an embodiment of the present application
  • Fig. 2 shows a schematic diagram of a method for converting floating point to fixed point proposed by prior art 1;
  • FIG. 3 shows a schematic diagram of a method for converting a floating point to a fixed point proposed in the second prior art
  • FIG. 4 shows an exemplary working mode of an electronic device according to an embodiment of the present application
  • FIG. 5 shows an exemplary working mode of an electronic device according to an embodiment of the present application
  • FIG. 6 shows an exemplary application scenario of an electronic device according to an embodiment of the present application
  • Fig. 7 shows an exemplary workflow of the neural network quantization method according to the embodiment of the present application.
  • Fig. 1 shows a structural diagram of an exemplary electronic device according to an embodiment of the present application
  • the electronic device may include a processor and a logic circuit.
  • the electronic device may further include a memory, wherein the processor and the logic circuit may be connected to the memory, for example.
  • Processors and logic circuits are capable of retrieving data stored in memory and outputting data to memory.
  • the memory can store the preset values required for executing the embodiments of the present application (such as the range information of the preset fixed-point numbers, the preset fixed-point quantization coefficients, etc.), and can also store floating-point data, such as floating-point The parameters or input data of the neural network model, as well as the intermediate and final results during execution, etc.
  • the processor can process the input parameters of the logic circuit based on the floating-point data, and improve the precision of the input parameters of the logic circuit by means of multiplication.
  • the logic circuit can, for example, be able to obtain the input parameters generated by the processor and the data to be quantized, perform arithmetic operations and shift operations, and output fixed-point data with improved precision, which can be used as parameters or input data of the fixed-point neural network model , output to memory for storage.
  • FIG. 2 shows a schematic diagram of a method for converting floating point to fixed point proposed in the first prior art.
  • the theoretical linear transformation formula for converting floating point to fixed point proposed by prior art 1 is shown in formula (1):
  • X represents a floating-point number to be quantized, such as a parameter or input data of a floating-point neural network model
  • Xq represents a quantized fixed-point integer
  • Z1 represents the zero offset
  • S1 represents the quantization coefficient
  • round represents the function of rounding the floating-point number to a fixed-point integer.
  • the value range [Xmin, Xmax] of the floating-point number X may be determined according to the multiple floating-point numbers X.
  • Xmin represents the minimum value of the floating point number X
  • Xmax represents the maximum value of the floating point number X.
  • the value range [Qmin, Qmax] of the fixed-point integer Xq can be preset according to quantization requirements, where Qmin represents the minimum value of the fixed-point integer Xq, and Qmax represents the maximum value of the fixed-point integer Xq.
  • the zero offset Z1 can be set equal to the minimum value Xmin of the floating point number X.
  • the quantization coefficient S1 can be obtained by dividing the difference between the maximum value Qmax of the fixed-point integer Xq and the minimum value Qmin of the fixed-point integer Xq, and the difference between the maximum value Xmax of the floating-point number X and the minimum value Xmin of the floating-point number X, as shown in formula (2) Show:
  • the floating point number X, the zero offset Z1, and the quantization coefficient S1 are all numerical values in floating point form.
  • the quantization coefficient S1 and the zero offset Z1 can be determined by the processor according to the maximum and minimum values of the floating-point numbers and the preset fixed-point integer maximum and minimum values, And the quantization coefficient S1 and the zero offset Z1 are input to the logic circuit as parameters.
  • Perform floating-point addition (or subtraction) operations that is, X+Z1 in formula (1)
  • floating-point multiplication operations that is, (X+Z1)*S1 in formula (1)
  • rounding in logic circuits Processing that is, round((X+Z1)*S1)
  • FIG. 3 shows a schematic diagram of the method for converting floating point to fixed point proposed by the prior art 2.
  • the linear transformation formula for converting floating-point to fixed-point proposed by prior art 2 is shown in formula (3):
  • X represents a floating-point number to be quantized, such as a parameter or input data of a floating-point neural network model
  • Xq represents a quantized fixed-point integer
  • S1 represents the quantization coefficient
  • the acquisition method refers to the above formula (2)
  • round represents the function of rounding the floating-point number to a fixed-point integer
  • Zq round(Z1*S1), which means zero offset in fixed-point form, that is, zero offset Zq is the quantization result of zero offset Z1.
  • the operation of formula (3) is realized by designing a corresponding logic circuit.
  • the quantization coefficient S1 and the zero offset Z1 can be determined by the processor according to the maximum and minimum values of floating-point numbers and the preset fixed-point integer maximum and minimum values, Combined with the rounding function round, the fixed-point zero offset Zq is further determined, wherein the zero offset Zq is an integer, and the quantization coefficient S1 and the fixed-point zero offset Zq are input to the logic circuit as parameters.
  • floating-point multiplication ie X*S1 in formula (3)
  • rounding ie round (X*S1) in formula (3)
  • fixed-point addition ie formula round(X*S1)+Zq
  • the hardware cost of the logic circuit required for the quantization process can be reduced.
  • the disadvantage is that the zero offset Zq input to the logic circuit is obtained by rounding off the product of the floating-point zero offset Z1 and the quantization coefficient S1. The second rounding operation further magnifies the precision loss problem of the quantized neural network model, and reduces the accuracy of the calculation results of the neural network.
  • the embodiment of the present application proposes an electronic device and a neural network quantization method.
  • the quantized neural network model can be improved under the premise of low hardware cost. accuracy.
  • an electronic device including a processor and a logic circuit.
  • Fig. 4 shows an exemplary working manner of an electronic device according to an embodiment of the present application.
  • the processor is configured to: determine the first zero offset and the first quantization according to the maximum value and the minimum value in the floating-point data, as well as the preset fixed-point maximum value and the preset fixed-point minimum value Coefficients, the floating-point data includes at least one of floating-point parameters of the neural network or floating-point input data; multiplying the first quantization coefficients according to preset fixed-point quantization coefficients to obtain second quantization coefficients, and Perform multiple expansion and quantization on the first zero offset according to the preset fixed-point quantization coefficient and the first quantization coefficient to obtain a second zero offset;
  • the logic circuit is used to: according to the second quantization coefficient and the second zero offset, quantize the data to be quantized by floating-point multiplication and fixed-point addition to obtain a first quantization result; according to the preset fixed-point The quantization coefficient shifts the first quantization result to obtain the final quantization result of the data to be quantized.
  • the processor can determine the first zero offset and the first quantization coefficient through the floating-point data, the preset maximum value of the fixed-point number, and the preset minimum value of the fixed-point number, and combine the preset The fixed-point quantized coefficients are further processed to obtain the second zero offset and the second quantized coefficient.
  • the second quantized coefficient is obtained by expanding the multiple of the first quantized coefficient
  • the second zero offset is obtained by expanding and quantizing the multiple of the first zero offset.
  • the logic circuit can be quantized according to the input parameters (the second zero offset and the second quantization parameter)
  • the data is quantized by floating-point multiplication and fixed-point addition, and the quantization result is shifted according to the preset fixed-point quantization coefficient to obtain the final quantization result of the data to be quantized, so that the final quantization result after shifting can meet the preset In the range from the minimum value of the fixed-point number to the maximum value of the preset fixed-point number, the final quantization result that meets the quantization requirements is obtained.
  • the logic circuit performs floating-point multiplication, fixed-point addition and shift operation, so that the hardware cost of the logic circuit capable of quantization is relatively low. Therefore, the accuracy of the quantized neural network model can be improved at a lower hardware cost.
  • the data to be quantized can be different types of data, for example, it can be the fixed-point data obtained after the processor quantizes the floating-point data, or the intermediate results in the process of neural network processing (such as convolution layer output) or the final result and so on.
  • the electronic device can improve the accuracy of the data used by the neural network during the processing of the neural network, and further improve the accuracy of the input fixed-point data of the neural network and the intermediate results or final results obtained by the neural network processing.
  • Fig. 5 shows an exemplary working manner of an electronic device according to an embodiment of the present application. Taking the data to be quantized as the fixed-point data obtained by quantizing the floating-point data by the processor as an example, an exemplary working mode of the electronic device according to the embodiment of the present application is introduced below.
  • the processor determines the maximum value in the floating-point data according to the floating-point data X, hereinafter also referred to as the maximum value of the floating-point number Xmax, and the minimum value in the floating-point data, hereinafter also referred to as The floating-point minimum value Xmin
  • the floating-point data includes at least one of floating-point parameters of the neural network or floating-point input data
  • the floating-point data may include multiple floating-point numbers.
  • the floating-point data may include multiple floating-point numbers, and each floating-point number may, for example, correspond to a floating-point parameter.
  • the processor can count the maximum and minimum values of all floating-point parameters of the neural network model to be quantized as the maximum value Xmax of the floating-point number and the minimum value Xmin of the floating-point number.
  • the floating-point input data of the neural network model that needs to be quantized usually includes multiple values.
  • the floating-point data can include multiple floating-point numbers. a numeric value.
  • the processor can count the maximum and minimum values of all numerical values of the floating-point input data of the neural network model to be quantized as the maximum value Xmax of the floating-point number and the minimum value Xmin of the floating-point number.
  • Step S1 can be implemented based on existing technologies.
  • the processor determines the first zero offset Z according to the minimum value of the floating point number Xmin, and according to the maximum value of the floating point number Xmax, the minimum value of the floating point number Xmin, the preset maximum value of the fixed point number Qmax, and the preset minimum value of the fixed point number
  • the value Qmin determines the first quantization coefficient S, the preset fixed-point maximum value Qmax, and the preset fixed-point minimum value Qmin are determined according to the preset range [Qmin, Qmax] of the final quantization result of the floating-point data.
  • the way of determining the first zero offset Z can be, for example, using the floating-point minimum value Xmin as the first zero offset Z; the way of determining the first quantization coefficient S can be, for example, referring to formula (2), that is, Qmax, Qmin, Xmax and Xmin are brought into formula (2) to obtain the first quantization coefficient S.
  • Step S2 can be implemented based on existing technologies.
  • the first zero offset Z corresponds to the zero offset Z1 in the prior art
  • the first quantization coefficient S corresponds to the quantization coefficient S1 in the prior art.
  • the processor quantizes the floating-point data X to obtain the first fixed-point data X'.
  • the processor may, for example, quantize the floating-point data X according to the first quantization coefficient S and the first zero offset Z to obtain corresponding first fixed-point data X'.
  • the first fixed-point data X' includes, for example, a fixed-point neural network The fixed-point input data of the model or the fixed-point parameters of the fixed-point neural network model, etc.
  • the numerical range of the first fixed-point data X' is equal to the preset numerical range [Qmin, Qmax] of the fixed-point number, that is, within the range greater than or equal to the minimum value Qmin of the fixed-point number and less than or equal to the maximum value Qmax of the fixed-point number.
  • the processor can quantize the floating-point data based on related techniques, for example, the first quantization coefficient S and the first zero offset Z can be brought into formula (1) as S1 and Z1.
  • the processor can send the first fixed-point data X' to the logic circuit, or send it to the memory for the logic circuit to call.
  • the processor multiplies the first quantization coefficient S according to the preset fixed-point quantization coefficient S_shift to obtain the second quantization coefficient S' (see formula (4) below for an example), and according to the preset fixed-point quantization coefficient S_shift and the first quantization coefficient S multiply and quantize the first zero offset Z to obtain the second zero offset Zq' (for an example, refer to formula (5) below).
  • the processor can send the second quantization coefficient S' and the second zero offset Zq' to the logic circuit, or to the memory for calling by the logic circuit.
  • the second quantization coefficient S obtained by multiplying the first quantization coefficient S, and the second zero offset Zq' obtained by multiplying and quantizing the first zero offset Z are provided to the logic circuit for operation, so that in During the rounding process carried out in the logic circuit, the values of the first quantization coefficient S and the high-precision bits (such as decimal places) of the first zero offset Z are preserved, so that the logic circuit can use the second quantization coefficient and the second The second fixed-point data obtained by zero offset processing has higher precision.
  • the logic circuit quantizes the first fixed-point data X' through floating-point multiplication and fixed-point addition according to the second quantization coefficient S' and the second zero offset Zq' to obtain the first quantization result;
  • the set fixed-point quantization coefficient S_shift shifts the first quantization result to obtain the second fixed-point data Xq', and the second fixed-point data is used as the final quantization result of the floating-point data (see formula (6) below for an example).
  • the first zero offset, the first quantization coefficient, and the second quantization coefficient are in floating-point form, and the preset fixed-point quantization coefficient and the second zero offset are in fixed-point form.
  • steps S1-S3 can be implemented with reference to the prior art above, and will not be repeated here. Exemplary implementations of steps S4 and S5 will be described later.
  • step S4 the processor multiplies the first quantization coefficient S according to the preset fixed-point quantization coefficient S_shift to obtain the second quantization coefficient, including:
  • the processor determines the expansion multiple of the first quantization coefficient S according to the preset fixed-point quantization coefficient S_shift; the processor obtains the second quantization coefficient S' according to the product of the first quantization coefficient S and the expansion multiple of the first quantization coefficient S.
  • the preset fixed-point quantization coefficient S_shift is an integer greater than or equal to 1, and the expansion factor of the first quantization coefficient S is equal to a value whose base is 2 and whose exponent is the preset fixed-point quantization coefficient S_shift. That is, the expansion factor of the first quantization coefficient S is equal to 2 ⁇ S_shift.
  • the effect of multiplying is similar to the effect of using the preset fixed-point quantization coefficient as the number of shifting bits, which facilitates determining the number of shifting bits in the subsequent process of shifting the first quantization result.
  • the expansion factor of the first quantization coefficient S is not limited to the example of the above-mentioned exponential form with base 2, and the expansion factor of the first quantization coefficient S can also be other values, as long as the first quantization factor can be satisfied.
  • the second quantization coefficient obtained after expanding the coefficient S is input into the logic circuit, the impact on the numerical range of the output result of the logic circuit can be eliminated by shifting. This application does not limit the expansion method of the first quantization coefficient S .
  • the first quantization coefficient S(S1) is a fixed value
  • the precision of the second quantization coefficient S' is determined by the fixed-point quantization coefficient S_shift, the larger the value of the fixed-point quantization coefficient S_shift, the higher the precision of the second quantization coefficient S'.
  • the second quantization coefficient S' is one of the parameters input to the logic circuit, so the larger the value of the fixed-point quantization coefficient S_shift is, the higher the quantization accuracy of the logic circuit is.
  • a suitable fixed-point quantization coefficient S_shift can be preset according to different quantization precision requirements.
  • the binary form corresponding to the second quantization coefficient S' is equivalent to each number in the binary form corresponding to the first quantization coefficient S moving to the left by the corresponding displacement bit (the number of bits is equal to the fixed-point quantization coefficient S_shift), and the high bit Move out (discard), fill the lower vacancy with 0, and get the shift result.
  • the value of the high-precision (for example, decimal precision) part in the first quantization coefficient S is moved to a high position, and is not discarded in the rounding, and is retained.
  • the second quantization coefficient S' is obtained, so the second quantization coefficient S' retains the high-precision part of the first quantization coefficient S, and its precision is higher than that of the first quantization coefficient S before the expansion factor.
  • the parameter accuracy of the input logic circuit can be improved.
  • the expansion factor of the first quantization coefficient can be adjusted by changing the value of the preset fixed-point quantization coefficient, so that the precision improvement method of the first quantization coefficient is more flexible.
  • step S4 the first zero offset Z is multiplied and quantized according to the preset fixed-point quantization coefficient S_shift and the first quantization coefficient S to obtain the second zero offset Zq ',include:
  • the processor determines the magnification factor of the first zero offset Z according to the preset fixed-point quantization coefficient S_shift; the processor determines the first quantization factor S, the magnification factor of the first zero offset Z, and the first zero offset Z The product of is rounded to get the second zero offset Zq'.
  • the expansion factor of the first zero point offset can be adjusted by changing the value of the preset fixed-point quantization coefficient, so that the precision improvement method of the first zero point offset is more flexible.
  • the expansion factor of the first zero offset Z is equal to the expansion factor of the first quantization coefficient S. That is, the expansion factor of the first zero point offset Z is equal to 2 ⁇ S_shift.
  • the second zero offset Zq' is obtained.
  • the zero offset Zq obtained by quantizing the zero offset Z1 by only using the quantization coefficient S1 in the technical formula (3) is also enlarged by 2 ⁇ S_shift times.
  • the value range of the second zero offset obtained is the same as that of the data to be quantized
  • the value range of the quantization result obtained during the second quantization coefficient is the same, so that the second zero offset and the second quantization coefficient can participate in the arithmetic operation in the logic circuit.
  • the first zero offset Z (zero offset Z1) is a fixed value; as can be seen from the above formula (2), the fixed-point maximum value Qmax and the fixed value
  • the first quantization coefficient S(S1) is a fixed value, so according to the formula (5), the accuracy of the second zero point offset Zq' is changed by the first zero point
  • the expansion factor 2 ⁇ S_shift of the offset Z is determined.
  • the larger the value of the fixed-point quantization coefficient S_shift the higher the precision of the second zero offset Zq'.
  • the second zero shift Zq' is one of the parameters input to the logic circuit, so the larger the value of the fixed-point quantization coefficient S_shift is, the higher the quantization accuracy of the logic circuit is.
  • the product of the first quantization factor S and the magnification factor of the first zero offset Z may be equal to the second quantization factor S', referring to formula (4), the second zero offset Zq' can also be regarded as the rounding result of the product of the first zero offset Z and the second quantization coefficient S'.
  • the binary form corresponding to the product (Z*S') of the first zero offset Z and the second quantization coefficient S' is equivalent to the product of the first zero offset Z and the first quantization coefficient S (Z *S)
  • Each number corresponding to the binary form is shifted to the left by the corresponding number of bits (the number of bits is equal to the fixed-point quantization coefficient S_shift), the high bit is removed (discarded), and the low bit is filled with 0 to obtain the shift result. Therefore, the precision of the product of the first zero offset Z and the second quantization coefficient S' is higher than the precision of the product of the first zero offset Z and the first quantization coefficient S before the multiplication.
  • the rounded quantization process is performed on the product of the first zero offset Z and the second quantization coefficient S', and the precision of the obtained second zero offset Zq' (see formula (5)) is also higher than that of the first zero offset
  • the product of the point offset Z and the first quantization coefficient S is rounded off and quantized to obtain the precision of the zero offset Zq (see formula (3)).
  • the value of the high-precision (such as decimal precision) part of the first zero offset Z is moved to the high position, and it is not rounded off and retained to obtain the second Zero offset Zq, so the second zero offset Zq retains the high-precision part of the first zero offset Z, its precision is higher than that of the prior art zero offset Zq, and is closer to the first zero offset Move Z. Therefore, the parameter accuracy of the input logic circuit can be improved.
  • the high-precision such as decimal precision
  • Step S4 may be performed after completing step S2.
  • the embodiment of the present application does not limit the execution sequence of S3 and S4.
  • step S5 Due to the expansion of the multiple, when the logic circuit uses the second quantization coefficient to perform step S5, the obtained first quantization result also has a multiple expansion. In this case, the range of the first quantization result cannot meet the expectation of the final quantization result of the floating point data. A range is set, so step S5 obtains the second fixed-point data by shifting, so that the value of the second fixed-point data satisfies the preset range of the final quantization result of the floating-point data.
  • the following introduces an exemplary method for the electronic device to determine the first quantization result based on step S5 and determine the final quantization result of the floating-point data based on the first quantization result according to the embodiment of the present application.
  • step S5 the logic circuit performs floating-point multiplication and fixed-point addition on the first fixed-point data X' according to the second quantization coefficient S' and the second zero offset Zq'
  • the quantization obtains the first quantization result, including:
  • the logic circuit rounds the product of the second quantization coefficient S' and the first fixed-point data X' to obtain the second quantization result; the logic circuit obtains the second quantization result based on the sum of the second quantization result and the second zero offset Zq' A quantitative result.
  • the second quantization coefficient S' is expanded by 2 ⁇ S_shift times on the basis of the first quantization coefficient (quantization coefficient S1 in the prior art). Therefore, according to the second quantization coefficient S' and the first fixed-point data X The product X'*S' obtained by 'multiplication is also enlarged by 2 ⁇ S_shift times, and the rounding result round(X'*S') (the second quantization result) of this product is the rounding result after the multiplier is enlarged.
  • the second zero offset Zq' is the rounding result of the product of the first zero offset Z and the second quantization coefficient S', therefore, the second zero offset Zq' is the multiplied The rounding result of .
  • the second zero offset Zq' can be fixed-point added to the second quantized result to obtain the first quantized result (round(X'*S')+Zq').
  • the first quantized result is a multiplied quantized result.
  • the multiple of expansion is equal to 2 ⁇ S_shift.
  • the second quantization result of multiple expansion can be obtained; through the fixed-point addition operation of the second quantization result and the second zero offset, it can be obtained
  • the first quantization result is multiplied, so that the high-precision attribute of the data to be quantized can be retained in the first quantization result, and the precision of the first quantization result can be improved.
  • step S5 the first quantization result is shifted according to the preset fixed-point quantization coefficient S_shift to obtain the second fixed-point data Xq', including:
  • the logic circuit shifts the first quantization result to the right, and the number of bits shifted is equal to the preset fixed-point quantization coefficient S_shift.
  • the first quantization result, the preset fixed-point quantization coefficient S_shift and the second fixed-point data Xq' can be shown as formula (6):
  • step S5 the first quantization result may be reduced by a corresponding multiple through a shift operation to obtain second fixed-point data.
  • the second fixed-point data can be made to be in the preset fixed-point value range of [Qmin, Qmax], so that a convolution operation can be performed on the second fixed-point data subsequently.
  • the high-precision attribute of the data to be quantized can be retained in the first quantization result, so that when the first quantization result is used to shift to obtain the final quantization result, the precision of the final quantization result can also be improved.
  • the number of bits shifted is equal to the preset fixed-point quantization coefficient, so after shifting, the value range of the final quantization result meets the quantization requirement.
  • the shift is a kind of bit operation, and the right shift is to move all the numbers to the right in binary form corresponding to the number of shift digits (fixed-point quantization coefficient) S_shift, the low bit is removed (discarded), and the high bit vacant bit is filled with the sign bit, that is, positive Numbers are filled with zeros, and negative numbers are filled with 1s.
  • S_shift bit is shifted to the right, it is equivalent to dividing the second quantization result by 2 ⁇ S_shift and rounding it up.
  • the product of the first fixed-point data X' and the second quantization coefficient S' can be obtained by performing the floating-point multiplication operation, the first quantization result can be obtained by the fixed-point addition operation, and the second quantization coefficient can be obtained by performing the shift operation.
  • Fixed-point data Fixed-point data.
  • the logic circuit only needs to perform floating-point multiplication, fixed-point addition, and shift operations, and the area and power consumption are small, and the neural network quantization can be realized at a low hardware cost.
  • convolution operation may be performed based on the second fixed-point data with improved precision.
  • Fig. 6 shows an exemplary application scenario of an electronic device according to an embodiment of the present application.
  • the processor executes the above-mentioned step S3 to complete the processing of the floating-point data and obtain the data to be quantized (first fixed-point data).
  • the quantized data is the fixed-point data obtained after the processor quantizes the floating-point data, and the fixed-point neural network model can be deployed according to the data to be quantized, for example, the processor executes the above step S3 to quantify the floating-point input data and floating-point parameters, and obtains Fixed-point input data and fixed-point parameters (first fixed-point data), thereby obtaining fixed-point neural network and fixed-point input data.
  • a set of first quantization coefficients and a first zero offset can be obtained for the floating-point input data, and multiple sets of first quantization coefficients and first zero offsets can be obtained for the multi-layer parameters of the neural network.
  • a quantization coefficient and a first zero offset Since the values of input data and parameters of each layer may be different, the maximum value and minimum value of floating point numbers of input data and parameters of each layer may also be different, so that multiple sets of first quantization coefficients and first zero offsets are different .
  • the processor executes the above on the premise of using the same fixed-point number maximum value and fixed-point number minimum value.
  • the data to be quantized obtained in step S3 are all in the same preset range, that is, the value is between Qmin-Qmax.
  • the processor may execute the above-mentioned step S4 to complete the processing of the first quantized coefficient and the first zero offset to obtain the second quantized coefficient and the second zero offset.
  • multiple sets of first quantization coefficients and first zero offsets correspond to multiple sets of second quantization coefficients, second quantization offsets, and fixed-point quantization coefficients, respectively.
  • the logic circuit may, for example, be able to take processor-generated input parameters (a set of second quantization coefficients, a second zero offset, and fixed-point quantization coefficients) and processor-generated first fixed-point data (combined with the set of second quantization coefficients,
  • the second zero offset corresponds to the fixed-point quantization coefficient
  • arithmetic operations and shift operations are performed during the operation of the fixed-point neural network model to obtain second fixed-point data corresponding to the first fixed-point data and having higher accuracy
  • the convolution result of the fixed-point neural network model based on the second fixed-point data is more accurate, and the convolution result with higher precision can be used as the input data of the next convolution layer for the next convolution layer During the convolution operation.
  • the fixed-point neural network obtained by the processor is used to perform operations on the input data (indicated by a), Then, when the fixed-point neural network starts to run, the convolution of the input data of the fixed-point neural network and the parameters (weights) of the first layer of the neural network is completed at first.
  • this parameter can be the parameter of the fixed-point form obtained after the floating-point parameters of the first layer of the neural network are quantized by the processor (the first fixed-point An example of data X'), for example denoted by b.
  • the logic circuit can process the input data a and parameter b separately to obtain input data a1 and parameter b1 (second fixed-point data) with higher accuracy.
  • the logic circuit when the logic circuit processes the input data a, the input data a is input to the logic circuit as data to be quantized, and a set of second quantization coefficients and second quantization offsets corresponding to the input data a (according to the floating-point number corresponding to the input data a).
  • the maximum value, the minimum value of the data, and the value range of the fixed-point data corresponding to the input data a) and the fixed-point quantization coefficient are also input to the logic circuit.
  • the logic circuit can calculate and output data a1.
  • the logic circuit processes the parameter b
  • the parameter b is input to the logic circuit as data to be quantized
  • a set of second quantization coefficients and second quantization offsets corresponding to the parameter b (according to the maximum value of the floating-point data corresponding to the parameter b value, minimum value, determined by the range of fixed-point data corresponding to parameter b) and fixed-point quantization coefficients are also input to the logic circuit.
  • the logic circuit can calculate and output the parameter b1.
  • convolution is performed based on the input data a1 and the parameter b1 to obtain the convolution result c1.
  • the convolution result c1 is more accurate than the convolution result of the input data a and parameter b, and so on, when the neural network convolves the convolution result and weight of any layer, the convolution result is already
  • the logic circuit processes the weights to obtain the weights to improve the accuracy, so that the results of the convolution operation of each layer can achieve relatively high accuracy.
  • the processor inversely quantizes the convolution result of any layer to obtain the floating-point convolution result corresponding to the convolution result, the accuracy of the obtained floating-point convolution result is also higher, which is closer to the corresponding of the original floating-point neural network.
  • the floating-point convolution result of one layer is more accurate than the convolution result of the input data a and parameter b
  • the working process of the logic circuit can be regarded as firstly adjusting the numerical range of the data to be quantized to a numerical range larger than the preset fixed-point numerical field by means of mapping, so as to improve the accuracy and perform high-precision calculations, and then convert the calculation results to The way of mapping is restored to the preset fixed-point numerical field to realize quantization.
  • different fixed-point quantization coefficients S_shift can be set for different floating-point data according to different requirements.
  • the precision of the second fixed-point data is closer to the original floating-point data, so that the accuracy of the output data of the quantized neural network can be improved.
  • the logic circuit provided in the embodiment of the present application may be an arithmetic logic unit (arithmetic logical unit, ALU), configured to implement the arithmetic operation and shift operation shown in formula (6).
  • ALU arithmetic logical unit
  • the logic circuit adopts floating-point multiplication, fixed-point addition, and shifting methods, and the cost of area and power consumption is low.
  • the advantage of area power consumption is obvious; and the input logic The parameter accuracy of the circuit is higher, which makes the quantization result of the logic circuit more accurate.
  • the electronic device further includes a memory, and the memory is used to store floating-point data, a preset maximum value of a fixed-point number, a preset minimum value of a fixed-point number, the first fixed-point data , a second zero offset, a second quantization coefficient, a preset fixed-point quantization coefficient, and a final quantization result of the data to be quantized.
  • the data to be quantized may be the fixed-point data obtained after the processor quantizes the floating-point data, or the intermediate or final result of the neural network operation.
  • the logic circuit in the embodiment of the present application quantizes the data to be quantized through floating-point multiplication and fixed-point addition according to the second quantization coefficient and the second zero offset , to obtain the first quantitative results, including:
  • step S5 when the data to be quantized is the fixed-point data obtained by quantizing the floating-point data by the processor, refer to step S5 and related descriptions above.
  • the logic circuit in the embodiment of the present application shifts the first quantization result according to the preset fixed-point quantization coefficient to obtain the final quantization result of the floating-point data, including:
  • the logic circuit shifts the first quantization result to the right to obtain the final quantization result of the floating-point data, and the number of bits shifted is equal to the preset fixed-point quantization coefficient.
  • step S5 when the data to be quantized is the fixed-point data obtained by quantizing the floating-point data by the processor, refer to step S5 and related descriptions above.
  • FIG. 7 shows an exemplary workflow of the neural network quantification method according to an embodiment of the present application.
  • the method can be applied to an electronic device according to an embodiment of the present application, including:
  • the processor determines the first zero offset and the first quantization coefficient according to the maximum value and the minimum value in the floating-point data, and the preset maximum value and the preset minimum value of the fixed-point number, and the floating-point number
  • the data includes at least one of floating-point parameters or floating-point input data of the neural network
  • the processor multiplies the first quantization coefficient according to the preset fixed-point quantization coefficient to obtain a second quantization coefficient, and performs multiplication expansion on the first quantization coefficient according to the preset fixed-point quantization coefficient and the first quantization coefficient.
  • the zero offset is multiplied and quantized to obtain the second zero offset;
  • the logic circuit quantizes the data to be quantized by floating-point multiplication and fixed-point addition to obtain a first quantization result; according to the preset fixed-point quantization The coefficient shifts the first quantization result to obtain the final quantization result of the data to be quantized.
  • the processor multiplies the first quantization coefficient according to the preset fixed-point quantization coefficient to obtain the second quantization coefficient, including: the processor determines the first quantization coefficient according to the preset fixed-point quantization coefficient The expansion factor; the processor obtains the second quantization coefficient according to the product of the first quantization coefficient and the expansion factor of the first quantization coefficient.
  • the preset fixed-point quantization coefficient is an integer greater than or equal to 1, and the expansion factor of the first quantization coefficient is equal to a value whose base is 2 and whose exponent is the preset fixed-point quantization coefficient.
  • the first zero offset is multiplied and quantized according to the preset fixed-point quantization coefficient and the first quantization coefficient to obtain the second zero offset, including: according to the preset fixed-point quantization Coefficient to determine the expansion multiple of the first zero offset; round the product of the first quantization coefficient, the expansion multiple of the first zero offset, and the first zero offset to obtain the second zero offset .
  • the expansion factor of the first zero offset is equal to the expansion factor of the first quantization coefficient.
  • the data to be quantized is quantized by floating-point multiplication and fixed-point addition to obtain the first quantization result, including: the second quantization coefficient and the product of the data to be quantized is rounded to obtain a second quantized result; the first quantized result is obtained according to the sum of the second quantized result and the second zero offset.
  • the first quantization result is shifted according to a preset fixed-point quantization coefficient to obtain the final quantization result of the data to be quantized, including: a logic circuit shifts the first quantization result to the right to obtain For the final quantization result of the data to be quantized, the number of shifted bits is equal to the preset fixed-point quantization coefficient.
  • the method further includes: storing the floating-point data, the preset maximum value of the fixed-point number, the preset minimum value of the fixed-point number, the first fixed-point data, the second zero offset One or more of a second quantization coefficient, a preset fixed-point quantization coefficient, and a final quantization result of the data to be quantized.
  • the logic circuit includes an arithmetic logic unit ALU.
  • the data to be quantized includes one or more of fixed-point data obtained after the processor quantizes the floating-point data, intermediate results or final results during neural network processing.
  • the present application proposes a non-volatile computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above neural network quantization method is implemented.
  • the present application proposes a computer program product, including computer readable codes, or a non-volatile computer readable storage medium bearing computer readable codes, when the computer readable codes are stored in
  • the processor executes the above neural network quantization method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

本申请实施例提供一种电子设备及神经网络量化方法,电子设备包括处理器和逻辑电路,处理器用于根据浮点数据以及预设的定点数最大值、最小值,确定第一零点偏移和第一量化系数,对第一量化系数进行倍数扩大,得到第二量化系数,以及对第一零点偏移进行倍数扩大和量化,得到第二零点偏移;逻辑电路,用于根据第二量化系数和第二零点偏移,通过浮点乘法运算和定点加法运算对待量化数据进行量化,得到第一量化结果;根据预设的定点量化系数对第一量化结果进行移位,得到待量化数据的最终量化结果。根据本申请实施例的电子设备及神经网络量化方法,能够在硬件成本较低的前提下,提升量化后的神经网络模型的精度。

Description

电子设备及神经网络量化方法 技术领域
本申请涉及神经网络技术领域,尤其涉及一种电子设备及神经网络量化方法。
背景技术
随着深度学习技术的应用,出现了大量的基于深度学习的神经网络模型。这些神经网络模型的参数或输入数据通常采用浮点形式,其运算方式也采用浮点运算。而浮点形式的数据通常位数较高,例如32位等。因此浮点数据的存储和运算消耗大量的硬件成本。在浮点神经网络模型规模较大的时候,例如参数或输入数据数量大,对浮点神经网络模型的硬件性能要求更高,导致基于神经网络模型进行运算需要较大的硬件代价。
为了解决基于浮点神经网络模型进行运算的硬件成本过大的问题,现有技术提出对浮点神经网络模型进行量化,将浮点神经网络模型的参数或输入数据量化为定点参数或定点输入数据。量化后位数得以降低,使得定点数据的存储和运算消耗的硬件成本得以降低,在此情况下,神经网络运算所需的硬件成本也得以降低。
然而,浮点神经网络模型的参数或输入数据的量化过程也进行大量浮点运算,导致神经网络量化所需的硬件运算成本较高,且位数降低也意味着有一定的精度损失,导致神经网络模型的运算准确度有所下降。
发明内容
有鉴于此,本申请实施例提出一种电子设备及神经网络量化方法,根据本申请实施例的电子设备及神经网络量化方法,能够在硬件成本较低的前提下,提升量化后的神经网络模型的精度。
第一方面,本申请实施例提出一种电子设备,包括处理器和逻辑电路,所述处理器,用于:根据浮点数据中的最大值和最小值、以及预设的定点数最大值和预设的定点数最小值,确定第一零点偏移和第一量化系数,所述浮点数据包括神经网络的浮点参数或浮点输入数据中的至少一种;根据预设的定点量化系数对所述第一量化系数进行倍数扩大,得到第二量化系数,以及根据所述预设的定点量化系数和所述第一量化系数对所述第一零点偏移进行倍数扩大和量化,得到第二零点偏移;所述逻辑电路,用于根据所述第二量化系数和所述第二零点偏移,通过浮点乘法运算和定点加法运算对待量化数据进行量化,得到第一量化结果;根据所述预设的定点量化系数对所述第一量化结果进行移位,得到所述待量化数据的最终量化结果。
根据本申请实施例的电子设备,处理器可以通过浮点数据、预设的定点数最大值和预设的定点数最小值,确定第一零点偏移和第一量化系数,并结合预设的定点量化系数进一步处理得到第二零点偏移和第二量化系数,第二量化系数是第一量化系数倍 数扩大得到的、第二零点偏移是第一零点偏移倍数扩大和量化得到的,使得第二零点偏移和第二量化系数精度更高,从而提高逻辑电路的输入参数的精度;逻辑电路可以根据输入参数(第二零点偏移和第二量化参数)对待量化数据进行浮点乘法运算、定点加法运算实现量化,并根据预设的定点量化系数对量化结果进行移位,得到待量化数据的最终量化结果,使得移位后的最终量化结果能够满足处于预设的定点数最小值到预设的定点数最大值的范围内,得到满足量化需求的最终量化结果。并且逻辑电路进行浮点乘法、定点加法以及移位运算,使得能够完成量化的逻辑电路硬件成本较低。从而能够以较低的硬件成本,提升量化后的神经网络模型的精度。
根据第一方面,在所述电子设备的第一种可能的实现方式中,所述处理器根据预设的定点量化系数对所述第一量化系数进行倍数扩大,得到第二量化系数,包括:所述处理器根据所述预设的定点量化系数,确定所述第一量化系数的扩大倍数;所述处理器根据所述第一量化系数和所述第一量化系数的扩大倍数的乘积,得到所述第二量化系数。
通过这种方式,能够提高输入逻辑电路的参数精度。根据参数精度需求,可以通过改变预设的定点量化系数的数值,对第一量化系数的扩大倍数进行调整,使得第一量化系数的精度提升方式更灵活。
根据第一方面的第一种可能的实现方式,在所述电子设备的第二种可能的实现方式中,所述预设的定点量化系数是大于或等于1的整数,所述第一量化系数的扩大倍数等于以2为底数、以所述预设的定点量化系数为指数的数值。
通过这种方式,使得倍数扩大的效果近似于以预设的定点量化系数作为移位位数进行移位的效果,便于确定后续对第一量化结果进行移位的过程的移位位数。
根据第一方面的第二种可能的实现方式,在所述电子设备的第三种可能的实现方式中,根据预设的定点量化系数和第一量化系数对第一零点偏移进行倍数扩大和量化,得到第二零点偏移,包括:根据所述预设的定点量化系数,确定所述第一零点偏移的扩大倍数;对所述第一量化系数、所述第一零点偏移的扩大倍数、所述第一零点偏移的乘积进行舍入,得到所述第二零点偏移。
通过这种方式,能够提高输入逻辑电路的参数精度。根据参数精度需求,可以通过改变预设的定点量化系数的数值,对第一零点偏移的扩大倍数进行调整,使得第一零点偏移的精度提升方式更灵活。
根据第一方面的第三种可能的实现方式,在所述电子设备的第四种可能的实现方式中,所述第一零点偏移的扩大倍数等于所述第一量化系数的扩大倍数。
使用与第一量化系数相同的扩大倍数,使得第一零点偏移进行倍数扩大后,采用第一量化系数进行量化时,得到的第二零点偏移的取值范围,与待量化数据使用第二量化系数时得到的量化结果的取值范围相同,使得第二零点偏移和第二量化系数可以在逻辑电路中参与算术运算。
根据第一方面,以及以上第一方面的任意一种可能的实现方式,在所述电子设备的第五种可能的实现方式中,根据所述第二量化系数和所述第二零点偏移,通过浮点乘法运算和定点加法运算对所述待量化数据进行量化,得到第一量化结果,包括:对所述第二量化系数和所述待量化数据的乘积进行舍入,得到第二量化结果;根据所述 第二量化结果与所述第二零点偏移的和,得到所述第一量化结果。
通过这种方式,使得通过第二量化系数和待量化数据的浮点乘法运算,可以得到倍数扩大的第二量化结果;通过第二量化结果和第二零点偏移的定点加法运算,可以得到倍数扩大的第一量化结果,从而在第一量化结果中能保留待量化数据的高精度属性,提高第一量化结果的精度。
根据第一方面的第二种可能的实现方式至第五种可能的实现方式中的任意一种可能的实现方式,在所述电子设备的第六种可能的实现方式中,根据所述预设的定点量化系数对所述第一量化结果进行移位,得到所述待量化数据的最终量化结果,包括:所述逻辑电路对所述第一量化结果进行向右移位,得到所述待量化数据的最终量化结果,移位的位数等于所述预设的定点量化系数。
第一量化结果中能保留待量化数据的高精度属性,使得在使用第一量化结果进行移位得到最终量化结果时,也可以提高最终量化结果的精度。并且移位的位数等于预设的定点量化系数,因此移位后使得最终量化结果的取值范围满足量化需求。
根据第一方面,以及以上第一方面的任意一种可能的实现方式,在所述电子设备的第七种可能的实现方式中,所述电子设备还包括存储器,所述存储器用于存储所述浮点数据、所述预设的定点数最大值、所述预设的定点数最小值、所述第一定点数据、所述第二零点偏移、所述第二量化系数、所述预设的定点量化系数、所述待量化数据的最终量化结果中的一种或多种。
根据第一方面,以及以上第一方面的任意一种可能的实现方式,在所述电子设备的第八种可能的实现方式中,所述逻辑电路包括算术逻辑单元ALU。
根据第一方面,以及以上第一方面的任意一种可能的实现方式,在所述电子设备的第九种可能的实现方式中,所述待量化数据包括处理器对浮点数据进行量化后得到的定点数据、神经网络处理过程中的中间结果或最终结果中的一种或多种。
通过这种方式,使得电子设备可以在神经网络的处理过程中提高神经网络使用的数据的精度,进一步提高神经网络的输入定点数据、以及神经网络处理得到的中间结果或最终结果的准确度。
第二方面,本申请实施例提供一种神经网络量化方法,所述方法包括:处理器根据浮点数据中的最大值和最小值、以及预设的定点数最大值和预设的定点数最小值,确定第一零点偏移和第一量化系数,所述浮点数据包括神经网络的浮点参数或浮点输入数据中的至少一种;所述处理器根据预设的定点量化系数对所述第一量化系数进行倍数扩大,得到第二量化系数,以及根据所述预设的定点量化系数和所述第一量化系数对所述第一零点偏移进行倍数扩大和量化,得到第二零点偏移;逻辑电路根据所述第二量化系数和所述第二零点偏移,通过浮点乘法运算和定点加法运算对待量化数据进行量化,得到第一量化结果;根据所述预设的定点量化系数对所述第一量化结果进行移位,得到所述待量化数据的最终量化结果。
根据第二方面,在所述神经网络量化方法的第一种可能的实现方式中,所述处理器根据预设的定点量化系数对所述第一量化系数进行倍数扩大,得到第二量化系数,包括:所述处理器根据所述预设的定点量化系数,确定所述第一量化系数的扩大倍数;所述处理器根据所述第一量化系数和所述第一量化系数的扩大倍数的乘积,得到所述 第二量化系数。
根据第二方面的第一种可能的实现方式,在所述神经网络量化方法的第二种可能的实现方式中,所述预设的定点量化系数是大于或等于1的整数,所述第一量化系数的扩大倍数等于以2为底数、以所述预设的定点量化系数为指数的数值。
根据第二方面的第二种可能的实现方式,在所述神经网络量化方法的第三种可能的实现方式中,根据预设的定点量化系数和第一量化系数对第一零点偏移进行倍数扩大和量化,得到第二零点偏移,包括:根据所述预设的定点量化系数,确定所述第一零点偏移的扩大倍数;对所述第一量化系数、所述第一零点偏移的扩大倍数、所述第一零点偏移的乘积进行舍入,得到所述第二零点偏移。
根据第二方面的第三种可能的实现方式,在所述神经网络量化方法的第四种可能的实现方式中,所述第一零点偏移的扩大倍数等于所述第一量化系数的扩大倍数。
根据第二方面,以及以上第二方面的任意一种可能的实现方式,在所述神经网络量化方法的第五种可能的实现方式中,根据所述第二量化系数和所述第二零点偏移,通过浮点乘法运算和定点加法运算对所述待量化数据进行量化,得到第一量化结果,包括:对所述第二量化系数和所述待量化数据的乘积进行舍入,得到第二量化结果;根据所述第二量化结果与所述第二零点偏移的和,得到所述第一量化结果。
根据第二方面的第二种可能的实现方式至第五种可能的实现方式中的任意一种可能的实现方式,在所述神经网络量化方法的第六种可能的实现方式中,根据所述预设的定点量化系数对所述第一量化结果进行移位,得到所述待量化数据的最终量化结果,包括:所述逻辑电路对所述第一量化结果进行向右移位,得到所述待量化数据的最终量化结果,移位的位数等于所述预设的定点量化系数。
根据第二方面,以及以上第二方面的任意一种可能的实现方式,在所述神经网络量化方法的第七种可能的实现方式中,所述方法还包括,存储所述浮点数据、所述预设的定点数最大值、所述预设的定点数最小值、所述第一定点数据、所述第二零点偏移、所述第二量化系数、所述预设的定点量化系数、所述待量化数据的最终量化结果中的一种或多种。
根据第二方面,以及以上第二方面的任意一种可能的实现方式,在所述神经网络量化方法的第八种可能的实现方式中,所述逻辑电路包括算术逻辑单元ALU。
根据第二方面,以及以上第二方面的任意一种可能的实现方式,在所述神经网络量化方法的第九种可能的实现方式中,所述待量化数据包括处理器对浮点数据进行量化后得到的定点数据、神经网络处理过程中的中间结果或最终结果中的一种或多种。
第三方面,本申请实施例提供一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述第二方面的神经网络量化方法。
第四方面,本申请实施例提供一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在处理器中运行时,所述处理器执行上述第二方面的神经网络量化方法。
附图说明
图1示出根据本申请实施例的示例性电子设备的结构图;
图2示出现有技术一提出的将浮点转换为定点的方法示意图;
图3示出现有技术二提出的将浮点转换为定点的方法示意图;
图4示出根据本申请实施例的电子设备的一种示例性的工作方式;
图5示出根据本申请实施例的电子设备的一种示例性的工作方式;
图6示出根据本申请实施例的电子设备的示例性应用场景;
图7示出根据本申请实施例的神经网络量化方法的示例性工作流程。
具体实施方式
图1示出根据本申请实施例的示例性电子设备的结构图,所述电子设备可包括处理器、逻辑电路。电子设备还可进一步包括存储器,其中,处理器、逻辑电路可例如连接存储器。处理器和逻辑电路能够获取存储器存储的数据,以及输出数据到存储器。存储器可存储有执行根据本申请实施例所需的预设值(例如后文中的预设的定点数的值域信息、预设的定点量化系数等),还可以存储浮点数据,例如浮点神经网络模型的参数或输入数据,以及执行过程中的中间结果和最终结果等。处理器可基于浮点数据进行处理得到逻辑电路的输入参数,并通过扩大倍数的方式提升逻辑电路的输入参数的精度。逻辑电路可例如能够获取处理器产生的输入参数以及待量化数据,并进行算术运算以及移位运算,输出提高精度的定点数据,该提高精度的定点数据可作为定点神经网络模型的参数或输入数据,输出到存储器进行存储。
下面结合图2-图3介绍对浮点神经网络模型的参数进行量化的技术原理。
图2示出现有技术一提出的将浮点转换为定点的方法示意图。现有技术一提出的将浮点转换为定点的理论线性变换公式如公式(1)所示:
Xq=round((X+Z1)*S1)      (1)
在公式(1)中,X表示待量化的浮点数,例如浮点神经网络模型的参数或输入数据,Xq表示量化后的定点整数。Z1表示零点偏移,S1表示量化系数,round表示将浮点数四舍五入为定点整数的函数。现有技术一通过设计相应的逻辑电路实现公式(1)的运算。
其中,浮点数X可以有多个,根据多个浮点数X,可以确定浮点数X的值域[Xmin,Xmax]。其中,Xmin表示浮点数X的最小值,Xmax表示浮点数X的最大值。定点整数Xq的值域[Qmin,Qmax]可以根据量化需求预先设置,其中,Qmin表示定点整数Xq的最小值,Qmax表示定点整数Xq的最大值。零点偏移Z1可设置为等于浮点数X的最小值Xmin。量化系数S1可由定点整数Xq的最大值Qmax和定点整数Xq的最小值Qmin的差,与浮点数X的最大值Xmax和浮点数X的最小值Xmin的差相除得到,如公式(2)所示:
S1=(Qmax-Qmin)/(Xmax-Xmin)        (2)
因此,浮点数X、零点偏移Z1、量化系数S1均为浮点形式的数值。
如图2所示,基于现有技术一的技术方案,可以通过处理器根据浮点数的最大值、最小值以及预设的定点整数最大值、最小值,确定量化系数S1以及零点偏移Z1,并 将量化系数S1以及零点偏移Z1作为参数输入到逻辑电路。在逻辑电路中进行浮点加法(或减法)运算(即公式(1)中的X+Z1),浮点乘法运算(即公式(1)中的(X+Z1)*S1),以及舍入处理(即round((X+Z1)*S1)),得到与浮点数X对应的定点整数Xq。
现有技术一的方案,通过对神经网络的参数或输入数据进行量化,使得在神经网络中能够基于量化得到的定点整数进行运算,可以降低在神经网络中进行运算所需的硬件成本。缺点在于,在对神经网络的参数或输入数据进行量化时,采用逻辑电路实现浮点乘法运算和浮点减法(或加法)运算的方式,使得量化过程所需的逻辑电路硬件成本比较大,尤其在神经网络模型以高性能运行的场景下,例如并行处理多个浮点数的量化,需要逻辑电路实现并行的浮点加法运算和浮点乘法运算时,逻辑电路的面积及功耗代价随着并行度的提高进一步增大,对控制硬件成本十分不利。
因此,在现有技术一的基础上,现有技术二提出一种神经网络量化的改进方案,图3示出现有技术二提出的将浮点转换为定点的方法示意图。现有技术二提出的将浮点转换为定点的线性变换公式如公式(3)所示:
Xq=round(X*S1)+Zq       (3)
公式(3)中,X表示待量化的浮点数,例如浮点神经网络模型的参数或输入数据,Xq表示量化后的定点整数。S1表示量化系数,获取方式参照上文公式(2),round表示将浮点数四舍五入为定点整数的函数。Zq=round(Z1*S1),表示定点形式的零点偏移,即,零点偏移Zq是零点偏移Z1的量化结果。现有技术二通过设计相应的逻辑电路实现公式(3)的运算。
如图3所示,基于现有技术二的技术方案,可以通过处理器根据浮点数的最大值、最小值以及预设的定点整数最大值、最小值,确定量化系数S1以及零点偏移Z1,并结合舍入函数round,进一步确定定点形式的零点偏移Zq,其中零点偏移Zq为整数,并将量化系数S1以及定点形式的零点偏移Zq作为参数输入到逻辑电路。在逻辑电路中进行的是浮点乘法运算(即公式(3)中的X*S1),舍入处理(即公式(3)中的round(X*S1)),以及定点加法运算(即公式(3)中的round(X*S1)+Zq),得到与浮点数X对应的定点整数Xq。
现有技术二的方案,通过将通过处理器量化好的零点偏移Zq输入逻辑电路,采用逻辑电路实现浮点乘法运算和定点加法运算的方式,可以降低量化过程所需的逻辑电路硬件成本。缺点在于,输入到逻辑电路的零点偏移Zq,是由浮点形式的零点偏移Z1和量化系数S1的乘积进行四舍五入得到的,也就是说,量化过程在处理器和逻辑电路中共进行了两次四舍五入操作,进一步放大了量化后的神经网络模型的精度损失问题,降低了神经网络进行运算的运算结果准确度。
有鉴于此,本申请实施例提出一种电子设备及神经网络量化方法,根据本申请实施例的电子设备及神经网络量化方法,能够在硬件成本较低的前提下,提升量化后的神经网络模型的精度。
根据本申请实施例,提出了一种电子设备,包括处理器和逻辑电路。图4示出根据本申请实施例的电子设备的一种示例性的工作方式。
如图4所示,处理器用于:根据浮点数据中的最大值和最小值、以及预设的定点数最大值和预设的定点数最小值,确定第一零点偏移和第一量化系数,所述浮点数据 包括神经网络的浮点参数或浮点输入数据中的至少一种;根据预设的定点量化系数对所述第一量化系数进行倍数扩大,得到第二量化系数,以及根据所述预设的定点量化系数和所述第一量化系数对所述第一零点偏移进行倍数扩大和量化,得到第二零点偏移;
逻辑电路用于:根据所述第二量化系数和所述第二零点偏移,通过浮点乘法运算和定点加法运算对待量化数据进行量化,得到第一量化结果;根据所述预设的定点量化系数对所述第一量化结果进行移位,得到所述待量化数据的最终量化结果。
根据本申请实施例的电子设备,处理器可以通过浮点数据、预设的定点数最大值和预设的定点数最小值,确定第一零点偏移和第一量化系数,并结合预设的定点量化系数进一步处理得到第二零点偏移和第二量化系数,第二量化系数是第一量化系数倍数扩大得到的、第二零点偏移是第一零点偏移倍数扩大和量化得到的,使得第二零点偏移和第二量化系数精度更高,从而提高逻辑电路的输入参数的精度;逻辑电路可以根据输入参数(第二零点偏移和第二量化参数)对待量化数据进行浮点乘法运算、定点加法运算实现量化,并根据预设的定点量化系数对量化结果进行移位,得到待量化数据的最终量化结果,使得移位后的最终量化结果能够满足处于预设的定点数最小值到预设的定点数最大值的范围内,得到满足量化需求的最终量化结果。并且逻辑电路进行浮点乘法、定点加法以及移位运算,使得能够完成量化的逻辑电路硬件成本较低。从而能够以较低的硬件成本,提升量化后的神经网络模型的精度。
其中,在不同的应用场景下,待量化数据可以为不同类型的数据,例如可以是处理器对浮点数据进行量化后得到的定点数据、或者神经网络处理过程中的中间结果(例如卷积层的输出结果)或最终结果等等。
通过这种方式,使得电子设备可以在神经网络的处理过程中提高神经网络使用的数据的精度,进一步提高神经网络的输入定点数据、以及神经网络处理得到的中间结果或最终结果的准确度。
图5示出根据本申请实施例的电子设备的一种示例性的工作方式。下面以待量化数据为处理器对浮点数据进行量化后得到的定点数据为例,介绍根据本申请实施例的电子设备的一种示例性的工作方式。
如图5所示,在S1中,处理器根据浮点数据X,确定浮点数据中的最大值,以下又称为浮点数最大值Xmax,和浮点数据中的最小值,以下又称为浮点数最小值Xmin,浮点数据包括神经网络的浮点参数或浮点输入数据中的至少一种,浮点数据可包括多个浮点数。例如,需要量化的神经网络模型的浮点参数通常有多个,在浮点数据包括浮点参数时,浮点数据可包括多个浮点数,每个浮点数可例如对应一个浮点参数。处理器可统计需要量化的神经网络模型的所有浮点参数的最大值和最小值,作为浮点数最大值Xmax和浮点数最小值Xmin。需要量化的神经网络模型的浮点输入数据通常包括多个数值,在浮点数据包括浮点输入数据时,浮点数据可包括多个浮点数,每个浮点数可例如对应输入浮点数据的一个数值。处理器可统计需要量化的神经网络模型的浮点输入数据的所有数值的最大值和最小值,作为浮点数最大值Xmax和浮点数最小值Xmin。步骤S1可以基于现有技术来实现。
在S2中,处理器根据浮点数最小值Xmin确定第一零点偏移Z,以及根据浮点数 最大值Xmax、浮点数最小值Xmin、预设的定点数最大值Qmax、预设的定点数最小值Qmin,确定第一量化系数S,预设的定点数最大值Qmax、预设的定点数最小值Qmin根据浮点数据的最终量化结果的预设范围[Qmin,Qmax]确定。确定第一零点偏移Z的方式可例如将浮点数最小值Xmin作为第一零点偏移Z;确定第一量化系数S的方式可例如参见公式(2),即可以将Qmax、Qmin、Xmax、Xmin带入公式(2),得到第一量化系数S。步骤S2可以基于现有技术来实现。第一零点偏移Z对应于现有技术中的零点偏移Z1,第一量化系数S对应于现有技术中的量化系数S1。
在S3中,处理器对浮点数据X进行量化,得到第一定点数据X’。
处理器可例如根据第一量化系数S和第一零点偏移Z,对浮点数据X进行量化,得到对应的第一定点数据X’,第一定点数据X’例如包括定点神经网络模型的定点输入数据或定点神经网络模型的定点参数等。第一定点数据X’的数值范围等于预设的定点数的数值范围[Qmin,Qmax],即处于大于等于定点数最小值Qmin且小于等于定点数最大值Qmax的范围内。处理器可基于相关技术对浮点数据进行量化,例如可将第一量化系数S和第一零点偏移Z带入公式(1)作为S1和Z1。处理器可将第一定点数据X’发送给逻辑电路,或发送给存储器供逻辑电路调用。
在S4中,处理器根据预设的定点量化系数S_shift对第一量化系数S进行倍数扩大,得到第二量化系数S’(示例可参见下文公式(4)),以及根据预设的定点量化系数S_shift和第一量化系数S对第一零点偏移Z进行倍数扩大和量化,得到第二零点偏移Zq’(示例可参见下文公式(5))。处理器可将第二量化系数S’和第二零点偏移Zq’发送给逻辑电路,或发送给存储器供逻辑电路调用。由第一量化系数S进行倍数扩大得到的第二量化系数S,和由第一零点偏移Z进行倍数扩大和量化得到的第二零点偏移Zq’提供给逻辑电路进行运算,使得在逻辑电路中进行的四舍五入的过程中,第一量化系数S和第一零点偏移Z的高精度位(例如小数位)的数值得以保留,因此能够使得逻辑电路使用第二量化系数和第二零点偏移处理得到的第二定点数据精度更高。
在S5中,逻辑电路根据第二量化系数S’和第二零点偏移Zq’,通过浮点乘法运算和定点加法运算对第一定点数据X’进行量化得到第一量化结果;根据预设的定点量化系数S_shift对第一量化结果进行移位,得到第二定点数据Xq’,以第二定点数据作为浮点数据的最终量化结果(示例可参见下文公式(6))。
其中,第一零点偏移、第一量化系数、第二量化系数是浮点形式,预设的定点量化系数、第二零点偏移是定点形式。
步骤S1-S3的具体实现方式可以参照上文中的现有技术来实现,在此不再赘述。后文中针对步骤S4和S5的示例性实现方式进行描述。
下面介绍本申请实施例的电子设备基于步骤S4确定第二量化系数S’的示例性方法。在一种可能的实现方式中,在步骤S4中,处理器根据预设的定点量化系数S_shift对第一量化系数S进行倍数扩大,得到第二量化系数,包括:
处理器根据预设的定点量化系数S_shift,确定第一量化系数S的扩大倍数;处理器根据第一量化系数S和第一量化系数S的扩大倍数的乘积,得到第二量化系数S’。
其中,预设的定点量化系数S_shift是大于或等于1的整数,第一量化系数S的扩大倍数等于以2为底数、以预设的定点量化系数S_shift为指数的数值。也即,第一量化系数S的扩大倍数等于2^S_shift。
通过这种方式,使得倍数扩大的效果近似于以预设的定点量化系数作为移位位数进行移位的效果,便于确定后续对第一量化结果进行移位的过程的移位位数。
本领域技术人员应理解,第一量化系数S的扩大倍数并不限于上述以2为底的指数形式的示例,第一量化系数S的扩大倍数也可以例如是其他数值,只要能够满足第一量化系数S扩大后得到的第二量化系数输入逻辑电路时,对逻辑电路的输出结果的数值范围带来的影响,能够通过移位消除即可,本申请对第一量化系数S的扩大方式不作限制。
第一量化系数S、第一量化系数S的扩大倍数2^S_shift与第二量化系数S’的关系如公式(4)所示:
S’=S*2^S_shift      (4)
由上文公式(2)可知,在定点数最大值Qmax和定点数最小值Qmin不变时,对于同样的浮点数据,第一量化系数S(S1)是一个定值,因此根据公式(4),第二量化系数S’的精度由定点量化系数S_shift决定,定点量化系数S_shift的数值越大,第二量化系数S’的精度越高。且第二量化系数S’是输入到逻辑电路的参数之一,因此定点量化系数S_shift的数值越大,逻辑电路的量化精度也越高。可以根据不同的量化精度需求,预先设置适合的定点量化系数S_shift。
在此情况下,第二量化系数S’对应的二进制形式,相当于第一量化系数S对应的二进制形式的每个数字向左移动对应位移位数(位数等于定点量化系数S_shift),高位移出(舍弃),低位的空位补0,得到的移位结果。在此情况下,可认为是通过对第一量化系数S进行扩大倍数,使得第一量化系数S中高精度(例如小数精度)部分的数值向高位移动,在四舍五入中未被舍去,保留下来,得到第二量化系数S’,因此第二量化系数S’保留了第一量化系数S的高精度部分,其精度高于扩大倍数前的第一量化系数S。
通过这种方式,能够提高输入逻辑电路的参数精度。根据参数精度需求,可以通过改变预设的定点量化系数的数值,对第一量化系数的扩大倍数进行调整,使得第一量化系数的精度提升方式更灵活。
下面介绍根据本申请实施例的电子设备基于步骤S4确定第二零点偏移Zq’的示例性方法。在一种可能的实现方式中,在步骤S4中,根据预设的定点量化系数S_shift和第一量化系数S对第一零点偏移Z进行倍数扩大和量化,得到第二零点偏移Zq’,包括:
处理器根据预设的定点量化系数S_shift,确定第一零点偏移Z的扩大倍数;处理器对第一量化系数S、第一零点偏移Z的扩大倍数、第一零点偏移Z的乘积进行舍入,得到第二零点偏移Zq’。
通过这种方式,能够提高输入逻辑电路的参数精度。根据参数精度需求,可以通过改变预设的定点量化系数的数值,对第一零点偏移的扩大倍数进行调整,使得第一零点偏移的精度提升方式更灵活。
其中,第一零点偏移Z的扩大倍数等于第一量化系数S的扩大倍数。也即,第一零点偏移Z的扩大倍数等于2^S_shift。在此情况下,使用第一量化系数S和第一零点偏移Z的扩大倍数对第一零点偏移Z进行倍数扩大和量化时,得到第二零点偏移Zq’,相比现有技术公式(3)中仅使用量化系数S1对零点偏移Z1进行量化得到的零点偏移Zq,也扩大了2^S_shift倍。
使用与第一量化系数相同的扩大倍数,使得第一零点偏移进行倍数扩大后,采用第一量化系数进行量化时,得到的第二零点偏移的取值范围,与待量化数据使用第二量化系数时得到的量化结果的取值范围相同,使得第二零点偏移和第二量化系数可以在逻辑电路中参与算术运算。
第一量化系数S、第一零点偏移Z的扩大倍数2^S_shift、第一零点偏移Z与第二零点偏移Zq’的关系如公式(5)所示:
Zq’=round(Z*S*2^S_shift)=round(Z*S’)      (5)
根据现有技术的相关描述,对于同样的浮点数据,第一零点偏移Z(零点偏移Z1)是一个定值;由上文公式(2)可知,在定点数最大值Qmax和定点数最小值Qmin不变时,对于同样的浮点数据,第一量化系数S(S1)是一个定值,因此根据公式(5),第二零点偏移Zq’的精度由第一零点偏移Z的扩大倍数2^S_shift决定,定点量化系数S_shift的数值越大,第二零点偏移Zq’的精度越高。且第二零点偏移Zq’是输入到逻辑电路的参数之一,因此定点量化系数S_shift的数值越大,逻辑电路的量化精度也越高。
由于所述第一零点偏移Z的扩大倍数等于第一量化系数S的扩大倍数,因此,第一量化系数S、第一零点偏移Z的扩大倍数的乘积,可等于第二量化系数S’,参见公式(4),第二零点偏移Zq’也可看作第一零点偏移Z与第二量化系数S’的乘积的舍入结果。在此情况下,第一零点偏移Z与第二量化系数S’的乘积(Z*S’)对应的二进制形式,相当于第一零点偏移Z与第一量化系数S乘积(Z*S)对应的二进制形式的每个数字向左移动对应位移位数(位数等于定点量化系数S_shift),高位移出(舍弃),低位的空位补0,得到的移位结果。因此第一零点偏移Z与第二量化系数S’的乘积的精度,高于扩大倍数前的第一零点偏移Z与第一量化系数S的乘积的精度。对第一零点偏移Z与第二量化系数S’的乘积进行四舍五入的量化处理,得到的第二零点偏移Zq’(参见公式(5))的精度,也高于对第一零点偏移Z与第一量化系数S的乘积进行四舍五入的量化处理,得到的零点偏移Zq(参见公式(3))的精度。在此情况下,可认为是通过扩大倍数,使得第一零点偏移Z中,高精度(例如小数精度)部分的数值向高位移动,在四舍五入中未被舍去,保留下来,得到第二零点偏移Zq,因此第二零点偏移Zq保留了第一零点偏移Z的高精度部分,其精度高于现有技术的零点偏移Zq,并更接近于第一零点偏移Z。从而能够提高输入逻辑电路的参数精度。
步骤S4可以在完成步骤S2之后进行。本申请实施例不限制S3和S4的执行顺序。
由于倍数扩大,使得逻辑电路使用第二量化系数执行步骤S5时,得到的第一量化结果也出现倍数扩大,在此情况下,第一量化结果的范围不能满足浮点数据的最终量化结果的预设范围,因此步骤S5通过移位获得第二定点数据,使得第二定点数据的数值满足浮点数据的最终量化结果的预设范围。
下面介绍根据本申请实施例的电子设备基于步骤S5确定第一量化结果、并基于第一量化结果确定浮点数据的最终量化结果的示例性方法。
在一种可能的实现方式中,步骤S5中,逻辑电路根据第二量化系数S’和第二零点偏移Zq’,通过浮点乘法运算和定点加法运算对第一定点数据X’进行量化得到第一量化结果,包括:
逻辑电路对第二量化系数S’和第一定点数据X’的乘积进行舍入,得到第二量化结果;逻辑电路根据第二量化结果与第二零点偏移Zq’的和,得到第一量化结果。
举例来说,第二量化系数S’在第一量化系数(现有技术的量化系数S1)的基础上扩大了2^S_shift倍,因此,根据第二量化系数S’和第一定点数据X’相乘得到的乘积X’*S’也扩大了2^S_shift倍,该乘积的舍入结果round(X’*S’)(第二量化结果)是扩大倍数后的舍入结果。且根据上文描述,第二零点偏移Zq’是第一零点偏移Z与第二量化系数S’的乘积的舍入结果,因此,第二零点偏移Zq’是扩大倍数后的舍入结果。由于扩大倍数相同(2^S_shift),因此第二零点偏移Zq’能够与第二量化结果进行定点加法运算,得到第一量化结果(round(X’*S’)+Zq’)。在此情况下,第一量化结果是倍数扩大的量化结果。且扩大的倍数等于2^S_shift。
通过这种方式,使得通过第二量化系数和待量化数据的浮点乘法运算,可以得到倍数扩大的第二量化结果;通过第二量化结果和第二零点偏移的定点加法运算,可以得到倍数扩大的第一量化结果,从而在第一量化结果中能保留待量化数据的高精度属性,提高第一量化结果的精度。
在一种可能的实现方式中,步骤S5中,根据预设的定点量化系数S_shift对第一量化结果进行移位,得到第二定点数据Xq’,包括:
逻辑电路对第一量化结果进行向右移位,移位的位数等于预设的定点量化系数S_shift。
第一量化结果、预设的定点量化系数S_shift与第二定点数据Xq’可如公式(6)所示:
Xq’=(round(X’*S’)+Zq’)>>S_shift       (6)
其中,“>>”表示向右移位。由于确定第二量化系数S’以及第二零点偏移Zq’时,进行了倍数为2^S_shift的扩大处理,因此,第一量化结果也扩大了2^S_shift倍。基于此,在步骤S5中,可以通过移位操作,将第一量化结果缩小相应的倍数,得到第二定点数据。这样,可以使得第二定点数据处于预设的[Qmin,Qmax]的定点数值域,使得后续可以对第二定点数据进行卷积运算。
第一量化结果中能保留待量化数据的高精度属性,使得在使用第一量化结果进行移位得到最终量化结果时,也可以提高最终量化结果的精度。并且移位的位数等于预设的定点量化系数,因此移位后使得最终量化结果的取值范围满足量化需求。
其中,移位是一种位运算,右移是按二进制形式把所有的数字向右移动对应位移位数(定点量化系数)S_shift,低位移出(舍弃),高位的空位补符号位,即正数补零,负数补1。右移S_shift位时,相当于使得第二量化结果除以2^S_shift后取整。
其中,在逻辑电路中,进行浮点乘法运算可以得到第一定点数据X’以及第二量化系数S’的乘积,进行定点加法运算可以得到第一量化结果,进行移位运算可以得到第 二定点数据。通过这种方式,使得逻辑电路仅需要进行浮点乘法运算、定点加法运算以及移位操作,面积和功耗较小,可以以较低的硬件成本实现神经网络量化。
在此情况下,在神经网络中,可基于精度提高的第二定点数据进行卷积运算。
图6示出根据本申请实施例的电子设备的示例性应用场景。举例来说,参见图6的应用场景,其中,处理器执行上文所述的步骤S3可以完成对浮点数据的处理,获得待量化数据(第一定点数据),在此情况下,待量化数据为处理器对浮点数据进行量化后得到的定点数据,根据待量化数据可以部署定点神经网络模型,例如使处理器执行上述步骤S3,对浮点输入数据和浮点参数进行量化,得到定点输入数据和定点参数(第一定点数据),从而得到定点神经网络及定点输入数据。其中,对浮点输入数据和浮点参数进行量化时,可以针对浮点输入数据得到一组第一量化系数和第一零点偏移,针对神经网络的多层的参数可以分别得到多组第一量化系数和第一零点偏移。由于输入数据的数值和各层参数的数值可能不同,因此输入数据和各层参数的浮点数最大值和浮点数最小值也可能不同,使得多组第一量化系数和第一零点偏移不同。在处理得到以上所述的输入数据和多层参数对应的多组第一量化系数和第一零点偏移时,使用相同的定点数最大值、定点数最小值的前提下,处理器执行上所述的步骤S3得到的待量化数据均处于相同的预设范围,即满足取值在Qmin-Qmax之间。
处理器可以执行上文所述的步骤S4完成第一量化系数和第一零点偏移的处理,得到第二量化系数和第二零点偏移。其中,多组第一量化系数和第一零点偏移,分别对应多组第二量化系数、第二量化偏移和定点量化系数。第二量化系数、第二量化偏移和定点量化系数作为输入参数输入到逻辑电路时,每次输入一组第二量化系数、第二量化偏移和定点量化系数。
逻辑电路可例如能够获取处理器产生的输入参数(一组第二量化系数、第二零点偏移和定点量化系数)以及处理器产生的第一定点数据(与该组第二量化系数、第二零点偏移和定点量化系数对应),并在定点神经网络模型的运行过程中进行算术运算以及移位运算,得到与第一定点数据对应且精确度更高的第二定点数据,使得定点神经网络模型基于第二定点数据进行卷积运算输出的卷积结果精度更高,该精度更高的卷积结果可作为下一个卷积层的输入数据,以用于下一个卷积层的卷积运算过程中。
例如,在定点神经网络的输入数据和各层参数均处于相同的预设范围(例如-128~127)的情况下,以处理器得到的定点神经网络对输入数据(用a表示)进行运算,则在定点神经网络开始运行时,首先完成的是定点神经网络的输入数据与神经网络第一层参数(权重)的卷积,该输入数据可以是浮点输入数据经处理器量化后得到的定点形式的输入数据(第一定点数据X’的示例),例如用a表示,该参数可以是神经网络第一层浮点参数经处理器量化后的得到的定点形式的参数(第一定点数据X’的示例),例如用b表示。逻辑电路可以对输入数据a和参数b分别处理,得到精确度更高的输入数据a1和参数b1(第二定点数据)。例如,逻辑电路对输入数据a进行处理时,输入数据a作为待量化数据输入到逻辑电路,输入数据a对应的一组第二量化系数、第二量化偏移(根据输入数据a对应的浮点数据的最大值、最小值,以及输入数据a对应的定点数据的值域确定)和定点量化系数也输入到逻辑电路。在此情况下,逻辑电路可以计算并输出数据a1。同理,逻辑电路对参数b进行处理时,参数 b作为待量化数据输入到逻辑电路,参数b对应的一组第二量化系数、第二量化偏移(根据参数b对应的浮点数据的最大值、最小值,以参数b对应的定点数据的值域确定)和定点量化系数也输入到逻辑电路。在此情况下,逻辑电路可以计算并输出参数b1。定点神经网络中,基于输入数据a1和参数b1进行卷积,得到卷积结果c1。这样,卷积结果c1相比输入数据a和参数b的卷积结果,精确度更高,以此类推,神经网络对任意一层的卷积结果和权重进行卷积时,卷积结果已经是提高精确度的卷积结果,逻辑电路对权重进行处理,得到提高精确度的权重,使得每层的卷积运算的结果都能够达到比较高的精确度。在处理器根据任意一层的卷积结果反量化得到卷积结果对应的浮点卷积结果时,得到的浮点卷积结果的准确度也更高,更接近原始的浮点神经网络的对应一层的浮点卷积结果。
逻辑电路的工作过程可以看作先将待量化数据的数值范围以映射的方式先调整到一个大于预设的定点数值域的数值范围,以提高精度,进行高精度的运算,再将运算结果以映射的方式还原到预设的定点数值域,实现量化。在应用中,可以根据不同的需求,针对不同的浮点数据,设置不同的定点量化系数S_shift。例如对于浮点数据A,可使得S_shift=M,对于浮点数据B,可使得S_shift=N,在浮点数据A对提高精度的要求高于浮点数据B时,可使得M>N,可通过设置不同的定点量化系数S_shift,使得神经网络模型中的定点参数和定点输入数据达到所需要的精度要求。
相比现有技术量化得到的定点数据,第二定点数据的精度更接近原始的浮点数据,从而能提高量化后的神经网络的输出数据的准确度。
在一种可能的实现方式中,本申请实施例提出的逻辑电路可以是算术逻辑单元(arithmetic logical unit,ALU),用于实现公式(6)所示的算术运算和移位运算。在此情况下,逻辑电路采用浮点乘法、定点加法、移位方式,面积及功耗成本较低,尤其在并行量化处理中,随着并行度的增加,面积功耗优势明显;且输入逻辑电路的参数精度更高,使得逻辑电路的量化结果准确度更高。
在一种可能的实现方式中,根据本申请实施例的电子设备还包括存储器,存储器用于存储浮点数据、预设的定点数最大值、预设的定点数最小值、第一定点数据、第二零点偏移、第二量化系数、预设的定点量化系数、待量化数据的最终量化结果中的一种或多种。
其中,在上文所述的不同的应用场景下,待量化数据可以是处理器对浮点数据进行量化后得到的定点数据、或者神经网络运行的中间结果或最终结果。
在一种可能的实现方式中,本申请实施例的逻辑电路根据所述第二量化系数和所述第二零点偏移,通过浮点乘法运算和定点加法运算对所述待量化数据进行量化,得到第一量化结果,包括:
对所述第二量化系数和所述待量化数据的乘积进行舍入,得到第二量化结果;根据所述第二量化结果与所述第二零点偏移的和,得到所述第一量化结果。
其中,待量化数据是处理器对浮点数据进行量化后得到的定点数据时,可以参见上文中的步骤S5及相关描述。
在一种可能的实现方式中,本申请实施例的逻辑电路根据所述预设的定点量化系数对所述第一量化结果进行移位,得到所述浮点数据的最终量化结果,包括:
所述逻辑电路对所述第一量化结果进行向右移位,得到所述浮点数据的最终量化结果,移位的位数等于所述预设的定点量化系数。
其中,待量化数据是处理器对浮点数据进行量化后得到的定点数据时,可以参见上文中的步骤S5及相关描述。
本申请还提出一种神经网络量化方法,图7示出根据本申请实施例的神经网络量化方法的示例性工作流程。如图7所示,该方法可应用于根据本申请实施例的电子设备,包括:
S1101,处理器根据浮点数据中的最大值和最小值、以及预设的定点数最大值和预设的定点数最小值,确定第一零点偏移和第一量化系数,所述浮点数据包括神经网络的浮点参数或浮点输入数据中的至少一种;
S1102,处理器根据预设的定点量化系数对所述第一量化系数进行倍数扩大,得到第二量化系数,以及根据所述预设的定点量化系数和所述第一量化系数对所述第一零点偏移进行倍数扩大和量化,得到第二零点偏移;
S1103,逻辑电路根据所述第二量化系数和所述第二零点偏移,通过浮点乘法运算和定点加法运算对待量化数据进行量化,得到第一量化结果;根据所述预设的定点量化系数对所述第一量化结果进行移位,得到所述待量化数据的最终量化结果。
关于该方法的示例性说明可参见上文,此处不再重复。
在一种可能的实现方式中,处理器根据预设的定点量化系数对第一量化系数进行倍数扩大,得到第二量化系数,包括:处理器根据预设的定点量化系数,确定第一量化系数的扩大倍数;处理器根据第一量化系数和第一量化系数的扩大倍数的乘积,得到第二量化系数。
在一种可能的实现方式中,预设的定点量化系数是大于或等于1的整数,第一量化系数的扩大倍数等于以2为底数、以预设的定点量化系数为指数的数值。
在一种可能的实现方式中,根据预设的定点量化系数和第一量化系数对第一零点偏移进行倍数扩大和量化,得到第二零点偏移,包括:根据预设的定点量化系数,确定第一零点偏移的扩大倍数;对第一量化系数、第一零点偏移的扩大倍数、第一零点偏移的乘积进行舍入,得到所述第二零点偏移。
在一种可能的实现方式中,第一零点偏移的扩大倍数等于第一量化系数的扩大倍数。
在一种可能的实现方式中,根据第二量化系数和第二零点偏移,通过浮点乘法运算和定点加法运算对待量化数据进行量化,得到第一量化结果,包括:对第二量化系数和待量化数据的乘积进行舍入,得到第二量化结果;根据第二量化结果与第二零点偏移的和,得到第一量化结果。
在一种可能的实现方式中,根据预设的定点量化系数对第一量化结果进行移位,得到待量化数据的最终量化结果,包括:逻辑电路对第一量化结果进行向右移位,得到待量化数据的最终量化结果,移位的位数等于预设的定点量化系数。
在一种可能的实现方式中,所述方法还包括:存储所述浮点数据、预设的定点数最大值、预设的定点数最小值、第一定点数据、第二零点偏移、第二量化系数、预设的定点量化系数、待量化数据的最终量化结果中的一种或多种。
在一种可能的实现方式中,逻辑电路包括算术逻辑单元ALU。
在一种可能的实现方式中,待量化数据包括处理器对浮点数据进行量化后得到的定点数据、神经网络处理过程中的中间结果或最终结果中的一种或多种。
以上方法的示例性描述可参见上文,此处不再赘述。
在一种可能的实现方式中,本申请提出一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述神经网络量化方法。
在一种可能的实现方式中,本申请提出一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在处理器中运行时,所述处理器执行上述神经网络量化方法。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (13)

  1. 一种电子设备,其特征在于,包括处理器和逻辑电路:
    所述处理器,用于:
    根据浮点数据中的最大值和最小值、以及预设的定点数最大值和预设的定点数最小值,确定第一零点偏移和第一量化系数,所述浮点数据包括神经网络的浮点参数或浮点输入数据中的至少一种;
    根据预设的定点量化系数对所述第一量化系数进行倍数扩大,得到第二量化系数,以及根据所述预设的定点量化系数和所述第一量化系数对所述第一零点偏移进行倍数扩大和量化,得到第二零点偏移;
    所述逻辑电路,用于根据所述第二量化系数和所述第二零点偏移,通过浮点乘法运算和定点加法运算对待量化数据进行量化,得到第一量化结果;根据所述预设的定点量化系数对所述第一量化结果进行移位,得到所述待量化数据的最终量化结果。
  2. 根据权利要求1所述的电子设备,其特征在于,所述处理器根据预设的定点量化系数对所述第一量化系数进行倍数扩大,得到第二量化系数,包括:
    所述处理器根据所述预设的定点量化系数,确定所述第一量化系数的扩大倍数;
    所述处理器根据所述第一量化系数和所述第一量化系数的扩大倍数的乘积,得到所述第二量化系数。
  3. 根据权利要求2所述的电子设备,其特征在于,所述预设的定点量化系数是大于或等于1的整数,所述第一量化系数的扩大倍数等于以2为底数、以所述预设的定点量化系数为指数的数值。
  4. 根据权利要求3所述的电子设备,其特征在于,根据预设的定点量化系数和第一量化系数对第一零点偏移进行倍数扩大和量化,得到第二零点偏移,包括:
    根据所述预设的定点量化系数,确定所述第一零点偏移的扩大倍数;
    对所述第一量化系数、所述第一零点偏移的扩大倍数、所述第一零点偏移的乘积进行舍入,得到所述第二零点偏移。
  5. 根据权利要求4所述的电子设备,其特征在于,所述第一零点偏移的扩大倍数等于所述第一量化系数的扩大倍数。
  6. 根据权利要求1-5中任一项所述的电子设备,其特征在于,根据所述第二量化系数和所述第二零点偏移,通过浮点乘法运算和定点加法运算对所述待量化数据进行量化,得到第一量化结果,包括:
    对所述第二量化系数和所述待量化数据的乘积进行舍入,得到第二量化结果;
    根据所述第二量化结果与所述第二零点偏移的和,得到所述第一量化结果。
  7. 根据权利要求3-6中任一项所述的电子设备,其特征在于,根据所述预设的定点量化系数对所述第一量化结果进行移位,得到所述待量化数据的最终量化结果,包括:
    所述逻辑电路对所述第一量化结果进行向右移位,得到所述待量化数据的最终量化结果,移位的位数等于所述预设的定点量化系数。
  8. 根据权利要求1-7中任意一项所述的电子设备,其特征在于,所述电子设备还包括存储器,所述存储器用于存储所述浮点数据、所述预设的定点数最大值、所述预 设的定点数最小值、所述第一定点数据、所述第二零点偏移、所述第二量化系数、所述预设的定点量化系数、所述待量化数据的最终量化结果中的一种或多种。
  9. 根据权利要求1-8中任意一项所述的电子设备,其特征在于,所述逻辑电路包括算术逻辑单元ALU。
  10. 根据权利要求1-9中任意一项所述的电子设备,其特征在于,所述待量化数据包括处理器对浮点数据进行量化后得到的定点数据、神经网络处理过程中的中间结果或最终结果中的一种或多种。
  11. 一种神经网络量化方法,其特征在于,所述方法包括:
    处理器根据浮点数据中的最大值和最小值、以及预设的定点数最大值和预设的定点数最小值,确定第一零点偏移和第一量化系数,所述浮点数据包括神经网络的浮点参数或浮点输入数据中的至少一种;
    所述处理器根据预设的定点量化系数对所述第一量化系数进行倍数扩大,得到第二量化系数,以及根据所述预设的定点量化系数和所述第一量化系数对所述第一零点偏移进行倍数扩大和量化,得到第二零点偏移;
    逻辑电路根据所述第二量化系数和所述第二零点偏移,通过浮点乘法运算和定点加法运算对待量化数据进行量化,得到第一量化结果;根据所述预设的定点量化系数对所述第一量化结果进行移位,得到所述待量化数据的最终量化结果。
  12. 一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求11所述的方法。
  13. 一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,其特征在于,当所述计算机可读代码在处理器中运行时,所述处理器执行权利要求11所述的方法。
PCT/CN2021/109839 2021-07-30 2021-07-30 电子设备及神经网络量化方法 WO2023004799A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180100947.8A CN117813610A (zh) 2021-07-30 2021-07-30 电子设备及神经网络量化方法
PCT/CN2021/109839 WO2023004799A1 (zh) 2021-07-30 2021-07-30 电子设备及神经网络量化方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/109839 WO2023004799A1 (zh) 2021-07-30 2021-07-30 电子设备及神经网络量化方法

Publications (1)

Publication Number Publication Date
WO2023004799A1 true WO2023004799A1 (zh) 2023-02-02

Family

ID=85087378

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109839 WO2023004799A1 (zh) 2021-07-30 2021-07-30 电子设备及神经网络量化方法

Country Status (2)

Country Link
CN (1) CN117813610A (zh)
WO (1) WO2023004799A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200134448A1 (en) * 2018-10-31 2020-04-30 Google Llc Quantizing neural networks with batch normalization
CN111612147A (zh) * 2020-06-30 2020-09-01 上海富瀚微电子股份有限公司 深度卷积网络的量化方法
CN112446491A (zh) * 2021-01-20 2021-03-05 上海齐感电子信息科技有限公司 神经网络模型实时自动量化方法及实时自动量化系统
CN113159276A (zh) * 2021-03-09 2021-07-23 北京大学 模型优化部署方法、系统、设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200134448A1 (en) * 2018-10-31 2020-04-30 Google Llc Quantizing neural networks with batch normalization
CN111612147A (zh) * 2020-06-30 2020-09-01 上海富瀚微电子股份有限公司 深度卷积网络的量化方法
CN112446491A (zh) * 2021-01-20 2021-03-05 上海齐感电子信息科技有限公司 神经网络模型实时自动量化方法及实时自动量化系统
CN113159276A (zh) * 2021-03-09 2021-07-23 北京大学 模型优化部署方法、系统、设备及存储介质

Also Published As

Publication number Publication date
CN117813610A (zh) 2024-04-02

Similar Documents

Publication Publication Date Title
WO2019238029A1 (zh) 卷积神经网络系统和卷积神经网络量化的方法
CN110852434B (zh) 基于低精度浮点数的cnn量化方法、前向计算方法及硬件装置
CN107340993B (zh) 运算装置和方法
KR20190062129A (ko) 컨볼루션 신경망 계산을 위한 저전력 하드웨어 가속 방법 및 시스템
JP2018124681A (ja) 演算処理装置、情報処理装置、方法、およびプログラム
CN106990937A (zh) 一种浮点数处理装置
US10491239B1 (en) Large-scale computations using an adaptive numerical format
CN110008952B (zh) 一种目标识别方法及设备
CN111832719A (zh) 一种定点量化的卷积神经网络加速器计算电路
CN110852416A (zh) 基于低精度浮点数数据表现形式的cnn加速计算方法及系统
CN111240746B (zh) 一种浮点数据反量化及量化的方法和设备
CN109308520B (zh) 实现softmax函数计算的FPGA电路及方法
CN110515584A (zh) 浮点计算方法及系统
CN111813371B (zh) 数字信号处理的浮点除法运算方法、系统及可读介质
CN109325590B (zh) 用于实现计算精度可变的神经网络处理器的装置
Wu et al. Efficient dynamic fixed-point quantization of CNN inference accelerators for edge devices
WO2023004799A1 (zh) 电子设备及神经网络量化方法
CN114418057A (zh) 卷积神经网络的运算方法及相关设备
CN107220025A (zh) 处理乘加运算的装置和处理乘加运算的方法
CN115860062A (zh) 一种适合fpga的神经网络量化方法及装置
CN114860193A (zh) 一种用于计算Power函数的硬件运算电路及数据处理方法
Isobe et al. Low-bit Quantized CNN Acceleration based on Bit-serial Dot Product Unit with Zero-bit Skip
CN116468079B (zh) 用于训练深度神经网络模型的方法及相关产品
JP2020067897A (ja) 演算処理装置、学習プログラム及び学習方法
US20230334117A1 (en) Method and system for calculating dot products

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21951389

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180100947.8

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE