CN117813610A - Electronic device and neural network quantization method - Google Patents

Electronic device and neural network quantization method Download PDF

Info

Publication number
CN117813610A
CN117813610A CN202180100947.8A CN202180100947A CN117813610A CN 117813610 A CN117813610 A CN 117813610A CN 202180100947 A CN202180100947 A CN 202180100947A CN 117813610 A CN117813610 A CN 117813610A
Authority
CN
China
Prior art keywords
point
quantization
quantized
coefficient
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180100947.8A
Other languages
Chinese (zh)
Inventor
肖延南
刘根树
张怡浩
左文明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN117813610A publication Critical patent/CN117813610A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the application provides electronic equipment and a neural network quantization method, wherein the electronic equipment comprises a processor and a logic circuit, wherein the processor is used for determining a first zero point offset and a first quantization coefficient according to floating point data and a preset fixed point number maximum value and a preset fixed point number minimum value, performing multiple expansion on the first quantization coefficient to obtain a second quantization coefficient, and performing multiple expansion and quantization on the first zero point offset to obtain a second zero point offset; the logic circuit is used for quantizing the data to be quantized through floating point multiplication operation and fixed point addition operation according to the second quantization coefficient and the second zero point offset to obtain a first quantization result; and shifting the first quantization result according to the preset fixed-point quantization coefficient to obtain a final quantization result of the data to be quantized. According to the electronic equipment and the neural network quantization method, on the premise of low hardware cost, the precision of the quantized neural network model can be improved.

Description

Electronic device and neural network quantization method Technical Field
The application relates to the technical field of neural networks, in particular to electronic equipment and a neural network quantification method.
Background
With the application of deep learning techniques, a large number of deep learning-based neural network models have emerged. The parameters or input data of these neural network models are typically in floating point form, which also operates in a floating point manner. Whereas the data in floating point form is typically higher in number of bits, e.g., 32 bits, etc. The storage and operation of floating point data therefore consumes a significant amount of hardware cost. When the floating-point neural network model is large in scale, for example, the number of parameters or input data is large, the hardware performance requirement on the floating-point neural network model is higher, and therefore larger hardware cost is required for operation based on the neural network model.
In order to solve the problem of overlarge hardware cost for carrying out operation based on a floating point neural network model, the prior art proposes to quantize the floating point neural network model and quantize parameters or input data of the floating point neural network model into fixed-point parameters or fixed-point input data. The number of post-quantization bits is reduced, so that the hardware cost for the storage and operation of fixed-point data is reduced, in which case the hardware cost required for the neural network operation is also reduced.
However, the quantization process of the parameters or the input data of the floating-point neural network model also performs a large number of floating-point operations, resulting in higher hardware operation cost required by the neural network quantization, and the reduced number of bits also means a certain loss of precision, resulting in a decrease in operation accuracy of the neural network model.
Disclosure of Invention
In view of this, the embodiment of the application provides an electronic device and a neural network quantization method, which can improve the accuracy of the quantized neural network model on the premise of lower hardware cost.
In a first aspect, an embodiment of the present application proposes an electronic device, including a processor and a logic circuit, where the processor is configured to: determining a first zero offset and a first quantization coefficient according to a maximum value and a minimum value in floating point data, and a preset fixed point number maximum value and a preset fixed point number minimum value, wherein the floating point data comprises at least one of floating point parameters of a neural network or floating point input data; performing multiple expansion on the first quantized coefficient according to a preset fixed-point quantized coefficient to obtain a second quantized coefficient, and performing multiple expansion and quantization on the first zero offset according to the preset fixed-point quantized coefficient and the first quantized coefficient to obtain a second zero offset; the logic circuit is used for quantizing the data to be quantized through floating point multiplication operation and fixed point addition operation according to the second quantization coefficient and the second zero point offset to obtain a first quantization result; and shifting the first quantization result according to the preset fixed-point quantization coefficient to obtain a final quantization result of the data to be quantized.
According to the electronic equipment provided by the embodiment of the application, the processor can determine the first zero offset and the first quantized coefficient through floating point data, the preset fixed point number maximum value and the preset fixed point number minimum value, and further process the floating point data, the preset fixed point quantized coefficient to obtain the second zero offset and the second quantized coefficient, wherein the second quantized coefficient is obtained by enlarging the first quantized coefficient, the second zero offset is obtained by enlarging and quantizing the first zero offset multiple, so that the second zero offset and the second quantized coefficient are higher in precision, and the precision of input parameters of the logic circuit is improved; the logic circuit can carry out floating point multiplication operation and fixed point addition operation on the data to be quantized according to the input parameters (the second zero point offset and the second quantization parameter) to realize quantization, and shift the quantization result according to the preset fixed point quantization coefficient to obtain a final quantization result of the data to be quantized, so that the shifted final quantization result can be within a range from the minimum value of the preset fixed point number to the maximum value of the preset fixed point number, and the final quantization result meeting the quantization requirement is obtained. And the logic circuit performs floating point multiplication, fixed point addition and shift operation, so that the hardware cost of the logic circuit capable of completing quantization is lower. Therefore, the accuracy of the quantized neural network model can be improved with lower hardware cost.
In a first possible implementation manner of the electronic device according to the first aspect, the processor performs multiple expansion on the first quantization coefficient according to a preset fixed-point quantization coefficient to obtain a second quantization coefficient, including: the processor determines expansion multiples of the first quantization coefficient according to the preset fixed-point quantization coefficient; the processor obtains the second quantized coefficient according to the product of the first quantized coefficient and the expansion multiple of the first quantized coefficient.
In this way, the parameter accuracy of the input logic circuit can be improved. According to the parameter precision requirement, the expansion multiple of the first quantized coefficient can be adjusted by changing the numerical value of the preset fixed-point quantized coefficient, so that the precision improvement mode of the first quantized coefficient is more flexible.
In a second possible implementation manner of the electronic device according to the first possible implementation manner of the first aspect, the preset fixed-point quantization coefficient is an integer greater than or equal to 1, and the expansion multiple of the first quantization coefficient is equal to a value based on 2 and based on the preset fixed-point quantization coefficient as an index.
In this way, the effect of the multiple expansion is approximated to the effect of shifting with the preset fixed-point quantization coefficient as the shift bit number, so that the shift bit number of the subsequent process of shifting the first quantization result is determined.
In a third possible implementation manner of the electronic device according to the second possible implementation manner of the first aspect, the performing multiple expansion and quantization on the first zero offset according to a preset fixed point quantization coefficient and the first quantization coefficient to obtain a second zero offset includes: determining the expansion multiple of the first zero point offset according to the preset fixed point quantization coefficient; and rounding the product of the first quantization coefficient, the expansion multiple of the first zero point offset and the first zero point offset to obtain the second zero point offset.
In this way, the parameter accuracy of the input logic circuit can be improved. According to the parameter precision requirement, the expansion multiple of the first zero point offset can be adjusted by changing the numerical value of a preset fixed-point quantization coefficient, so that the precision lifting mode of the first zero point offset is more flexible.
In a fourth possible implementation manner of the electronic device according to the third possible implementation manner of the first aspect, the expansion multiple of the first zero offset is equal to the expansion multiple of the first quantization coefficient.
The same expansion multiple as the first quantization coefficient is used, so that after the first zero point offset is subjected to multiple expansion, when the first quantization coefficient is adopted for quantization, the obtained value range of the second zero point offset is the same as the value range of a quantization result obtained when the second quantization coefficient is used for data to be quantized, and the second zero point offset and the second quantization coefficient can participate in arithmetic operation in a logic circuit.
In a fifth possible implementation manner of the electronic device according to the first aspect and any one of the possible implementation manners of the first aspect, quantizing the data to be quantized according to the second quantization coefficient and the second zero offset by a floating point multiplication operation and a fixed point addition operation to obtain a first quantization result, where the quantizing includes: rounding the product of the second quantization coefficient and the data to be quantized to obtain a second quantization result; and obtaining the first quantized result according to the sum of the second quantized result and the second zero offset.
In this way, the second quantization result with expanded multiple can be obtained through floating point multiplication operation of the second quantization coefficient and the data to be quantized; the first quantized result with multiple expansion can be obtained through fixed-point addition operation of the second quantized result and the second zero point offset, so that the high-precision attribute of the data to be quantized can be reserved in the first quantized result, and the precision of the first quantized result is improved.
According to any one of the second to fifth possible implementation manners of the first aspect, in a sixth possible implementation manner of the electronic device, the shifting the first quantization result according to the preset fixed-point quantization coefficient, to obtain a final quantization result of the data to be quantized includes: and the logic circuit shifts the first quantization result to the right to obtain a final quantization result of the data to be quantized, wherein the number of shifted bits is equal to the preset fixed-point quantization coefficient.
The high-precision attribute of the data to be quantized can be reserved in the first quantized result, so that the precision of the final quantized result can be improved when the final quantized result is obtained by using the first quantized result to shift. And the bit number of the shift is equal to the preset fixed-point quantization coefficient, so that the value range of the final quantization result after the shift meets the quantization requirement.
In a seventh possible implementation manner of the electronic device according to the first aspect, the electronic device further includes a memory, where the memory is configured to store one or more of the floating point data, the preset fixed point number maximum value, the preset fixed point number minimum value, the first fixed point data, the second zero offset, the second quantization coefficient, the preset fixed point quantization coefficient, and a final quantization result of the data to be quantized.
In an eighth possible implementation form of the electronic device according to the first aspect as such or any of the possible implementation forms of the first aspect as such, the logic circuit comprises an arithmetic logic unit ALU.
In a ninth possible implementation manner of the electronic device according to the first aspect and any one of the possible implementation manners of the first aspect, the data to be quantized includes one or more of fixed-point data obtained by quantizing floating-point data by a processor, an intermediate result in a neural network processing process, or a final result.
By the method, the electronic equipment can improve the accuracy of data used by the neural network in the processing process of the neural network, and further improve the accuracy of input fixed-point data of the neural network and intermediate results or final results obtained by the processing of the neural network.
In a second aspect, embodiments of the present application provide a neural network quantization method, the method including: the processor determines a first zero offset and a first quantization coefficient according to a maximum value and a minimum value in floating point data, a preset fixed point number maximum value and a preset fixed point number minimum value, wherein the floating point data comprises at least one of floating point parameters of a neural network or floating point input data; the processor performs multiple expansion on the first quantized coefficient according to a preset fixed-point quantized coefficient to obtain a second quantized coefficient, and performs multiple expansion and quantization on the first zero offset according to the preset fixed-point quantized coefficient and the first quantized coefficient to obtain a second zero offset; the logic circuit quantizes the data to be quantized through floating point multiplication operation and fixed point addition operation according to the second quantization coefficient and the second zero offset to obtain a first quantization result; and shifting the first quantization result according to the preset fixed-point quantization coefficient to obtain a final quantization result of the data to be quantized.
According to a second aspect, in a first possible implementation manner of the neural network quantization method, the processor performs multiple expansion on the first quantization coefficient according to a preset fixed-point quantization coefficient to obtain a second quantization coefficient, and includes: the processor determines expansion multiples of the first quantization coefficient according to the preset fixed-point quantization coefficient; the processor obtains the second quantized coefficient according to the product of the first quantized coefficient and the expansion multiple of the first quantized coefficient.
In a second possible implementation manner of the neural network quantization method according to the first possible implementation manner of the second aspect, the preset fixed-point quantization coefficient is an integer greater than or equal to 1, and the expansion multiple of the first quantization coefficient is equal to a value based on 2 and an exponent of the preset fixed-point quantization coefficient.
In a third possible implementation manner of the neural network quantization method according to the second possible implementation manner of the second aspect, the performing multiple expansion and quantization on the first zero offset according to a preset fixed-point quantization coefficient and the first quantization coefficient to obtain a second zero offset includes: determining the expansion multiple of the first zero point offset according to the preset fixed point quantization coefficient; and rounding the product of the first quantization coefficient, the expansion multiple of the first zero point offset and the first zero point offset to obtain the second zero point offset.
In a fourth possible implementation manner of the neural network quantization method according to the third possible implementation manner of the second aspect, the expansion multiple of the first zero offset is equal to the expansion multiple of the first quantization coefficient.
In a fifth possible implementation manner of the neural network quantization method according to the second aspect and any one of the possible implementation manners of the second aspect, the quantizing the data to be quantized by a floating point multiplication operation and a fixed point addition operation according to the second quantization coefficient and the second zero point offset to obtain a first quantization result includes: rounding the product of the second quantization coefficient and the data to be quantized to obtain a second quantization result; and obtaining the first quantized result according to the sum of the second quantized result and the second zero offset.
According to any one of the second to fifth possible implementation manners of the second aspect, in a sixth possible implementation manner of the neural network quantization method, the shifting the first quantization result according to the preset fixed-point quantization coefficient, to obtain a final quantization result of the data to be quantized includes: and the logic circuit shifts the first quantization result to the right to obtain a final quantization result of the data to be quantized, wherein the number of shifted bits is equal to the preset fixed-point quantization coefficient.
In a seventh possible implementation manner of the neural network quantization method according to the second aspect and any one of the possible implementation manners of the second aspect, the method further includes storing one or more of the floating point data, the preset fixed point number maximum value, the preset fixed point number minimum value, the first fixed point data, the second zero point offset, the second quantization coefficient, the preset fixed point quantization coefficient, and a final quantization result of the data to be quantized.
In an eighth possible implementation form of the neural network quantization method according to the second aspect as such or any of the possible implementation forms of the second aspect as such, the logic circuit comprises an arithmetic logic unit ALU.
In a ninth possible implementation manner of the neural network quantization method according to the second aspect and any one of the possible implementation manners of the second aspect, the data to be quantized includes one or more of fixed-point data obtained by quantizing floating-point data by a processor, an intermediate result in a neural network processing process, or a final result.
In a third aspect, embodiments of the present application provide a non-transitory computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the neural network quantization method of the second aspect described above.
In a fourth aspect, embodiments of the present application provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when executed in a processor, performs the neural network quantization method of the second aspect described above.
Drawings
FIG. 1 illustrates a block diagram of an exemplary electronic device, according to an embodiment of the present application;
FIG. 2 is a schematic diagram showing a method for converting floating point to fixed point according to the first prior art;
FIG. 3 is a schematic diagram of a method for converting floating point to fixed point according to the second prior art;
FIG. 4 illustrates an exemplary manner of operation of an electronic device according to an embodiment of the present application;
FIG. 5 illustrates an exemplary manner of operation of an electronic device according to an embodiment of the present application;
FIG. 6 illustrates an exemplary application scenario of an electronic device according to an embodiment of the present application;
fig. 7 shows an exemplary workflow of a neural network quantization method according to an embodiment of the present application.
Detailed Description
Fig. 1 illustrates a block diagram of an exemplary electronic device, which may include a processor, logic circuitry, according to an embodiment of the present application. The electronic device may further comprise a memory, wherein the processor, the logic circuit may be connected to the memory, for example. The processor and logic are capable of retrieving data stored by the memory and outputting the data to the memory. The memory may store preset values (e.g., value range information of preset fixed point numbers, preset fixed point quantization coefficients, etc.) required for executing the embodiments according to the present application, and may also store floating point data, e.g., parameters or input data of a floating point neural network model, intermediate results and final results in the execution process, etc. The processor can process the floating point data to obtain the input parameters of the logic circuit, and the precision of the input parameters of the logic circuit is improved in a multiple expansion mode. The logic circuit may, for example, obtain input parameters and data to be quantized generated by the processor, perform arithmetic operations and shift operations, and output fixed-point data with improved accuracy, where the fixed-point data with improved accuracy may be used as parameters or input data of the fixed-point neural network model, and output to the memory for storage.
The following describes the technical principle of quantifying parameters of a floating point neural network model in connection with fig. 2-3.
Fig. 2 is a schematic diagram of a method for converting floating point to fixed point according to a first proposed prior art. The theoretical linear transformation formula for converting floating point into fixed point in the first prior art is shown in formula (1):
Xq=round((X+Z1)*S1) (1)
in formula (1), X represents a floating point number to be quantized, such as a parameter of a floating point neural network model or input data, and Xq represents a fixed point integer after quantization. Z1 denotes zero-point offset, S1 denotes quantization coefficient, and round denotes a function of rounding floating point numbers to fixed point integers. In the prior art, the operation of the formula (1) is realized by designing a corresponding logic circuit.
The number of floating point numbers X may be plural, and the value range [ Xmin, xmax ] of the floating point number X may be determined according to the plural floating point numbers X. Where Xmin represents the minimum value of floating point number X and Xmax represents the maximum value of floating point number X. The value range [ Qmin, qmax ] of the fixed-point integer Xq may be preset according to quantization requirements, where Qmin represents a minimum value of the fixed-point integer Xq and Qmax represents a maximum value of the fixed-point integer Xq. Zero offset Z1 may be set equal to the minimum value Xmin of floating point number X. The quantization coefficient S1 may be obtained by dividing the difference between the maximum value Qmax of the fixed-point integer Xq and the minimum value Qmin of the fixed-point integer Xq by the difference between the maximum value Xmax of the floating-point number X and the minimum value Xmin of the floating-point number X, as shown in formula (2):
S1=(Qmax-Qmin)/(Xmax-Xmin) (2)
Therefore, the floating point number X, the zero point offset Z1, and the quantization coefficient S1 are all floating point type values.
As shown in fig. 2, according to the first technical solution of the prior art, the quantization coefficient S1 and the zero offset Z1 may be determined by the processor according to the maximum value and the minimum value of the floating point number and the preset fixed point integer maximum value and the preset fixed point integer minimum value, and the quantization coefficient S1 and the zero offset Z1 are input as parameters to the logic circuit. Floating point addition (or subtraction) operation (i.e., x+z1 in formula (1)), floating point multiplication operation (i.e., (x+z1) S1 in formula (1)), and rounding process (i.e., (x+z1) S1)) are performed in the logic circuit to obtain a fixed point integer Xq corresponding to the floating point number X.
According to the scheme in the first prior art, parameters or input data of the neural network are quantized, so that operation can be performed in the neural network based on fixed-point integers obtained through quantization, and hardware cost required by operation in the neural network can be reduced. The method has the defects that when parameters or input data of the neural network are quantized, a mode of realizing floating point multiplication operation and floating point subtraction (or addition) operation by adopting a logic circuit is adopted, so that the hardware cost of the logic circuit required by the quantization process is relatively high, and particularly in the scene that a neural network model operates with high performance, for example, quantization of a plurality of floating points is processed in parallel, when the logic circuit is required to realize parallel floating point addition operation and floating point multiplication operation, the area and the power consumption cost of the logic circuit are further increased along with the improvement of parallelism, and the control hardware cost is quite unfavorable.
Therefore, based on the first prior art, the second prior art proposes a neural network quantization improvement, and fig. 3 shows a schematic diagram of a method for converting floating point to fixed point proposed by the second prior art. The linear transformation formula for converting floating point into fixed point in the second prior art is shown in formula (3):
Xq=round(X*S1)+Zq (3)
in formula (3), X represents a floating point number to be quantized, such as a parameter of a floating point neural network model or input data, and Xq represents a quantized fixed-point integer. S1 represents a quantization coefficient, the acquisition method is referred to in the above formula (2), and round represents a function of rounding up floating point numbers to fixed point integers. Zq=round (z1×s1) represents zero point offset in the form of a fixed point, that is, zero point offset Zq is the quantization result of zero point offset Z1. In the second prior art, the operation of the formula (3) is realized by designing a corresponding logic circuit.
As shown in fig. 3, based on the technical solution of the second prior art, the quantization coefficient S1 and the zero offset Z1 may be determined by the processor according to the maximum value and the minimum value of the floating point number and the preset fixed point integer maximum value and minimum value, and the zero offset Zq in the fixed point form is further determined in combination with the rounding function round, where the zero offset Zq is an integer, and the quantization coefficient S1 and the zero offset Zq in the fixed point form are input as parameters to the logic circuit. In the logic circuit, floating-point multiplication (x×s1 in formula (3)), rounding (round (x×s1) in formula (3)), and fixed-point addition (round (x×s1) +zq in formula (3)) are performed to obtain a fixed-point integer Xq corresponding to the floating-point number X.
In the scheme of the second prior art, the zero offset Zq quantized by the processor is input into the logic circuit, and the logic circuit is adopted to realize floating point multiplication operation and fixed point addition operation, so that the hardware cost of the logic circuit required by the quantization process can be reduced. The disadvantage is that the zero offset Zq input to the logic circuit is obtained by rounding the product of the zero offset Z1 in the floating point form and the quantization coefficient S1, that is, the quantization process performs rounding operations twice in the processor and the logic circuit, which further amplifies the problem of precision loss of the quantized neural network model and reduces the accuracy of the operation result of the neural network operation.
In view of this, the embodiment of the application provides an electronic device and a neural network quantization method, which can improve the accuracy of the quantized neural network model on the premise of lower hardware cost.
According to an embodiment of the application, an electronic device is provided, which includes a processor and a logic circuit. Fig. 4 illustrates an exemplary manner of operation of an electronic device according to an embodiment of the present application.
As shown in fig. 4, the processor is configured to: determining a first zero offset and a first quantization coefficient according to a maximum value and a minimum value in floating point data, and a preset fixed point number maximum value and a preset fixed point number minimum value, wherein the floating point data comprises at least one of floating point parameters of a neural network or floating point input data; performing multiple expansion on the first quantized coefficient according to a preset fixed-point quantized coefficient to obtain a second quantized coefficient, and performing multiple expansion and quantization on the first zero offset according to the preset fixed-point quantized coefficient and the first quantized coefficient to obtain a second zero offset;
the logic circuit is used for: quantizing the data to be quantized through floating point multiplication operation and fixed point addition operation according to the second quantization coefficient and the second zero point offset to obtain a first quantization result; and shifting the first quantization result according to the preset fixed-point quantization coefficient to obtain a final quantization result of the data to be quantized.
According to the electronic equipment provided by the embodiment of the application, the processor can determine the first zero offset and the first quantized coefficient through floating point data, the preset fixed point number maximum value and the preset fixed point number minimum value, and further process the floating point data, the preset fixed point quantized coefficient to obtain the second zero offset and the second quantized coefficient, wherein the second quantized coefficient is obtained by enlarging the multiple of the first quantized coefficient, and the second zero offset is obtained by enlarging and quantizing the multiple of the first zero offset, so that the second zero offset and the second quantized coefficient are higher in precision, and the precision of input parameters of the logic circuit is improved; the logic circuit can carry out floating point multiplication operation and fixed point addition operation on the data to be quantized according to the input parameters (the second zero point offset and the second quantization parameter) to realize quantization, and shift the quantization result according to the preset fixed point quantization coefficient to obtain a final quantization result of the data to be quantized, so that the shifted final quantization result can be within a range from the minimum value of the preset fixed point number to the maximum value of the preset fixed point number, and the final quantization result meeting the quantization requirement is obtained. And the logic circuit performs floating point multiplication, fixed point addition and shift operation, so that the hardware cost of the logic circuit capable of completing quantization is lower. Therefore, the accuracy of the quantized neural network model can be improved with lower hardware cost.
In different application scenarios, the data to be quantized may be different types of data, for example, fixed-point data obtained by quantizing floating-point data by a processor, or intermediate results (for example, output results of a convolution layer) or final results in a neural network processing process, and the like.
By the method, the electronic equipment can improve the accuracy of data used by the neural network in the processing process of the neural network, and further improve the accuracy of input fixed-point data of the neural network and intermediate results or final results obtained by the processing of the neural network.
Fig. 5 illustrates an exemplary manner of operation of an electronic device according to an embodiment of the present application. An exemplary operation of the electronic device according to the embodiment of the present application will be described below by taking fixed-point data obtained by quantizing floating-point data with a data to be quantized as a processor as an example.
As shown in FIG. 5, in S1, the processor determines a maximum value in floating point data, hereinafter also referred to as a floating point number maximum value Xmax, and a minimum value in floating point data, hereinafter also referred to as a floating point number minimum value Xmin, from floating point data X, which includes at least one of a floating point parameter of a neural network or floating point input data, which may include a plurality of floating point numbers. For example, there are typically a plurality of floating point parameters of the neural network model that need to be quantized, where the floating point data includes floating point parameters, the floating point data may include a plurality of floating point numbers, each of which may correspond to a floating point parameter, for example. The processor may count the maximum and minimum values of all floating point parameters of the neural network model that need to be quantized as the floating point number maximum Xmax and floating point number minimum Xmin. The floating point input data of the neural network model to be quantized typically includes a plurality of values, and when the floating point data includes floating point input data, the floating point data may include a plurality of floating point numbers, each of which may, for example, correspond to one value of the input floating point data. The processor may count the maximum and minimum values of all values of the floating point input data of the neural network model that need to be quantized as the floating point number maximum value Xmax and the floating point number minimum value Xmin. Step S1 may be implemented based on prior art.
In S2, the processor determines the first zero offset Z according to the floating point minimum Xmin, and determines the first quantization coefficient S according to the floating point maximum Xmax, the floating point minimum Xmin, the preset fixed point maximum Qmax, and the preset fixed point minimum Qmin, and the preset fixed point maximum Qmax and the preset fixed point minimum Qmin are determined according to the preset range [ Qmin, qmax ] of the final quantization result of the floating point data. The manner of determining the first zero point offset Z may, for example, take the floating point number minimum Xmin as the first zero point offset Z; the first quantization coefficient S may be determined, for example, by referring to equation (2), i.e., qmax, qmin, xmax, xmin may be brought into equation (2) to obtain the first quantization coefficient S. Step S2 may be implemented based on prior art. The first zero-point offset Z corresponds to the zero-point offset Z1 in the related art, and the first quantized coefficient S corresponds to the quantized coefficient S1 in the related art.
In S3, the processor quantizes the floating point data X to obtain first fixed point data X'.
The processor may quantize the floating point data X according to the first quantization coefficient S and the first zero offset Z, to obtain corresponding first fixed point data X ', where the first fixed point data X' includes, for example, fixed point input data of a fixed point neural network model or fixed point parameters of the fixed point neural network model. The numerical range of the first fixed point data X' is equal to the numerical range [ Qmin, qmax ] of the preset fixed point number, that is, in a range of greater than or equal to the fixed point number minimum value Qmin and less than or equal to the fixed point number maximum value Qmax. The processor may quantize the floating point data based on a related technique, for example, the first quantized coefficient S and the first zero offset Z may be brought into equation (1) as S1 and Z1. The processor may send the first fixed point data X' to the logic circuit or to the memory for the logic circuit to call.
In S4, the processor performs multiple expansion on the first quantization coefficient S according to the preset fixed-point quantization coefficient s_shift to obtain a second quantization coefficient S '(see the following equation (4) for an example), and performs multiple expansion and quantization on the first zero point offset Z according to the preset fixed-point quantization coefficient s_shift and the first quantization coefficient S to obtain a second zero point offset Zq' (see the following equation (5) for an example). The processor may send the second quantized coefficient S 'and the second zero offset Zq' to the logic circuit or to a memory for the logic circuit to call. The second quantized coefficient S obtained by multiplying the first quantized coefficient S and the second zero point offset Zq' obtained by multiplying and quantizing the first zero point offset Z are supplied to the logic circuit for operation, so that the values of high-precision bits (e.g., decimal) of the first quantized coefficient S and the first zero point offset Z are retained during rounding in the logic circuit, and thus the second fixed point data precision obtained by the logic circuit using the second quantized coefficient and the second zero point offset processing can be made higher.
In S5, the logic circuit quantizes the first fixed point data X ' to obtain a first quantized result through a floating point multiplication operation and a fixed point addition operation according to the second quantized coefficient S ' and the second zero offset Zq '; the first quantization result is shifted according to a preset fixed-point quantization coefficient s_shift to obtain second fixed-point data Xq', and the second fixed-point data is used as a final quantization result of floating-point data (for an example, see the following formula (6)).
The first zero point offset, the first quantized coefficient and the second quantized coefficient are in floating point form, and the preset fixed point quantized coefficient and the second zero point offset are in fixed point form.
The specific implementation of steps S1-S3 may be implemented with reference to the above prior art, and will not be described here again. An exemplary implementation of steps S4 and S5 is described hereinafter.
An exemplary method by which the electronic device of the embodiments of the present application determines the second quantization coefficient S' based on step S4 is described below. In a possible implementation manner, in step S4, the processor performs multiple expansion on the first quantization coefficient S according to a preset fixed-point quantization coefficient s_shift to obtain a second quantization coefficient, including:
the processor determines expansion multiples of a first quantization coefficient S according to a preset fixed-point quantization coefficient S_shift; the processor obtains a second quantized coefficient S' according to the product of the first quantized coefficient S and the expansion multiple of the first quantized coefficient S.
The preset fixed-point quantization coefficient s_shift is an integer greater than or equal to 1, and the expansion multiple of the first quantization coefficient S is equal to a value taking 2 as a base and taking the preset fixed-point quantization coefficient s_shift as an index. That is, the expansion multiple of the first quantization coefficient S is equal to 2≡S_shift.
In this way, the effect of the multiple expansion is approximated to the effect of shifting with the preset fixed-point quantization coefficient as the shift bit number, so that the shift bit number of the subsequent process of shifting the first quantization result is determined.
It should be understood by those skilled in the art that the expansion multiple of the first quantized coefficient S is not limited to the above-mentioned example of the exponential form based on 2, and the expansion multiple of the first quantized coefficient S may be other values, so long as the effect on the numerical range of the output result of the logic circuit when the second quantized coefficient obtained by expanding the first quantized coefficient S is satisfied and the effect can be eliminated by shifting.
The relation among the first quantized coefficient S, the expansion multiple 2≡S_shift of the first quantized coefficient S and the second quantized coefficient S' is shown in formula (4):
S’=S*2^S_shift (4)
as can be seen from the above formula (2), when the fixed point number maximum value Qmax and the fixed point number minimum value Qmin are unchanged, the first quantization coefficient S (S1) is a fixed value for the same floating point data, and therefore, according to the formula (4), the precision of the second quantization coefficient S 'is determined by the fixed point quantization coefficient s_shift, and the larger the value of the fixed point quantization coefficient s_shift, the higher the precision of the second quantization coefficient S'. And the second quantization coefficient S' is one of the parameters input to the logic circuit, so the larger the value of the fixed-point quantization coefficient s_shift, the higher the quantization accuracy of the logic circuit. The appropriate fixed-point quantization coefficient s_shift may be preset according to different quantization precision requirements.
In this case, the binary form corresponding to the second quantized coefficient S' corresponds to the shift result obtained by shifting each number of the binary form corresponding to the first quantized coefficient S to the left by a corresponding shift number (number of bits is equal to the fixed-point quantized coefficient s_shift), shifting (discarding) the higher order, and filling the lower order with the space 0. In this case, it is considered that the first quantized coefficient S is expanded by a factor such that the value of the high-precision (e.g., decimal-precision) portion of the first quantized coefficient S is shifted to a higher position, and the second quantized coefficient S' is obtained by leaving the high-precision portion of the first quantized coefficient S, which is higher in precision than the first quantized coefficient S before the expansion, without being truncated in rounding.
In this way, the parameter accuracy of the input logic circuit can be improved. According to the parameter precision requirement, the expansion multiple of the first quantized coefficient can be adjusted by changing the numerical value of the preset fixed-point quantized coefficient, so that the precision improvement mode of the first quantized coefficient is more flexible.
An exemplary method of determining the second zero point offset Zq' by the electronic device according to the embodiment of the present application based on step S4 is described below. In one possible implementation manner, in step S4, the multiplying and quantizing the first zero offset Z according to the preset fixed-point quantization coefficient s_shift and the first quantization coefficient S to obtain a second zero offset Zq', including:
The processor determines the expansion multiple of the first zero offset Z according to a preset fixed-point quantization coefficient S_shift; the processor rounds the product of the first quantization coefficient S, the expansion multiple of the first zero offset Z and the first zero offset Z to obtain a second zero offset Zq'.
In this way, the parameter accuracy of the input logic circuit can be improved. According to the parameter precision requirement, the expansion multiple of the first zero point offset can be adjusted by changing the numerical value of a preset fixed-point quantization coefficient, so that the precision lifting mode of the first zero point offset is more flexible.
Wherein the expansion multiple of the first zero point offset Z is equal to the expansion multiple of the first quantization coefficient S. That is, the expansion multiple of the first zero-point offset Z is equal to 2 Σ_shift. In this case, when the first zero point offset Z is multiplied and quantized by using the first quantization coefficient S and the expansion multiple of the first zero point offset Z, a second zero point offset Zq' is obtained, which is also expanded by 2 s_shift times as compared with the zero point offset Zq obtained by quantizing the zero point offset Z1 by using only the quantization coefficient S1 in the prior art formula (3).
The same expansion multiple as the first quantization coefficient is used, so that after the first zero point offset is subjected to multiple expansion, when the first quantization coefficient is adopted for quantization, the obtained value range of the second zero point offset is the same as the value range of a quantization result obtained when the second quantization coefficient is used for data to be quantized, and the second zero point offset and the second quantization coefficient can participate in arithmetic operation in a logic circuit.
The relation among the first quantization coefficient S, the expansion multiple 2≡S_shift of the first zero-point offset Z, the first zero-point offset Z and the second zero-point offset Zq' is shown in the formula (5):
Zq’=round(Z*S*2^S_shift)=round(Z*S’) (5)
according to the related description of the prior art, the first zero offset Z (zero offset Z1) is a constant value for the same floating point data; as can be seen from the above formula (2), when the fixed point number maximum value Qmax and the fixed point number minimum value Qmin are unchanged, the first quantization coefficient S (S1) is a fixed value for the same floating point data, so that the accuracy of the second zero point offset Zq 'is determined by the expansion multiple 2 Σ s_shift of the first zero point offset Z according to the formula (5), and the larger the value of the fixed point quantization coefficient s_shift, the higher the accuracy of the second zero point offset Zq'. And the second zero point offset Zq' is one of the parameters input to the logic circuit, so the larger the value of the fixed point quantization coefficient s_shift, the higher the quantization accuracy of the logic circuit.
Since the expansion multiple of the first zero point offset Z is equal to the expansion multiple of the first quantization coefficient S, the product of the expansion multiple of the first quantization coefficient S and the first zero point offset Z may be equal to the second quantization coefficient S ', see formula (4), and the second zero point offset Zq ' may also be regarded as a rounding result of the product of the first zero point offset Z and the second quantization coefficient S '. In this case, the binary form corresponding to the product (z×s ') of the first zero point offset Z and the second quantization coefficient S' corresponds to the shift result obtained by shifting each number corresponding to the binary form corresponding to the product (z×s) of the first zero point offset Z and the first quantization coefficient S to the left by a corresponding shift number (number of bits is equal to the fixed-point quantization coefficient s_shift), shifting (discarding) the higher order, and filling 0 in the lower order space. Therefore, the accuracy of the product of the first zero point offset Z and the second quantized coefficient S' is higher than the accuracy of the product of the first zero point offset Z and the first quantized coefficient S before the expansion. The precision of the second zero point offset Zq '(see formula (5)) obtained by rounding the product of the first zero point offset Z and the second quantized coefficient S' is also higher than the precision of the zero point offset Zq (see formula (3)) obtained by rounding the product of the first zero point offset Z and the first quantized coefficient S. In this case, it is considered that by enlarging the multiple, the value of the high-precision (e.g., decimal-precision) portion in the first zero-point offset Z is shifted to the higher position, and is not truncated in rounding, and remains, resulting in the second zero-point offset Zq, so the second zero-point offset Zq remains the high-precision portion of the first zero-point offset Z, which is higher in precision than the zero-point offset Zq of the related art, and is closer to the first zero-point offset Z. Thereby enabling to improve the parameter accuracy of the input logic circuit.
Step S4 may be performed after step S2 is completed. The embodiment of the present application does not limit the execution order of S3 and S4.
Because of the multiple expansion, when the logic circuit performs step S5 using the second quantization coefficient, the multiple expansion also occurs on the obtained first quantization result, in which case, the range of the first quantization result cannot satisfy the preset range of the final quantization result of the floating point data, so step S5 obtains the second fixed point data through shifting, so that the value of the second fixed point data satisfies the preset range of the final quantization result of the floating point data.
An exemplary method by which the electronic device determines the first quantized result based on step S5 and determines the final quantized result of the floating point data based on the first quantized result according to an embodiment of the present application is described below.
In a possible implementation manner, in step S5, the logic circuit quantizes the first fixed point data X ' to obtain a first quantized result according to the second quantized coefficient S ' and the second zero offset Zq ' through a floating point multiplication operation and a fixed point addition operation, including:
the logic circuit rounds the product of the second quantization coefficient S 'and the first fixed point data X' to obtain a second quantization result; the logic circuit obtains a first quantization result according to the sum of the second quantization result and the second zero offset Zq'.
For example, the second quantized coefficient S ' is enlarged by a factor of 2 s_shift on the basis of the first quantized coefficient (quantized coefficient S1 of the related art), and thus, a product X ' S ' obtained by multiplying the second quantized coefficient S ' and the first fixed point data X ' is also enlarged by a factor of 2 s_shift, and a rounded result round (X ' S ') (second quantized result) of the product is a rounded result after the enlargement. And according to the above description, the second zero point offset Zq ' is the rounding result of the product of the first zero point offset Z and the second quantization coefficient S ', and therefore, the second zero point offset Zq ' is the rounding result after the expansion. Since the expansion factor is the same (2 Σs_shift), the second zero point offset Zq 'can be subjected to fixed point addition operation with the second quantization result, resulting in a first quantization result (round (X' ×s ')+zq'). In this case, the first quantization result is a multiple-enlarged quantization result. And the expansion multiple is equal to 2 S_shift.
In this way, the second quantization result with expanded multiple can be obtained through floating point multiplication operation of the second quantization coefficient and the data to be quantized; the first quantized result with multiple expansion can be obtained through fixed-point addition operation of the second quantized result and the second zero point offset, so that the high-precision attribute of the data to be quantized can be reserved in the first quantized result, and the precision of the first quantized result is improved.
In a possible implementation manner, in step S5, the shifting the first quantization result according to the preset fixed-point quantization coefficient s_shift to obtain the second fixed-point data Xq', includes:
the logic circuit shifts the first quantization result rightward, wherein the number of shifted bits is equal to a preset fixed-point quantization coefficient s_shift.
The first quantization result, the preset fixed-point quantization coefficient s_shift, and the second fixed-point data Xq' may be as shown in formula (6):
Xq’=(round(X’*S’)+Zq’)>>S_shift (6)
where "> >" indicates a rightward shift. Since the expansion processing of the multiple 2 s_shift is performed when the second quantization coefficient S 'and the second zero point offset Zq' are determined, the first quantization result is also expanded by 2 s_shift. Based on this, in step S5, the first quantization result may be scaled down by a corresponding multiple by a shift operation, resulting in second fixed point data. In this way, the second fixed point data can be in the fixed point number value range of the preset [ Qmin, qmax ], so that the convolution operation can be performed on the second fixed point data subsequently.
The high-precision attribute of the data to be quantized can be reserved in the first quantized result, so that the precision of the final quantized result can be improved when the final quantized result is obtained by using the first quantized result to shift. And the bit number of the shift is equal to the preset fixed-point quantization coefficient, so that the value range of the final quantization result after the shift meets the quantization requirement.
The shift is a bit operation, and the right shift is to shift all numbers to the right by a corresponding shift bit number (fixed-point quantization coefficient) s_shift in a binary form, shift out (discard) is low, and the high-order space complements the sign bit, namely, the positive number complements zero and the negative number complements 1. When shifting the S_shift bit to the right, it is equivalent to rounding the second quantization result after dividing by 2≡S_shift.
In the logic circuit, floating point multiplication operation is performed to obtain a product of the first fixed point data X 'and the second quantized coefficient S', fixed point addition operation is performed to obtain a first quantized result, and shift operation is performed to obtain second fixed point data. In this way, the logic circuit only needs to perform floating point multiplication operation, fixed point addition operation and shift operation, the area and the power consumption are small, and the neural network quantization can be realized with low hardware cost.
In this case, in the neural network, the convolution operation may be performed based on the second fixed-point data of which accuracy is improved.
Fig. 6 shows an exemplary application scenario of an electronic device according to an embodiment of the present application. For example, referring to the application scenario of fig. 6, the processor performs the above step S3 to complete the processing of the floating point data to obtain the data to be quantized (the first fixed point data), where the data to be quantized is fixed point data obtained by quantizing the floating point data by the processor, and the fixed point neural network model may be deployed according to the data to be quantized, for example, the processor performs the above step S3 to quantize the floating point input data and the floating point parameters to obtain the fixed point input data and the fixed point parameters (the first fixed point data), so as to obtain the fixed point neural network and the fixed point input data. When the floating point input data and the floating point parameters are quantized, a group of first quantization coefficients and first zero point offset can be obtained for the floating point input data, and a plurality of groups of first quantization coefficients and first zero point offset can be respectively obtained for the multi-layer parameters of the neural network. Since the values of the input data and the values of the respective layer parameters may be different, the floating point number maximum value and the floating point number minimum value of the input data and the respective layer parameters may also be different, so that the sets of first quantization coefficients and the first zero point offset are different. When the input data and the multiple groups of first quantization coefficients and the first zero point offset corresponding to the multiple layers of parameters are obtained through processing, the processor executes the data to be quantized obtained in the step S3 to be in the same preset range on the premise of using the same fixed point number maximum value and fixed point number minimum value, namely the values are between Qmin and Qmax.
The processor may perform step S4 described above to complete the processing of the first quantized coefficients and the first zero offset, resulting in the second quantized coefficients and the second zero offset. The first quantized coefficients and the first zero point offsets correspond to the second quantized coefficients, the second quantized offsets and the fixed-point quantized coefficients. The second quantized coefficient, the second quantized offset, and the fixed-point quantized coefficient are input one set of the second quantized coefficient, the second quantized offset, and the fixed-point quantized coefficient at a time when the second quantized coefficient, the second quantized offset, and the fixed-point quantized coefficient are input as input parameters to the logic circuit.
The logic circuit may, for example, obtain an input parameter (a set of second quantized coefficient, a second zero offset, and a fixed-point quantized coefficient) generated by the processor and first fixed-point data (corresponding to the set of second quantized coefficient, the second zero offset, and the fixed-point quantized coefficient) generated by the processor, and perform arithmetic operation and shift operation during operation of the fixed-point neural network model to obtain second fixed-point data corresponding to the first fixed-point data and having higher accuracy, so that a convolution result output by the fixed-point neural network model by performing convolution operation based on the second fixed-point data has higher accuracy, and the convolution result having higher accuracy may be used as input data of a next convolution layer to be used in a convolution operation process of the next convolution layer.
For example, in the case that the input data of the fixed-point neural network and the parameters of each layer are in the same preset range (for example, -128-127), the fixed-point neural network obtained by the processor is used for operating the input data (denoted by a), when the fixed-point neural network starts to operate, the first completion is the convolution of the input data of the fixed-point neural network and the parameters (weights) of the first layer of the neural network, where the input data may be the fixed-point input data (an example of the first fixed-point data X ') obtained by quantizing the floating-point input data by the processor, for example, denoted by a, and the parameter may be the fixed-point input data (an example of the first fixed-point data X') obtained by quantizing the floating-point parameters of the first layer of the neural network by the processor, for example, denoted by b. The logic circuit can process the input data a and the parameter b respectively to obtain input data a1 and parameter b1 (second fixed point data) with higher accuracy. For example, when the logic circuit processes the input data a, the input data a is input to the logic circuit as data to be quantized, and a set of second quantization coefficients, second quantization offsets (determined according to the maximum value and the minimum value of floating point data corresponding to the input data a, and the value range of fixed point data corresponding to the input data a) and fixed point quantization coefficients corresponding to the input data a are also input to the logic circuit. In this case, the logic circuit may calculate and output the data a1. Similarly, when the logic circuit processes the parameter b, the parameter b is input to the logic circuit as data to be quantized, and a group of second quantization coefficients, second quantization offsets (determined according to the maximum value and the minimum value of floating point data corresponding to the parameter b and the value range of fixed point data corresponding to the parameter b) and fixed point quantization coefficients corresponding to the parameter b are also input to the logic circuit. In this case, the logic circuit may calculate and output the parameter b1. In the fixed-point neural network, convolution is performed based on input data a1 and parameters b1, and a convolution result c1 is obtained. In this way, the convolution result c1 has higher accuracy than the convolution result of the input data a and the parameter b, and the like, when the neural network convolves the convolution result and the weight of any layer, the convolution result is already the convolution result with improved accuracy, and the logic circuit processes the weight to obtain the weight with improved accuracy, so that the convolution operation result of each layer can reach higher accuracy. When the processor obtains a floating point convolution result corresponding to the convolution result according to the inverse quantization of the convolution result of any layer, the accuracy of the obtained floating point convolution result is higher and is closer to the floating point convolution result of the corresponding layer of the original floating point neural network.
The working process of the logic circuit can be regarded as that the numerical range of the data to be quantized is firstly adjusted to be larger than the numerical range of the preset fixed point number range in a mapping mode, so that the precision is improved, high-precision operation is performed, and then the operation result is restored to the preset fixed point number range in a mapping mode, so that quantization is realized. In application, different fixed-point quantization coefficients s_shift can be set for different floating point data according to different requirements. For example, for floating point data a, s_shift=m may be made, for floating point data B, s_shift=n may be made, and when the floating point data a has a higher requirement for improving precision than the floating point data B, M > N may be made, and by setting different fixed point quantization coefficients s_shift, the fixed point parameters and fixed point input data in the neural network model may be made to meet the required precision requirement.
Compared with fixed-point data obtained by quantization in the prior art, the precision of the second fixed-point data is closer to that of the original floating-point data, so that the accuracy of the output data of the quantized neural network can be improved.
In one possible implementation, the logic circuit proposed by the embodiments of the present application may be an arithmetic logic unit (arithmetic logical unit, ALU) for implementing the arithmetic operations and shift operations shown in formula (6). Under the condition, the logic circuit adopts floating point multiplication, fixed point addition and displacement modes, the area and the power consumption cost are low, and the area and the power consumption advantages are obvious along with the increase of the parallelism in parallel quantization processing; and the parameter precision of the input logic circuit is higher, so that the quantization result accuracy of the logic circuit is higher.
In a possible implementation manner, the electronic device according to an embodiment of the present application further includes a memory, where the memory is configured to store one or more of floating point data, a preset fixed point number maximum value, a preset fixed point number minimum value, first fixed point data, a second zero point offset, a second quantization coefficient, a preset fixed point quantization coefficient, and a final quantization result of data to be quantized.
In the above-mentioned different application scenarios, the data to be quantized may be fixed-point data obtained by quantizing floating-point data by the processor, or an intermediate result or a final result of the neural network operation.
In a possible implementation manner, the logic circuit in the embodiment of the present application quantizes the data to be quantized through floating point multiplication operation and fixed point addition operation according to the second quantization coefficient and the second zero offset, to obtain a first quantization result, where the first quantization result includes:
rounding the product of the second quantization coefficient and the data to be quantized to obtain a second quantization result; and obtaining the first quantized result according to the sum of the second quantized result and the second zero offset.
When the data to be quantized is fixed-point data obtained by quantizing floating-point data by the processor, reference may be made to step S5 and related description above.
In a possible implementation manner, the logic circuit of the embodiment of the present application shifts the first quantization result according to the preset fixed-point quantization coefficient to obtain a final quantization result of the floating point data, where the shifting includes:
and the logic circuit shifts the first quantization result to the right to obtain a final quantization result of the floating point data, wherein the number of shifted bits is equal to the preset fixed point quantization coefficient.
When the data to be quantized is fixed-point data obtained by quantizing floating-point data by the processor, reference may be made to step S5 and related description above.
The present application also proposes a neural network quantization method, and fig. 7 shows an exemplary workflow of the neural network quantization method according to an embodiment of the present application. As shown in fig. 7, the method may be applied to an electronic device according to an embodiment of the present application, including:
s1101, determining a first zero offset and a first quantization coefficient by the processor according to a maximum value and a minimum value in floating point data, a preset fixed point number maximum value and a preset fixed point number minimum value, wherein the floating point data comprises at least one of floating point parameters of a neural network or floating point input data;
S1102, the processor performs multiple expansion on the first quantized coefficient according to a preset fixed-point quantized coefficient to obtain a second quantized coefficient, and performs multiple expansion and quantization on the first zero offset according to the preset fixed-point quantized coefficient and the first quantized coefficient to obtain a second zero offset;
s1103, the logic circuit quantizes the data to be quantized through floating point multiplication operation and fixed point addition operation according to the second quantization coefficient and the second zero point offset to obtain a first quantization result; and shifting the first quantization result according to the preset fixed-point quantization coefficient to obtain a final quantization result of the data to be quantized.
Exemplary descriptions of the method can be found above and are not repeated here.
In one possible implementation manner, the processor performs multiple expansion on the first quantization coefficient according to a preset fixed-point quantization coefficient to obtain a second quantization coefficient, including: the processor determines expansion times of the first quantized coefficients according to preset fixed-point quantized coefficients; the processor obtains a second quantized coefficient according to the product of the first quantized coefficient and the expansion multiple of the first quantized coefficient.
In one possible implementation, the preset fixed-point quantization factor is an integer greater than or equal to 1, and the expansion factor of the first quantization factor is equal to a value based on 2 and an exponent based on the preset fixed-point quantization factor.
In one possible implementation manner, the multiplying and quantizing the first zero offset according to the preset fixed-point quantization coefficient and the first quantization coefficient to obtain a second zero offset includes: determining the expansion multiple of the first zero point offset according to a preset fixed-point quantization coefficient; and rounding the product of the first quantization coefficient, the expansion multiple of the first zero point offset and the first zero point offset to obtain the second zero point offset.
In one possible implementation, the expansion factor of the first zero point offset is equal to the expansion factor of the first quantization factor.
In one possible implementation manner, according to the second quantization coefficient and the second zero offset, the quantization of the data to be quantized is performed through floating point multiplication operation and fixed point addition operation, so as to obtain a first quantization result, which includes: rounding the product of the second quantization coefficient and the data to be quantized to obtain a second quantization result; and obtaining a first quantization result according to the sum of the second quantization result and the second zero offset.
In one possible implementation manner, the shifting the first quantization result according to the preset fixed-point quantization coefficient to obtain a final quantization result of the data to be quantized includes: the logic circuit shifts the first quantized result to the right to obtain a final quantized result of the data to be quantized, wherein the number of shifted bits is equal to a preset fixed-point quantized coefficient.
In one possible implementation, the method further includes: and storing one or more of the floating point data, a preset fixed point number maximum value, a preset fixed point number minimum value, the first fixed point data, the second zero point offset, the second quantization coefficient, the preset fixed point quantization coefficient and a final quantization result of the data to be quantized.
In one possible implementation, the logic circuit includes an arithmetic logic unit ALU.
In one possible implementation, the data to be quantized includes one or more of fixed-point data obtained by quantizing floating-point data by a processor, intermediate results in a neural network processing process, or final results.
Exemplary descriptions of the above methods may be found above and are not repeated here.
In one possible implementation, the present application proposes a non-transitory computer-readable storage medium, on which computer program instructions are stored, which when executed by a processor implement the above-described neural network quantization method.
In one possible implementation, the present application proposes a computer program product comprising a computer readable code, or a non-volatile computer readable storage medium carrying computer readable code, which when executed in a processor, performs the above-described neural network quantization method.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (13)

  1. An electronic device comprising a processor and logic circuitry to:
    the processor is configured to:
    determining a first zero offset and a first quantization coefficient according to a maximum value and a minimum value in floating point data, and a preset fixed point number maximum value and a preset fixed point number minimum value, wherein the floating point data comprises at least one of floating point parameters of a neural network or floating point input data;
    performing multiple expansion on the first quantized coefficient according to a preset fixed-point quantized coefficient to obtain a second quantized coefficient, and performing multiple expansion and quantization on the first zero offset according to the preset fixed-point quantized coefficient and the first quantized coefficient to obtain a second zero offset;
    The logic circuit is used for quantizing the data to be quantized through floating point multiplication operation and fixed point addition operation according to the second quantization coefficient and the second zero point offset to obtain a first quantization result; and shifting the first quantization result according to the preset fixed-point quantization coefficient to obtain a final quantization result of the data to be quantized.
  2. The electronic device of claim 1, wherein the processor multiplies the first quantization coefficient according to a preset fixed-point quantization coefficient to obtain a second quantization coefficient, comprising:
    the processor determines expansion multiples of the first quantization coefficient according to the preset fixed-point quantization coefficient;
    the processor obtains the second quantized coefficient according to the product of the first quantized coefficient and the expansion multiple of the first quantized coefficient.
  3. The electronic device of claim 2, wherein the predetermined fixed point quantization factor is an integer greater than or equal to 1, and the expansion factor of the first quantization factor is equal to a value based on 2 and an exponent of the predetermined fixed point quantization factor.
  4. The electronic device of claim 3, wherein multiplying and quantizing the first zero offset according to the preset fixed-point quantization coefficient and the first quantization coefficient to obtain the second zero offset comprises:
    Determining the expansion multiple of the first zero point offset according to the preset fixed point quantization coefficient;
    and rounding the product of the first quantization coefficient, the expansion multiple of the first zero point offset and the first zero point offset to obtain the second zero point offset.
  5. The electronic device of claim 4, wherein a magnification of the first zero point offset is equal to a magnification of the first quantization factor.
  6. The electronic device of any of claims 1-5, wherein quantizing the data to be quantized by a floating point multiplication operation and a fixed point addition operation according to the second quantization coefficient and the second zero point offset, to obtain a first quantization result, comprises:
    rounding the product of the second quantization coefficient and the data to be quantized to obtain a second quantization result;
    and obtaining the first quantized result according to the sum of the second quantized result and the second zero offset.
  7. The electronic device of any of claims 3-6, wherein shifting the first quantization result according to the preset fixed point quantization coefficient results in a final quantization result of the data to be quantized, comprises:
    And the logic circuit shifts the first quantization result to the right to obtain a final quantization result of the data to be quantized, wherein the number of shifted bits is equal to the preset fixed-point quantization coefficient.
  8. The electronic device of any of claims 1-7, further comprising a memory to store one or more of the floating point data, the preset fixed point number maximum value, the preset fixed point number minimum value, the first fixed point data, the second zero point offset, the second quantization factor, the preset fixed point quantization factor, a final quantization result of the data to be quantized.
  9. The electronic device of any one of claims 1-8, wherein the logic circuit comprises an arithmetic logic unit ALU.
  10. The electronic device of any of claims 1-9, wherein the data to be quantized comprises one or more of fixed-point data obtained by quantizing floating-point data by a processor, intermediate results in a neural network processing process, or final results.
  11. A method for quantifying a neural network, the method comprising:
    The processor determines a first zero offset and a first quantization coefficient according to a maximum value and a minimum value in floating point data, a preset fixed point number maximum value and a preset fixed point number minimum value, wherein the floating point data comprises at least one of floating point parameters of a neural network or floating point input data;
    the processor performs multiple expansion on the first quantized coefficient according to a preset fixed-point quantized coefficient to obtain a second quantized coefficient, and performs multiple expansion and quantization on the first zero offset according to the preset fixed-point quantized coefficient and the first quantized coefficient to obtain a second zero offset;
    the logic circuit quantizes the data to be quantized through floating point multiplication operation and fixed point addition operation according to the second quantization coefficient and the second zero offset to obtain a first quantization result; and shifting the first quantization result according to the preset fixed-point quantization coefficient to obtain a final quantization result of the data to be quantized.
  12. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of claim 11.
  13. A computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, characterized in that the processor performs the method of claim 11 when the computer readable code is run in the processor.
CN202180100947.8A 2021-07-30 2021-07-30 Electronic device and neural network quantization method Pending CN117813610A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/109839 WO2023004799A1 (en) 2021-07-30 2021-07-30 Electronic device and neural network quantization method

Publications (1)

Publication Number Publication Date
CN117813610A true CN117813610A (en) 2024-04-02

Family

ID=85087378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180100947.8A Pending CN117813610A (en) 2021-07-30 2021-07-30 Electronic device and neural network quantization method

Country Status (2)

Country Link
CN (1) CN117813610A (en)
WO (1) WO2023004799A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111612147A (en) * 2020-06-30 2020-09-01 上海富瀚微电子股份有限公司 Quantization method of deep convolutional network
CN112446491B (en) * 2021-01-20 2024-03-15 上海齐感电子信息科技有限公司 Real-time automatic quantification method and real-time automatic quantification system for neural network model
CN113159276B (en) * 2021-03-09 2024-04-16 北京大学 Model optimization deployment method, system, equipment and storage medium

Also Published As

Publication number Publication date
WO2023004799A1 (en) 2023-02-02

Similar Documents

Publication Publication Date Title
CN110070178B (en) Convolutional neural network computing device and method
CN108701250B (en) Data fixed-point method and device
CN110008952B (en) Target identification method and device
JP2018010618A (en) Convolutional neural network hardware configuration
CN110610237A (en) Quantitative training method and device of model and storage medium
CN111832719A (en) Fixed point quantization convolution neural network accelerator calculation circuit
CN110109646B (en) Data processing method, data processing device, multiplier-adder and storage medium
CN110826685A (en) Method and device for convolution calculation of neural network
CN113703840B (en) Data processing device, method, chip, computer device and storage medium
CN110515584A (en) Floating-point Computation method and system
WO2018196750A1 (en) Device for processing multiplication and addition operations and method for processing multiplication and addition operations
JP2017533458A5 (en)
KR20200134281A (en) Stochastic rounding logic
CN116762080A (en) Neural network generation device, neural network operation device, edge device, neural network control method, and software generation program
US20210044303A1 (en) Neural network acceleration device and method
CN117813610A (en) Electronic device and neural network quantization method
CN110796247B (en) Data processing method, device, processor and computer readable storage medium
CN110647308B (en) Accumulator and operation method thereof
US11494165B2 (en) Arithmetic circuit for performing product-sum arithmetic
JP5589628B2 (en) Inner product calculation device and inner product calculation method
EP4390660A1 (en) Multi-input floating point number processing method and apparatus, processor and computer device
CN110851110B (en) Divider-free divide-by-three circuit
CN116468079B (en) Method for training deep neural network model and related product
CN112732223B (en) Semi-precision floating point divider data processing method and system
CN115759217A (en) Neural network activation method and device, NPU, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination