WO2023060959A1

WO2023060959A1 - Neural network model quantification method, system and device, and computer-readable medium

Info

Publication number: WO2023060959A1
Application number: PCT/CN2022/105317
Authority: WO
Inventors: 陈其宾; 李锐; 张晖
Original assignee: 山东浪潮科学研究院有限公司
Priority date: 2021-10-13
Filing date: 2022-07-13
Publication date: 2023-04-20
Also published as: CN114021691A

Abstract

The present invention belongs to the technical field of neural networks, and disclosed are a neural network model quantification method, system and device, and a computer-readable medium, wherein the technical problem to be solved is how to avoid calculating activation value quantization factors during inference. The method comprises the following steps: constructing a neural network model and training the neural network model; for a target model, calculating a model weight quantization factor on the basis of a quantization range by calculating the maximum value of an absolute value of a model weight; for each layer of the target model, by minimizing a mean square error, calculating an activation value quantization factor of each layer of the target model; for each layer of the target model, performing model inference by means of a weight and activation value of a quantized fixed-point type, and inversely quantizing the inference result into an int32 data type; and for each layer of the target model, quantizing each operator by means of asymmetric quantization, quantizing a weight of a floating-point type model into an int8 data type, and quantizing an activation value into a uint8 data type.

Description

Neural network model quantification method, system, device and computer readable medium

technical field

The present invention relates to the technical field of neural network, in particular to a neural network model quantification method, system, device and computer readable medium.

Background technique

In recent years, neural network models have been widely used in many fields and have achieved very good results. However, due to the high complexity and large size of the model, the neural network model results in low inference efficiency and long inference time, especially when running on mobile devices with low performance and low power consumption devices. Therefore, how to design a model with low resource consumption, real-time prediction and guaranteed prediction accuracy has become a practical problem. On low-power devices like microcontrollers, models with low resource consumption are required. In areas with high real-time requirements, such as speech recognition and automatic driving, models that can predict in real time are required. To solve this problem, we can generally start with designing an efficient model architecture, designing a model architecture suitable for specific hardware, network pruning, knowledge distillation, and model quantization. Among them, model quantization has achieved good results in this problem. Quantizing the model from floating-point type to fixed-point type can effectively reduce the size of the model and improve the speed of model reasoning.

In order to improve the model inference speed, how to avoid calculating the quantization factor of the activation value during inference is a technical problem that needs to be solved.

Contents of the invention

The technical task of the present invention is to address the above deficiencies and provide a neural network model quantization method, system, device and computer readable medium to solve the technical problem of how to avoid calculating activation value quantization factors during reasoning.

In the first aspect, the neural network model quantification method of the present invention calculates the activation value quantification factor of each layer of the neural network model by minimizing the equation before the neural network model performs inference, and the method includes the following steps:

Construct a neural network model and train the neural network model to obtain a floating-point neural network model as the target model;

For the target model, calculate the model weight quantization factor based on the quantization range by calculating the maximum value of the absolute value of the model weight;

For each layer of the target model, calculate the activation value quantization factor of each layer of the target model by minimizing the mean square error;

For each layer of the target model, perform model inference through quantized weights and activation values of the fixed-point type, and dequantize the inference result into an int32 data type;

For each layer of the target model, each operator is quantized by asymmetric quantization, the weight of the floating-point model is quantized into an int8 data type, and the activation value is quantized into a uint8 data type to obtain the final quantized model.

Preferably, for the target model, the model weight is quantized as int8 type, and the quantization range is [-128,127].

Preferably, the test data set is obtained, the mean square error of the output after quantization and the output without quantization of each layer of the target model is calculated based on the test data set, and the quantization factor of the activation value is obtained by minimizing the mean square error.

As preferably, the mean square error formula is:

Among them, y _i represents the non-quantized output,

Indicates the output after quantization.

In the second aspect, the neural network model quantification system of the present invention is used to calculate the activation value quantification factor of each layer of the neural network model by minimizing the equation before the neural network model performs inference, and the system includes:

Construct a training module, which is used to construct a neural network model, and train the neural network model to obtain a floating-point type neural network model as the target model;

A quantization factor calculation module, the quantization factor calculation module is applied to the target model, and is used to calculate the model weight quantization factor based on the quantization range by calculating the maximum absolute value of the model weight;

An activation value quantization factor calculation module, the activation value quantization factor calculation module is applied to each layer of the target model, and is used to calculate the activation value quantization factor of each layer of the target model by minimizing the mean square error;

Inference inverse quantization module, the inference inverse quantization module is applied to each layer of the target model, for performing model inference through quantized fixed-point type weights and activation values, and inverse quantization of inference results into int32 data types;

The final quantization module, which is applied to each layer of the target model, is used to quantize each operator by means of asymmetric quantization, quantizes the weight of the floating-point type model into an int8 data type, and converts the activation value Quantize to uint8 data type to get the final quantized model.

Preferably, the activation value quantization factor calculation module is used to calculate the activation value quantization factor by the following method:

Obtain the test set data set, calculate the mean square error of the output of each layer of the target model after quantization and non-quantization, and obtain the quantization factor of the activation value by minimizing the mean square error.

As preferably, the mean square error formula is:

Among them, y _i represents the non-quantized output,

Indicates the output after quantization.

In a third aspect, the device of the present invention includes: at least one memory and at least one processor;

said at least one memory for storing machine-readable programs;

The at least one processor is configured to invoke the machine-readable program to execute the method described in any one of the first aspects.

In the fourth aspect, the computer-readable medium of the present invention stores computer instructions on the computer-readable medium, and when the computer instructions are executed by a processor, the processor executes the method described in any one of the first aspect .

The neural network model quantification method, system, device, and computer-readable medium of the present invention have the following advantages: the method for calculating the quantization factor of the activation value in advance before model reasoning, and calculating the quantization factor of the activation value of each layer by minimizing the mean square error, The accuracy of the model is guaranteed, and the reasoning speed of the model is improved by calculating the quantization factor in advance.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the following will briefly introduce the accompanying drawings that need to be used in the descriptions of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only of the present invention. For some embodiments, those skilled in the art can also obtain other drawings based on these drawings without creative effort.

The present invention will be further described below in conjunction with the accompanying drawings.

Fig. 1 is a flow chart of the neural network model quantification method in Embodiment 1.

Detailed ways

The present invention will be further described below in conjunction with accompanying drawing and specific embodiment, so that those skilled in the art can better understand the present invention and can be implemented, but the embodiment given is not as the limitation of the present invention, in the case of no conflict Next, the embodiments of the present invention and the technical features in the embodiments can be combined with each other.

Embodiments of the present invention provide a neural network model quantization method, system, device, and computer-readable medium for solving the technical problem of how to avoid calculating activation value quantization factors during reasoning.

Example 1:

The neural network model quantification method of the present invention, before the neural network model is inferred, calculates the activation value quantification factor of each layer of the neural network model by minimizing the equation, and the method includes the following steps:

S100. Construct a neural network model, and train the neural network model to obtain a floating-point neural network model as a target model;

S200. For the target model, calculate the model weight quantization factor based on the quantization range by calculating the maximum value of the absolute value of the model weight;

S300. For each layer of the target model, calculate the activation value quantization factor of each layer of the target model by minimizing the mean square error;

S400. For each layer of the target model, perform model inference through the quantized weight and activation value of the fixed-point type, and dequantize the inference result into an int32 data type;

S500. For each layer of the target model, quantize each operator by means of asymmetric quantization, quantize the weight of the floating-point type model into an int8 data type, and quantize the activation value into a uint8 data type, and obtain the final quantization Model.

In this embodiment, the quantization factor of the model weight is calculated in step S200. By calculating the maximum absolute value of the model weight, the quantization factor of the model weight is calculated based on the quantization range. The model weight is quantized into an int8 type, so the quantization range is [-128, 127].

Step S300 calculates the quantization factor of the activation value of each layer by minimizing the mean square error, calculates the mean square error of the quantized output and the non-quantized output of each layer based on a part of the test data set, and obtains the quantization of the activation value by minimizing the mean square error factor. The formula for mean square error is:

Among them, y _i represents the non-quantized output,

Indicates the output after quantization.

Each layer of the model repeats step S400 and step S500 to calculate, and finally obtains the model output.

Example 2:

The neural network model quantification system of the present invention is used to calculate the activation value quantification factor of each layer of the neural network model by minimizing the equation before the neural network model performs inference. The system includes a training module, a quantization factor calculation module, an activation value quantization factor calculation module, an inference inverse quantization module, and a final quantization module. The training module is used to build a neural network model, and the neural network model is trained to obtain floating point types. The neural network model of the target model is used as the target model; the quantization factor calculation module is applied to the target model, and is used to calculate the model weight quantization factor based on the quantization range by calculating the maximum absolute value of the model weight; the activation value quantization factor calculation module is applied to each of the target model Layer, which is used to calculate the activation value quantization factor of each layer of the target model by minimizing the mean square error; the inference inverse quantization module is applied to each layer of the target model, and is used to perform model quantization by weights and activation values of the fixed-point type Inference, and dequantize the inference result into int32 data type; the final quantization module is applied to each layer of the target model to quantize each operator through asymmetric quantization, and quantize the weight of the floating-point type model into int8 data type, and quantize the activation value into a uint8 data type to obtain the final quantized model.

For the target model, the model weights are quantized to int8 type, and the quantization range is [-128,127].

The activation value quantization factor calculation module is used to calculate the activation value quantization factor by the following method: obtain the test set data set, calculate the mean square error of the quantized output of each layer of the target model and the output without quantization, and obtain the activation value by minimizing the mean square error quantization factor. The formula for mean square error is:

Among them, y _i represents the non-quantized output,

Indicates the output after quantization.

Example 3:

The device of the present invention includes: at least one memory and at least one processor; at least one memory for storing a machine-readable program; at least one processor for calling the machine-readable program to execute the disclosure disclosed in Embodiment 1 of the present invention Methods.

Example 4:

An embodiment of the present invention also provides a computer-readable medium, wherein computer instructions are stored on the computer-readable medium, and when the computer instructions are executed by a processor, the processor executes the method disclosed in Embodiment 1 of the present invention. Specifically, a system or device equipped with a storage medium may be provided, on which a software program code for realizing the functions of any of the above embodiments is stored, and the computer (or CPU or MPU of the system or device) ) to read and execute the program code stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the function of any one of the above-mentioned embodiments, so the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of storage media for providing program code include floppy disks, hard disks, magneto-optical disks, optical disks (such as CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), Tape, non-volatile memory card, and ROM. Alternatively, the program code can be downloaded from a server computer via a communication network.

In addition, it should be clear that not only by executing the program code read by the computer, but also by making the operating system on the computer complete part or all of the actual operations through instructions based on the program code, so as to realize the function of any one of the embodiments.

In addition, it can be understood that the program code read from the storage medium is written into the memory provided in the expansion board inserted into the computer or written into the memory provided in the expansion unit connected to the computer, and then based on the program code The instruction causes the CPU installed on the expansion board or the expansion unit to perform some or all of the actual operations, so as to realize the functions of any one of the above-mentioned embodiments.

It should be noted that not all steps and modules in the above-mentioned processes and system structure diagrams are necessary, and some steps or modules can be ignored according to actual needs. The execution order of each step is not fixed and can be adjusted as required. The system structures described in the above embodiments may be physical structures or logical structures, that is, some modules may be realized by the same physical entity, or some modules may be realized by multiple physical entities, or may be realized by multiple Certain components in individual devices are implemented together.

In the above embodiments, the hardware unit may be implemented mechanically or electrically. For example, a hardware unit may include permanently dedicated circuitry or logic (such as a dedicated processor, FPGA or ASIC) to perform the corresponding operations. The hardware unit may also include programmable logic or circuits (such as general-purpose processors or other programmable processors), which can be temporarily set by software to complete corresponding operations. The specific implementation (mechanical way, or a dedicated permanent circuit, or a temporary circuit) can be determined based on cost and time considerations.

The present invention has been shown and described in detail through the accompanying drawings and preferred embodiments above, but the present invention is not limited to these disclosed embodiments, and those skilled in the art based on the above-mentioned multiple embodiments can know that the above-mentioned different embodiments can be combined The means in get more embodiments of the present invention, and these embodiments are also within the protection scope of the present invention.

Claims

The neural network model quantification method is characterized in that before the neural network model is inferred, the activation value quantization factor of each layer of the neural network model is calculated by the minimization equation, and the method includes the following steps:

Construct a neural network model and train the neural network model to obtain a floating-point neural network model as the target model;

For the target model, calculate the model weight quantization factor based on the quantization range by calculating the maximum value of the absolute value of the model weight;

For each layer of the target model, calculate the activation value quantization factor of each layer of the target model by minimizing the mean square error;

For each layer of the target model, perform model inference through quantized weights and activation values of the fixed-point type, and dequantize the inference result into an int32 data type;

For each layer of the target model, each operator is quantized by asymmetric quantization, the weight of the floating-point model is quantized into an int8 data type, and the activation value is quantized into a uint8 data type to obtain the final quantized model.
The neural network model quantification method according to claim 1, characterized in that for the target model, the model weight quantization is int8 type, and the quantization range is [-128,127].
According to the neural network model quantification method described in any one of claim 1 or 2, it is characterized in that obtaining a test data set, based on the test data set, calculating the mean square error of the quantized output of each layer of the target model and the output without quantization, through the minimum The activation value quantization factor is obtained by normalizing the mean square error.
neural network model quantification method according to claim 3, is characterized in that described mean square error formula is:

Among them, y i represents the non-quantized output,
Indicates the output after quantization.
The neural network model quantization system is characterized in that it is used to calculate the activation value quantization factor of each layer of the neural network model by minimizing the equation before the neural network model performs inference, and the system includes:

Construct a training module, which is used to construct a neural network model, and train the neural network model to obtain a floating-point type neural network model as the target model;

A quantization factor calculation module, the quantization factor calculation module is applied to the target model, and is used to calculate the model weight quantization factor based on the quantization range by calculating the maximum absolute value of the model weight;

An activation value quantization factor calculation module, the activation value quantization factor calculation module is applied to each layer of the target model, and is used to calculate the activation value quantization factor of each layer of the target model by minimizing the mean square error;

Inference inverse quantization module, the inference inverse quantization module is applied to each layer of the target model, for performing model inference through quantized fixed-point type weights and activation values, and inverse quantization of inference results into int32 data types;

The final quantization module, which is applied to each layer of the target model, is used to quantize each operator by means of asymmetric quantization, quantizes the weight of the floating-point type model into an int8 data type, and converts the activation value Quantize to uint8 data type to get the final quantized model.
The neural network model quantization system according to claim 5, characterized in that for the target model, the model weight is quantized as int8 type, and the quantization range is [-128,127].
According to the neural network model quantification system described in any one of claim 5 or 6, it is characterized in that the activation value quantization factor calculation module is used to calculate the activation value quantization factor by the following method:

Obtain the test set data set, calculate the mean square error of the output of each layer of the target model after quantization and non-quantization, and obtain the quantization factor of the activation value by minimizing the mean square error.
Neural network model quantification system according to claim 7, is characterized in that described mean square error formula is:

Among them, y i represents the non-quantized output,
Indicates the output after quantization.
The device is characterized by comprising: at least one memory and at least one processor;

said at least one memory for storing machine-readable programs;

The at least one processor is configured to call the machine-readable program to execute the method of any one of claims 1-4.
A computer-readable medium, wherein computer instructions are stored on the computer-readable medium, and when executed by a processor, the computer instruction causes the processor to execute the method according to any one of claims 1 to 4.