WO2023029349A1

WO2023029349A1 - Model quantization method and apparatus, device, storage medium, computer program product, and computer program

Info

Publication number: WO2023029349A1
Application number: PCT/CN2022/071377
Authority: WO
Inventors: 李雨杭; 沈明珠; 马建; 任岩; 张琦; 龚睿昊; 余锋伟
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-09-03
Filing date: 2022-01-11
Publication date: 2023-03-09
Also published as: CN113780551A; CN113780551B

Abstract

A model quantization method and apparatus, a device, a storage medium, a computer program product, and a computer program. The method comprises: obtaining a first network model to be quantized (S101); on the basis of set deployment configuration information, determining at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer (S102); and quantizing each processing layer in the first network model according to the quantization parameter, so as to obtain a second network model (S103).

Description

Model quantification method, device, equipment, storage medium, computer program product and computer program

Cross References to Related Applications

The embodiment of this application is based on the Chinese patent application with the application number 202111030764.1, the application date is September 03, 2021, and the application name is "model quantification method, device, equipment, storage medium and computer program product", and requires the Chinese patent The priority of the application, the entire content of the Chinese patent application is hereby incorporated into this application as a reference.

technical field

The embodiments of the present application relate to, but are not limited to, the field of artificial intelligence, and in particular, relate to a model quantification method, device, equipment, storage medium, computer program product, and computer program.

Background technique

Modern deep learning techniques pursue higher performance by consuming more memory and computing power. Although large models can be trained on the cloud, it is very difficult to directly deploy the models on edge devices due to limited computing resources (including latency, energy, and memory consumption). Through techniques such as model quantization, pruning, distillation, lightweight network design, and weight matrix decomposition, the reasoning of deep models can be accelerated. Among them, model quantization can quantize the weights and activation values in the neural network from the original floating-point type. to low bit width (such as 8-bit, 4-bit, 3-bit, 2-bit, etc.) integers. After the model is quantized, the storage space required for the quantized neural network model is reduced, and the calculation form is changed from the original floating-point operation to the calculation of lower-cost low-bit wide integer data.

In related technologies, the work of model quantization often cannot be applied in practice, and the obtained quantized neural network models usually cannot be deployed on hardware.

Contents of the invention

In view of this, embodiments of the present application provide a model quantization method, device, device, storage medium, computer program product, and computer program.

The technical scheme of the embodiment of the application is realized in this way:

An embodiment of the present application provides a model quantification method, the method comprising:

Obtain the first network model to be quantified;

Based on the set deployment configuration information, determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer;

Quantize each processing layer in the first network model according to the quantization parameter to obtain a second network model.

An embodiment of the present application provides a model quantization device, which includes:

The first acquisition part is configured to acquire the first network model to be quantified;

The first determining part is configured to determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer based on set deployment configuration information;

The quantization part is configured to perform quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.

An embodiment of the present application provides a computer device, including a memory and a processor. The memory stores a computer program that can run on the processor. When the processor executes the program, part or all of the steps in the above method are implemented.

An embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, some or all of the steps in the above method are implemented.

An embodiment of the present application provides a computer program, including computer readable codes. When the computer readable codes run in a computer device, a processor in the computer device executes some or all of the steps in the above method.

An embodiment of the present application provides a computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, part or all of the above methods are implemented. All steps.

In the embodiment of the present application, the first network model to be quantified is obtained; based on the set deployment configuration information, at least one processing layer to be quantized in the first network model and the quantization for each processing layer are determined Parameters; performing quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model. In this way, since the processing layers to be quantized in the first network model and the quantization parameters for each processing layer to be quantized are determined based on the set deployment configuration information, in the process of model quantization, full consideration is given to The deployment configuration information of the hardware platform of the deployment model, so that the obtained second network model is deployable on the corresponding hardware platform.

Description of drawings

Fig. 1 is a schematic diagram of the implementation flow of a model quantification method provided by the embodiment of the present application;

FIG. 2A is a schematic diagram of the implementation flow of a model quantification method provided in the embodiment of the present application;

FIG. 2B is a schematic diagram of inserting quantized nodes into a calculation graph of a basic block structure provided by an embodiment of the present application;

FIG. 2C is a schematic diagram of inserting quantized nodes into a calculation graph of a basic block structure provided by an embodiment of the present application;

FIG. 2D is a schematic diagram of inserting a quantization node into a calculation graph of a basic block structure provided by an embodiment of the present application;

FIG. 3A is a schematic diagram of the implementation flow of a model quantification method provided in the embodiment of the present application;

FIG. 3B is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application;

FIG. 3C is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application;

FIG. 3D is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application;

FIG. 3E is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application;

FIG. 3F is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of the implementation flow of a model quantification method provided in the embodiment of the present application;

Fig. 5 is a schematic diagram of the implementation flow of a model quantification method provided by the embodiment of the present application;

FIG. 6 is a schematic diagram of an application scenario of MQBench provided by the embodiment of the present application;

FIG. 7 is a schematic diagram of the composition and structure of a model quantization device provided in the embodiment of the present application;

FIG. 8 is a schematic diagram of a hardware entity of a computer device provided by an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solution and advantages of the application more clear, the technical solution of the application will be further elaborated below in conjunction with the accompanying drawings and embodiments. The described embodiments should not be considered as limiting the application. All other embodiments obtained under the premise of no creative work belong to the scope of protection of this application.

In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict. In the following description, the term "first/second/third" is only used to distinguish similar objects, and does not represent a specific order for objects. Understandably, "first/second/third" is used in Where permitted, the specific order or sequence may be interchanged such that the embodiments of the application described herein can be practiced in other sequences than illustrated or described herein.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terminology used herein is for the purpose of describing the application only and is not intended to limit the application.

In order to better understand the embodiment of the present application, the model quantization solution in the related art is first described. In related technologies, the model quantization scheme often fails to be practically applied and deployed because it ignores the requirements of hardware deployment. On the one hand, after the model is deployed to the hardware platform, the hardware platform usually optimizes the calculation process of the batch normalization (Batch Normalization, BN) layer into the convolution layer to avoid additional overhead, but the BN layer in related technologies is to maintain Intact; On the other hand, in related technologies, only the input parameters and weight parameters of the convolutional layer are considered to be quantized, but when the model is deployed, the entire calculation graph of the neural network model should be quantized, that is, except for the convolutional layer The input parameters and weight parameters of other processing layers other than the layer also need to be quantized. Therefore, the model quantization scheme in the related art will inevitably reduce the deployability of the quantization algorithm. In addition, because different quantization algorithms have different deployability on different hardware platforms, it is impossible to measure the performance and robustness of different quantization algorithms on different hardware and quantization methods in academic research.

An embodiment of the present application provides a model quantization method, which can be executed by a processor of a computer device. Among them, computer equipment refers to servers, notebook computers, tablet computers, desktop computers, smart TVs, set-top boxes, mobile devices (such as mobile phones, portable video players, personal digital assistants, dedicated messaging devices, portable game devices), etc. Devices with data processing capabilities. Figure 1 is a schematic diagram of the implementation process of a model quantification method provided in the embodiment of the present application. As shown in Figure 1, the method includes:

Step S101, acquiring a first network model to be quantized.

Here, the first network model can be any suitable neural network model to be quantized, and can be a full-precision neural network model. Exemplarily, the first network model can be a 32-bit floating-point parameter or a 16-bit floating-point parameter type parameter neural network model, of course, this embodiment does not limit the floating-point number of the first network model. During implementation, the first network model may adopt any suitable neural network structure, including but not limited to one or more of ResNet-18, ResNet-50, MobileNetV2, EfficientNet-Lite, RegNet and the like.

Step S102, based on the set deployment configuration information, determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer.

Here, the deployment configuration information may include but not limited to one or more of the type of the deployed hardware, the inference engine used by the deployed hardware type, the model of the deployed hardware, the quantized bit width of the network model parameters corresponding to the deployed hardware type, and the like. During implementation, the deployment configuration information may be preset by the user, or may be default, or may be obtained from a configuration file of the target deployment hardware, which is not limited here.

The first network model may include multiple processing layers, such as one or more of an input layer, a convolutional layer, a pooling layer, a downsampling layer, a linear correction unit, a fully connected layer, and a batch normalization layer. Since different deployment environments may have different support capabilities for model quantification, based on the set deployment configuration information, at least one processing layer to be quantified in the first network model may be determined. During implementation, at least one processing layer to be quantified in the first network model may be determined in an appropriate manner based on the set deployment configuration information according to actual conditions, which is not limited in this embodiment of the present application. In some embodiments, the corresponding relationship between different deployment configuration information and the processing layer to be quantified can be determined in advance according to the actual situation, and the corresponding relationship can be determined by using the set deployment configuration information to query the corresponding relationship. at least one processing layer of . For example, for the first deployment hardware type or the first inference engine, it can be determined that only the convolution layer in the first network model is quantized; for the second deployment hardware type or the second inference engine, it can be determined that the Each convolutional layer, input layer, and fully connected layer of the first network model can be quantized; for the third inference engine, each convolutional layer, input layer, fully connected layer, and element-wise added calculation layer in the first network model can be quantified to quantify. In some implementation manners, the parameter to be quantified in each processing layer of at least one processing layer to be quantized in the first network model may also be determined based on the set deployment configuration information.

The quantization parameter for quantizing each processing layer may include, but not limited to, one or more of the preset accuracy of the quantization scale used in the process of quantizing the processing layer, quantization symmetry, quantization bit width, and quantization granularity, etc. Various. For example, the preset precision of the quantization scale may include full precision, power of 2 precision, and the like. Quantization symmetry can be either symmetric quantization or asymmetric quantization. The quantization bit width may include one of 8 bits, 4 bits, 3 bits, 2 bits and so on. Quantization granularity can be hierarchical quantization (that is, tensor-level quantization) or feature-level quantization (that is, channel-level quantization). Different deployment hardware platforms can support or apply different quantization parameters such as quantization scale precision, quantization symmetry, quantization bit width and quantization granularity. Based on the set deployment configuration information, the first network model can be determined. The quantization parameter used in the quantization process of each processing layer to be quantized. During implementation, those skilled in the art may determine the quantization parameter for quantizing each processing layer to be quantized in the first network model based on the set deployment configuration information in an appropriate manner according to the actual situation, which is not limited here. In some embodiments, the corresponding relationship between different deployment configuration information and quantified parameters can be determined in advance according to the actual situation, and the corresponding relationship can be determined based on the set deployment configuration information to determine the quantified parameters in the first network model. Quantization parameters for quantization at each processing layer.

Step S103, performing quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.

Here, any suitable quantization algorithm may be used according to the actual situation to quantize each processing layer in the first network model according to the quantization parameter to obtain the quantized second network model. Quantization algorithms may include, but are not limited to, one or more of post-training quantization algorithms, quantization-aware training algorithms, and the like. The post-training quantization algorithm refers to selecting the appropriate quantization operation and calibration operation for the pre-trained network model to minimize the quantization loss. It can be static quantization after training or dynamic quantization after training. The quantization-aware training algorithm refers to training during the quantization process of the network. Through quantization-aware training, the network can adapt to the discontinuous distribution of integer values and reduce the loss of operational accuracy caused by the quantization process, which can include but not limited to the learning step size Quantization (Learned Step-size Quantization, LSQ) algorithm, parameterized clipping activation (PAParameterized Clipping acTivation, PACT) algorithm, additive power of two power quantization (Additive Powers-of-Two, APoT) algorithm, differentiable soft quantization (Differentiable Soft Quantization, DSQ), DoReFa-Net training algorithm, Learning Quantization for Highly Accurate and Compact Deep Neural Networks (LQ-net) algorithm, etc.

In some implementations, in the implementation process of quantifying the first network model, the calculation graph of the first network model can be extracted based on the network structure of the first network model, by inserting at least one A quantization node is used to quantify at least one processing layer in the first network model to construct a calculation graph of the second network model, and perform quantization processing on each processing layer to be quantized in the calculation graph of the second network model The quantization parameter adopted by the quantization node is the quantization parameter for quantizing the processing layer, and the quantized second network model can be obtained based on the calculation graph of the second network model. In some embodiments, any suitable quantization algorithm and training data can be used to perform parameter training on the calculation graph of the second network model according to the actual situation, to obtain the calculation graph of the second network model after training, and based on the trained A calculation graph of the second network model to obtain the trained second network model.

In the process of constructing the calculation graph of the quantized neural network, different deployment hardware will consider different levels of graph optimization, so based on different deployment configuration information, different quantitative node insertion strategies can be adopted. In the first network model At least one quantization node is inserted into the computation graph to construct a computation graph of a suitable quantization neural network (that is, a computation graph of the second network model). The position of inserting a quantization node in the calculation graph of the first network model can be equivalent to quantifying the processing layer corresponding to the logical node corresponding to the position, so that determining the position of inserting a quantization node in the calculation graph of the first network model is equivalent to determining At least one processing layer to be quantized in the first network model.

In some embodiments, the deployment configuration information includes the inference engine used by the deployed hardware type; based on the set deployment configuration information described in step S102 above, determine at least one processing layer to be quantified in the first network model , which can include:

Step S111, based on the inference engine, determine the processing layer type to be quantized;

Step S112, determining at least one processing layer in the first network model that matches the processing layer type as the processing layer to be quantized.

Here, the deployment hardware type is the hardware type of the target hardware on which the quantized second network model is deployed, and the reasoning engines used by different deployment hardware types may be the same or different, which is not limited here. Inference engines can include but are not limited to TensorRT, ACL, TVM, SNPE, or FBGEMM, etc. During implementation, the deployment hardware can be classified in an appropriate way according to the actual situation. For example, the hardware can be classified according to the hardware manufacturer. In this case, the deployment hardware type is the deployment hardware manufacturer, and the deployment hardware type The inference engine used is the inference engine used by the manufacturer; it can also be classified according to the specifications and models of the hardware. In this case, the deployment hardware type is the deployment hardware model, and the inference engine used by the deployment hardware type is the The inference engine used by the model's hardware.

Different inference engines can support quantization of different types of processing layers. The types of processing layers can include but are not limited to such as input layer, convolutional layer, pooling layer, downsampling layer, linear correction unit, fully connected layer, batch normalization One or more of layers, etc. In some implementations, the correspondence between different inference engines and the types of processing layers to be quantified can be determined in advance, and based on the correspondence, the types of processing layers to be processed corresponding to the inference engines adopted by the type of deployed hardware can be determined.

After determining the processing layer type to be quantized, each processing layer in the first network model may be matched with the processing layer type, and at least one matched processing layer may be determined as the processing layer to be quantized.

An embodiment of the present application provides a model quantization method, which can be executed by a processor of a computer device. As shown in Figure 2A, the method includes:

Step S201, acquiring a first network model to be quantized.

Here, the above-mentioned step S201 corresponds to the above-mentioned step S101, and the implementation of the above-mentioned step S101 can be referred to for implementation.

Step S202, based on the set deployment configuration information, determine at least one processing layer to be quantized in each of the block structures in the first network model and a quantization parameter for quantizing each of the processing layers.

Here, the structure of the neural network model can be divided into multiple stages (stages), each stage can be divided into multiple blocks (blocks), and each block can be divided into multiple processing layers (layers). In this embodiment, quantization processing is performed in units of a block structure. The first network model includes at least one block structure, each of said block structures includes at least one processing layer.

In some implementations, the processing layers to be quantized in each block structure corresponding to the set deployment configuration information may be determined based on the predetermined correspondence between the deployment configuration information and the processing layers to be quantized in different block structures.

In some implementations, for each block structure in the first network model, based on the set deployment configuration information, it is determined that the pseudo-quantization corresponding to the deployment configuration information is inserted in the calculation subgraph corresponding to the block structure in the calculation graph The insertion strategy of the nodes, thereby determining at least one processing layer to be quantized in the block structure. For example, in the case where the neural network structure adopted by the first network model is ResNet-18/ResNet-34, for the basic block structure in ResNet-18/ResNet-34, for different deployment configuration information, it can be used as shown in Figure 2B The three different insertion strategies shown in 2D insert at least one pseudo-quantization node in the calculation subgraph corresponding to the basic block structure. In the insertion strategy shown in Figure 2B, a pseudo-quantization node FakeQuant 20 is inserted at the input of each convolutional layer Conv 10 in the calculation subgraph, wherein the pseudo-quantization node FakeQuant 20 includes a quantization processing node Quantization 21 and an inverse quantization node Dequantization 22. Therefore, the processing layer to be quantized in the basic block structure corresponding to the calculation subgraph is each convolutional layer in the basic block structure. In the insertion strategy shown in Figure 2C, the input to the computation subgraph is the quantized data

(that is, the input of the convolutional layer Conv 10-1 and Conv 10-2), in the calculation subgraph at the input of the convolutional layer Conv 10-3, an input of the elementwise addition layer elementwise-add 30 and the calculation The pseudo-quantization node FakeQuant 20 is inserted into the output of the subgraph, so that the processing layer to be quantized in the basic block structure corresponding to the calculation subgraph is each convolutional layer and element-wise addition layer in the basic block structure (only for single side input for quantization) and the output layer of this basic block structure. In the insertion strategy shown in Figure 2D, the input to the computation subgraph is the quantized data

(That is, the input of the convolutional layer Conv 10-1 and Conv 10-2), at the input of the convolutional layer Conv 10-3 in the calculation subgraph, at each input of the elementwise addition layer elementwise-add 30 and The pseudo-quantization node FakeQuant 20 is inserted into the output of the calculation subgraph, so that the processing layer to be quantized in the basic block structure corresponding to the calculation subgraph is each convolution layer and element-wise addition layer in the basic block structure (only Quantize both inputs) and the output layer of this basic block structure.

Step S203, performing quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.

Here, the above-mentioned step S203 corresponds to the above-mentioned step S103, and the implementation of the above-mentioned step S103 can be referred to for implementation.

In the embodiment of the present application, the first network model includes at least one block structure, each block structure includes at least one processing layer, and based on the set deployment configuration information, at least one processing layer to be quantified in each block structure in the first network model is determined and a quantization parameter for quantizing each processing layer, and performing quantization on each processing layer to be quantized in the first network model according to the quantization parameter to obtain a second network model. In this way, all block structures in the first network model can be quantized, thereby realizing the quantization of the entire network model.

An embodiment of the present application provides a model quantization method, which can be executed by a processor of a computer device. As shown in Figure 3A, the method includes:

Step S301, acquiring the first network model to be quantized.

Step S302, based on the inference engine used by the set deployment hardware type, determine the processing layer type to be quantified.

Step S303, determining at least one processing layer in the first network model that matches the processing layer type as the processing layer to be quantized.

Step S304, based on the inference engine, determine quantization parameters for quantizing each of the processing layers.

Here, the above-mentioned steps S301 to S304 correspond to the above-mentioned steps S101 to S102, and the specific implementation manners of the above-mentioned steps S101 to S102 can be referred to for implementation.

Step S305, determining at least one batch normalization layer in the first network model and the convolutional layer that each batch normalization layer depends on as processing layers to be quantized.

Here, the convolutional layer on which the batch normalization layer depends may be the convolutional layer connected to the batch normalization layer before the batch normalization layer.

Step S306, obtaining the set batch normalization layer folding strategy.

Here, the batch normalization folding strategy refers to the strategy of folding the batch normalization layer in the neural network model into the convolutional layer that the batch normalization layer depends on. In neural network models, batch normalization layers are designed to reduce internal covariate shifts and smooth losses for fast convergence. The batch normalization layer introduces a two-step linear transformation, scaling and translation, to each convolutional layer output. During implementation, those skilled in the art can set an appropriate batch normalization layer folding strategy according to actual conditions, which is not limited in this embodiment of the present application. In some embodiments, the set batch normalization layer folding strategy may be a preset batch normalization layer folding strategy corresponding to the deployment configuration information.

Step S307, based on the batch normalization layer folding strategy, fold each of the batch normalization layers in the first network model into the convolutional layer that the batch normalization layer depends on to obtain the folded After the first network model.

Step S308, performing quantization on each of the processing layers in the folded first network model according to the quantization parameter to obtain a second network model.

In some embodiments, the batch normalization layer folding strategy includes batch normalization layer removal status, coefficient update algorithm, statistical parameters to be incorporated into weights, statistical parameters to be incorporated into offsets; The statistical parameters to be incorporated into the weights include the running statistics of the convolutional layer on which the batch normalization layer depends or the statistics of the current batch, and the statistical parameters to be incorporated into the offset include the batch normalization layer dependencies The operation statistical data of the convolutional layer or the statistical data of the current batch; based on the batch normalization layer folding strategy described in the above step S307, normalize each batch in the first network model The normalization layer is folded into the convolutional layer that the batch normalization layer depends on, which may include:

Step S311, determining the scaling coefficient and translation coefficient of each batch normalization layer in at least one batch normalization layer in the first network model;

Here, the scaling coefficient and translation coefficient of each batch normalization layer may be determined based on parameters of the batch normalization layer.

Step S312 , based on the coefficient update algorithm, update the scaling coefficient and translation coefficient of each batch normalization layer to obtain the updated scaling coefficient and translation coefficient of each batch normalization layer.

Here, the coefficient update algorithm is any suitable algorithm set for updating the scaling coefficient and translation coefficient of the batch normalization layer, which may include but not limited to gradient descent method, simulated annealing method, genetic algorithm, etc. one or more species. In some implementations, the coefficient updating algorithm may also be non-updating, so that the scaling coefficients and translation coefficients of the batch normalization layer may not be updated.

Step S313, for each batch normalization layer, obtain statistical parameters to be combined into weights and statistical parameters to be combined into offsets in the batch normalization layer, and perform batch normalization The updated scaling coefficient of the layer and the statistical parameters to be incorporated into the weight are merged into the weight of the convolutional layer on which the batch normalization layer depends, and the updated scaling coefficient of the batch normalization layer is , translation coefficients, and the statistical parameters to be incorporated into the offset are combined into the offset of the convolutional layer.

Here, the statistical parameters to be incorporated into the weights may include the running statistics of the convolutional layer on which the batch normalization layer depends or the statistics of the current batch, and the statistical parameters to be incorporated into the offset may also include batch normalization The running statistics of the convolutional layers that the layer depends on or the statistics of the current batch.

Running statistical data is statistical data obtained from the output data during the historical operation of the convolutional layer, which may include but not limited to one or more of the mean, variance, and sliding average of the historical output data. The statistical data of the current batch is the statistical data obtained by statistics of the current batch of data in the output data of the convolutional layer, which may include but not limited to one or more of the mean value and variance of the current batch of data. During implementation, the statistics of the current batch of the convolutional layer can be calculated by performing convolution with full-precision weights in the convolutional layer.

In some implementations, the statistical parameters to be incorporated into the weights may include the variance of the historical output data of the convolutional layer on which the batch normalization layer depends, and the statistical parameters to be incorporated into the weights may include The mean and variance of the historical output data of the convolutional layer. The updated scaling coefficient of the batch normalization layer and the variance of the historical output data of the convolutional layer that the batch normalization layer depends on can be combined into the weights of the convolutional layer that the batch normalization layer depends on, and the batch The updated scaling coefficient and translation coefficient of the normalization layer and the mean and variance of the historical output data of the convolutional layer on which the batch normalization layer depends are combined into the offset of the convolutional layer.

In some implementations, the statistical parameters to be incorporated into the weights may include the mean value of the current batch data of the convolutional layer on which the batch normalization layer depends, and the statistical parameters to be incorporated into the weights may include The mean and variance of the current batch of data in the convolutional layer. The updated scaling coefficient of the batch normalization layer and the variance of the current batch data of the convolutional layer on which the batch normalization layer depends can be combined into the weights of the convolutional layer on which the batch normalization layer depends, and The updated scaling coefficient and translation coefficient of the batch normalization layer and the mean value and variance of the current batch data of the convolutional layer on which the batch normalization layer depends are combined into the offset of the convolutional layer.

In some implementations, the statistical parameters to be incorporated into the weights may include the variance of the historical output data of the convolutional layer on which the batch normalization layer depends, and the statistical parameters to be incorporated into the weights may include The mean and variance of the current batch of data in the convolutional layer. The updated scaling coefficient of the batch normalization layer and the variance of the historical output data of the convolutional layer that the batch normalization layer depends on can be combined into the weights of the convolutional layer that the batch normalization layer depends on, and the batch The updated scaling coefficient and translation coefficient of the normalization layer and the mean value and variance of the current batch of data of the convolutional layer on which the batch normalization layer depends are combined into the offset of the convolutional layer.

Step S314, if the removal state of the batch normalization layer is removed, remove each batch normalization layer from the first network model.

In some embodiments, during inference, the scaling coefficients and translation coefficients of the batch normalization layer and the running statistics of the convolutional layers that the batch normalization layer depends on can be combined into the method shown in formula (1). In the weights and offsets of the convolutional layers, the linear transformation performed by the batch normalization layer is folded into the corresponding convolutional layer:

Among them, w _fold and b _fold are the combined weights and offsets in the convolutional layer, respectively, μ, σ ² are the sliding average and variance obtained from the statistics of the output data during the operation of the convolutional layer; γ, β are the scaling and translation coefficients of the batch normalization layer, respectively. ε is a very small non-zero value set for numerical stability, which prevents divisors from being zero. If the convolutional layers are quantized after the batch normalization layer is folded, there will be no extra floating point operations during inference.

In some embodiments, batch normalization layer folding strategies may include, but are not limited to, one of the following:

Strategy 1: See FIG. 3B. In this strategy, the above formula (1) is used to merge the scaling coefficient and translation coefficient in the batch normalization layer into the weight w _fold and bias of the convolutional layer Conv 310 that the batch normalization layer depends on. Move to b _fold , and completely remove the batch normalization layer;

Strategy 2: Refer to FIG. 3C. In this strategy, the above formula (1) is used to merge the scaling coefficient and translation coefficient in the batch normalization layer into the weight w _fold and bias of the convolutional layer Conv 310 that the batch normalization layer depends on. Move to b _fold , and completely remove the batch normalization layer, and do not update the running statistics of the convolutional layer during the quantization training process, but, γ, β can be updated by Stochastic gradient descent (SGD) . In this strategy, even if the statistical data is not updated, the loss situation can still be smoothed, and the time of quantization training can be significantly reduced by reducing the statistical data.

Strategy 3: See Figure 3D. In this strategy, the running statistics of the convolutional layer can be updated during the quantization training process. During the quantization training process, the convolution will be calculated twice, which will cause additional overhead. Among them, the first convolution The product (corresponding to the convolutional layer Conv 320 in the figure) is to use the weight of full precision to calculate the mean value of the current batch

and variance

Then, use the above formula (1) to divide the mean value of the current batch

variance

The scaling coefficients and translation coefficients in the batch normalization layer are merged into the weights of the convolutional layer Conv310 that the batch normalization layer depends on

and offset

, and completely remove the batch normalization layer.

Strategy 4: See Figure 3E, in this strategy, two convolutions are also calculated during the training process. The first convolution (corresponding to the convolutional layer Conv 320 in the figure) is the same as strategy 3, and will estimate the mean of the current batch

and variance

In this strategy 4, the weights will be folded together with the running statistics, and the variance σ ² in the running statistics, the scaling factor in the batch normalization layer, will be incorporated into the BN layer dependent In the weight w _fold of the convolutional layer Conv 310, in order to avoid unexpected fluctuations in the statistical data of the current batch, the mean value of the current batch

variance

The scaling and translation coefficients in the batch normalization layer are merged into the offset of the convolutional layer Conv 310 that the batch normalization layer depends on

, and completely remove the batch normalization layer, in addition, the batch variance factor

Will be used to rescale the output after the second convolution.

Strategy 5: See Figure 3F. In this strategy, two convolutions are not used, but a batch normalization layer BN 330 is explicitly added after the quantized convolution (corresponding to the convolutional layer Conv 310 in the figure). One of the benefits brought by this strategy is that the statistics of the current batch are calculated based on quantized weights. During inference, the rescaling of convolutional layer outputs can be neutralized by batch normalization layers.

It should be noted that the above strategies 2 to 5 can all be transformed into strategy 1. In some embodiments, a batch normalization folding strategy can be set from a variety of preset batch normalization folding strategies (such as the above-mentioned strategies 1 to 5), and based on the set batch normalization layer folding A strategy for folding at least one batch normalization layer in the first network model to obtain the folded first network model.

In some embodiments, the above step S306 may include:

Step S321 , based on the inference engine, determine a target batch normalization layer folding strategy from various set batch normalization layer folding strategies.

Here, the set multiple batch normalization layer folding strategies may be determined in advance according to the actual situation, and may include but not limited to any one of the strategies 1 to 5 above. The batch normalization layer folding strategy of the target is determined based on the inference engine from the set multiple batch normalization layer folding strategies. Different inference engines can support different batch normalization layer folding strategies, or they can support the same batch normalization layer folding strategy. During implementation, the target batch normalization layer folding strategy can be determined from multiple set batch normalization layer folding strategies according to the inference engine's ability to support the batch normalization layer folding strategy. In this way, the performance of the quantized second network model after being deployed on the deployment hardware using the set inference engine can be further improved.

In some embodiments, based on the support capabilities of different inference engines for different batch normalization layer folding strategies, the corresponding relationship between the inference engine and the batch normalization layer folding strategy can be determined in advance, and the inference engine can be queried based on the set reasoning engine. Corresponding relationship, the batch normalization layer folding strategy of the target can be determined among the various batch normalization layer folding strategies that can be set.

In the embodiment of the present application, the set batch normalization layer folding strategy is obtained, and each batch normalization layer in the first network model is folded to the specified batch normalization layer folding strategy based on the batch normalization layer folding strategy. In the convolutional layer that the batch normalization layer depends on, the folded first network model is obtained, and each of the processing layers in the folded first network model is quantized according to the quantization parameter , to get the second network model. In this way, the convolution layer is quantized after the batch normalization layer is folded, and there will be no additional floating-point operations in the inference process, so that the inference speed of the quantized second network model can be further accelerated.

An embodiment of the present application provides a model quantization method, which can be executed by a processor of a computer device. As shown in Figure 4, the method includes:

Step S401, acquiring the first network model to be quantized.

Step S402, based on the set deployment configuration information, determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer.

Here, the above-mentioned steps S401 to S402 correspond to the above-mentioned steps S101 to S102 respectively, and the specific implementation manners of the above-mentioned steps S101 to S102 can be referred to for implementation.

Step S403, based on the set quantization algorithm and the first training data set, quantize each of the processing layers in the first network model according to the quantization parameters to obtain a second network model.

Here, the user can set any appropriate quantization algorithm according to the actual situation. The quantization algorithm can be a post-training quantization algorithm or a quantization-aware training algorithm, which is not limited here.

The first training data set may be an appropriate training data set determined in advance according to the target task of the second network model, and may be an image data set, a point cloud data set, or voice data, etc., which is not limited here.

In some implementations, the quantization algorithm is a post-training quantization algorithm. Based on the post-training quantization algorithm, each of the processing layers in the first network model is quantized according to the quantization parameters to obtain a quantized The second network model: based on the first training data set, calibrate the model parameters in the quantized second network model to obtain the calibrated second network model.

In some implementations, the quantization algorithm is a quantization-aware training algorithm. Based on the quantization-aware training algorithm and the first training data set, each of the first network models can be multiplied according to the quantization parameter. The parameters of the processing layer are subjected to at least one quantization-aware training to obtain a trained quantized second network model.

In some implementation manners, before quantizing the first network model, the first network model may be pre-trained, and the pre-trained first network model may be used as the first network model to be quantized.

In the embodiment of the present application, based on the set quantization algorithm and the first training data set, each processing layer to be quantized in the first network model is quantized according to quantization parameters to obtain the second network model. In this way, the set quantization algorithm can be effectively reproduced.

In some embodiments, the quantization algorithm includes a quantization-aware training algorithm, and the above step S403 may also include:

Step S411, setting a pseudo-quantizer for each of the processing layers in the first network model according to the quantization parameters to obtain a third network model.

Here, the pseudo-quantizer can perform quantization simulation during the quantization-aware training process to facilitate the network to perceive the loss caused by quantization, so that a pseudo-quantizer can be set for each processing layer to be quantized in the first network model. The structure of the pseudo-quantizer can be determined based on quantization parameters, it can be a symmetric quantizer or an asymmetric quantizer, it can be a uniform quantizer or a non-uniform quantizer, it can be a learning-based quantizer or a rule-based quantizer , can also be a quantizer that directly uses heuristics to calculate the quantization step size, which is not limited here. The first network model in which the pseudo-quantizer is set may be determined as the third network model.

Step S412, based on the set quantization-aware training algorithm and the first training data set, perform at least one quantization-aware training on the parameters of each processing layer in the third network model to obtain a second network model.

Here, those skilled in the art can set an appropriate quantization-aware training algorithm according to the actual situation during implementation, such as one of LSQ algorithm, PACT algorithm, APoT algorithm, DSQ algorithm, DoReFa-Net training algorithm, LQ-net algorithm, etc. or more, and it is not limited here. In some embodiments, one quantization-aware training algorithm may be set from multiple preset quantization-aware training algorithms.

In some embodiments, the quantization parameters include preset precision of quantization scale, quantization symmetry, quantization bit width and quantization granularity, the quantization symmetry includes symmetric quantization or asymmetric quantization, and the quantization granularity includes hierarchical quantization or Feature-level quantization. The pseudo-quantizer is configured to perform the following steps S421 to S424:

Step S421: Determine the quantized value range of the processing layer parameter based on the quantized bit width.

Here, the quantization bit width is the bit width of the integer data obtained by quantizing the floating-point parameters during the training process of the parameters of each processing layer to be quantized in the third network model, such as 8 bits, 4 bits , 3 bits, 2 bits, etc. The quantized bit width can be determined according to the set deployment configuration information, or can be set directly by the user. Different processing layers in the third network model may use the same quantization bit width or different quantization bit widths.

The processing layer parameter can be one or more parameters to be quantized in the weight value, activation value, input data, output data, etc. of the processing layer to be quantized, and the quantized value range of the processing layer parameter is the value after quantization of the parameter scope. During implementation, the quantized value range of the processing layer parameter can be determined based on the quantization bit width. For example, the processing layer parameter can include a weight value and an activation value. In the case where the quantization bit width is k, the weight value can be quantized as [ Signed integer values in the range of -2 ^k-1 , 2 ^k-1 -1], the activation value can be quantized as an unsigned integer value in the range of [0, 2 ^k-1 ], therefore, the weight value of The quantized value range may be [-2 ^k-1 , 2 ^k-1 -1], and the quantized value range of the activation value may be [0, 2 ^k-1 ].

Step S422, determining a quantization scale that satisfies the preset precision and a quantization zero that satisfies the quantization symmetry.

Here, the quantization scale is a coefficient for scaling the full-precision value to be quantized during the quantization process. The preset precision of the quantization scale may include but not limited to one of full precision, power of 2 precision, and the like.

Quantization symmetry is used to characterize whether the value range of the full-precision value to be quantized is symmetrical about 0. In uniform quantization, the zero point of the full-precision value is quantized to an integer value, which is called the quantized zero point. When the quantization zero point is 0, it means that the value range of the full-precision value to be quantized is symmetrical about 0, that is, the uniform quantization is symmetrical quantization; when the quantization zero point is not 0, it means the full-precision value to be quantized The range of values is asymmetric about 0, that is, the uniform quantization is asymmetric quantization.

In some implementation manners, a fixed quantization scale that satisfies the preset accuracy and a fixed quantization zero point that satisfies the quantization symmetry may be set for the pseudo quantizer according to actual conditions. For example, when the preset precision of the quantization scale is full precision, an appropriate full-precision numerical value may be set as the quantization scale for the pseudo quantizer. When the quantization symmetry is symmetrical, the quantization zero point can be set to 0; when the quantization symmetry is asymmetrical, the quantization zero point can be set to an appropriate non-zero number, such as 1, -2, and so on.

In some implementations, by counting the value range of the full-precision value to be quantized during the model operation, based on the value range and the corresponding quantization value range, the quantization scale that meets the preset precision and the quantization symmetry can be determined. Sexual quantization zero. In some implementation manners, the quantization scale and quantization zero point can also be continuously adjusted during the model training process.

Step S423, based on the quantization granularity, within the quantization value range, uniform quantization is performed on the processing layer parameters to be quantized by using the quantization scale and the quantization zero point, to obtain the quantized processing layer parameters.

Here, the quantization granularity refers to the granularity of parameters such as the quantization value range, quantization scale, and quantization zero point shared in the quantization network model, which can include hierarchical quantization (that is, tensor-level quantization) or feature-level quantization (that is, channel-level quantization). etc. Among them, the level quantization means that the parameters of the processing layer to be quantized in the same processing layer adopt the same parameters such as the shared quantization value range, quantization scale, and quantization zero point, and the feature level quantization means that the parameters to be quantized corresponding to different features in the same processing layer The parameters of the processing layer adopt different shared quantization value ranges, quantization scales, quantization zeros and other parameters.

In some implementations, the range of quantized values is [N _min , N _max ], where N _min is the smallest quantized value in the range of quantized values, N _max is the largest quantized value in the range of quantized values, and the quantized scale is s, In the case where the quantization zero point is z, the processing layer parameters to be quantized can be uniformly quantized in the manner shown in the following formula (2):

Among them, w represents the floating-point value corresponding to the parameter of the processing layer, and

is the quantized value of the processing layer parameter, the function

express will

limited between N _min and N _max , at

When it is greater than N _max , the value of this function is N _max , in

When it is less than N _min , the value of this function is N _min , in

When not greater than N _max and not less than N _min , the value of this function is

Indicates that the entered value is rounded to an integer.

Step S424, based on the quantization scale and the quantization zero point, perform inverse uniform quantization on the quantized processing layer parameters to obtain the dequantized processing layer parameters.

Here, in some implementations, in the case where the quantization scale is s and the quantization zero point is z, the quantized processing layer parameters can be deuniformly quantized in the manner shown in the following formula (3):

in,

Indicates the quantized value of the quantized processing layer parameter,

Indicates the parameters of the processing layer after dequantization.

In the above embodiments, the quantization parameters for quantizing each processing layer in the first network model can be determined based on the set deployment configuration information, and the quantization parameters include the preset precision of the quantization scale, quantization symmetry, quantization bit width and quantization Granularity, the quantization symmetry includes symmetric quantization or asymmetric quantization, and the quantization granularity includes hierarchical quantization or feature level quantization. In this way, the hardware-aware quantizer can be used to perform model quantization according to the configuration of individual deployment hardware, so that the quantized second network model can better meet the deployment requirements of the deployment hardware. In addition, multiple types of quantizers can be supported, so that a deployable second network model can be quantized for more types of deployment hardware.

In some embodiments, the above step S403 may include:

Step S431, determining a preset training hyperparameter corresponding to the neural network structure adopted by the first network model; wherein, for each of the deployment configuration information in the preset multiple deployment configuration information, the training hyperparameter The parameters are the same.

Here, for network models using the same neural network structure, uniform training hyperparameters are adopted, and training hyperparameters may include but not limited to one or more of fine-tuning duration (algebra), learning rate, parameter optimization algorithm, weight decay, etc. kind. The preset multiple deployment configuration information may include at least two preset deployment configuration information, which is not limited here. For different deployment configuration information, the same training hyperparameters are used in the process of quantitative training for network models using the same neural network structure. In the process of using different quantization algorithms to perform quantization training on network models using the same neural network structure, the training hyperparameters used are also the same.

During implementation, a set of suitable training hyperparameters for at least one neural network structure can be determined in advance through experiments or analysis. Based on the neural network structure adopted by the first network model, the preset training corresponding to the neural network structure can be determined. hyperparameters. Those skilled in the art may determine appropriate training hyperparameters for at least one neural network structure according to actual conditions, which is not limited in this embodiment of the present application.

For example, the following Table 1 provides an example of training hyperparameters preset for the neural network structures ResNet-18, ResNet-50, EffNet, MbV2, and RegNet, wherein, for the first network model using ResNet-18, the preset The set learning rate is 0.004, the weight decay is 10 ^-4 , the batch size is 64, and the number of graphics processors (Graphics processing unit, GPU) is 8; for the first network model using ResNet-50, the preset learning rate is 0.004, the weight decay is 10 ^-4 , the batch size is 16, and the number of GPUs is 16; for the first network model using EffNet and MbV2, the same training hyperparameters can be preset, the preset learning rate is 0.01, and the weight decay is 10 ^-5 *, the batch size is 32, and the number of GPUs is 16; for the first network model using RegNet, the preset learning rate is 0.004, the weight decay is 4×10 ^-5 , the batch size is 32, and the number of GPUs is 16. Among them, * represents that the weight decay of the batch normalization layer is 0.

Table 1 Example table of training hyperparameters corresponding to different neural network structures

神经网络结构neural network structure	学习率learning rate	权重衰减weight decay	批大小batch size	GPU数量Number of GPUs
ResNet-18ResNet-18	0.0040.004	10 ^-4 ^10-4	6464	88
ResNet-50ResNet-50	0.0040.004	10 ^-4 ^10-4	1616	1616
EffNetEffNet	0.010.01	10 ^-5* ^10-5 *	3232	1616
MbV2MbV2	0.010.01	10 ^-5* ^10-5 *	3232	1616
RegNetRegNet	0.0040.004	4×10 ^-5 4×10 ^-5	3232	1616

In some embodiments, a unified data preset process can be used for the training data, including random size cropping to 224 resolution, random horizontal flip, and color dithering of the image, such as brightness offset 0.2, contrast offset 0.2, saturation Offset 0.2, hue offset 0.1. During training, the test data is centered and cropped to 224 resolution, and regularization is added using 0.1 label smoothing. All models are trained for 100 epochs (meaning that all training samples are trained once, and all training samples are forward-propagated and back-propagated), and a linear warm-up is performed in the first epoch. The learning rate is decayed by a cosine annealing strategy. Trained using the SGD optimizer and updated with Newtonian momentum (Nesterov) with a momentum parameter of 0.9.

Step S432, using the set first training data set, based on the quantization algorithm and the training hyperparameters, quantize each of the processing layers in the first network model according to the quantization parameters to obtain the first Binary network model.

In the above-mentioned embodiment, for the network models using the same neural network structure, a unified training hyperparameter is used, so that the model can be shared between various first network models and various quantization algorithms with the same neural network structure Training skills, so that different quantization algorithms can be better reproduced and the accuracy of the quantization algorithm can be improved.

An embodiment of the present application provides a model quantization method, which can be executed by a processor of a computer device. As shown in Figure 5, the method includes:

Step S501 , based on at least one type of deployment configuration information, adjust the processing layers in the set neural network structure to obtain at least one adjusted neural network structure.

Here, the preset neural network structure may be preset by the user according to the actual situation, or may be a default, which is not limited here.

The at least one type of deployment configuration information may be one or more types of deployment configuration information preset or defaulted by the user. Due to different deployment hardware, there are differences in the quantization support capabilities of different processing layers in the neural network structure. During implementation, for each type of deployment configuration information, according to the actual quantitative support of the deployment hardware for different processing layers in the neural network structure corresponding to the deployment configuration information, an appropriate method can be used to evaluate the set neural network structure. At least one processing layer is adjusted to obtain an adjusted neural network structure. For example, for the deployment hardware type with low quantization support ability, if the neural network structure is set as EfficientNet, the squeeze and excitation blocks in the network structure can be removed, and the fast activation layer can be replaced by ReLU6 (Rectified liner unit, linear correction unit) layer, get the lightweight (Lite) version of EfficientNet, so that better integer value support can be obtained on the deployment hardware.

Step S502: Create at least one first network model based on at least one adjusted neural network structure.

Here, a first network model may be created for each neural network structure in the at least one adjusted neural network structure. During implementation, those skilled in the art can create an appropriate first network model based on the adjusted neural network structure according to actual business requirements, which is not limited here.

Step S503, based on the preset model parameters corresponding to the set neural network structure, initialize at least one parameter of the first network model to obtain at least one initialized first network model.

Here, for the first network model using the same neural network structure or the adjusted neural network structure based on the same neural network structure, the parameters of the first network model can be initialized with uniform preset model parameters, and at least An initialized first network model. The preset model parameters may include preset initial values of parameters in the first network model, or may include trained model parameters obtained after pre-training the first network model, which is not limited here.

Step S504, based on the set deployment configuration information, determine a first network model to be quantified from the at least one initialized first network model.

Here, each type of deployment configuration information may correspond to an initialized first network model, based on the set deployment configuration information, the initialized first network model corresponding to the deployment configuration information may be determined, and the initialized first network model The first network model is determined as the first network model to be quantified.

Step S505, based on the set deployment configuration information, determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer.

Step S506, performing quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.

Here, the above-mentioned steps S505 to S506 correspond to the above-mentioned steps S102 to S103 respectively, and the specific implementation manners of the above-mentioned steps S102 to S103 can be referred to for implementation.

In the embodiment of the present application, based on at least one deployment configuration information, the processing layer in the set neural network structure is adjusted to obtain at least one adjusted neural network structure, and based on at least one adjusted neural network structure, at least A first network model, based on the preset model parameters corresponding to the set neural network structure, initialize the parameters of at least one first network model to obtain at least one initialized first network model, and based on the set deployment Configuration information, determining the first network model to be quantized from at least one initialized first network model. In this way, on the one hand, the first network model to be quantified is created based on the set deployment configuration information and the neural network structure obtained by adjusting the processing layer in the set neural network structure, so that the first network model obtained after quantization The second network model can get better quantitative support after being deployed to the deployment hardware using the set deployment configuration information; on the other hand, by adopting a unified preset model for the first network model using the same neural network structure The initialization of parameters can reduce the inconsistency of initialization caused by using different initialization methods, thereby improving the comparability of quantization of different neural network models of the same network structure by different quantization algorithms.

In some embodiments, before the above step S503, the method further includes:

Step S511, obtaining a preset pre-training model corresponding to the neural network structure; the structure of the pre-training model before the output layer is the same as the neural network structure.

Here, the pre-training model may be any suitable neural network model created in advance based on the neural network structure.

Step S512, using the set second training data set to train the parameters of the pre-training model to obtain the trained pre-training model.

Here, the second training data set may be a suitable training data set determined in advance according to the target task of the pre-trained model, and may be an image data set, a point cloud data set, or voice data, etc., which is not limited here.

Step S513, determining the trained parameters of the pre-training model as the preset model parameters.

Here, for the first network model using the same neural network structure, a unified pre-training model can be used to pre-train the parameters of the first network model, and the parameters of the trained pre-training model can be used as preset model parameters, It is used to initialize the parameters of the first network model. In this way, in the process of quantizing the first network model, only simple calibration or fine-tuning of the quantized parameters is required to obtain a quantized second network model with better performance. Therefore, the efficiency of model quantization can be improved, and the precision of the quantized second network model can be further improved.

In some embodiments, the above step S501 may include:

Step S521, determining a target neural network structure from various preset neural network structures.

Here, a variety of optional neural network structures can be preset, and the user can determine a suitable target neural network structure from the various preset neural network structures according to actual business needs, which is not limited here.

Step S522, based on at least one deployment configuration information, adjust the processing layer in the target neural network structure to obtain at least one adjusted neural network structure.

In the above embodiments, various optional neural network structures can be provided for creating the initial first network model, so that different service requirements of users can be better supported.

In related technologies, for different quantization algorithms that can be deployed on hardware, there is a huge gap in accuracy when they are deployed and run on target hardware.

The embodiment of the present application provides a reproducible and deployable model quantization algorithm library (hereinafter referred to as MQBench), which can be used to evaluate and analyze the reproducibility and deployability of the model quantization algorithm. Provides a variety of different deployment hardware types to choose from for the deployment of quantitative models in practical applications, including central processing unit (central processing unit, CPU), GPU, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), digital signal processing (Digital Signal Process, DSP), and evaluate a large number of state-of-the-art quantization algorithms under a unified training configuration. Users can use MQBench to quantify the trained full-precision network model in tasks such as image classification and target detection, and obtain a quantized network model that can be deployed to target hardware. In the process of using MQBench for model quantification, the user only needs to provide the corresponding training data set and the deployment configuration information of the target hardware (such as the deployment hardware type, the inference engine used by the deployment hardware type, the quantization bit width corresponding to the deployment hardware type, etc.) and Configuration information of the quantization algorithm (such as quantization algorithm, fine-tuning duration, fine-tuning training algebra, training hyperparameters, etc.).

In some implementations, MQBench can be implemented using the Pytorch deep learning engine and supports the torch.fx (also known as FX) feature. FX includes a symbol tracker, an intermediate representation, and Python code generation, allowing deeper metaprogramming. In the embodiment of the present application, the quantization algorithm and hardware-aware configuration can be implemented in MQBench, and the full-precision network model can be converted into a quantized network model through an application programming interface (Application Programming Interface, API) call. For example, you can use the code in the following example to call the API to convert the trained full-precision network model into a quantized network model:

1) Introduce the torch.quantization.quantize_fx package:

import torch.quantization.quantize_fx as quantize_fx;

2) Based on the set network structure self.config.model, create a full-precision network model model, and load the pre-trained parameters to initialize the full-precision network model:

model = model_entry(self.config.model, pretrained=True);

3) Obtain the configuration information qparams of the quantization algorithm and the deployment configuration information backend_params of the target hardware:

model_qconfig = get_qconfig(**qparams, **backend_params);

4) Obtain the set batch normalization layer folding strategy foldbn_strategy:

foldbn_config = get_foldbn_config(foldbn_strategy);

5) Call the model quantification API quantize_fx.prepare_qat_fx to quantify the full-precision network model model, and obtain the quantized network model qModel:

qModel = quantize_fx. prepare_qat_fx(model, {"": model_qconfig}, foldbn_config).

In some implementations, after quantize_fx.prepare_qat_fx is called, the quantization network model qModel can be fine-tuned, calibrated and optimized.

MQBench is like a bridge, connecting quantization algorithms and deployment hardware. Figure 6 is a schematic diagram of the application scenario of MQBench provided by the embodiment of the present application. As shown in Figure 6, MQBench 60 mainly provides the reproducibility 61 of the quantization algorithm and the deployability 62 of the hardware platform, and the reproducibility of the quantization algorithm 61 can support multiple quantization algorithms 70, including quantization-aware training algorithms 71 and post-training quantization algorithms 72, and the deployability 62 of the hardware platform can support the deployment of quantization algorithms on different deployment hardware 80, including CPU 81, GPU 82, ASIC 83, DSP 84. The following describes MQBench from the two aspects of model quantification, reproducibility and deployability.

1) Reproducibility: The reproducibility of model quantification in MQBench is mainly reflected in the following four dimensions:

Hardware-aware quantizer: For different hardware (such as CPU, GPU, ASIC, and DSP, etc.), MQBench provides matching support for the calculation graph mode of the inference engine library (such as TVM, TensorRT, ACL, and SNPE, etc.) used by the hardware, which can Based on the set inference engine library, the insertion position of the quantitative node in the calculation graph is automatically matched. One hardware type corresponds to one inference engine, and different hardware types can correspond to the same inference engine. MQBench supports 5 general-purpose software libraries (that is, inference engines), including TensorRT for graphics processing unit (GPU) inference, ACL for application-specific integrated circuit (ASIC) inference, and mobile digital signal processor (DSP) SNPE for inference, TVM for ARM central processing unit (CPU), and FBGEMM for X86 server-side CPU inference. Each inference engine corresponds to a quantizer. Users can select an appropriate inference engine from these five inference engines for model deployment according to actual application scenarios. MQBench can determine at least one processing layer to be quantified in the full-precision network model and the corresponding hardware perception based on the selected inference engine quantizer.

Quantization algorithm: MQBench reproduces various quantization algorithms of the current SOTA (State-Of-The-Art, referring to the best/most advanced model), including learning-based LSQ, APoT, and quantization intervals Learning (Quantization Interval Learning, QIL) algorithm, PACT, and rule-based strategy DSQ, LQ-Net, DoReFa. Users can select an appropriate quantization algorithm from the multiple quantization algorithms reproduced by MQBench for model quantization according to the actual application scenario. MQBench can quantify the full-precision network model to be quantized according to the selected quantization algorithm.

Neural network structure: The neural network structure supported by MQBench includes ResNet-18, ResNet-50, MobileNetV2, Efficient-Net (use the Lite version of Efficient-Net, and replace the swish activation with ReLU6 to provide better overall performance on the hardware. type number support) and the neural network structure RegNetX-600MF with group convolution.

Quantization bit width: MQBench supports multiple quantization bit widths such as 8 bits, 4 bits, 3 bits, and 2 bits. In some implementation manners, a quantization bit width of 8 bits may be used for the post-training quantization algorithm, and a quantization bit width of 4 bits may be used for the quantization-aware training algorithm.

Training settings: In MQBench, fine-tuning is used for parameter training for all quantization algorithms. For full-precision network models using the same neural network structure, a unified pre-training model is used for parameter initialization, which reduces the number of parameters introduced in the initialization stage. Inconsistency.

2) Deployability: MQBench has optimized the deployability of model quantification as follows:

BN layer folding: MQBench supports 5 BN layer folding strategies, and can support the parameters of the BN layer to be folded into the corresponding convolutional layer according to the configured BN layer folding strategy. Users can choose an appropriate strategy from these five BN layer folding strategies according to the actual application scenario.

Computation graph of block structure: The model quantization scheme in the related art only considers the input and weight of the quantized convolution or fully connected layer. However, the neural network architecture can also include other operations, such as element-wise addition in the neural network architecture ResNet, Concatenation in neural network architecture InceptionV3, etc. In MQBench, different calculation graph optimization levels are considered for different inference engines, and the insertion position of quantized nodes in the calculation graph is automatically matched based on the set inference engine, so as to correspond to different calculation graph optimization levels to build a corresponding quantitative neural network calculation picture.

The reproducible and deployable model quantification algorithm library MQBench provided by the embodiment of the present application has at least the following improvements compared with the model quantification open source library in the related art:

1) For network models using the same neural network structure, uniform training hyperparameters are used for fine-tuning, and the skills of model training can be shared between multiple first network models and multiple quantization algorithms with the same neural network structure, Improve the accuracy of the quantization algorithm;

2) For the full-precision network models using the same neural network structure, a unified pre-training model is used for parameter initialization, which can reduce the inconsistency introduced in the initialization stage;

3) Support a variety of configurable neural network structures;

4) Support multiple configurable deployment hardware types and/or inference engines;

5) Using a hardware-aware quantizer can improve the deployability of the quantized network model and the accuracy in actual deployment scenarios;

6) Support multiple configurable BN layer folding strategies;

7) Consider different calculation graph optimization levels for different inference engines.

Fig. 7 is a schematic diagram of the composition and structure of a model quantization device provided in the embodiment of the present application. As shown in Fig. 7, the model quantization device 700 includes: a first acquisition part 710, a first determination part 720 and a quantization part 730, wherein:

The first acquiring part 710 is configured to acquire the first network model to be quantified;

The first determining part 720 is configured to determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer based on the set deployment configuration information;

The quantization part 730 is configured to perform quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.

In some embodiments, the first network model includes at least one block structure, and each of the block structures includes at least one processing layer; the first determining part is further configured to: based on the set deployment configuration information, determine At least one processing layer to be quantized in each of the block structures in the first network model and a quantization parameter for quantizing each of the processing layers.

In some embodiments, the deployment configuration information includes the inference engine used by the deployed hardware type; the first determining part is further configured to: determine the processing layer type to be quantified based on the inference engine; At least one processing layer matching the processing layer type in the network model is determined as the processing layer to be quantized.

In some embodiments, the processing layer type includes a convolutional layer and a batch normalization layer; the first determination part is further configured to: combine at least one batch normalization layer and a batch normalization layer in the first network model The convolutional layer that each of the batch normalization layers depends on is determined as the processing layer to be quantized; the set batch normalization layer folding strategy is obtained; based on the batch normalization layer folding strategy, the first Each of the batch normalization layers in the network model is folded into the convolution layer that the batch normalization layer depends on to obtain the folded first network model; the quantization part is further configured to: Each of the processing layers in the folded first network model is quantized according to the quantization parameter to obtain a second network model.

In some embodiments, the batch normalization layer folding strategy includes batch normalization layer removal status, coefficient update algorithm, statistical parameters to be incorporated into weights, statistical parameters to be incorporated into offsets; The statistical parameters to be incorporated into the weights include the running statistics of the convolutional layer on which the batch normalization layer depends or the statistics of the current batch, and the statistical parameters to be incorporated into the offset include the batch normalization layer dependencies The running statistics of the convolutional layer or the statistics of the current batch; the first determining part is also configured to: determine each of the batch normalization in at least one batch normalization layer in the first network model Layer scaling coefficients and translation coefficients; based on the coefficient update algorithm, the scaling coefficients and translation coefficients of each of the batch normalization layers are updated to obtain the updated scaling coefficients of each of the batch normalization layers and translation coefficients; for each batch normalization layer, obtain the statistical parameters to be incorporated into the weight and the statistical parameters to be incorporated into the offset in the batch normalization layer, and normalize the batch The updated scaling coefficients of the normalization layer and the statistical parameters to be incorporated into the weights are combined into the weights of the convolutional layers that the batch normalization layer depends on, and the updated scaling coefficients of the batch normalization layers are Coefficients, translation coefficients, and the statistical parameters to be incorporated into the offset are merged into the offset of the convolutional layer; when the removal status of the batch normalization layer is removed, each The batch normalization layer is removed from the first network model.

In some embodiments, the first determination part is further configured to: determine a target batch normalization layer folding strategy from multiple set batch normalization layer folding strategies based on the reasoning engine.

In some embodiments, the quantization part is further configured to: based on the set quantization algorithm and the first training data set, according to the quantization parameters, each of the processing layers in the first network model is Quantify to get the second network model.

In some embodiments, the quantization parameters include preset precision of quantization scale, quantization symmetry, quantization bit width and quantization granularity, the quantization symmetry includes symmetric quantization or asymmetric quantization, and the quantization granularity includes hierarchical quantization or Feature-level quantization, the quantization algorithm includes a quantization-aware training algorithm; the quantization part is further configured to: according to the quantization parameter, set a pseudo-quantizer for each of the processing layers in the first network model, A third network model is obtained; wherein, the pseudo-quantizer is configured to: determine a quantization value range of a processing layer parameter based on the quantization bit width; determine a quantization scale that satisfies the preset precision and a quantization scale that satisfies the quantization symmetry Quantize the zero point; based on the quantization granularity, within the range of the quantized value, use the quantization scale and the quantization zero point to perform uniform quantization processing on the processing layer parameters to be quantized to obtain the quantized processing layer parameters; based on the quantization scale and The quantization zero point is to perform inverse uniform quantization processing on the quantized processing layer parameters to obtain the dequantized processing layer parameters; based on the set quantization-aware training algorithm and the first training data set, the second The parameters of each of the processing layers in the three network models are trained to obtain the second network model.

In some embodiments, the quantization part is further configured to: determine preset training hyperparameters corresponding to the neural network structure adopted by the first network model; wherein, for the various preset deployment configuration information For each of the deployment configuration information, the training hyperparameters are the same; using the set first training data set, based on the quantization algorithm and the training hyperparameters, for each of the first network models The processing layer performs quantization according to the quantization parameter to obtain a second quantized network model.

In some embodiments, the first acquisition part is further configured to: adjust the processing layers in the set neural network structure based on at least one deployment configuration information to obtain at least one adjusted neural network structure; At least one of the adjusted neural network structures is used to create at least one first network model; based on the preset model parameters corresponding to the set neural network structure, the parameters of at least one of the first network models are initialized, At least one initialized first network model is obtained; based on the set deployment configuration information, a first network model to be quantified is determined from the at least one initialized first network model.

In some embodiments, the device further includes: a second acquisition part configured to acquire a preset pre-training model corresponding to the neural network structure; the structure of the pre-training model before the output layer is the same as the The neural network structure is the same; the pre-training part is configured to use the set second training data set to train the parameters of the pre-training model to obtain the trained pre-training model; the second determination part is configured To determine the trained parameters of the pre-training model as the preset model parameters.

In some embodiments, the first acquisition part is further configured to: determine a target neural network structure from a variety of preset neural network structures; The processing layer is adjusted to obtain at least one adjusted neural network structure.

The description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the device embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.

In the embodiment of the present application and other embodiments, a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a unit, a module or a non-modular one.

It should be noted that, in the embodiment of the present application, if the above-mentioned model quantification method is realized in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solution of the embodiment of the present application or the part that contributes to the related technology can be embodied in the form of a software product, the software product is stored in a storage medium, and includes several instructions to make a A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

An embodiment of the present application provides a computer device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the steps in the above method when executing the program.

An embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps in the above method are implemented. The computer readable storage medium may be transitory or non-transitory.

An embodiment of the present application provides a computer program, the computer program includes computer readable code, and when the computer readable code is run in a computer device, a processor in the computer device executes part of the above method or all steps.

An embodiment of the present application provides a computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, part or all of the above methods are implemented. All steps. The computer program product can be specifically realized by means of hardware, software or a combination thereof. In some embodiments, the computer program product is embodied as a computer storage medium, and in other embodiments, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.

It should be pointed out here that: the above descriptions of the storage medium, computer program product, and device embodiments are similar to the descriptions of the above method embodiments, and have similar beneficial effects to those of the method embodiments. For technical details not disclosed in the storage medium, computer program product, computer program and device embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.

It should be noted that FIG. 8 is a schematic diagram of a hardware entity of a computer device in the embodiment of the present application. As shown in FIG. 8, the hardware entity of the computer device 800 includes: a processor 801, a communication interface 802, and a memory 803, wherein: Processor 801 generally controls the overall operation of computer device 800 . The communication interface 802 enables the computer device to communicate with other terminals or servers over a network. The memory 803 is configured to store instructions and applications executable by the processor 801, and can also cache data to be processed or processed by the processor 801 and various modules in the computer device 800 (for example, image data, audio data, voice communication data and Video communication data) can be realized by flash memory (FLASH) or random access memory (Random Access Memory, RAM). Data transmission may be performed between the processor 801 , the communication interface 802 and the memory 803 through the bus 804 .

It should be understood that reference throughout the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic related to the embodiment is included in at least one embodiment of the present application. Thus, appearances of "in one embodiment" or "in an embodiment" in various places throughout the specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application. The implementation process constitutes any limitation. The serial numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.

It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.

In the several embodiments provided in this application, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, such as: multiple units or components can be combined, or May be integrated into another system, or some features may be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of.

The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, or each unit can be used as a single unit, or two or more units can be integrated into one unit; the above-mentioned integration The unit can be realized in the form of hardware or in the form of hardware plus software functional unit.

Those of ordinary skill in the art can understand that all or part of the steps to realize the above method embodiments can be completed by hardware related to program instructions, and the aforementioned programs can be stored in computer-readable storage media. When the program is executed, the execution includes The steps of the foregoing method embodiments; and the foregoing storage media include: removable storage devices, read-only memory (Read Only Memory, ROM), magnetic disks or optical disks and other media that can store program codes.

Alternatively, if the above-mentioned integrated units in the embodiments of the present application are implemented in the form of software function modules and sold or used as independent products, they may also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present application or the part that contributes to the related technologies can be embodied in the form of software products. The computer software products are stored in a storage medium and include several instructions to make A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks.

The above is only the embodiment of the present application, but the scope of protection of the present application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, and should covered within the scope of protection of this application.

Industrial Applicability

The embodiment of the present application discloses a model quantification method, device, device, storage medium, computer program product and computer program, wherein the method includes: acquiring the first network model to be quantified; based on the set deployment configuration information, Determining at least one processing layer to be quantized in the first network model and quantization parameters for quantizing each of the processing layers; performing quantization on each of the processing layers in the first network model according to the quantization parameters Quantify to get the second network model. According to the embodiment of the present application, the deployment configuration information of the hardware platform on which the model is deployed can be fully considered during the model quantification process of the first network model, so as to obtain the second network model deployable on the corresponding hardware platform.

Claims

A method for model quantification, the method comprising:

Obtain the first network model to be quantified;

Based on the set deployment configuration information, determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer;

Quantize each processing layer in the first network model according to the quantization parameter to obtain a second network model.
The method of claim 1, wherein said first network model comprises at least one block structure, each said block structure comprising at least one processing layer;

The determination of at least one processing layer to be quantized in the first network model and quantization parameters for each processing layer based on the set deployment configuration information includes:

Based on the set deployment configuration information, at least one processing layer to be quantized in each of the block structures in the first network model and a quantization parameter for quantizing each of the processing layers are determined.
The method according to claim 1, wherein the deployment configuration information includes the inference engine used by the deployment hardware type;

The determining at least one processing layer to be quantified in the first network model based on the set deployment configuration information includes:

Based on the inference engine, determine the type of processing layer to be quantified;

Determining at least one processing layer matching the processing layer type in the first network model as the processing layer to be quantized.
The method according to claim 3, wherein the processing layer types include convolutional layers and batch normalization layers;

The determining at least one processing layer matching the processing layer type in the first network model as the processing layer to be quantified includes:

Determining at least one batch normalization layer in the first network model and the convolutional layer that each batch normalization layer depends on as the processing layer to be quantized;

Get the set batch normalization layer folding strategy;

Based on the batch normalization layer folding strategy, each of the batch normalization layers in the first network model is folded into the convolutional layer on which the batch normalization layer depends, and all folded normalization layers are obtained. Describe the first network model;

The quantization of each of the processing layers in the first network model according to the quantization parameter to obtain a second network model includes:

Quantize each processing layer in the folded first network model according to the quantization parameter to obtain a second network model.
The method according to claim 4, wherein the batch normalization layer folding strategy includes batch normalization layer removal status, coefficient update algorithm, statistical parameters to be incorporated into weights, and statistics parameters to be incorporated into offsets. Statistical parameters; the statistical parameters to be incorporated into the weights include the running statistical data of the convolutional layer that the batch normalization layer depends on or the statistical data of the current batch, and the statistical parameters to be incorporated into the offset include batch The running statistics of the convolutional layer that the normalization layer depends on or the statistics of the current batch;

Folding each of the batch normalization layers in the first network model into the convolutional layer on which the batch normalization layer depends based on the batch normalization layer folding strategy includes:

determining a scaling factor and a translation factor for each of said batch normalization layers in at least one batch normalization layer in the first network model;

Based on the coefficient update algorithm, update the scaling coefficient and translation coefficient of each batch normalization layer to obtain the updated scaling coefficient and translation coefficient of each batch normalization layer;

For each batch normalization layer, obtain the statistical parameters to be incorporated into the weight and the statistical parameters to be incorporated into the offset in the batch normalization layer, and update the batch normalization layer The final scaling coefficient and the statistical parameters to be incorporated into the weights are merged into the weights of the convolutional layer on which the batch normalization layer depends, and the updated scaling coefficients and translation coefficients of the batch normalization layer are and the statistical parameters to be incorporated into the offset are incorporated into the offset of the convolutional layer;

If the removal status of the batch normalization layer is removed, each of the batch normalization layers is removed from the first network model.
The method according to claim 4 or 5, wherein said obtaining the set batch normalization layer folding strategy comprises:

Based on the reasoning engine, a target batch normalization layer folding strategy is determined from multiple set batch normalization layer folding strategies.
The method according to any one of claims 1 to 6, wherein said quantizing each of said processing layers in said first network model according to said quantization parameter to obtain a second network model comprises:

Based on the set quantization algorithm and the first training data set, each of the processing layers in the first network model is quantized according to the quantization parameters to obtain a second network model.
The method according to claim 7, wherein said quantization parameters include preset precision of quantization scale, quantization symmetry, quantization bit width and quantization granularity, said quantization symmetry includes symmetric quantization or asymmetric quantization, said quantization The granularity includes hierarchical quantization or feature-level quantization, and the quantization algorithm includes a quantization-aware training algorithm;

The quantization algorithm based on the setting and the first training data set, according to the quantization parameters, quantize each of the processing layers in the first network model to obtain a second network model, including:

According to the quantization parameter, a pseudo-quantizer is set for each of the processing layers in the first network model to obtain a third network model; wherein, the pseudo-quantizer is configured to: based on the quantization bit width Determine the quantization value range of the processing layer parameters; determine the quantization scale satisfying the preset precision and the quantization zero point satisfying the quantization symmetry; based on the quantization granularity, within the quantization value range, adopt the quantization scale and quantization zero point performing uniform quantization processing on the processing layer parameters to be quantized to obtain the quantized processing layer parameters; based on the quantization scale and the quantization zero point, performing inverse uniform quantization processing on the quantized processing layer parameters to obtain inverse quantization The subsequent processing layer parameters;

Based on the set quantization-aware training algorithm and the first training data set, the parameters of each processing layer in the third network model are trained to obtain a second network model.
The method according to claim 7 or 8, wherein, based on the set quantization algorithm and the first training data set, each of the processing layers in the first network model is performed according to the quantization parameters Quantify to get the second network model, including:

Determining preset training hyperparameters corresponding to the neural network structure adopted by the first network model; wherein, for each of the preset deployment configuration information, the training hyperparameters are the same of;

Using the set first training data set, based on the quantization algorithm and the training hyperparameters, quantize each of the processing layers in the first network model according to the quantization parameters to obtain a second quantization network Model.
The method according to any one of claims 1 to 9, wherein said obtaining the first network model to be quantified comprises:

Adjust the processing layer in the set neural network structure based on at least one deployment configuration information to obtain at least one adjusted neural network structure;

creating at least one first network model based on at least one said adjusted neural network structure;

Initializing at least one parameter of the first network model based on the preset model parameters corresponding to the set neural network structure to obtain at least one initialized first network model;

Based on the set deployment configuration information, a first network model to be quantified is determined from the at least one initialized first network model.
The method of claim 10, further comprising:

Obtaining a preset pre-training model corresponding to the neural network structure; the structure of the pre-training model before the output layer is the same as the neural network structure;

Using the set second training data set to train the parameters of the pre-training model to obtain the trained pre-training model;

Determining the trained parameters of the pre-training model as the preset model parameters.
The method according to claim 10 or 11, wherein, based on at least one deployment configuration information, the processing layer in the set neural network structure is adjusted to obtain at least one adjusted neural network structure, comprising:

Determine the target neural network structure from a variety of preset neural network structures;

Based on at least one deployment configuration information, the processing layer in the target neural network structure is adjusted to obtain at least one adjusted neural network structure.
A model quantization device, comprising:

The first acquisition part is configured to acquire the first network model to be quantified;

The first determining part is configured to determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer based on set deployment configuration information;

The quantization part is configured to perform quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.
The apparatus according to claim 13, wherein the first network model includes at least one block structure, each of the block structures includes at least one processing layer; the first determining part is further configured to: based on a set Deploying configuration information to determine at least one processing layer to be quantized in each of the block structures in the first network model and a quantization parameter for quantizing each of the processing layers.
The apparatus according to claim 13, wherein the deployment configuration information includes an inference engine used by the type of deployed hardware; the first determination part is further configured to: determine the processing layer type to be quantified based on the inference engine; Determining at least one processing layer matching the processing layer type in the first network model as the processing layer to be quantized.
The apparatus according to claim 15, wherein the processing layer type includes a convolutional layer and a batch normalization layer; and the first determining part is further configured to: batch at least one batch in the first network model The normalization layer and the convolutional layer that each batch normalization layer depends on are determined as the processing layer to be quantized; the batch normalization layer folding strategy is obtained; based on the batch normalization layer folding strategy, Folding each of the batch normalization layers in the first network model into the convolutional layer on which the batch normalization layer depends, to obtain the folded first network model; the quantization part also It is configured to: perform quantization on each of the processing layers in the folded first network model according to the quantization parameter to obtain a second network model.
The apparatus according to claim 16, wherein the batch normalization layer folding strategy includes batch normalization layer removal status, coefficient update algorithm, statistical parameters to be incorporated into weights, and statistics parameters to be incorporated into offsets. Statistical parameters; the statistical parameters to be incorporated into the weights include the running statistical data of the convolutional layer that the batch normalization layer depends on or the statistical data of the current batch, and the statistical parameters to be incorporated into the offset include batch The running statistics of the convolutional layer on which the normalization layer depends or the statistics of the current batch; the first determination part is also configured to: determine each of the at least one batch normalization layer in the first network model The scaling coefficient and translation coefficient of the batch normalization layer; based on the coefficient update algorithm, the scaling coefficient and translation coefficient of each batch normalization layer are updated to obtain the batch normalization layer of each updated scaling coefficients and translation coefficients; for each of the batch normalization layers, obtaining statistical parameters to be incorporated into weights and statistical parameters to be incorporated into offsets in the batch normalization layers, and The updated scaling coefficients of the batch normalization layer and the statistical parameters to be incorporated into the weights are merged into the weights of the convolutional layers on which the batch normalization layer depends, and the batch normalization layer The updated scaling coefficient, translation coefficient and the statistical parameters to be incorporated into the offset are merged into the offset of the convolutional layer; in the case where the removal status of the batch normalization layer is removed Next, each of the batch normalization layers is removed from the first network model.
The apparatus according to claim 16 or 17, wherein the first determining part is further configured to: determine the batch normalization of the target from various set batch normalization layer folding strategies based on the inference engine Layer folding strategy.
The device according to any one of claims 13 to 18, wherein the quantization part is further configured to: based on the set quantization algorithm and the first training data set, according to the quantization parameter, the first Each processing layer in the network model is quantized to obtain the second network model.
The apparatus according to claim 19, wherein said quantization parameters include preset precision of quantization scale, quantization symmetry, quantization bit width and quantization granularity, said quantization symmetry includes symmetric quantization or asymmetric quantization, said quantization The granularity includes hierarchical quantization or feature-level quantization, and the quantization algorithm includes a quantization-aware training algorithm; the quantization part is further configured to: according to the quantization parameter, set for each of the processing layers in the first network model A pseudo-quantizer to obtain a third network model; wherein, the pseudo-quantizer is configured to: determine the quantization value range of the processing layer parameter based on the quantization bit width; determine the quantization scale that meets the preset accuracy and satisfy the required The quantization zero point of the quantization symmetry; based on the quantization granularity, within the quantization value range, the quantization scale and the quantization zero point are used to perform uniform quantization processing on the processing layer parameters to be quantized to obtain the quantized processing layer parameters; based on The quantization scale and the quantization zero point perform inverse uniform quantization processing on the quantized processing layer parameters to obtain the dequantized processing layer parameters; based on the set quantization-aware training algorithm and the first training data set , training the parameters of each of the processing layers in the third network model to obtain a second network model.
The device according to claim 19 or 20, wherein the quantization part is further configured to: determine a preset training hyperparameter corresponding to the neural network structure adopted by the first network model; wherein, for the preset For each of the deployment configuration information in multiple deployment configuration information, the training hyperparameters are the same; using the set first training data set, based on the quantization algorithm and the training hyperparameters, the first Each processing layer in a network model is quantized according to the quantization parameter to obtain a second quantized network model.
The device according to any one of claims 13 to 21, wherein the first acquisition part is further configured to: adjust the processing layers in the set neural network structure based on at least one type of deployment configuration information, Obtaining at least one adjusted neural network structure; creating at least one first network model based on at least one adjusted neural network structure; based on preset model parameters corresponding to the set neural network structure, at least one The parameters of the first network model are initialized to obtain at least one initialized first network model; based on the set deployment configuration information, the first network to be quantified is determined from the at least one initialized first network model Model.
The device according to claim 22, wherein the device further comprises: a second acquisition part configured to acquire a preset pre-training model corresponding to the neural network structure; the pre-training model is before the output layer The structure of the neural network is the same as that of the neural network; the pre-training part is configured to use the set second training data set to train the parameters of the pre-training model to obtain the trained pre-training model; the second The determining part is configured to determine the trained parameters of the pre-training model as the preset model parameters.
The device according to claim 22 or 23, wherein the first acquisition part is further configured to: determine the target neural network structure from a variety of preset neural network structures; The processing layer in the target neural network structure is adjusted to obtain at least one adjusted neural network structure.
A computer device, comprising a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the steps in the method of any one of claims 1 to 12 when executing the program .
A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps in the method of any one of claims 1 to 12 are implemented.
A computer program product, the computer program product comprising a non-transitory computer-readable storage medium storing a computer program, when the computer program is read and executed by a computer, it realizes any one of claims 1 to 12 steps in the method.
A computer program comprising computer readable code, under the condition that the computer readable code is run in a computer device, a processor in the computer device performs the method in any one of claims 1 to 12 step.