WO2023029349A1 - Model quantization method and apparatus, device, storage medium, computer program product, and computer program - Google Patents

Model quantization method and apparatus, device, storage medium, computer program product, and computer program Download PDF

Info

Publication number
WO2023029349A1
WO2023029349A1 PCT/CN2022/071377 CN2022071377W WO2023029349A1 WO 2023029349 A1 WO2023029349 A1 WO 2023029349A1 CN 2022071377 W CN2022071377 W CN 2022071377W WO 2023029349 A1 WO2023029349 A1 WO 2023029349A1
Authority
WO
WIPO (PCT)
Prior art keywords
quantization
network model
layer
batch normalization
parameters
Prior art date
Application number
PCT/CN2022/071377
Other languages
French (fr)
Chinese (zh)
Inventor
李雨杭
沈明珠
马建
任岩
张琦
龚睿昊
余锋伟
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023029349A1 publication Critical patent/WO2023029349A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the embodiments of the present application relate to, but are not limited to, the field of artificial intelligence, and in particular, relate to a model quantification method, device, equipment, storage medium, computer program product, and computer program.
  • model quantization can quantize the weights and activation values in the neural network from the original floating-point type. to low bit width (such as 8-bit, 4-bit, 3-bit, 2-bit, etc.) integers. After the model is quantized, the storage space required for the quantized neural network model is reduced, and the calculation form is changed from the original floating-point operation to the calculation of lower-cost low-bit wide integer data.
  • embodiments of the present application provide a model quantization method, device, device, storage medium, computer program product, and computer program.
  • An embodiment of the present application provides a model quantification method, the method comprising:
  • An embodiment of the present application provides a model quantization device, which includes:
  • the first acquisition part is configured to acquire the first network model to be quantified
  • the first determining part is configured to determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer based on set deployment configuration information;
  • the quantization part is configured to perform quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.
  • An embodiment of the present application provides a computer device, including a memory and a processor.
  • the memory stores a computer program that can run on the processor.
  • the processor executes the program, part or all of the steps in the above method are implemented.
  • An embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, some or all of the steps in the above method are implemented.
  • An embodiment of the present application provides a computer program, including computer readable codes.
  • a processor in the computer device executes some or all of the steps in the above method.
  • An embodiment of the present application provides a computer program product.
  • the computer program product includes a non-transitory computer-readable storage medium storing a computer program.
  • the computer program is read and executed by a computer, part or all of the above methods are implemented. All steps.
  • the first network model to be quantified is obtained; based on the set deployment configuration information, at least one processing layer to be quantized in the first network model and the quantization for each processing layer are determined Parameters; performing quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.
  • the processing layers to be quantized in the first network model and the quantization parameters for each processing layer to be quantized are determined based on the set deployment configuration information, in the process of model quantization, full consideration is given to The deployment configuration information of the hardware platform of the deployment model, so that the obtained second network model is deployable on the corresponding hardware platform.
  • Fig. 1 is a schematic diagram of the implementation flow of a model quantification method provided by the embodiment of the present application
  • FIG. 2A is a schematic diagram of the implementation flow of a model quantification method provided in the embodiment of the present application.
  • FIG. 2B is a schematic diagram of inserting quantized nodes into a calculation graph of a basic block structure provided by an embodiment of the present application;
  • FIG. 2C is a schematic diagram of inserting quantized nodes into a calculation graph of a basic block structure provided by an embodiment of the present application.
  • FIG. 2D is a schematic diagram of inserting a quantization node into a calculation graph of a basic block structure provided by an embodiment of the present application;
  • FIG. 3A is a schematic diagram of the implementation flow of a model quantification method provided in the embodiment of the present application.
  • FIG. 3B is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application.
  • FIG. 3C is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application.
  • FIG. 3D is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application.
  • FIG. 3E is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application.
  • FIG. 3F is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the implementation flow of a model quantification method provided in the embodiment of the present application.
  • Fig. 5 is a schematic diagram of the implementation flow of a model quantification method provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of an application scenario of MQBench provided by the embodiment of the present application.
  • FIG. 7 is a schematic diagram of the composition and structure of a model quantization device provided in the embodiment of the present application.
  • FIG. 8 is a schematic diagram of a hardware entity of a computer device provided by an embodiment of the present application.
  • references to “some embodiments” describe a subset of all possible embodiments, but it is understood that “some embodiments” may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.
  • the term “first/second/third” is only used to distinguish similar objects, and does not represent a specific order for objects. Understandably, “first/second/third” is used in Where permitted, the specific order or sequence may be interchanged such that the embodiments of the application described herein can be practiced in other sequences than illustrated or described herein.
  • the model quantization solution in the related art is first described.
  • the model quantization scheme often fails to be practically applied and deployed because it ignores the requirements of hardware deployment.
  • the hardware platform usually optimizes the calculation process of the batch normalization (Batch Normalization, BN) layer into the convolution layer to avoid additional overhead, but the BN layer in related technologies is to maintain Intact;
  • Batch Normalization Batch Normalization
  • the input parameters and weight parameters of the convolutional layer are considered to be quantized, but when the model is deployed, the entire calculation graph of the neural network model should be quantized, that is, except for the convolutional layer
  • the input parameters and weight parameters of other processing layers other than the layer also need to be quantized.
  • the model quantization scheme in the related art will inevitably reduce the deployability of the quantization algorithm.
  • different quantization algorithms have different deployability on different hardware platforms, it is impossible to measure the performance and robustness of different quantization algorithms on different hardware and quantization methods in academic research.
  • FIG. 1 is a schematic diagram of the implementation process of a model quantification method provided in the embodiment of the present application. As shown in Figure 1, the method includes:
  • Step S101 acquiring a first network model to be quantized.
  • the first network model can be any suitable neural network model to be quantized, and can be a full-precision neural network model.
  • the first network model can be a 32-bit floating-point parameter or a 16-bit floating-point parameter type parameter neural network model, of course, this embodiment does not limit the floating-point number of the first network model.
  • the first network model may adopt any suitable neural network structure, including but not limited to one or more of ResNet-18, ResNet-50, MobileNetV2, EfficientNet-Lite, RegNet and the like.
  • Step S102 based on the set deployment configuration information, determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer.
  • the deployment configuration information may include but not limited to one or more of the type of the deployed hardware, the inference engine used by the deployed hardware type, the model of the deployed hardware, the quantized bit width of the network model parameters corresponding to the deployed hardware type, and the like.
  • the deployment configuration information may be preset by the user, or may be default, or may be obtained from a configuration file of the target deployment hardware, which is not limited here.
  • the first network model may include multiple processing layers, such as one or more of an input layer, a convolutional layer, a pooling layer, a downsampling layer, a linear correction unit, a fully connected layer, and a batch normalization layer. Since different deployment environments may have different support capabilities for model quantification, based on the set deployment configuration information, at least one processing layer to be quantified in the first network model may be determined. During implementation, at least one processing layer to be quantified in the first network model may be determined in an appropriate manner based on the set deployment configuration information according to actual conditions, which is not limited in this embodiment of the present application.
  • the corresponding relationship between different deployment configuration information and the processing layer to be quantified can be determined in advance according to the actual situation, and the corresponding relationship can be determined by using the set deployment configuration information to query the corresponding relationship.
  • at least one processing layer of For example, for the first deployment hardware type or the first inference engine, it can be determined that only the convolution layer in the first network model is quantized; for the second deployment hardware type or the second inference engine, it can be determined that the Each convolutional layer, input layer, and fully connected layer of the first network model can be quantized; for the third inference engine, each convolutional layer, input layer, fully connected layer, and element-wise added calculation layer in the first network model can be quantified to quantify.
  • the parameter to be quantified in each processing layer of at least one processing layer to be quantized in the first network model may also be determined based on the set deployment configuration information.
  • the quantization parameter for quantizing each processing layer may include, but not limited to, one or more of the preset accuracy of the quantization scale used in the process of quantizing the processing layer, quantization symmetry, quantization bit width, and quantization granularity, etc.
  • the preset precision of the quantization scale may include full precision, power of 2 precision, and the like.
  • Quantization symmetry can be either symmetric quantization or asymmetric quantization.
  • the quantization bit width may include one of 8 bits, 4 bits, 3 bits, 2 bits and so on.
  • Quantization granularity can be hierarchical quantization (that is, tensor-level quantization) or feature-level quantization (that is, channel-level quantization).
  • the first network model can be determined.
  • the quantization parameter used in the quantization process of each processing layer to be quantized may be determined.
  • those skilled in the art may determine the quantization parameter for quantizing each processing layer to be quantized in the first network model based on the set deployment configuration information in an appropriate manner according to the actual situation, which is not limited here.
  • the corresponding relationship between different deployment configuration information and quantified parameters can be determined in advance according to the actual situation, and the corresponding relationship can be determined based on the set deployment configuration information to determine the quantified parameters in the first network model.
  • Quantization parameters for quantization at each processing layer can be determined.
  • Step S103 performing quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.
  • any suitable quantization algorithm may be used according to the actual situation to quantize each processing layer in the first network model according to the quantization parameter to obtain the quantized second network model.
  • Quantization algorithms may include, but are not limited to, one or more of post-training quantization algorithms, quantization-aware training algorithms, and the like.
  • the post-training quantization algorithm refers to selecting the appropriate quantization operation and calibration operation for the pre-trained network model to minimize the quantization loss. It can be static quantization after training or dynamic quantization after training.
  • the quantization-aware training algorithm refers to training during the quantization process of the network.
  • the network can adapt to the discontinuous distribution of integer values and reduce the loss of operational accuracy caused by the quantization process, which can include but not limited to the learning step size Quantization (Learned Step-size Quantization, LSQ) algorithm, parameterized clipping activation (PAParameterized Clipping acTivation, PACT) algorithm, additive power of two power quantization (Additive Powers-of-Two, APoT) algorithm, differentiable soft quantization (Differentiable Soft Quantization, DSQ), DoReFa-Net training algorithm, Learning Quantization for Highly Accurate and Compact Deep Neural Networks (LQ-net) algorithm, etc.
  • learning step size Quantization Learning Step-size Quantization, LSQ
  • PACT parameterized Clipping activation
  • PACT additive power of two power quantization
  • APoT additive power of two power quantization
  • DSQ differentiable soft quantization
  • DoReFa-Net training algorithm Learning Quantization for Highly Accurate and Compact Deep Neural Networks (LQ-net
  • the calculation graph of the first network model can be extracted based on the network structure of the first network model, by inserting at least one A quantization node is used to quantify at least one processing layer in the first network model to construct a calculation graph of the second network model, and perform quantization processing on each processing layer to be quantized in the calculation graph of the second network model
  • the quantization parameter adopted by the quantization node is the quantization parameter for quantizing the processing layer, and the quantized second network model can be obtained based on the calculation graph of the second network model.
  • any suitable quantization algorithm and training data can be used to perform parameter training on the calculation graph of the second network model according to the actual situation, to obtain the calculation graph of the second network model after training, and based on the trained A calculation graph of the second network model to obtain the trained second network model.
  • At least one quantization node is inserted into the computation graph to construct a computation graph of a suitable quantization neural network (that is, a computation graph of the second network model).
  • the position of inserting a quantization node in the calculation graph of the first network model can be equivalent to quantifying the processing layer corresponding to the logical node corresponding to the position, so that determining the position of inserting a quantization node in the calculation graph of the first network model is equivalent to determining At least one processing layer to be quantized in the first network model.
  • the deployment configuration information includes the inference engine used by the deployed hardware type; based on the set deployment configuration information described in step S102 above, determine at least one processing layer to be quantified in the first network model , which can include:
  • Step S111 based on the inference engine, determine the processing layer type to be quantized
  • Step S112 determining at least one processing layer in the first network model that matches the processing layer type as the processing layer to be quantized.
  • the deployment hardware type is the hardware type of the target hardware on which the quantized second network model is deployed, and the reasoning engines used by different deployment hardware types may be the same or different, which is not limited here.
  • Inference engines can include but are not limited to TensorRT, ACL, TVM, SNPE, or FBGEMM, etc.
  • the deployment hardware can be classified in an appropriate way according to the actual situation.
  • the hardware can be classified according to the hardware manufacturer.
  • the deployment hardware type is the deployment hardware manufacturer
  • the deployment hardware type The inference engine used is the inference engine used by the manufacturer; it can also be classified according to the specifications and models of the hardware.
  • the deployment hardware type is the deployment hardware model
  • the inference engine used by the deployment hardware type is the The inference engine used by the model's hardware.
  • Different inference engines can support quantization of different types of processing layers.
  • the types of processing layers can include but are not limited to such as input layer, convolutional layer, pooling layer, downsampling layer, linear correction unit, fully connected layer, batch normalization One or more of layers, etc.
  • the correspondence between different inference engines and the types of processing layers to be quantified can be determined in advance, and based on the correspondence, the types of processing layers to be processed corresponding to the inference engines adopted by the type of deployed hardware can be determined.
  • each processing layer in the first network model may be matched with the processing layer type, and at least one matched processing layer may be determined as the processing layer to be quantized.
  • the first network model to be quantified is obtained; based on the set deployment configuration information, at least one processing layer to be quantized in the first network model and the quantization for each processing layer are determined Parameters; performing quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.
  • the processing layers to be quantized in the first network model and the quantization parameters for each processing layer to be quantized are determined based on the set deployment configuration information, in the process of model quantization, full consideration is given to The deployment configuration information of the hardware platform of the deployment model, so that the obtained second network model is deployable on the corresponding hardware platform.
  • An embodiment of the present application provides a model quantization method, which can be executed by a processor of a computer device. As shown in Figure 2A, the method includes:
  • Step S201 acquiring a first network model to be quantized.
  • the above-mentioned step S201 corresponds to the above-mentioned step S101, and the implementation of the above-mentioned step S101 can be referred to for implementation.
  • Step S202 based on the set deployment configuration information, determine at least one processing layer to be quantized in each of the block structures in the first network model and a quantization parameter for quantizing each of the processing layers.
  • the structure of the neural network model can be divided into multiple stages (stages), each stage can be divided into multiple blocks (blocks), and each block can be divided into multiple processing layers (layers).
  • quantization processing is performed in units of a block structure.
  • the first network model includes at least one block structure, each of said block structures includes at least one processing layer.
  • the processing layers to be quantized in each block structure corresponding to the set deployment configuration information may be determined based on the predetermined correspondence between the deployment configuration information and the processing layers to be quantized in different block structures.
  • the pseudo-quantization corresponding to the deployment configuration information is inserted in the calculation subgraph corresponding to the block structure in the calculation graph.
  • the insertion strategy of the nodes thereby determining at least one processing layer to be quantized in the block structure.
  • the neural network structure adopted by the first network model is ResNet-18/ResNet-34
  • the basic block structure in ResNet-18/ResNet-34 for different deployment configuration information, it can be used as shown in Figure 2B
  • the three different insertion strategies shown in 2D insert at least one pseudo-quantization node in the calculation subgraph corresponding to the basic block structure.
  • a pseudo-quantization node FakeQuant 20 is inserted at the input of each convolutional layer Conv 10 in the calculation subgraph, wherein the pseudo-quantization node FakeQuant 20 includes a quantization processing node Quantization 21 and an inverse quantization node Dequantization 22. Therefore, the processing layer to be quantized in the basic block structure corresponding to the calculation subgraph is each convolutional layer in the basic block structure.
  • the input to the computation subgraph is the quantized data (that is, the input of the convolutional layer Conv 10-1 and Conv 10-2), in the calculation subgraph at the input of the convolutional layer Conv 10-3, an input of the elementwise addition layer elementwise-add 30 and the calculation
  • the pseudo-quantization node FakeQuant 20 is inserted into the output of the subgraph, so that the processing layer to be quantized in the basic block structure corresponding to the calculation subgraph is each convolutional layer and element-wise addition layer in the basic block structure (only for single side input for quantization) and the output layer of this basic block structure.
  • the input to the computation subgraph is the quantized data (That is, the input of the convolutional layer Conv 10-1 and Conv 10-2), at the input of the convolutional layer Conv 10-3 in the calculation subgraph, at each input of the elementwise addition layer elementwise-add 30 and
  • the pseudo-quantization node FakeQuant 20 is inserted into the output of the calculation subgraph, so that the processing layer to be quantized in the basic block structure corresponding to the calculation subgraph is each convolution layer and element-wise addition layer in the basic block structure (only Quantize both inputs) and the output layer of this basic block structure.
  • Step S203 performing quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.
  • the above-mentioned step S203 corresponds to the above-mentioned step S103, and the implementation of the above-mentioned step S103 can be referred to for implementation.
  • the first network model includes at least one block structure, each block structure includes at least one processing layer, and based on the set deployment configuration information, at least one processing layer to be quantified in each block structure in the first network model is determined and a quantization parameter for quantizing each processing layer, and performing quantization on each processing layer to be quantized in the first network model according to the quantization parameter to obtain a second network model.
  • all block structures in the first network model can be quantized, thereby realizing the quantization of the entire network model.
  • An embodiment of the present application provides a model quantization method, which can be executed by a processor of a computer device. As shown in Figure 3A, the method includes:
  • Step S301 acquiring the first network model to be quantized.
  • Step S302 based on the inference engine used by the set deployment hardware type, determine the processing layer type to be quantified.
  • Step S303 determining at least one processing layer in the first network model that matches the processing layer type as the processing layer to be quantized.
  • Step S304 based on the inference engine, determine quantization parameters for quantizing each of the processing layers.
  • the above-mentioned steps S301 to S304 correspond to the above-mentioned steps S101 to S102, and the specific implementation manners of the above-mentioned steps S101 to S102 can be referred to for implementation.
  • Step S305 determining at least one batch normalization layer in the first network model and the convolutional layer that each batch normalization layer depends on as processing layers to be quantized.
  • the convolutional layer on which the batch normalization layer depends may be the convolutional layer connected to the batch normalization layer before the batch normalization layer.
  • Step S306 obtaining the set batch normalization layer folding strategy.
  • the batch normalization folding strategy refers to the strategy of folding the batch normalization layer in the neural network model into the convolutional layer that the batch normalization layer depends on.
  • batch normalization layers are designed to reduce internal covariate shifts and smooth losses for fast convergence.
  • the batch normalization layer introduces a two-step linear transformation, scaling and translation, to each convolutional layer output.
  • the set batch normalization layer folding strategy may be a preset batch normalization layer folding strategy corresponding to the deployment configuration information.
  • Step S307 based on the batch normalization layer folding strategy, fold each of the batch normalization layers in the first network model into the convolutional layer that the batch normalization layer depends on to obtain the folded After the first network model.
  • Step S308 performing quantization on each of the processing layers in the folded first network model according to the quantization parameter to obtain a second network model.
  • the batch normalization layer folding strategy includes batch normalization layer removal status, coefficient update algorithm, statistical parameters to be incorporated into weights, statistical parameters to be incorporated into offsets;
  • the statistical parameters to be incorporated into the weights include the running statistics of the convolutional layer on which the batch normalization layer depends or the statistics of the current batch, and the statistical parameters to be incorporated into the offset include the batch normalization layer dependencies
  • the operation statistical data of the convolutional layer or the statistical data of the current batch based on the batch normalization layer folding strategy described in the above step S307, normalize each batch in the first network model
  • the normalization layer is folded into the convolutional layer that the batch normalization layer depends on, which may include:
  • Step S311 determining the scaling coefficient and translation coefficient of each batch normalization layer in at least one batch normalization layer in the first network model
  • the scaling coefficient and translation coefficient of each batch normalization layer may be determined based on parameters of the batch normalization layer.
  • Step S312 based on the coefficient update algorithm, update the scaling coefficient and translation coefficient of each batch normalization layer to obtain the updated scaling coefficient and translation coefficient of each batch normalization layer.
  • the coefficient update algorithm is any suitable algorithm set for updating the scaling coefficient and translation coefficient of the batch normalization layer, which may include but not limited to gradient descent method, simulated annealing method, genetic algorithm, etc. one or more species.
  • the coefficient updating algorithm may also be non-updating, so that the scaling coefficients and translation coefficients of the batch normalization layer may not be updated.
  • Step S313 for each batch normalization layer, obtain statistical parameters to be combined into weights and statistical parameters to be combined into offsets in the batch normalization layer, and perform batch normalization
  • the updated scaling coefficient of the layer and the statistical parameters to be incorporated into the weight are merged into the weight of the convolutional layer on which the batch normalization layer depends, and the updated scaling coefficient of the batch normalization layer is , translation coefficients, and the statistical parameters to be incorporated into the offset are combined into the offset of the convolutional layer.
  • the statistical parameters to be incorporated into the weights may include the running statistics of the convolutional layer on which the batch normalization layer depends or the statistics of the current batch, and the statistical parameters to be incorporated into the offset may also include batch normalization The running statistics of the convolutional layers that the layer depends on or the statistics of the current batch.
  • Running statistical data is statistical data obtained from the output data during the historical operation of the convolutional layer, which may include but not limited to one or more of the mean, variance, and sliding average of the historical output data.
  • the statistical data of the current batch is the statistical data obtained by statistics of the current batch of data in the output data of the convolutional layer, which may include but not limited to one or more of the mean value and variance of the current batch of data.
  • the statistics of the current batch of the convolutional layer can be calculated by performing convolution with full-precision weights in the convolutional layer.
  • the statistical parameters to be incorporated into the weights may include the variance of the historical output data of the convolutional layer on which the batch normalization layer depends, and the statistical parameters to be incorporated into the weights may include The mean and variance of the historical output data of the convolutional layer.
  • the updated scaling coefficient of the batch normalization layer and the variance of the historical output data of the convolutional layer that the batch normalization layer depends on can be combined into the weights of the convolutional layer that the batch normalization layer depends on, and the batch
  • the updated scaling coefficient and translation coefficient of the normalization layer and the mean and variance of the historical output data of the convolutional layer on which the batch normalization layer depends are combined into the offset of the convolutional layer.
  • the statistical parameters to be incorporated into the weights may include the mean value of the current batch data of the convolutional layer on which the batch normalization layer depends
  • the statistical parameters to be incorporated into the weights may include The mean and variance of the current batch of data in the convolutional layer.
  • the updated scaling coefficient of the batch normalization layer and the variance of the current batch data of the convolutional layer on which the batch normalization layer depends can be combined into the weights of the convolutional layer on which the batch normalization layer depends
  • the updated scaling coefficient and translation coefficient of the batch normalization layer and the mean value and variance of the current batch data of the convolutional layer on which the batch normalization layer depends are combined into the offset of the convolutional layer.
  • the statistical parameters to be incorporated into the weights may include the variance of the historical output data of the convolutional layer on which the batch normalization layer depends, and the statistical parameters to be incorporated into the weights may include The mean and variance of the current batch of data in the convolutional layer.
  • the updated scaling coefficient of the batch normalization layer and the variance of the historical output data of the convolutional layer that the batch normalization layer depends on can be combined into the weights of the convolutional layer that the batch normalization layer depends on, and the batch
  • the updated scaling coefficient and translation coefficient of the normalization layer and the mean value and variance of the current batch of data of the convolutional layer on which the batch normalization layer depends are combined into the offset of the convolutional layer.
  • Step S314 if the removal state of the batch normalization layer is removed, remove each batch normalization layer from the first network model.
  • the scaling coefficients and translation coefficients of the batch normalization layer and the running statistics of the convolutional layers that the batch normalization layer depends on can be combined into the method shown in formula (1).
  • the linear transformation performed by the batch normalization layer is folded into the corresponding convolutional layer:
  • w fold and b fold are the combined weights and offsets in the convolutional layer, respectively
  • ⁇ , ⁇ 2 are the sliding average and variance obtained from the statistics of the output data during the operation of the convolutional layer
  • ⁇ , ⁇ are the scaling and translation coefficients of the batch normalization layer, respectively.
  • is a very small non-zero value set for numerical stability, which prevents divisors from being zero. If the convolutional layers are quantized after the batch normalization layer is folded, there will be no extra floating point operations during inference.
  • batch normalization layer folding strategies may include, but are not limited to, one of the following:
  • Strategy 1 See FIG. 3B.
  • the above formula (1) is used to merge the scaling coefficient and translation coefficient in the batch normalization layer into the weight w fold and bias of the convolutional layer Conv 310 that the batch normalization layer depends on. Move to b fold , and completely remove the batch normalization layer;
  • Strategy 2 Refer to FIG. 3C.
  • the above formula (1) is used to merge the scaling coefficient and translation coefficient in the batch normalization layer into the weight w fold and bias of the convolutional layer Conv 310 that the batch normalization layer depends on.
  • SGD Stochastic gradient descent
  • Strategy 3 See Figure 3D.
  • the running statistics of the convolutional layer can be updated during the quantization training process.
  • the convolution will be calculated twice, which will cause additional overhead.
  • the first convolution The product (corresponding to the convolutional layer Conv 320 in the figure) is to use the weight of full precision to calculate the mean value of the current batch and variance Then, use the above formula (1) to divide the mean value of the current batch variance
  • the scaling coefficients and translation coefficients in the batch normalization layer are merged into the weights of the convolutional layer Conv310 that the batch normalization layer depends on and offset , and completely remove the batch normalization layer.
  • Strategy 4 See Figure 3E, in this strategy, two convolutions are also calculated during the training process.
  • the first convolution (corresponding to the convolutional layer Conv 320 in the figure) is the same as strategy 3, and will estimate the mean of the current batch and variance
  • the weights will be folded together with the running statistics, and the variance ⁇ 2 in the running statistics, the scaling factor in the batch normalization layer, will be incorporated into the BN layer dependent
  • the mean value of the current batch variance The scaling and translation coefficients in the batch normalization layer are merged into the offset of the convolutional layer Conv 310 that the batch normalization layer depends on , and completely remove the batch normalization layer, in addition, the batch variance factor Will be used to rescale the output after the second convolution.
  • Strategy 5 See Figure 3F.
  • two convolutions are not used, but a batch normalization layer BN 330 is explicitly added after the quantized convolution (corresponding to the convolutional layer Conv 310 in the figure).
  • One of the benefits brought by this strategy is that the statistics of the current batch are calculated based on quantized weights.
  • the rescaling of convolutional layer outputs can be neutralized by batch normalization layers.
  • a batch normalization folding strategy can be set from a variety of preset batch normalization folding strategies (such as the above-mentioned strategies 1 to 5), and based on the set batch normalization layer folding A strategy for folding at least one batch normalization layer in the first network model to obtain the folded first network model.
  • the above step S306 may include:
  • Step S321 based on the inference engine, determine a target batch normalization layer folding strategy from various set batch normalization layer folding strategies.
  • the set multiple batch normalization layer folding strategies may be determined in advance according to the actual situation, and may include but not limited to any one of the strategies 1 to 5 above.
  • the batch normalization layer folding strategy of the target is determined based on the inference engine from the set multiple batch normalization layer folding strategies. Different inference engines can support different batch normalization layer folding strategies, or they can support the same batch normalization layer folding strategy.
  • the target batch normalization layer folding strategy can be determined from multiple set batch normalization layer folding strategies according to the inference engine's ability to support the batch normalization layer folding strategy. In this way, the performance of the quantized second network model after being deployed on the deployment hardware using the set inference engine can be further improved.
  • the corresponding relationship between the inference engine and the batch normalization layer folding strategy can be determined in advance, and the inference engine can be queried based on the set reasoning engine.
  • the batch normalization layer folding strategy of the target can be determined among the various batch normalization layer folding strategies that can be set.
  • the set batch normalization layer folding strategy is obtained, and each batch normalization layer in the first network model is folded to the specified batch normalization layer folding strategy based on the batch normalization layer folding strategy.
  • the convolutional layer that the batch normalization layer depends on the folded first network model is obtained, and each of the processing layers in the folded first network model is quantized according to the quantization parameter , to get the second network model.
  • the convolution layer is quantized after the batch normalization layer is folded, and there will be no additional floating-point operations in the inference process, so that the inference speed of the quantized second network model can be further accelerated.
  • An embodiment of the present application provides a model quantization method, which can be executed by a processor of a computer device. As shown in Figure 4, the method includes:
  • Step S401 acquiring the first network model to be quantized.
  • Step S402 based on the set deployment configuration information, determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer.
  • the above-mentioned steps S401 to S402 correspond to the above-mentioned steps S101 to S102 respectively, and the specific implementation manners of the above-mentioned steps S101 to S102 can be referred to for implementation.
  • Step S403 based on the set quantization algorithm and the first training data set, quantize each of the processing layers in the first network model according to the quantization parameters to obtain a second network model.
  • the quantization algorithm can be a post-training quantization algorithm or a quantization-aware training algorithm, which is not limited here.
  • the first training data set may be an appropriate training data set determined in advance according to the target task of the second network model, and may be an image data set, a point cloud data set, or voice data, etc., which is not limited here.
  • the quantization algorithm is a post-training quantization algorithm. Based on the post-training quantization algorithm, each of the processing layers in the first network model is quantized according to the quantization parameters to obtain a quantized
  • the second network model based on the first training data set, calibrate the model parameters in the quantized second network model to obtain the calibrated second network model.
  • the quantization algorithm is a quantization-aware training algorithm. Based on the quantization-aware training algorithm and the first training data set, each of the first network models can be multiplied according to the quantization parameter. The parameters of the processing layer are subjected to at least one quantization-aware training to obtain a trained quantized second network model.
  • the first network model before quantizing the first network model, the first network model may be pre-trained, and the pre-trained first network model may be used as the first network model to be quantized.
  • each processing layer to be quantized in the first network model is quantized according to quantization parameters to obtain the second network model. In this way, the set quantization algorithm can be effectively reproduced.
  • the quantization algorithm includes a quantization-aware training algorithm, and the above step S403 may also include:
  • Step S411 setting a pseudo-quantizer for each of the processing layers in the first network model according to the quantization parameters to obtain a third network model.
  • the pseudo-quantizer can perform quantization simulation during the quantization-aware training process to facilitate the network to perceive the loss caused by quantization, so that a pseudo-quantizer can be set for each processing layer to be quantized in the first network model.
  • the structure of the pseudo-quantizer can be determined based on quantization parameters, it can be a symmetric quantizer or an asymmetric quantizer, it can be a uniform quantizer or a non-uniform quantizer, it can be a learning-based quantizer or a rule-based quantizer , can also be a quantizer that directly uses heuristics to calculate the quantization step size, which is not limited here.
  • the first network model in which the pseudo-quantizer is set may be determined as the third network model.
  • Step S412 based on the set quantization-aware training algorithm and the first training data set, perform at least one quantization-aware training on the parameters of each processing layer in the third network model to obtain a second network model.
  • one quantization-aware training algorithm may be set from multiple preset quantization-aware training algorithms.
  • the quantization parameters include preset precision of quantization scale, quantization symmetry, quantization bit width and quantization granularity, the quantization symmetry includes symmetric quantization or asymmetric quantization, and the quantization granularity includes hierarchical quantization or Feature-level quantization.
  • the pseudo-quantizer is configured to perform the following steps S421 to S424:
  • Step S421 Determine the quantized value range of the processing layer parameter based on the quantized bit width.
  • the quantization bit width is the bit width of the integer data obtained by quantizing the floating-point parameters during the training process of the parameters of each processing layer to be quantized in the third network model, such as 8 bits, 4 bits , 3 bits, 2 bits, etc.
  • the quantized bit width can be determined according to the set deployment configuration information, or can be set directly by the user. Different processing layers in the third network model may use the same quantization bit width or different quantization bit widths.
  • the processing layer parameter can be one or more parameters to be quantized in the weight value, activation value, input data, output data, etc. of the processing layer to be quantized, and the quantized value range of the processing layer parameter is the value after quantization of the parameter scope.
  • the quantized value range of the processing layer parameter can be determined based on the quantization bit width.
  • the processing layer parameter can include a weight value and an activation value.
  • the weight value can be quantized as [ Signed integer values in the range of -2 k-1 , 2 k-1 -1], the activation value can be quantized as an unsigned integer value in the range of [0, 2 k-1 ], therefore, the weight value of The quantized value range may be [-2 k-1 , 2 k-1 -1], and the quantized value range of the activation value may be [0, 2 k-1 ].
  • Step S422 determining a quantization scale that satisfies the preset precision and a quantization zero that satisfies the quantization symmetry.
  • the quantization scale is a coefficient for scaling the full-precision value to be quantized during the quantization process.
  • the preset precision of the quantization scale may include but not limited to one of full precision, power of 2 precision, and the like.
  • Quantization symmetry is used to characterize whether the value range of the full-precision value to be quantized is symmetrical about 0.
  • the zero point of the full-precision value is quantized to an integer value, which is called the quantized zero point.
  • the quantization zero point it means that the value range of the full-precision value to be quantized is symmetrical about 0, that is, the uniform quantization is symmetrical quantization; when the quantization zero point is not 0, it means the full-precision value to be quantized
  • the range of values is asymmetric about 0, that is, the uniform quantization is asymmetric quantization.
  • a fixed quantization scale that satisfies the preset accuracy and a fixed quantization zero point that satisfies the quantization symmetry may be set for the pseudo quantizer according to actual conditions. For example, when the preset precision of the quantization scale is full precision, an appropriate full-precision numerical value may be set as the quantization scale for the pseudo quantizer.
  • the quantization zero point can be set to 0; when the quantization symmetry is asymmetrical, the quantization zero point can be set to an appropriate non-zero number, such as 1, -2, and so on.
  • the quantization scale that meets the preset precision and the quantization symmetry can be determined.
  • Sexual quantization zero the quantization scale and quantization zero point can also be continuously adjusted during the model training process.
  • Step S423 based on the quantization granularity, within the quantization value range, uniform quantization is performed on the processing layer parameters to be quantized by using the quantization scale and the quantization zero point, to obtain the quantized processing layer parameters.
  • the quantization granularity refers to the granularity of parameters such as the quantization value range, quantization scale, and quantization zero point shared in the quantization network model, which can include hierarchical quantization (that is, tensor-level quantization) or feature-level quantization (that is, channel-level quantization). etc.
  • the level quantization means that the parameters of the processing layer to be quantized in the same processing layer adopt the same parameters such as the shared quantization value range, quantization scale, and quantization zero point
  • the feature level quantization means that the parameters to be quantized corresponding to different features in the same processing layer
  • the parameters of the processing layer adopt different shared quantization value ranges, quantization scales, quantization zeros and other parameters.
  • the range of quantized values is [N min , N max ], where N min is the smallest quantized value in the range of quantized values, N max is the largest quantized value in the range of quantized values, and the quantized scale is s,
  • the processing layer parameters to be quantized can be uniformly quantized in the manner shown in the following formula (2):
  • w represents the floating-point value corresponding to the parameter of the processing layer, and is the quantized value of the processing layer parameter
  • the function express will limited between N min and N max , at When it is greater than N max , the value of this function is N max , in When it is less than N min , the value of this function is N min , in When not greater than N max and not less than N min , the value of this function is Indicates that the entered value is rounded to an integer.
  • Step S424 based on the quantization scale and the quantization zero point, perform inverse uniform quantization on the quantized processing layer parameters to obtain the dequantized processing layer parameters.
  • the quantized processing layer parameters can be deuniformly quantized in the manner shown in the following formula (3):
  • the quantization parameters for quantizing each processing layer in the first network model can be determined based on the set deployment configuration information, and the quantization parameters include the preset precision of the quantization scale, quantization symmetry, quantization bit width and quantization Granularity, the quantization symmetry includes symmetric quantization or asymmetric quantization, and the quantization granularity includes hierarchical quantization or feature level quantization.
  • the hardware-aware quantizer can be used to perform model quantization according to the configuration of individual deployment hardware, so that the quantized second network model can better meet the deployment requirements of the deployment hardware.
  • multiple types of quantizers can be supported, so that a deployable second network model can be quantized for more types of deployment hardware.
  • the above step S403 may include:
  • Step S431 determining a preset training hyperparameter corresponding to the neural network structure adopted by the first network model; wherein, for each of the deployment configuration information in the preset multiple deployment configuration information, the training hyperparameter The parameters are the same.
  • training hyperparameters may include but not limited to one or more of fine-tuning duration (algebra), learning rate, parameter optimization algorithm, weight decay, etc. kind.
  • the preset multiple deployment configuration information may include at least two preset deployment configuration information, which is not limited here.
  • the same training hyperparameters are used in the process of quantitative training for network models using the same neural network structure.
  • the training hyperparameters used are also the same.
  • a set of suitable training hyperparameters for at least one neural network structure can be determined in advance through experiments or analysis. Based on the neural network structure adopted by the first network model, the preset training corresponding to the neural network structure can be determined. hyperparameters. Those skilled in the art may determine appropriate training hyperparameters for at least one neural network structure according to actual conditions, which is not limited in this embodiment of the present application.
  • Table 1 provides an example of training hyperparameters preset for the neural network structures ResNet-18, ResNet-50, EffNet, MbV2, and RegNet, wherein, for the first network model using ResNet-18, the preset The set learning rate is 0.004, the weight decay is 10 -4 , the batch size is 64, and the number of graphics processors (Graphics processing unit, GPU) is 8; for the first network model using ResNet-50, the preset learning rate is 0.004, the weight decay is 10 -4 , the batch size is 16, and the number of GPUs is 16; for the first network model using EffNet and MbV2, the same training hyperparameters can be preset, the preset learning rate is 0.01, and the weight decay is 10 -5 *, the batch size is 32, and the number of GPUs is 16; for the first network model using RegNet, the preset learning rate is 0.004, the weight decay is 4 ⁇ 10 -5 , the batch size is 32, and the number of GPUs is 16. Among them, * represents that the weight
  • Table 1 Example table of training hyperparameters corresponding to different neural network structures
  • a unified data preset process can be used for the training data, including random size cropping to 224 resolution, random horizontal flip, and color dithering of the image, such as brightness offset 0.2, contrast offset 0.2, saturation Offset 0.2, hue offset 0.1.
  • the test data is centered and cropped to 224 resolution, and regularization is added using 0.1 label smoothing. All models are trained for 100 epochs (meaning that all training samples are trained once, and all training samples are forward-propagated and back-propagated), and a linear warm-up is performed in the first epoch.
  • the learning rate is decayed by a cosine annealing strategy. Trained using the SGD optimizer and updated with Newtonian momentum (Nesterov) with a momentum parameter of 0.9.
  • Nesterov Newtonian momentum
  • Step S432 using the set first training data set, based on the quantization algorithm and the training hyperparameters, quantize each of the processing layers in the first network model according to the quantization parameters to obtain the first Binary network model.
  • a unified training hyperparameter is used, so that the model can be shared between various first network models and various quantization algorithms with the same neural network structure Training skills, so that different quantization algorithms can be better reproduced and the accuracy of the quantization algorithm can be improved.
  • An embodiment of the present application provides a model quantization method, which can be executed by a processor of a computer device. As shown in Figure 5, the method includes:
  • Step S501 based on at least one type of deployment configuration information, adjust the processing layers in the set neural network structure to obtain at least one adjusted neural network structure.
  • the preset neural network structure may be preset by the user according to the actual situation, or may be a default, which is not limited here.
  • the at least one type of deployment configuration information may be one or more types of deployment configuration information preset or defaulted by the user. Due to different deployment hardware, there are differences in the quantization support capabilities of different processing layers in the neural network structure. During implementation, for each type of deployment configuration information, according to the actual quantitative support of the deployment hardware for different processing layers in the neural network structure corresponding to the deployment configuration information, an appropriate method can be used to evaluate the set neural network structure. At least one processing layer is adjusted to obtain an adjusted neural network structure.
  • the squeeze and excitation blocks in the network structure can be removed, and the fast activation layer can be replaced by ReLU6 (Rectified liner unit, linear correction unit) layer, get the lightweight (Lite) version of EfficientNet, so that better integer value support can be obtained on the deployment hardware.
  • ReLU6 Rectified liner unit, linear correction unit
  • Step S502 Create at least one first network model based on at least one adjusted neural network structure.
  • a first network model may be created for each neural network structure in the at least one adjusted neural network structure.
  • those skilled in the art can create an appropriate first network model based on the adjusted neural network structure according to actual business requirements, which is not limited here.
  • Step S503 based on the preset model parameters corresponding to the set neural network structure, initialize at least one parameter of the first network model to obtain at least one initialized first network model.
  • the parameters of the first network model can be initialized with uniform preset model parameters, and at least An initialized first network model.
  • the preset model parameters may include preset initial values of parameters in the first network model, or may include trained model parameters obtained after pre-training the first network model, which is not limited here.
  • Step S504 based on the set deployment configuration information, determine a first network model to be quantified from the at least one initialized first network model.
  • each type of deployment configuration information may correspond to an initialized first network model, based on the set deployment configuration information, the initialized first network model corresponding to the deployment configuration information may be determined, and the initialized first network model The first network model is determined as the first network model to be quantified.
  • Step S505 based on the set deployment configuration information, determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer.
  • Step S506 performing quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.
  • the above-mentioned steps S505 to S506 correspond to the above-mentioned steps S102 to S103 respectively, and the specific implementation manners of the above-mentioned steps S102 to S103 can be referred to for implementation.
  • the processing layer in the set neural network structure is adjusted to obtain at least one adjusted neural network structure, and based on at least one adjusted neural network structure, at least A first network model, based on the preset model parameters corresponding to the set neural network structure, initialize the parameters of at least one first network model to obtain at least one initialized first network model, and based on the set deployment Configuration information, determining the first network model to be quantized from at least one initialized first network model.
  • the first network model to be quantified is created based on the set deployment configuration information and the neural network structure obtained by adjusting the processing layer in the set neural network structure, so that the first network model obtained after quantization
  • the second network model can get better quantitative support after being deployed to the deployment hardware using the set deployment configuration information; on the other hand, by adopting a unified preset model for the first network model using the same neural network structure
  • the initialization of parameters can reduce the inconsistency of initialization caused by using different initialization methods, thereby improving the comparability of quantization of different neural network models of the same network structure by different quantization algorithms.
  • the method before the above step S503, the method further includes:
  • Step S511 obtaining a preset pre-training model corresponding to the neural network structure; the structure of the pre-training model before the output layer is the same as the neural network structure.
  • the pre-training model may be any suitable neural network model created in advance based on the neural network structure.
  • Step S512 using the set second training data set to train the parameters of the pre-training model to obtain the trained pre-training model.
  • the second training data set may be a suitable training data set determined in advance according to the target task of the pre-trained model, and may be an image data set, a point cloud data set, or voice data, etc., which is not limited here.
  • Step S513 determining the trained parameters of the pre-training model as the preset model parameters.
  • a unified pre-training model can be used to pre-train the parameters of the first network model, and the parameters of the trained pre-training model can be used as preset model parameters, It is used to initialize the parameters of the first network model.
  • the efficiency of model quantization can be improved, and the precision of the quantized second network model can be further improved.
  • the above step S501 may include:
  • Step S521 determining a target neural network structure from various preset neural network structures.
  • a variety of optional neural network structures can be preset, and the user can determine a suitable target neural network structure from the various preset neural network structures according to actual business needs, which is not limited here.
  • Step S522 based on at least one deployment configuration information, adjust the processing layer in the target neural network structure to obtain at least one adjusted neural network structure.
  • various optional neural network structures can be provided for creating the initial first network model, so that different service requirements of users can be better supported.
  • the embodiment of the present application provides a reproducible and deployable model quantization algorithm library (hereinafter referred to as MQBench), which can be used to evaluate and analyze the reproducibility and deployability of the model quantization algorithm.
  • MQBench reproducible and deployable model quantization algorithm library
  • MQBench provides a variety of different deployment hardware types to choose from for the deployment of quantitative models in practical applications, including central processing unit (central processing unit, CPU), GPU, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), digital signal processing (Digital Signal Process, DSP), and evaluate a large number of state-of-the-art quantization algorithms under a unified training configuration.
  • Users can use MQBench to quantify the trained full-precision network model in tasks such as image classification and target detection, and obtain a quantized network model that can be deployed to target hardware.
  • the user only needs to provide the corresponding training data set and the deployment configuration information of the target hardware (such as the deployment hardware type, the inference engine used by the deployment hardware type, the quantization bit width corresponding to the deployment hardware type, etc.) and Configuration information of the quantization algorithm (such as quantization algorithm, fine-tuning duration, fine-tuning training algebra, training hyperparameters, etc.).
  • the deployment configuration information of the target hardware such as the deployment hardware type, the inference engine used by the deployment hardware type, the quantization bit width corresponding to the deployment hardware type, etc.
  • Configuration information of the quantization algorithm such as quantization algorithm, fine-tuning duration, fine-tuning training algebra, training hyperparameters, etc.
  • MQBench can be implemented using the Pytorch deep learning engine and supports the torch.fx (also known as FX) feature.
  • FX includes a symbol tracker, an intermediate representation, and Python code generation, allowing deeper metaprogramming.
  • the quantization algorithm and hardware-aware configuration can be implemented in MQBench, and the full-precision network model can be converted into a quantized network model through an application programming interface (Application Programming Interface, API) call.
  • API Application Programming Interface
  • model_qconfig get_qconfig(**qparams, **backend_params);
  • foldbn_config get_foldbn_config(foldbn_strategy);
  • qModel quantize_fx. prepare_qat_fx(model, ⁇ "": model_qconfig ⁇ , foldbn_config).
  • the quantization network model qModel can be fine-tuned, calibrated and optimized.
  • MQBench is like a bridge, connecting quantization algorithms and deployment hardware.
  • Figure 6 is a schematic diagram of the application scenario of MQBench provided by the embodiment of the present application.
  • MQBench 60 mainly provides the reproducibility 61 of the quantization algorithm and the deployability 62 of the hardware platform, and the reproducibility of the quantization algorithm 61 can support multiple quantization algorithms 70, including quantization-aware training algorithms 71 and post-training quantization algorithms 72, and the deployability 62 of the hardware platform can support the deployment of quantization algorithms on different deployment hardware 80, including CPU 81, GPU 82, ASIC 83, DSP 84.
  • Hardware-aware quantizer For different hardware (such as CPU, GPU, ASIC, and DSP, etc.), MQBench provides matching support for the calculation graph mode of the inference engine library (such as TVM, TensorRT, ACL, and SNPE, etc.) used by the hardware, which can Based on the set inference engine library, the insertion position of the quantitative node in the calculation graph is automatically matched.
  • the inference engine library such as TVM, TensorRT, ACL, and SNPE, etc.
  • MQBench supports 5 general-purpose software libraries (that is, inference engines), including TensorRT for graphics processing unit (GPU) inference, ACL for application-specific integrated circuit (ASIC) inference, and mobile digital signal processor (DSP) SNPE for inference, TVM for ARM central processing unit (CPU), and FBGEMM for X86 server-side CPU inference.
  • inference engines include TensorRT for graphics processing unit (GPU) inference, ACL for application-specific integrated circuit (ASIC) inference, and mobile digital signal processor (DSP) SNPE for inference, TVM for ARM central processing unit (CPU), and FBGEMM for X86 server-side CPU inference.
  • Each inference engine corresponds to a quantizer. Users can select an appropriate inference engine from these five inference engines for model deployment according to actual application scenarios.
  • MQBench can determine at least one processing layer to be quantified in the full-precision network model and the corresponding hardware perception based on the selected inference engine quantizer.
  • MQBench reproduces various quantization algorithms of the current SOTA (State-Of-The-Art, referring to the best/most advanced model), including learning-based LSQ, APoT, and quantization intervals Learning (Quantization Interval Learning, QIL) algorithm, PACT, and rule-based strategy DSQ, LQ-Net, DoReFa. Users can select an appropriate quantization algorithm from the multiple quantization algorithms reproduced by MQBench for model quantization according to the actual application scenario. MQBench can quantify the full-precision network model to be quantized according to the selected quantization algorithm.
  • SOTA State-Of-The-Art
  • Neural network structure The neural network structure supported by MQBench includes ResNet-18, ResNet-50, MobileNetV2, Efficient-Net (use the Lite version of Efficient-Net, and replace the swish activation with ReLU6 to provide better overall performance on the hardware. type number support) and the neural network structure RegNetX-600MF with group convolution.
  • Quantization bit width MQBench supports multiple quantization bit widths such as 8 bits, 4 bits, 3 bits, and 2 bits. In some implementation manners, a quantization bit width of 8 bits may be used for the post-training quantization algorithm, and a quantization bit width of 4 bits may be used for the quantization-aware training algorithm.
  • Training settings In MQBench, fine-tuning is used for parameter training for all quantization algorithms. For full-precision network models using the same neural network structure, a unified pre-training model is used for parameter initialization, which reduces the number of parameters introduced in the initialization stage. Inconsistency.
  • MQBench has optimized the deployability of model quantification as follows:
  • BN layer folding MQBench supports 5 BN layer folding strategies, and can support the parameters of the BN layer to be folded into the corresponding convolutional layer according to the configured BN layer folding strategy. Users can choose an appropriate strategy from these five BN layer folding strategies according to the actual application scenario.
  • Computation graph of block structure The model quantization scheme in the related art only considers the input and weight of the quantized convolution or fully connected layer.
  • the neural network architecture can also include other operations, such as element-wise addition in the neural network architecture ResNet, Concatenation in neural network architecture InceptionV3, etc.
  • MQBench different calculation graph optimization levels are considered for different inference engines, and the insertion position of quantized nodes in the calculation graph is automatically matched based on the set inference engine, so as to correspond to different calculation graph optimization levels to build a corresponding quantitative neural network calculation picture.
  • Using a hardware-aware quantizer can improve the deployability of the quantized network model and the accuracy in actual deployment scenarios
  • Fig. 7 is a schematic diagram of the composition and structure of a model quantization device provided in the embodiment of the present application.
  • the model quantization device 700 includes: a first acquisition part 710, a first determination part 720 and a quantization part 730, wherein:
  • the first acquiring part 710 is configured to acquire the first network model to be quantified
  • the first determining part 720 is configured to determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer based on the set deployment configuration information;
  • the quantization part 730 is configured to perform quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.
  • the first network model includes at least one block structure, and each of the block structures includes at least one processing layer; the first determining part is further configured to: based on the set deployment configuration information, determine At least one processing layer to be quantized in each of the block structures in the first network model and a quantization parameter for quantizing each of the processing layers.
  • the deployment configuration information includes the inference engine used by the deployed hardware type; the first determining part is further configured to: determine the processing layer type to be quantified based on the inference engine; At least one processing layer matching the processing layer type in the network model is determined as the processing layer to be quantized.
  • the processing layer type includes a convolutional layer and a batch normalization layer; the first determination part is further configured to: combine at least one batch normalization layer and a batch normalization layer in the first network model
  • the convolutional layer that each of the batch normalization layers depends on is determined as the processing layer to be quantized; the set batch normalization layer folding strategy is obtained; based on the batch normalization layer folding strategy, the first Each of the batch normalization layers in the network model is folded into the convolution layer that the batch normalization layer depends on to obtain the folded first network model;
  • the quantization part is further configured to: Each of the processing layers in the folded first network model is quantized according to the quantization parameter to obtain a second network model.
  • the batch normalization layer folding strategy includes batch normalization layer removal status, coefficient update algorithm, statistical parameters to be incorporated into weights, statistical parameters to be incorporated into offsets;
  • the statistical parameters to be incorporated into the weights include the running statistics of the convolutional layer on which the batch normalization layer depends or the statistics of the current batch, and the statistical parameters to be incorporated into the offset include the batch normalization layer dependencies
  • the first determining part is also configured to: determine each of the batch normalization in at least one batch normalization layer in the first network model Layer scaling coefficients and translation coefficients; based on the coefficient update algorithm, the scaling coefficients and translation coefficients of each of the batch normalization layers are updated to obtain the updated scaling coefficients of each of the batch normalization layers and translation coefficients; for each batch normalization layer, obtain the statistical parameters to be incorporated into the weight and the statistical parameters to be incorporated into the offset in the batch normalization layer, and normalize the batch The updated scaling coefficients of the normalization layer and
  • the first determination part is further configured to: determine a target batch normalization layer folding strategy from multiple set batch normalization layer folding strategies based on the reasoning engine.
  • the quantization part is further configured to: based on the set quantization algorithm and the first training data set, according to the quantization parameters, each of the processing layers in the first network model is Quantify to get the second network model.
  • the quantization parameters include preset precision of quantization scale, quantization symmetry, quantization bit width and quantization granularity, the quantization symmetry includes symmetric quantization or asymmetric quantization, and the quantization granularity includes hierarchical quantization or Feature-level quantization, the quantization algorithm includes a quantization-aware training algorithm; the quantization part is further configured to: according to the quantization parameter, set a pseudo-quantizer for each of the processing layers in the first network model, A third network model is obtained; wherein, the pseudo-quantizer is configured to: determine a quantization value range of a processing layer parameter based on the quantization bit width; determine a quantization scale that satisfies the preset precision and a quantization scale that satisfies the quantization symmetry Quantize the zero point; based on the quantization granularity, within the range of the quantized value, use the quantization scale and the quantization zero point to perform uniform quantization processing on the processing layer parameters to be quantized to obtain the quantized
  • the quantization part is further configured to: determine preset training hyperparameters corresponding to the neural network structure adopted by the first network model; wherein, for the various preset deployment configuration information For each of the deployment configuration information, the training hyperparameters are the same; using the set first training data set, based on the quantization algorithm and the training hyperparameters, for each of the first network models The processing layer performs quantization according to the quantization parameter to obtain a second quantized network model.
  • the first acquisition part is further configured to: adjust the processing layers in the set neural network structure based on at least one deployment configuration information to obtain at least one adjusted neural network structure; At least one of the adjusted neural network structures is used to create at least one first network model; based on the preset model parameters corresponding to the set neural network structure, the parameters of at least one of the first network models are initialized, At least one initialized first network model is obtained; based on the set deployment configuration information, a first network model to be quantified is determined from the at least one initialized first network model.
  • the device further includes: a second acquisition part configured to acquire a preset pre-training model corresponding to the neural network structure; the structure of the pre-training model before the output layer is the same as the The neural network structure is the same; the pre-training part is configured to use the set second training data set to train the parameters of the pre-training model to obtain the trained pre-training model; the second determination part is configured To determine the trained parameters of the pre-training model as the preset model parameters.
  • the first acquisition part is further configured to: determine a target neural network structure from a variety of preset neural network structures; The processing layer is adjusted to obtain at least one adjusted neural network structure.
  • a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a unit, a module or a non-modular one.
  • the above-mentioned model quantification method is realized in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
  • the software product is stored in a storage medium, and includes several instructions to make a
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk.
  • embodiments of the present application are not limited to any specific combination of hardware and software.
  • An embodiment of the present application provides a computer device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the steps in the above method when executing the program.
  • An embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps in the above method are implemented.
  • the computer readable storage medium may be transitory or non-transitory.
  • An embodiment of the present application provides a computer program, the computer program includes computer readable code, and when the computer readable code is run in a computer device, a processor in the computer device executes part of the above method or all steps.
  • An embodiment of the present application provides a computer program product.
  • the computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, part or all of the above methods are implemented. All steps.
  • the computer program product can be specifically realized by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in other embodiments, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.
  • FIG. 8 is a schematic diagram of a hardware entity of a computer device in the embodiment of the present application.
  • the hardware entity of the computer device 800 includes: a processor 801, a communication interface 802, and a memory 803, wherein: Processor 801 generally controls the overall operation of computer device 800 .
  • the communication interface 802 enables the computer device to communicate with other terminals or servers over a network.
  • the memory 803 is configured to store instructions and applications executable by the processor 801, and can also cache data to be processed or processed by the processor 801 and various modules in the computer device 800 (for example, image data, audio data, voice communication data and Video communication data) can be realized by flash memory (FLASH) or random access memory (Random Access Memory, RAM). Data transmission may be performed between the processor 801 , the communication interface 802 and the memory 803 through the bus 804 .
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, or each unit can be used as a single unit, or two or more units can be integrated into one unit; the above-mentioned integration
  • the unit can be realized in the form of hardware or in the form of hardware plus software functional unit.
  • the above-mentioned integrated units in the embodiments of the present application are implemented in the form of software function modules and sold or used as independent products, they may also be stored in a computer-readable storage medium.
  • the computer software products are stored in a storage medium and include several instructions to make A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks.
  • the embodiment of the present application discloses a model quantification method, device, device, storage medium, computer program product and computer program, wherein the method includes: acquiring the first network model to be quantified; based on the set deployment configuration information, Determining at least one processing layer to be quantized in the first network model and quantization parameters for quantizing each of the processing layers; performing quantization on each of the processing layers in the first network model according to the quantization parameters Quantify to get the second network model.
  • the deployment configuration information of the hardware platform on which the model is deployed can be fully considered during the model quantification process of the first network model, so as to obtain the second network model deployable on the corresponding hardware platform.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Image Analysis (AREA)

Abstract

A model quantization method and apparatus, a device, a storage medium, a computer program product, and a computer program. The method comprises: obtaining a first network model to be quantized (S101); on the basis of set deployment configuration information, determining at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer (S102); and quantizing each processing layer in the first network model according to the quantization parameter, so as to obtain a second network model (S103).

Description

模型量化方法、装置、设备、存储介质、计算机程序产品及计算机程序Model quantification method, device, equipment, storage medium, computer program product and computer program
相关申请的交叉引用Cross References to Related Applications
本申请实施例基于申请号为202111030764.1、申请日为2021年09月03日、申请名称为“模型量化方法、装置、设备、存储介质及计算机程序产品”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。The embodiment of this application is based on the Chinese patent application with the application number 202111030764.1, the application date is September 03, 2021, and the application name is "model quantification method, device, equipment, storage medium and computer program product", and requires the Chinese patent The priority of the application, the entire content of the Chinese patent application is hereby incorporated into this application as a reference.
技术领域technical field
本申请实施例涉及但不限人工智能领域,尤其涉及一种模型量化方法、装置、设备、存储介质、计算机程序产品及计算机程序。The embodiments of the present application relate to, but are not limited to, the field of artificial intelligence, and in particular, relate to a model quantification method, device, equipment, storage medium, computer program product, and computer program.
背景技术Background technique
现代深度学习技术通过消耗更大的内存和算力来追求更高的性能。虽然大模型可以在云端进行训练,但由于计算资源(包括延迟、能源和内存消耗)有限,将模型直接部署在边缘设备是非常困难的。通过模型量化、剪枝、蒸馏、轻量级网络设计和权重矩阵分解等技术,可以加速深度模型的推理,其中,模型量化可以将神经网络中的权值和激活值从原始的浮点类型量化到低位宽(如8比特、4比特、3比特、2比特等)整型。模型量化后得到的量化神经网络模型所需要的存储空间降低了,计算形式也从原始的浮点型运算变为了代价更小的低位宽整型数据的计算。Modern deep learning techniques pursue higher performance by consuming more memory and computing power. Although large models can be trained on the cloud, it is very difficult to directly deploy the models on edge devices due to limited computing resources (including latency, energy, and memory consumption). Through techniques such as model quantization, pruning, distillation, lightweight network design, and weight matrix decomposition, the reasoning of deep models can be accelerated. Among them, model quantization can quantize the weights and activation values in the neural network from the original floating-point type. to low bit width (such as 8-bit, 4-bit, 3-bit, 2-bit, etc.) integers. After the model is quantized, the storage space required for the quantized neural network model is reduced, and the calculation form is changed from the original floating-point operation to the calculation of lower-cost low-bit wide integer data.
相关技术中,模型量化的工作往往无法进行实际应用,得到的量化神经网络模型通常无法在硬件上部署。In related technologies, the work of model quantization often cannot be applied in practice, and the obtained quantized neural network models usually cannot be deployed on hardware.
发明内容Contents of the invention
有鉴于此,本申请实施例提供一种模型量化方法、装置、设备、存储介质、计算机程序产品及计算机程序。In view of this, embodiments of the present application provide a model quantization method, device, device, storage medium, computer program product, and computer program.
本申请实施例的技术方案是这样实现的:The technical scheme of the embodiment of the application is realized in this way:
本申请实施例提供一种模型量化方法,所述方法包括:An embodiment of the present application provides a model quantification method, the method comprising:
获取待量化的第一网络模型;Obtain the first network model to be quantified;
基于设定的部署配置信息,确定所述第一网络模型中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数;Based on the set deployment configuration information, determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer;
对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。Quantize each processing layer in the first network model according to the quantization parameter to obtain a second network model.
本申请实施例提供一种模型量化装置,所述装置包括:An embodiment of the present application provides a model quantization device, which includes:
第一获取部分,被配置为获取待量化的第一网络模型;The first acquisition part is configured to acquire the first network model to be quantified;
第一确定部分,被配置为基于设定的部署配置信息,确定所述第一网络模型中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数;The first determining part is configured to determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer based on set deployment configuration information;
量化部分,被配置为对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。The quantization part is configured to perform quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.
本申请实施例提供一种计算机设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述方法中的部分或全部步骤。An embodiment of the present application provides a computer device, including a memory and a processor. The memory stores a computer program that can run on the processor. When the processor executes the program, part or all of the steps in the above method are implemented.
本申请实施例提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述方法中的部分或全部步骤。An embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, some or all of the steps in the above method are implemented.
本申请实施例提供一种计算机程序,包括计算机可读代码,在所述计算机可读代码在计算机设备中运行的情况下,所述计算机设备中的处理器执行上述方法中的部分或全部步骤。An embodiment of the present application provides a computer program, including computer readable codes. When the computer readable codes run in a computer device, a processor in the computer device executes some or all of the steps in the above method.
本申请实施例提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,实现上述方法中的部分或全部步骤。An embodiment of the present application provides a computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, part or all of the above methods are implemented. All steps.
本申请实施例中,获取待量化的第一网络模型;基于设定的部署配置信息,确定所述第一网络 模型中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数;对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。这样,由于第一网络模型中待量化的处理层以及对每一待量化的处理层进行量化的量化参数是基于设定的部署配置信息确定的,因此,在进行模型量化的过程中充分考虑了部署模型的硬件平台的部署配置信息,从而得到的第二网络模型是在相应的硬件平台上可部署的。In the embodiment of the present application, the first network model to be quantified is obtained; based on the set deployment configuration information, at least one processing layer to be quantized in the first network model and the quantization for each processing layer are determined Parameters; performing quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model. In this way, since the processing layers to be quantized in the first network model and the quantization parameters for each processing layer to be quantized are determined based on the set deployment configuration information, in the process of model quantization, full consideration is given to The deployment configuration information of the hardware platform of the deployment model, so that the obtained second network model is deployable on the corresponding hardware platform.
附图说明Description of drawings
图1为本申请实施例提供的一种模型量化方法的实现流程示意图;Fig. 1 is a schematic diagram of the implementation flow of a model quantification method provided by the embodiment of the present application;
图2A为本申请实施例提供的一种模型量化方法的实现流程示意图;FIG. 2A is a schematic diagram of the implementation flow of a model quantification method provided in the embodiment of the present application;
图2B为本申请实施例提供的一种在基本块结构的计算图中插入量化节点的示意图;FIG. 2B is a schematic diagram of inserting quantized nodes into a calculation graph of a basic block structure provided by an embodiment of the present application;
图2C为本申请实施例提供的一种在基本块结构的计算图中插入量化节点的示意图;FIG. 2C is a schematic diagram of inserting quantized nodes into a calculation graph of a basic block structure provided by an embodiment of the present application;
图2D为本申请实施例提供的一种在基本块结构的计算图中插入量化节点的示意图;FIG. 2D is a schematic diagram of inserting a quantization node into a calculation graph of a basic block structure provided by an embodiment of the present application;
图3A为本申请实施例提供的一种模型量化方法的实现流程示意图;FIG. 3A is a schematic diagram of the implementation flow of a model quantification method provided in the embodiment of the present application;
图3B为本申请实施例提供的一种批量归一化层折叠策略的实现示意图;FIG. 3B is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application;
图3C为本申请实施例提供的一种批量归一化层折叠策略的实现示意图;FIG. 3C is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application;
图3D为本申请实施例提供的一种批量归一化层折叠策略的实现示意图;FIG. 3D is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application;
图3E为本申请实施例提供的一种批量归一化层折叠策略的实现示意图;FIG. 3E is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application;
图3F为本申请实施例提供的一种批量归一化层折叠策略的实现示意图;FIG. 3F is a schematic diagram of an implementation of a batch normalization layer folding strategy provided by an embodiment of the present application;
图4为本申请实施例提供的一种模型量化方法的实现流程示意图;FIG. 4 is a schematic diagram of the implementation flow of a model quantification method provided in the embodiment of the present application;
图5为本申请实施例提供的一种模型量化方法的实现流程示意图;Fig. 5 is a schematic diagram of the implementation flow of a model quantification method provided by the embodiment of the present application;
图6为本申请实施例提供的MQBench的应用场景示意图;FIG. 6 is a schematic diagram of an application scenario of MQBench provided by the embodiment of the present application;
图7为本申请实施例提供的一种模型量化装置的组成结构示意图;FIG. 7 is a schematic diagram of the composition and structure of a model quantization device provided in the embodiment of the present application;
图8为本申请实施例提供的一种计算机设备的硬件实体示意图。FIG. 8 is a schematic diagram of a hardware entity of a computer device provided by an embodiment of the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案和优点更加清楚,下面结合附图和实施例对本申请的技术方案进一步详细阐述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。In order to make the purpose, technical solution and advantages of the application more clear, the technical solution of the application will be further elaborated below in conjunction with the accompanying drawings and embodiments. The described embodiments should not be considered as limiting the application. All other embodiments obtained under the premise of no creative work belong to the scope of protection of this application.
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。在以下的描述中,所涉及的术语“第一/第二/第三”仅仅是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一/第二/第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict. In the following description, the term "first/second/third" is only used to distinguish similar objects, and does not represent a specific order for objects. Understandably, "first/second/third" is used in Where permitted, the specific order or sequence may be interchanged such that the embodiments of the application described herein can be practiced in other sequences than illustrated or described herein.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terminology used herein is for the purpose of describing the application only and is not intended to limit the application.
为了更好地理解本申请实施例,首先对相关技术中的模型量化方案进行说明。相关技术中,模型量化方案往往由于忽略了硬件部署的要求,因而无法进行实际应用与部署。一方面,在模型部署到硬件平台后,硬件平台通常会将批量标准化(Batch Normalization,BN)层的计算过程优化到卷积层里面,以避免额外的开销,但相关技术中的BN层是保持完好无损的;另一方面,相关技术中只考虑对卷积层的输入参数和权值参数进行量化,但在模型部署时,神经网络模型的整个计算图都应该被量化,也即除卷积层以外的其它处理层的输入参数、权值参数等也需要进行量化。因而,相关技术中的模型量化方案将不可避免地降低量化算法的可部署性。此外,由于不同的量化算法在不同硬件平台上具有不同的可部署性,因而在学术研究中也无法对不同的量化算法在不同的硬件与量化方式下的表现和鲁棒性进行衡量。In order to better understand the embodiment of the present application, the model quantization solution in the related art is first described. In related technologies, the model quantization scheme often fails to be practically applied and deployed because it ignores the requirements of hardware deployment. On the one hand, after the model is deployed to the hardware platform, the hardware platform usually optimizes the calculation process of the batch normalization (Batch Normalization, BN) layer into the convolution layer to avoid additional overhead, but the BN layer in related technologies is to maintain Intact; On the other hand, in related technologies, only the input parameters and weight parameters of the convolutional layer are considered to be quantized, but when the model is deployed, the entire calculation graph of the neural network model should be quantized, that is, except for the convolutional layer The input parameters and weight parameters of other processing layers other than the layer also need to be quantized. Therefore, the model quantization scheme in the related art will inevitably reduce the deployability of the quantization algorithm. In addition, because different quantization algorithms have different deployability on different hardware platforms, it is impossible to measure the performance and robustness of different quantization algorithms on different hardware and quantization methods in academic research.
本申请实施例提供一种模型量化方法,该方法可以由计算机设备的处理器执行。其中,计算机设备指的可以是服务器、笔记本电脑、平板电脑、台式计算机、智能电视、机顶盒、移动设备(例如移动电话、便携式视频播放器、个人数字助理、专用消息设备、便携式游戏设备)等具备数据处理能力的设备。图1为本申请实施例提供的一种模型量化方法的实现流程示意图,如图1所示,该方法包括:An embodiment of the present application provides a model quantization method, which can be executed by a processor of a computer device. Among them, computer equipment refers to servers, notebook computers, tablet computers, desktop computers, smart TVs, set-top boxes, mobile devices (such as mobile phones, portable video players, personal digital assistants, dedicated messaging devices, portable game devices), etc. Devices with data processing capabilities. Figure 1 is a schematic diagram of the implementation process of a model quantification method provided in the embodiment of the present application. As shown in Figure 1, the method includes:
步骤S101,获取待量化的第一网络模型。Step S101, acquiring a first network model to be quantized.
这里,第一网络模型可以是任意合适的待量化的神经网络模型,可以是全精度的神经网络模型,示例性的,所述第一网络模型可以是32位浮点型参数或者16位浮点型参数的神经网络模型,当然,本实施例对第一网络模型的浮点位数不做限定。在实施时,第一网络模型可以采用任意合适的神经网络结构,包括但不限于ResNet-18、ResNet-50、MobileNetV2、EfficientNet-Lite、RegNet等中的一种或多种。Here, the first network model can be any suitable neural network model to be quantized, and can be a full-precision neural network model. Exemplarily, the first network model can be a 32-bit floating-point parameter or a 16-bit floating-point parameter type parameter neural network model, of course, this embodiment does not limit the floating-point number of the first network model. During implementation, the first network model may adopt any suitable neural network structure, including but not limited to one or more of ResNet-18, ResNet-50, MobileNetV2, EfficientNet-Lite, RegNet and the like.
步骤S102,基于设定的部署配置信息,确定所述第一网络模型中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数。Step S102, based on the set deployment configuration information, determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer.
这里,部署配置信息可以包括但不限于部署硬件类型、部署硬件类型采用的推理引擎、部署硬件的型号、与部署硬件类型对应的网络模型参数的量化位宽等中的一种或多种。在实施时,部署配置信息可以是用户预先设定的,也可以是默认的,还可以是从目标部署硬件的配置文件中获取的,这里并不限定。Here, the deployment configuration information may include but not limited to one or more of the type of the deployed hardware, the inference engine used by the deployed hardware type, the model of the deployed hardware, the quantized bit width of the network model parameters corresponding to the deployed hardware type, and the like. During implementation, the deployment configuration information may be preset by the user, or may be default, or may be obtained from a configuration file of the target deployment hardware, which is not limited here.
第一网络模型中可以包括多个处理层,如输入层、卷积层、池化层、下采样层、线性修正单元、全连接层、批量归一化层等中的一种或多种。由于在不同部署环境下,可以对模型量化具有不同的支持能力,因而,基于设定的部署配置信息,可以确定第一网络模型中待量化的至少一个处理层。在实施时,可以根据实际情况采用合适的方式基于设定的部署配置信息确定第一网络模型中待量化的至少一个处理层,本申请实施例对此并不限定。在一些实施方式中,可以根据实际情况预先确定不同部署配置信息与待量化的处理层之间的对应关系,利用设定的部署配置信息查询该对应关系,即可确定第一网络模型中待量化的至少一个处理层。例如,对于第一部署硬件类型或第一推理引擎,可以确定仅对第一网络模型中的卷积层进行量化;对于第二部署硬件类型或第二推理引擎,可以确定对第一网络模型中的每一卷积层、输入层、全连接层进行量化;对于第三推理引擎,可以对第一网络模型中的每一卷积层、输入层、全连接层以及逐元素相加的计算层进行量化。在一些实施方式中,还可以基于设定的部署配置信息,确定第一网络模型中待量化的至少一个处理层中每一处理层中待量化的参数。The first network model may include multiple processing layers, such as one or more of an input layer, a convolutional layer, a pooling layer, a downsampling layer, a linear correction unit, a fully connected layer, and a batch normalization layer. Since different deployment environments may have different support capabilities for model quantification, based on the set deployment configuration information, at least one processing layer to be quantified in the first network model may be determined. During implementation, at least one processing layer to be quantified in the first network model may be determined in an appropriate manner based on the set deployment configuration information according to actual conditions, which is not limited in this embodiment of the present application. In some embodiments, the corresponding relationship between different deployment configuration information and the processing layer to be quantified can be determined in advance according to the actual situation, and the corresponding relationship can be determined by using the set deployment configuration information to query the corresponding relationship. at least one processing layer of . For example, for the first deployment hardware type or the first inference engine, it can be determined that only the convolution layer in the first network model is quantized; for the second deployment hardware type or the second inference engine, it can be determined that the Each convolutional layer, input layer, and fully connected layer of the first network model can be quantized; for the third inference engine, each convolutional layer, input layer, fully connected layer, and element-wise added calculation layer in the first network model can be quantified to quantify. In some implementation manners, the parameter to be quantified in each processing layer of at least one processing layer to be quantized in the first network model may also be determined based on the set deployment configuration information.
对每一处理层进行量化的量化参数可以包括但不限于对该处理层进行量化的过程中所采用的量化尺度的预设精度、量化对称性、量化位宽和量化粒度等中的一种或多种。例如,量化尺度的预设精度可以包括全精度、2的次方的精度等。量化对称性可以是对称量化,也可以是非对称量化。量化位宽可以包括8比特、4比特、3比特、2比特等中的一种。量化粒度可以是层级量化(也即张量级量化)或特征级量化(也即通道级量化)。不同的部署硬件平台所能支持或适用的量化尺度的精度、量化对称性、量化位宽以及量化粒度等量化参数是不同的,基于设定的部署配置信息,即可确定对第一网络模型中待量化的每一处理层进行量化的过程中采用的量化参数。在实施时,本领域技术人员可以根据实际情况采用合适的方式基于设定的部署配置信息确定对第一网络模型中待量化的每一处理层进行量化的量化参数,这里并不限定。在一些实施方式中,可以预先根据实际情况确定不同的部署配置信息与量化参数之间的对应关系,基于设定的部署配置信息查询该对应关系,即可确定对第一网络模型中待量化的每一处理层进行量化的量化参数。The quantization parameter for quantizing each processing layer may include, but not limited to, one or more of the preset accuracy of the quantization scale used in the process of quantizing the processing layer, quantization symmetry, quantization bit width, and quantization granularity, etc. Various. For example, the preset precision of the quantization scale may include full precision, power of 2 precision, and the like. Quantization symmetry can be either symmetric quantization or asymmetric quantization. The quantization bit width may include one of 8 bits, 4 bits, 3 bits, 2 bits and so on. Quantization granularity can be hierarchical quantization (that is, tensor-level quantization) or feature-level quantization (that is, channel-level quantization). Different deployment hardware platforms can support or apply different quantization parameters such as quantization scale precision, quantization symmetry, quantization bit width and quantization granularity. Based on the set deployment configuration information, the first network model can be determined. The quantization parameter used in the quantization process of each processing layer to be quantized. During implementation, those skilled in the art may determine the quantization parameter for quantizing each processing layer to be quantized in the first network model based on the set deployment configuration information in an appropriate manner according to the actual situation, which is not limited here. In some embodiments, the corresponding relationship between different deployment configuration information and quantified parameters can be determined in advance according to the actual situation, and the corresponding relationship can be determined based on the set deployment configuration information to determine the quantified parameters in the first network model. Quantization parameters for quantization at each processing layer.
步骤S103,对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。Step S103, performing quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.
这里,可以根据实际情况采用任意合适的量化算法对第一网络模型中的每一处理层按照量化参数进行量化,得到量化后的第二网络模型。量化算法可以包括但不限于训练后量化算法、量化感知训练算法等中的一种或多种。训练后量化算法指的是对预训练后的网络模型选择合适的量化操作和校准操作,以实现量化损失的最小化,可以是训练后静态量化,也可以是训练后动态量化。量化感知训练算法是指在网络的量化过程中进行训练,通过量化感知训练,可以使得网络能够适应整型数值的不连续分布,减少量化过程造成的运算精度损失,可以包括但不限于学习步长量化(Learned Step-size Quantization,LSQ)算法、参数化剪裁激活(PArameterized Clipping acTivation,PACT)算法、加法二次幂量化(Additive Powers-of-Two,APoT)算法、可微软量化(Differentiable Soft Quantization,DSQ)、DoReFa-Net训练算法、高精度紧凑型深层神经网络的学习量化(Learned Quantization for Highly Accurate and Compact Deep Neural Networks,LQ-net)算法等。Here, any suitable quantization algorithm may be used according to the actual situation to quantize each processing layer in the first network model according to the quantization parameter to obtain the quantized second network model. Quantization algorithms may include, but are not limited to, one or more of post-training quantization algorithms, quantization-aware training algorithms, and the like. The post-training quantization algorithm refers to selecting the appropriate quantization operation and calibration operation for the pre-trained network model to minimize the quantization loss. It can be static quantization after training or dynamic quantization after training. The quantization-aware training algorithm refers to training during the quantization process of the network. Through quantization-aware training, the network can adapt to the discontinuous distribution of integer values and reduce the loss of operational accuracy caused by the quantization process, which can include but not limited to the learning step size Quantization (Learned Step-size Quantization, LSQ) algorithm, parameterized clipping activation (PAParameterized Clipping acTivation, PACT) algorithm, additive power of two power quantization (Additive Powers-of-Two, APoT) algorithm, differentiable soft quantization (Differentiable Soft Quantization, DSQ), DoReFa-Net training algorithm, Learning Quantization for Highly Accurate and Compact Deep Neural Networks (LQ-net) algorithm, etc.
在一些实施方式中,在对第一网络模型进行量化的实现过程中,可以基于第一网络模型的网络 结构提取第一网络模型的计算图,通过在第一网络模型的计算图中插入至少一个量化节点,来对第一网络模型中的至少一个处理层进行量化,以构建第二网络模型的计算图,在第二网络模型的计算图中,对每一待量化的处理层进行量化处理的量化节点采用的量化参数即为对该处理层进行量化的量化参数,基于该第二网络模型的计算图可以得到量化后的第二网络模型。在一些实施方式中,还可以根据实际情况采用任意合适的量化算法和训练数据在第二网络模型的计算图上进行参数训练,得到训练后的第二网络模型的计算图,并基于训练后的第二网络模型的计算图,得到训练后的第二网络模型。In some implementations, in the implementation process of quantifying the first network model, the calculation graph of the first network model can be extracted based on the network structure of the first network model, by inserting at least one A quantization node is used to quantify at least one processing layer in the first network model to construct a calculation graph of the second network model, and perform quantization processing on each processing layer to be quantized in the calculation graph of the second network model The quantization parameter adopted by the quantization node is the quantization parameter for quantizing the processing layer, and the quantized second network model can be obtained based on the calculation graph of the second network model. In some embodiments, any suitable quantization algorithm and training data can be used to perform parameter training on the calculation graph of the second network model according to the actual situation, to obtain the calculation graph of the second network model after training, and based on the trained A calculation graph of the second network model to obtain the trained second network model.
由于在对量化神经网络的计算图进行构建的过程中,不同的部署硬件会考虑不同层次的图优化,从而基于不同的部署配置信息,可以采用不同的量化节点插入策略,在第一网络模型的计算图中插入至少一个量化节点,来构建合适的量化神经网络的计算图(也即第二网络模型的计算图)。第一网络模型的计算图中插入量化节点的位置可以相当于对该位置相应的逻辑节点对应的处理层进行量化,从而确定在第一网络模型的计算图中插入量化节点的位置即相当于确定第一网络模型中待量化的至少一个处理层。In the process of constructing the calculation graph of the quantized neural network, different deployment hardware will consider different levels of graph optimization, so based on different deployment configuration information, different quantitative node insertion strategies can be adopted. In the first network model At least one quantization node is inserted into the computation graph to construct a computation graph of a suitable quantization neural network (that is, a computation graph of the second network model). The position of inserting a quantization node in the calculation graph of the first network model can be equivalent to quantifying the processing layer corresponding to the logical node corresponding to the position, so that determining the position of inserting a quantization node in the calculation graph of the first network model is equivalent to determining At least one processing layer to be quantized in the first network model.
在一些实施例中,所述部署配置信息包括部署硬件类型采用的推理引擎;上述步骤S102中所述的基于设定的部署配置信息,确定所述第一网络模型中待量化的至少一个处理层,可以包括:In some embodiments, the deployment configuration information includes the inference engine used by the deployed hardware type; based on the set deployment configuration information described in step S102 above, determine at least one processing layer to be quantified in the first network model , which can include:
步骤S111,基于所述推理引擎,确定待量化的处理层类型;Step S111, based on the inference engine, determine the processing layer type to be quantized;
步骤S112,将所述第一网络模型中与所述处理层类型匹配的至少一个处理层确定为待量化的处理层。Step S112, determining at least one processing layer in the first network model that matches the processing layer type as the processing layer to be quantized.
这里,部署硬件类型为部署量化后的第二网络模型的目标硬件的硬件类型,不同的部署硬件类型采用的推理引擎可以相同也可以不同,这里并不限定。推理引擎可以包括但不限于TensorRT、ACL、TVM、SNPE或FBGEMM等。在实施时,可以根据实际情况采用合适的方式对部署硬件进行分类,例如,可以根据硬件的生产厂家对硬件进行分类,在这种情况下部署硬件类型即为部署硬件的生产厂家,部署硬件类型采用的推理引擎即为该生产厂家所采用的推理引擎;还可以根据硬件的规格型号进行分类,在这种情况下部署硬件类型即为部署硬件的型号,部署硬件类型采用的推理引擎即为该型号的硬件所采用的推理引擎。Here, the deployment hardware type is the hardware type of the target hardware on which the quantized second network model is deployed, and the reasoning engines used by different deployment hardware types may be the same or different, which is not limited here. Inference engines can include but are not limited to TensorRT, ACL, TVM, SNPE, or FBGEMM, etc. During implementation, the deployment hardware can be classified in an appropriate way according to the actual situation. For example, the hardware can be classified according to the hardware manufacturer. In this case, the deployment hardware type is the deployment hardware manufacturer, and the deployment hardware type The inference engine used is the inference engine used by the manufacturer; it can also be classified according to the specifications and models of the hardware. In this case, the deployment hardware type is the deployment hardware model, and the inference engine used by the deployment hardware type is the The inference engine used by the model's hardware.
不同的推理引擎可以支持对不同类型的处理层进行量化,处理层类型可以包括但不限于如输入层、卷积层、池化层、下采样层、线性修正单元、全连接层、批量归一化层等中的一种或多种。在一些实施方式中,可以预先确定不同推理引擎与待量化的处理层类型之间的对应关系,基于该对应关系可以确定与部署硬件类型采用的推理引擎对应的待处理的处理层类型。Different inference engines can support quantization of different types of processing layers. The types of processing layers can include but are not limited to such as input layer, convolutional layer, pooling layer, downsampling layer, linear correction unit, fully connected layer, batch normalization One or more of layers, etc. In some implementations, the correspondence between different inference engines and the types of processing layers to be quantified can be determined in advance, and based on the correspondence, the types of processing layers to be processed corresponding to the inference engines adopted by the type of deployed hardware can be determined.
在确定待量化的处理层类型之后,可以将第一网络模型中的每一处理层与该处理层类型匹配,并将匹配到的至少一个处理层确定为待量化的处理层。After determining the processing layer type to be quantized, each processing layer in the first network model may be matched with the processing layer type, and at least one matched processing layer may be determined as the processing layer to be quantized.
本申请实施例中,获取待量化的第一网络模型;基于设定的部署配置信息,确定所述第一网络模型中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数;对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。这样,由于第一网络模型中待量化的处理层以及对每一待量化的处理层进行量化的量化参数是基于设定的部署配置信息确定的,因此,在进行模型量化的过程中充分考虑了部署模型的硬件平台的部署配置信息,从而得到的第二网络模型是在相应的硬件平台上可部署的。In the embodiment of the present application, the first network model to be quantified is obtained; based on the set deployment configuration information, at least one processing layer to be quantized in the first network model and the quantization for each processing layer are determined Parameters; performing quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model. In this way, since the processing layers to be quantized in the first network model and the quantization parameters for each processing layer to be quantized are determined based on the set deployment configuration information, in the process of model quantization, full consideration is given to The deployment configuration information of the hardware platform of the deployment model, so that the obtained second network model is deployable on the corresponding hardware platform.
本申请实施例提供一种模型量化方法,该方法可以由计算机设备的处理器执行。如图2A所示,该方法包括:An embodiment of the present application provides a model quantization method, which can be executed by a processor of a computer device. As shown in Figure 2A, the method includes:
步骤S201,获取待量化的第一网络模型。Step S201, acquiring a first network model to be quantized.
这里,上述步骤S201对应于前述步骤S101,在实施时可以参照前述步骤S101的实施方式。Here, the above-mentioned step S201 corresponds to the above-mentioned step S101, and the implementation of the above-mentioned step S101 can be referred to for implementation.
步骤S202,基于设定的部署配置信息,确定所述第一网络模型中每一所述块结构中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数。Step S202, based on the set deployment configuration information, determine at least one processing layer to be quantized in each of the block structures in the first network model and a quantization parameter for quantizing each of the processing layers.
这里,神经网络模型的结构可分为多个阶段(stage),每个阶段又可分为多个块(block),每个块又可分为多个处理层(layer)。本实施例中以块(block)结构为单位进行量化处理。第一网络模型包括至少一个块结构,每一所述块结构包括至少一个处理层。Here, the structure of the neural network model can be divided into multiple stages (stages), each stage can be divided into multiple blocks (blocks), and each block can be divided into multiple processing layers (layers). In this embodiment, quantization processing is performed in units of a block structure. The first network model includes at least one block structure, each of said block structures includes at least one processing layer.
在一些实施方式中,可以基于预先确定部署配置信息与不同块结构中待量化的处理层之间的对应关系,确定与设定的部署配置信息对应的每一块结构中的待量化处理层。In some implementations, the processing layers to be quantized in each block structure corresponding to the set deployment configuration information may be determined based on the predetermined correspondence between the deployment configuration information and the processing layers to be quantized in different block structures.
在一些实施方式中,可以针对第一网络模型中的每一个块结构,基于设定的部署配置信息,确定该部署配置信息对应的在计算图中该块结构对应的计算子图中插入伪量化节点的插入策略,从而确定在该块结构中待量化的至少一个处理层。例如,在第一网络模型采用的神经网络结构为ResNet-18/ResNet-34的情况下,针对ResNet-18/ResNet-34中的基本块结构,对于不同的部署配置信息,可以采用如图2B至2D所示的三种不同插入策略在该基本块结构对应的计算子图中插入至少一个伪量化节点。在图2B所示的插入策略中,在计算子图中每一卷积层Conv 10的输入处插入伪量化节点FakeQuant 20,其中,伪量化节点FakeQuant 20包括量化处理节点Quantization 21和反量化节点Dequantization 22,从而在该计算子图对应的基本块结构中待量化的处理层为该基本块结构中的每一卷积层。在图2C所示的插入策略中,计算子图的输入为已量化的数据
Figure PCTCN2022071377-appb-000001
(也即卷积层Conv 10-1和Conv 10-2的输入),在计算子图中卷积层Conv 10-3的输入处、逐元素相加层elementwise-add 30的一个输入处以及计算子图的输出处插入伪量化节点FakeQuant 20,从而在该计算子图对应的基本块结构中待量化的处理层为该基本块结构中每一卷积层、逐元素相加层(仅对单侧输入进行量化)以及该基本块结构的输出层。在图2D所示的插入策略中,计算子图的输入为已量化的数据
Figure PCTCN2022071377-appb-000002
(也即卷积层Conv 10-1和Conv 10-2的输入),在计算子图中卷积层Conv 10-3的输入处、逐元素相加层elementwise-add 30的每一个输入处以及计算子图的输出处均插入伪量化节点FakeQuant 20,从而在该计算子图对应的基本块结构中待量化的处理层为该基本块结构中每一卷积层、逐元素相加层(仅对两侧输入均进行量化)以及该基本块结构的输出层。
In some implementations, for each block structure in the first network model, based on the set deployment configuration information, it is determined that the pseudo-quantization corresponding to the deployment configuration information is inserted in the calculation subgraph corresponding to the block structure in the calculation graph The insertion strategy of the nodes, thereby determining at least one processing layer to be quantized in the block structure. For example, in the case where the neural network structure adopted by the first network model is ResNet-18/ResNet-34, for the basic block structure in ResNet-18/ResNet-34, for different deployment configuration information, it can be used as shown in Figure 2B The three different insertion strategies shown in 2D insert at least one pseudo-quantization node in the calculation subgraph corresponding to the basic block structure. In the insertion strategy shown in Figure 2B, a pseudo-quantization node FakeQuant 20 is inserted at the input of each convolutional layer Conv 10 in the calculation subgraph, wherein the pseudo-quantization node FakeQuant 20 includes a quantization processing node Quantization 21 and an inverse quantization node Dequantization 22. Therefore, the processing layer to be quantized in the basic block structure corresponding to the calculation subgraph is each convolutional layer in the basic block structure. In the insertion strategy shown in Figure 2C, the input to the computation subgraph is the quantized data
Figure PCTCN2022071377-appb-000001
(that is, the input of the convolutional layer Conv 10-1 and Conv 10-2), in the calculation subgraph at the input of the convolutional layer Conv 10-3, an input of the elementwise addition layer elementwise-add 30 and the calculation The pseudo-quantization node FakeQuant 20 is inserted into the output of the subgraph, so that the processing layer to be quantized in the basic block structure corresponding to the calculation subgraph is each convolutional layer and element-wise addition layer in the basic block structure (only for single side input for quantization) and the output layer of this basic block structure. In the insertion strategy shown in Figure 2D, the input to the computation subgraph is the quantized data
Figure PCTCN2022071377-appb-000002
(That is, the input of the convolutional layer Conv 10-1 and Conv 10-2), at the input of the convolutional layer Conv 10-3 in the calculation subgraph, at each input of the elementwise addition layer elementwise-add 30 and The pseudo-quantization node FakeQuant 20 is inserted into the output of the calculation subgraph, so that the processing layer to be quantized in the basic block structure corresponding to the calculation subgraph is each convolution layer and element-wise addition layer in the basic block structure (only Quantize both inputs) and the output layer of this basic block structure.
步骤S203,对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。Step S203, performing quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.
这里,上述步骤S203对应于前述步骤S103,在实施时可以参照前述步骤S103的实施方式。Here, the above-mentioned step S203 corresponds to the above-mentioned step S103, and the implementation of the above-mentioned step S103 can be referred to for implementation.
本申请实施例中,第一网络模型包括至少一个块结构,每一块结构包括至少一个处理层,基于设定的部署配置信息,确定第一网络模型中每一块结构中待量化的至少一个处理层以及对每一处理层进行量化的量化参数,对第一网络模型中的每一待量化的处理层按照该量化参数进行量化,得到第二网络模型。这样,可以对第一网络模型中的全部块结构进行量化,从而可以实现整个网络模型的量化。In the embodiment of the present application, the first network model includes at least one block structure, each block structure includes at least one processing layer, and based on the set deployment configuration information, at least one processing layer to be quantified in each block structure in the first network model is determined and a quantization parameter for quantizing each processing layer, and performing quantization on each processing layer to be quantized in the first network model according to the quantization parameter to obtain a second network model. In this way, all block structures in the first network model can be quantized, thereby realizing the quantization of the entire network model.
本申请实施例提供一种模型量化方法,该方法可以由计算机设备的处理器执行。如图3A所示,该方法包括:An embodiment of the present application provides a model quantization method, which can be executed by a processor of a computer device. As shown in Figure 3A, the method includes:
步骤S301,获取待量化的第一网络模型。Step S301, acquiring the first network model to be quantized.
步骤S302,基于设定的部署硬件类型采用的推理引擎,确定待量化的处理层类型。Step S302, based on the inference engine used by the set deployment hardware type, determine the processing layer type to be quantified.
步骤S303,将所述第一网络模型中与所述处理层类型匹配的至少一个处理层确定为待量化的处理层。Step S303, determining at least one processing layer in the first network model that matches the processing layer type as the processing layer to be quantized.
步骤S304,基于所述推理引擎,确定对每一所述处理层进行量化的量化参数。Step S304, based on the inference engine, determine quantization parameters for quantizing each of the processing layers.
这里,上述步骤S301至步骤S304对应于前述步骤S101至步骤S102,在实施时可以参照前述步骤S101至步骤S102的具体实施方式。Here, the above-mentioned steps S301 to S304 correspond to the above-mentioned steps S101 to S102, and the specific implementation manners of the above-mentioned steps S101 to S102 can be referred to for implementation.
步骤S305,将所述第一网络模型中的至少一个批量归一化层和每一所述批量归一化层依赖的卷积层确定为待量化的处理层。Step S305, determining at least one batch normalization layer in the first network model and the convolutional layer that each batch normalization layer depends on as processing layers to be quantized.
这里,批量归一化层依赖的卷积层可以是该批量归一化层之前与该批量归一化层连接的卷积层。Here, the convolutional layer on which the batch normalization layer depends may be the convolutional layer connected to the batch normalization layer before the batch normalization layer.
步骤S306,获取设定的批量归一化层折叠策略。Step S306, obtaining the set batch normalization layer folding strategy.
这里,批量归一化折叠策略指的是将神经网络模型中的批量归一化层折叠至该批量归一化层依赖的卷积层中的策略。在神经网络模型中,批量归一化层旨在减少内部协变量偏移并平滑损失,以实现快速收敛。批量归一化层为每个卷积层输出引入了两步线性变换,即缩放和平移。在实施时,本领域技术人员可以根据实际情况设定合适的批量归一化层折叠策略,本申请实施例对此并不限定。在一些实施例中,设定的批量归一化层折叠策略可以是预设的与部署配置信息对应的批量归一化层折叠策略。Here, the batch normalization folding strategy refers to the strategy of folding the batch normalization layer in the neural network model into the convolutional layer that the batch normalization layer depends on. In neural network models, batch normalization layers are designed to reduce internal covariate shifts and smooth losses for fast convergence. The batch normalization layer introduces a two-step linear transformation, scaling and translation, to each convolutional layer output. During implementation, those skilled in the art can set an appropriate batch normalization layer folding strategy according to actual conditions, which is not limited in this embodiment of the present application. In some embodiments, the set batch normalization layer folding strategy may be a preset batch normalization layer folding strategy corresponding to the deployment configuration information.
步骤S307,基于所述批量归一化层折叠策略,将所述第一网络模型中的每一所述批量归一化层折叠至所述批量归一化层依赖的卷积层中,得到折叠后的所述第一网络模型。Step S307, based on the batch normalization layer folding strategy, fold each of the batch normalization layers in the first network model into the convolutional layer that the batch normalization layer depends on to obtain the folded After the first network model.
步骤S308,对折叠后的所述第一网络模型中的每一所述处理层按照所述量化参数进行量化, 得到第二网络模型。Step S308, performing quantization on each of the processing layers in the folded first network model according to the quantization parameter to obtain a second network model.
在一些实施例中,所述批量归一化层折叠策略包括批量归一化层的移除状态、系数更新算法、待合并至权重中的统计参数、待合并至偏移中的统计参数;所述待合并至权重中的统计参数包括批量归一化层依赖的卷积层的运行统计数据或当前批次的统计数据,所述待合并至偏移中的统计参数包括批量归一化层依赖的卷积层的运行统计数据或当前批次的统计数据;上述步骤S307中所述的基于所述批量归一化层折叠策略,将所述第一网络模型中的每一所述批量归一化层折叠至所述批量归一化层依赖的卷积层中,可以包括:In some embodiments, the batch normalization layer folding strategy includes batch normalization layer removal status, coefficient update algorithm, statistical parameters to be incorporated into weights, statistical parameters to be incorporated into offsets; The statistical parameters to be incorporated into the weights include the running statistics of the convolutional layer on which the batch normalization layer depends or the statistics of the current batch, and the statistical parameters to be incorporated into the offset include the batch normalization layer dependencies The operation statistical data of the convolutional layer or the statistical data of the current batch; based on the batch normalization layer folding strategy described in the above step S307, normalize each batch in the first network model The normalization layer is folded into the convolutional layer that the batch normalization layer depends on, which may include:
步骤S311,确定第一网络模型中的至少一个批量归一化层中每一所述批量归一化层的缩放系数和平移系数;Step S311, determining the scaling coefficient and translation coefficient of each batch normalization layer in at least one batch normalization layer in the first network model;
这里,可以基于每一批量归一化层的参数确定该批量归一化层的缩放系数和平移系数。Here, the scaling coefficient and translation coefficient of each batch normalization layer may be determined based on parameters of the batch normalization layer.
步骤S312,基于所述系数更新算法,对每一所述批量归一化层的缩放系数和平移系数进行更新,得到每一所述批量归一化层的更新后的缩放系数和平移系数。Step S312 , based on the coefficient update algorithm, update the scaling coefficient and translation coefficient of each batch normalization layer to obtain the updated scaling coefficient and translation coefficient of each batch normalization layer.
这里,系数更新算法为设定的用于对批量归一化层的缩放系数和平移系数进行更新的任意合适的算法,可以包括但不限于梯度下降法、模拟退火法、遗传算法等中的一种或多种。在一些实施方式中,系数更新算法也可以是不更新,从而可以不对批量归一化层的缩放系数和平移系数进行更新。Here, the coefficient update algorithm is any suitable algorithm set for updating the scaling coefficient and translation coefficient of the batch normalization layer, which may include but not limited to gradient descent method, simulated annealing method, genetic algorithm, etc. one or more species. In some implementations, the coefficient updating algorithm may also be non-updating, so that the scaling coefficients and translation coefficients of the batch normalization layer may not be updated.
步骤S313,针对每一所述批量归一化层,获取所述批量归一化层中待合并至权重中的统计参数以及待合并至偏移中的统计参数,并将所述批量归一化层的更新后的缩放系数和所述待合并至权重中的统计参数合并至所述批量归一化层依赖的卷积层的权重中,将所述批量归一化层的更新后的缩放系数、平移系数以及所述待合并至偏移中的统计参数合并至所述卷积层的偏移中。Step S313, for each batch normalization layer, obtain statistical parameters to be combined into weights and statistical parameters to be combined into offsets in the batch normalization layer, and perform batch normalization The updated scaling coefficient of the layer and the statistical parameters to be incorporated into the weight are merged into the weight of the convolutional layer on which the batch normalization layer depends, and the updated scaling coefficient of the batch normalization layer is , translation coefficients, and the statistical parameters to be incorporated into the offset are combined into the offset of the convolutional layer.
这里,待合并至权重中的统计参数可以包括批量归一化层依赖的卷积层的运行统计数据或当前批次的统计数据,待合并至偏移中的统计参数也可以包括批量归一化层依赖的卷积层的运行统计数据或当前批次的统计数据。Here, the statistical parameters to be incorporated into the weights may include the running statistics of the convolutional layer on which the batch normalization layer depends or the statistics of the current batch, and the statistical parameters to be incorporated into the offset may also include batch normalization The running statistics of the convolutional layers that the layer depends on or the statistics of the current batch.
运行统计数据为对卷积层历史运行过程中的输出数据进行统计得到的统计数据,可以包括但不限于历史输出数据的均值、方差、滑动平均值等中的一种或多种。当前批次的统计数据为对卷积层输出数据中当前批次的数据进行统计得到的统计数据,可以包括但不限于当前批次数据的均值、方差等中的一种或多种。在实施时,卷积层的当前批次的统计数据可以通过在该卷积层中使用全精度的权重进行卷积计算得到。Running statistical data is statistical data obtained from the output data during the historical operation of the convolutional layer, which may include but not limited to one or more of the mean, variance, and sliding average of the historical output data. The statistical data of the current batch is the statistical data obtained by statistics of the current batch of data in the output data of the convolutional layer, which may include but not limited to one or more of the mean value and variance of the current batch of data. During implementation, the statistics of the current batch of the convolutional layer can be calculated by performing convolution with full-precision weights in the convolutional layer.
在一些实施方式中,待合并至权重中的统计参数可以包括批量归一化层依赖的卷积层的历史输出数据的方差,待合并至权重中的统计参数可以包括批量归一化层依赖的卷积层的历史输出数据的均值、方差。可以将批量归一化层的更新后的缩放系数和批量归一化层依赖的卷积层的历史输出数据的方差合并至所述批量归一化层依赖的卷积层的权重中,将批量归一化层的更新后的缩放系数、平移系数以及批量归一化层依赖的卷积层的历史输出数据的均值、方差合并至所述卷积层的偏移中。In some implementations, the statistical parameters to be incorporated into the weights may include the variance of the historical output data of the convolutional layer on which the batch normalization layer depends, and the statistical parameters to be incorporated into the weights may include The mean and variance of the historical output data of the convolutional layer. The updated scaling coefficient of the batch normalization layer and the variance of the historical output data of the convolutional layer that the batch normalization layer depends on can be combined into the weights of the convolutional layer that the batch normalization layer depends on, and the batch The updated scaling coefficient and translation coefficient of the normalization layer and the mean and variance of the historical output data of the convolutional layer on which the batch normalization layer depends are combined into the offset of the convolutional layer.
在一些实施方式中,待合并至权重中的统计参数可以包括批量归一化层依赖的卷积层的当前批次数据的均值,待合并至权重中的统计参数可以包括批量归一化层依赖的卷积层的当前批次数据的均值、方差。可以将批量归一化层的更新后的缩放系数和批量归一化层依赖的卷积层的当前批次数据的方差合并至所述批量归一化层依赖的卷积层的权重中,将批量归一化层的更新后的缩放系数、平移系数以及批量归一化层依赖的卷积层的当前批次数据的均值、方差合并至所述卷积层的偏移中。In some implementations, the statistical parameters to be incorporated into the weights may include the mean value of the current batch data of the convolutional layer on which the batch normalization layer depends, and the statistical parameters to be incorporated into the weights may include The mean and variance of the current batch of data in the convolutional layer. The updated scaling coefficient of the batch normalization layer and the variance of the current batch data of the convolutional layer on which the batch normalization layer depends can be combined into the weights of the convolutional layer on which the batch normalization layer depends, and The updated scaling coefficient and translation coefficient of the batch normalization layer and the mean value and variance of the current batch data of the convolutional layer on which the batch normalization layer depends are combined into the offset of the convolutional layer.
在一些实施方式中,待合并至权重中的统计参数可以包括批量归一化层依赖的卷积层的历史输出数据的方差,待合并至权重中的统计参数可以包括批量归一化层依赖的卷积层的当前批次数据的均值、方差。可以将批量归一化层的更新后的缩放系数和批量归一化层依赖的卷积层的历史输出数据的方差合并至所述批量归一化层依赖的卷积层的权重中,将批量归一化层的更新后的缩放系数、平移系数以及批量归一化层依赖的卷积层的当前批次数据的均值、方差合并至所述卷积层的偏移中。In some implementations, the statistical parameters to be incorporated into the weights may include the variance of the historical output data of the convolutional layer on which the batch normalization layer depends, and the statistical parameters to be incorporated into the weights may include The mean and variance of the current batch of data in the convolutional layer. The updated scaling coefficient of the batch normalization layer and the variance of the historical output data of the convolutional layer that the batch normalization layer depends on can be combined into the weights of the convolutional layer that the batch normalization layer depends on, and the batch The updated scaling coefficient and translation coefficient of the normalization layer and the mean value and variance of the current batch of data of the convolutional layer on which the batch normalization layer depends are combined into the offset of the convolutional layer.
步骤S314,在所述批量归一化层的移除状态为移除的情况下,将每一所述批量归一化层从所述第一网络模型中移除。Step S314, if the removal state of the batch normalization layer is removed, remove each batch normalization layer from the first network model.
在一些实施例中,在推理中,可以采用公式(1)所示的方式将批量归一化层的缩放系数和平移系数以及批量归一化层依赖的卷积层的运行统计数据合并到该卷积层的权重和偏移中,实现将批量归一化层执行的线性变换折叠至相应的卷积层中:In some embodiments, during inference, the scaling coefficients and translation coefficients of the batch normalization layer and the running statistics of the convolutional layers that the batch normalization layer depends on can be combined into the method shown in formula (1). In the weights and offsets of the convolutional layers, the linear transformation performed by the batch normalization layer is folded into the corresponding convolutional layer:
Figure PCTCN2022071377-appb-000003
Figure PCTCN2022071377-appb-000003
其中,w fold、b fold分别为卷积层中合并后的权重和偏移,μ,σ 2分别是对卷积层运行过程中的输出数据进行统计得到的滑动平均值和方差;γ,β分别是批量归一化层的缩放系数和平移系数。ε是为了数值稳定性设置的极小的非零值,可以防止出现除数为0的情况。如果在批量归一化层折叠之后对卷积层进行量化,在推理过程中将不会有额外的浮点运算。 Among them, w fold and b fold are the combined weights and offsets in the convolutional layer, respectively, μ, σ 2 are the sliding average and variance obtained from the statistics of the output data during the operation of the convolutional layer; γ, β are the scaling and translation coefficients of the batch normalization layer, respectively. ε is a very small non-zero value set for numerical stability, which prevents divisors from being zero. If the convolutional layers are quantized after the batch normalization layer is folded, there will be no extra floating point operations during inference.
在一些实施例中,批量归一化层折叠策略可以包括但不限于以下之一:In some embodiments, batch normalization layer folding strategies may include, but are not limited to, one of the following:
策略1:参见图3B,该策略中利用上述公式(1)将批量归一化层中的缩放系数和平移系数合并到该批量归一化层依赖的卷积层Conv 310的权重w fold和偏移b fold中,并完全去除该批量归一化层; Strategy 1: See FIG. 3B. In this strategy, the above formula (1) is used to merge the scaling coefficient and translation coefficient in the batch normalization layer into the weight w fold and bias of the convolutional layer Conv 310 that the batch normalization layer depends on. Move to b fold , and completely remove the batch normalization layer;
策略2:参见图3C,该策略中利用上述公式(1)将批量归一化层中的缩放系数和平移系数合并到该批量归一化层依赖的卷积层Conv 310的权重w fold和偏移b fold中,并完全去除该批量归一化层,且在量化训练过程中不更新卷积层的运行统计数据,但是,γ,β可以用随机梯度下降法(Stochastic gradient descent,SGD)更新。在该策略中即使没有更新统计数据,仍然可以平滑损失情况,并能通过减少统计数据的过程显著减少量化训练的时间。 Strategy 2: Refer to FIG. 3C. In this strategy, the above formula (1) is used to merge the scaling coefficient and translation coefficient in the batch normalization layer into the weight w fold and bias of the convolutional layer Conv 310 that the batch normalization layer depends on. Move to b fold , and completely remove the batch normalization layer, and do not update the running statistics of the convolutional layer during the quantization training process, but, γ, β can be updated by Stochastic gradient descent (SGD) . In this strategy, even if the statistical data is not updated, the loss situation can still be smoothed, and the time of quantization training can be significantly reduced by reducing the statistical data.
策略3:参见图3D,该策略中在量化训练过程中可以更新卷积层的运行统计数据,在量化训练过程中卷积会计算两次,这会导致额外的开销,其中,第一次卷积(对应图中的卷积层Conv 320)是使用全精度的权重计算当前批次的均值
Figure PCTCN2022071377-appb-000004
和方差
Figure PCTCN2022071377-appb-000005
然后,利用上述公式(1)将当前批次的均值
Figure PCTCN2022071377-appb-000006
方差
Figure PCTCN2022071377-appb-000007
批量归一化层中的缩放系数和平移系数合并到该批量归一化层依赖的卷积层Conv310的权重
Figure PCTCN2022071377-appb-000008
和偏移
Figure PCTCN2022071377-appb-000009
中,并完全去除该批量归一化层。
Strategy 3: See Figure 3D. In this strategy, the running statistics of the convolutional layer can be updated during the quantization training process. During the quantization training process, the convolution will be calculated twice, which will cause additional overhead. Among them, the first convolution The product (corresponding to the convolutional layer Conv 320 in the figure) is to use the weight of full precision to calculate the mean value of the current batch
Figure PCTCN2022071377-appb-000004
and variance
Figure PCTCN2022071377-appb-000005
Then, use the above formula (1) to divide the mean value of the current batch
Figure PCTCN2022071377-appb-000006
variance
Figure PCTCN2022071377-appb-000007
The scaling coefficients and translation coefficients in the batch normalization layer are merged into the weights of the convolutional layer Conv310 that the batch normalization layer depends on
Figure PCTCN2022071377-appb-000008
and offset
Figure PCTCN2022071377-appb-000009
, and completely remove the batch normalization layer.
策略4:参见图3E,该策略中在训练过程中也计算了两次卷积。第一次卷积(对应图中的卷积层Conv 320)和策略3一样,会估计当前批次的均值
Figure PCTCN2022071377-appb-000010
和方差
Figure PCTCN2022071377-appb-000011
在该策略4中,权重将与运行统计数据一起折叠,利用上述公式(1)将运行统计数据中的方差σ 2、批量归一化层中的缩放系数合并到该批量归一化层依赖的卷积层Conv 310的权重w fold中,以避免当前批次的统计数据出现意外波动,将当前批次的均值
Figure PCTCN2022071377-appb-000012
方差
Figure PCTCN2022071377-appb-000013
批量归一化层中的缩放系数和平移系数合并到该批量归一化层依赖的卷积层Conv 310的偏移
Figure PCTCN2022071377-appb-000014
中,并完全去除该批量归一化层,此外,批量方差因子
Figure PCTCN2022071377-appb-000015
将用于在第二次卷积后重新缩放输出。
Strategy 4: See Figure 3E, in this strategy, two convolutions are also calculated during the training process. The first convolution (corresponding to the convolutional layer Conv 320 in the figure) is the same as strategy 3, and will estimate the mean of the current batch
Figure PCTCN2022071377-appb-000010
and variance
Figure PCTCN2022071377-appb-000011
In this strategy 4, the weights will be folded together with the running statistics, and the variance σ 2 in the running statistics, the scaling factor in the batch normalization layer, will be incorporated into the BN layer dependent In the weight w fold of the convolutional layer Conv 310, in order to avoid unexpected fluctuations in the statistical data of the current batch, the mean value of the current batch
Figure PCTCN2022071377-appb-000012
variance
Figure PCTCN2022071377-appb-000013
The scaling and translation coefficients in the batch normalization layer are merged into the offset of the convolutional layer Conv 310 that the batch normalization layer depends on
Figure PCTCN2022071377-appb-000014
, and completely remove the batch normalization layer, in addition, the batch variance factor
Figure PCTCN2022071377-appb-000015
Will be used to rescale the output after the second convolution.
策略5:参见图3F,该策略中不会采用两次卷积,而是在量化卷积(对应图中的卷积层Conv 310)之后明确添加批量归一化层BN 330。这种策略带来的好处之一是当前批次的统计数据是基于量化的权重计算的。在推理过程中,卷积层输出的重新缩放可以被批量归一化层中和。Strategy 5: See Figure 3F. In this strategy, two convolutions are not used, but a batch normalization layer BN 330 is explicitly added after the quantized convolution (corresponding to the convolutional layer Conv 310 in the figure). One of the benefits brought by this strategy is that the statistics of the current batch are calculated based on quantized weights. During inference, the rescaling of convolutional layer outputs can be neutralized by batch normalization layers.
需要说明的是,上述策略2至5均可以转换为策略1。在一些实施方式中,可以从预设的多种批量归一化折叠策略(如上述策略1至5)中设定一种批量归一化折叠策略,并基于设定的批量归一化层折叠策略,对所述第一网络模型中的至少一个批量归一化层进行折叠,得到折叠后的所述第一网络模型。It should be noted that the above strategies 2 to 5 can all be transformed into strategy 1. In some embodiments, a batch normalization folding strategy can be set from a variety of preset batch normalization folding strategies (such as the above-mentioned strategies 1 to 5), and based on the set batch normalization layer folding A strategy for folding at least one batch normalization layer in the first network model to obtain the folded first network model.
在一些实施例中,上述步骤S306可以包括:In some embodiments, the above step S306 may include:
步骤S321,基于所述推理引擎,从设定的多种批量归一化层折叠策略中确定目标的批量归一化层折叠策略。Step S321 , based on the inference engine, determine a target batch normalization layer folding strategy from various set batch normalization layer folding strategies.
这里,设定的多种批量归一化层折叠策略可以是预先根据实际情况确定的,可以包括但不限于上述策略1至5中的任一种。目标的批量归一化层折叠策略是基于推理引擎从设定的多种批量归一化层折叠策略中确定的。不同的推理引擎可以支持不同的批量归一化层折叠策略,也可以支持相同 的批量归一化层折叠策略。在实施时,可以根据推理引擎对批量归一化层折叠策略的支持能力,从设定的多种批量归一化层折叠策略中确定目标的批量归一化层折叠策略。这样,可以进一步提高量化后的第二网络模型在采用设定的推理引擎的部署硬件上部署后的性能。Here, the set multiple batch normalization layer folding strategies may be determined in advance according to the actual situation, and may include but not limited to any one of the strategies 1 to 5 above. The batch normalization layer folding strategy of the target is determined based on the inference engine from the set multiple batch normalization layer folding strategies. Different inference engines can support different batch normalization layer folding strategies, or they can support the same batch normalization layer folding strategy. During implementation, the target batch normalization layer folding strategy can be determined from multiple set batch normalization layer folding strategies according to the inference engine's ability to support the batch normalization layer folding strategy. In this way, the performance of the quantized second network model after being deployed on the deployment hardware using the set inference engine can be further improved.
在一些实施方式中,可以预先基于不同推理引擎对不同批量归一化层折叠策略的支持能力,确定推理引擎与批量归一化层折叠策略之间的对应关系,基于设定的推理引擎查询该对应关系,可以设定的多种批量归一化层折叠策略中确定目标的批量归一化层折叠策略。In some embodiments, based on the support capabilities of different inference engines for different batch normalization layer folding strategies, the corresponding relationship between the inference engine and the batch normalization layer folding strategy can be determined in advance, and the inference engine can be queried based on the set reasoning engine. Corresponding relationship, the batch normalization layer folding strategy of the target can be determined among the various batch normalization layer folding strategies that can be set.
本申请实施例中,获取设定的批量归一化层折叠策略,基于所述批量归一化层折叠策略,将所述第一网络模型中的每一所述批量归一化层折叠至所述批量归一化层依赖的卷积层中,得到折叠后的所述第一网络模型,并对折叠后的所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。这样,在批量归一化层折叠之后对卷积层进行量化,在推理过程中将不会有额外的浮点运算,从而可以进一步加速量化后的第二网络模型的推理速度。In the embodiment of the present application, the set batch normalization layer folding strategy is obtained, and each batch normalization layer in the first network model is folded to the specified batch normalization layer folding strategy based on the batch normalization layer folding strategy. In the convolutional layer that the batch normalization layer depends on, the folded first network model is obtained, and each of the processing layers in the folded first network model is quantized according to the quantization parameter , to get the second network model. In this way, the convolution layer is quantized after the batch normalization layer is folded, and there will be no additional floating-point operations in the inference process, so that the inference speed of the quantized second network model can be further accelerated.
本申请实施例提供一种模型量化方法,该方法可以由计算机设备的处理器执行。如图4所示,该方法包括:An embodiment of the present application provides a model quantization method, which can be executed by a processor of a computer device. As shown in Figure 4, the method includes:
步骤S401,获取待量化的第一网络模型。Step S401, acquiring the first network model to be quantized.
步骤S402,基于设定的部署配置信息,确定所述第一网络模型中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数。Step S402, based on the set deployment configuration information, determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer.
这里,上述步骤S401至步骤S402分别对应于前述步骤S101至步骤S102,在实施时可以参照前述步骤S101至步骤S102的具体实施方式。Here, the above-mentioned steps S401 to S402 correspond to the above-mentioned steps S101 to S102 respectively, and the specific implementation manners of the above-mentioned steps S101 to S102 can be referred to for implementation.
步骤S403,基于设定的量化算法和第一训练数据集,按照所述量化参数,对所述第一网络模型中的每一所述处理层进行量化,得到第二网络模型。Step S403, based on the set quantization algorithm and the first training data set, quantize each of the processing layers in the first network model according to the quantization parameters to obtain a second network model.
这里,用户可以根据实际情况设定任意合适的量化算法,量化算法可以是训练后量化算法,也可以是量化感知训练算法,这里并不限定。Here, the user can set any appropriate quantization algorithm according to the actual situation. The quantization algorithm can be a post-training quantization algorithm or a quantization-aware training algorithm, which is not limited here.
第一训练数据集可以是预先根据第二网络模型的目标任务确定的合适的训练数据集,可以是图像数据集、点云数据集或语音数据等,这里并不限定。The first training data set may be an appropriate training data set determined in advance according to the target task of the second network model, and may be an image data set, a point cloud data set, or voice data, etc., which is not limited here.
在一些实施方式中,量化算法为训练后量化算法,基于所述训练后量化算法,按照所述量化参数,对所述第一网络模型中的每一所述处理层进行量化,得到量化后的第二网络模型;基于所述第一训练数据集,对量化后的第二网络模型中的模型参数进行校准,得到校准后的第二网络模型。In some implementations, the quantization algorithm is a post-training quantization algorithm. Based on the post-training quantization algorithm, each of the processing layers in the first network model is quantized according to the quantization parameters to obtain a quantized The second network model: based on the first training data set, calibrate the model parameters in the quantized second network model to obtain the calibrated second network model.
在一些实施方式中,量化算法为量化感知训练算法,可以基于所述量化感知训练算法和所述第一训练数据集,按照所述量化参数,多所述第一网络模型中的每一所述处理层的参数进行至少一次量化感知训练,得到已训练的量化后的第二网络模型。In some implementations, the quantization algorithm is a quantization-aware training algorithm. Based on the quantization-aware training algorithm and the first training data set, each of the first network models can be multiplied according to the quantization parameter. The parameters of the processing layer are subjected to at least one quantization-aware training to obtain a trained quantized second network model.
在一些实施方式中,可以在对第一网络模型进行量化之前,对该第一网络模型进行预训练,将预训练后的第一网络模型作为待量化的第一网络模型。In some implementation manners, before quantizing the first network model, the first network model may be pre-trained, and the pre-trained first network model may be used as the first network model to be quantized.
本申请实施例中,基于设定的量化算法和第一训练数据集,按照量化参数,对第一网络模型中的每一待量化的处理层进行量化,得到第二网络模型。这样,可以有效复现设定的量化算法。In the embodiment of the present application, based on the set quantization algorithm and the first training data set, each processing layer to be quantized in the first network model is quantized according to quantization parameters to obtain the second network model. In this way, the set quantization algorithm can be effectively reproduced.
在一些实施例中,所述量化算法包括量化感知训练算法,上述步骤S403还可以包括:In some embodiments, the quantization algorithm includes a quantization-aware training algorithm, and the above step S403 may also include:
步骤S411,按照所述量化参数,为所述第一网络模型中的每一所述处理层设置一个伪量化器,得到第三网络模型。Step S411, setting a pseudo-quantizer for each of the processing layers in the first network model according to the quantization parameters to obtain a third network model.
这里,伪量化器可以在量化感知训练过程中进行量化模拟,以方便网络感知量化带来的损失,从而可以为第一网络模型中的每一待量化的处理层设置一个伪量化器。伪量化器的结构可以基于量化参数确定,可以是对称量化器也可以是非对称量化器,可以是均匀量化器也可以是非均匀量化器,可以是基于学习的量化器也可以是基于规则的量化器,还可以是直接使用启发式计算量化步长的量化器,这里并不限定。可以将设置了伪量化器的第一网络模型确定为第三网络模型。Here, the pseudo-quantizer can perform quantization simulation during the quantization-aware training process to facilitate the network to perceive the loss caused by quantization, so that a pseudo-quantizer can be set for each processing layer to be quantized in the first network model. The structure of the pseudo-quantizer can be determined based on quantization parameters, it can be a symmetric quantizer or an asymmetric quantizer, it can be a uniform quantizer or a non-uniform quantizer, it can be a learning-based quantizer or a rule-based quantizer , can also be a quantizer that directly uses heuristics to calculate the quantization step size, which is not limited here. The first network model in which the pseudo-quantizer is set may be determined as the third network model.
步骤S412,基于设定的量化感知训练算法和第一训练数据集,对所述第三网络模型中的每一所述处理层的参数进行至少一次量化感知训练,得到第二网络模型。Step S412, based on the set quantization-aware training algorithm and the first training data set, perform at least one quantization-aware training on the parameters of each processing layer in the third network model to obtain a second network model.
这里,本领域技术人员可以在实施时根据实际情况设定合适的量化感知训练算法,例如LSQ算法、PACT算法、APoT算法、DSQ算法、DoReFa-Net训练算法、LQ-net算法等中的一种或多种,这里并不限定。在一些实施方式中,可以从预设的多种量化感知训练算法中设定一种量化感知训练算法。Here, those skilled in the art can set an appropriate quantization-aware training algorithm according to the actual situation during implementation, such as one of LSQ algorithm, PACT algorithm, APoT algorithm, DSQ algorithm, DoReFa-Net training algorithm, LQ-net algorithm, etc. or more, and it is not limited here. In some embodiments, one quantization-aware training algorithm may be set from multiple preset quantization-aware training algorithms.
在一些实施例中,所述量化参数包括量化尺度的预设精度、量化对称性、量化位宽和量化粒度,所述量化对称性包括对称量化或非对称量化,所述量化粒度包括层级量化或特征级量化。所述伪量化器被配置为执行如下步骤S421至步骤S424:In some embodiments, the quantization parameters include preset precision of quantization scale, quantization symmetry, quantization bit width and quantization granularity, the quantization symmetry includes symmetric quantization or asymmetric quantization, and the quantization granularity includes hierarchical quantization or Feature-level quantization. The pseudo-quantizer is configured to perform the following steps S421 to S424:
步骤S421,基于所述量化位宽确定处理层参数的量化值范围。Step S421: Determine the quantized value range of the processing layer parameter based on the quantized bit width.
这里,量化位宽为对第三网络模型中的每一待量化的处理层的参数进行训练的过程中对浮点型参数进行量化得到的整型数据的比特位宽,如8比特、4比特、3比特、2比特等。量化位宽可以是根据设定的部署配置信息确定的,也可以是用户直接设定的。第三网络模型中不同的处理层可以采用相同的量化位宽,也可以采用不同的量化位宽。Here, the quantization bit width is the bit width of the integer data obtained by quantizing the floating-point parameters during the training process of the parameters of each processing layer to be quantized in the third network model, such as 8 bits, 4 bits , 3 bits, 2 bits, etc. The quantized bit width can be determined according to the set deployment configuration information, or can be set directly by the user. Different processing layers in the third network model may use the same quantization bit width or different quantization bit widths.
处理层参数可以是待量化的处理层的权重值、激活值、输入数据、输出数据等中的一种或多种待量化的参数,处理层参数的量化值范围为该参数量化后的取值范围。在实施时,处理层参数的量化值范围可以是基于量化位宽确定的,例如,处理层参数可以包括权重值和激活值,在量化位宽为k的情况下,权重值可以被量化为[-2 k-1,2 k-1-1]范围内的有符号整型数值,激活值可以被量化为[0,2 k-1]范围内的无符号整型数值,因此,权重值的量化值范围可以为[-2 k-1,2 k-1-1],激活值的量化值范围可以为[0,2 k-1]。 The processing layer parameter can be one or more parameters to be quantized in the weight value, activation value, input data, output data, etc. of the processing layer to be quantized, and the quantized value range of the processing layer parameter is the value after quantization of the parameter scope. During implementation, the quantized value range of the processing layer parameter can be determined based on the quantization bit width. For example, the processing layer parameter can include a weight value and an activation value. In the case where the quantization bit width is k, the weight value can be quantized as [ Signed integer values in the range of -2 k-1 , 2 k-1 -1], the activation value can be quantized as an unsigned integer value in the range of [0, 2 k-1 ], therefore, the weight value of The quantized value range may be [-2 k-1 , 2 k-1 -1], and the quantized value range of the activation value may be [0, 2 k-1 ].
步骤S422,确定满足所述预设精度的量化尺度和满足所述量化对称性的量化零点。Step S422, determining a quantization scale that satisfies the preset precision and a quantization zero that satisfies the quantization symmetry.
这里,量化尺度为量化过程中对待量化的全精度数值进行缩放的系数。量化尺度的预设精度可以包括但不限于全精度、2的次方精度等中的一种。Here, the quantization scale is a coefficient for scaling the full-precision value to be quantized during the quantization process. The preset precision of the quantization scale may include but not limited to one of full precision, power of 2 precision, and the like.
量化对称性用于表征待量化的全精度数值的取值范围是否关于0对称。在均匀量化中,全精度数值的零点会被量化为一个整型的数值,该数值称为量化零点。在量化零点为0的情况下,表示待量化的全精度数值的取值范围是关于0对称,也即该均匀量化为对称量化;在量化零点不为0的情况下,表示待量化的全精度数值的取值范围关于0不对称,也即该均匀量化为非对称量化。Quantization symmetry is used to characterize whether the value range of the full-precision value to be quantized is symmetrical about 0. In uniform quantization, the zero point of the full-precision value is quantized to an integer value, which is called the quantized zero point. When the quantization zero point is 0, it means that the value range of the full-precision value to be quantized is symmetrical about 0, that is, the uniform quantization is symmetrical quantization; when the quantization zero point is not 0, it means the full-precision value to be quantized The range of values is asymmetric about 0, that is, the uniform quantization is asymmetric quantization.
在一些实施方式中,可以根据实际情况为伪量化器设置一个满足预设精度的固定的量化尺度,以及设置一个满足所述量化对称性的固定的量化零点。例如,在量化尺度的预设精度为全精度的情况下,可以为伪量化器设置一个合适的全精度的数值作为量化尺度。在量化对称性为对称的情况下,可以将量化零点设置为0;在量化对称性为非对称的情况下,可以将量化零点设置为一个合适的非零数,如1、-2等。In some implementation manners, a fixed quantization scale that satisfies the preset accuracy and a fixed quantization zero point that satisfies the quantization symmetry may be set for the pseudo quantizer according to actual conditions. For example, when the preset precision of the quantization scale is full precision, an appropriate full-precision numerical value may be set as the quantization scale for the pseudo quantizer. When the quantization symmetry is symmetrical, the quantization zero point can be set to 0; when the quantization symmetry is asymmetrical, the quantization zero point can be set to an appropriate non-zero number, such as 1, -2, and so on.
在一些实施方式中,可以通过统计待量化的全精度数值在模型运行过程中的取值范围,基于该取值范围和对应的量化值范围,可以确定满足预设精度的量化尺度和满足量化对称性的量化零点。在一些实施方式中,还可以在模型训练过程中,对量化尺度和量化零点进行不断调整。In some implementations, by counting the value range of the full-precision value to be quantized during the model operation, based on the value range and the corresponding quantization value range, the quantization scale that meets the preset precision and the quantization symmetry can be determined. Sexual quantization zero. In some implementation manners, the quantization scale and quantization zero point can also be continuously adjusted during the model training process.
步骤S423,基于所述量化粒度,在所述量化值范围内,采用量化尺度和量化零点对待量化的处理层参数进行均匀量化处理,得到量化后的所述处理层参数。Step S423, based on the quantization granularity, within the quantization value range, uniform quantization is performed on the processing layer parameters to be quantized by using the quantization scale and the quantization zero point, to obtain the quantized processing layer parameters.
这里,量化粒度指的是在量化网络模型中共享量化值范围、量化尺度、量化零点等参数的粒度,可以包括层级量化(也即张量级量化)或特征级量化(也即通道级量化)等,其中,层级量化表示对同一处理层中待量化的处理层参数采用相同的共享量化值范围、量化尺度、量化零点等参数,特征级量化表示对同一处理层中不同的特征对应的待量化的处理层参数采用不同的共享量化值范围、量化尺度、量化零点等参数。Here, the quantization granularity refers to the granularity of parameters such as the quantization value range, quantization scale, and quantization zero point shared in the quantization network model, which can include hierarchical quantization (that is, tensor-level quantization) or feature-level quantization (that is, channel-level quantization). etc. Among them, the level quantization means that the parameters of the processing layer to be quantized in the same processing layer adopt the same parameters such as the shared quantization value range, quantization scale, and quantization zero point, and the feature level quantization means that the parameters to be quantized corresponding to different features in the same processing layer The parameters of the processing layer adopt different shared quantization value ranges, quantization scales, quantization zeros and other parameters.
在一些实施方式中,在量化值范围为[N min,N max],其中,N min为量化值范围中的最小量化值,N max为量化值范围中的最大量化值,量化尺度为s,量化零点为z的情况下,可以采用如下公式(2)所示的方式对待量化的处理层参数进行均匀量化处理: In some implementations, the range of quantized values is [N min , N max ], where N min is the smallest quantized value in the range of quantized values, N max is the largest quantized value in the range of quantized values, and the quantized scale is s, In the case where the quantization zero point is z, the processing layer parameters to be quantized can be uniformly quantized in the manner shown in the following formula (2):
Figure PCTCN2022071377-appb-000016
Figure PCTCN2022071377-appb-000016
其中,w表示该处理层参数对应的浮点值,而
Figure PCTCN2022071377-appb-000017
为该处理层参数的量化值,函数
Figure PCTCN2022071377-appb-000018
表示将
Figure PCTCN2022071377-appb-000019
限制在N min至N max之间,在
Figure PCTCN2022071377-appb-000020
大于N max的情况下,该函数的值为N max,在
Figure PCTCN2022071377-appb-000021
小于N min的情况下,该函数的值为N min,在
Figure PCTCN2022071377-appb-000022
不大于N max且不小于N min的情况下,该函数的值为
Figure PCTCN2022071377-appb-000023
表示对输入的值进行四舍五入取整。
Among them, w represents the floating-point value corresponding to the parameter of the processing layer, and
Figure PCTCN2022071377-appb-000017
is the quantized value of the processing layer parameter, the function
Figure PCTCN2022071377-appb-000018
express will
Figure PCTCN2022071377-appb-000019
limited between N min and N max , at
Figure PCTCN2022071377-appb-000020
When it is greater than N max , the value of this function is N max , in
Figure PCTCN2022071377-appb-000021
When it is less than N min , the value of this function is N min , in
Figure PCTCN2022071377-appb-000022
When not greater than N max and not less than N min , the value of this function is
Figure PCTCN2022071377-appb-000023
Indicates that the entered value is rounded to an integer.
步骤S424,基于所述量化尺度和所述量化零点,对量化后的所述处理层参数进行反均匀量化处理,得到反量化后的所述处理层参数。Step S424, based on the quantization scale and the quantization zero point, perform inverse uniform quantization on the quantized processing layer parameters to obtain the dequantized processing layer parameters.
这里,在一些实施方式中,在量化尺度为s,量化零点为z的情况下,可以采用如下公式(3)所示的方式对量化后的处理层参数进行反均匀量化处理:Here, in some implementations, in the case where the quantization scale is s and the quantization zero point is z, the quantized processing layer parameters can be deuniformly quantized in the manner shown in the following formula (3):
Figure PCTCN2022071377-appb-000024
Figure PCTCN2022071377-appb-000024
其中,
Figure PCTCN2022071377-appb-000025
表示量化后的处理层参数的量化值,
Figure PCTCN2022071377-appb-000026
表示反量化后的处理层参数。
in,
Figure PCTCN2022071377-appb-000025
Indicates the quantized value of the quantized processing layer parameter,
Figure PCTCN2022071377-appb-000026
Indicates the parameters of the processing layer after dequantization.
在上述实施例中,可以基于设定的部署配置信息确定对第一网络模型中每一处理层进行量化的量化参数,量化参数包括量化尺度的预设精度、量化对称性、量化位宽和量化粒度,所述量化对称性包括对称量化或非对称量化,所述量化粒度包括层级量化或特征级量化。这样,可以个人那件部署硬件的配置情况采用硬件感知的量化器进行模型量化,从而量化后的第二网络模型可以更好地满足部署硬件的部署要求。此外,可以支持多种类型的量化器,从而可以为更多类型的部署硬件量化可部署的第二网络模型。In the above embodiments, the quantization parameters for quantizing each processing layer in the first network model can be determined based on the set deployment configuration information, and the quantization parameters include the preset precision of the quantization scale, quantization symmetry, quantization bit width and quantization Granularity, the quantization symmetry includes symmetric quantization or asymmetric quantization, and the quantization granularity includes hierarchical quantization or feature level quantization. In this way, the hardware-aware quantizer can be used to perform model quantization according to the configuration of individual deployment hardware, so that the quantized second network model can better meet the deployment requirements of the deployment hardware. In addition, multiple types of quantizers can be supported, so that a deployable second network model can be quantized for more types of deployment hardware.
在一些实施例中,上述步骤S403可以包括:In some embodiments, the above step S403 may include:
步骤S431,确定预设的与所述第一网络模型采用的神经网络结构对应的训练超参数;其中,对于预设的多种部署配置信息中的每一所述部署配置信息,所述训练超参数是相同的。Step S431, determining a preset training hyperparameter corresponding to the neural network structure adopted by the first network model; wherein, for each of the deployment configuration information in the preset multiple deployment configuration information, the training hyperparameter The parameters are the same.
这里,对于采用相同的神经网络结构的网络模型,采用统一的训练超参数,训练超参数可以包括但不限于微调时长(代数)、学习率、参数优化算法、权重衰减等中的一种或多种。预设的多种部署配置信息可以包括预先设定的至少两个任意合适的部署配置信息,这里并不限定。对于不同的部署配置信息,在对采用相同的神经网络结构的网络模型进行量化训练的过程中,所采用的训练超参数均相同。在采用不同的量化算法对采用相同的神经网络结构的网络模型进行量化训练的过程中,所采用的训练超参数也均相同。Here, for network models using the same neural network structure, uniform training hyperparameters are adopted, and training hyperparameters may include but not limited to one or more of fine-tuning duration (algebra), learning rate, parameter optimization algorithm, weight decay, etc. kind. The preset multiple deployment configuration information may include at least two preset deployment configuration information, which is not limited here. For different deployment configuration information, the same training hyperparameters are used in the process of quantitative training for network models using the same neural network structure. In the process of using different quantization algorithms to perform quantization training on network models using the same neural network structure, the training hyperparameters used are also the same.
在实施时,可以预先通过实验或分析为至少一种神经网络结构确定一组合适的训练超参数,基于第一网络模型采用的神经网络结构,可以确定预设的与该神经网络结构对应的训练超参数。本领域技术人员可以根据实际情况为至少一种神经网络结构确定合适的训练超参数,本申请实施例对此并不限定。During implementation, a set of suitable training hyperparameters for at least one neural network structure can be determined in advance through experiments or analysis. Based on the neural network structure adopted by the first network model, the preset training corresponding to the neural network structure can be determined. hyperparameters. Those skilled in the art may determine appropriate training hyperparameters for at least one neural network structure according to actual conditions, which is not limited in this embodiment of the present application.
例如,如下表1提供了一种为神经网络结构ResNet-18、ResNet-50、EffNet、MbV2、RegNet分别预设的训练超参数的示例,其中,对于采用ResNet-18的第一网络模型,预设的学习率为0.004、权重衰减为10 -4、批大小为64、图形处理器(Graphics processing unit,GPU)数量为8;对于采用ResNet-50的第一网络模型,预设的学习率为0.004、权重衰减为10 -4、批大小为16、GPU数量为16;对于采用EffNet和MbV2的第一网络模型可以预设相同的训练超参数,预设的学习率为0.01、权重衰减为10 -5*、批大小为32、GPU数量为16;对于采用RegNet的第一网络模型,预设的学习率为0.004、权重衰减为4×10 -5,批大小为32、GPU数量为16。其中,*代表批量归一化层的权重衰减为0。 For example, the following Table 1 provides an example of training hyperparameters preset for the neural network structures ResNet-18, ResNet-50, EffNet, MbV2, and RegNet, wherein, for the first network model using ResNet-18, the preset The set learning rate is 0.004, the weight decay is 10 -4 , the batch size is 64, and the number of graphics processors (Graphics processing unit, GPU) is 8; for the first network model using ResNet-50, the preset learning rate is 0.004, the weight decay is 10 -4 , the batch size is 16, and the number of GPUs is 16; for the first network model using EffNet and MbV2, the same training hyperparameters can be preset, the preset learning rate is 0.01, and the weight decay is 10 -5 *, the batch size is 32, and the number of GPUs is 16; for the first network model using RegNet, the preset learning rate is 0.004, the weight decay is 4×10 -5 , the batch size is 32, and the number of GPUs is 16. Among them, * represents that the weight decay of the batch normalization layer is 0.
表1 不同神经网络结构对应的训练超参数示例表Table 1 Example table of training hyperparameters corresponding to different neural network structures
神经网络结构neural network structure 学习率learning rate 权重衰减weight decay 批大小batch size GPU数量Number of GPUs
ResNet-18ResNet-18 0.0040.004 10 -4 10-4 6464 88
ResNet-50ResNet-50 0.0040.004 10 -4 10-4 1616 1616
EffNetEffNet 0.010.01 10 -5* 10-5 * 3232 1616
MbV2MbV2 0.010.01 10 -5* 10-5 * 3232 1616
RegNetRegNet 0.0040.004 4×10 -5 4×10 -5 3232 1616
在一些实施例中,可以对训练数据采用统一的数据预置处理,包括随机尺寸裁剪到224分辨率,随机水平翻转,对图像进行颜色抖动,如亮度偏移0.2、对比度偏移0.2,饱和度偏移0.2,色调偏移0.1。在训练中,将测试数据居中并裁剪为224分辨率,并使用0.1标签平滑来添加正则化。所有模型都训练100个epoch(指将所有训练样本训练一次,所有训练样本进行了一次正向传播和反向传播),在第一个epoch中进行了线性热身。学习率通过余弦退火策略衰减。使用SGD优化器进行训练,并用动量参数为0.9的牛顿动量(Nesterov)进行更新。In some embodiments, a unified data preset process can be used for the training data, including random size cropping to 224 resolution, random horizontal flip, and color dithering of the image, such as brightness offset 0.2, contrast offset 0.2, saturation Offset 0.2, hue offset 0.1. During training, the test data is centered and cropped to 224 resolution, and regularization is added using 0.1 label smoothing. All models are trained for 100 epochs (meaning that all training samples are trained once, and all training samples are forward-propagated and back-propagated), and a linear warm-up is performed in the first epoch. The learning rate is decayed by a cosine annealing strategy. Trained using the SGD optimizer and updated with Newtonian momentum (Nesterov) with a momentum parameter of 0.9.
步骤S432,利用设定的第一训练数据集,基于所述量化算法和所述训练超参数,对所述第一 网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二量化网络模型。Step S432, using the set first training data set, based on the quantization algorithm and the training hyperparameters, quantize each of the processing layers in the first network model according to the quantization parameters to obtain the first Binary network model.
在上述实施例中,对于采用相同的神经网络结构的网络模型,采用统一的训练超参数,这样,可以在具有相同的神经网络结构的多种第一网络模型以及多种量化算法之间共享模型训练的技巧,从而可以更好地复现不同的量化算法,提高量化算法的精度。In the above-mentioned embodiment, for the network models using the same neural network structure, a unified training hyperparameter is used, so that the model can be shared between various first network models and various quantization algorithms with the same neural network structure Training skills, so that different quantization algorithms can be better reproduced and the accuracy of the quantization algorithm can be improved.
本申请实施例提供一种模型量化方法,该方法可以由计算机设备的处理器执行。如图5所示,该方法包括:An embodiment of the present application provides a model quantization method, which can be executed by a processor of a computer device. As shown in Figure 5, the method includes:
步骤S501,基于至少一种部署配置信息,对设定的神经网络结构中的处理层进行调整,得到至少一个调整后的神经网络结构。Step S501 , based on at least one type of deployment configuration information, adjust the processing layers in the set neural network structure to obtain at least one adjusted neural network structure.
这里,预设的神经网络结构可以是用户根据实际情况预先设定的,也可以是默认的,这里并不限定。Here, the preset neural network structure may be preset by the user according to the actual situation, or may be a default, which is not limited here.
至少一种部署配置信息可以是用户预先设定的或者默认的一种或多种部署配置信息。由于不同的部署硬件对神经网络结构中不同处理层的量化支持能力存在差异。在实施时,针对每一种部署配置信息,可以根据该部署配置信息对应的部署硬件对神经网络结构中不同处理层的实际的量化支持情况,采用合适的方式对设定的神经网络结构中的至少一个处理层进行调整,得到调整后的神经网络结构。例如,对于量化支持能力不高的部署硬件类型,在设定的神经网络结构为EfficientNet的情况下,可以移除网络结构中的挤压和激励块,并将快速激活层替换为ReLU6(Rectified liner unit,线性修正单元)层,得到EfficientNet的轻量(Lite)版本,从而可以在部署硬件上获得更好的整型数值支持。The at least one type of deployment configuration information may be one or more types of deployment configuration information preset or defaulted by the user. Due to different deployment hardware, there are differences in the quantization support capabilities of different processing layers in the neural network structure. During implementation, for each type of deployment configuration information, according to the actual quantitative support of the deployment hardware for different processing layers in the neural network structure corresponding to the deployment configuration information, an appropriate method can be used to evaluate the set neural network structure. At least one processing layer is adjusted to obtain an adjusted neural network structure. For example, for the deployment hardware type with low quantization support ability, if the neural network structure is set as EfficientNet, the squeeze and excitation blocks in the network structure can be removed, and the fast activation layer can be replaced by ReLU6 (Rectified liner unit, linear correction unit) layer, get the lightweight (Lite) version of EfficientNet, so that better integer value support can be obtained on the deployment hardware.
步骤S502,基于至少一个所述调整后的神经网络结构,创建至少一个第一网络模型。Step S502: Create at least one first network model based on at least one adjusted neural network structure.
这里,可以针对至少一个调整后的神经网络结构中的每一神经网络结构分别创建一个第一网络模型。在实施时,本领域技术人员可以根据实际业务需求基于调整后的神经网络结构创建合适的第一网络模型,这里并不限定。Here, a first network model may be created for each neural network structure in the at least one adjusted neural network structure. During implementation, those skilled in the art can create an appropriate first network model based on the adjusted neural network structure according to actual business requirements, which is not limited here.
步骤S503,基于与所述设定的神经网络结构对应的预设模型参数,对至少一个所述第一网络模型的参数进行初始化,得到至少一个初始化后的第一网络模型。Step S503, based on the preset model parameters corresponding to the set neural network structure, initialize at least one parameter of the first network model to obtain at least one initialized first network model.
这里,对于采用相同的神经网络结构或采用基于相同的神经网络结构调整后的神经网络结构的第一网络模型,可以采用统一的预设模型参数对该第一网络模型的参数进行初始化,得到至少一个初始化后的第一网络模型。预设模型参数可以包括第一网络模型中各参数的预先设定的初始值,也可以包括对第一网络模型进行预训练后得到的训练好的模型参数,这里并不限定。Here, for the first network model using the same neural network structure or the adjusted neural network structure based on the same neural network structure, the parameters of the first network model can be initialized with uniform preset model parameters, and at least An initialized first network model. The preset model parameters may include preset initial values of parameters in the first network model, or may include trained model parameters obtained after pre-training the first network model, which is not limited here.
步骤S504,基于设定的部署配置信息,从所述至少一个初始化后的第一网络模型中确定待量化的第一网络模型。Step S504, based on the set deployment configuration information, determine a first network model to be quantified from the at least one initialized first network model.
这里,每一种部署配置信息可以对应一个初始化后的第一网络模型,基于设定的部署配置信息,可以确定与该部署配置信息对应的初始化后的第一网络模型,并将该初始化后的第一网络模型确定为待量化的第一网络模型。Here, each type of deployment configuration information may correspond to an initialized first network model, based on the set deployment configuration information, the initialized first network model corresponding to the deployment configuration information may be determined, and the initialized first network model The first network model is determined as the first network model to be quantified.
步骤S505,基于设定的部署配置信息,确定所述第一网络模型中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数。Step S505, based on the set deployment configuration information, determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer.
步骤S506,对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。Step S506, performing quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.
这里,上述步骤S505至步骤S506分别对应于前述步骤S102至步骤S103,在实施时可以参照前述步骤S102至步骤S103的具体实施方式。Here, the above-mentioned steps S505 to S506 correspond to the above-mentioned steps S102 to S103 respectively, and the specific implementation manners of the above-mentioned steps S102 to S103 can be referred to for implementation.
本申请实施例中,基于至少一种部署配置信息,对设定的神经网络结构中的处理层进行调整,得到至少一个调整后的神经网络结构,基于至少一个调整后的神经网络结构,创建至少一个第一网络模型,基于与设定的神经网络结构对应的预设模型参数,对至少一个第一网络模型的参数进行初始化,得到至少一个初始化后的第一网络模型,并基于设定的部署配置信息,从至少一个初始化后的第一网络模型中确定待量化的第一网络模型。这样,一方面,待量化的第一网络模型是基于设定的部署配置信息,对设定的神经网络结构中的处理层进行调整后得到的神经网络结构创建的,从而在量化后得到的第二网络模型在部署至采用该设定的部署配置信息的部署硬件后可以得到更好的量化支持;另一方面,通过对采用相同的神经网络结构的第一网络模型,采用统一的预设模型参数进行初始化,可以减少采用不同初始化方式导致的初始化不一致,从而可以提高不同量化算法对同 一网络结构的不同神经网络模型进行量化的可比性。In the embodiment of the present application, based on at least one deployment configuration information, the processing layer in the set neural network structure is adjusted to obtain at least one adjusted neural network structure, and based on at least one adjusted neural network structure, at least A first network model, based on the preset model parameters corresponding to the set neural network structure, initialize the parameters of at least one first network model to obtain at least one initialized first network model, and based on the set deployment Configuration information, determining the first network model to be quantized from at least one initialized first network model. In this way, on the one hand, the first network model to be quantified is created based on the set deployment configuration information and the neural network structure obtained by adjusting the processing layer in the set neural network structure, so that the first network model obtained after quantization The second network model can get better quantitative support after being deployed to the deployment hardware using the set deployment configuration information; on the other hand, by adopting a unified preset model for the first network model using the same neural network structure The initialization of parameters can reduce the inconsistency of initialization caused by using different initialization methods, thereby improving the comparability of quantization of different neural network models of the same network structure by different quantization algorithms.
在一些实施例中,在上述步骤S503之前,所述方法还包括:In some embodiments, before the above step S503, the method further includes:
步骤S511,获取预设的与所述神经网络结构对应的预训练模型;所述预训练模型在输出层之前的结构与所述神经网络结构相同。Step S511, obtaining a preset pre-training model corresponding to the neural network structure; the structure of the pre-training model before the output layer is the same as the neural network structure.
这里,预训练模型可以是预先基于该神经网络结构创建的任意合适的神经网络模型。Here, the pre-training model may be any suitable neural network model created in advance based on the neural network structure.
步骤S512,利用设定的第二训练数据集,对所述预训练模型的参数进行训练,得到已训练的所述预训练模型。Step S512, using the set second training data set to train the parameters of the pre-training model to obtain the trained pre-training model.
这里,第二训练数据集可以是预先根据预训练模型的目标任务确定的合适的训练数据集,可以是图像数据集、点云数据集或语音数据等,这里并不限定。Here, the second training data set may be a suitable training data set determined in advance according to the target task of the pre-trained model, and may be an image data set, a point cloud data set, or voice data, etc., which is not limited here.
步骤S513,将已训练的所述预训练模型的参数确定为所述预设模型参数。Step S513, determining the trained parameters of the pre-training model as the preset model parameters.
这里,对于采用相同的神经网络结构的第一网络模型,可以采用统一的预训练模型对该第一网络模型的参数进行预训练,并将训练好的预训练模型的参数作为预设模型参数,用于对该第一网络模型的参数进行初始化。这样,在对第一网络模型进行量化的过程中只需对量化后的参数进行简单的校准或微调,即可得到性能较好的量化后的第二网络模型。从而可以提高模型量化的效率,并能进一步提高量化后的第二网络模型的精度。Here, for the first network model using the same neural network structure, a unified pre-training model can be used to pre-train the parameters of the first network model, and the parameters of the trained pre-training model can be used as preset model parameters, It is used to initialize the parameters of the first network model. In this way, in the process of quantizing the first network model, only simple calibration or fine-tuning of the quantized parameters is required to obtain a quantized second network model with better performance. Therefore, the efficiency of model quantization can be improved, and the precision of the quantized second network model can be further improved.
在一些实施例中,上述步骤S501可以包括:In some embodiments, the above step S501 may include:
步骤S521,从预设的多种神经网络结构中确定目标神经网络结构。Step S521, determining a target neural network structure from various preset neural network structures.
这里,可以预先设定多种可选的神经网络结构,用户可以根据实际业务需求从预设的多种神经网络结构中确定合适的目标神经网络结构,这里并不限定。Here, a variety of optional neural network structures can be preset, and the user can determine a suitable target neural network structure from the various preset neural network structures according to actual business needs, which is not limited here.
步骤S522,基于至少一种部署配置信息,对所述目标神经网络结构中的处理层进行调整,得到至少一个调整后的神经网络结构。Step S522, based on at least one deployment configuration information, adjust the processing layer in the target neural network structure to obtain at least one adjusted neural network structure.
在上述实施例中,可以提供多种可选的神经网络结构用于创建初始的第一网络模型,这样,可以更好地支持用户的不同业务需求。In the above embodiments, various optional neural network structures can be provided for creating the initial first network model, so that different service requirements of users can be better supported.
相关技术中,对于面向硬件可部署的不同量化算法,部署在目标硬件上运行时存在着巨大的精度差距。In related technologies, for different quantization algorithms that can be deployed on hardware, there is a huge gap in accuracy when they are deployed and run on target hardware.
本申请实施例提供一种面向可复现与可部署的模型量化算法库(以下称为MQBench),该模型量化算法库可以用于评估、分析模型量化算法的可复现性与可部署性,为实际应用中量化模型的部署提供了可选择的多种不同的部署硬件类型,包括中央处理器(central processing unit,CPU)、GPU、专用集成电路(Application Specific Integrated Circuit,ASIC)、数字信号处理器(Digital Signal Process,DSP),并在统一的训练配置下评估了大量最先进的量化算法。用户可以在图像分类、目标检测等任务中使用MQBench对训练好的全精度网络模型进行量化,得到可部署到目标硬件上运行的量化网络模型。在使用MQBench进行模型量化的过程中只需要用户提供相应的训练数据集以及目标硬件的部署配置信息(如部署硬件类型、部署硬件类型采用的推理引擎、部署硬件类型对应的量化位宽等)和量化算法的配置信息(如量化算法、微调时长、微调训练代数、训练超参数等)。The embodiment of the present application provides a reproducible and deployable model quantization algorithm library (hereinafter referred to as MQBench), which can be used to evaluate and analyze the reproducibility and deployability of the model quantization algorithm. Provides a variety of different deployment hardware types to choose from for the deployment of quantitative models in practical applications, including central processing unit (central processing unit, CPU), GPU, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), digital signal processing (Digital Signal Process, DSP), and evaluate a large number of state-of-the-art quantization algorithms under a unified training configuration. Users can use MQBench to quantify the trained full-precision network model in tasks such as image classification and target detection, and obtain a quantized network model that can be deployed to target hardware. In the process of using MQBench for model quantification, the user only needs to provide the corresponding training data set and the deployment configuration information of the target hardware (such as the deployment hardware type, the inference engine used by the deployment hardware type, the quantization bit width corresponding to the deployment hardware type, etc.) and Configuration information of the quantization algorithm (such as quantization algorithm, fine-tuning duration, fine-tuning training algebra, training hyperparameters, etc.).
在一些实施方式中,可以使用Pytorch深度学习引擎实现MQBench,并支持torch.fx(也称为FX)特性。FX包含一个符号跟踪器、一个中间表示和Python代码生成,从而可以允许进行更深入的元编程。本申请实施例可以在MQBench中实现量化算法和硬件感知配置,通过一个应用程序编程接口(Application Programming Interface,API)调用,可以将全精度网络模型转换为量化网络模型。例如,可以采用如下示例的代码调用API,将训练好的全精度网络模型转换为量化网络模型:In some implementations, MQBench can be implemented using the Pytorch deep learning engine and supports the torch.fx (also known as FX) feature. FX includes a symbol tracker, an intermediate representation, and Python code generation, allowing deeper metaprogramming. In the embodiment of the present application, the quantization algorithm and hardware-aware configuration can be implemented in MQBench, and the full-precision network model can be converted into a quantized network model through an application programming interface (Application Programming Interface, API) call. For example, you can use the code in the following example to call the API to convert the trained full-precision network model into a quantized network model:
1)引入torch.quantization.quantize_fx包:1) Introduce the torch.quantization.quantize_fx package:
import torch.quantization.quantize_fx as quantize_fx;import torch.quantization.quantize_fx as quantize_fx;
2)基于设定的网络结构self.config.model,创建全精度网络模型model,并加载预训练好的参数对全精度网络模型进行初始化:2) Based on the set network structure self.config.model, create a full-precision network model model, and load the pre-trained parameters to initialize the full-precision network model:
model=model_entry(self.config.model,pretrained=True);model = model_entry(self.config.model, pretrained=True);
3)获取量化算法的配置信息qparams和目标硬件的部署配置信息backend_params:3) Obtain the configuration information qparams of the quantization algorithm and the deployment configuration information backend_params of the target hardware:
model_qconfig=get_qconfig(**qparams,**backend_params);model_qconfig = get_qconfig(**qparams, **backend_params);
4)获取设定的批量归一化层折叠策略foldbn_strategy:4) Obtain the set batch normalization layer folding strategy foldbn_strategy:
foldbn_config=get_foldbn_config(foldbn_strategy);foldbn_config = get_foldbn_config(foldbn_strategy);
5)调用模型量化的API quantize_fx.prepare_qat_fx对全精度网络模型model进行量化,得到量化网络模型qModel:5) Call the model quantification API quantize_fx.prepare_qat_fx to quantify the full-precision network model model, and obtain the quantized network model qModel:
qModel=quantize_fx.prepare_qat_fx(model,{“”:model_qconfig},foldbn_config)。qModel = quantize_fx. prepare_qat_fx(model, {"": model_qconfig}, foldbn_config).
在一些实施方式中,在调用quantize_fx.prepare_qat_fx之后,可以对量化网络模型qModel进行微调、校准等优化。In some implementations, after quantize_fx.prepare_qat_fx is called, the quantization network model qModel can be fine-tuned, calibrated and optimized.
MQBench就像一座桥梁,连接量化算法和部署硬件。图6为本申请实施例提供的MQBench的应用场景示意图,如图6所示,MQBench 60中主要提供量化算法的可复现能力61和硬件平台的可部署能力62,量化算法的可复现能力61可以支持多种量化算法70,包括量化感知训练算法71和训练后量化算法72,硬件平台的可部署能力62可以支持量化算法在不同的部署硬件80上的部署,包括CPU 81、GPU 82、ASIC 83、DSP 84。下面从模型量化的可复现性和可部署性两个方面对MQBench进行说明。MQBench is like a bridge, connecting quantization algorithms and deployment hardware. Figure 6 is a schematic diagram of the application scenario of MQBench provided by the embodiment of the present application. As shown in Figure 6, MQBench 60 mainly provides the reproducibility 61 of the quantization algorithm and the deployability 62 of the hardware platform, and the reproducibility of the quantization algorithm 61 can support multiple quantization algorithms 70, including quantization-aware training algorithms 71 and post-training quantization algorithms 72, and the deployability 62 of the hardware platform can support the deployment of quantization algorithms on different deployment hardware 80, including CPU 81, GPU 82, ASIC 83, DSP 84. The following describes MQBench from the two aspects of model quantification, reproducibility and deployability.
1)可复现性:MQBench在模型量化的可复现性方面主要体现在如下4个维度:1) Reproducibility: The reproducibility of model quantification in MQBench is mainly reflected in the following four dimensions:
硬件感知量化器:针对不同硬件(如CPU、GPU、ASIC及DSP等),MQBench对硬件所采用的推理引擎库(如TVM、TensorRT、ACL及SNPE等)的计算图模式做了匹配支持,可以基于设定的推理引擎库自动匹配计算图中量化节点的插入位置,其中,一个硬件类型对应一种推理引擎,不同的硬件类型可以对应相同的推理引擎。MQBench支持5个通用软件库(也即推理引擎),包括用于图形处理单元(GPU)推理的TensorRT、用于应用专用集成电路(ASIC)推理的ACL、用于移动数字信号处理器(DSP)推理的SNPE、用于ARM中央处理器(CPU)的TVM、以及用于X86服务器端CPU推理的FBGEMM。每个推理引擎对应一种量化器。用户可以根据实际应用场景从这5个推理引擎中选择合适的推理引擎用于进行模型部署,MQBench可以根据选定的推理引擎确定全精度网络模型中待量化的至少一个处理层以及相应的硬件感知量化器。Hardware-aware quantizer: For different hardware (such as CPU, GPU, ASIC, and DSP, etc.), MQBench provides matching support for the calculation graph mode of the inference engine library (such as TVM, TensorRT, ACL, and SNPE, etc.) used by the hardware, which can Based on the set inference engine library, the insertion position of the quantitative node in the calculation graph is automatically matched. One hardware type corresponds to one inference engine, and different hardware types can correspond to the same inference engine. MQBench supports 5 general-purpose software libraries (that is, inference engines), including TensorRT for graphics processing unit (GPU) inference, ACL for application-specific integrated circuit (ASIC) inference, and mobile digital signal processor (DSP) SNPE for inference, TVM for ARM central processing unit (CPU), and FBGEMM for X86 server-side CPU inference. Each inference engine corresponds to a quantizer. Users can select an appropriate inference engine from these five inference engines for model deployment according to actual application scenarios. MQBench can determine at least one processing layer to be quantified in the full-precision network model and the corresponding hardware perception based on the selected inference engine quantizer.
量化算法:MQBench复现了目前SOTA(State-Of-The-Art,指目前最好/最先进的模型)的多种量化算法,包含基于学习策略(learning-based)的LSQ、APoT、量化区间学习(Quantization Interval Learning,QIL)算法、PACT,以及基于规则策略的DSQ、LQ-Net、DoReFa。用户可以根据实际应用场景从MQBench复现的多种量化算法中选择合适的量化算法用于进行模型量化,MQBench可以根据选定的量化算法对待量化的全精度网络模型进行量化。Quantization algorithm: MQBench reproduces various quantization algorithms of the current SOTA (State-Of-The-Art, referring to the best/most advanced model), including learning-based LSQ, APoT, and quantization intervals Learning (Quantization Interval Learning, QIL) algorithm, PACT, and rule-based strategy DSQ, LQ-Net, DoReFa. Users can select an appropriate quantization algorithm from the multiple quantization algorithms reproduced by MQBench for model quantization according to the actual application scenario. MQBench can quantify the full-precision network model to be quantized according to the selected quantization algorithm.
神经网络结构:MQBench支持的神经网络结构包括ResNet-18、ResNet-50、MobileNetV2、Efficient-Net(利用Efficient-Net的Lite版本,并将swish激活替换为ReLU6,以便在硬件上提供更好的整型数字支持)以及具有组卷积的神经网络结构RegNetX-600MF。Neural network structure: The neural network structure supported by MQBench includes ResNet-18, ResNet-50, MobileNetV2, Efficient-Net (use the Lite version of Efficient-Net, and replace the swish activation with ReLU6 to provide better overall performance on the hardware. type number support) and the neural network structure RegNetX-600MF with group convolution.
量化位宽:MQBench支持8比特、4比特、3比特、2比特等多种量化位宽。在一些实施方式中,可以对训练后量化算法采用8比特的量化位宽,对量化感知训练算法采用4比特的量化位宽。Quantization bit width: MQBench supports multiple quantization bit widths such as 8 bits, 4 bits, 3 bits, and 2 bits. In some implementation manners, a quantization bit width of 8 bits may be used for the post-training quantization algorithm, and a quantization bit width of 4 bits may be used for the quantization-aware training algorithm.
训练设置:在MQBench中,对所有量化算法都采用了微调的方式进行参数训练,对于采用相同的神经网络结构的全精度网络模型都采用统一的预训练模型进行参数初始化,减少了初始化阶段引入的不一致性。Training settings: In MQBench, fine-tuning is used for parameter training for all quantization algorithms. For full-precision network models using the same neural network structure, a unified pre-training model is used for parameter initialization, which reduces the number of parameters introduced in the initialization stage. Inconsistency.
2)可部署性:MQBench在模型量化的可部署性方面进行了如下优化:2) Deployability: MQBench has optimized the deployability of model quantification as follows:
BN层的折叠:MQBench支持5种BN层折叠策略,并能支持根据配置的BN层折叠策略将BN层的参数折叠进相应的卷积层。用户可以根据实际应用场景从这5种BN层折叠策略中选择合适的策略。BN layer folding: MQBench supports 5 BN layer folding strategies, and can support the parameters of the BN layer to be folded into the corresponding convolutional layer according to the configured BN layer folding strategy. Users can choose an appropriate strategy from these five BN layer folding strategies according to the actual application scenario.
块结构的计算图:相关技术中的模型量化方案只考虑量化卷积或全连接层的输入和权重,然而,神经网络架构还可以包括其他操作,如神经网络架构ResNet中的逐元素相加、神经网络架构InceptionV3中的串联等。在MQBench中针对不同推理引擎考虑不同的计算图优化级别,基于设定的推理引擎自动匹配计算图中量化节点的插入位置,从而对应不同的计算图优化级别,以构建相应的量化神经网络的计算图。Computation graph of block structure: The model quantization scheme in the related art only considers the input and weight of the quantized convolution or fully connected layer. However, the neural network architecture can also include other operations, such as element-wise addition in the neural network architecture ResNet, Concatenation in neural network architecture InceptionV3, etc. In MQBench, different calculation graph optimization levels are considered for different inference engines, and the insertion position of quantized nodes in the calculation graph is automatically matched based on the set inference engine, so as to correspond to different calculation graph optimization levels to build a corresponding quantitative neural network calculation picture.
本申请实施例提供的面向可复现与可部署的模型量化算法库MQBench与相关技术中的模型量化开源库相比,至少存在以下改进:The reproducible and deployable model quantification algorithm library MQBench provided by the embodiment of the present application has at least the following improvements compared with the model quantification open source library in the related art:
1)对于采用相同的神经网络结构的网络模型,采用统一的训练超参数进行微调,可以在具有相同的神经网络结构的多种第一网络模型以及多种量化算法之间共享模型训练的技巧,提高量化算法的精度;1) For network models using the same neural network structure, uniform training hyperparameters are used for fine-tuning, and the skills of model training can be shared between multiple first network models and multiple quantization algorithms with the same neural network structure, Improve the accuracy of the quantization algorithm;
2)对于采用相同的神经网络结构的全精度网络模型都采用统一的预训练模型进行参数初始化,可以减少初始化阶段引入的不一致性;2) For the full-precision network models using the same neural network structure, a unified pre-training model is used for parameter initialization, which can reduce the inconsistency introduced in the initialization stage;
3)支持多种可配置的神经网络结构;3) Support a variety of configurable neural network structures;
4)支持多种可配置的部署硬件类型和/或推理引擎;4) Support multiple configurable deployment hardware types and/or inference engines;
5)采用硬件感知量化器,可以提高量化网络模型的可部署性以及在实际部署场景下的精度;5) Using a hardware-aware quantizer can improve the deployability of the quantized network model and the accuracy in actual deployment scenarios;
6)支持多种可配置的BN层折叠策略;6) Support multiple configurable BN layer folding strategies;
7)针对不同推理引擎考虑不同的计算图优化级别。7) Consider different calculation graph optimization levels for different inference engines.
图7为本申请实施例提供的一种模型量化装置的组成结构示意图,如图7所示,模型量化装置700包括:第一获取部分710、第一确定部分720和量化部分730,其中:Fig. 7 is a schematic diagram of the composition and structure of a model quantization device provided in the embodiment of the present application. As shown in Fig. 7, the model quantization device 700 includes: a first acquisition part 710, a first determination part 720 and a quantization part 730, wherein:
第一获取部分710,被配置为获取待量化的第一网络模型;The first acquiring part 710 is configured to acquire the first network model to be quantified;
第一确定部分720,被配置为基于设定的部署配置信息,确定所述第一网络模型中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数;The first determining part 720 is configured to determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer based on the set deployment configuration information;
量化部分730,被配置为对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。The quantization part 730 is configured to perform quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.
在一些实施例中,所述第一网络模型包括至少一个块结构,每一所述块结构包括至少一个处理层;所述第一确定部分还被配置为:基于设定的部署配置信息,确定所述第一网络模型中每一所述块结构中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数。In some embodiments, the first network model includes at least one block structure, and each of the block structures includes at least one processing layer; the first determining part is further configured to: based on the set deployment configuration information, determine At least one processing layer to be quantized in each of the block structures in the first network model and a quantization parameter for quantizing each of the processing layers.
在一些实施例中,所述部署配置信息包括部署硬件类型采用的推理引擎;所述第一确定部分还被配置为:基于所述推理引擎,确定待量化的处理层类型;将所述第一网络模型中与所述处理层类型匹配的至少一个处理层确定为待量化的处理层。In some embodiments, the deployment configuration information includes the inference engine used by the deployed hardware type; the first determining part is further configured to: determine the processing layer type to be quantified based on the inference engine; At least one processing layer matching the processing layer type in the network model is determined as the processing layer to be quantized.
在一些实施例中,所述处理层类型包括卷积层和批量归一化层;所述第一确定部分还被配置为:将所述第一网络模型中的至少一个批量归一化层和每一所述批量归一化层依赖的卷积层确定为待量化的处理层;获取设定的批量归一化层折叠策略;基于所述批量归一化层折叠策略,将所述第一网络模型中的每一所述批量归一化层折叠至所述批量归一化层依赖的卷积层中,得到折叠后的所述第一网络模型;所述量化部分还被配置为:对折叠后的所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。In some embodiments, the processing layer type includes a convolutional layer and a batch normalization layer; the first determination part is further configured to: combine at least one batch normalization layer and a batch normalization layer in the first network model The convolutional layer that each of the batch normalization layers depends on is determined as the processing layer to be quantized; the set batch normalization layer folding strategy is obtained; based on the batch normalization layer folding strategy, the first Each of the batch normalization layers in the network model is folded into the convolution layer that the batch normalization layer depends on to obtain the folded first network model; the quantization part is further configured to: Each of the processing layers in the folded first network model is quantized according to the quantization parameter to obtain a second network model.
在一些实施例中,所述批量归一化层折叠策略包括批量归一化层的移除状态、系数更新算法、待合并至权重中的统计参数、待合并至偏移中的统计参数;所述待合并至权重中的统计参数包括批量归一化层依赖的卷积层的运行统计数据或当前批次的统计数据,所述待合并至偏移中的统计参数包括批量归一化层依赖的卷积层的运行统计数据或当前批次的统计数据;所述第一确定部分还被配置为:确定第一网络模型中的至少一个批量归一化层中每一所述批量归一化层的缩放系数和平移系数;基于所述系数更新算法,对每一所述批量归一化层的缩放系数和平移系数进行更新,得到每一所述批量归一化层的更新后的缩放系数和平移系数;针对每一所述批量归一化层,获取所述批量归一化层中待合并至权重中的统计参数以及待合并至偏移中的统计参数,并将所述批量归一化层的更新后的缩放系数和所述待合并至权重中的统计参数合并至所述批量归一化层依赖的卷积层的权重中,将所述批量归一化层的更新后的缩放系数、平移系数以及所述待合并至偏移中的统计参数合并至所述卷积层的偏移中;在所述批量归一化层的移除状态为移除的情况下,将每一所述批量归一化层从所述第一网络模型中移除。In some embodiments, the batch normalization layer folding strategy includes batch normalization layer removal status, coefficient update algorithm, statistical parameters to be incorporated into weights, statistical parameters to be incorporated into offsets; The statistical parameters to be incorporated into the weights include the running statistics of the convolutional layer on which the batch normalization layer depends or the statistics of the current batch, and the statistical parameters to be incorporated into the offset include the batch normalization layer dependencies The running statistics of the convolutional layer or the statistics of the current batch; the first determining part is also configured to: determine each of the batch normalization in at least one batch normalization layer in the first network model Layer scaling coefficients and translation coefficients; based on the coefficient update algorithm, the scaling coefficients and translation coefficients of each of the batch normalization layers are updated to obtain the updated scaling coefficients of each of the batch normalization layers and translation coefficients; for each batch normalization layer, obtain the statistical parameters to be incorporated into the weight and the statistical parameters to be incorporated into the offset in the batch normalization layer, and normalize the batch The updated scaling coefficients of the normalization layer and the statistical parameters to be incorporated into the weights are combined into the weights of the convolutional layers that the batch normalization layer depends on, and the updated scaling coefficients of the batch normalization layers are Coefficients, translation coefficients, and the statistical parameters to be incorporated into the offset are merged into the offset of the convolutional layer; when the removal status of the batch normalization layer is removed, each The batch normalization layer is removed from the first network model.
在一些实施例中,所述第一确定部分还被配置为:基于所述推理引擎,从设定的多种批量归一化层折叠策略中确定目标的批量归一化层折叠策略。In some embodiments, the first determination part is further configured to: determine a target batch normalization layer folding strategy from multiple set batch normalization layer folding strategies based on the reasoning engine.
在一些实施例中,所述量化部分还被配置为:基于设定的量化算法和第一训练数据集,按照所述量化参数,对所述第一网络模型中的每一所述处理层进行量化,得到第二网络模型。In some embodiments, the quantization part is further configured to: based on the set quantization algorithm and the first training data set, according to the quantization parameters, each of the processing layers in the first network model is Quantify to get the second network model.
在一些实施例中,所述量化参数包括量化尺度的预设精度、量化对称性、量化位宽和量化粒度,所述量化对称性包括对称量化或非对称量化,所述量化粒度包括层级量化或特征级量化,所述量化算法包括量化感知训练算法;所述量化部分还被配置为:按照所述量化参数,为所述第一网络模型中的每一所述处理层设置一个伪量化器,得到第三网络模型;其中,所述伪量化器被配置为:基于所述量化位宽确定处理层参数的量化值范围;确定满足所述预设精度的量化尺度和满足所述量化对称性的量化零点;基于所述量化粒度,在所述量化值范围内,采用量化尺度和量化零点对待量化的 处理层参数进行均匀量化处理,得到量化后的所述处理层参数;基于所述量化尺度和所述量化零点,对量化后的所述处理层参数进行反均匀量化处理,得到反量化后的所述处理层参数;基于设定的量化感知训练算法和第一训练数据集,对所述第三网络模型中的每一所述处理层的参数进行训练,得到第二网络模型。In some embodiments, the quantization parameters include preset precision of quantization scale, quantization symmetry, quantization bit width and quantization granularity, the quantization symmetry includes symmetric quantization or asymmetric quantization, and the quantization granularity includes hierarchical quantization or Feature-level quantization, the quantization algorithm includes a quantization-aware training algorithm; the quantization part is further configured to: according to the quantization parameter, set a pseudo-quantizer for each of the processing layers in the first network model, A third network model is obtained; wherein, the pseudo-quantizer is configured to: determine a quantization value range of a processing layer parameter based on the quantization bit width; determine a quantization scale that satisfies the preset precision and a quantization scale that satisfies the quantization symmetry Quantize the zero point; based on the quantization granularity, within the range of the quantized value, use the quantization scale and the quantization zero point to perform uniform quantization processing on the processing layer parameters to be quantized to obtain the quantized processing layer parameters; based on the quantization scale and The quantization zero point is to perform inverse uniform quantization processing on the quantized processing layer parameters to obtain the dequantized processing layer parameters; based on the set quantization-aware training algorithm and the first training data set, the second The parameters of each of the processing layers in the three network models are trained to obtain the second network model.
在一些实施例中,所述量化部分还被配置为:确定预设的与所述第一网络模型采用的神经网络结构对应的训练超参数;其中,对于预设的多种部署配置信息中的每一所述部署配置信息,所述训练超参数是相同的;利用设定的第一训练数据集,基于所述量化算法和所述训练超参数,对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二量化网络模型。In some embodiments, the quantization part is further configured to: determine preset training hyperparameters corresponding to the neural network structure adopted by the first network model; wherein, for the various preset deployment configuration information For each of the deployment configuration information, the training hyperparameters are the same; using the set first training data set, based on the quantization algorithm and the training hyperparameters, for each of the first network models The processing layer performs quantization according to the quantization parameter to obtain a second quantized network model.
在一些实施例中,所述第一获取部分还被配置为:基于至少一种部署配置信息,对设定的神经网络结构中的处理层进行调整,得到至少一个调整后的神经网络结构;基于至少一个所述调整后的神经网络结构,创建至少一个第一网络模型;基于与所述设定的神经网络结构对应的预设模型参数,对至少一个所述第一网络模型的参数进行初始化,得到至少一个初始化后的第一网络模型;基于设定的部署配置信息,从所述至少一个初始化后的第一网络模型中确定待量化的第一网络模型。In some embodiments, the first acquisition part is further configured to: adjust the processing layers in the set neural network structure based on at least one deployment configuration information to obtain at least one adjusted neural network structure; At least one of the adjusted neural network structures is used to create at least one first network model; based on the preset model parameters corresponding to the set neural network structure, the parameters of at least one of the first network models are initialized, At least one initialized first network model is obtained; based on the set deployment configuration information, a first network model to be quantified is determined from the at least one initialized first network model.
在一些实施例中,所述装置还包括:第二获取部分,被配置为获取预设的与所述神经网络结构对应的预训练模型;所述预训练模型在输出层之前的结构与所述神经网络结构相同;预训练部分,被配置为利用设定的第二训练数据集,对所述预训练模型的参数进行训练,得到已训练的所述预训练模型;第二确定部分,被配置为将已训练的所述预训练模型的参数确定为所述预设模型参数。In some embodiments, the device further includes: a second acquisition part configured to acquire a preset pre-training model corresponding to the neural network structure; the structure of the pre-training model before the output layer is the same as the The neural network structure is the same; the pre-training part is configured to use the set second training data set to train the parameters of the pre-training model to obtain the trained pre-training model; the second determination part is configured To determine the trained parameters of the pre-training model as the preset model parameters.
在一些实施例中,所述第一获取部分还被配置为:从预设的多种神经网络结构中确定目标神经网络结构;基于至少一种部署配置信息,对所述目标神经网络结构中的处理层进行调整,得到至少一个调整后的神经网络结构。In some embodiments, the first acquisition part is further configured to: determine a target neural network structure from a variety of preset neural network structures; The processing layer is adjusted to obtain at least one adjusted neural network structure.
以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请装置实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。The description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the device embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.
在本申请实施例以及其他的实施例中,“部分”可以是部分电路、部分处理器、部分程序或软件等等,当然也可以是单元,还可以是模块也可以是非模块化的。In the embodiment of the present application and other embodiments, a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a unit, a module or a non-modular one.
需要说明的是,本申请实施例中,如果以软件功能模块的形式实现上述的模型量化方法,并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本申请实施例不限制于任何特定的硬件和软件结合。It should be noted that, in the embodiment of the present application, if the above-mentioned model quantification method is realized in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solution of the embodiment of the present application or the part that contributes to the related technology can be embodied in the form of a software product, the software product is stored in a storage medium, and includes several instructions to make a A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
本申请实施例提供一种计算机设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述方法中的步骤。An embodiment of the present application provides a computer device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the steps in the above method when executing the program.
本申请实施例提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述方法中的步骤。所述计算机可读存储介质可以是瞬时性的,也可以是非瞬时性的。An embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps in the above method are implemented. The computer readable storage medium may be transitory or non-transitory.
本申请实施例提供一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码在计算机设备中运行的情况下,所述计算机设备中的处理器执行上述方法中的部分或全部步骤。An embodiment of the present application provides a computer program, the computer program includes computer readable code, and when the computer readable code is run in a computer device, a processor in the computer device executes part of the above method or all steps.
本申请实施例提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,实现上述方法中的部分或全部步骤。该计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一些实施例中,所述计算机程序产品具体体现为计算机存储介质,在另一些实施例中,计算机程序产品具体体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。An embodiment of the present application provides a computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, part or all of the above methods are implemented. All steps. The computer program product can be specifically realized by means of hardware, software or a combination thereof. In some embodiments, the computer program product is embodied as a computer storage medium, and in other embodiments, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.
这里需要指出的是:以上存储介质、计算机程序产品和设备实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请存储介质、计算机程序产品、计算机程序和设备实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。It should be pointed out here that: the above descriptions of the storage medium, computer program product, and device embodiments are similar to the descriptions of the above method embodiments, and have similar beneficial effects to those of the method embodiments. For technical details not disclosed in the storage medium, computer program product, computer program and device embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.
需要说明的是,图8为本申请实施例中计算机设备的一种硬件实体示意图,如图8所示,该计算机设备800的硬件实体包括:处理器801、通信接口802和存储器803,其中:处理器801通常控制计算机设备800的总体操作。通信接口802可以使计算机设备通过网络与其他终端或服务器通 信。存储器803配置为存储由处理器801可执行的指令和应用,还可以缓存待处理器801以及计算机设备800中各模块待处理或已经处理的数据(例如,图像数据、音频数据、语音通信数据和视频通信数据),可以通过闪存(FLASH)或随机访问存储器(Random Access Memory,RAM)实现。处理器801、通信接口802和存储器803之间可以通过总线804进行数据传输。It should be noted that FIG. 8 is a schematic diagram of a hardware entity of a computer device in the embodiment of the present application. As shown in FIG. 8, the hardware entity of the computer device 800 includes: a processor 801, a communication interface 802, and a memory 803, wherein: Processor 801 generally controls the overall operation of computer device 800 . The communication interface 802 enables the computer device to communicate with other terminals or servers over a network. The memory 803 is configured to store instructions and applications executable by the processor 801, and can also cache data to be processed or processed by the processor 801 and various modules in the computer device 800 (for example, image data, audio data, voice communication data and Video communication data) can be realized by flash memory (FLASH) or random access memory (Random Access Memory, RAM). Data transmission may be performed between the processor 801 , the communication interface 802 and the memory 803 through the bus 804 .
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。It should be understood that reference throughout the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic related to the embodiment is included in at least one embodiment of the present application. Thus, appearances of "in one embodiment" or "in an embodiment" in various places throughout the specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application. The implementation process constitutes any limitation. The serial numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。In the several embodiments provided in this application, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, such as: multiple units or components can be combined, or May be integrated into another system, or some features may be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of.
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, or each unit can be used as a single unit, or two or more units can be integrated into one unit; the above-mentioned integration The unit can be realized in the form of hardware or in the form of hardware plus software functional unit.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps to realize the above method embodiments can be completed by hardware related to program instructions, and the aforementioned programs can be stored in computer-readable storage media. When the program is executed, the execution includes The steps of the foregoing method embodiments; and the foregoing storage media include: removable storage devices, read-only memory (Read Only Memory, ROM), magnetic disks or optical disks and other media that can store program codes.
或者,本申请实施例上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。Alternatively, if the above-mentioned integrated units in the embodiments of the present application are implemented in the form of software function modules and sold or used as independent products, they may also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present application or the part that contributes to the related technologies can be embodied in the form of software products. The computer software products are stored in a storage medium and include several instructions to make A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks.
以上所述,仅为本申请的实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。The above is only the embodiment of the present application, but the scope of protection of the present application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, and should covered within the scope of protection of this application.
工业实用性Industrial Applicability
本申请实施例公开了一种模型量化方法、装置、设备、存储介质、计算机程序产品及计算机程序,其中,所述方法包括:获取待量化的第一网络模型;基于设定的部署配置信息,确定所述第一网络模型中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数;对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。根据本申请实施例,可以在对第一网络模型进行模型量化的过程中充分考虑部署模型的硬件平台的部署配置信息,从而得到在相应的硬件平台上可部署的第二网络模型。The embodiment of the present application discloses a model quantification method, device, device, storage medium, computer program product and computer program, wherein the method includes: acquiring the first network model to be quantified; based on the set deployment configuration information, Determining at least one processing layer to be quantized in the first network model and quantization parameters for quantizing each of the processing layers; performing quantization on each of the processing layers in the first network model according to the quantization parameters Quantify to get the second network model. According to the embodiment of the present application, the deployment configuration information of the hardware platform on which the model is deployed can be fully considered during the model quantification process of the first network model, so as to obtain the second network model deployable on the corresponding hardware platform.

Claims (28)

  1. 一种模型量化方法,所述方法包括:A method for model quantification, the method comprising:
    获取待量化的第一网络模型;Obtain the first network model to be quantified;
    基于设定的部署配置信息,确定所述第一网络模型中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数;Based on the set deployment configuration information, determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer;
    对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。Quantize each processing layer in the first network model according to the quantization parameter to obtain a second network model.
  2. 根据权利要求1所述的方法,其中,所述第一网络模型包括至少一个块结构,每一所述块结构包括至少一个处理层;The method of claim 1, wherein said first network model comprises at least one block structure, each said block structure comprising at least one processing layer;
    所述基于设定的部署配置信息,确定所述第一网络模型中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数,包括:The determination of at least one processing layer to be quantized in the first network model and quantization parameters for each processing layer based on the set deployment configuration information includes:
    基于设定的部署配置信息,确定所述第一网络模型中每一所述块结构中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数。Based on the set deployment configuration information, at least one processing layer to be quantized in each of the block structures in the first network model and a quantization parameter for quantizing each of the processing layers are determined.
  3. 根据权利要求1所述的方法,其中,所述部署配置信息包括部署硬件类型采用的推理引擎;The method according to claim 1, wherein the deployment configuration information includes the inference engine used by the deployment hardware type;
    所述基于设定的部署配置信息,确定所述第一网络模型中待量化的至少一个处理层,包括:The determining at least one processing layer to be quantified in the first network model based on the set deployment configuration information includes:
    基于所述推理引擎,确定待量化的处理层类型;Based on the inference engine, determine the type of processing layer to be quantified;
    将所述第一网络模型中与所述处理层类型匹配的至少一个处理层确定为待量化的处理层。Determining at least one processing layer matching the processing layer type in the first network model as the processing layer to be quantized.
  4. 根据权利要求3所述的方法,其中,所述处理层类型包括卷积层和批量归一化层;The method according to claim 3, wherein the processing layer types include convolutional layers and batch normalization layers;
    所述将所述第一网络模型中与所述处理层类型匹配的至少一个处理层确定为待量化的处理层,包括:The determining at least one processing layer matching the processing layer type in the first network model as the processing layer to be quantified includes:
    将所述第一网络模型中的至少一个批量归一化层和每一所述批量归一化层依赖的卷积层确定为待量化的处理层;Determining at least one batch normalization layer in the first network model and the convolutional layer that each batch normalization layer depends on as the processing layer to be quantized;
    获取设定的批量归一化层折叠策略;Get the set batch normalization layer folding strategy;
    基于所述批量归一化层折叠策略,将所述第一网络模型中的每一所述批量归一化层折叠至所述批量归一化层依赖的卷积层中,得到折叠后的所述第一网络模型;Based on the batch normalization layer folding strategy, each of the batch normalization layers in the first network model is folded into the convolutional layer on which the batch normalization layer depends, and all folded normalization layers are obtained. Describe the first network model;
    所述对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型,包括:The quantization of each of the processing layers in the first network model according to the quantization parameter to obtain a second network model includes:
    对折叠后的所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。Quantize each processing layer in the folded first network model according to the quantization parameter to obtain a second network model.
  5. 根据权利要求4所述的方法,其中,所述批量归一化层折叠策略包括批量归一化层的移除状态、系数更新算法、待合并至权重中的统计参数、待合并至偏移中的统计参数;所述待合并至权重中的统计参数包括批量归一化层依赖的卷积层的运行统计数据或当前批次的统计数据,所述待合并至偏移中的统计参数包括批量归一化层依赖的卷积层的运行统计数据或当前批次的统计数据;The method according to claim 4, wherein the batch normalization layer folding strategy includes batch normalization layer removal status, coefficient update algorithm, statistical parameters to be incorporated into weights, and statistics parameters to be incorporated into offsets. Statistical parameters; the statistical parameters to be incorporated into the weights include the running statistical data of the convolutional layer that the batch normalization layer depends on or the statistical data of the current batch, and the statistical parameters to be incorporated into the offset include batch The running statistics of the convolutional layer that the normalization layer depends on or the statistics of the current batch;
    所述基于所述批量归一化层折叠策略,将所述第一网络模型中的每一所述批量归一化层折叠至所述批量归一化层依赖的卷积层中,包括:Folding each of the batch normalization layers in the first network model into the convolutional layer on which the batch normalization layer depends based on the batch normalization layer folding strategy includes:
    确定第一网络模型中的至少一个批量归一化层中每一所述批量归一化层的缩放系数和平移系数;determining a scaling factor and a translation factor for each of said batch normalization layers in at least one batch normalization layer in the first network model;
    基于所述系数更新算法,对每一所述批量归一化层的缩放系数和平移系数进行更新,得到每一所述批量归一化层的更新后的缩放系数和平移系数;Based on the coefficient update algorithm, update the scaling coefficient and translation coefficient of each batch normalization layer to obtain the updated scaling coefficient and translation coefficient of each batch normalization layer;
    针对每一所述批量归一化层,获取所述批量归一化层中待合并至权重中的统计参数以及待合并至偏移中的统计参数,并将所述批量归一化层的更新后的缩放系数和所述待合并至权重中的统计参数合并至所述批量归一化层依赖的卷积层的权重中,将所述批量归一化层的更新后的缩放系数、平移系数以及所述待合并至偏移中的统计参数合并至所述卷积层的偏移中;For each batch normalization layer, obtain the statistical parameters to be incorporated into the weight and the statistical parameters to be incorporated into the offset in the batch normalization layer, and update the batch normalization layer The final scaling coefficient and the statistical parameters to be incorporated into the weights are merged into the weights of the convolutional layer on which the batch normalization layer depends, and the updated scaling coefficients and translation coefficients of the batch normalization layer are and the statistical parameters to be incorporated into the offset are incorporated into the offset of the convolutional layer;
    在所述批量归一化层的移除状态为移除的情况下,将每一所述批量归一化层从所述第一网络模型中移除。If the removal status of the batch normalization layer is removed, each of the batch normalization layers is removed from the first network model.
  6. 根据权利要求4或5所述的方法,其中,所述获取设定的批量归一化层折叠策略,包括:The method according to claim 4 or 5, wherein said obtaining the set batch normalization layer folding strategy comprises:
    基于所述推理引擎,从设定的多种批量归一化层折叠策略中确定目标的批量归一化层折叠策略。Based on the reasoning engine, a target batch normalization layer folding strategy is determined from multiple set batch normalization layer folding strategies.
  7. 根据权利要求1至6中任一项所述的方法,其中,所述对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型,包括:The method according to any one of claims 1 to 6, wherein said quantizing each of said processing layers in said first network model according to said quantization parameter to obtain a second network model comprises:
    基于设定的量化算法和第一训练数据集,按照所述量化参数,对所述第一网络模型中的每一所述处理层进行量化,得到第二网络模型。Based on the set quantization algorithm and the first training data set, each of the processing layers in the first network model is quantized according to the quantization parameters to obtain a second network model.
  8. 根据权利要求7所述的方法,其中,所述量化参数包括量化尺度的预设精度、量化对称性、量化位宽和量化粒度,所述量化对称性包括对称量化或非对称量化,所述量化粒度包括层级量化或特征级量化,所述量化算法包括量化感知训练算法;The method according to claim 7, wherein said quantization parameters include preset precision of quantization scale, quantization symmetry, quantization bit width and quantization granularity, said quantization symmetry includes symmetric quantization or asymmetric quantization, said quantization The granularity includes hierarchical quantization or feature-level quantization, and the quantization algorithm includes a quantization-aware training algorithm;
    所述基于设定的量化算法和第一训练数据集,按照所述量化参数,对所述第一网络模型中的每一所述处理层进行量化,得到第二网络模型,包括:The quantization algorithm based on the setting and the first training data set, according to the quantization parameters, quantize each of the processing layers in the first network model to obtain a second network model, including:
    按照所述量化参数,为所述第一网络模型中的每一所述处理层设置一个伪量化器,得到第三网络模型;其中,所述伪量化器被配置为:基于所述量化位宽确定处理层参数的量化值范围;确定满足所述预设精度的量化尺度和满足所述量化对称性的量化零点;基于所述量化粒度,在所述量化值范围内,采用量化尺度和量化零点对待量化的处理层参数进行均匀量化处理,得到量化后的所述处理层参数;基于所述量化尺度和所述量化零点,对量化后的所述处理层参数进行反均匀量化处理,得到反量化后的所述处理层参数;According to the quantization parameter, a pseudo-quantizer is set for each of the processing layers in the first network model to obtain a third network model; wherein, the pseudo-quantizer is configured to: based on the quantization bit width Determine the quantization value range of the processing layer parameters; determine the quantization scale satisfying the preset precision and the quantization zero point satisfying the quantization symmetry; based on the quantization granularity, within the quantization value range, adopt the quantization scale and quantization zero point performing uniform quantization processing on the processing layer parameters to be quantized to obtain the quantized processing layer parameters; based on the quantization scale and the quantization zero point, performing inverse uniform quantization processing on the quantized processing layer parameters to obtain inverse quantization The subsequent processing layer parameters;
    基于设定的量化感知训练算法和第一训练数据集,对所述第三网络模型中的每一所述处理层的参数进行训练,得到第二网络模型。Based on the set quantization-aware training algorithm and the first training data set, the parameters of each processing layer in the third network model are trained to obtain a second network model.
  9. 根据权利要求7或8所述的方法,其中,所述基于设定的量化算法和第一训练数据集,按照所述量化参数,对所述第一网络模型中的每一所述处理层进行量化,得到第二网络模型,包括:The method according to claim 7 or 8, wherein, based on the set quantization algorithm and the first training data set, each of the processing layers in the first network model is performed according to the quantization parameters Quantify to get the second network model, including:
    确定预设的与所述第一网络模型采用的神经网络结构对应的训练超参数;其中,对于预设的多种部署配置信息中的每一所述部署配置信息,所述训练超参数是相同的;Determining preset training hyperparameters corresponding to the neural network structure adopted by the first network model; wherein, for each of the preset deployment configuration information, the training hyperparameters are the same of;
    利用设定的第一训练数据集,基于所述量化算法和所述训练超参数,对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二量化网络模型。Using the set first training data set, based on the quantization algorithm and the training hyperparameters, quantize each of the processing layers in the first network model according to the quantization parameters to obtain a second quantization network Model.
  10. 根据权利要求1至9中任一项所述的方法,其中,所述获取待量化的第一网络模型,包括:The method according to any one of claims 1 to 9, wherein said obtaining the first network model to be quantified comprises:
    基于至少一种部署配置信息,对设定的神经网络结构中的处理层进行调整,得到至少一个调整后的神经网络结构;Adjust the processing layer in the set neural network structure based on at least one deployment configuration information to obtain at least one adjusted neural network structure;
    基于至少一个所述调整后的神经网络结构,创建至少一个第一网络模型;creating at least one first network model based on at least one said adjusted neural network structure;
    基于与所述设定的神经网络结构对应的预设模型参数,对至少一个所述第一网络模型的参数进行初始化,得到至少一个初始化后的第一网络模型;Initializing at least one parameter of the first network model based on the preset model parameters corresponding to the set neural network structure to obtain at least one initialized first network model;
    基于设定的部署配置信息,从所述至少一个初始化后的第一网络模型中确定待量化的第一网络模型。Based on the set deployment configuration information, a first network model to be quantified is determined from the at least one initialized first network model.
  11. 根据权利要求10所述的方法,所述方法还包括:The method of claim 10, further comprising:
    获取预设的与所述神经网络结构对应的预训练模型;所述预训练模型在输出层之前的结构与所述神经网络结构相同;Obtaining a preset pre-training model corresponding to the neural network structure; the structure of the pre-training model before the output layer is the same as the neural network structure;
    利用设定的第二训练数据集,对所述预训练模型的参数进行训练,得到已训练的所述预训练模型;Using the set second training data set to train the parameters of the pre-training model to obtain the trained pre-training model;
    将已训练的所述预训练模型的参数确定为所述预设模型参数。Determining the trained parameters of the pre-training model as the preset model parameters.
  12. 根据权利要求10或11所述的方法,其中,所述基于至少一种部署配置信息,对设定的神经网络结构中的处理层进行调整,得到至少一个调整后的神经网络结构,包括:The method according to claim 10 or 11, wherein, based on at least one deployment configuration information, the processing layer in the set neural network structure is adjusted to obtain at least one adjusted neural network structure, comprising:
    从预设的多种神经网络结构中确定目标神经网络结构;Determine the target neural network structure from a variety of preset neural network structures;
    基于至少一种部署配置信息,对所述目标神经网络结构中的处理层进行调整,得到至少一个调整后的神经网络结构。Based on at least one deployment configuration information, the processing layer in the target neural network structure is adjusted to obtain at least one adjusted neural network structure.
  13. 一种模型量化装置,包括:A model quantization device, comprising:
    第一获取部分,被配置为获取待量化的第一网络模型;The first acquisition part is configured to acquire the first network model to be quantified;
    第一确定部分,被配置为基于设定的部署配置信息,确定所述第一网络模型中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数;The first determining part is configured to determine at least one processing layer to be quantized in the first network model and a quantization parameter for quantizing each processing layer based on set deployment configuration information;
    量化部分,被配置为对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。The quantization part is configured to perform quantization on each of the processing layers in the first network model according to the quantization parameter to obtain a second network model.
  14. 根据权利要求13所述的装置,其中,所述第一网络模型包括至少一个块结构,每一所述块结构包括至少一个处理层;所述第一确定部分还被配置为:基于设定的部署配置信息,确定所述第一网络模型中每一所述块结构中待量化的至少一个处理层以及对每一所述处理层进行量化的量化参数。The apparatus according to claim 13, wherein the first network model includes at least one block structure, each of the block structures includes at least one processing layer; the first determining part is further configured to: based on a set Deploying configuration information to determine at least one processing layer to be quantized in each of the block structures in the first network model and a quantization parameter for quantizing each of the processing layers.
  15. 根据权利要求13所述的装置,其中,所述部署配置信息包括部署硬件类型采用的推理引擎;所述第一确定部分还被配置为:基于所述推理引擎,确定待量化的处理层类型;将所述第一网络模型中与所述处理层类型匹配的至少一个处理层确定为待量化的处理层。The apparatus according to claim 13, wherein the deployment configuration information includes an inference engine used by the type of deployed hardware; the first determination part is further configured to: determine the processing layer type to be quantified based on the inference engine; Determining at least one processing layer matching the processing layer type in the first network model as the processing layer to be quantized.
  16. 根据权利要求15所述的装置,其中,所述处理层类型包括卷积层和批量归一化层;所述第一确定部分还被配置为:将所述第一网络模型中的至少一个批量归一化层和每一所述批量归一化层依赖的卷积层确定为待量化的处理层;获取设定的批量归一化层折叠策略;基于所述批量归一化层折叠策略,将所述第一网络模型中的每一所述批量归一化层折叠至所述批量归一化层依赖的卷积层中,得到折叠后的所述第一网络模型;所述量化部分还被配置为:对折叠后的所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二网络模型。The apparatus according to claim 15, wherein the processing layer type includes a convolutional layer and a batch normalization layer; and the first determining part is further configured to: batch at least one batch in the first network model The normalization layer and the convolutional layer that each batch normalization layer depends on are determined as the processing layer to be quantized; the batch normalization layer folding strategy is obtained; based on the batch normalization layer folding strategy, Folding each of the batch normalization layers in the first network model into the convolutional layer on which the batch normalization layer depends, to obtain the folded first network model; the quantization part also It is configured to: perform quantization on each of the processing layers in the folded first network model according to the quantization parameter to obtain a second network model.
  17. 根据权利要求16所述的装置,其中,所述批量归一化层折叠策略包括批量归一化层的移除状态、系数更新算法、待合并至权重中的统计参数、待合并至偏移中的统计参数;所述待合并至权重中的统计参数包括批量归一化层依赖的卷积层的运行统计数据或当前批次的统计数据,所述待合并至偏移中的统计参数包括批量归一化层依赖的卷积层的运行统计数据或当前批次的统计数据;所述第一确定部分还被配置为:确定第一网络模型中的至少一个批量归一化层中每一所述批量归一化层的缩放系数和平移系数;基于所述系数更新算法,对每一所述批量归一化层的缩放系数和平移系数进行更新,得到每一所述批量归一化层的更新后的缩放系数和平移系数;针对每一所述批量归一化层,获取所述批量归一化层中待合并至权重中的统计参数以及待合并至偏移中的统计参数,并将所述批量归一化层的更新后的缩放系数和所述待合并至权重中的统计参数合并至所述批量归一化层依赖的卷积层的权重中,将所述批量归一化层的更新后的缩放系数、平移系数以及所述待合并至偏移中的统计参数合并至所述卷积层的偏移中;在所述批量归一化层的移除状态为移除的情况下,将每一所述批量归一化层从所述第一网络模型中移除。The apparatus according to claim 16, wherein the batch normalization layer folding strategy includes batch normalization layer removal status, coefficient update algorithm, statistical parameters to be incorporated into weights, and statistics parameters to be incorporated into offsets. Statistical parameters; the statistical parameters to be incorporated into the weights include the running statistical data of the convolutional layer that the batch normalization layer depends on or the statistical data of the current batch, and the statistical parameters to be incorporated into the offset include batch The running statistics of the convolutional layer on which the normalization layer depends or the statistics of the current batch; the first determination part is also configured to: determine each of the at least one batch normalization layer in the first network model The scaling coefficient and translation coefficient of the batch normalization layer; based on the coefficient update algorithm, the scaling coefficient and translation coefficient of each batch normalization layer are updated to obtain the batch normalization layer of each updated scaling coefficients and translation coefficients; for each of the batch normalization layers, obtaining statistical parameters to be incorporated into weights and statistical parameters to be incorporated into offsets in the batch normalization layers, and The updated scaling coefficients of the batch normalization layer and the statistical parameters to be incorporated into the weights are merged into the weights of the convolutional layers on which the batch normalization layer depends, and the batch normalization layer The updated scaling coefficient, translation coefficient and the statistical parameters to be incorporated into the offset are merged into the offset of the convolutional layer; in the case where the removal status of the batch normalization layer is removed Next, each of the batch normalization layers is removed from the first network model.
  18. 根据权利要求16或17所述的装置,其中,所述第一确定部分还被配置为:基于所述推理引擎,从设定的多种批量归一化层折叠策略中确定目标的批量归一化层折叠策略。The apparatus according to claim 16 or 17, wherein the first determining part is further configured to: determine the batch normalization of the target from various set batch normalization layer folding strategies based on the inference engine Layer folding strategy.
  19. 根据权利要求13至18中任一项所述的装置,其中,所述量化部分还被配置为:基于设定的量化算法和第一训练数据集,按照所述量化参数,对所述第一网络模型中的每一所述处理层进行量化,得到第二网络模型。The device according to any one of claims 13 to 18, wherein the quantization part is further configured to: based on the set quantization algorithm and the first training data set, according to the quantization parameter, the first Each processing layer in the network model is quantized to obtain the second network model.
  20. 根据权利要求19所述的装置,其中,所述量化参数包括量化尺度的预设精度、量化对称性、量化位宽和量化粒度,所述量化对称性包括对称量化或非对称量化,所述量化粒度包括层级量化或特征级量化,所述量化算法包括量化感知训练算法;所述量化部分还被配置为:按照所述量化参数,为所述第一网络模型中的每一所述处理层设置一个伪量化器,得到第三网络模型;其中,所述伪量化器被配置为:基于所述量化位宽确定处理层参数的量化值范围;确定满足所述预设精度的量化尺度和满足所述量化对称性的量化零点;基于所述量化粒度,在所述量化值范围内,采用量化尺度和量化零点对待量化的处理层参数进行均匀量化处理,得到量化后的所述处理层参数;基于所述量化尺度和所述量化零点,对量化后的所述处理层参数进行反均匀量化处理,得到反量化后的所述处理层参数;基于设定的量化感知训练算法和第一训练数据集,对所述第三网络模型中的每一所述处理层的参数进行训练,得到第二网络模型。The apparatus according to claim 19, wherein said quantization parameters include preset precision of quantization scale, quantization symmetry, quantization bit width and quantization granularity, said quantization symmetry includes symmetric quantization or asymmetric quantization, said quantization The granularity includes hierarchical quantization or feature-level quantization, and the quantization algorithm includes a quantization-aware training algorithm; the quantization part is further configured to: according to the quantization parameter, set for each of the processing layers in the first network model A pseudo-quantizer to obtain a third network model; wherein, the pseudo-quantizer is configured to: determine the quantization value range of the processing layer parameter based on the quantization bit width; determine the quantization scale that meets the preset accuracy and satisfy the required The quantization zero point of the quantization symmetry; based on the quantization granularity, within the quantization value range, the quantization scale and the quantization zero point are used to perform uniform quantization processing on the processing layer parameters to be quantized to obtain the quantized processing layer parameters; based on The quantization scale and the quantization zero point perform inverse uniform quantization processing on the quantized processing layer parameters to obtain the dequantized processing layer parameters; based on the set quantization-aware training algorithm and the first training data set , training the parameters of each of the processing layers in the third network model to obtain a second network model.
  21. 根据权利要求19或20所述的装置,其中,所述量化部分还被配置为:确定预设的与所述第一网络模型采用的神经网络结构对应的训练超参数;其中,对于预设的多种部署配置信息中的每一所述部署配置信息,所述训练超参数是相同的;利用设定的第一训练数据集,基于所述量化算法和所述训练超参数,对所述第一网络模型中的每一所述处理层按照所述量化参数进行量化,得到第二量化网络模型。The device according to claim 19 or 20, wherein the quantization part is further configured to: determine a preset training hyperparameter corresponding to the neural network structure adopted by the first network model; wherein, for the preset For each of the deployment configuration information in multiple deployment configuration information, the training hyperparameters are the same; using the set first training data set, based on the quantization algorithm and the training hyperparameters, the first Each processing layer in a network model is quantized according to the quantization parameter to obtain a second quantized network model.
  22. 根据权利要求13至21中任一项所述的装置,其中,所述第一获取部分还被配置为:基于至少一种部署配置信息,对设定的神经网络结构中的处理层进行调整,得到至少一个调整后的神经网络结构;基于至少一个所述调整后的神经网络结构,创建至少一个第一网络模型;基于与所述设 定的神经网络结构对应的预设模型参数,对至少一个所述第一网络模型的参数进行初始化,得到至少一个初始化后的第一网络模型;基于设定的部署配置信息,从所述至少一个初始化后的第一网络模型中确定待量化的第一网络模型。The device according to any one of claims 13 to 21, wherein the first acquisition part is further configured to: adjust the processing layers in the set neural network structure based on at least one type of deployment configuration information, Obtaining at least one adjusted neural network structure; creating at least one first network model based on at least one adjusted neural network structure; based on preset model parameters corresponding to the set neural network structure, at least one The parameters of the first network model are initialized to obtain at least one initialized first network model; based on the set deployment configuration information, the first network to be quantified is determined from the at least one initialized first network model Model.
  23. 根据权利要求22所述的装置,其中,所述装置还包括:第二获取部分,被配置为获取预设的与所述神经网络结构对应的预训练模型;所述预训练模型在输出层之前的结构与所述神经网络结构相同;预训练部分,被配置为利用设定的第二训练数据集,对所述预训练模型的参数进行训练,得到已训练的所述预训练模型;第二确定部分,被配置为将已训练的所述预训练模型的参数确定为所述预设模型参数。The device according to claim 22, wherein the device further comprises: a second acquisition part configured to acquire a preset pre-training model corresponding to the neural network structure; the pre-training model is before the output layer The structure of the neural network is the same as that of the neural network; the pre-training part is configured to use the set second training data set to train the parameters of the pre-training model to obtain the trained pre-training model; the second The determining part is configured to determine the trained parameters of the pre-training model as the preset model parameters.
  24. 根据权利要求22或23所述的装置,其中,所述第一获取部分还被配置为:从预设的多种神经网络结构中确定目标神经网络结构;基于至少一种部署配置信息,对所述目标神经网络结构中的处理层进行调整,得到至少一个调整后的神经网络结构。The device according to claim 22 or 23, wherein the first acquisition part is further configured to: determine the target neural network structure from a variety of preset neural network structures; The processing layer in the target neural network structure is adjusted to obtain at least one adjusted neural network structure.
  25. 一种计算机设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现权利要求1至12任一项所述方法中的步骤。A computer device, comprising a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the steps in the method of any one of claims 1 to 12 when executing the program .
  26. 一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现权利要求1至12任一项所述方法中的步骤。A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps in the method of any one of claims 1 to 12 are implemented.
  27. 一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,实现权利要求1至12中任一项所述方法中的步骤。A computer program product, the computer program product comprising a non-transitory computer-readable storage medium storing a computer program, when the computer program is read and executed by a computer, it realizes any one of claims 1 to 12 steps in the method.
  28. 一种计算机程序,包括计算机可读代码,在所述计算机可读代码在计算机设备中运行的情况下,所述计算机设备中的处理器执行权利要求1至12中任一项所述方法中的步骤。A computer program comprising computer readable code, under the condition that the computer readable code is run in a computer device, a processor in the computer device performs the method in any one of claims 1 to 12 step.
PCT/CN2022/071377 2021-09-03 2022-01-11 Model quantization method and apparatus, device, storage medium, computer program product, and computer program WO2023029349A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111030764.1A CN113780551B (en) 2021-09-03 2021-09-03 Model quantization method, device, equipment, storage medium and computer program product
CN202111030764.1 2021-09-03

Publications (1)

Publication Number Publication Date
WO2023029349A1 true WO2023029349A1 (en) 2023-03-09

Family

ID=78840925

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071377 WO2023029349A1 (en) 2021-09-03 2022-01-11 Model quantization method and apparatus, device, storage medium, computer program product, and computer program

Country Status (2)

Country Link
CN (1) CN113780551B (en)
WO (1) WO2023029349A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116187420A (en) * 2023-05-04 2023-05-30 上海齐感电子信息科技有限公司 Training method, system, equipment and medium for lightweight deep neural network
CN116739039A (en) * 2023-05-05 2023-09-12 北京百度网讯科技有限公司 Quantization method, device, equipment and medium of distributed deployment model

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780551B (en) * 2021-09-03 2023-03-24 北京市商汤科技开发有限公司 Model quantization method, device, equipment, storage medium and computer program product
CN114580281A (en) * 2022-03-04 2022-06-03 北京市商汤科技开发有限公司 Model quantization method, apparatus, device, storage medium, and program product
CN114611697B (en) * 2022-05-11 2022-09-09 上海登临科技有限公司 Neural network quantification and deployment method, system, electronic device and storage medium
CN115238873B (en) * 2022-09-22 2023-04-07 深圳市友杰智新科技有限公司 Neural network model deployment method and device, and computer equipment
CN116630632B (en) * 2023-07-25 2023-11-03 腾讯科技(深圳)有限公司 Image segmentation model quantization method, device and equipment and computer storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898751A (en) * 2020-07-29 2020-11-06 苏州浪潮智能科技有限公司 Data processing method, system, equipment and readable storage medium
US20210174214A1 (en) * 2019-12-10 2021-06-10 The Mathworks, Inc. Systems and methods for quantizing a neural network
CN113282535A (en) * 2021-05-25 2021-08-20 北京市商汤科技开发有限公司 Quantization processing method and device and quantization processing chip
CN113780551A (en) * 2021-09-03 2021-12-10 北京市商汤科技开发有限公司 Model quantization method, device, equipment, storage medium and computer program product

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460613A (en) * 2018-11-12 2019-03-12 北京迈格威科技有限公司 Model method of cutting out and device
CN110443165B (en) * 2019-07-23 2022-04-29 北京迈格威科技有限公司 Neural network quantization method, image recognition method, device and computer equipment
US20210089925A1 (en) * 2019-09-24 2021-03-25 Vahid PARTOVI NIA Training method for quantizing the weights and inputs of a neural network
CN111783974A (en) * 2020-08-12 2020-10-16 成都佳华物链云科技有限公司 Model construction and image processing method and device, hardware platform and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210174214A1 (en) * 2019-12-10 2021-06-10 The Mathworks, Inc. Systems and methods for quantizing a neural network
CN111898751A (en) * 2020-07-29 2020-11-06 苏州浪潮智能科技有限公司 Data processing method, system, equipment and readable storage medium
CN113282535A (en) * 2021-05-25 2021-08-20 北京市商汤科技开发有限公司 Quantization processing method and device and quantization processing chip
CN113780551A (en) * 2021-09-03 2021-12-10 北京市商汤科技开发有限公司 Model quantization method, device, equipment, storage medium and computer program product

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116187420A (en) * 2023-05-04 2023-05-30 上海齐感电子信息科技有限公司 Training method, system, equipment and medium for lightweight deep neural network
CN116739039A (en) * 2023-05-05 2023-09-12 北京百度网讯科技有限公司 Quantization method, device, equipment and medium of distributed deployment model

Also Published As

Publication number Publication date
CN113780551A (en) 2021-12-10
CN113780551B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
WO2023029349A1 (en) Model quantization method and apparatus, device, storage medium, computer program product, and computer program
US20240104378A1 (en) Dynamic quantization of neural networks
CN110363279B (en) Image processing method and device based on convolutional neural network model
TW201918939A (en) Method and apparatus for learning low-precision neural network
WO2019238029A1 (en) Convolutional neural network system, and method for quantifying convolutional neural network
US11604647B2 (en) Mixed precision capable hardware for tuning a machine learning model
TW201915839A (en) Method and apparatus for quantizing artificial neural network and floating-point neural network
US20200117981A1 (en) Data representation for dynamic precision in neural network cores
CN114341892A (en) Machine learning hardware with reduced precision parameter components for efficient parameter updating
TWI744724B (en) Method of processing convolution neural network
US11704556B2 (en) Optimization methods for quantization of neural network models
WO2023165139A1 (en) Model quantization method and apparatus, device, storage medium and program product
WO2022111002A1 (en) Method and apparatus for training neural network, and computer readable storage medium
WO2023272972A1 (en) Neural network search method and apparatus, and device, storage medium and program product
KR20230076641A (en) Apparatus and method for floating-point operations
CN116472538A (en) Method and system for quantifying neural networks
CN114580625A (en) Method, apparatus, and computer-readable storage medium for training neural network
Liu et al. Block-Wise Dynamic-Precision Neural Network Training Acceleration via Online Quantization Sensitivity Analytics
CN111950689A (en) Neural network training method and device
US20230342613A1 (en) System and method for integer only quantization aware training on edge devices
Naganawa et al. SIMD-Constrained Lookup Table for Accelerating Variable-Weighted Convolution on x86/64 CPUs
WO2024065530A1 (en) Methods and apparatus to perform artificial intelligence-based sparse computation based on hybrid pattern and dynamic encoding
WO2024060727A1 (en) Method and apparatus for training neural network model, and device and system
KR20240077167A (en) Data processing method and computing device for convolution operation
KR20230020856A (en) Device and Method for Quantizing Parameters of Neural Network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22862500

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE