CN114386469A - Method and device for quantizing convolutional neural network model and electronic equipment - Google Patents

Method and device for quantizing convolutional neural network model and electronic equipment Download PDF

Info

Publication number
CN114386469A
CN114386469A CN202011139056.7A CN202011139056A CN114386469A CN 114386469 A CN114386469 A CN 114386469A CN 202011139056 A CN202011139056 A CN 202011139056A CN 114386469 A CN114386469 A CN 114386469A
Authority
CN
China
Prior art keywords
layer
coefficient
quantized
quantization
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011139056.7A
Other languages
Chinese (zh)
Inventor
吕朦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gree Electric Appliances Inc of Zhuhai
Zhuhai Zero Boundary Integrated Circuit Co Ltd
Original Assignee
Gree Electric Appliances Inc of Zhuhai
Zhuhai Zero Boundary Integrated Circuit Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gree Electric Appliances Inc of Zhuhai, Zhuhai Zero Boundary Integrated Circuit Co Ltd filed Critical Gree Electric Appliances Inc of Zhuhai
Priority to CN202011139056.7A priority Critical patent/CN114386469A/en
Publication of CN114386469A publication Critical patent/CN114386469A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to a method, a device and electronic equipment for quantizing a convolutional neural network model, for each block, input data is quantized integer data, and the quantized weight is obtained by multiplying the weight of a convolutional layer by a scaling coefficient gamma' of a BN layer; multiplying the bias by gamma 'in the convolution layer, adding the multiplied bias and the translation coefficient beta' of the BN layer, quantizing to obtain the quantized bias, multiplying the quantized input data by the quantized weight, adding the quantized bias, and inputting the obtained integral point data into the normalization layer; dividing the normalization layer by a first quantization coefficient used in quantization, multiplying by a second quantization coefficient of the next block input data, and inputting the obtained integer data into the active layer; and after the activation layer operates through the activation function, inputting the obtained integer point data into the next block. The invention can complete the end-to-end integral point type data path calculation and improve the calculation speed of the model.

Description

Method and device for quantizing convolutional neural network model and electronic equipment
Technical Field
The present invention relates to the field of neural network technologies, and in particular, to a method and an apparatus for quantizing a convolutional neural network model, and an electronic device.
Background
The convolutional neural network is taken as a mainstream algorithm in the field of artificial intelligence at present, but the model has large parameter quantity, the cost of required hardware is high, and great difficulty is met in landing. Under general conditions, the hardware resources of the embedded device of the intelligent household appliance are very limited and have strict requirements on power consumption, and the storage and memory space of the processor is small, so that the implementation of the convolutional neural network at the edge terminal becomes difficult. The model quantization can effectively reduce the memory required by the model under the condition of low precision loss, and is an important means for solving the difficulty.
In the embedded type, resources are less, a neural network algorithm needs a large amount of calculation, a floating point CPU (Central Processing Unit) calculation resource needs to consume a large amount of resources, and the calculation speed is slow. Currently, NPU (Neural Network Processing Unit) integer computation is accelerated to achieve an industrial goal.
Through research on a quantization scheme of a convolutional neural network in the current market, a mature offline quantization mode in the current market is a TensorRT quantization method, the method can provide reasoning performance with high throughput and low delay on an NVIDIA platform, and the implementation mode of each layer is relatively simple. However, for the calculation unit block, after quantizing the input and the weight of the convolutional layer of the current block, the obtained integer result needs to be restored back to the floating point real number, so that the calculation result needs to be transferred to the next layer to be calculated by the CPU after being shifted out of the NPU and restored back to the floating point real number, and therefore an end-to-end operation path cannot be constructed.
Disclosure of Invention
The invention provides a method, a device and electronic equipment for quantizing a convolutional neural network model, which are used for solving the problem that an end-to-end operation path cannot be constructed when the convolutional neural network model is quantized at present. The technical scheme of the invention is as follows:
according to a first aspect of embodiments of the present invention, there is provided a method for quantizing a convolutional neural network model, where the convolutional neural network model includes a plurality of computing units block, including:
for each block, inputting data which are quantized integer point data, and quantizing the weight of the convolutional layer after multiplying the weight of the convolutional layer by a scaling coefficient gamma' of the normalized BN layer to obtain the quantized weight;
multiplying the bias of the block by a scaling coefficient gamma 'at the convolution layer, adding the bias of the block and a translation coefficient beta' of the BN layer, quantizing to obtain a quantized bias, multiplying the input data by a quantized weight, adding the quantized bias, and inputting the obtained integral point type data into a normalization layer;
dividing input integer point data by a first quantization coefficient used in quantization in a normalization layer, multiplying the first quantization coefficient by a second quantization coefficient of next block input data, and inputting the obtained integer point data to an active layer;
after the activation layer operates the input integer point data through the activation function, the obtained integer point data is input to the next block.
As a possible implementation, the method further includes:
and acquiring input data of the current block, and if the input data of the current block is floating-point data, obtaining quantized integer-point data by using a second quantization coefficient of the input data.
As a possible implementation, the scaling coefficient γ 'of the BN layer is set to the second quantized coefficient divided by the first quantized coefficient of the next block input data, and the shift coefficient β' is set to zero.
As a possible implementation manner, the first quantized coefficient is a product of the second quantized coefficient and a third quantized coefficient, and then is divided by shift during multiplication, where the third quantized coefficient is a quantized coefficient corresponding to a weight.
As a possible implementation, the quantization after multiplying the bias of the block by a scaling coefficient γ 'and adding the result to the translation coefficient β' of the BN layer at the convolutional layer includes:
in the convolutional layer, the offset of the block is multiplied by a scaling coefficient γ ', and the resulting value is added to a shift coefficient β' of the BN layer, and then quantized using a first quantization coefficient.
As a possible implementation manner, when the input data of the convolutional neural network model is sample data, the method further includes:
determining the loss precision of the convolutional neural network model according to the sample data and the output data of the quantized convolutional neural network model;
and when the loss precision is determined to exceed the set model threshold, improving the bit number of the current convolutional neural network model for weight quantization.
As a possible implementation, when determining that the loss accuracy exceeds the set threshold, increasing the number of bits of quantization of the current convolutional neural network model includes:
calculating the layer loss precision of each block in the convolutional neural network model;
and improving the bit number of the weight quantization for the block with the highest layer loss precision.
As a possible implementation, the method further comprises:
and determining a second quantization coefficient used by each block by adopting a KL divergence mode, and determining a third quantization coefficient used by each block by adopting an unsaturated symmetry mode.
According to a second aspect of the embodiments of the present invention, there is provided an apparatus for quantizing a convolutional neural network model, including:
the weight quantization module is used for inputting data into each block, wherein the data are integer point data after quantization, and the weights of the convolution layers are quantized after being multiplied by the scaling coefficient gamma' of the normalization BN layer to obtain the quantized weights;
the bias quantization module is used for multiplying the bias of the block by a scaling coefficient gamma 'at the convolution layer, adding the bias of the block and a translation coefficient beta' of the BN layer for quantization to obtain the quantized bias, multiplying the input data by the quantized weight, adding the quantized bias, and inputting the obtained integer point data to the normalization layer;
the normalization module is used for dividing input integer point data by a first quantization coefficient used in quantization in the normalization layer, multiplying the input integer point data by a second quantization coefficient of next block input data, and inputting the obtained integer point data to the activation layer;
and the activation operation module is used for inputting the obtained integer point data into the next block after the activation layer operates the input integer point data through the activation function.
As a possible implementation, the weight quantization module is specifically further configured to:
and acquiring input data of the current block, and if the input data of the current block is floating-point data, obtaining quantized integer-point data by using a second quantization coefficient of the input data.
As a possible implementation, the scaling coefficient γ 'of the BN layer is set to the second quantized coefficient divided by the first quantized coefficient of the next block input data, and the shift coefficient β' is set to zero.
As a possible implementation manner, the first quantized coefficient is a product of the second quantized coefficient and a third quantized coefficient, and then is divided by shift during multiplication, where the third quantized coefficient is a quantized coefficient corresponding to a weight.
As a possible implementation, the offset quantization module multiplies the offset of the block by a scaling coefficient γ 'at the convolutional layer, and quantizes the offset after adding the offset to a translation coefficient β' of the BN layer, including:
in the convolutional layer, the offset of the block is multiplied by a scaling coefficient γ ', and the resulting value is added to a shift coefficient β' of the BN layer, and then quantized using a first quantization coefficient.
As a possible implementation manner, when the input data of the convolutional neural network model is sample data, the apparatus further includes:
the precision adjusting module is used for determining the loss precision of the convolutional neural network model according to the sample data and the output data of the quantized convolutional neural network model; and when the loss precision is determined to exceed the set model threshold, improving the bit number of the current convolutional neural network model for weight quantization.
As a possible implementation, when determining that the loss accuracy exceeds the set threshold, increasing the number of bits of quantization of the current convolutional neural network model includes:
calculating the layer loss precision of each block in the convolutional neural network model;
and improving the bit number of the weight quantization for the block with the highest layer loss precision.
As a possible implementation, the apparatus further comprises:
and the coefficient determining module is used for determining a second quantization coefficient used by each block in a KL divergence mode and determining a third quantization coefficient used by each block in an unsaturated symmetry mode.
According to a third aspect of embodiments of the present invention, there is provided an electronic apparatus, including: a memory for storing executable instructions;
a processor for reading and executing the executable instructions stored in the memory to perform the method steps of:
for each block, inputting data which are quantized integer point data, and quantizing the weight of the convolutional layer after multiplying the weight of the convolutional layer by a scaling coefficient gamma' of the normalized BN layer to obtain the quantized weight;
multiplying the bias of the block by a scaling coefficient gamma 'at the convolution layer, adding the bias of the block and a translation coefficient beta' of the BN layer, quantizing to obtain a quantized bias, multiplying the input data by a quantized weight, adding the quantized bias, and inputting the obtained integral point type data into a normalization layer;
dividing input integer point data by a first quantization coefficient used in quantization in a normalization layer, multiplying the first quantization coefficient by a second quantization coefficient of next block input data, and inputting the obtained integer point data to an active layer;
after the activation layer operates the input integer point data through the activation function, the obtained integer point data is input to the next block.
As a possible implementation, the processor is further configured to:
and acquiring input data of the current block, and if the input data of the current block is floating-point data, obtaining quantized integer-point data by using a second quantization coefficient of the input data.
As a possible implementation, the scaling coefficient γ 'of the BN layer is set to the second quantized coefficient divided by the first quantized coefficient of the next block input data, and the shift coefficient β' is set to zero.
As a possible implementation manner, the first quantized coefficient is a product of the second quantized coefficient and a third quantized coefficient, and then is divided by shift during multiplication, where the third quantized coefficient is a quantized coefficient corresponding to a weight.
As a possible implementation, the processor quantizes, after multiplying the bias of the block by a scaling coefficient γ ' at the convolutional layer and adding the scaling coefficient γ ' to the translation coefficient β ' of the BN layer, including:
in the convolutional layer, the offset of the block is multiplied by a scaling coefficient γ ', and the resulting value is added to a shift coefficient β' of the BN layer, and then quantized using a first quantization coefficient.
As a possible implementation manner, when the input data of the convolutional neural network model is sample data, the processor is further configured to:
determining the loss precision of the convolutional neural network model according to the sample data and the output data of the quantized convolutional neural network model;
and when the loss precision is determined to exceed the set model threshold, improving the bit number of the current convolutional neural network model for weight quantization.
As a possible implementation, when determining that the loss accuracy exceeds the set threshold, increasing the number of bits of quantization of the current convolutional neural network model includes:
calculating the layer loss precision of each block in the convolutional neural network model;
and improving the bit number of the weight quantization for the block with the highest layer loss precision.
As a possible implementation, the processor is further configured to:
and determining a second quantization coefficient used by each block by adopting a KL divergence mode, and determining a third quantization coefficient used by each block by adopting an unsaturated symmetry mode.
According to a fourth aspect of embodiments of the present invention, there is provided a computer storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the method of modeling a convolutional neural network as provided in the first aspect above.
The technical scheme provided by the embodiment of the invention at least has the following beneficial effects:
in the method for quantizing the convolutional neural network model, provided by the embodiment of the invention, in order to complete end-to-end integer path calculation without floating point type calculation, the quantization coefficient of the convolutional layer input data of the next block is multiplied to the position in advance during normalization layer calculation so as to ensure that the obtained integer result is input to the next block, so that end-to-end integer data path calculation can be completed, the calculation amount of the model is reduced, and the calculation speed of the model is improved.
Drawings
Fig. 1 is an intention of a quantization process of each block when a convolutional neural network model is quantized according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a process of adjusting quantization precision when quantizing a convolutional neural network model according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a process for quantizing a convolutional neural network model when multi-precision quantization is employed, according to an exemplary embodiment;
FIG. 4 is a diagram illustrating a first block quantization process in a convolutional neural network model, according to an exemplary embodiment;
FIG. 5 is a schematic diagram illustrating an apparatus for quantizing a convolutional neural network model in accordance with an exemplary embodiment;
FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
Hereinafter, some terms in the embodiments of the present invention are explained to facilitate understanding by those skilled in the art.
(1) The term "and/or" in the embodiments of the present invention describes an association relationship of associated objects, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
(2) The term "electronic device" in embodiments of the present invention may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like.
(3) In the embodiment of the invention, the term 'TenorrT quantization' TensrT reconstructs a network structure, some combinable operations are combined together, and the characteristics of a GPU are optimized. Most deep learning frameworks are not performance optimized for GPUs today. A deep learning model without optimization, such as a convolutional layer, a bias layer and a load layer, needs to call APIs corresponding to cuDNN three times, but actually, the three layers can be completely combined, and TensorRT can combine some networks which can be combined.
The present invention provides a method for quantizing a convolutional neural network model, which can be applied to a terminal device, and can also be applied to a server, where the terminal device can be any suitable electronic device that can be used for network access, including but not limited to a computer, a notebook computer, a smart phone, a tablet computer, or other types of terminals. The server is any server that can provide information required for the interactive service through a network. The terminal device can realize information transmission and reception with the server via the network.
TensorRT can combine some networks which can be combined, one convolution is regarded as a calculation unit block before the next convolution, and a convolution neural network model comprises a plurality of blocks, wherein each block comprises a convolution layer, a normalization BN layer and an activation layer.
When the network calculates forwards (by using float type), the output of each calculation unit block can be represented by the following formula:
output=f((input*wt+bias)*γ′+β′)
in the above formula, f represents the activation function of the activation layer, γ 'is the scaling coefficient of the BN layer, β' is the translation coefficient of the BN layer, γ 'and β' are fixed values, wt is the weight used by the convolutional layer, and bias used by the convolutional layer.
Currently, TensorRT quantizes a convolutional neural network model, quantizes input by using a corresponding quantization coefficient, quantizes wt by using a corresponding quantization coefficient, multiplies quantized input data by quantized weight, restores the obtained integer data to floating point data, and performs subsequent operations.
Therefore, the TenSorRT quantization scheme is non-pure fixed point calculation, cannot complete an end-to-end operation path, and cannot give full play to hardware performance when being implemented at an embedded end. If end-to-end calculation is not realized, calculation results are moved in and out of the NPU and the CPU in actual calculation, resources and time are wasted, and the performance of the algorithm is influenced.
The embodiment of the invention improves the TenSorRT quantization method to achieve the quantization scheme of finishing the end-to-end integer point type operation path on the CNP supporting hardware operation. The embodiment of the invention provides a method for quantizing a convolutional neural network model, wherein the convolutional neural network model comprises a plurality of computing unit blocks, and the processing mode of each block is as follows:
convolution layer (conv) — > normalization layer (BN) — > activation layer (activation).
As shown in fig. 1, for each block, the input data is quantized integer data in the following manner:
step 101, multiplying the weight of the convolution layer by a scaling coefficient gamma' of the normalized BN layer, and then quantizing to obtain a quantized weight;
the embodiment of the invention quantizes a convolutional neural network model, belongs to an off-line quantizing scheme, and is characterized in that after the convolutional neural network training is completed, the quantizing process is divided into two parts, namely training and derivation, and the training is generally performed in a float mode. The training of the model is part of the neural network algorithm design, and the training of the network model can be performed in the existing mode when the offline quantitative scheme is designed, and is not detailed here.
After the conventional convolutional neural network model is trained, the trained convolutional neural network model and input data of the convolutional neural network model are obtained.
And obtaining the quantized input data by using a second quantization coefficient of the input data when the input data of the current block is the floating point type data for the first block of the convolutional neural network model.
For other blocks of the convolutional neural network model, the input data is quantized integer point data.
The quantization coefficient for quantizing the input data is referred to as a second quantization coefficient in the present embodiment.
The quantization coefficient for quantizing the weight data is referred to as a third quantization coefficient in the present embodiment.
Step 102, multiplying the bias of the block by a scaling coefficient gamma 'in the convolution layer, adding the bias of the block and a translation coefficient beta' of the BN layer, quantizing the result to obtain a quantized bias, multiplying quantized input data by a quantized weight, adding the quantized bias, and inputting the obtained integer point data into a normalization layer;
the above-mentioned scaling factor gammaAnd the translation coefficient beta' is a parameter of the BN layer, and the value thereof is a fixed value, and the embodiment of the present invention multiplies the parameter values in advance, that is, the parameter of the BN layer is calculated first and then quantized. This way of quantizing the floating point calculation results first and then later can help reduce the loss of precision compared to a way of quantizing the parameters first and then the quantized parameters.
Step 103, dividing the input integer point data by a first quantization coefficient used in quantization of the convolutional layer in the normalization layer, multiplying the first quantization coefficient by a second quantization coefficient of the next block input data, and inputting the obtained integer point data to the active layer;
according to the formula for calculating the output of the active layer, the quantization of the convolutional layer comprises two parts, one part is used for quantizing the product of the input data and the weight, the other part is used for quantizing the offset part, and the quantization coefficients used by the two parts are equal and are the first quantization coefficients.
As an optional implementation manner, the first quantization coefficient used in the quantization is a product of the second quantization coefficient and a third quantization coefficient, and then is divided by a shift in the multiplication, where the input first quantization coefficient is a quantization coefficient corresponding to a weight.
As an alternative embodiment, the quantization after multiplying the bias of the block by a scaling coefficient γ 'and adding the result to the translation coefficient β' of the BN layer at the convolutional layer includes:
in the convolutional layer, the offset of the block is multiplied by a scaling coefficient γ ', and the resulting value is added to a shift coefficient β' of the BN layer, and then quantized using a first quantization coefficient.
After convolution operation is performed on the convolution layer, the input data needs to be in the first quantization coefficient in the normalization layer to restore the floating-point data.
Meanwhile, in the next block, the integral point data is input, so that the second quantization coefficient multiplied by the input data is not needed to be quantized, only the third quantization coefficient is needed to be used for quantizing the weight, and the first quantization coefficient is used for quantizing the offset, so that the calculated amount of the model is reduced, and the accuracy of the calculated result is not influenced.
And 104, after the activation layer operates the input integer point data through the activation function, inputting the obtained integer point data into the next block.
In the method for quantizing the convolutional neural network model, provided by the embodiment of the invention, in order to complete end-to-end integer path calculation without floating point type calculation, the quantization coefficient of the convolutional layer input data of the next block is multiplied to the position in advance during normalization layer calculation so as to ensure that the obtained integer result is input to the next block, so that end-to-end integer data path calculation can be completed, the calculation amount of the model is reduced, and the calculation speed of the model is improved.
The following describes the principle of the implementation of the present invention with reference to the calculation process of the convolutional neural network model:
as mentioned above, when the network calculates forward (float type), the output of each computing unit block can be expressed by the following equation:
output=f((input*wt+bias)*γ′+β′)
the above formula can be converted into:
x=(input*wt*γ’+bias*γ’+β’)*1+0
x=(input*wt′+bias′)*1+0
wherein wt 'is wt gamma' and bias 'is bias gamma + beta'.
It can be seen that according to the above formula, the parameters of the BN layer are multiplied by the convolution layer and calculated, and then quantized, specifically, the weight of the block is multiplied by the scaling coefficient γ ' of the normalized BN layer and then quantized, and the offset of the block is multiplied by the scaling coefficient γ ' and added to the translation coefficient β ' of the BN layer and then quantized. This way of quantizing the floating point calculation results first and then can help reduce the loss of precision compared to a way of quantizing the parameters separately and then calculating the quantized parameters.
Therefore, when quantization is applied to the block of the convolutional neural network model in the convolutional layer, the following quantized calculation result is obtained:
x′=(input*si*wt′*sw/shift+bias′*si*sw/shift)*shift/shift+0
si is a second quantization coefficient of the input data of the current block, and it should be noted that for the first block, the second quantization coefficient needs to be multiplied, and for the subsequent block, since the previous block is multiplied by the coefficient in advance, the second quantization coefficient does not need to be multiplied;
sw is a third quantization coefficient of the weight' of the current block;
the shift is the self-contained operation when the CNP performs multiplication.
In the quantization process, the quantization of the two parts is performed by using the quantization coefficient si sws and performing a shift operation, i.e., multiplying the first quantization coefficient si sws/shift.
x' differs from the actual value by a factor of sx. Therefore, this embodiment needs to remove this coefficient in the normalization layer to obtain the true calculated value.
Figure BDA0002737628890000121
However, the obtained real calculation value is a floating point type and cannot be represented by an integer, so that the input conversion coefficient si of the next block can be multiplied in advance, and the next input is quantized in advance to obtain an integer result. I.e. the output y of the hierarchy is:
y=x*sinext=[x′/(si*sw/shift)]*si_next
in order to complete the end-to-end calculation, according to the hardware characteristics of the CPN, in this embodiment, the scaling coefficient γ ' of the BN layer is set to be the second quantization coefficient of the next block input data divided by the first quantization coefficient, and the panning coefficient β ' is set to be zero, that is, the si _ next/(si sw/shift) coefficient is set to be γ ' of the BN layer, so that no software processing is required. And then, calculating an activation layer to finish the calculation of the whole block.
As an optional implementation manner, in the embodiment of the present invention, quantization coefficients of different block layers are independent, that is, quantization of bit numbers used when quantization is performed between different block layers is possible.
The existing TenSorRT quantification method adopts single-precision quantification which can cause the accuracy of some algorithm models to be greatly reduced. An embodiment of the present invention provides a method for quantizing a convolutional neural network model, where input data of the convolutional neural network model is sample data, as shown in fig. 2, the method includes:
step 201, acquiring a trained convolutional neural network model and sample data;
step 202, inputting sample data into a convolutional neural network model, and executing the steps 101 to 104 by each block in the model to obtain output data of the quantized convolutional neural network model;
step 203, determining the loss precision of the convolutional neural network model according to the sample data and the output data of the quantized convolutional neural network model;
the loss accuracy of the convolutional neural network model is determined by adopting the existing mode, and the overall loss accuracy of the model is reflected.
And 204, when the loss precision is determined to exceed the set model threshold, increasing the bit number of the current convolutional neural network model for weight quantization.
As an optional implementation manner, in the embodiment of the present invention, quantization with different accuracies may be used for different blocks, that is, bit numbers of quantization coefficients are different, and when it is determined that the loss of the model exceeds the set model threshold, a block with a low accuracy is selected to improve the quantization accuracy, specifically, the following manner is used:
calculating the layer loss precision of each block in the convolutional neural network model;
and improving the bit number of the weight quantization for the block with the highest layer loss precision.
As shown in fig. 3, the overall schematic diagram of the precision adjustment in the embodiment of the present invention may perform low-precision quantization on an original convolutional neural network model, where the low-precision quantization in the embodiment refers to 8-bit quantization on a block, and then calculates the precision loss of the model, and if the precision loss is lower than a model threshold, the quantization is completed; if the loss is higher than the model threshold value, calculating the layer precision loss of each block layer, wherein the larger the loss is, the lower the layer precision is, the quantization precision requirement cannot be met, and finding out the block layer with the largest loss; and (4) improving the quantization precision of the block with the maximum precision loss, and then carrying out multi-precision quantization on the model again until the target loss precision of the model is finally reached. Through the process, the quantization of the model can adopt different precision quantization in different block modules.
The layer accuracy loss calculation method of the block layer in this embodiment may be, but is not limited to, the following method:
and giving sample data input, and each computing unit takes the output of the last computing unit of the floating point type computing as input, and compares the input with the output of the floating point type of the layer to compute the precision loss of the layer. The variable of the calculation unit only has the quantized weight, and the loss of the quantized weight can be well measured, so that whether the quantization precision meets the requirement or not is judged.
As shown in fig. 4, the block quantization calculation process given by taking the calculation of the first block of the convolutional neural network model as an example includes:
quantizing the floating-point input data (float) by a second quantization coefficient si in the convolution layer to obtain 16-bit input data;
multiplying the floating-point weight (float) by the scaling factor gamma of BN layer in the convolution layerObtaining a floating point type weight ' (float), quantizing the weight ' (float) by using a third quantization coefficient sw to obtain a quantized weight ' (kbit), and carrying out self-carrying/shift operation on the CNP in a multiplication process during quantization;
multiplying the floating point bias (float) by the scaling factor gamma of BN layer at convolution layerAdding the shift coefficient beta 'to obtain the floating point bias' ((b))float), quantizing the bias' (float) by using a first quantization coefficient si × sw to obtain a quantized 16-bit bias (16-bit), wherein the CNP carries out self-shift operation in the multiplication process during quantization;
the product of the input data of the 16 bits after quantization and the weight '(kbit) after quantization is added with the bias' (16 bits) of the 16 bits after quantization, and the obtained integer point data is input into a normalization layer;
one-layer multiplication of input integer data by a scaling factor gammaAdding a translation coefficient beta' to obtain integral point type 16-bit data to be input into the active layer;
the activation layer performs operation by using an activation function to obtain 16-bit output data, and the 16-bit output data is used as input data of the next block.
In the above process, the scaling factor γ of the BN layerThe value of (si _ next shift)/(si sw/shift) and the value of the translation coefficient β' is zero.
The weight '(kbit) indicates that the quantized bit number of weight' is not fixed, and the bit numbers of weight in different blocks are different according to different actual models. Specifically, the number of bits of weight' (kbit) in the block may be optimized by a model multi-precision quantization process, specifically, the value of k may be 8, 10, or 16, and may also be other integer values.
Example 2
An embodiment of the present invention further provides a device for quantizing a convolutional neural network model, as shown in fig. 5, including:
a weight quantization module 501, configured to, for each block, input data that is quantized integer-point data, and quantize the weight of the convolutional layer after multiplying the weight of the convolutional layer by a scaling coefficient γ' of the normalized BN layer to obtain a quantized weight;
an offset quantization module 502, configured to multiply the offset of the block by a scaling coefficient γ 'at the convolutional layer, add the multiplied offset to a translation coefficient β' of the BN layer, and quantize the multiplied offset, multiply the quantized input data by a quantized weight, add the quantized offset, and input the obtained integer point data to the normalization layer;
a normalization module 503, configured to divide input integer dot data by a first quantization coefficient used in quantization in the normalization layer, multiply the first quantization coefficient by a second quantization coefficient of next block input data, and input the obtained integer dot data to the active layer;
and the activation operation module 504 is configured to input the obtained integer point data to the next block after the activation layer operates the input integer point data through the activation function.
As a possible implementation, the weight quantization module is specifically configured to:
and acquiring input data of the current block, and if the input data of the current block is floating-point data, obtaining quantized integer-point data by using a second quantization coefficient of the input data.
As a possible implementation, the scaling coefficient γ 'of the BN layer is set to the second quantized coefficient divided by the first quantized coefficient of the next block input data, and the shift coefficient β' is set to zero.
As a possible implementation manner, the first quantized coefficient is a product of the second quantized coefficient and a third quantized coefficient, and then is divided by shift during multiplication, where the third quantized coefficient is a quantized coefficient corresponding to a weight.
As a possible implementation, the offset quantization module multiplies the offset of the block by a scaling coefficient γ 'at the convolutional layer, and quantizes the offset after adding the offset to a translation coefficient β' of the BN layer, including:
in the convolutional layer, the offset of the block is multiplied by a scaling coefficient γ ', and the resulting value is added to a shift coefficient β' of the BN layer, and then quantized using a first quantization coefficient.
As a possible implementation manner, when the input data of the convolutional neural network model is sample data, the apparatus further includes:
the precision adjusting module is used for determining the loss precision of the convolutional neural network model according to the sample data and the output data of the quantized convolutional neural network model; and when the loss precision is determined to exceed the set model threshold, improving the bit number of the current convolutional neural network model for weight quantization.
As a possible implementation, when determining that the loss accuracy exceeds the set threshold, increasing the number of bits of quantization of the current convolutional neural network model includes:
calculating the layer loss precision of each block in the convolutional neural network model;
and improving the bit number of the weight quantization for the block with the highest layer loss precision.
As a possible implementation, the apparatus further comprises:
and the coefficient determining module is used for determining a second quantization coefficient used by each block in a KL divergence mode and determining a third quantization coefficient used by each block in an unsaturated symmetry mode.
According to a third aspect of embodiments of the present invention, there is provided an electronic device 600, as shown in fig. 6, comprising: a memory 610 for storing executable instructions;
a processor 620, configured to read and execute the executable instructions stored in the memory, so as to implement the following method steps:
for each block, inputting data which are quantized integer point data, and quantizing the weight of the convolutional layer after multiplying the weight of the convolutional layer by a scaling coefficient gamma' of the normalized BN layer to obtain the quantized weight;
multiplying the bias of the block by a scaling coefficient gamma 'at the convolution layer, adding the bias of the block and a translation coefficient beta' of the BN layer, quantizing to obtain a quantized bias, multiplying quantized input data by a quantized weight, adding the quantized bias, and inputting the obtained integral point type data into a normalization layer;
dividing input integer point data by a first quantization coefficient used in quantization in a normalization layer, multiplying the first quantization coefficient by a second quantization coefficient of next block input data, and inputting the obtained integer point data to an active layer;
after the activation layer operates the input integer point data through the activation function, the obtained integer point data is input to the next block.
As a possible implementation, the processor acquiring quantized input data at the convolutional layer includes:
and acquiring input data of the current block at the convolutional layer, and if the input data of the current block is floating-point data, obtaining quantized input data by using a second quantization coefficient of the input data.
As a possible implementation, the scaling coefficient γ 'of the BN layer is set to the second quantized coefficient divided by the first quantized coefficient of the next block input data, and the shift coefficient β' is set to zero.
As a possible implementation manner, the first quantized coefficient is a product of the second quantized coefficient and a third quantized coefficient, and then is divided by shift during multiplication, where the third quantized coefficient is a quantized coefficient corresponding to a weight.
As a possible implementation, the processor quantizes, after multiplying the bias of the block by a scaling coefficient γ ' at the convolutional layer and adding the scaling coefficient γ ' to the translation coefficient β ' of the BN layer, including:
in the convolutional layer, the offset of the block is multiplied by a scaling coefficient γ ', and the resulting value is added to a shift coefficient β' of the BN layer, and then quantized using a first quantization coefficient.
As a possible implementation manner, when the input data of the convolutional neural network model is sample data, the processor is further configured to:
determining the loss precision of the convolutional neural network model according to the sample data and the output data of the quantized convolutional neural network model;
and when the loss precision is determined to exceed the set model threshold, improving the bit number of the current convolutional neural network model for weight quantization.
As a possible implementation, when determining that the loss accuracy exceeds the set threshold, increasing the number of bits of quantization of the current convolutional neural network model includes:
calculating the layer loss precision of each block in the convolutional neural network model;
and improving the bit number of the weight quantization for the block with the highest layer loss precision.
As a possible implementation, the processor is further configured to:
and determining a second quantization coefficient used by each block by adopting a KL divergence mode, and determining a third quantization coefficient used by each block by adopting an unsaturated symmetry mode.
There is also provided, in accordance with an embodiment of the present invention, a computer storage medium having instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the method for modeling convolutional neural networks as provided in embodiment 1 above.
Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (10)

1. A method of quantifying a convolutional neural network model, the convolutional neural network model comprising a plurality of compute units block, comprising:
for each block, inputting data which are quantized integer point data, and quantizing the weight of the convolutional layer after multiplying the weight of the convolutional layer by a scaling coefficient gamma' of the normalized BN layer to obtain the quantized weight;
multiplying the bias of the block by a scaling coefficient gamma 'at the convolution layer, adding the bias of the block and a translation coefficient beta' of the BN layer, quantizing to obtain a quantized bias, multiplying the input data by a quantized weight, adding the quantized bias, and inputting the obtained integral point type data into a normalization layer;
dividing input integer point data by a first quantization coefficient used in quantization in a normalization layer, multiplying the first quantization coefficient by a second quantization coefficient of next block input data, and inputting the obtained integer point data to an active layer;
after the activation layer operates the input integer point data through the activation function, the obtained integer point data is input to the next block.
2. The method of claim 1, further comprising:
and acquiring input data of the current block, and if the input data of the current block is floating-point data, obtaining quantized integer-point data by using a second quantization coefficient of the input data.
3. The method according to claim 1, wherein the scaling coefficient γ 'of the BN layer is set to the second quantized coefficient divided by the first quantized coefficient of the next block of input data, and the shift coefficient β' is set to zero.
4. The method of claim 1, wherein the first quantized coefficient is a product of a second quantized coefficient and a third quantized coefficient, and the product is divided by a shift in multiplication, and the third quantized coefficient is a quantized coefficient corresponding to a weight.
5. The method according to any one of claims 1 to 4, wherein the step of quantizing the block offset multiplied by a scaling coefficient γ 'in the convolutional layer and added to a shift coefficient β' of the BN layer comprises:
in the convolutional layer, the offset of the block is multiplied by a scaling coefficient γ ', and the resulting value is added to a shift coefficient β' of the BN layer, and then quantized using a first quantization coefficient.
6. The method of claim 1, further comprising, when the input data of the convolutional neural network model is sample data:
determining the loss precision of the convolutional neural network model according to the sample data and the output data of the quantized convolutional neural network model;
and when the loss precision is determined to exceed the set model threshold, improving the bit number of the current convolutional neural network model for weight quantization.
7. The method of claim 6, wherein determining that the loss accuracy exceeds a set threshold, increasing the number of bits of quantization of the current convolutional neural network model comprises:
calculating the layer loss precision of each block in the convolutional neural network model;
and improving the bit number of the weight quantization for the block with the highest layer loss precision.
8. The method of claim 4, further comprising:
and determining a second quantization coefficient used by each block by adopting a KL divergence mode, and determining a third quantization coefficient used by each block by adopting an unsaturated symmetry mode.
9. An apparatus for quantizing a convolutional neural network model, comprising:
the weight quantization module is used for inputting data into each block, wherein the data are integer point data after quantization, and the weights of the convolution layers are quantized after being multiplied by the scaling coefficient gamma' of the normalization BN layer to obtain the quantized weights;
the bias quantization module is used for multiplying the bias of the block by a scaling coefficient gamma 'at the convolution layer, adding the bias of the block and a translation coefficient beta' of the BN layer for quantization to obtain the quantized bias, multiplying the input data by the quantized weight, adding the quantized bias, and inputting the obtained integer point data to the normalization layer;
the normalization module is used for dividing input integer point data by a first quantization coefficient used in quantization in the normalization layer, multiplying the input integer point data by a second quantization coefficient of next block input data, and inputting the obtained integer point data to the activation layer;
and the activation operation module is used for inputting the obtained integer point data into the next block after the activation layer operates the input integer point data through the activation function.
10. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of quantifying convolutional neural network models as defined in any one of claims 1 to 8.
CN202011139056.7A 2020-10-22 2020-10-22 Method and device for quantizing convolutional neural network model and electronic equipment Pending CN114386469A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011139056.7A CN114386469A (en) 2020-10-22 2020-10-22 Method and device for quantizing convolutional neural network model and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011139056.7A CN114386469A (en) 2020-10-22 2020-10-22 Method and device for quantizing convolutional neural network model and electronic equipment

Publications (1)

Publication Number Publication Date
CN114386469A true CN114386469A (en) 2022-04-22

Family

ID=81193728

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011139056.7A Pending CN114386469A (en) 2020-10-22 2020-10-22 Method and device for quantizing convolutional neural network model and electronic equipment

Country Status (1)

Country Link
CN (1) CN114386469A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114648101A (en) * 2022-05-13 2022-06-21 杭州研极微电子有限公司 Transformer structure-based softmax function quantization realization method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114648101A (en) * 2022-05-13 2022-06-21 杭州研极微电子有限公司 Transformer structure-based softmax function quantization realization method and device

Similar Documents

Publication Publication Date Title
CN110222821B (en) Weight distribution-based convolutional neural network low bit width quantization method
CN110852416B (en) CNN hardware acceleration computing method and system based on low-precision floating point data representation form
CN107340993B (en) Arithmetic device and method
TW201918939A (en) Method and apparatus for learning low-precision neural network
KR20200004700A (en) Method and apparatus for processing parameter in neural network
CN110555450A (en) Face recognition neural network adjusting method and device
CN110008952B (en) Target identification method and device
CN110598839A (en) Convolutional neural network system and method for quantizing convolutional neural network
EP4087239A1 (en) Image compression method and apparatus
CN112686382B (en) Convolution model lightweight method and system
CN110738315A (en) neural network precision adjusting method and device
CN111950715A (en) 8-bit integer full-quantization inference method and device based on self-adaptive dynamic shift
CN114386469A (en) Method and device for quantizing convolutional neural network model and electronic equipment
CN111937011A (en) Method and equipment for determining weight parameters of neural network model
CN112183726A (en) Neural network full-quantization method and system
CN113157453B (en) Task complexity-based high-energy-efficiency target detection task dynamic scheduling method
CN111614358B (en) Feature extraction method, system, equipment and storage medium based on multichannel quantization
CN114418057A (en) Operation method of convolutional neural network and related equipment
CN116579400B (en) Quantization method, data processing method and device of deep learning model
CN112561050A (en) Neural network model training method and device
Zhen et al. A Secure and Effective Energy-Aware Fixed-Point Quantization Scheme for Asynchronous Federated Learning.
CN116702861B (en) Compression method, training method, processing method and device of deep learning model
CN117829222A (en) Model quantization method, apparatus, electronic device, and computer-readable storage medium
JP7371499B2 (en) Arithmetic processing unit, control method for the arithmetic processing unit, and arithmetic processing program
US20220019891A1 (en) Electronic device and learning method for learning of low complexity artificial intelligence model based on selecting dynamic prediction confidence threshold

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination