CN111814955A

CN111814955A - Method and apparatus for quantizing neural network model, and computer storage medium

Info

Publication number: CN111814955A
Application number: CN202010568807.0A
Authority: CN
Inventors: 周旭亚
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2020-10-23

Abstract

The application provides a quantization method and device of a neural network model and a computer storage medium. The method comprises the following steps: inputting a training picture into the neural network model, and calculating a first data type to obtain first input data of each calculation layer in the neural network model; obtaining at least two initial quantization factors of each calculation layer according to at least two algorithms; obtaining at least two second input data quantized by each calculation layer based on at least two initial quantization factors; comparing the correlation of the first input data and each second input data in each calculation layer; taking the initial quantization factor corresponding to the second input data with the maximum correlation as the final quantization factor of the calculation layer; and inputting the final quantization factor into the neural network model. The quantization factor of each calculation layer is calculated through at least two algorithms, and the optimal quantization factor of each calculation layer is determined after comparison, so that the quantization precision of the whole neural network model is improved.

Description

Method and apparatus for quantizing neural network model, and computer storage medium

Technical Field

The present application relates to the field of neural network technologies, and in particular, to a method and an apparatus for quantizing a neural network model, and a computer storage medium.

Background

At present, a common neural network quantization method is to quantize input activation values of all convolutional layers and all connection layers by using the same algorithm, but due to the flexible variability of the input activation values, an error of a certain layer in a neural network becomes large by using one algorithm, and due to the feedforward and complexity of a network structure, the error of the neural network in the inference process becomes larger and larger, so that the quantization accuracy of a neural network model is poor.

Disclosure of Invention

The application provides a quantization method and equipment of a neural network model and a computer storage medium, and mainly solves the technical problem of how to improve the quantization precision of the neural network model.

In order to solve the above technical problem, the present application provides a method for quantizing a neural network model, the method including:

inputting a training picture to the neural network model, and calculating a first data type to obtain first input data of each calculation layer in the neural network model;

obtaining at least two initial quantization factors of each calculation layer according to at least two algorithms;

obtaining at least two second input data quantized for each computation layer based on the at least two initial quantization factors;

comparing the correlation of the first input data and each second input data in each calculation layer;

taking the initial quantization factor corresponding to the second input data with the maximum correlation as the final quantization factor of the calculation layer;

inputting the final quantization factor into the neural network model.

According to an embodiment provided by the present application, the method further includes:

merging the data after the computation layer into the computation layer for computation.

According to one embodiment provided herein, the computation layer includes a convolution layer and a full connection layer; the method further comprises the following steps:

and setting the output data type of the previous layer of the convolutional layer and the full-link layer as a second data type.

and setting the output data type of the previous layer of the non-calculation layer as a second data type.

According to an embodiment provided by the present application, the quantization factor includes a weighted quantization factor and an input quantization factor; the inputting the final quantization factor into the neural network model comprises:

transmitting the input quantization factor of the computation layer to a previous layer such that output data of the previous layer is of a second data type.

According to an embodiment provided by the present application, the first data type is a floating point type, and the second data type is a fixed point type.

According to an embodiment provided by the present application, the inputting the final quantization factor into the neural network model further includes:

and calculating to obtain a quantization weight value according to the weight quantization factor, and inputting the quantization weight value into the neural network model.

According to an embodiment provided by the present application, the inputting the final quantization factor into the neural network model includes:

and converting the bias value of the computing layer into an output data type of the computing layer according to the quantization factor.

To solve the above technical problem, the present application provides a terminal device, which includes a memory and a processor coupled to the memory;

the memory is used for storing program data, and the processor is used for executing the program data to realize the quantification method of the neural network model.

To solve the above technical problem, the present application further provides a computer storage medium for storing program data, which when executed by a processor, is used to implement the method for quantizing a neural network model as described above.

The beneficial effect of this application is: inputting a training picture into the neural network model, and calculating a first data type to obtain first input data of each calculation layer in the neural network model; obtaining at least two initial quantization factors of each calculation layer according to at least two algorithms; obtaining at least two second input data quantized by each calculation layer based on at least two initial quantization factors; comparing the correlation of the first input data and each second input data in each calculation layer; taking the initial quantization factor corresponding to the second input data with the maximum correlation as the final quantization factor of the calculation layer; and inputting the final quantization factor into the neural network model. According to the quantization method of the neural network model, at least two initial quantization factors of each calculation layer are calculated through at least two algorithms, the at least two initial quantization factors of each calculation layer are compared with the first input data of each calculation layer in a correlation mode, the initial quantization factor corresponding to the second input data with the largest correlation serves as the final quantization factor, namely the quantization factor with the optimal precision, the final quantization factor is input into the neural network model, and the quantization precision of the whole neural network model is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts. Wherein:

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a method for quantifying a neural network model provided herein;

FIG. 2 is a schematic flow chart of prior art convolutional layer and fully-connected layer quantization operations;

fig. 3 is a schematic structural diagram of an embodiment of a terminal device provided in the present application;

FIG. 4 is a schematic structural diagram of an embodiment of a computer storage medium provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the prior art, an algorithm is usually adopted to quantize the input activation values of all convolutional layers and fully-connected layers in the neural network model, so as to improve the accuracy of the neural network model. However, when the input activation values of all convolutional layers and all connection layers are quantized by the same algorithm, the error of a certain layer in the neural network model becomes large due to the flexible variability of the input activation values, and the error of the neural network model in the inference process becomes larger and larger due to the feedforward and complexity of the neural network model, so that the quantization precision of the neural network model is poor finally.

In order to solve the above technical problem, the present application provides a method for quantizing a neural network model, and specifically refer to fig. 1, where fig. 1 is a schematic flow chart of an embodiment of the method for quantizing a neural network model provided in the present application. The quantization method of the neural network model in this embodiment may be applied to a terminal device for quantizing the neural network model, and may also be applied to a server having a data processing capability. The method for quantifying the neural network model in the embodiment specifically comprises the following steps:

s101: and inputting a training picture into the neural network model, and calculating a first data type to obtain first input data of each calculation layer in the neural network model.

In order to train the model quickly and obtain the neural network model with the optimal precision, a plurality of images, for example, 100 images, in the training image can be randomly selected and input into the neural network model, and the images are calculated according to the first data type to obtain first input data of each calculation layer in the neural network model. The first data type is a floating point type, the floating point type is a real number, and the floating point is used for approximately representing any real number in a computer. The computation layer may be a convolutional layer or a fully-connected layer. The first input data is a maximum value of an absolute value of input data of each convolutional layer or fully-connected layer and a maximum value of an absolute value of weight data of each convolutional layer or fully-connected layer.

S102: at least two initial quantization factors for each calculation layer are obtained according to at least two algorithms.

In this embodiment, at least two algorithms are used to obtain the quantization factor, specifically, two algorithms or three algorithms may be used, and the number of the specific algorithms is not limited. The kind of the algorithm can be set by those skilled in the art according to actual situations. For example, two algorithms are used to obtain two initial quantization factors for each computation layer, which may be a global-cycle algorithm and a kl-diversity algorithm, respectively.

In practical application, when two initial quantization factors of each calculation layer are obtained by a global-cycle algorithm and a kl-diversity algorithm, an initial quantization factor M of each calculation layer is obtained by the global-cycle algorithm; and acquiring an initial quantization factor N of each calculation layer by using a kl-divergence algorithm.

Wherein the quantization factor comprises a weight quantization factor and an input quantization factor. Specifically, a global-cycle algorithm can be adopted to obtain a weight quantization factor M1 and an input quantization factor M2 of each computation layer; and acquiring a weight quantization factor N1 and an input quantization factor N2 of each calculation layer by using a kl-divergence algorithm.

S103: at least two second input data quantized for each computation layer are obtained based on the at least two initial quantization factors.

Based on the at least two initial quantization factors of each computation layer obtained in S102, the initial quantization factors are input into each computation layer of the convolutional neural network to obtain at least two second input data quantized by each computation layer. Specifically, the input quantization factor and the weight quantization factor under each algorithm are input into each computation layer of the convolutional neural network, and second input data quantized by each computation layer in each algorithm is obtained. The second input data is the input data of each layer calculated by inputting the weight quantization factor and the input quantization factor in each algorithm into the convolutional neural network. For example, in practical application, the weight quantization factor M1 and the input quantization factor M2 obtained by the global-cycle algorithm are input into the convolutional neural network, and second input data under the global-cycle algorithm is obtained; and inputting the weight quantization factor N1 and the input quantization factor N2 acquired by adopting the kl-divergence algorithm into a convolutional neural network to acquire second input data under the kl-divergence algorithm.

S104: and comparing the correlation of the first input data and each second input data in each calculation layer.

Based on the first input data in each calculation layer acquired in S101 and the second input data in each algorithm in S103, the correlation of the first input data and each of the second input data is compared. The correlation represents the degree of correlation between the first input data and the second input data, and the neural network model quantized by the second input data with higher degree of correlation with the first input data is more accurate.

For example, in practical application, correlation calculation is performed on second input data acquired based on a global-cycle algorithm and first input data, and the acquired correlation is C; and performing correlation calculation on the second input data acquired based on the kl-divergence algorithm and the first input data to acquire correlation C ', and comparing the correlation C with the correlation C'.

S105: and taking the initial quantization factor corresponding to the second input data with the maximum correlation as the final quantization factor of the calculation layer.

Based on the magnitude of the correlation compared in S104, the initial quantization factor corresponding to the second input data with the largest correlation is used as the final quantization factor of the computation layer. For example, when the correlation C is greater than the correlation C', the initial quantization factor M corresponding to the second input data of the correlation C is used as the final quantization factor of the computation layer, that is, the weighted quantization factor M1 and the input quantization factor M2 corresponding to the second input data of the correlation C are used as the final quantization factors.

S106: and inputting the final quantization factor into the neural network model.

Based on the final quantization factor obtained in S105, the final quantization factor is input into the neural network model to obtain the neural network model whose inference process is accelerated. For example, the weighting quantization factor M1 and the input quantization factor M2 corresponding to the second input data of the correlation C are input into the neural network model.

In the embodiment, a training picture is input to a neural network model, and a first data type is calculated to obtain first input data of each calculation layer in the neural network model; obtaining at least two initial quantization factors of each calculation layer according to at least two algorithms; obtaining at least two second input data quantized by each calculation layer based on at least two initial quantization factors; comparing the correlation of the first input data and each second input data in each calculation layer; taking the initial quantization factor corresponding to the second input data with the maximum correlation as the final quantization factor of the calculation layer; and inputting the final quantization factor into the neural network model. According to the method and the device, at least two initial quantization factors of each calculation layer are calculated through at least two algorithms, the correlation between the two initial quantization factors of each calculation layer obtained through the at least two algorithms and the first input data of each calculation layer is compared, the quantization factor with the optimal precision can be obtained, and therefore the quantization precision of the whole neural network model is improved.

Further, when the neural network model is trained, in order to effectively solve the problems of disappearance of the gradient of the neural network, explosion of the gradient and the like, data behind the calculation layer are merged into the calculation layer in a layer to be calculated. The data normalization layer may be a bn (batch normalization) layer or a scale layer, and the bn (batch normalization) layer is generally placed after the convolution layer, and can accelerate network convergence and control overfitting, but when the neural network is inferred, the operation of the bn (batch normalization) layer or the scale layer will affect the performance of the neural network model, and occupy too much memory or display space.

In a specific embodiment, in order to solve the above problem caused by the bn (batch normalization) layer or the scale layer for calculation, when a neural network is actually trained, if there are network segments such as conv + bn + scale or conv + bn, the weight values of the bn (batch normalization) layer and the scale layer may be merged into the convolutional layer by using the same weight, so as to reduce calculation of data in a layer in the process of training the neural network, and at the same time, reduce the occupation of too much memory or display space by the data normalization layer. The present embodiment does not specifically limit the way of merging the data into one layer.

In order to enable the output data type of the convolutional layer, the fully-connected layer or the non-computation layer of the convolutional neural network model to be the second data type, and therefore the conversion of the data type once is reduced, the output data type of the previous layer of the convolutional layer, the fully-connected layer or the non-computation layer is set as the second data type, and the output data type of the previous layer of the non-computation layer or the like is set as the second data type. Wherein, the non-computation layer is a layer which does not involve computation in the neural network model, such as a permate layer, a concat layer, and the like. The second data type is a fixed point type.

Specifically, in order to make the output data of the previous layer of the convolutional layer or the fully-connected layer of the neural network model also be the second data type, the present embodiment makes the output data of the previous layer of the convolutional layer or the fully-connected layer be the second data type, that is, the fixed point type, by transmitting the input quantization factor of the convolutional layer or the fully-connected layer in the neural network model to the previous layer of the convolutional layer or the fully-connected layer.

For the way of inputting the weight quantization factor into the neural network model, in this embodiment, a quantization weight value is obtained by calculation according to the weight quantization factor, and finally, the quantization weight value is input into the neural network model, so as to improve the quantization precision of the neural network model.

Referring to fig. 2, fig. 2 is a schematic flow chart of quantization operations of convolutional layers and full link layers in the prior art. In the prior art, the intermediate data is subjected to inverse quantization processing, so that the inverse quantized data type is the same as the data type of the offset value, but the method increases the calculation amount and occupies too much memory. In order to facilitate uniform precision during operation and facilitate data conversion, when the convolution layer or the full-link layer has an offset value, the offset value of the calculation layer is converted into the output data type of the calculation layer according to the quantization factor. Specifically, by converting the offset value float32 into int32 of the output data type, the intermediate data int32 can be directly summed with the offset value of the output data type int32, so that the inverse quantization process of the intermediate data int32 is avoided, the data operation is reduced, and the conversion overhead is reduced.

In the embodiment, a training picture is input to a neural network model, and a first data type is calculated to obtain first input data of each calculation layer in the neural network model; obtaining at least two initial quantization factors of each calculation layer according to at least two algorithms; obtaining at least two second input data quantized by each calculation layer based on at least two initial quantization factors; comparing the correlation of the first input data and each second input data in each calculation layer; taking the initial quantization factor corresponding to the second input data with the maximum correlation as the final quantization factor of the calculation layer; and inputting the final quantization factor into the neural network model. According to the method, at least two algorithms are adopted to calculate the quantization factors, the correlation between the two initial quantization factors of each calculation layer obtained by the at least two algorithms and the first input data of each calculation layer is compared, and the quantization factor with the optimal precision can be obtained according to the correlation, so that the quantization precision of the whole neural network model is improved; furthermore, data is merged into one layer and is merged into a calculation layer for calculation, so that the calculation of data in one layer is reduced, and the excessive memory or display space occupied by the data merging layer is reduced; setting the output data types of the previous layer of the convolutional layer, the full-connection layer and the non-calculation layer as a second data type, so that the output data type of each layer of the network is set according to the characteristics of the output layer, and the data transfer overhead between each layer is reduced; the bias value of the calculation layer is converted into the output data type of the calculation layer, so that the inverse quantization process of intermediate data is avoided, data operation is reduced, the conversion overhead is reduced, and the reasoning speed of the neural network model is improved.

To implement the quantization method of the neural network model in the foregoing embodiment, the present application provides another terminal device, and specifically refer to fig. 3, where fig. 3 is a schematic structural diagram of an embodiment of the terminal device provided in the present application.

The device 300 comprises a memory 31 and a processor 32, wherein the memory 31 and the processor 32 are coupled.

The memory 31 is used for storing program data, and the processor 32 is used for executing the program data to implement the quantization method of the neural network model of the above-mentioned embodiment.

In the present embodiment, the processor 32 may also be referred to as a CPU (Central Processing Unit). The processor 32 may be an integrated circuit chip having signal processing capabilities. The processor 32 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor 32 may be any conventional processor or the like.

The present application further provides a computer storage medium 400, as shown in fig. 4, the computer storage medium 400 is used for storing program data 41, and the program data 41 is used for implementing the quantization method of the neural network model as described in the method embodiment of the present application when being executed by a processor.

The method involved in the embodiments of the method for quantifying a neural network model of the present application, when implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a device, for example, a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the purpose of illustrating embodiments of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application or are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method of quantifying a neural network model, the method comprising:

inputting the final quantization factor into the neural network model.

2. The quantization method of claim 1, further comprising:

3. The quantization method of claim 1, wherein the computation layers include a convolutional layer and a fully-connected layer; the method further comprises the following steps:

4. The quantization method of claim 3, further comprising:

the output data type of a layer preceding the non-computation layer is set to the second data type.

5. The quantization method of claim 3, wherein the quantization factor comprises a weighted quantization factor and an input quantization factor; the inputting the final quantization factor into the neural network model comprises:

6. The quantization method of claim 5, wherein the first data type is a floating point type and the second data type is a fixed point type.

7. The quantization method of claim 5, wherein inputting the final quantization factor into the neural network model further comprises:

8. The method of claim 1, wherein said inputting the final quantization factor into the neural network model comprises:

9. A terminal device, comprising a memory and a processor coupled to the memory;

wherein the memory is configured to store program data and the processor is configured to execute the program data to implement a method of quantifying a neural network model as claimed in any one of claims 1 to 8.

10. A computer storage medium for storing program data which, when executed by a processor, is adapted to implement a method of quantifying a neural network model as claimed in any one of claims 1 to 8.