CN114611665A

CN114611665A - Multi-precision hierarchical quantization method and device based on weight oscillation influence degree

Info

Publication number: CN114611665A
Application number: CN202210217282.5A
Authority: CN
Inventors: 宋萍; 刘宏博; 郄有田; 李一凡
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-03-07
Filing date: 2022-03-07
Publication date: 2022-06-10

Abstract

The multi-precision layered quantization method and device based on the weight oscillation influence degree can guarantee the accuracy of the neural network, solve the problems of structural redundancy and complex parameters of the trained neural network, reduce the calculation amount and memory consumption and realize the on-chip operation of the neural network. The method comprises the following steps: performing global feature extraction on the feature map, acquiring a channel activation value and determining a weight oscillation coefficient; adding a weight oscillation value in the trained neural network to calculate the influence on the accuracy of the neural network; sequencing the weights according to the oscillation influence degree, and setting a quantization scale; quantizing the neural network; retraining the unquantized parameters to obtain the accuracy of the neural network; and (5) setting a quantization scale, repeating the steps (4) and (5) until the lowest accuracy threshold of the neural network is reached, and finishing quantization.

Description

Multi-precision hierarchical quantization method and device based on weight oscillation influence degree

Technical Field

The invention belongs to the technical field of neural network model compression, and particularly relates to a multi-precision hierarchical quantization method based on weight oscillation influence degree and a multi-precision hierarchical quantization device based on weight oscillation influence degree.

Background

The parameters in the neural network are numerous, and a large amount of memory and computing resources are required to be occupied during operation, so that the neural network is difficult to deploy on a mobile device with limited resources.

In order to solve the above problems, the compression technology of the neural network is widely focused, mainly including pruning, low rank approximation, knowledge distillation, quantization, and the like. The pruning technology needs to set a threshold value, and removes the connection of the parameter value between the neurons lower than the threshold value to achieve the pruning purpose, but the pruning aiming at the neuron connection can generate a sparse matrix, and the sparse matrix can only realize the compression acceleration effect by a specific bottom library and hardware. The traditional quantization theory adopts uniform quantization threshold values and bit widths for the whole method, and although the neural network has certain robustness, the precision of the quantized neural network is lost, and the execution efficiency and the accuracy of the network are reduced.

Disclosure of Invention

In order to overcome the defects of the prior art, the technical problem to be solved by the invention is to provide a multi-precision hierarchical quantization method based on the weight oscillation influence degree, which can ensure the accuracy of a neural network, solve the problems of structural redundancy and complex parameters of the trained neural network, reduce the operation amount and memory consumption and realize the on-chip operation of the neural network.

The technical scheme of the invention is as follows: the multi-precision hierarchical quantization method based on the weight oscillation influence degree comprises the following steps:

(1) belongs to R through convolution kernel W ∈ R^I×w×h×CThe characteristic diagram of the output after operation is X ∈ R^W×H×CWherein I represents the number of channels of the input feature diagram, W and H represent the width and height of the convolution kernel respectively, W and H represent the width and height of the feature diagram respectively, and C is the number of output channels, and global feature extraction, channel activation value acquisition and weight oscillation coefficient phi determination are carried out on the feature diagram;

(2) adding a weight oscillation value in the trained neural network to calculate the influence on the accuracy of the neural network;

(3) the weights are sequenced according to the oscillation influence degree eta, and the quantization scale is set to be p₁；

(4) Quantizing the neural network;

(5) retraining the unquantized parameters to obtain the accuracy of the neural network;

(6) setting the quantization scale to p₂And (5) repeating the steps (4) and (5) until the minimum accuracy threshold of the neural network is reached, and finishing the quantification.

According to the method, a branch quantization strategy is adopted, weights are sequenced and quantized in a layered mode according to the weight oscillation influence degree, and the accuracy of a neural network is guaranteed through a method of retraining unquantized weights; firstly, adding a global average pooling layer, a full-link layer and an activation layer to obtain activation values of a feature map, and representing the weight oscillation coefficients of corresponding filters by the activation values of different channels of the feature map; adding a weight oscillation value in the trained neural network to calculate the influence on the accuracy of the neural network; determining quantization parameters by a fine tuning fast training method, quantizing the weights according to different quantization precisions, and connecting the weights with a loss function through a straight-through estimator to obtain an optimal quantization model; therefore, the accuracy of the neural network can be ensured, the problems of structural redundancy and complex parameters of the neural network after training are solved, the calculation amount and the memory consumption are reduced, and the on-chip operation of the neural network is realized.

There is also provided a multi-precision hierarchical quantization apparatus based on a weighted oscillation influence degree, the apparatus including:

the weight oscillation coefficient acquisition module is configured to extract global features aiming at the feature map, acquire a channel activation value and determine a weight oscillation coefficient phi;

an adding module configured to add a weighted oscillation value to the trained neural network to calculate an influence on the accuracy of the neural network;

a ranking module configured to rank the weights according to an oscillation influence degree eta, setting a quantization scale to p₁；

A quantization module configured to quantize a neural network;

a training module configured to retrain unquantized parameters to obtain an accuracy of the neural network;

a reset parameter module configured to set the quantization scale to p₂Repeatedly performAnd the quantization module and the training module finish quantization until reaching the lowest threshold of the accuracy of the neural network.

Drawings

Fig. 1 is a flowchart of a multi-precision hierarchical quantization method based on the influence of weight oscillation according to the present invention.

FIG. 2 is a diagram illustrating fine tune quantization fast training according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

In order to make the description of the present disclosure more complete and complete, the following description is given for illustrative purposes with respect to the embodiments and examples of the present invention; it is not intended to be the only form in which the embodiments of the invention may be practiced or utilized. The embodiments are intended to cover the features of the various embodiments as well as the method steps and sequences for constructing and operating the embodiments. However, other embodiments may be utilized to achieve the same or equivalent functions and step sequences.

The multi-precision hierarchical quantization method based on the weight oscillation influence degree comprises the following steps:

(1) w belongs to R through convolution kernel^I×w×h×CThe characteristic diagram of the output after operation is X ∈ R^W×H×CWherein I represents the number of channels of an input feature map, W and H represent the width and height of a convolution kernel respectively, W and H represent the width and height of the feature map respectively, and C is the number of output channels, and global feature extraction, channel activation value acquisition and weight oscillation coefficient phi determination are performed on the feature map;

(4) Quantizing the neural network;

(6) setting the quantization scale to p₂And (5) repeating the steps (4) and (5) until the lowest accuracy threshold of the neural network is reached, and finishing the quantification.

According to the method, a branch quantization strategy is adopted, weights are sequenced and quantized in a layered mode according to the weight oscillation influence degree, and the accuracy of a neural network is guaranteed through a method of retraining unquantized weights; firstly, acquiring an activation value of a feature map by adding a global average pooling layer, a full connection layer and an activation layer, and representing a weight oscillation coefficient of a corresponding filter by using the activation values of different channels of the feature map; adding a weight oscillation value in the trained neural network to calculate the influence on the accuracy of the neural network; determining quantization parameters by a fine tuning fast training method, quantizing the weights according to different quantization precisions, and connecting the weights with a loss function through a straight-through estimator to obtain an optimal quantization model; therefore, the accuracy of the neural network can be ensured, the problems of structural redundancy and complex parameters of the neural network after training are solved, the calculation amount and the memory consumption are reduced, and the on-chip operation of the neural network is realized.

Preferably, the step (1) comprises the following substeps:

(1.1) mapping function F by compression_sq() Performing compression mapping extraction on all channels, taking the extraction result as a description mark of the whole channel, using global averaging to obtain an average value of all characteristic parameters in the channel, and obtaining a characteristic diagram X belonging to R after global averaging^W×H×CBecomes Z ∈ R^1×1×CBy compressing the mapping function F_sq() Is a formula (1)

(1.2) obtaining channel activation value

By an excitation function F_ex() Calculating a weight value according to the result of the compression mapping, and recalibrating the weight W by using the feature Z extracted in the step (1), wherein an excitation function is a formula (2):

s＝F_ex(Z,W)＝σ(W₂δ(W₁Z)) (2)

where the activation function σ is ReLU, δ is sigmoid, weight

Obtaining a one-dimensional excitation weight W through training and learning to activate all layers, wherein s is an activation value of a channel;

(1.3) determining the weighted oscillation coefficient phi

According to the corresponding relation of the significance of the channel and the filter, the weighted oscillation coefficient phi in the filter is represented by the activation value s of the channel, the oscillation coefficients of different filters are different, and the calculation formula is as follows:

φ_c＝s_c (3)

wherein phi is_cFor weighted oscillation coefficients of different filters, s_cLearned activation values for the different channels.

Preferably, in the step (2), the trained neural network parameters are all determined, and the weights in different channels are determined

By using

Instead, other weight parameters in the neural network are kept unchanged, the network is reconstructed, and the weight oscillation value is as follows:

judging the influence degree of the neural network by adding the change of the accuracy of the neural network before and after the weight oscillation value, inputting the test set which is the same as that in the network training, and obtaining the accuracy of the neural network as

The degree of weight oscillation influence

Comprises the following steps:

wherein the original accuracy of the neural network is I.

Preferably, in the step (3), the weights with low influence degree are quantized with high precision, and the weights with high influence degree are subjected to a low-precision quantization or non-quantization strategy, so that different quantization standards are determined for different weights with accuracy as a target.

Preferably, in the step (4),

let r denote the floating-point real number and q denote the quantized fixed-point integer, the conversion formula is as follows:

r＝S(Q-Z) (7)

wherein, S represents a scaling coefficient between a floating-point real number and a fixed-point integer, Z represents an integer corresponding to 0 in the floating-point real number after quantization, and a calculation formula of S and Z is as follows:

when the floating-point real number is quantized into the fixed-point integer, the quantized value needs to be cut off, b represents the integer number of bits to be quantized by the floating-point real number, and is represented by Q:

the calculation formula of S and Z is then:

preferably, the step (4) comprises the following substeps:

(4.1) forward propagation process: simulating and quantizing the process in the forward propagation process, quantizing the weight and the activation value, then inversely quantizing the floating point number with the error, and extracting data characteristics by using the floating point number with the error;

(4.2) a back propagation process: the model utilizes the floating point number of inverse quantization back to act on the data to calculate loss function loss (L), the obtained gradient is the gradient of the weight value after analog quantization, a straight-through estimator (STE) is used for approximating the pseudo quantization of the gradient, the weight value before quantization is updated by the gradient, and the floating point model containing quantization error is obtained;

(4.3) calculating a weight final quantization value w_i；

(4.4) reference value S from the scaling factor by means of an iterative update₁、S₂、S₃And selecting, performing characteristic measurement on the output characteristics obtained by using the quantized algorithm model after acting on input data and the output characteristics obtained by using the original floating point model, taking the search result with the highest characteristic similarity as a final S value, and superposing the result of the previous layer to the next layer during the characteristic measurement so as to ensure that certain correlation exists between the layers.

Preferably, in said step (4.1),

suppose the convolution kernel W ∈ R^I×w×h×CHas a quantization scale of p₁The number of weights to be quantized is p₁xIxw xhxC; selecting the weight with low influence degree for quantization according to the sequence of the influence degree of the weight oscillation in the neural network in the step (3)

Wherein: w is a_floatFloating point weight, w_iWeighting the quantized fixed point integer;

w_iinverse quantization to obtain a floating point value w 'with quantization error'_f：

w′_f＝S(w_i-Z) (14)

W 'of'_fIs a floating point value with quantization error.

Preferably, in said step (4.3),

it will be understood by those skilled in the art that all or part of the steps in the method of the above embodiments may be implemented by hardware instructions related to a program, the program may be stored in a computer-readable storage medium, and when executed, the program includes the steps of the method of the above embodiments, and the storage medium may be: ROM/RAM, magnetic disks, optical disks, memory cards, and the like. Therefore, in accordance with the method of the present invention, the present invention also includes a multi-precision hierarchical quantization apparatus based on the influence of weighted oscillation, which is generally expressed in the form of functional blocks corresponding to the steps of the method. The device includes:

a ranking module configured to rank the weights according to an oscillation influence magnitude η, setting a quantization scale to p₁；

A quantization module configured to quantize a neural network;

a reset parameter module configured to set the quantization scale to p₂And repeatedly executing the quantization module and the training module until the lowest threshold of the accuracy of the neural network is reached, and finishing the quantization.

The invention is described in detail below by way of example with reference to the accompanying drawings.

As shown in fig. 1, a multi-precision hierarchical quantization method based on the influence degree of weight oscillation includes the following steps:

step one, determining a weight oscillation coefficient phi;

suppose W is belonged to R through a certain convolution kernel^I×w×h×CThe characteristic diagram of the output after operation is X ∈ R^W×H×CWherein I represents the number of channels of the input feature map, W and H represent the width and height of the convolution kernel respectively, W and H represent the width and height of the feature map respectively, and C is the number of output channels;

step (1) of global feature extraction

Compression mapping function F_sq() All channels can be compressed, mapped and extracted, the extraction result is used as a description mark of the whole channel, the average value of all characteristic parameters in the channel is obtained by using global averaging, and the characteristic diagram X belongs to R after global averaging^W×H×CBecomes Z ∈ R^1×1×CBy compressing the mapping function F_sq() Comprises the following steps:

step (2) obtaining channel activation value

Excitation function F_ex() The weight values can be calculated according to the result of compression mapping, the degree of the model result to each channel is learned, the channels are independent after being calibrated for better fitting nonlinear characteristics, the calibration between the channels is nonlinear, and a sigmoid activation function is needed to realize the conditions. Recalibrating the weight W using the features Z extracted in step (1), the mathematical expression of the excitation function being：

s＝F_ex(Z,W)＝σ(W₂δ(W₁Z))

Where the activation function σ is ReLU and δ is sigmoid. Weight of

A one-dimensional excitation weight W is obtained through training and learning to activate all layers, and s is an activation value of a channel

Step (3) determining a weight oscillation coefficient phi

φ_c＝s_c

Determining the weight oscillation influence degree eta;

the trained neural network parameters are all determined, and the weights in different channels are determined

By using

as can be seen from the formula,

both by the weight itself and by the filter in which it is located.

Judging the influence degree and the output of the neural network by the change of the accuracy of the neural network before and after adding the weight oscillation valueEntering the same test set as the training network to obtain the neural network with the accuracy of

The degree of weight oscillation influence

Comprises the following steps:

wherein the original accuracy of the neural network is I.

Thirdly, sorting the weights according to the oscillation influence degree eta, and setting the quantization scale as p₁；

Aiming at the problem that the execution precision and efficiency of the algorithm are reduced due to the fact that the traditional quantization theory adopts a uniform quantization threshold value and bit width for the whole algorithm, a layered quantization strategy is provided, the weight with low influence degree is quantized with high precision, the weight with high influence degree is quantized with low precision or is not quantized, and different quantization standards are determined for different weights by taking accuracy as a target.

Step four, quantifying the neural network;

r＝S(Q-Z)

when the floating-point real number is quantized into the fixed-point integer, the quantized value needs to be cut off, b represents the integer bit number to be quantized by the floating-point real number, and is represented by Q:

the calculation formula of S and Z can be expressed as:

the parameter values of the model are approximately in Gaussian distribution, high-frequency values with small contribution degree exist at the edges of the numerical values, irrelevant high-frequency details are removed, and the quantization precision can be improved to a certain extent. Therefore, combining the above formula analysis, obtaining an appropriate S value can reduce the loss caused by quantization, and further analyzing to determine an appropriate r in the floating-point real number_max、r_minAnd the quantization bit width b is a key to guarantee quantization performance.

The fine tuning quantization fast training method is shown in fig. 2, and comprises the following specific steps:

step (1), forward propagation process

The process of analog quantization is carried out in the forward propagation process, the weight value and the activation value are quantized and then inversely quantized to return an error floating point number, and data features are extracted by using the error floating point number.

Suppose the convolution kernel W ∈ R^I×w×h×CHas a quantization scale of p₁The number of weights to be quantized is p₁xIxw xhxC; and selecting the weight with low influence degree for quantization according to the sequence of the influence degree of the weight oscillation in the neural network in the third step.

w′_f＝S(w_i-Z)

W 'of'_fIs a floating point value with quantization error.

Step (2), back propagation process

In the quantization process, due to operations such as truncation and rounding, quantization errors are generated, and the loss function loss (L) is calculated on data by the model through the action of floating point numbers subjected to inverse quantization.

The gradient obtained is the gradient of the weight after the analog quantization, and since the quantized function is a piecewise function, there are cases where the gradient is absent and the gradient is zero, the pseudo-quantization of the gradient is approximated by a pass-through estimator (STE). And updating the weight value before quantization, namely the parameters of the original floating point model by using the gradient to finally obtain the floating point model containing the quantization error.

Step (3) calculating the final weight quantization value

Step (4) determination of scaling factor S

By means of iterative updating, from the scaling factor reference value S₁、S₂、S₃And selecting, performing characteristic measurement on the output characteristics obtained by using the quantized algorithm model after acting on input data and the output characteristics obtained by using the original floating point model, taking the search result with the highest characteristic similarity as a final S value, and superposing the results of the previous layer to the next layer during the characteristic measurement, so that certain correlation between the layers is ensured.

Step five, retraining the unquantized parameters to obtain the accuracy of the neural network

Step six, setting the quantization scale as p₂And repeating the fourth step and the fifth step until the lowest accuracy threshold of the neural network is reached, and finishing the quantification.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent variations and modifications made to the above embodiment according to the technical spirit of the present invention still belong to the protection scope of the technical solution of the present invention.

Claims

1. The multi-precision hierarchical quantization method based on the weight oscillation influence degree is characterized by comprising the following steps: which comprises the following steps:

(4) Quantizing the neural network;

2. The method of claim 1, wherein the method comprises: the step (1) comprises the following sub-steps:

(1.1) mapping function F by compression_sq() All channels are compressed, mapped and extracted, and the extraction result is used as the description of the whole channelMarking, using global average to obtain average value of all characteristic parameters in channel, after global average making characteristic diagram X be element R^W×H×CBecomes Z ∈ R^1×1×CBy compressing the mapping function F_sq() Is a formula (1)

(1.2) obtaining channel activation value

By means of an excitation function F_ex() Calculating a weight value according to the result of the compression mapping, and recalibrating the weight W by using the feature Z extracted in the step (1), wherein an excitation function is a formula (2):

s＝F_ex(Z,W)＝σ(W₂δ(W₁Z)) (2)

where the activation function σ is ReLU, δ is sigmoid, weight

(1.3) determining the weighted oscillation coefficient phi

φ_c＝s_c (3)

wherein phi_cFor weighted oscillation coefficients of different filters, s_cLearned activation values for different channels.

3. The method according to claim 2, wherein the method comprises: in the step (2), the trained neural network parameters are all determined, and the weights in different channels are used

By using

judging the influence degree of the neural network accuracy by adding the change of the neural network accuracy before and after the weight oscillation value, inputting the test set which is the same as that of the training network, and obtaining the neural network accuracy of

The degree of weight oscillation influence

Comprises the following steps:

wherein the original accuracy of the neural network is I.

4. The method of claim 3, wherein the method comprises: in the step (3), the weight with low influence degree is quantized with high precision, the weight with high influence degree adopts a low-precision quantization or non-quantization strategy, and different quantization standards are determined for different weights by taking accuracy as a target.

5. The method of claim 4, wherein the method comprises: in the step (4), the step (C) is carried out,

r＝S(Q-Z) (7)

the calculation formula of S and Z is then:

6. the method of claim 5, wherein the method comprises: the step (4) comprises the following sub-steps:

(4.3) calculating a weight final quantization value w_i；

7. The method of claim 6, wherein: in the step (4.1), the step (C),

w′_f＝S(w_i-Z) (14)

W 'of'_fFloating point values with quantization errors.

8. The method of claim 7, wherein the method comprises: in the step (4.3), the step of,

9. the multi-precision layered quantization device based on the weight oscillation influence degree is characterized in that: the device includes:

the weight oscillation coefficient acquisition module is configured to perform global feature extraction on the feature map, acquire a channel activation value and determine a weight oscillation coefficient phi;

A quantization module configured to quantize a neural network;