CN114756517A

CN114756517A - Visual Transformer compression method and system based on micro-quantization training

Info

Publication number: CN114756517A
Application number: CN202210295189.6A
Authority: CN
Inventors: 李哲鑫; 张一帆; 王培松; 程健
Original assignee: Zhongke Nanjing Artificial Intelligence Innovation Research Institute; Institute of Automation of Chinese Academy of Science
Current assignee: Zhongke Nanjing Artificial Intelligence Innovation Research Institute; Institute of Automation of Chinese Academy of Science
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-07-15

Abstract

The invention discloses a visual Transformer compression method and system based on micro-quantization training, and belongs to the technical field of artificial intelligence. The method comprises the following steps: firstly, carrying out blocking processing on an input picture, and converting the input picture into a corresponding picture sequence through linear mapping; step two, the picture sequence is subjected to quantization alternate processing of global information and local information for M times in sequence to obtain a compressed picture sequence; and step three, classifying the compressed picture sequence and outputting a predicted probability value. A micro-quantization step training method is introduced when the first step to the third step are executed, and the matching degree of each micro-quantization step and the image data is improved based on the micro-quantization step training method; meanwhile, in the second step, a micro-quantization bias training method is introduced when local information quantization is executed, an optimal quantization interval is obtained based on automatic learning of the micro-quantization bias training method, and information of a negative activation region is reserved. Performance loss caused by quantization is reduced, and quantization precision is improved.

Description

Visual Transformer compression method and system based on micro-quantization training

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a visual Transformer compression method and system based on micro-quantization training.

Background

In recent years, models based on the structure of the Transformer have achieved very successful results in various natural language processing (natural language processing) tasks. In the field of computer vision (computer vision), some visual transform (vision transform) -based works also achieve effects close to or even surpass those of a traditional convolutional neural network (convolutional neural network) in various visual tasks, including tasks such as classification, detection, segmentation, super-resolution, denoising and the like. However, since the visual Transformer has very large parameters and computation amount increasing with the square level of resolution of input pictures, it will bring high memory occupation and high delay when reasoning, and it is difficult to find deployment on some devices with limited computation power, such as mobile terminal and automatic driving chip. Therefore, it is crucial to explore suitable compression techniques so that the performance penalty is kept low while the visual transform model size and inference delay are greatly reduced.

Quantization has been used extensively in convolutional neural networks as an efficient compression technique. Whether a convolutional neural network or a visual Transformer model, the core operation is matrix multiplication. By quantizing both the weight and the characteristics of the original 32-bit floating point number in the model into a low-bit fixed point number, the low-bit specific point matrix multiplication operation can be used for replacing the original floating point number matrix multiplication operation, so that the reasoning is accelerated while the size of the model is compressed. Quantization is divided into Post-Quantization (Post-Quantization) and Quantization-aware (Quantization-aware) according to whether fine-tuning is performed after Quantization (finetune). For visual transformers, existing work based on post-quantization results in a large performance penalty. However, the traditional quantitative training method does not fully consider the characteristics of the visual Transformer, and the performance under low bit is not ideal.

Disclosure of Invention

The invention provides a visual Transformer compression method and system based on micro-quantization training, aiming at solving the technical problems in the background technology.

The invention adopts the following technical scheme: a visual Transformer compression method based on micro-quantization training comprises the following steps:

firstly, carrying out blocking processing on an input picture, and converting the input picture into a corresponding picture sequence through linear mapping;

step two, the picture sequence is subjected to quantization alternate processing of global information and local information for M times in sequence to obtain a compressed picture sequence;

classifying the compressed picture sequence, and outputting a predicted probability value;

a micro-quantization step length training method is introduced when the first step to the third step are executed, and the matching degree of each micro-quantization step length and image data is improved based on the micro-quantization step length training method; meanwhile, in the second step, a micro-quantization bias training method is introduced when local information quantization is executed, an optimal quantization interval is obtained based on automatic learning of the micro-quantization bias training method, and information of a negative activation region is reserved.

In a further embodiment, when performing the scalable step size training method and/or the scalable bias training method, a quantization parameter initialization based on minimizing a mean square error is further included.

In a further embodiment, the method for training the micro-quantization step size is simultaneously suitable for image feature quantization and image weight quantization;

the method for training the micro-quantization step size comprises the following procedures:

defining the weight of full precision as w, the quantized fixed point weight as q, and the quantization operation is expressed as:

in the formula, clip (z, a, b) represents that an element larger than a in the matrix z is set as a, and an element larger than b is set as b; round operation represents rounding based on rounding; alpha denotes a differentiable step size, -q_min,q_maxRespectively representing the minimum value and the maximum value of the quantization range;

calculating to obtain a floating point corresponding to the fixed point weight q through inverse quantization operation

In a further embodiment, the method of quantifiable bias training includes the following steps:

Where β is the introduced differentiable bias.

In a further embodiment, said-q_miN,q_maxValue of (a)The following were used: given the quantization bit b in the set of bits,

for signed number quantization, then there is q_min＝2^b-1,q_max＝2^b-1-1；

For unsigned number quantization, then there is q_min＝0,q_max＝2^b-1。

In a further embodiment, the inverse quantization operation uses a pass-through estimator to process the gradient: if a is updated, its gradient is divided by an additional scaling factor g,

wherein N is_wThe number of elements of weight w representing full precision.

In a further embodiment, the quantization parameter initialization based on the minimum mean square error specifically includes the following procedures:

for a layer with only a microtizable step size α and no offset, the initialization method is expressed as:

assuming that alpha is known, the solution is obtained

Assuming that q is known, the solution is obtained

And repeatedly solving q and alpha iteratively until alpha converges, taking the value as the initial value of alpha, and then updating alpha by a gradient descent method.

In a further embodiment, for a layer with an additional offset β, the initialization method is expressed as:

β^*＝E(w-α·q)

e (z) represents the average of all elements of the vector z; and repeatedly and iteratively solving the q, the alpha and the beta until the alpha and the beta converge, taking the solved values as initial values of the alpha and the beta, and then updating the alpha and the beta by using a gradient descent method.

A visual Transformer compression system based on scalable training, comprising:

the quantization processing layer is used for carrying out blocking processing on an input picture and converting the input picture into a corresponding picture sequence through linear mapping;

a self-attention layer configured to perform global information quantization processing on the picture sequence;

a feedforward layer configured to perform global information quantization processing on a picture sequence; the feed-forward layer comprises an active layer; wherein the self-attention layer and the feedforward layer are alternately arranged M times;

a classification processing layer configured to classify the compressed picture sequence and output a predicted probability value;

further comprising: the step length training module capable of being quantized is sequentially embedded into the quantization processing layer, the self-attention layer, the feedforward layer and the classification processing layer; the micro-quantization step training module is set to improve the matching degree of each micro-quantization step and the image data;

a micro-quantifiable bias training module embedded in the active layer; the micro-quantization bias training module is set to automatically learn to obtain an optimal quantization interval and keep the information of the negative activation region.

In a further embodiment, further comprising: and the quantization parameter initialization module is simultaneously connected with the micro-quantization step length training module and the micro-quantization offset training module.

The invention has the beneficial effects that: the invention introduces a micro-quantization step length training method in the compression process, so that the step length of the quantizer is more matched with the distribution of data, thereby greatly reducing the quantization error. Meanwhile, when the local information is quantized, a micro-quantization bias training method is introduced, so that the information of the negative activation region is reserved. And when a micro-quantization step length training method and a micro-quantization offset training method are operated, quantization parameter initialization based on minimum mean square error is used, so that the convergence rate of the model is ensured, and the performance of the model obtained by quantization due to low convergence rate is avoided.

Drawings

Fig. 1 is a diagram of self-attention layer quantization.

Fig. 2 is an activation comparison diagram.

Detailed Description

The invention is further described with reference to the drawings and the specific embodiments in the following description.

Example 1

The embodiment discloses a visual transform compression method based on micro-quantization training, which comprises the following steps:

firstly, carrying out blocking processing on an input picture, and converting the input picture into a corresponding picture sequence through linear mapping; in this embodiment, performance close to the full-precision visual transform model can be achieved by using either 8-bit quantization (4 times compression ratio) or 4-bit quantization (8 times compression ratio).

Step two, the picture sequence is subjected to quantization alternate processing of global information and local information for M times in sequence to obtain a compressed picture sequence; wherein M is an integer; through the alternating quantization processing of M times, the performance of the quantized picture is improved, and meanwhile, the fast compression can be guaranteed. The value of M is 12 in this embodiment.

Step three, classifying the compressed picture sequence and outputting a predicted probability value; in the present embodiment, 8-bit quantization (4 times compression rate) may be used.

A quantization step training method is introduced when the first step to the third step are performed, and the matching degree between each quantization step and the image data is improved based on the quantization step training method, in other words, the quantization step training method is used every time quantization processing is performed.

Meanwhile, a micro-quantization bias training method is introduced when local information quantization is executed, an optimal quantization interval is obtained based on automatic learning of the micro-quantization bias training method, and information of a negative activation region is reserved; in other words, the scalable bias training method is applied without performing the active layer once.

In a further embodiment, the method for training the micro-quantization step size is applied to both image feature quantization and image weight quantization, i.e. the quantization strategies for the features and weights of the image are the same. Taking weight quantization as an example, defining the weight of full precision as w, the quantized fixed point weight as q, and the quantization operation is expressed as:

In another embodiment, if the local information quantization process uses the GELU activation function, the performance of the model is improved. The GELU activation function introduces a negative activation value compared to the ReLU activation function. That is, in the present embodiment, as shown in fig. 2, for quantization of the GELU active layer, unsigned number quantization cannot be directly used like ReLU, which may lose information contained in the negative active value.

Therefore, in order to solve the above technical problem, a method for training a quantization of local information by using a scalable offset is introduced, and the method comprises the following steps:

Where β is the introduced quantifiable bias.

When the step-size quantization training method and the offset quantization training method are used for performing inverse quantization operation, because the round operation encounters the problem of gradient disappearance during inverse propagation, the ste (straight through hestiator) is used to process the gradient, i.e., the round operation is ignored during inverse propagation. If a is updated, its gradient is divided by an additional scaling factor g,

wherein N is_wThe number of elements representing the weight w of full precision.

By adopting the technology, the problem that the model cannot be converged due to too severe alpha change is avoided. Therefore, compared with the traditional fixed step quantization training method, the step quantization training method introduced by the embodiment enables the step size of the quantizer to be more matched with the distribution of data, so that the quantization error is greatly reduced.

In a further embodiment, said-q_min,q_maxThe values of (a) are as follows: given the quantization bit b,

for signed number quantization, then there is q_min＝2^b-1,q_max＝2^b-1-1；

For unsigned number quantization, then there is q_min＝0,q_max＝2^b-1。

In another embodiment, although the differentiable quantization step size and offset are learnable parameters, it is still important to select the appropriate quantization parameter initialization. If the initialization method is selected improperly, the convergence speed of the model is slow, and the performance of the model obtained by quantitative training is affected.

Therefore, when the method for training the scalable step size and/or the method for training the scalable offset are/is executed, the method further includes initialization of quantization parameters based on minimizing the mean square error, and specifically includes the following procedures:

assuming that alpha is known, the solution is obtained

Assuming that q is known, the solution is obtained

Similarly, in the inverse quantization operation, the gradient is processed using a straight-through estimator: if a is updated, its gradient is divided by an additional scaling factor g,

Based on the method, the test is carried out on DeiT-Tiny and DeiT-Small, the test data set is ImageNet 2012, the accuracy is the test result on the test set (Validation dataset) in the data set, and as shown in the table 1-1, the compression ratios and the classification Top-1 accuracy of different models at different bit positions are displayed. Where FP32 represents a model represented by a 32-bit floating point number, i.e., a full precision model. Int8 and Int4 represent models of 8-bit quantization and 4-bit quantization, respectively. For 8-bit quantization, the fine adjustment is performed for only 1 epoch. Whereas for 4-bit quantization, a fine adjustment of 300 epochs is required. As can be seen from the table, the accuracy loss of the quantization model is within 0.5% for both Int8 and Int 4.

TABLE 1-1 visual Transformer quantitative training experiment results

Example 2

In order to complete the visual Transformer compression method described in embodiment 1, this embodiment discloses a visual Transformer compression system based on quantization training, which includes:

a quantization processing layer configured to perform a blocking process on an input picture and convert the input picture into a corresponding picture sequence through linear mapping; in this embodiment, performance close to the full-precision visual transform model can be achieved by using either 8-bit quantization (4 times compression ratio) or 4-bit quantization (8 times compression ratio).

A self-attention layer configured to perform global information quantization processing on the picture sequence; all operations in the self-attention layer are realized by multiplication with fixed-point matrix by quantizing the weights and features in the self-attention layer into fixed-point numbers. For attention weights (attention score), the present embodiment uses unsigned quantization because its value is constantly greater than 0. Furthermore all weights and features we use signed number (signed) quantization. As shown in fig. 1, the english symbol interpretation in fig. 1 is shown in table 2.

TABLE 2

a classification processing layer configured to classify the compressed picture sequence and output a predicted probability value; in the present embodiment, 8-bit quantization (4 times compression ratio) is not used.

Further comprising: the system comprises a quantization step length training module, a self-attention layer, a feedforward layer and a classification processing layer, wherein the quantization step length training module is embedded into a quantization processing layer, a self-attention layer, a feedforward layer and a classification processing layer in sequence; the micro-quantization step training module is set to improve the matching degree of each micro-quantization step and the image data;

Claims

1. A visual Transformer compression method based on micro-quantization training is characterized by comprising the following steps:

step two, the picture sequence is subjected to quantization alternate processing of global information and local information for M times in sequence to obtain a compressed picture sequence; wherein M is an integer;

step three, classifying the compressed picture sequence and outputting a predicted probability value; a micro-quantization step length training method is introduced when the first step to the third step are executed, and the matching degree of each micro-quantization step length and image data is improved based on the micro-quantization step length training method; meanwhile, in the second step, a micro-quantization bias training method is introduced when local information quantization is executed, an optimal quantization interval is obtained based on automatic learning of the micro-quantization bias training method, and information of a negative activation region is reserved.

2. The visual fransformer compression method based on scalable quantization training of claim 1, further comprising a quantization parameter initialization based on minimizing a mean square error when performing the scalable step size training method and/or the scalable bias training method.

3. The visual Transformer compression method based on the micro-quantization training as claimed in claim 1, wherein the micro-quantization step training method is applied to image feature quantization and image weight quantization at the same time;

in the formula, clip (z, a, b) represents that an element larger than a in the matrix z is set as a, and an element larger than b is set as b; round operation represents rounding based on rounding; α represents a differentiable quantization step size, -q_min,q_maxRespectively representing the minimum value and the maximum value of the quantization range;

4. The visual Transformer compression method based on micro-quantization training as claimed in claim 1, wherein the micro-quantization offset training method comprises the following procedures:

Where β is the introduced differentiable bias.

5. The visual Transformer compression method based on micro-quantization training of any one of claims 3 or 4, wherein-q is the same as q_min,q_maxThe values of (A) are as follows: given the quantization bit b,

for signed number quantization, then there is q_min＝2^b-1,q_max＝2^b-1-1；

For unsigned number quantization, then there is q_min＝0,q_max＝2^b-1。

6. The visual Transformer compression method based on micro-quantization training as claimed in any one of claims 3 or 4,

in inverse quantization operation, the gradient is processed using a pass-through estimator: if a is updated, its gradient is divided by an additional scaling factor g,

7. The visual transform compression method based on scalable quantization training of claim 2, wherein the quantization parameter initialization based on minimum mean square error specifically comprises the following procedures:

assuming that alpha is known, the solution is obtained

Assuming that q is known, the solution is obtained

8. The visual Transformer compression method based on micro-quantization training as claimed in claim 2, wherein for the layer with bias β additionally, the initialization method is expressed as:

β^*＝E(w-α·q)

9. A visual transform compression system based on micro-quantization training, comprising:

10. The visual transform compression system based on micro-quantization training of claim 9, further comprising: and the quantization parameter initialization module is simultaneously connected with the micro-quantization step length training module and the micro-quantization offset training module.