CN117689044A

CN117689044A - Quantification method suitable for vision self-attention model

Info

Publication number: CN117689044A
Application number: CN202410142459.9A
Authority: CN
Inventors: 纪荣嵘; 胡佳伟; 钟云山; 林明宝; 陈锰钊
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2024-02-01
Filing date: 2024-02-01
Publication date: 2024-03-12

Abstract

The invention provides a quantization method suitable for a visual self-attention model (Vits), relates to compression and acceleration of an artificial neural network, and provides a displacement uniform log2 quantizer, which introduces initial displacement bias on log2 function input and then uniformly quantizes output; a three-stage smoothing optimization strategy is also provided, which fully utilizes a smooth low-amplitude loss diagram for optimization, and simultaneously maintains the high efficiency of activating layer-by-layer quantization. The method of the invention has simple idea, saves calculation cost, greatly improves performance under extremely low compression bit, and can obtain a quantization model in a direct later training mode and obtain better performance only by applying the quantizer designed by the invention.

Description

Quantification method suitable for vision self-attention model

Technical Field

The invention relates to compression and acceleration of an artificial neural network, in particular to a quantization method suitable for a visual self-attention model (Vits).

Background

In the evolving field of computer vision, the recently emerging vision transducer stands out as an excellent architecture for capturing long distance relationships between image tiles, with its multi-headed self-attention Mechanism (MHSA). However, as the number of segmented image lots n increases, the MHSA operation generates O (n ² ) With an unacceptable computational overhead. In order to realize better application of the Vits series model in the practical process, a model compression method of the Vits series model is designed and proposed.

To accommodate unique structures in visual recognition models, such as LayerNorm and self-care mechanisms, current work on network quantization training (PTQ) of the ViTs typically introduces specialized quantizers and quantization schemes to preserve the original performance of the ViTs. For example, FQ-ViT and PTQ4ViT introduce a log2 quantizer and a bi-uniform quantizer for post-Softmax activation, respectively, while RepQ-ViT employs a channel-level quantizer that is first applied to LayerNorm post-activation values with a large variance distribution, and then re-parameterized as a hierarchical quantizer. In the case of 4 bits, the above RepQ-ViT resulted in a 10.82% drop in accuracy over the ImageNet for full precision DeiT-S; while in the case of 3 bits this drop is more pronounced, reaching 74.48%. Recently, optimization-based PTQ methods have shown their potential in quantifying Convolutional Neural Networks (CNNs). However, their attempts at Vision Transformers have remained underutilized, and in fig. 4 we have found that they tend to result in overfitting at high bits and suffer from significant performance degradation at ultra low bits, limiting their use in the ViTs architecture.

In view of this, the present application proposes a quantization method of a visual self-attention model that can keep very low bits while having high performance.

Disclosure of Invention

The invention aims to solve the technical problem of providing a quantization method suitable for a visual self-attention model (Vits), and aiming at the current Vits, when the current Vits perform post-training quantization, the quantizer designed by the invention is applied, so that the quantization model can be obtained in a direct post-training mode, and the performance is higher while the very low bit is kept.

The invention provides a quantization method suitable for a visual self-attention model, which comprises the following steps:

in the initial stage, fine adjustment is carried out on the model, meanwhile, full-precision weight is used, a channel-level quantizer is used for an activation value after LayerNorm, a shift uniform log2 quantizer is used for an activation value after softmax, and a layer-by-layer quantizer is used for other activations;

in the second stage, the quantizer of the channel level is smoothly transited to the corresponding hierarchical form by utilizing the scale re-parameterization technology, so that the activation value after LayerNorm is changed from the quantizer using the channel level to the quantizer adopting the layer-by-layer mode;

in the third stage, the model is trimmed using the loss function, while the activations and weights are quantized, wherein the activation values after softmax use a shifted uniform log2 quantizer, and the other activations use a layer-by-layer quantizer.

Further, the shift uniform log2 quantizer is: an initial shift bias is introduced on the log2 function input, and then the output is uniformly quantized, specifically designed as follows:

a shift bias is introduced before providing the full precision activation value input to the log2 transformationThen processing using a uniform quantizer, wherein the quantization process formula is:

the inverse quantization process formula is:

wherein,is an activation value input，/>Is the result after quantization, < >>Is the quantized integer value, +.>The quantization and inverse quantization calculation processes respectively representing uniform quantization are as follows:

wherein b represents bit, s represents quantization scale, and z represents zero;

further, the quantization of the channel level is smoothly transited to the corresponding hierarchical form by using the scale re-parameterization technology, and the parameters are calculated by adopting the following formula:

wherein the method comprises the steps of，/>Is a parameter of the original LayerNorm layer, < >>，/>Is La after scale heavy parameterParameters of the yerNorm layer, +.>，/>，Is a scale heavy parameter calculation parameter, < >> Is the original weight parameter,/->，/>Is the post-scale weight parameter.

Further, the loss function is:

wherein,representing full precision visual self-attention model NolOutput of the individual module->Representing quantized visual self-attention model numberlThe outputs of the modules.

The invention has the following technical effects or advantages:

1. the present invention proposes a shift-uniform-log 2 quantizer (SULQ) that achieves an accurate approximation of the full coverage and distribution of the input domain by introducing an offset before log2 transformation and then uniformly quantizing its output;

2. the invention provides a three-stage Smooth Optimization Strategy (SOS), which fully utilizes a smooth low-amplitude loss diagram to optimize, and simultaneously maintains the high efficiency of activating layer-by-layer quantification;

3. the method is simple and easy to realize, simultaneously saves calculation cost, greatly improves performance, and has the performance exceeding that of various mainstream post-training quantization methods, and particularly has more obvious phenomenon when the bit is lower.

Drawings

The invention will be further described with reference to examples of embodiments with reference to the accompanying drawings.

FIG. 1 is a flowchart illustrating a quantization method for a visual self-attention model according to a first embodiment of the present invention;

FIG. 2 is a schematic block diagram of the principle logic of a conventional quantizer;

FIG. 3 is a schematic block diagram of the principle logic of the shift uniform log2 quantizer of the present invention;

FIG. 4 is a diagram showing a comparison of the effect of the application with other methods according to an embodiment of the present invention;

FIG. 5 is a second view showing the effect of the application of an embodiment of the present invention compared with other methods.

Detailed Description

The invention provides a quantization method suitable for a visual self-attention model, which adopts the following technical scheme: the invention provides a shift uniform log2 quantizer (SULQ), which introduces initial shift bias on log2 function input and then uniformly quantizes output; meanwhile, a three-stage Smooth Optimization Strategy (SOS) is provided, and a smooth low-amplitude loss diagram is fully utilized for optimization, and meanwhile, the efficiency of activating layer-by-layer quantification is maintained.

In order to better understand the technical scheme of the present invention, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

Referring to fig. 1, an embodiment of the present invention provides a quantization method applicable to a visual self-attention model, the method including:

Preferably, the shift uniform log2 quantizer is: an initial shift bias is introduced on the log2 function input, and then the output is uniformly quantized, specifically designed as follows:

the inverse quantization process formula is:

wherein,is an activation value input, ++>Is the result after quantization, < >>Is the quantized integer value, +.>The quantization and inverse quantization calculation processes respectively representing uniform quantization are as follows:

preferably, the quantization of the channel level is smoothly transited to the hierarchical form corresponding to the quantization by using the scale re-parameterization technique, and the parameters are calculated by adopting the following formula:

wherein the method comprises the steps of，/>Is a parameter of the original LayerNorm layer, < >>，/>Is a parameter of LayerNorm layer after scale heavy parameter, +.>，/>，Is a scale heavy parameter calculation parameter, < >> Is the original weight parameter,/->，/>Is the post-scale weight parameter.

Preferably, the loss function is:

Some symbols in this application are explained below:

1. general structure of the ViTs:

assuming that a picture I passes through an embedding layer and is segmented into N blocks with 2D dimensions, the picture is used as a vector at the momentThe representation is:

vector of thenIs fed into LBlocks ViT each comprising a multi-head self-attention operation (MHSA) and multi-layer perceptron (MLP) architecture, for the +.>Layer blocks, the calculation process can be expressed as:

Z _l-1 representing the output of the MHSA,X _l representing the output of MLP, MHSA consists of H self-attention heads, for the firstA self-attention head for inputting +.>The calculation process of (2) is as follows:

wherein,A _h sum [Q _h ，K _h ，V _h ]All are the results of the intermediate calculation,represents the dimension of the self-attention head, +.>Representing linear layer weights in QKV, < >>Represents QKV linear layer bias, let ∈ ->：

The output of the MHSA may be expressed as:

the MLP consists of two fully connected layers and an activation function (Gelu), and the calculation of this procedure is given by the fact that the activation value is fed into the MLP:

wherein,b ⁰ ，b ¹ ，b ² ，W ⁰ ，W ¹ ，W ² are weight values.

2. Analysis of existing log2 quantizer

When using a uniform quantizer and a log2 quantizer, FIG. 2 plots full-precision activation valuesAnd quantized +.>Relationship between them. The log2 quantizer favors more bits over the region near zero than the uniform quantizer, exhibiting advantages in coping with the long tail distribution problem prevalent in post Softmax activation. However, log2 quantizers also have a major quantization efficiency problem. Consider that Softmax activated X after input ranges from [1.08e-8, 0.868]The rounding resulted in a span of maximally 26 and maximally 0. 3-bit quantization coverage is [0, 7]Thus, rounded sections [8, 26]Will be clamped to 7. For 4-bit quantization, rounded sections [16, 26]Will be clamped to 15. At this point, there is a "quantization efficiency" problem because most values are clamped to remote locations. Activation after Softmax has a number of values near zero. Quantization efficiency problems result in a large amount of quantization errors, thereby affecting the performance of the model.

By comparing quantization positions after the use of a 3bit uniform quantizer, a log2 quantizer and the effect of the shifted uniform log2 quantizer SULQ of the present invention. Compared with the existing log2 quantizer, the log2 quantizer with uniform displacement can realize different quantization effects according to the offset value, so that different input distributions can be quantized better by adjusting the offset value, and more flexible quantization effects can be realized than the existing log2 quantizer.

3. Quantitative description

The quantization method of the invention comprises the following steps:

1) A shift uniform log2 quantizer (SULQ) that introduces an initial shift bias on the log2 function input and then uniformly quantizes the output;

2) A three-phase Smoothing Optimization Strategy (SOS) that leverages smooth, low-amplitude loss maps for optimization while maintaining the efficiency of active layer-by-layer quantization.

The existing log2 quantizer has the problem of low quantization efficiency when Softmax is activated after processing, namely the quantization range cannot cover the whole input domain. To solve this problem, a shift-uniform-log 2 quantizer (SULQ) is proposed herein, which achieves an accurate approximation of the full coverage and distribution of the input field by introducing an offset before log2 transformation and then uniformly quantizing its output, as shown in fig. 3.

Specifically, in 1), the shift uniform log2 quantizer (SULQ), at the input of the full precisionBefore providing the log2 transformation, a shift bias is introduced>Then processing is performed using a uniform quantizer, quantization process:

inverse quantization process:

wherein,the distribution represents a uniformly quantized quantization and dequantization calculation process, which can be calculated as follows:

in 2), the three-stage Smoothing Optimization Strategy (SOS) is specifically:

in the initial stage, the model is finely adjusted, and full-precision weight and LayerNorm based on a channel are used for quantization after activation, and other activation adopts a layer-by-layer quantizer; in the second stage, smoothly transitioning the quantizer of the channel level to a hierarchical form corresponding to the quantizer by skillfully utilizing a scale re-parameterization technique;

in the third stage, the model is finely tuned, and meanwhile, the activation and the weight are quantized, so that performance degradation caused by weight quantization is compensated, and performance of the quantized model is improved.

Through a large number of experimental verification, the quantization method suitable for the visual self-attention model (Vits) provided by the invention is simple in thought, simultaneously saves calculation cost, greatly improves performance, and simultaneously has more obvious phenomenon when the bit is lower than that of the post-training quantization method of various main streams. ViT-B in the 3-bit case, the method herein improves by 50.68% over the existing PTQ method.

4. Implementation details

All algorithms are based on the PyTorch framework (Adam Paszke, sam Gross, francisco Massa, adam Lerer, james Bradbury, gregory Chanan, trevor Killen, zeming Lin, natalia Gimelshein, luca anti, et al Pytorch: an imperative style, high-performance deep learning library In Proceedings of the Advances in Neural Information Process-ing Systems (Neurops), pages 8026-8037, 2019.6) using uniform quantization for all weights and activation values, except for activation values after softmax using shift uniform log2 quantization. A pass-through estimator (STE) is used to perform gradient estimation for round operations in the quantization process. All experimental configurations used 1024 calibration pictures, derived from ImageNet and Coco datasets.

The quantization model is a model comprising a vit series, a det series, a swin series. SOTA competitors in PTQ quantized to 6bit, 4bit, 3bit, respectively, have FQ-ViT (Yang Lin, tianyu Zhang, peiqin Sun, zheng Li, and Shuang Zhou. Fq-vit: post-training quantization for fully quantized vision transducer In Proceedings of the Thirty First International Joint Conference on Artificial Intelligence, (IJCAI), pages 1173-1179, 2022.1, 2, 3, 4, 5,), PSAQ-ViT (Zhikai Li, lip Ma, mengjuan Chen, junixiao, and Qingyi Gu. Patch similarity aware data-free quantization for vision transducers In Proceedings of the European Conference on Computer Vision (ECCV), pages 154-170. Springer, 2022.), ranking-ViT (Zhenhua Liu, yonhe Wang, kai Han, wei Zhang, sinwell Ma, and Wen Gao, post-training quantization for vision transducer In Proceedings of the Advances in Neural Information Processing Systems (NeuroPS), pages 28092-28103, 2021, eamant [ Disyu, qingye Zhang, ming Zhang, yes, and Deqqu Zhang). Post-training quantization via scale optimization, coRR, abs/2006.16669, 2020 ], PTQ4ViT [ Zhihang Yuan, chenhao Xue, YIqi Chen, qiang Wu, and Guangyu Sun, ptq4vit: post-training quantization for vision transformers with twin uniform quantization, in Proceedings of the European Conference on Computer Vision (ECCV), pages 191-207, springer, 2022 ], APQ-ViT (YIpu Ding, haotong Qin, qinghua Yan, zhhenhua Chai, junjie Liu, xiaolin Wei, and Xiangalon Liu. Towards accurate Post-training quantization for vision transducer In Proceedings of the 30th ACM International Conference on Multimedia (ACMMM), pages 5380-5388, 2022.) and NoisyQuant (YIjiang Liu, huanrui Yang, zhen Dong, kurt Keutzer, li Du, and Shangmanng Zhang. Noisyquant; noisy bias enhanced Post-training activation quantization for vision transformers In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20321-20330, 2023.), BRECQ (Yuhang Li, ruihao Gong, xu Tan, yang Yang Yang, peng Hu, qi Zhang, fengwei Yu, wei Wang, and Shi Gu. Breq: pushing the limit of Post-training quantization by block reconconstruction In Proceedings of the International Conference on Learning Representations (ICLR), 2021), qdrop (Xiuying Wei, ruihao Gong, yuhang Li, xiangaging Liu, and Fengwei Yu. Qdrop: randomly dropping quantization for extremely low-Bit Post-transformation In Proceedings of the International Conference on Learning Representations (ICLR), 2022), PD-Quant (Jiawei Liu, lin Niu, zhihang Yuan, dawei Yang, xinggang Wang, and Wenyu Liu, pd-Quant: post-training quantization based on prediction difference metric In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24427-24437, 2023), bit-shaking (Chen Lin, bo Peng, zyang Li, wenming Tan, ye Ren, jun Xiao, and Shiliang Pu. Bit-shaping: limiting instantaneous sharpness for improving post-shaping quation, in Pro-ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16196-16205, 2023.), repQ-ViT (Zhikai Li, jun rui Xiao, lianwei Yang, and Qingyi Gu. Repq-vit: scale reparameterization for post-training quantization of vision transformers In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17227-17236, 2023.).

5. Application effects

The method can be applied to quantification of the visual self-attention model (Vits), and the processing efficiency and the processing precision of the model to the picture are improved through compression and acceleration of the visual self-attention model (Vits).

The tables of fig. 4 and 5 show the quantitative results of the method of the invention and the existing quantization method, respectively, on different data sets. It can be seen that the method of the invention is in most cases superior to all comparison methods on these quantization models with different bit widths.

FIG. 4 is the quantitative results of ImageNet. It can be seen that the method of the present invention achieves the best performance at different bit settings. In particular, the method of the invention has advantages in all bits, especially in the case of low numbers. As shown in fig. 4, there is a very significant performance degradation in the case of ultra low bits, whether there is no optimization or an optimization-based approach. For example, in the 3-bit case, optimization-based PTQ4ViT suffers from crashes in all Vits models, while RepQ-ViT has limited accuracy. For example, repQ-ViT is only 0.97%, 4.37% and 4.84% for DeiT-T, deiT-B and DeiT-B, respectively. The optimization-based approach provides better results but exhibits unstable performance for different ViTs models. For example, BRECQ has a crash problem on ViT-S and Swin-B. In contrast, the proposed method exhibits stable and significantly improved performance over the ViT variant. In particular, the method of the present invention achieved encouraging 40.72% and 50.68% improvement over the previous methods, respectively, in ViT-S and ViT-B quantification.

For DeiT-T, deiT-B and DeiT-B, the process of the present invention achieved 41.52%, 55.78% and 73.30% performance, respectively, corresponding to 1.55%, 26.45% and 27.01% increase, respectively. On Swin-S and Swin-B, the inventive method reported increases of 4.53% and 4.98%, respectively. In the case of 4-bit, non-optimized RepQ-ViT outperforms the optimization-based approach over most ViT variants, indicating that the previous optimization-based PTQ approach has a overfitting problem. The proposed method is significantly improved over RepQ-ViT in all ViT variants. In particular, the process of the present invention achieves significant improvements of 9.82% and 11.59% over ViT-S and ViT-B, respectively. The method of the present invention provides significant 3.28%, 4.6% and 4.36% accuracy gains when quantifying DeiT-T, deiT-S and DeiT-B, respectively. As regards Swin-S and Swin-B, the process according to the invention exhibits a performance gain of 1.72% and 1.48%, respectively. RepQ-ViT outperforms the optimization-based approach over most ViT variants with 6 bits, indicating that the optimization-based approach also suffers from the same over-fitting problem with 4 bits. Similar to the results for 3 bits and 4 bits, the inventive method shows improved and satisfactory results in performance. For example, in DeiT-B, swin-S and Swin-B quantification, the methods of the present invention have an accuracy of 81.68%, 82.89% and 84.94%, respectively, with only 0.12%, 0.34% and 0.33% loss of accuracy relative to the full accuracy model.

Quantitative results for Coco dataset see fig. 5, results for target detection and example segmentation are reported in fig. 5, all networks are quantized to 4 bits. It can be seen that in most cases the process of the invention achieves better performance.

Specifically, when Mask R-CNN uses Swin-T as its backbone, the method of the present invention increases the frame AP and Mask AP by 1.4 and 0.6 points, respectively. Also, in cascading Mask R-CNN, when Swin-T is used as backbone, the method of the invention increases frame AP by 1.2 and Mask AP by 0.6. When Swin-S is used as backbone, box AP is raised by 1.0 and mask AP is raised by 0.5.

The technical scheme provided in the embodiment of the application has at least the following technical effects or advantages: the present invention proposes a shift-uniform-log 2 quantizer (SULQ) that achieves an accurate approximation of the full coverage and distribution of the input domain by introducing an offset before log2 transformation and then uniformly quantizing its output; the invention also provides a three-stage Smooth Optimization Strategy (SOS), which fully utilizes a smooth low-amplitude loss diagram to optimize, and simultaneously maintains the high efficiency of activating layer-by-layer quantization. The method of the invention improves the processing efficiency and accuracy of the model to the picture by compressing and accelerating the visual self-attention model (Vits), is simple and easy to realize, saves the calculation cost, and greatly improves the performance, and the performance of the method exceeds the post-training quantization method of various main streams, particularly when the bit is lower, the phenomenon is more obvious, for example, viT-B in the case of 3 bits, and the method is improved by 50.68 percent compared with the existing PTQ method.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that the specific embodiments described are illustrative only and not intended to limit the scope of the invention, and that equivalent modifications and variations of the invention in light of the spirit of the invention will be covered by the claims of the present invention.

Claims

1. A quantization method suitable for a visual self-attention model, characterized by: the method comprises the following steps:

2. A method of quantifying a visual self-attention model according to claim 1, characterized by: the shift uniform log2 quantizer is: an initial shift bias is introduced on the log2 function input, and then the output is uniformly quantized, specifically designed as follows:

the inverse quantization process formula is:

wherein (1)>Is an activation value input, ++>Is the result after quantization, < >>Is the quantized integer value, +.>The quantization and inverse quantization calculation processes respectively representing uniform quantization are as follows:

；/>wherein b represents bit, s represents quantization scale, z represents zero point, and clip represents given interval [0, 2 ] ^b -1]Cut-off function on->The representation will->Is subjected to a rounding operation.

3. A method of quantifying a visual self-attention model according to claim 1, characterized by: the quantization device adopts the following formula to calculate parameters when the quantization device of the channel level is smoothly transited to the hierarchical form corresponding to the quantization device by utilizing the scale re-parameterization technology:

；/>wherein->，/>Is a parameter of the original LayerNorm layer, < >>，/>Is a parameter of LayerNorm layer after scale heavy parameter, +.>，/>，/>Is a scale heavy parameter calculation parameter, < >> Is the original weight parameter,/->，/>Is the post-scale weight parameter.

4. A method of quantifying a visual self-attention model according to claim 1, characterized by: the loss function is:

wherein (1)>Representing full precision visual self-attention model NolOutput of the individual module->Representing quantized visual self-attention model numberlThe outputs of the modules.