CN115983322A

CN115983322A - Compression method of visual self-attention model based on multi-granularity reasoning

Info

Publication number: CN115983322A
Application number: CN202310039838.0A
Authority: CN
Inventors: 纪荣嵘; 陈锰钊; 林明宝
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2023-01-12
Filing date: 2023-01-12
Publication date: 2023-04-18

Abstract

A compression method of a visual self-attention model based on multi-granularity reasoning relates to the compression and acceleration of an artificial neural network. The method comprises the following steps: 1) Providing a two-stage reasoning framework, performing coarse-grained patch cutting on the whole picture in the first stage, and performing further fine-grained cutting on a region with high information content in the picture in the second stage; 2) Designing important area identification based on global class attention diagrams; 3) Designing a feature multiplexing module to enable the second stage to multiplex the features extracted in the first stage; 4) The training paradigm of the two-stage inference framework is designed so that the model may not introduce additional parameters. The model complexity can be adaptively adjusted according to the sample difficulty.

Description

Compression method of visual self-attention model based on multi-granularity reasoning

Technical Field

The invention relates to compression and acceleration of an artificial neural network, in particular to a compression method of a visual self-attention model based on multi-granularity reasoning.

Background

Attention models (transformers) have had great success in natural language processing, which motivated the migration of them into the field of computer Vision, and proposed visual self-attention models (ViT). ViT has spread widely in the field of computer vision and is soon becoming one of the most common and promising architectures for a variety of common vision tasks. Such as image classification (Graham B, el-Nouby A, touvron H, et al, leViT: a Vision Transformer in ConvNet's cloning for Faster reference [ C ]// Proceedings of the IEEE/CVF International reference on Computer Vision.2021: 12259-12269.), target detection (Carion N, massa F, synnaeve G, et al, end-to-end detection with references [ C ]// European reference on Computer Vision. Springer, cham, 2020-213-229.), and the like. The basic idea of ViT is to cut an image into a series of patches and convert these patches into input labels (tokens) using linear transformation. The advantages of the ViTs is that it can capture the long distance relationship between different parts of the image through the Multi-head self-Attention-attachment (MHSA) mechanism.

The amount of computation of ViT is proportional to the square of the number of input marks, and too high computational cost seriously hinders the landing application of ViT. The most intuitive method is to reduce the number of labels in the reasoning process so as to achieve the purpose of reducing the calculation amount. As images are often filled with redundant areas such as background. This feature stimulates the inspiration of many researchers, such as further discarding smaller message size tags in the network forward process. PSViT (Tang Y, han K, wang Y, et al. Patch slim for effect vision transformations [ J ]. ArXIv preprint arXIv:2106.02852, 2021.) introduces a top-down marker pruning paradigm. The significance of a marker is measured by the ready-made attention class in the dynamic VIT (Rao Y, ZHao W, liu B, et al. Dynamic: efficient vision transforms with dynamic token localization [ J ]. Advances in neural information processing systems,2021, 34.) that learns to score each marker by a learnable prediction module, whereas EVIT (Liang Y, ge C, tong Z, et al. Not all Patches token reproduction [ J ]. ArXiv prediction print: 2202.07800, 2022.) measures the significance of a marker by learning a learnable prediction module. In DVT (Wang Y, huang R, song S, et al. Not all images are art 16x16 words. Although these label abandonment methods reduce computational cost, they suffer from two drawbacks: 1) Sacrificing the accuracy of the identification. For example, PS-ViT and although 1.6-2.0G of FLOPs are saved in DeiT-S (Touvron H, cord M, douze M, et al. Training data-efficiency image transformations & differentiation through association [ C ]// International Conference on Machine learning. PMLR, 2021. 2) Introducing an additional training burden. For example, the DynamicViT requires training an additional marker importance prediction module, and the DVT increases the training parameters by three times.

The spatial redundancy of images can be divided into two categories, inter-image spatial redundancy and intra-image spatial redundancy. The former means that the difficulty of different images is different, for example, a simple picture can be successfully identified by using a model with low calculation amount, and a difficult picture can be correctly identified only by using a model with high calculation amount. The latter means that the interior of the image can be divided into important areas and unimportant areas, and whether the model can be correctly identified is mainly based on the important areas. For example, the work of DynamicViT, which directly changes the number of the whole picture mark codes, focuses on spatial redundancy between images; while the work of PSViT, dynamicViT, etc. to reduce the number of markers by pruning is concerned with spatial redundancy within the image.

Therefore, existing approaches focus on single spatial redundancy when performing ViT model compression training. The invention not only pays attention to the space redundancy among the images, but also pays attention to the space redundancy in the images, and provides a multi-granularity reasoning framework of a visual self-attention model based on the space redundancy.

Disclosure of Invention

The invention aims to provide a compression method of a visual self-attention model based on multi-granularity reasoning, aiming at ViT model compression, based on the technical problems in the prior art. Only by applying the training mode designed by the invention, a ViT model can be directly obtained by training from the beginning, and the model can adaptively adjust the complexity of the model according to the difficulty of inputting pictures; compared with a common model, the method can realize better performance under the condition of compressing the computational complexity.

The invention comprises the following steps:

1) Cutting the image into coarse granularity patches, encoding the coarse granularity patches into marks, inputting the marks into a model for first-stage reasoning, and obtaining a coarse granularity reasoning result;

2) Calculating the confidence coefficient of the coarse-grained reasoning result, stopping reasoning when the confidence coefficient exceeds a threshold value, and otherwise performing second-stage reasoning;

3) Selecting important coarse-grained patch according to the global attention to carry out further fine-grained cutting, and obtaining mark codes of the fine-grained cutting patch;

4) Adding the features extracted by the coarse-grained reasoning to the fine-grained marking codes through linear change, and inputting the results into a model to carry out second-stage reasoning to obtain fine-grained reasoning results;

5) During training, the coarse-grained reasoning result is supervised by using a fine-grained reasoning result, and the fine-grained reasoning result is supervised by using a real label.

In step 1), the cutting the image into coarse-grained patches and encoding the coarse-grained patches into the marks specifically are:

inputting the mark corresponding to the coarse-grained patch into a model, and acquiring coarse-grained coding characteristics:

obtaining a coarse-grained reasoning result according to the coarse-grained coding characteristics:

in step 2), the confidence of the coarse-grained reasoning result is specifically calculated as follows:

setting a threshold eta, if j > eta, stopping reasoning, otherwise, carrying out second-stage reasoning.

In step 3), the global class attention is:

the mark code for the fine-grained cut patch is:

in step 4), the model inputs for the second stage inference are:

the fine-grained reasoning result is as follows:

in step 5), the training loss is:

loss＝CE(p _f ，y)+KL(p _c ，p _f )

the invention provides a two-stage reasoning framework, wherein coarse-grained patch cutting is carried out on the whole picture in the first stage, and further fine-grained cutting is carried out on the region with high information content in the picture in the second stage; designing important region identification based on the global class attention diagram; designing a feature multiplexing module to enable the second stage to multiplex the features extracted in the first stage; the training paradigm of the two-stage inference framework is designed such that no additional parameters can be introduced by the model. The model complexity can be adaptively adjusted according to the sample difficulty. The method can be applied to the ViT model in the field of image classification, and compared with the conventional mark discarding research which sacrifices the recognition precision, the two-stage reasoning framework provided by the invention provides better recognition capability and greatly improves the performance of the model. Compared with the existing research of determining the number of labels in a multi-model cascade self-adaption mode, the two-stage reasoning framework provided by the invention not only provides better recognition capability, but also greatly reduces the number of parameters and training overhead.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

Detailed Description

The following examples further illustrate the invention in conjunction with the drawings.

The invention provides a two-stage multi-granularity reasoning framework aiming at performance loss caused by current ViT compression. By using the multi-granularity reasoning framework, lossless compression can be realized, and a model can be used for adaptively adjusting the picture difficulty according to the input picture difficulty. To the compression and acceleration of artificial neural networks. The method is called CF-ViT for short.

A method block diagram of an embodiment of the invention is shown in fig. 1.

Description of the symbols:

the ViT segments the two-dimensional image into two-dimensional patches and maps the patches to the labels using linear projection. An extra [ class ] flag (class flag) is also appended, representing the global image information. In addition, all tags are embedded with the addition of a learnable position. Thus, the input tag sequence for the ViT model is:

a ViT model contains K sequentially stacked encoders, each consisting of a self-attention (SA) module and feed-forward network (FFN). In SA of the kth encoder, a marker sequence X _k-1 Is projected to a query matrix

In (1). A key matrix->

And a value matrix>

Self-attention matrix>

Is calculated as:

referred to as class attention, reflects the interaction between class labels and other labels. Output A of SA _k Is sent to the FFN consisting of two fully-connected layers to derive an updated flag->

After a series of SA-FFN conversions, a class flag from the Kth encoder>

Is fed into a classifier to predict the class of the input.

1. The first stage is as follows: coarse grain reasoning

The CF-ViT first performs a rough segmentation to identify "simple" images. At the same time, it also locates information areas for efficient reasoning when "difficult" samples are encountered. In the coarse phase, the inputs to the CF-ViT model v are:

wherein N is _c Is the number of coarse-grained taps. Assuming that v contains K encoders, after multiple SA-FFN transformations, the output tag sequence of v is:

class labels

Is sent to the classifier>

To obtain a class prediction distribution p of the coarse phase ^c ：

Where n represents a category number. So far, the input prediction categories can be obtained as:

it is desirable to have a large

Since it is used as the prediction confidence score in the coarse inference phase due to the number of patches N _c Is small and the calculation cost is very cheap. A threshold η is introduced to achieve a trade-off between performance and computation. If it is not

The inference will terminate, attributing the input to category j. Otherwise, fine-grained reasoning needs to be performed on the samples.

For better fine-grained reasoning, a statistical global class attention score is also needed in the coarse-grained reasoning stage. The class attention scores of the different layers are aggregated using an exponential weighted average:

where β is a weighting factor. Based on this, the output of the last encoder, i.e. the global class attention

2. And a second stage: fine-grained reasoning

Selecting confident coarse-grained patches based on global class attention, the attention scores of the patches being at N _c Pre-row alphaN in coarse-grained patch _c Wherein α ∈ [0,1 ]]Indicating the ratio of informative coarse-grained patch. Each information coarse-grained patch is then further partitioned into 2 x 2 fine-grained patches for better performance in finer granularity. Therefore, the number of fine-grained divided patches is:

encoding the patch after fine-grained segmentation:

in order to better improve the performance of a fine-grained reasoning stage, a feature multiplexing module is designed. The method corresponds the marked feature dimension of coarse granularity to the feature mark of fine granularity:

adding the aligned features directly with the fine-grained feature marks, and inputting the feature marks into a model to execute fine-grained reasoning:

class labels

Is sent to the same classifier>

In the method, a class prediction distribution p of a fine stage is obtained ^f ：

3. Training an objective function

In the training process of CV-ViT, a confidence threshold η =1 is set, which means that a fine-grained reasoning phase will always be performed on each input image. On the one hand, it is desirable that the fine-grained segmentation can be well adapted to the ground truth label y, so as to accurately predict the input. On the other hand, it is desirable that coarse-grained splits have similar outputs to fine-grained splits, so that most inputs can be well recognized during the coarse-grained reasoning phase. Thus, the training loss for CF-ViT is as follows:

loss＝CE(p _f ，y)+KL(p _c ，p _f )，

where CE (. Cndot.)) and KL (. Cndot.)) represent the cross-entropy loss and Kullback-Leibler divergence, respectively.

4. Implementation details

All models were trained on the ImageNet (Deng J, dong W, socher R, et al. Imagenet: A large-scale hierarchical image database [ C ]//2009IEEE conference on computer vision and pattern recognition. Ieee, 2009) set, which contains 1000 classes, 120 ten thousand training set pictures and 5 ten thousand verification set pictures. DeiT-S (Touvron H, cord M, douze M, et al. Training data-understanding image transformation & differentiation throughput evaluation [ C ]// International Conference on Machine learning. PMLR, 2021. Data enhancements include mix (Zhang H, cisse M, dauphin Y N, et al. Mix up: beyond empirical simulation minute [ J ]. ArXiv prediction arXiv:1710.09412, 2017.), cutmix (Yun S, han D, oh S J, et al. Cunmix: regulated simulation strand to strand composites [ C ]// Proceedings of the IEEE/f international conference component.2019: 6023-6032.).

5. Field of application

The method can be applied to the classified ViT model to realize compression and acceleration of the classified ViT model.

Table 1 shows the performance comparison between the model obtained by using the training method of the present invention and the basic backbone ViT model, and it can be found that the present invention (CF-ViT) has advantages in speed and accuracy, and can obtain actual reasoning acceleration on the GPU.

TABLE 1

The model Throughput (Throughput) in table 1 shows the number of images processed per second on a single a100 GPU, with each model being provided 50 iterations at a batch Throughput of 512 for accurate Throughput estimation. Then, the average is taken as the actual throughput. It can be observed from table 1 that CF-ViT balances accuracy and efficiency. It can be seen that CF-ViT greatly reduces model FLOPs of DeiT-S by 61% and LV-ViT-S by 53% while maintaining the same accuracy as backbone network. Therefore, the CF-ViT of the present invention obtains a great ability to process images, resulting in an increase in throughput of 1.88X over DeiT-S and 2.01X over LV-ViT-S. In addition, at a larger confidence threshold, the CF-ViT of the invention not only shows FLOPs and throughput, but also shows better top-1 precision. In particular, the CF-ViT of the present invention significantly improves the performance of DeiT-S by 1.0% and LV-ViT-S by 0.3% when η =1 indicates that all inputs are sent to the fine inference phase. These results demonstrate well that the CF-ViT of the present invention is able to maintain well the trade-off between model performance and model efficiency.

TABLE 2

Table 2 compares the present invention to recent compression methods, including: IA-RED ² (Pan B,Panda R,Jiang Y,et al.IA-RED$^2$:Interpretability-Aware Redundancy Reduction for Vision Transformers[J].Advances in Neural Information Processing Systems,2021,34.),Evo-ViT(Xu Y,Zhang Z,Zhang M,et al.Evo-vit:Slow-fast token evolution for dynamic vision transformer[J].arXiv preprint arXiv:2108.01390,2021.),EViT(Liang Y,Ge C,Tong Z,et al.Not all patches are what you need:Expediting vision transformers via token reorganizations[J].arXiv preprint arXiv:2202.07800,2022.),PS-ViT(Tang Y,Han K,Wang Y,et al.Patch slimming for efficient vision transformers[J]arXiv preprint arXiv:2106.02852, 2021.). Table 2 shows that the CF-ViT of the present invention is superior to previous methods compared to the existing ViT compression method: accuracy and reduction of FLOPs. For example, CF-ViT greatly reduces the FLOPs of DeiT-S to 1.8G FLOPs without any effect on accuracy, whereas recent progress of Evo-ViT has only 79.4% performance, and the burden of FLOPs is as heavy as 3.0G. Similar results were also observed when LV-ViT-S was used as the backbone.

Claims

1. The compression method of the visual self-attention model based on the multi-granularity reasoning is characterized by comprising the following steps of:

2) Calculating the confidence coefficient of the coarse-grained reasoning result, stopping reasoning if the confidence coefficient exceeds a threshold value, and otherwise, performing second-stage reasoning;

4) Adding the features extracted by coarse-grained reasoning to fine-grained marking codes through linear change, and inputting the result into a model to carry out second-stage reasoning to obtain a fine-grained reasoning result;

2. The method for compressing the visual self-attention model based on multi-granularity reasoning according to claim 1, wherein in step 1), the step of cutting the image into coarse-granularity patches and encoding the coarse-granularity patches into the labels comprises the following specific steps:

3. the method for compressing the visual self-attention model based on multi-granularity inference as claimed in claim 1, wherein in step 2), the confidence of the coarse-granularity inference result is calculated by:

4. The method for compressing the visual self-attention model based on multi-granularity reasoning according to claim 1, wherein in step 3), the global attention class is:

the mark coding of the fine-grained cut patch is:

5. the method for compressing a visual self-attention model based on multi-granularity inference as claimed in claim 1, wherein in step 4), the model input of the second-stage inference is:

the fine-grained reasoning result is as follows:

6. the method for compressing a visual self-attention model based on multi-granularity inference as claimed in claim 1, wherein in step 5), the training, the final training loss is:

loss＝CE(p _f ，y)+KL(p _c ，p _f )。