CN115983322A - Compression method of visual self-attention model based on multi-granularity reasoning - Google Patents

Compression method of visual self-attention model based on multi-granularity reasoning Download PDF

Info

Publication number
CN115983322A
CN115983322A CN202310039838.0A CN202310039838A CN115983322A CN 115983322 A CN115983322 A CN 115983322A CN 202310039838 A CN202310039838 A CN 202310039838A CN 115983322 A CN115983322 A CN 115983322A
Authority
CN
China
Prior art keywords
reasoning
grained
coarse
granularity
stage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310039838.0A
Other languages
Chinese (zh)
Inventor
纪荣嵘
陈锰钊
林明宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202310039838.0A priority Critical patent/CN115983322A/en
Publication of CN115983322A publication Critical patent/CN115983322A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

A compression method of a visual self-attention model based on multi-granularity reasoning relates to the compression and acceleration of an artificial neural network. The method comprises the following steps: 1) Providing a two-stage reasoning framework, performing coarse-grained patch cutting on the whole picture in the first stage, and performing further fine-grained cutting on a region with high information content in the picture in the second stage; 2) Designing important area identification based on global class attention diagrams; 3) Designing a feature multiplexing module to enable the second stage to multiplex the features extracted in the first stage; 4) The training paradigm of the two-stage inference framework is designed so that the model may not introduce additional parameters. The model complexity can be adaptively adjusted according to the sample difficulty.

Description

Compression method of visual self-attention model based on multi-granularity reasoning
Technical Field
The invention relates to compression and acceleration of an artificial neural network, in particular to a compression method of a visual self-attention model based on multi-granularity reasoning.
Background
Attention models (transformers) have had great success in natural language processing, which motivated the migration of them into the field of computer Vision, and proposed visual self-attention models (ViT). ViT has spread widely in the field of computer vision and is soon becoming one of the most common and promising architectures for a variety of common vision tasks. Such as image classification (Graham B, el-Nouby A, touvron H, et al, leViT: a Vision Transformer in ConvNet's cloning for Faster reference [ C ]// Proceedings of the IEEE/CVF International reference on Computer Vision.2021: 12259-12269.), target detection (Carion N, massa F, synnaeve G, et al, end-to-end detection with references [ C ]// European reference on Computer Vision. Springer, cham, 2020-213-229.), and the like. The basic idea of ViT is to cut an image into a series of patches and convert these patches into input labels (tokens) using linear transformation. The advantages of the ViTs is that it can capture the long distance relationship between different parts of the image through the Multi-head self-Attention-attachment (MHSA) mechanism.
The amount of computation of ViT is proportional to the square of the number of input marks, and too high computational cost seriously hinders the landing application of ViT. The most intuitive method is to reduce the number of labels in the reasoning process so as to achieve the purpose of reducing the calculation amount. As images are often filled with redundant areas such as background. This feature stimulates the inspiration of many researchers, such as further discarding smaller message size tags in the network forward process. PSViT (Tang Y, han K, wang Y, et al. Patch slim for effect vision transformations [ J ]. ArXIv preprint arXIv:2106.02852, 2021.) introduces a top-down marker pruning paradigm. The significance of a marker is measured by the ready-made attention class in the dynamic VIT (Rao Y, ZHao W, liu B, et al. Dynamic: efficient vision transforms with dynamic token localization [ J ]. Advances in neural information processing systems,2021, 34.) that learns to score each marker by a learnable prediction module, whereas EVIT (Liang Y, ge C, tong Z, et al. Not all Patches token reproduction [ J ]. ArXiv prediction print: 2202.07800, 2022.) measures the significance of a marker by learning a learnable prediction module. In DVT (Wang Y, huang R, song S, et al. Not all images are art 16x16 words. Although these label abandonment methods reduce computational cost, they suffer from two drawbacks: 1) Sacrificing the accuracy of the identification. For example, PS-ViT and although 1.6-2.0G of FLOPs are saved in DeiT-S (Touvron H, cord M, douze M, et al. Training data-efficiency image transformations & differentiation through association [ C ]// International Conference on Machine learning. PMLR, 2021. 2) Introducing an additional training burden. For example, the DynamicViT requires training an additional marker importance prediction module, and the DVT increases the training parameters by three times.
The spatial redundancy of images can be divided into two categories, inter-image spatial redundancy and intra-image spatial redundancy. The former means that the difficulty of different images is different, for example, a simple picture can be successfully identified by using a model with low calculation amount, and a difficult picture can be correctly identified only by using a model with high calculation amount. The latter means that the interior of the image can be divided into important areas and unimportant areas, and whether the model can be correctly identified is mainly based on the important areas. For example, the work of DynamicViT, which directly changes the number of the whole picture mark codes, focuses on spatial redundancy between images; while the work of PSViT, dynamicViT, etc. to reduce the number of markers by pruning is concerned with spatial redundancy within the image.
Therefore, existing approaches focus on single spatial redundancy when performing ViT model compression training. The invention not only pays attention to the space redundancy among the images, but also pays attention to the space redundancy in the images, and provides a multi-granularity reasoning framework of a visual self-attention model based on the space redundancy.
Disclosure of Invention
The invention aims to provide a compression method of a visual self-attention model based on multi-granularity reasoning, aiming at ViT model compression, based on the technical problems in the prior art. Only by applying the training mode designed by the invention, a ViT model can be directly obtained by training from the beginning, and the model can adaptively adjust the complexity of the model according to the difficulty of inputting pictures; compared with a common model, the method can realize better performance under the condition of compressing the computational complexity.
The invention comprises the following steps:
1) Cutting the image into coarse granularity patches, encoding the coarse granularity patches into marks, inputting the marks into a model for first-stage reasoning, and obtaining a coarse granularity reasoning result;
2) Calculating the confidence coefficient of the coarse-grained reasoning result, stopping reasoning when the confidence coefficient exceeds a threshold value, and otherwise performing second-stage reasoning;
3) Selecting important coarse-grained patch according to the global attention to carry out further fine-grained cutting, and obtaining mark codes of the fine-grained cutting patch;
4) Adding the features extracted by the coarse-grained reasoning to the fine-grained marking codes through linear change, and inputting the results into a model to carry out second-stage reasoning to obtain fine-grained reasoning results;
5) During training, the coarse-grained reasoning result is supervised by using a fine-grained reasoning result, and the fine-grained reasoning result is supervised by using a real label.
In step 1), the cutting the image into coarse-grained patches and encoding the coarse-grained patches into the marks specifically are:
Figure BDA0004050554580000031
inputting the mark corresponding to the coarse-grained patch into a model, and acquiring coarse-grained coding characteristics:
Figure BDA0004050554580000032
obtaining a coarse-grained reasoning result according to the coarse-grained coding characteristics:
Figure BDA0004050554580000033
in step 2), the confidence of the coarse-grained reasoning result is specifically calculated as follows:
Figure BDA0004050554580000034
setting a threshold eta, if j > eta, stopping reasoning, otherwise, carrying out second-stage reasoning.
In step 3), the global class attention is:
Figure BDA0004050554580000035
the mark code for the fine-grained cut patch is:
Figure BDA0004050554580000036
in step 4), the model inputs for the second stage inference are:
Figure BDA0004050554580000037
Figure BDA0004050554580000038
the fine-grained reasoning result is as follows:
Figure BDA0004050554580000039
in step 5), the training loss is:
loss=CE(p f ,y)+KL(p c ,p f )
the invention provides a two-stage reasoning framework, wherein coarse-grained patch cutting is carried out on the whole picture in the first stage, and further fine-grained cutting is carried out on the region with high information content in the picture in the second stage; designing important region identification based on the global class attention diagram; designing a feature multiplexing module to enable the second stage to multiplex the features extracted in the first stage; the training paradigm of the two-stage inference framework is designed such that no additional parameters can be introduced by the model. The model complexity can be adaptively adjusted according to the sample difficulty. The method can be applied to the ViT model in the field of image classification, and compared with the conventional mark discarding research which sacrifices the recognition precision, the two-stage reasoning framework provided by the invention provides better recognition capability and greatly improves the performance of the model. Compared with the existing research of determining the number of labels in a multi-model cascade self-adaption mode, the two-stage reasoning framework provided by the invention not only provides better recognition capability, but also greatly reduces the number of parameters and training overhead.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Detailed Description
The following examples further illustrate the invention in conjunction with the drawings.
The invention provides a two-stage multi-granularity reasoning framework aiming at performance loss caused by current ViT compression. By using the multi-granularity reasoning framework, lossless compression can be realized, and a model can be used for adaptively adjusting the picture difficulty according to the input picture difficulty. To the compression and acceleration of artificial neural networks. The method is called CF-ViT for short.
A method block diagram of an embodiment of the invention is shown in fig. 1.
Description of the symbols:
the ViT segments the two-dimensional image into two-dimensional patches and maps the patches to the labels using linear projection. An extra [ class ] flag (class flag) is also appended, representing the global image information. In addition, all tags are embedded with the addition of a learnable position. Thus, the input tag sequence for the ViT model is:
Figure BDA0004050554580000041
a ViT model contains K sequentially stacked encoders, each consisting of a self-attention (SA) module and feed-forward network (FFN). In SA of the kth encoder, a marker sequence X k-1 Is projected to a query matrix
Figure BDA0004050554580000042
In (1). A key matrix->
Figure BDA0004050554580000043
And a value matrix>
Figure BDA0004050554580000044
Self-attention matrix>
Figure BDA0004050554580000045
Is calculated as:
Figure BDA0004050554580000046
Figure BDA0004050554580000047
referred to as class attention, reflects the interaction between class labels and other labels. Output A of SA k Is sent to the FFN consisting of two fully-connected layers to derive an updated flag->
Figure BDA0004050554580000048
After a series of SA-FFN conversions, a class flag from the Kth encoder>
Figure BDA0004050554580000049
Is fed into a classifier to predict the class of the input.
1. The first stage is as follows: coarse grain reasoning
The CF-ViT first performs a rough segmentation to identify "simple" images. At the same time, it also locates information areas for efficient reasoning when "difficult" samples are encountered. In the coarse phase, the inputs to the CF-ViT model v are:
Figure BDA0004050554580000051
wherein N is c Is the number of coarse-grained taps. Assuming that v contains K encoders, after multiple SA-FFN transformations, the output tag sequence of v is:
Figure BDA0004050554580000052
class labels
Figure BDA0004050554580000053
Is sent to the classifier>
Figure BDA0004050554580000054
To obtain a class prediction distribution p of the coarse phase c
Figure BDA0004050554580000055
Where n represents a category number. So far, the input prediction categories can be obtained as:
Figure BDA0004050554580000056
it is desirable to have a large
Figure BDA0004050554580000057
Since it is used as the prediction confidence score in the coarse inference phase due to the number of patches N c Is small and the calculation cost is very cheap. A threshold η is introduced to achieve a trade-off between performance and computation. If it is not
Figure BDA0004050554580000058
The inference will terminate, attributing the input to category j. Otherwise, fine-grained reasoning needs to be performed on the samples.
For better fine-grained reasoning, a statistical global class attention score is also needed in the coarse-grained reasoning stage. The class attention scores of the different layers are aggregated using an exponential weighted average:
Figure BDA0004050554580000059
where β is a weighting factor. Based on this, the output of the last encoder, i.e. the global class attention
Figure BDA00040505545800000510
2. And a second stage: fine-grained reasoning
Selecting confident coarse-grained patches based on global class attention, the attention scores of the patches being at N c Pre-row alphaN in coarse-grained patch c Wherein α ∈ [0,1 ]]Indicating the ratio of informative coarse-grained patch. Each information coarse-grained patch is then further partitioned into 2 x 2 fine-grained patches for better performance in finer granularity. Therefore, the number of fine-grained divided patches is:
Figure BDA00040505545800000511
encoding the patch after fine-grained segmentation:
Figure BDA00040505545800000512
in order to better improve the performance of a fine-grained reasoning stage, a feature multiplexing module is designed. The method corresponds the marked feature dimension of coarse granularity to the feature mark of fine granularity:
Figure BDA0004050554580000061
adding the aligned features directly with the fine-grained feature marks, and inputting the feature marks into a model to execute fine-grained reasoning:
Figure BDA0004050554580000062
class labels
Figure BDA0004050554580000063
Is sent to the same classifier>
Figure BDA0004050554580000065
In the method, a class prediction distribution p of a fine stage is obtained f
Figure BDA0004050554580000064
3. Training an objective function
In the training process of CV-ViT, a confidence threshold η =1 is set, which means that a fine-grained reasoning phase will always be performed on each input image. On the one hand, it is desirable that the fine-grained segmentation can be well adapted to the ground truth label y, so as to accurately predict the input. On the other hand, it is desirable that coarse-grained splits have similar outputs to fine-grained splits, so that most inputs can be well recognized during the coarse-grained reasoning phase. Thus, the training loss for CF-ViT is as follows:
loss=CE(p f ,y)+KL(p c ,p f ),
where CE (. Cndot.)) and KL (. Cndot.)) represent the cross-entropy loss and Kullback-Leibler divergence, respectively.
4. Implementation details
All models were trained on the ImageNet (Deng J, dong W, socher R, et al. Imagenet: A large-scale hierarchical image database [ C ]//2009IEEE conference on computer vision and pattern recognition. Ieee, 2009) set, which contains 1000 classes, 120 ten thousand training set pictures and 5 ten thousand verification set pictures. DeiT-S (Touvron H, cord M, douze M, et al. Training data-understanding image transformation & differentiation throughput evaluation [ C ]// International Conference on Machine learning. PMLR, 2021. Data enhancements include mix (Zhang H, cisse M, dauphin Y N, et al. Mix up: beyond empirical simulation minute [ J ]. ArXiv prediction arXiv:1710.09412, 2017.), cutmix (Yun S, han D, oh S J, et al. Cunmix: regulated simulation strand to strand composites [ C ]// Proceedings of the IEEE/f international conference component.2019: 6023-6032.).
5. Field of application
The method can be applied to the classified ViT model to realize compression and acceleration of the classified ViT model.
Table 1 shows the performance comparison between the model obtained by using the training method of the present invention and the basic backbone ViT model, and it can be found that the present invention (CF-ViT) has advantages in speed and accuracy, and can obtain actual reasoning acceleration on the GPU.
TABLE 1
Figure BDA0004050554580000071
The model Throughput (Throughput) in table 1 shows the number of images processed per second on a single a100 GPU, with each model being provided 50 iterations at a batch Throughput of 512 for accurate Throughput estimation. Then, the average is taken as the actual throughput. It can be observed from table 1 that CF-ViT balances accuracy and efficiency. It can be seen that CF-ViT greatly reduces model FLOPs of DeiT-S by 61% and LV-ViT-S by 53% while maintaining the same accuracy as backbone network. Therefore, the CF-ViT of the present invention obtains a great ability to process images, resulting in an increase in throughput of 1.88X over DeiT-S and 2.01X over LV-ViT-S. In addition, at a larger confidence threshold, the CF-ViT of the invention not only shows FLOPs and throughput, but also shows better top-1 precision. In particular, the CF-ViT of the present invention significantly improves the performance of DeiT-S by 1.0% and LV-ViT-S by 0.3% when η =1 indicates that all inputs are sent to the fine inference phase. These results demonstrate well that the CF-ViT of the present invention is able to maintain well the trade-off between model performance and model efficiency.
TABLE 2
Figure BDA0004050554580000072
Table 2 compares the present invention to recent compression methods, including: IA-RED 2 (Pan B,Panda R,Jiang Y,et al.IA-RED$^2$:Interpretability-Aware Redundancy Reduction for Vision Transformers[J].Advances in Neural Information Processing Systems,2021,34.),Evo-ViT(Xu Y,Zhang Z,Zhang M,et al.Evo-vit:Slow-fast token evolution for dynamic vision transformer[J].arXiv preprint arXiv:2108.01390,2021.),EViT(Liang Y,Ge C,Tong Z,et al.Not all patches are what you need:Expediting vision transformers via token reorganizations[J].arXiv preprint arXiv:2202.07800,2022.),PS-ViT(Tang Y,Han K,Wang Y,et al.Patch slimming for efficient vision transformers[J]arXiv preprint arXiv:2106.02852, 2021.). Table 2 shows that the CF-ViT of the present invention is superior to previous methods compared to the existing ViT compression method: accuracy and reduction of FLOPs. For example, CF-ViT greatly reduces the FLOPs of DeiT-S to 1.8G FLOPs without any effect on accuracy, whereas recent progress of Evo-ViT has only 79.4% performance, and the burden of FLOPs is as heavy as 3.0G. Similar results were also observed when LV-ViT-S was used as the backbone.

Claims (6)

1. The compression method of the visual self-attention model based on the multi-granularity reasoning is characterized by comprising the following steps of:
1) Cutting the image into coarse granularity patches, encoding the coarse granularity patches into marks, inputting the marks into a model for first-stage reasoning, and obtaining a coarse granularity reasoning result;
2) Calculating the confidence coefficient of the coarse-grained reasoning result, stopping reasoning if the confidence coefficient exceeds a threshold value, and otherwise, performing second-stage reasoning;
3) Selecting important coarse-grained patch according to the global attention to carry out further fine-grained cutting, and obtaining mark codes of the fine-grained cutting patch;
4) Adding the features extracted by coarse-grained reasoning to fine-grained marking codes through linear change, and inputting the result into a model to carry out second-stage reasoning to obtain a fine-grained reasoning result;
5) During training, the coarse-grained reasoning result is supervised by using a fine-grained reasoning result, and the fine-grained reasoning result is supervised by using a real label.
2. The method for compressing the visual self-attention model based on multi-granularity reasoning according to claim 1, wherein in step 1), the step of cutting the image into coarse-granularity patches and encoding the coarse-granularity patches into the labels comprises the following specific steps:
Figure FDA0004050554570000011
inputting the mark corresponding to the coarse-grained patch into a model, and acquiring coarse-grained coding characteristics:
Figure FDA0004050554570000012
obtaining a coarse-grained reasoning result according to the coarse-grained coding characteristics:
Figure FDA0004050554570000013
3. the method for compressing the visual self-attention model based on multi-granularity inference as claimed in claim 1, wherein in step 2), the confidence of the coarse-granularity inference result is calculated by:
Figure FDA0004050554570000014
setting a threshold eta, if j > eta, stopping reasoning, otherwise, carrying out second-stage reasoning.
4. The method for compressing the visual self-attention model based on multi-granularity reasoning according to claim 1, wherein in step 3), the global attention class is:
Figure FDA0004050554570000015
the mark coding of the fine-grained cut patch is:
Figure FDA0004050554570000016
5. the method for compressing a visual self-attention model based on multi-granularity inference as claimed in claim 1, wherein in step 4), the model input of the second-stage inference is:
Figure FDA0004050554570000021
Figure FDA0004050554570000022
the fine-grained reasoning result is as follows:
Figure FDA0004050554570000023
6. the method for compressing a visual self-attention model based on multi-granularity inference as claimed in claim 1, wherein in step 5), the training, the final training loss is:
loss=CE(p f ,y)+KL(p c ,p f )。
CN202310039838.0A 2023-01-12 2023-01-12 Compression method of visual self-attention model based on multi-granularity reasoning Pending CN115983322A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310039838.0A CN115983322A (en) 2023-01-12 2023-01-12 Compression method of visual self-attention model based on multi-granularity reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310039838.0A CN115983322A (en) 2023-01-12 2023-01-12 Compression method of visual self-attention model based on multi-granularity reasoning

Publications (1)

Publication Number Publication Date
CN115983322A true CN115983322A (en) 2023-04-18

Family

ID=85959576

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310039838.0A Pending CN115983322A (en) 2023-01-12 2023-01-12 Compression method of visual self-attention model based on multi-granularity reasoning

Country Status (1)

Country Link
CN (1) CN115983322A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116884003A (en) * 2023-07-18 2023-10-13 南京领行科技股份有限公司 Picture automatic labeling method and device, electronic equipment and storage medium
CN117689044A (en) * 2024-02-01 2024-03-12 厦门大学 Quantification method suitable for vision self-attention model
CN117994587A (en) * 2024-02-26 2024-05-07 昆明理工大学 Pathological image classification method based on deep learning two-stage reasoning network

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116884003A (en) * 2023-07-18 2023-10-13 南京领行科技股份有限公司 Picture automatic labeling method and device, electronic equipment and storage medium
CN116884003B (en) * 2023-07-18 2024-03-22 南京领行科技股份有限公司 Picture automatic labeling method and device, electronic equipment and storage medium
CN117689044A (en) * 2024-02-01 2024-03-12 厦门大学 Quantification method suitable for vision self-attention model
CN117994587A (en) * 2024-02-26 2024-05-07 昆明理工大学 Pathological image classification method based on deep learning two-stage reasoning network

Similar Documents

Publication Publication Date Title
CN115983322A (en) Compression method of visual self-attention model based on multi-granularity reasoning
CN110209823B (en) Multi-label text classification method and system
CN111524525A (en) Original voice voiceprint recognition method, device, equipment and storage medium
CN109871885A (en) A kind of plants identification method based on deep learning and Plant Taxonomy
CN109871749B (en) Pedestrian re-identification method and device based on deep hash and computer system
CN113255597B (en) Transformer-based behavior analysis method and device and terminal equipment thereof
CN112331170A (en) Method, device and equipment for analyzing similarity of Buddha music melody and storage medium
CN110569823A (en) sign language identification and skeleton generation method based on RNN
Wei et al. Compact MQDF classifiers using sparse coding for handwritten Chinese character recognition
CN116578699A (en) Sequence classification prediction method and system based on Transformer
Seo et al. Mobilenet using coordinate attention and fusions for low-complexity acoustic scene classification with multiple devices
CN117236335A (en) Two-stage named entity recognition method based on prompt learning
CN108985517A (en) Short-term traffic flow forecast method based on linear regression
CN111144462A (en) Unknown individual identification method and device for radar signals
CN103955711A (en) Mode recognition method in imaging spectrum object recognition analysis
CN117830711A (en) Automatic image content auditing method based on deep learning
CN109508698A (en) A kind of Human bodys' response method based on binary tree
CN113011444A (en) Image identification method based on neural network frequency domain attention mechanism
CN109033413B (en) Neural network-based demand document and service document matching method
Ferreira et al. Learning signer-invariant representations with adversarial training.
CN115953506A (en) Industrial part defect image generation method and system based on image generation model
CN112465838B (en) Ceramic crystal grain image segmentation method, system, storage medium and computer equipment
CN115965026A (en) Model pre-training method and device, text analysis method and device and storage medium
CN113707213A (en) Protein-ligand binding site prediction method based on deep learning
CN112487231B (en) Automatic image labeling method based on double-image regularization constraint and dictionary learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination