CN115310607A

CN115310607A - Vision transform model pruning method based on attention diagram

Info

Publication number: CN115310607A
Application number: CN202211239440.3A
Authority: CN
Inventors: 王琼; 黄丹; 毛君竹; 姚亚洲
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2022-10-11
Filing date: 2022-10-11
Publication date: 2022-11-08

Abstract

The invention discloses a visual Transformer model pruning method based on attention diagram, which is applied to a machine vision reasoning system and comprises the following steps: in the machine vision reasoning system, performing a plurality of initial training rounds on a ViT model through a data training network to generate a complete attention diagram; calculating the information entropy of the attention diagram, and performing pruning operation on the attention head according to the calculated information entropy; removing the weight parameters related to the pruned attention head to obtain a new ViT model; re-fine-tuning the parameters of the new ViT model; by pruning the multi-head attention module and deleting the characteristic diagram with high uncertainty and the corresponding attention head, the parameters and complexity of a ViT model are reduced, the calculation complexity and parameter quantity of a ViT model are reduced, the size of the ViT model can be reduced, and finally the aim of realizing the lightweight of the ViT model under the condition of limited performance loss of the ViT model is achieved.

Description

Vision transform model pruning method based on attention diagram

Technical Field

The invention belongs to the technical field of neural network lightweight, and particularly relates to a visual Transformer model pruning method based on attention diagram.

Background

The Transformer is a deep neural network mainly based on a self-attention mechanism and is applied to the field of natural language processing, a visual Transformer model is abbreviated as ViT model, the Transformer has strong modeling capability of long-range dependency relationship and has attracted remarkable success in various visual tasks, however, the huge calculation amount and memory consumption of the Transformer model are inherent problems, so that the Transformer model cannot be successfully deployed and put into use on an edge-end computing device with limited resources, pruning is a common method for effectively reducing inference cost of the neural network, and the method is widely applied to computer vision and natural language processing application.

The model pruning method based on the attention-seeking can be used for deploying the neural network model in an embedded machine vision reasoning system with low power consumption and limited computing resources, wherein the embedded machine vision reasoning system comprises an embedded computing board based on graphics processor acceleration and a neural network processor, and the system can generally only provide less than 20% of computing resources equivalent to a high-performance GPU.

Pruning operation can be generally divided into two categories of unstructured pruning and structured pruning, specifically, the unstructured pruning deletes single unimportant weight under a specific standard, the unstructured pruning belongs to a fine paradigm, little damage is caused to precision, special hardware design is required for actual acceleration, the structured pruning removes the whole substructure of a model, such as a channel and an attention head, some work has been done to prune ViT by reducing the number of image coding blocks, tang et al have developed a top-down image block pruning method, the method removes redundant image blocks based on reconstruction errors of a pre-trained model, xu et al have completely utilized the whole spatial structure based on image coding block selection and slow-fast combined update strategies maintained by structures; although the method can save the calculation cost, the reasoning complexity and the model size cannot be reduced, and therefore the visual Transformer model pruning method based on the attention-seeking is provided.

Disclosure of Invention

The invention aims to provide a visual Transformer model pruning method based on attention-deficit hyperactivity algorithm, so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a visual Transformer model pruning method based on attention diagram is applied to a machine vision reasoning system and comprises the following steps:

step A, in a machine vision reasoning system, performing a plurality of rounds of initial training on a ViT model through a data training network to generate a complete attention diagram;

b, calculating the information entropy of the attention diagram, and performing pruning operation on the attention head according to the calculated information entropy to measure the uncertainty of the attention diagram;

step C, removing all weight parameters related to the pruned attention head to obtain a new ViT model;

and D, fine-tuning the parameters of the new ViT model again.

Preferably, in the step a, the ViT model splits an input image into N image blocks, and adds a class code to each image block, and then feeds the N image blocks with the additional class codes into an encoder similar to a common Transformer to form N image coding blocks.

Preferably, the step a includes the following steps:

a1, in the initial stage of ViT model training, the ViT model does not learn useful information, and at the moment, the attention is tried to be unordered and the attention is tried to have large information entropy;

a2, after a plurality of rounds of initial training are carried out on the ViT model, the ViT model learns basic information and starts to present a certain mode;

a3, in the final stage of ViT model training, when the ViT model converges, each attention head obtains an attention map, at this time, important image coding blocks are highly concerned by the attention head, the information entropy is reduced, and all the attention maps are the average result of one training round.

Preferably, in said step B, after the ViT model performs several rounds of initial training, when the useful information learned by the attention head increases, the attention head will pay attention to the image coding block, so that the information entropy decreases, and the attention force map has certainty; when the attention head learns less useful information, the attention head has a uniform attention to the whole world, so that the information entropy is increased, thereby generating large uncertainty, and the information entropy is used for measuring the uncertainty of the attention diagram.

Preferably, in the step B, for the Transformer block, the multi-headed self-attention MSA and the multi-layered perceptron MLP are the main parts of the expenditure of computing resources;

represents the input of the L-th layer, and

then, the attention calculation of the attention head h is as shown in equation (1):

（1）；

wherein, the first and the second end of the pipe are connected with each other,

；

q, K, V denote "query", "key" and "value" in the multi-headed attention mechanism, respectively;

for the h attention head module in the L level, participate in generating an attention diagram,

the calculated "query", "key" and "value" are respectively expressed as

；

d represents the attention head embedding dimension;

n represents the number of image blocks input into the ViT model;

t represents a visual Transformer network with an attention head of H;

the calculation of the multi-headed self-attention MSA is shown in equation (2):

（2）；

represents the sum of 4 projection matrices;

h denotes the number of attention heads.

Preferably, the calculation complexity through the parameters contained in formula (1) and formula (2) is as shown in formula (3):

（3）；

c represents the parameter calculation complexity;

4NDHd represents the sum of the calculated amounts of the projection calculation;

the simultaneous parameter quantity is shown in formula (4):

（4）；

p represents the number of parameters;

represents the amount of calculation to calculate an attention map using equation (1);

d represents the embedding dimension, D = Hd when the ViT model has not been pruned.

Preferably, when the input sequence of the visual Transformer is a long sequence scene, the computational complexity of self-attention is expressed as

；

When the sequence length of the visual Transformer cannot dominate the complexity of all multi-headed attention modules, the computational complexity of self-attention is expressed as

。

Preferably, after the ViT model is pruned, the number of heads of attention is pruned to

Then the complexity after pruning is as shown in equation (5):

（5）；

the simultaneous parameter quantity is shown in formula (6):

（6）。

preferably, in said step B, attention is sought

Representing the attention map of the L-th layer and the attention head h, the information entropy of the attention map is shown as formula (7):

（7）；

representing the similarity of the j key coding block of the ith query coding block;

for the ith query image block, performing Softmax operation in the attention calculation, then

Representing the probability distribution of key image blocks to the ith query coding block.

Compared with the prior art, the invention has the beneficial effects that:

1. through the analysis of the visual Transformer model structure, a plurality of calculation resources are occupied by a multi-head attention module in the model, the multi-head attention module is pruned, a characteristic diagram with high uncertainty and a corresponding attention head are deleted, so that the parameters and complexity of the ViT model are reduced, the calculation complexity and parameter quantity of the ViT model are reduced, the size of ViT model can be reduced under the condition that the accuracy of the ViT model is not greatly influenced, and finally the lightweight of the ViT model is realized under the condition that the limited performance loss of the ViT model is achieved;

2. compared with the traditional pruning method, the invention guides the pruning operation of the attention head by utilizing the attention map instead of the traditional Taylor criterion, thereby providing a new thought for the pruning decision;

3. the importance of each attention head is measured through the information entropy, and a pruning decision is guided.

Drawings

FIG. 1 is a schematic view of the attention head pruning process of the present invention;

FIG. 2 is a schematic flow chart of the method of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-2, the visual transform model pruning method based on attention diagram provided by the present invention is applied to a machine vision reasoning system, and includes the following steps:

in step a, the ViT model splits an input image into N image blocks, attaches a class code to each image block, and then feeds the N image blocks with the additional class codes into an encoder to form N image coding blocks;

the step A comprises the following steps:

a2, after a plurality of rounds of initial training are carried out on a ViT model, a ViT model learns basic information and starts to present a certain mode;

a3, in the final stage of ViT model training, when the ViT model converges, each attention head obtains an attention map, at the moment, important image coding blocks are highly concerned by the attention heads, so that the information entropy is reduced, and all the attention maps are the average result of one training round;

in step B, after the ViT model performs a plurality of initial training rounds, when the useful information learned by the attention head increases, the attention head focuses on the image coding block, so that the information entropy decreases, and the attention head is more deterministic; when the attention head learns less useful information, the attention head has uniform attention to the whole world, so that the information entropy is increased, and large uncertainty is generated, wherein the information entropy is used for measuring the uncertainty of the attention diagram in the process;

in step B, for the transform block, the multi-headed self-attention MSA and the multi-tier perceptron MLP are the main parts of the expenditure of computing resources;

represents the input of the L-th layer, and

then, the attention calculation of the attention head h is shown as formula (1):

（1）；

；

for the h attention head module in the L layer, participate in generating an attention diagram,

the calculated "query", "key" and "value" are respectively expressed as

；

d represents the attention head embedding dimension;

n represents the number of image blocks input into the ViT model;

t represents a visual Transformer network with an attention head of H;

（2）；

represents the sum of 4 projection matrices;

h represents the number of attention heads;

the computational complexity of the parameters contained by equations (1) and (2) is shown in equation (3):

（3）；

c represents the complexity of parameter calculation;

the simultaneous parameter quantity is shown in formula (4):

（4）；

p represents the number of parameters;

representing the amount of calculation to calculate an attention map using equation (1);

When the input sequence of the visual Transformer is a long-sequence scene, the computational complexity of self-attention is expressed as

；

。

In this embodiment, as shown in FIG. 1, after the ViT model is pruned, the number of attention heads is pruned to

Then the complexity after pruning is shown in equation (5):

（5）；

the simultaneous parameter quantity is shown in formula (6):

（6）。

in step B, an attention map is drawn

（7）；

Representing the probability distribution of the key image blocks to the ith query coding block;

step D, fine-tuning parameters of the new ViT model;

the ViT model pruning method is shown in table 1 below:

table 1:

the ViT model pruning method based on attention force diagram can be used for deploying a neural network model in an embedded machine vision reasoning system with low power consumption and limited computing resources, the embedded machine vision reasoning system comprises an embedded computing board based on graphic processor acceleration and a neural network processor, the machine vision reasoning system can only provide computing resources which are equivalent to less than 20% of a high-performance GPU (graphics processing unit), and the machine vision reasoning system is very difficult to deploy due to the fact that storage and computing resources limit the vision Transformer model, after a pruning task is completed, requirements for storage, data bandwidth and computing resources can be reduced to the range of computing capacity of the embedded machine vision reasoning system, and edge end deployment of the vision Transformer model can be smoothly achieved;

the method treats the characteristics on the key dimension in the visual Transformer model multi-head self-attention module as probability distribution, further calculates information entropy to present uncertainty of attention, then reduces parameters and complexity of ViT model by deleting characteristic diagram with high uncertainty and corresponding attention head, reduces calculation complexity and parameters of ViT model, and finally achieves the aim of realizing the lightweight ViT model under the condition of limited loss of ViT model performance.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A visual transform model pruning method based on attention diagram is applied to a machine vision reasoning system and is characterized by comprising the following steps:

and D, re-fine-tuning the parameters of the new ViT model.

2. The visual fransformer model pruning method based on attention-deficit-map according to claim 1, wherein: in the step a, the ViT model splits an input image into N image blocks, and adds a class code to each image block, and then feeds the N image blocks with the additional class codes into an encoder similar to a common Transformer to form N image coding blocks.

3. The visual fransformer model pruning method based on the attention map as claimed in claim 2, wherein: the step A comprises the following steps:

a3, in the final stage of ViT model training, when the ViT model converges, each attention head obtains an attention map, and at the moment, important image coding blocks are highly concerned by the attention head, so that the information entropy is reduced, and all the attention maps are the average result of one training round.

4. The visual fransformer model pruning method based on attention-deficit-map according to claim 1, wherein: in the step B, after the ViT model performs a plurality of initial training rounds, when the useful information learned by the attention head increases, the attention head focuses on the image coding block, so that the information entropy decreases, and the attention head has certainty; when the attention head learns less useful information, the attention head has a uniform attention to the whole world, so that the information entropy is increased, thereby generating large uncertainty, and the information entropy is used for measuring the uncertainty of the attention diagram.

5. The visual fransformer model pruning method based on attention-deficit hyperactivity disorder according to claim 4, wherein: in said step B, for the transform block, the multi-headed self-attention MSA and the multi-layered perceptron MLP are the main part of the cost of computing resources;

represents the input of the L-th layer, and

（1）；

wherein the content of the first and second substances,

；

the calculated "query", "key" and "value" are respectively expressed as

；

d represents the attention head embedding dimension;

n represents the number of image blocks input into the ViT model;

t represents a visual Transformer network with an attention head of H;

（2）；

represents the sum of 4 projection matrices;

h denotes the number of attention heads.

6. The visual fransformer model pruning method based on attention-deficit hyperactivity disorder according to claim 5, wherein: the computational complexity of the parameters contained by equations (1) and (2) is shown in equation (3):

（3）；

c represents the parameter calculation complexity;

the simultaneous parameter quantity is shown in formula (4):

（4）；

p represents the number of parameters;

7. The visual fransformer model pruning method based on attention-deficit hyperactivity disorder according to claim 6, wherein: the computational complexity of self-attention when the input sequence of the visual Transformer is a long sequence scene is expressed as

；

。

8. The visual fransformer model pruning method based on attention-deficit-map according to claim 7, wherein: after the ViT model is pruned, the number of attention points is pruned to

Then the complexity after pruning is as shown in equation (5):

（5）；

the simultaneous parameter quantity is shown in formula (6):

（6）。

9. the visual fransformer model pruning method based on attention-deficit hyperactivity disorder according to claim 8, wherein: in said step B, the attention map is obtained

（7）；