CN116051948B - Fine granularity image recognition method based on attention interaction and anti-facts attention - Google Patents
Fine granularity image recognition method based on attention interaction and anti-facts attention Download PDFInfo
- Publication number
- CN116051948B CN116051948B CN202310212744.9A CN202310212744A CN116051948B CN 116051948 B CN116051948 B CN 116051948B CN 202310212744 A CN202310212744 A CN 202310212744A CN 116051948 B CN116051948 B CN 116051948B
- Authority
- CN
- China
- Prior art keywords
- attention
- feature
- map
- channel
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The invention belongs to the technical field of image processing, and discloses a fine-granularity image recognition method based on attention interaction and anti-facts, which is characterized in that after image features are extracted, the spatial distribution of each part of an object is learned through a spatial attention mechanism, complementary features are captured through a self-channel feature interaction fusion module and fused with key features so as to improve recognition performance, an anti-facts region is positioned through an enhanced anti-facts attention mechanism module, prediction results of a critical distinguishing region and the anti-facts region are subjected to difference, and the difference is used as a strong attention supervision signal, so that the ability of effective attention of network learning is improved. The method provided by the invention can be used for effectively improving the identification precision of the fine-grained image.
Description
Technical Field
The invention belongs to the technical field of image processing, relates to a deep learning and fine-granularity image recognition technology, and particularly relates to a fine-granularity image recognition method based on attention interaction and inverse fact attention.
Background
Fine-grained image recognition, also referred to as sub-category image recognition, differs from traditional image recognition in that it is intended to distinguish between different sub-categories belonging to one category. The different subclasses are often too similar, and meanwhile, because of interference factors such as gestures, illumination, shielding, background and the like, the images with fine granularity have similar appearance and shape, and have the characteristics of small inter-class difference and large intra-class difference. In view of the high requirements for image recognition accuracy in reality, fine-grained image recognition becomes an important research direction of computer vision.
Early fine-grained image recognition methods addressed this problem by human annotated bounding boxes/region annotations (e.g., bird head, body) for region-based feature representation. However, specialized knowledge and a lot of annotation time are required in the tagging process. Thus, the strongly supervised approach, which takes a lot of time and resources to annotate, is not optimal for the actual fine-grained image recognition task. To address this problem, research has focused on weak supervision methods that provide only class labels, learning distinguishing features by locating different sites. At present, research methods for fine-grained image recognition focus on enlarging and cropping locally distinguishable regions. Specifically, in the method, an attention mechanism branch network is added in a feature extraction network for learning attention weight, after the feature extraction network extracts features from an input image, a feature map is used as the input of the attention mechanism branch network to obtain an attention feature map, the attention feature map and an original feature map are fused to strengthen key features, and then the key features are amplified and cut, so that fine-grained features which are more beneficial to recognition tasks are enhanced.
This common approach of magnifying and cropping critical areas using the attention mechanism, while achieving some results, still has several key issues. Specifically, the existing fine-granularity image recognition method mainly attaches weights to the characteristics of different channels through an attention mechanism, strengthens the channels with strong distinguishing property to locate key areas, and ignores complementarity among the channels; and the attention mechanism module is only supervised by the loss function, lacks a powerful supervision signal to guide the learning process, and ignores the causal relationship between the prediction result and the attention.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention provides a fine granularity image recognition method based on attention interaction and anti-fact attention, optimizes an attention mechanism by maximizing the difference between the anti-fact attention and the fact attention, and effectively utilizes authentication features and complementary information to participate in recognition together so as to improve recognition accuracy. Specifically, (1) first, aiming at the problems that the prior method ignores finer complementary information and effectively utilizes identification features and complementary information to participate in identification together, a self-channel feature interaction fusion module is provided, the module models interaction between different channels of an image, the complementary features of the channels can be captured for each channel, and then the complementary features and key features are fused to obtain fusion features; secondly, key features and fusion features are effectively utilized to participate in recognition by introducing a sorting loss function, so that recognition accuracy is improved; (2) Aiming at the problem that an attention mechanism lacks a powerful supervision signal to guide a learning process and ignores the causal relation between a prediction result and attention, the invention designs a module for enhancing a negative fact attention mechanism, and quantifies the quality of attention by comparing the influence of facts (learned attention) and negative facts (irrelevant attention) on a final prediction result; and then, the difference is maximized, the network learning is promoted to be more effective, the unilateral influence of the training set is reduced, and the recognition precision is improved.
In order to solve the technical problems, the invention adopts the following technical scheme:
the fine granularity image recognition method based on attention interaction and counter-facts attention comprises the following steps:
step 1: feature extraction:
inputting the image I into a feature extraction network to obtain a feature mapWhere C, H, W are the number of channels, height and width of the feature map, respectively.
Step 2: the spatial distribution of each part of the object is learned through a spatial attention mechanism:
the feature map F obtained in the step 1 is used for learning the spatial distribution of each part of the object through a spatial attention mechanism and is expressed asWhere M represents the number of attentiveness, the attentiveness force map A may be calculated as: a= { a 1 ,A 2 ,...,A M }=S(F);
Wherein the method comprises the steps ofAttention seeking to cover a local area, S (°) represents a spatial attention mechanism, consisting of a convolution layer and a ReLU activation function.
Step 3: capturing complementary features through a self-channel feature interaction fusion module and fusing the complementary features with key features:
inputting the attention map A obtained in the step 2 into a self-channel feature interaction fusion module, extracting complementary features by exploring channel correlation in the image, and fusing the complementary features with key features; the specific method comprises the following steps:
attention diagram a is first compressed into a feature matrix a':
A′=reshape(A,(M,l));
wherein reshape denotes converting data into a specified number of rows and columns without changing the data, i.e
Then carrying out bilinear interpolation operation on A ' and A ' T to obtain a bilinear matrix A ' T By being in a bilinear matrix A' T Adding a negative sign before, and obtaining a weight matrix through a softmax function
Wherein A 'is' T Representing the transpose of A ', A' T ij Representing the spatial relationship between channel i and channel j.
Multiplying the weight matrix W by the feature matrix A to obtain a feature matrix containing complementary features
A com =W×A。
Matrix the featuresConversion to attention seeking patterns containing complementary featuresAnd fusing with attention seeking A to obtain +.>
A fuse =φ(A′ com )+A;
Wherein A is fuse Representing a fusion attention map includes both key features and complementary features.
Step 4: from the attention map a obtained in step 2, construction of counterfactual attention force diagram A counter : masking key areas in the attention map A to obtain a mask map A mask At A mask The position of the key region is blocked, through A mask To construct a counterfactual attention-seeking graph A counter 。
Step 5: converting the feature map into feature vectors:
converting the attention force diagram, the fusion attention force diagram and the inverse fact attention diagram obtained in the step 2, the step 3 and the step 4 into feature matrixes respectively; after the corresponding feature matrix is obtained, the corresponding feature matrix is converted into a feature vector through a full connection layer.
Step 6: calculating loss:
and (5) calculating loss according to the feature vector obtained in the step (5), and optimizing the model.
And (5) repeating the training steps 2-6.
Further, in step 2, the feature diagram F obtained through step 1 is input into an attention mechanism module to obtain an attention force diagram, where the attention mechanism module includes a channel attention mechanism module and a spatial attention mechanism module, and the specific steps are as follows:
first, the feature diagram F is input into a channel attention mechanism module to obtain a channel attention diagram A channel :
Wherein F is c (i, j) feature map of the c-th channel, z c The feature vector representing the c-th channel,z represents the eigenvectors of all channels.
Weighting the feature vector z to obtain a weight vector s:
s=σ(T 2 σ(T 1 z));
wherein σ represents the ReLU activation function, T 1 、T 2 Are all parameters, wherein r represents the channel dimension reduction super parameter.
After the weight vector s is obtained, the feature map F and the weight vector s are fused to obtain a channel attention map A channel :
A channel =F scale (F,S);
Wherein F is scale (F, s) represents that the weight vector s and the feature diagram F are subjected to channel level multiplication to obtain a channel attention diagram.
Channel attention map A channel Input to the spatial attention module, capture attention in the spatial dimension, get attention strive a:
A=F spatial (A channel );
wherein F is spatial (A channel ) Comprising a 1 x 1 convolution kernel, a normalization layer and a ReLU activation function, by F spatial (A channel ) An attention map a containing both channel and space dimensions is then obtained.
Further, in step 4, the specific steps of constructing the counterfacts attention map are as follows:
masking key areas in the attention map A to obtain a mask map A mask :
Wherein a (a, b) represents a value corresponding to the position of attention drawing a in the spatial position index a, b, θ is a set threshold value, and if the value in a (a, b) is greater than the threshold value θ, the value of the corresponding position is multiplied by a suppression factor α for shielding, and the suppression factor α is a super-parameter; if the value in A (a, b) is less than or equal to the threshold value θ, the value of the corresponding position is unchanged.
By the above, the mask pattern A is obtained mask At A mask The position of the key region is blocked, through A mask To construct a counterfactual attention-seeking graph A counter :
random_map=random(A);
Where random (a) represents the generation of a corresponding random feature map from the attention map a, random_map represents the random feature map, and it is represented that in the feature map random_map, the critical and non-critical regions are random.
After obtaining random feature map random_map, combining random_map with A mask Multiplication results in a counterfactual attention graph A counter :
A counter =random_map×A mask ;
In which in the counterfacts attention is drawn to figure A counter In (C), due to A mask So that the key area is blocked, so that the random_map can only be applied to the non-key area, A counter The critical area in (a) is the irrelevant area.
Further, in step 5, the specific step of converting the feature map into the feature vector is as follows:
converting the attention force diagram, the complementary attention force diagram and the inverse fact force diagram obtained in the step 2, the step 3 and the step 4 into feature matrixes respectively:
feature_matrix=normal(einsum(A,F));
feature_comple_matrix=noFmal(einsum(A fuse ,F));
feature_counter_matrix=normal(einsum(A counter ,F));
wherein feature_matrix represents feature matrix of attention diagram a, feature_complex_matrix represents fusion attention diagram a fuse Feature_counter_matrix representing the inverse fact care force diagram a counter Normal () represents the normalization operation and einsum () represents the attention-seeking diagram a, the complementary attention-seeking diagram, the inverse fact-seeking diagram a counter Multiplied by the feature map F and converted into a feature matrix.
After the corresponding feature matrix is obtained, the corresponding feature matrix is converted into feature vectors through a full connection layer:
p=fc(feature_matrix);
p fuse =fc(feature_comple_matrix);
p effect =p-fc(feature_counter_matrix);
wherein p represents a feature vector of the attention map, p fuse Feature vectors representing complementary attention patterns, p effect A feature vector representing the difference between attention and countermeasures.
In step 6, the loss function is divided into two parts to optimize the model, firstly, the sorting loss function is introduced to effectively utilize the key features and the fusion features to participate in recognition together, and the calculation formula is as follows:
L rank =max(0,p-p fuse +ε);
wherein ε is a super parameter, L rank Represents the ordering penalty, by which L is the ordering penalty rank The model can promote p fuse Priority of (p) fuse >p+ε。
Secondly, a cross entropy loss function is introduced to optimize the model, and a calculation formula is as follows:
L p =L ce (y T log p);
L fuse =L ce (y T log p fuse );
L effect =L ce (y T log p effect );
wherein L is ce Represents a cross entropy loss function, L p Representing loss of attention map, L fuse Representing loss of fusion attention map, L effect Representing the loss of difference between attention and anti-facts attention, y T Representing the transpose of the real label vector.
Combining the above-mentioned loss functions into an overall loss function L:
L=L rank +L p +L fuse +L effect 。
compared with the prior art, the invention has the advantages that:
(1) According to the invention, the key region is considered by using an attention mechanism, the complementary region is considered by designing the self-channel feature interaction fusion module, the self-channel feature interaction fusion module models the correlation among the channels of the feature map to extract complementary features, and the complementary features and the key features are fused to obtain a fusion attention map containing both the key features and the complementary features so as to improve the recognition performance;
(2) The invention designs a mechanism module for enhancing the anti-facts attention to locate the anti-facts area, makes a difference between the prediction results of the critical distinguishing area and the anti-facts area, takes the difference value as an attention powerful supervision signal, and the powerful supervision signal guides a network model to learn more effective attention, improves the ability of learning effective attention and improves the recognition accuracy, which is not considered in the prior art.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a system architecture diagram of the present invention;
FIG. 3 is a schematic diagram of a self-channel feature interaction fusion module according to the present invention;
FIG. 4 is a schematic diagram of an enhanced anti-facts attention mechanism of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and specific examples.
Referring to fig. 1 and 2, the present embodiment provides a fine granularity image recognition method based on attention interaction and anti-facts attention, which includes the following steps:
step 1: feature extraction:
inputting the image I into a feature extraction network to obtain a feature mapWhere C, H, W are the number of channels, height and width of the feature map, respectively.
Step 2: the spatial distribution of each part of the object is learned through a spatial attention mechanism:
the feature map F obtained in the step 1 is used for learning the spatial distribution of each part of the object through a spatial attention mechanism and is expressed asWhere M represents the number of attentiveness, the attentiveness force map A may be calculated as:
A={A 1 ,A 2 ,...,A M }=S(F);
wherein the method comprises the steps ofAttention seeking to cover a local area, S (°) represents a spatial attention mechanism, consisting of a convolution layer and a ReLU activation function.
In a preferred embodiment, in step 2, the feature map F obtained through step 1 is input into an attention mechanism module, and attention is sought, where the attention mechanism module includes a channel attention mechanism module and a spatial attention mechanism module, and the specific steps are as follows:
first, the feature diagram F is input into a channel attention mechanism module to obtain a channel attention diagram A channel :
Wherein F is c (i, j) feature map of the c-th channel, z c Representing the eigenvectors of the c-th channel, and z represents the eigenvectors of all channels.
Weighting the feature vector z to obtain a weight vector s:
s=σ(T 2 σ(T 1 z));
wherein σ represents the ReLU activation function, T 1 、T 2 Are all parameters, wherein r represents the channel dimension reduction super parameter.
After the weight vector s is obtained, the feature map F and the weight vector s are fused to obtain a channel attention map A channel :
A channel =F scale (F,s);
Wherein F is scale (F, s) represents that the weight vector s and the feature diagram F are subjected to channel level multiplication to obtain a channel attention diagram.
Channel attention map A channel Input to the spatial attention module, capture attention in the spatial dimension, get attention strive a:
A=F spatial (A channel );
wherein F is spatial (A channel ) Comprising a 1 x 1 convolution kernel, a normalization layer and a ReLU activation function, by F spatial (A channel ) An attention map a containing both channel and space dimensions is then obtained.
Step 3: capturing complementary features through a self-channel feature interaction fusion module and fusing the complementary features with key features:
the attention map A obtained in the step 2 is input to a self-channel feature interaction fusion module, and the channel correlation in the image is explored to extract fine complementary features, and the complementary features are fused with key features.
In combination with the self-channel feature interaction fusion module shown in fig. 3, the specific method is as follows:
attention diagram a is first compressed into a feature matrix a':
A′=reshape(A,(M,l));
wherein reshape denotes converting data into a specified number of rows and columns without changing the data, i.e
Then A 'and A' T Bilinear interpolation operation is carried out to obtain bilinear matrix A' T By being in a bilinear matrix A' T Adding a negative sign before, and obtaining a weight matrix through a soffmax function
Wherein A 'is' T Representing the transpose of A ', A' T ij Representing the spatial relationship between channel i and channel j. The channel with larger weight tends to be equal to A 'according to the definition of the weight matrix w' i Semantically complementary. For example, A' i Focusing on the bird's head, the channel that highlights the complement is weighted more, such as the wings of the bird, while the channel that highlights the bird's head is weighted less.
Multiplying the weight matrix W with the feature matrix A' to obtain a feature matrix containing complementary features
A com =W×A。
Matrix the featuresConversion to attention seeking patterns containing complementary featuresAnd fusing with attention seeking A to obtain +.>
A fuse =φ(A′ com )+A;
Wherein A is fuse Representing a fusion attention map includes both key features and complementary features.
Step 4: from the attention map a obtained in step 2, construction of counterfactual attention force diagram A counter : masking key areas in the attention map A to obtain a mask map A mask At A mask The position of the key region is blocked, through A mask To construct a counterfactual attention-seeking graph A counter 。
The enhanced countering attention mechanism module shown in connection with fig. 4 is specifically as follows:
masking key areas in the attention map A to obtain a mask map A mask :
Wherein a (a, b) represents a value corresponding to the position of attention drawing a in the spatial position index a, b, θ is a set threshold value, and if the value in a (a, b) is greater than the threshold value θ, the value of the corresponding position is multiplied by a suppression factor α for shielding, and the suppression factor α is a super-parameter; if the value in A (a, b) is less than or equal to the threshold value θ, the value of the corresponding position is unchanged.
By the above method, a mask pattern is obtained mask At A mask The position of the key region has been blocked by mask To construct a counterfactual attention-seeking graph A counter :
random_map=random(A);
Where random (a) represents the generation of a corresponding random feature map from the attention map a, random_map represents the random feature map, and it is represented that in the feature map random_map, the critical and non-critical regions are random.
After obtaining random feature map random_map, combining random_map with A mask Multiplication results in a counterfactual attention graph A counter :
A counter =random_map×A mask ;
In which in the counterfacts attention is drawn to figure A counter In (C), due to A mask So that the key area is blocked, so that the random_map can only be applied to the non-key area, A counter The critical area in (a) is the irrelevant area.
Step 5: converting the feature map into feature vectors:
converting the attention force diagram, the fusion attention force diagram and the inverse fact attention diagram obtained in the step 2, the step 3 and the step 4 into feature matrixes respectively:
feature_matri×=noFmal(einsum(A,F));
feature_comple_matrix=normal(einsum(A fuse ,F));
feature_counter_matrix=normal(einsum(A counter ,F));
wherein feature_matrix represents feature matrix of attention diagram a, feature_complex_matrix represents fusion attention diagram a fuse Feature_counter_matrix representing the inverse fact care force diagram a counter Normal () represents the normalization operation, elnsum () represents the attention-seeking-a, the complementary attention-seeking, the inverse fact-seeking-a counter Multiplied by the feature map F and converted into a feature matrix.
After the corresponding feature matrix is obtained, the corresponding feature matrix is converted into feature vectors through a full connection layer:
p=fc(feature_matrix);
p fuse =fc(feature_comple_matrix);
p effect =p-fc(feature_counter_matrix);
wherein p represents a feature vector of the attention map, p fuse Feature vectors representing complementary attention patterns, p effect A feature vector representing the difference between attention and countermeasures.
Step 6: calculating loss:
and (5) calculating loss according to the feature vector obtained in the step (5), and optimizing the model. The loss function is divided into two parts to optimize the model, firstly, the sorting loss function is introduced to effectively utilize key features and fusion features to participate in identification together, and the calculation formula is as follows:
L rank =max(0,p-p fuse +ε);
wherein ε is a super parameter, L rank Represents the ordering penalty, by which L is the ordering penalty rank The model can promote p fuse Priority of (p) fuse The purpose of this design is to force the fused attention to produce more accurate predictions with reference to the predictions that the attention is to produce. With this regularization method, the network can learn to identify fine-grained images by adaptively considering feature priorities.
Secondly, a cross entropy loss function is introduced to optimize the model, and a calculation formula is as follows:
L p =L ce (y T log p);
L fuse =L ce (y T log p fuse );
L effect =L ce (y T log p effect );
wherein L is ce Represents a cross entropy loss function, L p Representing loss of attention map, L fuse Representing loss of fusion attention map, L effect Representing the loss of difference between attention and anti-facts attention, y T Representing the transpose of the real label vector.
Combining the above-mentioned loss functions into an overall loss function L:
L=L rank +L p +L fuse +L effect 。
and (5) repeating the training steps 2-6.
After model training is completed, the image to be identified is input, so that high-accuracy identification can be realized.
It should be understood that the above description is not intended to limit the invention to the particular embodiments disclosed, but to limit the invention to the particular embodiments disclosed, and that various changes, modifications, additions and substitutions can be made by those skilled in the art without departing from the spirit and scope of the invention.
Claims (5)
1. The fine-granularity image recognition method based on attention interaction and counter-fact attention is characterized by comprising the following steps of:
step 1: feature extraction:
inputting the image I into a feature extraction network to obtain a feature mapWherein C, H, W are the number of channels, height and width of the feature map, respectively;
step 2: the spatial distribution of each part of the object is learned through a spatial attention mechanism:
the feature map F obtained in the step 1 is used for learning the spatial distribution of each part of the object through a spatial attention mechanism and is expressed asWhere M represents the number of attentiveness, the attentiveness force map A may be calculated as:
A={A 1 ,A 2 ,...,A M }=S(F);
wherein the method comprises the steps ofAttention diagrams representing covered local areas, S (°) representing spatial attention mechanisms, consisting of convolutional layers and ReLU activation functions;
step 3: capturing complementary features through a self-channel feature interaction fusion module and fusing the complementary features with key features:
inputting the attention map A obtained in the step 2 into a self-channel feature interaction fusion module, extracting complementary features by exploring channel correlation in the image, and fusing the complementary features with key features; the specific method comprises the following steps:
attention diagram a is first compressed into a feature matrix a':
A′=reshape(A,(M,l));
wherein reshape denotes converting data into a specified number of rows and columns without changing the data, i.el=HW;
Then A 'and A' T Bilinear interpolation operation is carried out to obtain bilinear matrix A' T By being in a bilinear matrix A' T Adding a negative sign before, and obtaining a weight matrix through a softmax function
Wherein A 'is' T Representing the transpose of A ', A' T ij Representing the spatial relationship between channel i and channel j;
multiplying the weight matrix W with the feature matrix A' to obtain a feature matrix containing complementary features
A com =W×A′;
Matrix the featuresConversion to an attention seeking force comprising complementary features>And fusing with attention seeking A to obtain +.>
A fuse =φ(A' com )+A;
Wherein A is fuse Representing a fused attention map that includes both key features and complementary features;
step 4: from the attention map a obtained in step 2, construction of counterfactual attention force diagram A counter :
Masking key areas in the attention map A to obtain a mask map A mask At A mask The position of the key region is blocked, through A mask To construct a counterfactual attention-seeking graph A counter ;
Step 5: converting the feature map into feature vectors:
converting the attention force diagram, the fusion attention force diagram and the inverse fact attention diagram obtained in the step 2, the step 3 and the step 4 into feature matrixes respectively; after the corresponding feature matrix is obtained, the corresponding feature matrix is converted into a feature vector through a full connection layer;
step 6: calculating loss:
calculating loss according to the feature vector obtained in the step 5, and optimizing the model;
and (5) repeating the training steps 2-6.
2. The fine-grained image recognition method based on attention interaction and anti-facts attention according to claim 1, wherein in step 2, the feature map F obtained through step 1 is input into an attention mechanism module to obtain attention force, and the attention mechanism module comprises a channel attention mechanism module and a spatial attention mechanism module, and the specific steps are as follows:
first, the feature diagram F is input into a channel attention mechanism module to obtain a channel attention diagram A channel :
Wherein F is c (i, j) feature map of the c-th channel, z c Representing the feature vector of the c-th channel, z representing the feature vectors of all channels;
weighting the feature vector z to obtain a weight vector s:
s=σ(T 2 σ(T 1 z));
wherein σ represents the ReLU activation function, T 1 、T 2 Are all parameters, wherein r represents the super parameter of channel dimension reduction;
after the weight vector s is obtained, the feature map F and the weight vector s are fused to obtain a channel attention map A channel :
A channel =F scale (F,s);
Wherein F is scale (F, s) represents that the weight vector s and the feature diagram F are subjected to channel level multiplication to obtain a channel attention diagram; channel attention map A channel Input to the spatial attention module, capture attention in the spatial dimension, get attention strive a:
A=F spatial (A channel );
wherein F is spatial (A channel ) Comprising a 1 x 1 convolution kernel, a normalization layer and a ReLU activation function, by F spatial (A channel ) An attention map a containing both channel and space dimensions is then obtained.
3. The fine-grained image recognition method based on attention interaction and anti-facts attention according to claim 1, wherein in step 4, the specific steps of constructing an anti-facts attention map are as follows:
masking key areas in the attention map A to obtain a mask map A mask :
Wherein a (a, b) represents a value corresponding to the position of attention drawing a in the spatial position index a, b, θ is a set threshold value, and if the value in a (a, b) is greater than the threshold value θ, the value of the corresponding position is multiplied by a suppression factor α for shielding, and the suppression factor α is a super-parameter; if the value in A (a, b) is less than or equal to the threshold value θ, the value of the corresponding location is unchanged;
by the above, the mask pattern A is obtained mask At A mask The position of the key region is blocked, through A mask To construct a counterfactual attention-seeking graph A counter :
random_map=random(A)
Wherein random (a) represents generating a corresponding random feature map from the attention map a, and random_map represents the random feature map, representing that in the feature map random_map, the critical area and the non-critical area are random;
after obtaining random feature map random_map, combining random_map with A mask Multiplication results in a counterfactual attention graph A counter :
A counter =random_map×A mask ;
In which in the counterfacts attention is drawn to figure A counter In (C), due to A mask So that the key area is blocked, so that the random_map can only be applied to the non-key area, A counter The critical area in (a) is the irrelevant area.
4. The fine-grained image recognition method based on attention interaction and anti-facts attention according to claim 3, wherein in step 5, the specific step of converting the feature map into feature vectors is as follows:
converting the attention force diagram, the complementary attention force diagram and the inverse fact force diagram obtained in the step 2, the step 3 and the step 4 into feature matrixes respectively:
feature_matrix=normal(einsum(A,F));
feature_comple_matrix=normal(einsum(A fuse ,F));
feature_counter_matrix=normal(einsum(A counter ,F));
wherein feature_matrix represents feature matrix of attention diagram a, feature_complex_matrix represents fusion attention diagram a fuse Feature_counter_matrix representing the inverse fact care force diagram a counter Normal () represents the normalization operation and einsum () represents the attention-seeking diagram a, the complementary attention-seeking diagram, the inverse fact-seeking diagram a counter Multiplying the characteristic diagram F and converting the characteristic diagram F into a characteristic matrix;
after the corresponding feature matrix is obtained, the corresponding feature matrix is converted into feature vectors through a full connection layer:
p=fc(feature_matrix);
p fuse =fc(feature_comple_matrix);
p effect =p-fc(feature_counter_matrix);
wherein p represents a feature vector of the attention map, p fuse Feature vectors representing complementary attention patterns, p effect A feature vector representing the difference between attention and countermeasures.
5. The fine-grained image recognition method based on attention interaction and anti-facts attention according to claim 4, wherein in step 6, the loss function is divided into two parts to optimize the model, firstly, a sorting loss function is introduced to effectively utilize key features and fusion features to participate in recognition together, and a calculation formula is as follows:
L rank =max(0,p-p fuse +ε);
wherein ε is a super parameter, L rank Represents the ordering penalty, by which L is the ordering penalty rank The model can promote p fuse Priority of (p) fuse >p+ε;
Secondly, a cross entropy loss function is introduced to optimize the model, and a calculation formula is as follows:
L p =L ce (y T log p);
L fuse =L ce (y T log p fuse );
L effect =L ce (y T log p effect );
wherein L is ce Represents a cross entropy loss function, L p Representing loss of attention map, L fuse Representing loss of fusion attention map, L effect Representing the loss of difference between attention and anti-facts attention, y T Representing a transpose of the real label vector;
combining the above-mentioned loss functions into an overall loss function L:
L=L rank +L p +L fuse +L effect 。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310212744.9A CN116051948B (en) | 2023-03-08 | 2023-03-08 | Fine granularity image recognition method based on attention interaction and anti-facts attention |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310212744.9A CN116051948B (en) | 2023-03-08 | 2023-03-08 | Fine granularity image recognition method based on attention interaction and anti-facts attention |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116051948A CN116051948A (en) | 2023-05-02 |
CN116051948B true CN116051948B (en) | 2023-06-23 |
Family
ID=86123960
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310212744.9A Active CN116051948B (en) | 2023-03-08 | 2023-03-08 | Fine granularity image recognition method based on attention interaction and anti-facts attention |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116051948B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116228749B (en) * | 2023-05-04 | 2023-10-27 | 昆山润石智能科技有限公司 | Wafer defect detection method and system based on inverse fact interpretation |
CN116665019B (en) * | 2023-07-31 | 2023-09-29 | 山东交通学院 | Multi-axis interaction multi-dimensional attention network for vehicle re-identification |
CN117078920B (en) * | 2023-10-16 | 2024-01-23 | 昆明理工大学 | Infrared-visible light target detection method based on deformable attention mechanism |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325237A (en) * | 2020-01-21 | 2020-06-23 | 中国科学院深圳先进技术研究院 | Image identification method based on attention interaction mechanism |
CN113592023A (en) * | 2021-08-11 | 2021-11-02 | 杭州电子科技大学 | High-efficiency fine-grained image classification model based on depth model framework |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110807465B (en) * | 2019-11-05 | 2020-06-30 | 北京邮电大学 | Fine-grained image identification method based on channel loss function |
CN113642571B (en) * | 2021-07-12 | 2023-10-10 | 中国海洋大学 | Fine granularity image recognition method based on salient attention mechanism |
CN114882534B (en) * | 2022-05-31 | 2024-03-26 | 合肥工业大学 | Pedestrian re-recognition method, system and medium based on anti-facts attention learning |
-
2023
- 2023-03-08 CN CN202310212744.9A patent/CN116051948B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325237A (en) * | 2020-01-21 | 2020-06-23 | 中国科学院深圳先进技术研究院 | Image identification method based on attention interaction mechanism |
CN113592023A (en) * | 2021-08-11 | 2021-11-02 | 杭州电子科技大学 | High-efficiency fine-grained image classification model based on depth model framework |
Also Published As
Publication number | Publication date |
---|---|
CN116051948A (en) | 2023-05-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116051948B (en) | Fine granularity image recognition method based on attention interaction and anti-facts attention | |
CN109949317B (en) | Semi-supervised image example segmentation method based on gradual confrontation learning | |
CN110263912B (en) | Image question-answering method based on multi-target association depth reasoning | |
Li et al. | Adaptive deep convolutional neural networks for scene-specific object detection | |
Ren et al. | Salient object detection by fusing local and global contexts | |
CN111340123A (en) | Image score label prediction method based on deep convolutional neural network | |
CN113609896B (en) | Object-level remote sensing change detection method and system based on dual-related attention | |
Tan et al. | Fine-grained classification via hierarchical bilinear pooling with aggregated slack mask | |
CN110175248B (en) | Face image retrieval method and device based on deep learning and Hash coding | |
Wang et al. | Multiscale deep alternative neural network for large-scale video classification | |
Li et al. | Detection-friendly dehazing: Object detection in real-world hazy scenes | |
Xu et al. | Scene graph inference via multi-scale context modeling | |
Zhu et al. | Supplement and suppression: Both boundary and nonboundary are helpful for salient object detection | |
Xue et al. | NLWSNet: a weakly supervised network for visual sentiment analysis in mislabeled web images | |
Li et al. | Multi-scale global context feature pyramid network for object detector | |
CN112668662B (en) | Outdoor mountain forest environment target detection method based on improved YOLOv3 network | |
Jiang et al. | Cross-level reinforced attention network for person re-identification | |
CN111368637B (en) | Transfer robot target identification method based on multi-mask convolutional neural network | |
Wang et al. | Single shot multibox detector with deconvolutional region magnification procedure | |
CN114627312B (en) | Zero sample image classification method, system, equipment and storage medium | |
Yuan et al. | CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection | |
Xu et al. | Recursive multi-relational graph convolutional network for automatic photo selection | |
CN117649582B (en) | Single-flow single-stage network target tracking method and system based on cascade attention | |
Yang et al. | An Effective and Lightweight Hybrid Network for Object Detection in Remote Sensing Images | |
Wang et al. | Visual tracking using transformer with a combination of convolution and attention |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |