CN116580289A

CN116580289A - Fine granularity image recognition method based on attention

Info

Publication number: CN116580289A
Application number: CN202310678774.9A
Authority: CN
Inventors: 李兰英; 林成承
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2023-06-08
Filing date: 2023-06-08
Publication date: 2023-08-11

Abstract

A fine granularity image recognition method based on attention belongs to the technical field of image classification, and builds a network model through a spatial depth module, a multi-scale feature extraction module, a context attention perception module and a multi-head attention module. The feature extraction capability of the model is enhanced through the spatial depth module, and the loss of a discrimination area caused by downsampling is reduced; extracting multi-scale features based on the salient regions by a multi-scale feature extraction module so as to enhance the recognition accuracy of the model; learning local relations among all scale features through a context awareness module; learning global and long-term links of multi-scale features through a multi-head attention module; and finally, adopting a cross entropy loss function and a center loss function as loss functions of the network, and reducing the intra-class distance by expanding the inter-class distance between samples so as to reduce the influence of the confusable area on the model identification precision. The method can well solve the problems of low-level information loss caused by deepening of network layers and low recognition accuracy caused by neglecting the relation among multi-scale features in fine-grained image recognition.

Description

Fine granularity image recognition method based on attention

Technical Field

The invention belongs to the technical field of fine-granularity image processing, and particularly relates to a fine-granularity image recognition method based on attention.

Background

As an important research direction in the field of computer vision, image recognition is the most basic task, and is also the basis for other various visual tasks. As an important branch extending from the field of image recognition, fine-grained image recognition is different from conventional image recognition. Fine-grained image recognition is the division of various subcategories under the same meta-category, e.g., from among a wide variety of cats. The fine-grained image recognition can be classified into fine-grained image recognition based on strong supervision and fine-grained image recognition based on weak supervision, wherein the fine-grained image recognition uses annotation points and annotation frames to assist learning during model training, and the fine-grained image recognition uses only image labels to learn. The fine-grained image recognition based on weak supervised learning mainly comprises three methods of region-positioning sub-network based, high-order feature coding based and additional information assisted recognition based.

The current fine-grained image recognition method is mainly a region-positioning sub-network-based method, which mainly locates the regions with discriminant features through an attention mechanism and learns features from the regions. Although this approach achieves good results, it has the following disadvantages: the existing method ignores the effect of low-level information, and can cause the loss of the low-level information in a small discriminant area along with the increase of the network layer number; furthermore, these methods only find critical areas by spatial and channel attention, ignoring the links between them.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a fine granularity image recognition method based on attention, which comprises the following steps:

s1, constructing a fine-grained image recognition network model: the system specifically comprises a feature extraction network, a spatial depth convolution module, a multi-scale feature extraction module, a context attention sensing module, a multi-head self-attention module and a classifier;

s2, optimizing an initial network by using the pre-training parameters;

s3, dividing a data set and preprocessing a sample image;

s4, inputting the sample image into a feature extraction network to obtain a feature map and an attention thermodynamic diagram;

s5, simultaneously inputting the extracted feature map and thermodynamic diagram into a multi-scale feature module to obtain a multi-scale feature map;

s6, inputting the multi-scale feature map into a context attention sensing module, so that the model learns multi-scale context information of the salient region;

s7, inputting multi-scale context information into a multi-head self-attention module, so that the model learns the long-term dependency relationship of each scale characteristic;

and S8, training the network model according to the loss function, and repeating the steps S4 to S7 until the loss function converges.

And finally, inputting the fine-grained images to be identified into a trained model for classification identification.

The characteristic extraction network adopts ConvNeXt convolutional neural network as backbone network.

Further, the backbone network includes:

and adding a spatial depth convolution module to replace the original downsampling part in each Stage, so as to enhance the identification capability of the model for judging the key region. For a size of sxsxc ₁ The feature map X of (2) is divided into sub-maps, and the formula is as follows:

f _s-1,s-1 ＝X[s-1:S:s,s-1:S:s]

where f is the sub-feature map and s is the scale factor. Connecting sub-feature maps in the channel dimension to convert feature map X into a new intermediate feature map

Then using non-stride convolutionFeature conversion, adding a C after feature mapping X ₂ Convolutional layer, where C ₂ <s ² C ₁ Will beConversion to->So as to keep the discrimination information of the micro-area as much as possible.

Further, for a given feature map X ε R ^C×H×W Wherein C, H, W respectively represents the number, height and width of channels, the multi-scale feature module captures regions with different scales on a feature map X through rectangular regions with different sizes, and for a response region r (i, j, [ delta ] X, [ delta ] y), wherein i and j are central positions of the response region, and [ delta ] X, [ delta ] y are the width and the height. By varying the width and height of the regions, a set of regions, r=r (i, j, mΔx, nΔy), where m, n=1, 2,3, …, is obtained; and i<i+m△x≤W,j<j+m delta y is less than or equal to H, and rich context information of subtle changes of response areas is captured step by step, so that a group of area sets R= { R } are obtained.

Further, for a plurality of areas r=r (i, j, m Δx, n Δy) with different sizes, feature vectors with fixed sizes are generated by bilinear pooling and bilinear interpolation to represent the areas, and the areas are represented in target coordinatesTransformed image +.>The formula is as follows:

wherein R (L) _ψ (y)) represents that a feature vector with the region coordinate of y is obtained from the original image; l (L) _ψ (y) represents a transformation of the coordinate y, wherein ψ is a learnable parameter; k is a kernel function whenAnd L _ψ (y) when not directly adjacent, < >>

Further, the context awareness module is used for capturing the relation among the multi-scale features, so that the model can selectively pay attention to more relevant areas to generate overall context information, and a specific formula among the multi-scale features is obtained as follows:

v in _r Note for the context that the feature vector,feature map, alpha, representing other scales associated with the current scale _r,r' Representing the correlation between the current scale feature and other neighboring scale features, the formula is as follows:

m in the formula _α B is a nonlinear combination of weight matrix _α 、b _β Representing the deviation;representing a query vector->The formula for representing the key vector is shown below:

m in the formula _β And M _β' The weight matrix is represented by a matrix of weights,a feature map representing a current scale;

further, for context vector v= { V _r Global average pooling of r=1..|r| } and the resulting contextual features f are pooled _r As the input of the multi-head self-attention module, the spatial arrangement information of the learning region and the long-term dependency relationship are studied, and the calculation formula of the multi-head self-attention is as follows:

A＝Concat(A ₁ ,A ₂ ,...,A _|R| )W ₀

q, K, V is query vector, key vector and value vector, W ₀ Is a weight matrix.

The attention-based fine-grained image recognition method according to claim 1, wherein a network of models is trained using a combination of cross entropy loss functions and center loss functions, the loss function formulas of the models being as follows:

L＝L _CE +λL _cent

wherein lambda is a weight coefficient, the influence of a center loss function on the total loss is measured, N is the number of categories, y _i As a true value tag, p _i Predicting labels for the models; w is the number of samples, x _i In order to train the sample,the center vector is represented by a vector of the center,||·|| ₂ representing the Euclidean distance;

and carrying out optimization training on the network model according to the total loss L, thereby obtaining an optimally trained network model.

The fine granularity image recognition method based on attention provided by the invention has the following advantages:

(1) According to the method, the space depth convolution module is designed, so that the model keeps low-level information which is lost with the deepening of the layer number of the convolution network, the diversity of model learning characteristics is enhanced, and the recognition accuracy is improved.

(2) According to the method, not only is the key area considered, but also the multi-scale characteristics adjacent to the key area are obtained by designing the multi-scale characteristic module, so that the robustness and the recognition capability of the model are enhanced.

(3) According to the method, the local relation and the global relation among the scale features are obtained by designing the contextual attention and the multi-head attention features, and the local relation and the global relation are fused to obtain rich feature representations, so that the recognition performance of the model is further improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a modified ConvNeXt network structure according to the present invention

FIG. 3 is a system configuration diagram of the present invention

Detailed Description

The objects, technical solutions and advantages of the present invention will become more apparent by the following detailed description of the present invention with reference to the accompanying drawings. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.

As shown in fig. 1, the present invention provides a fine-grained image learning method based on attention, which comprises the following steps:

step 1, inputting an image to be classified into a feature extraction network to obtain a feature map:

as shown in fig. 2, the feature extraction network is formed by using a ConvNeXt convolution network as a basic network and adding a spatial depth convolution module thereon, wherein the network is mainly divided into four stages, namely four stages, each Stage comprises a downsampling layer and a plurality of convolution layers except for the first Stage, and the downsampling layer in the Stage is replaced by the spatial depth convolution module to enhance the identification capability of the model on a micro-discrimination key region. For a size of sxsxc ₁ The feature map X of (2) is divided into sub-maps, and the formula is as follows:

f _s-1,s-1 ＝X[s-1:S:s,s-1:S:s]

Then adopting non-stride convolution to make feature conversion, adding a C after feature mapping X ₂ Convolutional layer, where C ₂ <s ² C ₁ Will beConversion to->The space size of the feature map is reduced to half of the original space size after the input image passes through one Stage, and the channel data is doubled, so that the discrimination information of the micro area is kept as much as possible. Here, the feature map after Stage 4 is acquired, and an attention thermodynamic diagram is obtained through CAM (Class Activation Mapping).

Step 2, acquiring multi-scale features through a multi-scale feature module:

as shown in FIG. 3, the model's multiscale feature module, X ε R, for a given feature map ^C×H×W Wherein C, H, W respectively represents the number, height and width of channels, the multi-scale feature module captures regions with different scales on a feature map X through rectangular regions with different sizes, and for a key region r (i, j, deltax, deltay), i and j are central positions of response regions, deltax and Deltay are widths and heights. By varying the width and height of the regions, a set of regions, r=r (i, j, mΔx, nΔy), where m, n=1, 2,3, …, is obtained; and i<i+m△x≤W,j<j+m delta y is less than or equal to H, and rich context information of subtle changes of response areas is captured step by step, so that a group of area sets R= { R } are obtained.

The regions are then represented using bilinear pooling, bilinear interpolation to generate feature vectors of fixed size for the set of regions r=r (i, j, mΔx, nΔy), at the target coordinatesTransformed image +.>The formula is as follows:

wherein R (L) _ψ (y)) represents that a feature vector with the region coordinate of y is obtained from the original image; l (L) _ψ (y) represents a transformation of the coordinate y, wherein ψ is a learnable parameter; k is a kernel function whenAnd L _ψ (y) when not directly adjacent, < >>Through the module, multi-scale features are obtained from the feature map, and the features with different scales are integratedAnd feature vectors with the same size are combined, so that the subsequent calculation of the model is facilitated.

Step 3, obtaining local connection through the context attention:

as shown in fig. 3, the model's contextual attention module is used to capture local relationships between multi-scale features, enabling the model to selectively focus on more relevant regions to generate overall contextual information. After receiving the multi-scale features, a specific formula for obtaining the relation between the multi-scale features is as follows:

m in the formula _β And M _β' The weight matrix is represented by a matrix of weights,a feature map representing the current scale.

Step 4, acquiring global connection through a multi-head attention module:

as shown in fig. 3, the multi-head attention module of the model first pairs the context vector v= { V _r R=1..|r| } is globally averaged pooled and the resulting contextual feature f is used _r As the input of the multi-head self-attention module, the spatial arrangement information of the learning region and the long-term dependency relationship are studied, and the calculation formula of the multi-head self-attention is as follows:

A＝Concat(A ₁ ,A ₂ ,...,A _|R| )W ₀

q, K, V is query vector, key vector and value vector, W ₀ Is a weight matrix.

Step 5, combining the local features and the global features to obtain a final classification result:

as shown in fig. 3, the contextual attention derived features and the multi-head attention derived features are stitched together through the FC layer as the basis for the final classification. In the training stage, a model network is trained by adopting a cross entropy loss function and a center loss function in a combined mode, and the loss function formula of the model is as follows:

L＝L _CE +λL _cent

wherein lambda is a weight coefficient, the influence of a center loss function on the total loss is measured, N is the number of categories, y _i Is true toValue tag, p _i Predicting labels for the models; w is the number of samples, x _i In order to train the sample,the center vector is represented by a vector of the center, I.I ₂ Representing the Euclidean distance;

and carrying out optimization training on the network model according to the total loss L, and continuously repeating the steps until the loss function converges, so as to finally obtain the network model with optimized training. After training is completed, fine-grained images are input, and the model can realize high-accuracy recognition.

Briefly, the present embodiment provides a fine-granularity image recognition method based on attention, which is used for classifying fine-granularity images, and designs a fine-granularity recognition network model based on fine-granularity recognition, and the fine-granularity recognition network model mainly comprises a spatial depth convolution module, a feature extraction network, a multi-scale feature module, a context attention module, a multi-head attention module and a classifier. On one hand, the problem of low-level information loss of a tiny judging area is considered, and on the other hand, the problem of connection between the judging area and other areas is also considered.

Finally, it is to be understood that the above-described embodiments of the present invention are merely illustrative of or explanation of the principles of the present invention and are in no way limiting of the invention. Accordingly, any modification, equivalent replacement, improvement, etc. made without departing from the spirit and scope of the present invention should be included in the scope of the present invention.

Claims

1. A fine-grained image recognition method based on an attention mechanism, the method comprising the steps of:

s1, constructing a fine-grained image recognition network model: the system specifically comprises a feature extraction network, a spatial depth convolution module, a multi-scale feature extraction module, a context attention sensing module, a multi-head attention module and a classifier;

s2, optimizing an initial network by using the pre-training parameters;

s3, dividing a data set and preprocessing a sample image;

s5, inputting the extracted feature map and thermodynamic diagram into a multi-scale feature extraction module to obtain a multi-scale feature map;

s7, inputting multi-scale context information into a multi-head attention module, so that the model learns long-term dependency of each scale characteristic;

2. The attention-based fine-granularity image recognition method according to claim 1, wherein the feature extraction network adopts a ConvNeXt convolutional neural network as a backbone network.

3. The attention-based fine-granularity image recognition method according to claim 2, wherein a spatial depth convolution module is added in each Stage to replace an original downsampling part, so that the recognition capability of the model on a tiny discrimination key region is enhanced. For a size of sxsxc ₁ The feature map X of (2) is divided into sub-maps, and the formula is as follows:

f _s-1,s-1 ＝X[s-1:S:s,s-1:S:s]

Then adopting non-stride convolution to make feature conversion, adding a C after feature mapping X ₂ A convolution layer, whichMiddle C ₂ <s ² C ₁ Will beConversion to->So as to keep the discrimination information of the micro-area as much as possible.

4. The attention-based fine-grained image recognition method according to claim 1, wherein for a given feature map X e R ^C×H×W Wherein C, H, W respectively represent the number, height and width of channels, the multi-scale feature extraction module captures regions of different scales on the feature map X through rectangular regions of different sizes, and for the response regions r (i, j, [ delta ] X, [ delta ] y), wherein i and j are central positions of the response regions, and [ delta ] X, [ delta ] y are the width and the height. By varying the width and height of the regions, a set of regions, r=r (i, j, mΔx, nΔy), where m, n=1, 2,3, …, is obtained; and i<i+m△x≤W,j<j+m delta y is less than or equal to H, and rich context information of subtle changes of response areas is captured step by step, so that a group of area sets R= { R } are obtained.

5. The attention-based fine-granularity image recognition method according to claim 4, wherein for a number of different-sized regions r=r (i, j, m Δx, n Δy), a bilinear pooling, bilinear interpolation is used to generate feature vectors of fixed size to represent the regions at target coordinatesTransformed image +.>The formula is as follows:

6. The attention-based fine-grained image recognition method according to claim 1, wherein the context attention-aware module is used to capture the relationships between the multi-scale features, enabling the model to selectively focus on more relevant regions to generate overall context information, resulting in the following specific formulas for the relationships between the multi-scale features:

7. The attention-based fine granularity image recognition method of claim 1, wherein for the context vector v= { V _r Global average pooling of r=1..|r| } and the resulting contextual features f are pooled _r As the input of the multi-head attention module, the spatial arrangement information of the learning region and the long-term dependency relationship are studied, and the calculation formula of the multi-head attention is as follows:

A＝Concat(A ₁ ,A ₂ ,...,A _|R| )W ₀

q, K, V is query vector, key vector and value vector, W ₀ Is a weight matrix.

8. The attention-based fine-grained image recognition method according to claim 1, wherein a network of models is trained using a combination of cross entropy loss functions and center loss functions, the loss function formulas of the models being as follows:

L＝L _CE +λL _cent

wherein lambda is a weight coefficient, the influence of a center loss function on the total loss is measured, N is the number of categories, y _i As a true value tag, p _i Predicting labels for the models; w is the number of samples, x _i In order to train the sample,the center vector is represented by a vector of the center, I.I ₂ Representing the Euclidean distance;