CN115631369A

CN115631369A - Fine-grained image classification method based on convolutional neural network

Info

Publication number: CN115631369A
Application number: CN202211224648.8A
Authority: CN
Inventors: 王坤; 王延江; 刘宝弟
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2023-01-20

Abstract

The invention discloses a fine-grained image classification method based on a convolutional neural network, which belongs to the technical field of fine-grained image processing, and comprises the steps of firstly constructing a classification network model by adopting a fusion channel feature re-attention module and a spatial multi-region feature attention module, then designing a contrast learning loss item in a loss function by adopting a contrast learning idea, and finally classifying images acquired in real time by adopting the classification network model; the method specifically comprises the following steps: constructing a classification network model; the classification network model comprises a feature extraction network, a channel feature re-attention module, a spatial multi-region feature attention module and a classifier; constructing a training data set and carrying out model training; and acquiring images to be classified in real time, and sending the images to the trained classification network model to obtain the classification result of the current image. The invention effectively reduces the difficulty of classifying fine-grained images and solves the limitation of attention mechanism in the field.

Description

Fine-grained image classification method based on convolutional neural network

Technical Field

The invention belongs to the technical field of fine-grained image processing, and particularly relates to a fine-grained image classification method based on a convolutional neural network.

Background

In recent years, deep learning is rapidly developed, research focus of classification of object images is also transferred from coarse-grained image classification to fine-grained image classification, and the problem of fine-grained image classification is to identify subclasses under a base class, such as distinguishing different types of birds and vehicles of different brands. Compared with coarse-grained image classification, the difference between fine-grained image categories is finer, accurate resolution can be performed only by means of the small local difference, and compared with object-level classification tasks such as face recognition, the difference in the fine-grained image categories is finer and various uncertain factors such as postures, shielding and background interference exist, so that the task is very challenging. The subject currently consists primarily of identifying different types of birds, dogs, flowers, cars, planes, etc.

The fine-grained image classification neural network model has wide business requirements and application scenes in the industry and in real life in recent years. The photographing and identifying functions of the flower helper and the mobile phone software are used for understanding the car owner; in ecological protection, effective identification of different types of organisms is an important prerequisite for ecological research. Therefore, if the fine-grained image recognition and classification with low cost and high accuracy can be realized by means of the computer vision technology, the method has very important significance for both academic and industrial fields.

It is known through research that the currently existing methods for classifying fine-grained images can be classified into a method for classifying only visual information and a method for adding additional information. The former relies entirely on visual information to solve the classification problem, while the latter attempts to add additional information to the classification.

The methods of classification using only visual information can be roughly divided into two types: a method based on location-classification subnetworks and a method based on higher-order feature coding. The method of location-based classification of subnetworks is to detect and locate the discriminatory part of the object and build up a corresponding local feature representation. Early work used component tagging as a strong supervision to focus the network on subtle differences between classes, but component tagging information would bring expensive costs. Therefore, the current mainstream method mostly adopts a weak supervision mode, namely only using image-level labels for classification. The method based on the high-order feature coding is to perform high-order integration on the features generated by the neural network to obtain more discriminative features. However, both of these methods have their own limitations: most methods based on location-classification sub-networks focus on the most prominent parts of the object, while ignoring those that are not significant but distinctive, which makes the features less discriminative enough. The method based on high-order feature coding needs to occupy a large amount of computing resources when the dimension of a feature map channel is high, and has insufficient interpretability.

The method of adding additional information classification is to build a joint feature representation by adding additional information (such as network data, multimodal data, etc.), wherein the multimodal data in turn comprises sound, text description of objects, etc. By combining rich additional information and a deep neural network architecture, the method realizes effective classification of fine-grained images. The limitation of this approach is that they are designed for specific a priori knowledge and cannot be applied to other auxiliary information at will.

Disclosure of Invention

In order to solve the problems of difficult classification of fine-grained images and application limitation of an attention mechanism in the field in the prior art, the invention provides a fine-grained image classification method based on a convolutional neural network, and provides a convolutional neural network integrating a channel feature re-attention module and a spatial multi-region feature attention module to classify the fine-grained images.

The technical scheme of the invention is as follows:

a fine-grained image classification method based on a convolutional neural network comprises the steps that firstly, a classification network model is constructed by fusing a channel feature re-attention module and a spatial multi-region feature attention module, then a contrast learning loss item in a loss function is designed by adopting a contrast learning idea, and finally the classification network model is adopted to classify images acquired in real time; the method specifically comprises the following steps:

step 1, constructing a classification network model;

the classification network model comprises a feature extraction network, a channel feature re-attention module, a spatial multi-region feature attention module and a classifier;

step 2, constructing a training data set and carrying out model training;

and 3, acquiring the images to be classified in real time, and sending the images to the trained classification network model to obtain the classification result of the current image.

Further, a convolutional neural network which is output in the last three stages is adopted as a feature extraction network, the feature extraction network is composed of ResNet50, resNet101 and Densenet161 basic convolutional networks, each convolutional network structure is composed of a plurality of stages, each stage comprises a convolutional layer, when an image is input into the feature extraction network, the space size of a feature graph is reduced by half and the number of channels is doubled after each stage, and the feature graph X which is output in the plurality of stages of the feature extraction network is output ^l As an output feature of the feature extraction network.

Further, the channel feature re-attention module firstly integrates feature channel information by adopting average pooling and maximum pooling, and acquires weight information of each channel in the feature map by utilizing a SoftMax function; obtaining an enhanced mask matrix E according to weight distribution, suppressing channels with high weights, and obtaining a suppressed mask matrix S through a suppression function F (x); inputting a feature map X ^l Multiplying the enhanced mask matrix E and the suppressed mask matrix S respectively to obtain an output characteristic diagram

And

wherein the content of the first and second substances,

the SoftMax function is represented by:

wherein Z is _i The output value of each channel after passing through the SoftMax function, C is the total number of output channels, and the weight information of the solved channels is obtained through the SoftMax function;

the enhanced mask matrix E is calculated by:

E＝SoftMax(AvgPool(X ^l )+MaxPool((X ^l ))) (2)

wherein AvgPool (. Cndot.) represents the average pooling, and MaxPool (. Cndot.) represents the maximum pooling;

the suppression function F (x) is represented by the following formula:

wherein Z is _max The maximum output value of the channel is obtained, and both omega and delta represent hyper-parameters which respectively represent the degree of the corresponding channel to be inhibited and the degree of the channel to be inhibited;

output feature map of current stage

And

is obtained by the following formula:

wherein the content of the first and second substances,

representing element-by-element multiplication operations;

of a plurality of stages

The dimension of the Conv channel of the convolutional layer is unified and then the Conv channel is used as the output of a corresponding stage, and the channel is unified to ensure the balance of low-level information and high-level information;

the inputs to the subsequent stage force the network to mine potential channel features containing fine-grained knowledge.

Further, the spatial multi-region feature attention module employs a downsampling convolution, a 1 × 1 convolution, a SoftMax function and a CCMP module, wherein the downsampling convolution is used for a plurality of stages

And a feature map of the last stage of the network

The spatial scale is kept consistent, 1 × 1 convolution is used to simplify the calculation, and the SoftMax function and the CCMP module are used to calculate multiple stages

And obtaining a diversity learning loss L _div ，L _div And the similarity is in negative correlation, and the loss of diversity is reduced through training, so that the method can be used for multiple stages

Spatially focusing on different discriminatory portions of the object;

the characteristic graphs obtained by the channel re-attention module in the last three stages of the assumed characteristic extraction network are respectively

And

wherein, C _t Represents the normalized channel dimension, W _L-2 Width, H, of the feature map representing the L-2 th stage _L-2 Height of the L-2 stage feature map; w _L-1 Width, H, of the feature map representing the L-1 st stage _L-1 Height, W, of the characteristic diagram of the L-1 st stage _L Width, H, of characteristic diagram representing L stage _L Showing characteristic diagrams of the L-th stageA height;

in order to reduce the amount of calculation, the feature map is preprocessed by the following formula:

where φ (·) represents a 1 × 1 convolution; conv _ block _l (. To) represents a downsampled convolution; l represents the second stage of the profile;

after characteristic graphs from three stages with the same space size and 1 channel number are obtained, weight information of each space position is obtained by adopting a SoftMax function, and then the characteristic graphs are obtained by splicing along the channel dimension

Inputting it into CCMP module, CCMP pair X _concat Is responsive to the peak in the channel dimension and is responsive to X _concat The middle element is subjected to the operation h (-) of summing and averaging to obtain the value S of the similarity _i ；

Wherein k represents X _concat The size of the spatial dimension of (a), j represents X _concat Of a few channels, ε denotes X _concat The number of channels is obtained by a spatial multi-region feature attention module to represent the similarity value S between feature maps of each stage _i ；

Finally, according to the similarity S _i Obtaining a diversity learning loss L _div The calculation method is as follows,

L _div ＝(1-S _i )/ε (7)

where ε represents the output of several stages using a feature extraction network, where X represents _concat The number of channels of (2).

Further, the classifier employs a SoftMax classifier, which is applied in a multi-classification task to map the outputs of a plurality of neurons into a (0, 1) interval.

Further, the total loss function L of the classification network model _total The definition is as follows:

L _total ＝αL _cls +βL _div +γL _con (8)

wherein L is _cls Represents the cross entropy loss, L _div Indicates a loss of diversity learning, L _con Representing comparison learning loss, wherein alpha, beta and gamma represent balance parameters and are used for weighting each balance loss function; wherein, the first and the second end of the pipe are connected with each other,

cross entropy loss L _cls The classification loss is composed of the classification loss of each stage and the classification loss of the whole body represented by splicing the characteristics of each stage, and the calculation formula is as follows:

wherein y is a truth label of the input image and is represented by a one-hot vector; theta.theta. ₁ ,θ ₂ The SoftMax function is used for calculating a predicted label value of the neural network; cls _l (. Cndot.) represents a classifier that,

representing the output characteristic f of the l stage _l The predicted value of the label is obtained; cls _concat (. Represents a classifier for the representation of the overall features, Z _fconcat Representing a global feature representation f _concate The tag prediction value of (a);

comparative learning loss L _con Comprises the following steps:

where N is the size of the input image batch, z _i ,z _j Is passed through ₂ Regularized input images of different classes within the same batch, y _i ,y _j Is the label value, sim (z), of the different classes of input images _i ,z _j ) Is z _i ,z _j Cosine similarity between i, j represents different samples of the same batch, and η represents the loss L only for different classes of inputs with similarity greater than η _con It is helpful.

Further, the specific process of step 2 is as follows:

step 2.1, adopting a CUB _200 _2011data set as a training data set, carrying out data preprocessing on the acquired original image in a horizontal turning and center cutting mode, realizing data expansion, and constructing the training data set;

and 2.2, sending the fine-grained images of the training data set into a classification network model, and training and optimizing learnable parameters in the classification network model, so that a channel feature re-attention module in the model can furthest mine potential fine-grained knowledge in the feature map, a spatial multi-region feature attention module can greatly reduce the similarity between feature maps in different stages, and when the whole model is trained to be convergent, the trained classification network model is obtained.

Further, the specific process of step 3 is as follows:

firstly, fine-grained images to be classified are sent into a feature extraction network with the stage L, and then the fine-grained images are input into a channel feature re-attention module to obtain a channel enhanced feature map

And channel rejection profile

The channel enhanced feature map is used as the output of the current stage of the network, and the channel suppressed feature map is sent to the subsequent stage to force the network to pay attention to the channels which contain information impoverishment of fine-grained knowledge; the model training process utilizes the spatial multi-region feature attention module to enhance the channel feature map output by multiple stages

Focusing on different discriminative parts of the object in the spatial dimension; therefore, the model can obtain a plurality of output characteristics with discriminativity in space and channels, and finally the output characteristics of a plurality of stages are taken as the characteristic representation of the image; and finally obtaining a classification result of the current image through a SoftMax classifier.

The invention has the following beneficial technical effects:

the present invention greatly improves the limitations of attention mechanisms and convolutional neural network-based methods on fine-grained image classification. Through the multi-stage feature extraction network, the aggregation capability of the classification network on feature information is improved, low-level information and high-level semantic information are included, and the robustness of the extracted features is improved; through the channel characteristic re-attention module, the classification network is effectively helped to extract the channel characteristics which are ignored originally but are helpful for fine-grained classification, so that the obtained characteristics are more comprehensively represented; through the spatial multi-region feature attention module, the features output by multiple stages of the classification network respectively pay attention to different discriminative parts of the object in the spatial dimension, so that the discriminative performance of the final feature representation is improved; by fusing the loss terms of the comparison learning idea, different types of fine-grained images are treated differently, and the difference between the types is increased. In the comparison learning loss item, the idea of comparison learning is fused, different types of training images in the same input batch are set as negative samples, the same type of training images are set as positive samples, the distance between the positive samples is pulled in through the setting of a loss function, and the distance between the negative samples is pulled out, so that the classification effect of the classification network is further optimized in the training process.

Drawings

FIG. 1 is a flow chart of a fine-grained image classification method based on a convolutional neural network according to the present invention;

FIG. 2 is a schematic diagram of the overall structure of the classification network model of the present invention;

FIG. 3 is a schematic diagram of a classification network model channel feature re-attention module according to the present invention;

FIG. 4 is a schematic diagram of a multi-region feature attention module of a classification network model space according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the following figures and detailed description:

research shows that in a plurality of fine-grained image classification methods, a convolutional neural network which integrates a channel feature attention module and a spatial multi-region feature attention module is a reliable classification idea, belongs to a weak supervision method, and can obtain more comprehensive and abundant features by taking a multi-stage convolutional network as a feature extraction network. Because the multi-stage features contain both low-level information (color, edge connection points, etc.) and high-level semantic information, the low-level information remains unchanged when the pose and background of the object change, reducing intra-class variance. Although the classification method based on deep learning and attention mechanism improves the effect of classifying fine-grained images to some extent, there are some disadvantages. For a fine-grained image classification network, in addition to extracting features which are significant and easy to distinguish, the method also helps a neural network to learn more knowledge which is helpful for fine-grained classification in the dimensions of channels and spaces of object features, namely, the method can use a channel feature re-attention module to force the network to mine knowledge in channel features with poor information content, and use a spatial multi-region feature attention module to enable multi-stage features to respectively focus on different discriminative portions of an object. A more discriminative representation of the features in channel and spatial dimensions is finally obtained.

Therefore, the invention provides a fine-grained image classification method based on a convolutional neural network, a classification network model is constructed by fusing a channel feature re-attention module and a spatial multi-region feature attention module, a contrast learning loss item in a loss function is designed by adopting a contrast learning idea, and finally, the classification network model is adopted to classify images acquired in real time. As shown in fig. 1 and fig. 2, the method specifically includes the following steps:

step 1, constructing a classification network model;

the classification network model comprises a feature extraction network, a channel feature re-attention module, a spatial multi-region feature attention module and a classifier.

The feature extraction network is composed of basic convolution networks such as ResNet50, resNet101 and Densenet161, the convolution networks are similar in structure and are composed of multiple stages, each stage comprises a convolution layer, when an image is input into the feature extraction network, the space size of a feature graph is reduced by half and the number of channels is doubled after the image passes through one stage, and output feature graphs X of the multiple stages of the feature extraction network are output ^l The convolutional neural network output in the last three stages is used as the feature extraction network.

As shown in fig. 3, the channel feature re-attention module firstly integrates feature channel information by using average pooling and maximum pooling operations, and obtains weight information of each channel in the feature map by using a SoftMax function; and obtaining an enhanced mask matrix E according to weight distribution, suppressing channels with high weights, and obtaining a suppressed mask matrix S through a suppression function F (x). Inputting a feature map X ^l Respectively multiplied with the enhanced mask matrix E and the suppressed mask matrix S to obtain an output characteristic diagram

And

wherein the content of the first and second substances,

the SoftMax function may be represented by:

wherein, Z _i The output value of each channel after passing through the SoftMax function, C is the total number of output channels, and the weight information of the channel can be obtained through the SoftMax function.

The enhanced mask matrix E may be calculated by:

E＝SoftMax(AvgPool(X ^l )+MaxPool((X ^l ))) (2)

wherein, avgPool (. Cndot.) represents the average pooling, and MaxPool (. Cndot.) represents the maximum pooling;

the suppression function F (x) may be represented by the following formula:

wherein, Z _i Is the output value, Z, of each channel after the SoftMax function _max Is the maximum output value of the channel, and both ω and δ represent hyper-parameters, which represent the degree to which the corresponding channel is suppressed and the degree to which the channel needs to be suppressed, respectively.

Output feature map of current stage

And

can be obtained by the following formula:

wherein the content of the first and second substances,

representing element-by-element multiplication operations.

Of multiple stages

And the Conv channels of the convolutional layers are unified in dimensionality and then serve as output of corresponding stages, and the channels are unified to ensure balance of low-level information and high-level information.

Input to the subsequent stage forces the network to mine potential channel features that contain fine-grained knowledge.

As shown in FIG. 4, the spatial multi-region feature attention module employs downsampling convolution, 1 × 1 convolution, softMax function andCCMP (Cross-channel max boosting) module in which downsampling convolution is used to combine multiple stages

And a feature map of the last stage of the network

Spatially focusing on different discriminative portions of the object;

And

wherein, C _t Representing the normalized channel dimension, the value of which is equal to 1,W in the present invention _L-2 Width, H, of the feature map representing the L-2 th stage _L-2 Height of the L-2 stage feature map; w _L-1 Width, H, of the feature map representing the L-1 st stage _L-1 Height, W, of the characteristic diagram of the L-1 st stage _L Width, H, of characteristic diagram representing L stage _L Height of characteristic diagram of L stage;

where φ (·) represents a 1 × 1 convolution; conv _ block _l (. To) represents a downsampled convolution; l denotes the stage of the profile.

This results in a three-stage profile with the same spatial dimensions and a channel number of 1. In order to explore the similarity of the feature maps in three stages on the spatial dimension, a SoftMax function is adopted to obtain weight information of each spatial position, and then the weight information is spliced along the channel dimension to obtain the feature maps

It is input into CCMP module, CCMP is across channel maximum pooling, it tends to X _concat Is responsive to the peak in the channel dimension and is responsive to X _concat The middle element is subjected to the operation h (-) of summing and averaging to obtain the value S of the similarity _i ；

Wherein k represents X _concat J represents X _concat Of a few channels, ε denotes X _concat The number of channels is obtained by a spatial multi-region feature attention module to represent the similarity value S between feature maps of each stage _i 。S _i The larger the value of (a), the higher the similarity between feature maps. In order to focus the classification model on a number of different parts of the object. The similarity between the feature maps is reduced during training, namely S is reduced _i 。

L _div ＝(1-S _i )/ε (7)

where ε represents the stages of the network in which feature extraction is used as the output, where X represents _concat The number of channels in the present invention is 3.

The classifier adopts a SoftMax classifier, which is used in a multi-classification task and can map the output of a plurality of neurons into a (0, 1) interval, which can be understood as probability, so as to perform multi-classification.

In addition, the total loss function L of the classification network model _total The definition is as follows:

L _total ＝αL _cls +βL _div +γL _con (8)

wherein L is _cls Represents the cross entropy loss, L _div Indicates a loss of diversity learning, L _con Representing comparative learning loss, α, β, γ representing balance parameters, weights for each balance loss function; wherein the content of the first and second substances,

where y is the true label of the input image, represented by a one-hot vector. Theta.theta. ₁ ,θ ₂ Also a balance parameter, the SoftMax function is used to calculate the predicted tag values for the neural network. cls _l (. Cndot.) represents a classifier that,

representing the output characteristic f of the l stage _l The tag prediction value of (1). cls _concat (. Represents a classifier for the representation of the overall features, Z _fconcat Representing a global feature representation f _concate The tag prediction value of (1).

Loss of diversity learning L _div The calculation formula of (2) is formula (6):

comparative learning loss L _con Comprises the following steps:

Step 2, constructing a training data set and carrying out model training; the specific process is as follows:

step 2.1, adopting a CUB _200 _2011data set as a training data set, carrying out data preprocessing on the acquired original image in modes of horizontal turning, center cutting and the like, realizing data expansion, and constructing the training data set;

the CUB _200_2011 dataset was a fine-grained dataset proposed by the california institute of technology in 2010, which is also the baseline image dataset for current fine-grained classification and identification studies. It has 11788 bird images, including 200 bird species, where the training data set has 5994 images and the test set has 5794 images, each of which provides image-class tagging information.

And 2.2, sending the fine-grained images of the training data set into a classification network model, and training and optimizing learnable parameters in the classification network model, so that a channel feature re-attention module in the model can furthest mine potential fine-grained knowledge in the feature map, a spatial multi-region feature attention module can greatly reduce the similarity between the feature maps in different stages, and when the whole model is trained to be convergent, the trained classification network model is obtained.

And 3, acquiring the images to be classified in real time, and sending the images to the trained classification network model to obtain the classification result of the current image. The specific process is as follows:

firstly, fine-grained images to be classified are sent into a feature extraction network with the stage of L, and then the fine-grained images are input into a channel feature re-attention moduleBlock, deriving a channel enhanced feature map

And channel rejection profile

The channel enhanced feature map is used as the output of the current stage of the network, and the channel suppressed feature map is sent to the subsequent stage to force the network to pay attention to the channels which contain information impoverishment of fine-grained knowledge; the model training already utilizes a spatial multi-region feature attention module to enable the feature map of the channel output by the multiple stages to be enhanced

Focusing on different discriminative parts of the object in the spatial dimension; therefore, the model can obtain a plurality of output characteristics with discriminative performance on space and channels, and finally the output characteristics of a plurality of stages are taken as the characteristic representation of the image; and finally obtaining the classification result of the current image through a SoftMax classifier.

The invention provides a fine-grained image classification method based on a convolution neural network integrating a channel feature re-attention module and a spatial multi-region feature attention module, which combines the method of the convolution neural network in deep learning and an improved attention module to classify fine-grained images. The method of the invention improves the defect of attention mechanism in the task to the maximum extent and enhances the capability of extracting features of the basic convolution network. In a classification network model, the utilization rate of features in a network is improved through the proposed channel feature re-attention module, learning parameters of the original network are hardly increased, potential fine-grained knowledge contained in channel features which are beneficial to fine-grained classification is better learned, and overfitting to a smaller training set task can be controlled (such as a CUB _200 _2011bird data set used in the invention); a spatial multi-region feature attention module is introduced to enable feature maps output by multiple stages of a classification network to focus on different discriminative parts of an object in space instead of focusing on the most significant part of the object; and the comparison learning loss item is designed in the loss function, the idea of comparison learning is fused, and the classification performance of the network model is improved. The invention solves the problems that the context can not be fully utilized when deep networks are used for extracting the features in a fine-grained image classification task, and only the most obvious channel and spatial features of an object are concerned when an attention mechanism is applied.

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims

1. A fine-grained image classification method based on a convolutional neural network is characterized by comprising the steps of firstly constructing a classification network model by fusing a channel feature re-attention module and a spatial multi-region feature attention module, then designing a contrast learning loss item in a loss function by adopting a contrast learning idea, and finally classifying images acquired in real time by adopting the classification network model; the method specifically comprises the following steps:

step 1, constructing a classification network model;

step 2, constructing a training data set and carrying out model training;

2. The fine-grained image classification method based on convolutional neural network according to claim 1, characterized in that the convolutional neural network outputted in the last three stages is used as a feature extraction network, the feature extraction network is composed of ResNet50, resNet101 and Densenet161 basic convolutional networks, each convolutional network structure is composed of a plurality of stages, each stage comprises a convolutional layer, and when an image is inputted into the feature extraction network, the space size of a feature map is reduced by one stageHalf less, doubling the channel number, and extracting the output characteristic diagram X of multiple stages of the network ^l As an output feature of the feature extraction network.

3. The fine-grained image classification method based on the convolutional neural network as claimed in claim 1, wherein the channel feature re-attention module firstly integrates feature channel information by using average pooling and maximum pooling operations, and acquires weight information of each channel in the feature map by using a SoftMax function; obtaining an enhanced mask matrix E according to weight distribution, suppressing channels with high weights, and obtaining a suppressed mask matrix S through a suppression function F (x); will input the feature map X ^l Respectively multiplied with the enhanced mask matrix E and the suppressed mask matrix S to obtain an output characteristic diagram

And

wherein the content of the first and second substances,

the SoftMax function is represented by:

wherein, Z _i The output value of each channel after passing through the SoftMax function, C is the total number of output channels, and the weight information of the solved channels is obtained through the SoftMax function;

the enhanced mask matrix E is calculated by:

E＝SoftMax(AvgPool(X ^l )+MaxPool((X ^l ))) (2)

the suppression function F (x) is represented by the following formula:

wherein Z is _max The maximum output value of the channel is obtained, and both omega and delta represent hyper-parameters and respectively represent the degree of the corresponding channel to be inhibited and the degree of the channel to be inhibited;

output feature map of current stage

And

is obtained by the following formula:

wherein, the first and the second end of the pipe are connected with each other,

representing element-by-element multiplication operations;

of a plurality of stages

The Conv channels of the convolutional layers are unified in dimensionality and then serve as output of corresponding stages, and the channels are unified to ensure balance of low-level information and high-level information;

4. The fine-grained image classification method based on convolutional neural network according to claim 1, wherein the spatial multi-region feature attention module adopts downsampling convolution, 1 x 1 convolution, softMax function and CCMP module, wherein the downsampling convolution is used for classifying a plurality of stages

And a feature map of the last stage of the network

And obtaining a diversity learning loss L _div ，L _div And similarity are in negative correlation, and multiple stages can be realized by reducing diversity loss through training

Spatially focusing on different discriminative portions of the object;

And

wherein, C _t Denotes the normalized channel dimension, W _L-2 Width, H, of the feature map representing the L-2 th stage _L-2 Height of the L-2 stage feature map; w _L-1 Width, H, of the feature map representing the L-1 st stage _L-1 Height, W, of the characteristic diagram of the L-1 st stage _L Width, H, of characteristic diagram representing L stage _L Height of characteristic diagram of L stage;

Inputting it into CCMP module, CCMP pairs X _concat Is responsive to the peak in the channel dimension and is responsive to X _concat The middle element is subjected to the operation h (-) of summing and averaging to obtain the value S of the similarity _i ；

Wherein k represents X _concat J represents X _concat Of a few channels, ε denotes X _concat The number of channels is obtained by a spatial multi-region feature attention module to represent the similarity value S between feature maps of each stage _i ；

L _div ＝(1-S _i )/ε (7)

where ε represents the stages of the network in which feature extraction is used as the output, where X represents _concat The number of channels of (2).

5. The fine-grained image classification method based on convolutional neural network of claim 1, wherein the classifier adopts a SoftMax classifier and is applied in a multi-classification task to map the output of a plurality of neurons into a (0, 1) interval.

6. The convolutional neural network-based of claim 1The fine-grained image classification method of (1), characterized in that a total loss function L of the classification network model _total The definition is as follows:

L _total ＝αL _cls +βL _div +γL _con (8)

wherein L is _cls Represents the cross entropy loss, L _div Indicates a loss of diversity learning, L _con Representing comparative learning loss, wherein alpha, beta and gamma represent balance parameters and are used for weighting each balance loss function; wherein the content of the first and second substances,

wherein y is a truth label of the input image and is represented by a one-hot vector; theta.theta. ₁ ,θ ₂ The SoftMax function is used for calculating a predicted tag value of the neural network; cls _l (. Cndot.) represents a classifier that,

representing the output characteristic f of the l stage _l The tag prediction value of (a); cls _concat () represents a classifier for the overall feature representation,

representing a global feature representation f _concate The predicted value of the label is obtained;

comparative learning loss L _con Comprises the following steps:

where N is the size of the input image batch, z _i ,z _j Is prepared byl ₂ Regularized input images of different classes within the same batch, y _i ,y _j Is the label value, sim (z), of the different classes of input images _i ,z _j ) Is z _i ,z _j Cosine similarity between i, j represents different samples of the same batch, and η represents the loss L only for different classes of inputs with similarity greater than η _con It is helpful.

7. The fine-grained image classification method based on the convolutional neural network as claimed in claim 1, wherein the specific process of the step 2 is as follows:

8. The fine-grained image classification method based on the convolutional neural network as claimed in claim 1, wherein the specific process of step 3 is as follows:

firstly, fine-grained images to be classified are sent into a feature extraction network with the stage of L, and then the fine-grained images are input into a channel feature re-attention module to obtain a channel enhanced feature map

And channel rejection profile

Channel enhancement profilingFor the output of the current stage of the network, the characteristic diagram of the channel suppression is sent to the subsequent stage to force the network to pay attention to the information-barren channels containing fine-grained knowledge; the model training process utilizes the spatial multi-region feature attention module to enhance the channel feature map output by multiple stages