CN115631369A - Fine-grained image classification method based on convolutional neural network - Google Patents
Fine-grained image classification method based on convolutional neural network Download PDFInfo
- Publication number
- CN115631369A CN115631369A CN202211224648.8A CN202211224648A CN115631369A CN 115631369 A CN115631369 A CN 115631369A CN 202211224648 A CN202211224648 A CN 202211224648A CN 115631369 A CN115631369 A CN 115631369A
- Authority
- CN
- China
- Prior art keywords
- feature
- channel
- classification
- fine
- stage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a fine-grained image classification method based on a convolutional neural network, which belongs to the technical field of fine-grained image processing, and comprises the steps of firstly constructing a classification network model by adopting a fusion channel feature re-attention module and a spatial multi-region feature attention module, then designing a contrast learning loss item in a loss function by adopting a contrast learning idea, and finally classifying images acquired in real time by adopting the classification network model; the method specifically comprises the following steps: constructing a classification network model; the classification network model comprises a feature extraction network, a channel feature re-attention module, a spatial multi-region feature attention module and a classifier; constructing a training data set and carrying out model training; and acquiring images to be classified in real time, and sending the images to the trained classification network model to obtain the classification result of the current image. The invention effectively reduces the difficulty of classifying fine-grained images and solves the limitation of attention mechanism in the field.
Description
Technical Field
The invention belongs to the technical field of fine-grained image processing, and particularly relates to a fine-grained image classification method based on a convolutional neural network.
Background
In recent years, deep learning is rapidly developed, research focus of classification of object images is also transferred from coarse-grained image classification to fine-grained image classification, and the problem of fine-grained image classification is to identify subclasses under a base class, such as distinguishing different types of birds and vehicles of different brands. Compared with coarse-grained image classification, the difference between fine-grained image categories is finer, accurate resolution can be performed only by means of the small local difference, and compared with object-level classification tasks such as face recognition, the difference in the fine-grained image categories is finer and various uncertain factors such as postures, shielding and background interference exist, so that the task is very challenging. The subject currently consists primarily of identifying different types of birds, dogs, flowers, cars, planes, etc.
The fine-grained image classification neural network model has wide business requirements and application scenes in the industry and in real life in recent years. The photographing and identifying functions of the flower helper and the mobile phone software are used for understanding the car owner; in ecological protection, effective identification of different types of organisms is an important prerequisite for ecological research. Therefore, if the fine-grained image recognition and classification with low cost and high accuracy can be realized by means of the computer vision technology, the method has very important significance for both academic and industrial fields.
It is known through research that the currently existing methods for classifying fine-grained images can be classified into a method for classifying only visual information and a method for adding additional information. The former relies entirely on visual information to solve the classification problem, while the latter attempts to add additional information to the classification.
The methods of classification using only visual information can be roughly divided into two types: a method based on location-classification subnetworks and a method based on higher-order feature coding. The method of location-based classification of subnetworks is to detect and locate the discriminatory part of the object and build up a corresponding local feature representation. Early work used component tagging as a strong supervision to focus the network on subtle differences between classes, but component tagging information would bring expensive costs. Therefore, the current mainstream method mostly adopts a weak supervision mode, namely only using image-level labels for classification. The method based on the high-order feature coding is to perform high-order integration on the features generated by the neural network to obtain more discriminative features. However, both of these methods have their own limitations: most methods based on location-classification sub-networks focus on the most prominent parts of the object, while ignoring those that are not significant but distinctive, which makes the features less discriminative enough. The method based on high-order feature coding needs to occupy a large amount of computing resources when the dimension of a feature map channel is high, and has insufficient interpretability.
The method of adding additional information classification is to build a joint feature representation by adding additional information (such as network data, multimodal data, etc.), wherein the multimodal data in turn comprises sound, text description of objects, etc. By combining rich additional information and a deep neural network architecture, the method realizes effective classification of fine-grained images. The limitation of this approach is that they are designed for specific a priori knowledge and cannot be applied to other auxiliary information at will.
Disclosure of Invention
In order to solve the problems of difficult classification of fine-grained images and application limitation of an attention mechanism in the field in the prior art, the invention provides a fine-grained image classification method based on a convolutional neural network, and provides a convolutional neural network integrating a channel feature re-attention module and a spatial multi-region feature attention module to classify the fine-grained images.
The technical scheme of the invention is as follows:
a fine-grained image classification method based on a convolutional neural network comprises the steps that firstly, a classification network model is constructed by fusing a channel feature re-attention module and a spatial multi-region feature attention module, then a contrast learning loss item in a loss function is designed by adopting a contrast learning idea, and finally the classification network model is adopted to classify images acquired in real time; the method specifically comprises the following steps:
step 1, constructing a classification network model;
the classification network model comprises a feature extraction network, a channel feature re-attention module, a spatial multi-region feature attention module and a classifier;
step 2, constructing a training data set and carrying out model training;
and 3, acquiring the images to be classified in real time, and sending the images to the trained classification network model to obtain the classification result of the current image.
Further, a convolutional neural network which is output in the last three stages is adopted as a feature extraction network, the feature extraction network is composed of ResNet50, resNet101 and Densenet161 basic convolutional networks, each convolutional network structure is composed of a plurality of stages, each stage comprises a convolutional layer, when an image is input into the feature extraction network, the space size of a feature graph is reduced by half and the number of channels is doubled after each stage, and the feature graph X which is output in the plurality of stages of the feature extraction network is output l As an output feature of the feature extraction network.
Further, the channel feature re-attention module firstly integrates feature channel information by adopting average pooling and maximum pooling, and acquires weight information of each channel in the feature map by utilizing a SoftMax function; obtaining an enhanced mask matrix E according to weight distribution, suppressing channels with high weights, and obtaining a suppressed mask matrix S through a suppression function F (x); inputting a feature map X l Multiplying the enhanced mask matrix E and the suppressed mask matrix S respectively to obtain an output characteristic diagramAndwherein the content of the first and second substances,
the SoftMax function is represented by:
wherein Z is i The output value of each channel after passing through the SoftMax function, C is the total number of output channels, and the weight information of the solved channels is obtained through the SoftMax function;
the enhanced mask matrix E is calculated by:
E=SoftMax(AvgPool(X l )+MaxPool((X l ))) (2)
wherein AvgPool (. Cndot.) represents the average pooling, and MaxPool (. Cndot.) represents the maximum pooling;
the suppression function F (x) is represented by the following formula:
wherein Z is max The maximum output value of the channel is obtained, and both omega and delta represent hyper-parameters which respectively represent the degree of the corresponding channel to be inhibited and the degree of the channel to be inhibited;
wherein the content of the first and second substances,representing element-by-element multiplication operations;
of a plurality of stagesThe dimension of the Conv channel of the convolutional layer is unified and then the Conv channel is used as the output of a corresponding stage, and the channel is unified to ensure the balance of low-level information and high-level information;the inputs to the subsequent stage force the network to mine potential channel features containing fine-grained knowledge.
Further, the spatial multi-region feature attention module employs a downsampling convolution, a 1 × 1 convolution, a SoftMax function and a CCMP module, wherein the downsampling convolution is used for a plurality of stagesAnd a feature map of the last stage of the networkThe spatial scale is kept consistent, 1 × 1 convolution is used to simplify the calculation, and the SoftMax function and the CCMP module are used to calculate multiple stagesAnd obtaining a diversity learning loss L div ,L div And the similarity is in negative correlation, and the loss of diversity is reduced through training, so that the method can be used for multiple stagesSpatially focusing on different discriminatory portions of the object;
the characteristic graphs obtained by the channel re-attention module in the last three stages of the assumed characteristic extraction network are respectivelyAndwherein, C t Represents the normalized channel dimension, W L-2 Width, H, of the feature map representing the L-2 th stage L-2 Height of the L-2 stage feature map; w L-1 Width, H, of the feature map representing the L-1 st stage L-1 Height, W, of the characteristic diagram of the L-1 st stage L Width, H, of characteristic diagram representing L stage L Showing characteristic diagrams of the L-th stageA height;
in order to reduce the amount of calculation, the feature map is preprocessed by the following formula:
where φ (·) represents a 1 × 1 convolution; conv _ block l (. To) represents a downsampled convolution; l represents the second stage of the profile;
after characteristic graphs from three stages with the same space size and 1 channel number are obtained, weight information of each space position is obtained by adopting a SoftMax function, and then the characteristic graphs are obtained by splicing along the channel dimensionInputting it into CCMP module, CCMP pair X concat Is responsive to the peak in the channel dimension and is responsive to X concat The middle element is subjected to the operation h (-) of summing and averaging to obtain the value S of the similarity i ;
Wherein k represents X concat The size of the spatial dimension of (a), j represents X concat Of a few channels, ε denotes X concat The number of channels is obtained by a spatial multi-region feature attention module to represent the similarity value S between feature maps of each stage i ;
Finally, according to the similarity S i Obtaining a diversity learning loss L div The calculation method is as follows,
L div =(1-S i )/ε (7)
where ε represents the output of several stages using a feature extraction network, where X represents concat The number of channels of (2).
Further, the classifier employs a SoftMax classifier, which is applied in a multi-classification task to map the outputs of a plurality of neurons into a (0, 1) interval.
Further, the total loss function L of the classification network model total The definition is as follows:
L total =αL cls +βL div +γL con (8)
wherein L is cls Represents the cross entropy loss, L div Indicates a loss of diversity learning, L con Representing comparison learning loss, wherein alpha, beta and gamma represent balance parameters and are used for weighting each balance loss function; wherein, the first and the second end of the pipe are connected with each other,
cross entropy loss L cls The classification loss is composed of the classification loss of each stage and the classification loss of the whole body represented by splicing the characteristics of each stage, and the calculation formula is as follows:
wherein y is a truth label of the input image and is represented by a one-hot vector; theta.theta. 1 ,θ 2 The SoftMax function is used for calculating a predicted label value of the neural network; cls l (. Cndot.) represents a classifier that,representing the output characteristic f of the l stage l The predicted value of the label is obtained; cls concat (. Represents a classifier for the representation of the overall features, Z fconcat Representing a global feature representation f concate The tag prediction value of (a);
comparative learning loss L con Comprises the following steps:
where N is the size of the input image batch, z i ,z j Is passed through 2 Regularized input images of different classes within the same batch, y i ,y j Is the label value, sim (z), of the different classes of input images i ,z j ) Is z i ,z j Cosine similarity between i, j represents different samples of the same batch, and η represents the loss L only for different classes of inputs with similarity greater than η con It is helpful.
Further, the specific process of step 2 is as follows:
step 2.1, adopting a CUB _200 _2011data set as a training data set, carrying out data preprocessing on the acquired original image in a horizontal turning and center cutting mode, realizing data expansion, and constructing the training data set;
and 2.2, sending the fine-grained images of the training data set into a classification network model, and training and optimizing learnable parameters in the classification network model, so that a channel feature re-attention module in the model can furthest mine potential fine-grained knowledge in the feature map, a spatial multi-region feature attention module can greatly reduce the similarity between feature maps in different stages, and when the whole model is trained to be convergent, the trained classification network model is obtained.
Further, the specific process of step 3 is as follows:
firstly, fine-grained images to be classified are sent into a feature extraction network with the stage L, and then the fine-grained images are input into a channel feature re-attention module to obtain a channel enhanced feature mapAnd channel rejection profileThe channel enhanced feature map is used as the output of the current stage of the network, and the channel suppressed feature map is sent to the subsequent stage to force the network to pay attention to the channels which contain information impoverishment of fine-grained knowledge; the model training process utilizes the spatial multi-region feature attention module to enhance the channel feature map output by multiple stagesFocusing on different discriminative parts of the object in the spatial dimension; therefore, the model can obtain a plurality of output characteristics with discriminativity in space and channels, and finally the output characteristics of a plurality of stages are taken as the characteristic representation of the image; and finally obtaining a classification result of the current image through a SoftMax classifier.
The invention has the following beneficial technical effects:
the present invention greatly improves the limitations of attention mechanisms and convolutional neural network-based methods on fine-grained image classification. Through the multi-stage feature extraction network, the aggregation capability of the classification network on feature information is improved, low-level information and high-level semantic information are included, and the robustness of the extracted features is improved; through the channel characteristic re-attention module, the classification network is effectively helped to extract the channel characteristics which are ignored originally but are helpful for fine-grained classification, so that the obtained characteristics are more comprehensively represented; through the spatial multi-region feature attention module, the features output by multiple stages of the classification network respectively pay attention to different discriminative parts of the object in the spatial dimension, so that the discriminative performance of the final feature representation is improved; by fusing the loss terms of the comparison learning idea, different types of fine-grained images are treated differently, and the difference between the types is increased. In the comparison learning loss item, the idea of comparison learning is fused, different types of training images in the same input batch are set as negative samples, the same type of training images are set as positive samples, the distance between the positive samples is pulled in through the setting of a loss function, and the distance between the negative samples is pulled out, so that the classification effect of the classification network is further optimized in the training process.
Drawings
FIG. 1 is a flow chart of a fine-grained image classification method based on a convolutional neural network according to the present invention;
FIG. 2 is a schematic diagram of the overall structure of the classification network model of the present invention;
FIG. 3 is a schematic diagram of a classification network model channel feature re-attention module according to the present invention;
FIG. 4 is a schematic diagram of a multi-region feature attention module of a classification network model space according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the following figures and detailed description:
research shows that in a plurality of fine-grained image classification methods, a convolutional neural network which integrates a channel feature attention module and a spatial multi-region feature attention module is a reliable classification idea, belongs to a weak supervision method, and can obtain more comprehensive and abundant features by taking a multi-stage convolutional network as a feature extraction network. Because the multi-stage features contain both low-level information (color, edge connection points, etc.) and high-level semantic information, the low-level information remains unchanged when the pose and background of the object change, reducing intra-class variance. Although the classification method based on deep learning and attention mechanism improves the effect of classifying fine-grained images to some extent, there are some disadvantages. For a fine-grained image classification network, in addition to extracting features which are significant and easy to distinguish, the method also helps a neural network to learn more knowledge which is helpful for fine-grained classification in the dimensions of channels and spaces of object features, namely, the method can use a channel feature re-attention module to force the network to mine knowledge in channel features with poor information content, and use a spatial multi-region feature attention module to enable multi-stage features to respectively focus on different discriminative portions of an object. A more discriminative representation of the features in channel and spatial dimensions is finally obtained.
Therefore, the invention provides a fine-grained image classification method based on a convolutional neural network, a classification network model is constructed by fusing a channel feature re-attention module and a spatial multi-region feature attention module, a contrast learning loss item in a loss function is designed by adopting a contrast learning idea, and finally, the classification network model is adopted to classify images acquired in real time. As shown in fig. 1 and fig. 2, the method specifically includes the following steps:
step 1, constructing a classification network model;
the classification network model comprises a feature extraction network, a channel feature re-attention module, a spatial multi-region feature attention module and a classifier.
The feature extraction network is composed of basic convolution networks such as ResNet50, resNet101 and Densenet161, the convolution networks are similar in structure and are composed of multiple stages, each stage comprises a convolution layer, when an image is input into the feature extraction network, the space size of a feature graph is reduced by half and the number of channels is doubled after the image passes through one stage, and output feature graphs X of the multiple stages of the feature extraction network are output l The convolutional neural network output in the last three stages is used as the feature extraction network.
As shown in fig. 3, the channel feature re-attention module firstly integrates feature channel information by using average pooling and maximum pooling operations, and obtains weight information of each channel in the feature map by using a SoftMax function; and obtaining an enhanced mask matrix E according to weight distribution, suppressing channels with high weights, and obtaining a suppressed mask matrix S through a suppression function F (x). Inputting a feature map X l Respectively multiplied with the enhanced mask matrix E and the suppressed mask matrix S to obtain an output characteristic diagramAndwherein the content of the first and second substances,
the SoftMax function may be represented by:
wherein, Z i The output value of each channel after passing through the SoftMax function, C is the total number of output channels, and the weight information of the channel can be obtained through the SoftMax function.
The enhanced mask matrix E may be calculated by:
E=SoftMax(AvgPool(X l )+MaxPool((X l ))) (2)
wherein, avgPool (. Cndot.) represents the average pooling, and MaxPool (. Cndot.) represents the maximum pooling;
the suppression function F (x) may be represented by the following formula:
wherein, Z i Is the output value, Z, of each channel after the SoftMax function max Is the maximum output value of the channel, and both ω and δ represent hyper-parameters, which represent the degree to which the corresponding channel is suppressed and the degree to which the channel needs to be suppressed, respectively.
wherein the content of the first and second substances,representing element-by-element multiplication operations.
Of multiple stagesAnd the Conv channels of the convolutional layers are unified in dimensionality and then serve as output of corresponding stages, and the channels are unified to ensure balance of low-level information and high-level information.Input to the subsequent stage forces the network to mine potential channel features that contain fine-grained knowledge.
As shown in FIG. 4, the spatial multi-region feature attention module employs downsampling convolution, 1 × 1 convolution, softMax function andCCMP (Cross-channel max boosting) module in which downsampling convolution is used to combine multiple stagesAnd a feature map of the last stage of the networkThe spatial scale is kept consistent, 1 × 1 convolution is used to simplify the calculation, and the SoftMax function and the CCMP module are used to calculate multiple stagesAnd obtaining a diversity learning loss L div ,L div And the similarity is in negative correlation, and the loss of diversity is reduced through training, so that the method can be used for multiple stagesSpatially focusing on different discriminative portions of the object;
the characteristic graphs obtained by the channel re-attention module in the last three stages of the assumed characteristic extraction network are respectivelyAndwherein, C t Representing the normalized channel dimension, the value of which is equal to 1,W in the present invention L-2 Width, H, of the feature map representing the L-2 th stage L-2 Height of the L-2 stage feature map; w L-1 Width, H, of the feature map representing the L-1 st stage L-1 Height, W, of the characteristic diagram of the L-1 st stage L Width, H, of characteristic diagram representing L stage L Height of characteristic diagram of L stage;
in order to reduce the amount of calculation, the feature map is preprocessed by the following formula:
where φ (·) represents a 1 × 1 convolution; conv _ block l (. To) represents a downsampled convolution; l denotes the stage of the profile.
This results in a three-stage profile with the same spatial dimensions and a channel number of 1. In order to explore the similarity of the feature maps in three stages on the spatial dimension, a SoftMax function is adopted to obtain weight information of each spatial position, and then the weight information is spliced along the channel dimension to obtain the feature mapsIt is input into CCMP module, CCMP is across channel maximum pooling, it tends to X concat Is responsive to the peak in the channel dimension and is responsive to X concat The middle element is subjected to the operation h (-) of summing and averaging to obtain the value S of the similarity i ;
Wherein k represents X concat J represents X concat Of a few channels, ε denotes X concat The number of channels is obtained by a spatial multi-region feature attention module to represent the similarity value S between feature maps of each stage i 。S i The larger the value of (a), the higher the similarity between feature maps. In order to focus the classification model on a number of different parts of the object. The similarity between the feature maps is reduced during training, namely S is reduced i 。
Finally, according to the similarity S i Obtaining a diversity learning loss L div The calculation method is as follows,
L div =(1-S i )/ε (7)
where ε represents the stages of the network in which feature extraction is used as the output, where X represents concat The number of channels in the present invention is 3.
The classifier adopts a SoftMax classifier, which is used in a multi-classification task and can map the output of a plurality of neurons into a (0, 1) interval, which can be understood as probability, so as to perform multi-classification.
In addition, the total loss function L of the classification network model total The definition is as follows:
L total =αL cls +βL div +γL con (8)
wherein L is cls Represents the cross entropy loss, L div Indicates a loss of diversity learning, L con Representing comparative learning loss, α, β, γ representing balance parameters, weights for each balance loss function; wherein the content of the first and second substances,
cross entropy loss L cls The classification loss is composed of the classification loss of each stage and the classification loss of the whole body represented by splicing the characteristics of each stage, and the calculation formula is as follows:
where y is the true label of the input image, represented by a one-hot vector. Theta.theta. 1 ,θ 2 Also a balance parameter, the SoftMax function is used to calculate the predicted tag values for the neural network. cls l (. Cndot.) represents a classifier that,representing the output characteristic f of the l stage l The tag prediction value of (1). cls concat (. Represents a classifier for the representation of the overall features, Z fconcat Representing a global feature representation f concate The tag prediction value of (1).
Loss of diversity learning L div The calculation formula of (2) is formula (6):
comparative learning loss L con Comprises the following steps:
where N is the size of the input image batch, z i ,z j Is passed through 2 Regularized input images of different classes within the same batch, y i ,y j Is the label value, sim (z), of the different classes of input images i ,z j ) Is z i ,z j Cosine similarity between i, j represents different samples of the same batch, and η represents the loss L only for different classes of inputs with similarity greater than η con It is helpful.
Step 2, constructing a training data set and carrying out model training; the specific process is as follows:
step 2.1, adopting a CUB _200 _2011data set as a training data set, carrying out data preprocessing on the acquired original image in modes of horizontal turning, center cutting and the like, realizing data expansion, and constructing the training data set;
the CUB _200_2011 dataset was a fine-grained dataset proposed by the california institute of technology in 2010, which is also the baseline image dataset for current fine-grained classification and identification studies. It has 11788 bird images, including 200 bird species, where the training data set has 5994 images and the test set has 5794 images, each of which provides image-class tagging information.
And 2.2, sending the fine-grained images of the training data set into a classification network model, and training and optimizing learnable parameters in the classification network model, so that a channel feature re-attention module in the model can furthest mine potential fine-grained knowledge in the feature map, a spatial multi-region feature attention module can greatly reduce the similarity between the feature maps in different stages, and when the whole model is trained to be convergent, the trained classification network model is obtained.
And 3, acquiring the images to be classified in real time, and sending the images to the trained classification network model to obtain the classification result of the current image. The specific process is as follows:
firstly, fine-grained images to be classified are sent into a feature extraction network with the stage of L, and then the fine-grained images are input into a channel feature re-attention moduleBlock, deriving a channel enhanced feature mapAnd channel rejection profileThe channel enhanced feature map is used as the output of the current stage of the network, and the channel suppressed feature map is sent to the subsequent stage to force the network to pay attention to the channels which contain information impoverishment of fine-grained knowledge; the model training already utilizes a spatial multi-region feature attention module to enable the feature map of the channel output by the multiple stages to be enhancedFocusing on different discriminative parts of the object in the spatial dimension; therefore, the model can obtain a plurality of output characteristics with discriminative performance on space and channels, and finally the output characteristics of a plurality of stages are taken as the characteristic representation of the image; and finally obtaining the classification result of the current image through a SoftMax classifier.
The invention provides a fine-grained image classification method based on a convolution neural network integrating a channel feature re-attention module and a spatial multi-region feature attention module, which combines the method of the convolution neural network in deep learning and an improved attention module to classify fine-grained images. The method of the invention improves the defect of attention mechanism in the task to the maximum extent and enhances the capability of extracting features of the basic convolution network. In a classification network model, the utilization rate of features in a network is improved through the proposed channel feature re-attention module, learning parameters of the original network are hardly increased, potential fine-grained knowledge contained in channel features which are beneficial to fine-grained classification is better learned, and overfitting to a smaller training set task can be controlled (such as a CUB _200 _2011bird data set used in the invention); a spatial multi-region feature attention module is introduced to enable feature maps output by multiple stages of a classification network to focus on different discriminative parts of an object in space instead of focusing on the most significant part of the object; and the comparison learning loss item is designed in the loss function, the idea of comparison learning is fused, and the classification performance of the network model is improved. The invention solves the problems that the context can not be fully utilized when deep networks are used for extracting the features in a fine-grained image classification task, and only the most obvious channel and spatial features of an object are concerned when an attention mechanism is applied.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.
Claims (8)
1. A fine-grained image classification method based on a convolutional neural network is characterized by comprising the steps of firstly constructing a classification network model by fusing a channel feature re-attention module and a spatial multi-region feature attention module, then designing a contrast learning loss item in a loss function by adopting a contrast learning idea, and finally classifying images acquired in real time by adopting the classification network model; the method specifically comprises the following steps:
step 1, constructing a classification network model;
the classification network model comprises a feature extraction network, a channel feature re-attention module, a spatial multi-region feature attention module and a classifier;
step 2, constructing a training data set and carrying out model training;
and 3, acquiring the images to be classified in real time, and sending the images to the trained classification network model to obtain the classification result of the current image.
2. The fine-grained image classification method based on convolutional neural network according to claim 1, characterized in that the convolutional neural network outputted in the last three stages is used as a feature extraction network, the feature extraction network is composed of ResNet50, resNet101 and Densenet161 basic convolutional networks, each convolutional network structure is composed of a plurality of stages, each stage comprises a convolutional layer, and when an image is inputted into the feature extraction network, the space size of a feature map is reduced by one stageHalf less, doubling the channel number, and extracting the output characteristic diagram X of multiple stages of the network l As an output feature of the feature extraction network.
3. The fine-grained image classification method based on the convolutional neural network as claimed in claim 1, wherein the channel feature re-attention module firstly integrates feature channel information by using average pooling and maximum pooling operations, and acquires weight information of each channel in the feature map by using a SoftMax function; obtaining an enhanced mask matrix E according to weight distribution, suppressing channels with high weights, and obtaining a suppressed mask matrix S through a suppression function F (x); will input the feature map X l Respectively multiplied with the enhanced mask matrix E and the suppressed mask matrix S to obtain an output characteristic diagramAndwherein the content of the first and second substances,
the SoftMax function is represented by:
wherein, Z i The output value of each channel after passing through the SoftMax function, C is the total number of output channels, and the weight information of the solved channels is obtained through the SoftMax function;
the enhanced mask matrix E is calculated by:
E=SoftMax(AvgPool(X l )+MaxPool((X l ))) (2)
wherein, avgPool (. Cndot.) represents the average pooling, and MaxPool (. Cndot.) represents the maximum pooling;
the suppression function F (x) is represented by the following formula:
wherein Z is max The maximum output value of the channel is obtained, and both omega and delta represent hyper-parameters and respectively represent the degree of the corresponding channel to be inhibited and the degree of the channel to be inhibited;
wherein, the first and the second end of the pipe are connected with each other,representing element-by-element multiplication operations;
of a plurality of stagesThe Conv channels of the convolutional layers are unified in dimensionality and then serve as output of corresponding stages, and the channels are unified to ensure balance of low-level information and high-level information;the inputs to the subsequent stage force the network to mine potential channel features containing fine-grained knowledge.
4. The fine-grained image classification method based on convolutional neural network according to claim 1, wherein the spatial multi-region feature attention module adopts downsampling convolution, 1 x 1 convolution, softMax function and CCMP module, wherein the downsampling convolution is used for classifying a plurality of stagesAnd a feature map of the last stage of the networkThe spatial scale is kept consistent, 1 × 1 convolution is used to simplify the calculation, and the SoftMax function and the CCMP module are used to calculate multiple stagesAnd obtaining a diversity learning loss L div ,L div And similarity are in negative correlation, and multiple stages can be realized by reducing diversity loss through trainingSpatially focusing on different discriminative portions of the object;
the characteristic graphs obtained by the channel re-attention module in the last three stages of the assumed characteristic extraction network are respectivelyAndwherein, C t Denotes the normalized channel dimension, W L-2 Width, H, of the feature map representing the L-2 th stage L-2 Height of the L-2 stage feature map; w L-1 Width, H, of the feature map representing the L-1 st stage L-1 Height, W, of the characteristic diagram of the L-1 st stage L Width, H, of characteristic diagram representing L stage L Height of characteristic diagram of L stage;
in order to reduce the amount of calculation, the feature map is preprocessed by the following formula:
where φ (·) represents a 1 × 1 convolution; conv _ block l (. To) represents a downsampled convolution; l represents the second stage of the profile;
after characteristic graphs from three stages with the same space size and 1 channel number are obtained, weight information of each space position is obtained by adopting a SoftMax function, and then the characteristic graphs are obtained by splicing along the channel dimensionInputting it into CCMP module, CCMP pairs X concat Is responsive to the peak in the channel dimension and is responsive to X concat The middle element is subjected to the operation h (-) of summing and averaging to obtain the value S of the similarity i ;
Wherein k represents X concat J represents X concat Of a few channels, ε denotes X concat The number of channels is obtained by a spatial multi-region feature attention module to represent the similarity value S between feature maps of each stage i ;
Finally, according to the similarity S i Obtaining a diversity learning loss L div The calculation method is as follows,
L div =(1-S i )/ε (7)
where ε represents the stages of the network in which feature extraction is used as the output, where X represents concat The number of channels of (2).
5. The fine-grained image classification method based on convolutional neural network of claim 1, wherein the classifier adopts a SoftMax classifier and is applied in a multi-classification task to map the output of a plurality of neurons into a (0, 1) interval.
6. The convolutional neural network-based of claim 1The fine-grained image classification method of (1), characterized in that a total loss function L of the classification network model total The definition is as follows:
L total =αL cls +βL div +γL con (8)
wherein L is cls Represents the cross entropy loss, L div Indicates a loss of diversity learning, L con Representing comparative learning loss, wherein alpha, beta and gamma represent balance parameters and are used for weighting each balance loss function; wherein the content of the first and second substances,
cross entropy loss L cls The classification loss is composed of the classification loss of each stage and the classification loss of the whole body represented by splicing the characteristics of each stage, and the calculation formula is as follows:
wherein y is a truth label of the input image and is represented by a one-hot vector; theta.theta. 1 ,θ 2 The SoftMax function is used for calculating a predicted tag value of the neural network; cls l (. Cndot.) represents a classifier that,representing the output characteristic f of the l stage l The tag prediction value of (a); cls concat () represents a classifier for the overall feature representation,representing a global feature representation f concate The predicted value of the label is obtained;
comparative learning loss L con Comprises the following steps:
where N is the size of the input image batch, z i ,z j Is prepared byl 2 Regularized input images of different classes within the same batch, y i ,y j Is the label value, sim (z), of the different classes of input images i ,z j ) Is z i ,z j Cosine similarity between i, j represents different samples of the same batch, and η represents the loss L only for different classes of inputs with similarity greater than η con It is helpful.
7. The fine-grained image classification method based on the convolutional neural network as claimed in claim 1, wherein the specific process of the step 2 is as follows:
step 2.1, adopting a CUB _200 _2011data set as a training data set, carrying out data preprocessing on the acquired original image in a horizontal turning and center cutting mode, realizing data expansion, and constructing the training data set;
and 2.2, sending the fine-grained images of the training data set into a classification network model, and training and optimizing learnable parameters in the classification network model, so that a channel feature re-attention module in the model can furthest mine potential fine-grained knowledge in the feature map, a spatial multi-region feature attention module can greatly reduce the similarity between the feature maps in different stages, and when the whole model is trained to be convergent, the trained classification network model is obtained.
8. The fine-grained image classification method based on the convolutional neural network as claimed in claim 1, wherein the specific process of step 3 is as follows:
firstly, fine-grained images to be classified are sent into a feature extraction network with the stage of L, and then the fine-grained images are input into a channel feature re-attention module to obtain a channel enhanced feature mapAnd channel rejection profileChannel enhancement profilingFor the output of the current stage of the network, the characteristic diagram of the channel suppression is sent to the subsequent stage to force the network to pay attention to the information-barren channels containing fine-grained knowledge; the model training process utilizes the spatial multi-region feature attention module to enhance the channel feature map output by multiple stagesFocusing on different discriminative parts of the object in the spatial dimension; therefore, the model can obtain a plurality of output characteristics with discriminative performance on space and channels, and finally the output characteristics of a plurality of stages are taken as the characteristic representation of the image; and finally obtaining the classification result of the current image through a SoftMax classifier.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211224648.8A CN115631369A (en) | 2022-10-09 | 2022-10-09 | Fine-grained image classification method based on convolutional neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211224648.8A CN115631369A (en) | 2022-10-09 | 2022-10-09 | Fine-grained image classification method based on convolutional neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115631369A true CN115631369A (en) | 2023-01-20 |
Family
ID=84904512
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211224648.8A Pending CN115631369A (en) | 2022-10-09 | 2022-10-09 | Fine-grained image classification method based on convolutional neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115631369A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116452896A (en) * | 2023-06-16 | 2023-07-18 | 中国科学技术大学 | Method, system, device and medium for improving fine-grained image classification performance |
CN116664911A (en) * | 2023-04-17 | 2023-08-29 | 山东第一医科大学附属肿瘤医院(山东省肿瘤防治研究院、山东省肿瘤医院) | Breast tumor image classification method based on interpretable deep learning |
CN116994032A (en) * | 2023-06-28 | 2023-11-03 | 河北大学 | Rectal polyp multi-classification method based on deep learning |
CN117011718A (en) * | 2023-10-08 | 2023-11-07 | 之江实验室 | Plant leaf fine granularity identification method and system based on multiple loss fusion |
-
2022
- 2022-10-09 CN CN202211224648.8A patent/CN115631369A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116664911A (en) * | 2023-04-17 | 2023-08-29 | 山东第一医科大学附属肿瘤医院(山东省肿瘤防治研究院、山东省肿瘤医院) | Breast tumor image classification method based on interpretable deep learning |
CN116452896A (en) * | 2023-06-16 | 2023-07-18 | 中国科学技术大学 | Method, system, device and medium for improving fine-grained image classification performance |
CN116452896B (en) * | 2023-06-16 | 2023-10-20 | 中国科学技术大学 | Method, system, device and medium for improving fine-grained image classification performance |
CN116994032A (en) * | 2023-06-28 | 2023-11-03 | 河北大学 | Rectal polyp multi-classification method based on deep learning |
CN116994032B (en) * | 2023-06-28 | 2024-02-27 | 河北大学 | Rectal polyp multi-classification method based on deep learning |
CN117011718A (en) * | 2023-10-08 | 2023-11-07 | 之江实验室 | Plant leaf fine granularity identification method and system based on multiple loss fusion |
CN117011718B (en) * | 2023-10-08 | 2024-02-02 | 之江实验室 | Plant leaf fine granularity identification method and system based on multiple loss fusion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bouti et al. | A robust system for road sign detection and classification using LeNet architecture based on convolutional neural network | |
CN110532920B (en) | Face recognition method for small-quantity data set based on FaceNet method | |
Yuan et al. | Gated CNN: Integrating multi-scale feature layers for object detection | |
CN115631369A (en) | Fine-grained image classification method based on convolutional neural network | |
CN111639564B (en) | Video pedestrian re-identification method based on multi-attention heterogeneous network | |
CN107977661B (en) | Region-of-interest detection method based on FCN and low-rank sparse decomposition | |
Roecker et al. | Automatic vehicle type classification with convolutional neural networks | |
CN110287798B (en) | Vector network pedestrian detection method based on feature modularization and context fusion | |
Vaidya et al. | Deep learning architectures for object detection and classification | |
Manssor et al. | Real-time human detection in thermal infrared imaging at night using enhanced Tiny-yolov3 network | |
CN116798070A (en) | Cross-mode pedestrian re-recognition method based on spectrum sensing and attention mechanism | |
Yu et al. | WaterHRNet: A multibranch hierarchical attentive network for water body extraction with remote sensing images | |
US20220301311A1 (en) | Efficient self-attention for video processing | |
Ajagbe et al. | Performance investigation of two-stage detection techniques using traffic light detection dataset | |
Wang et al. | Pedestrian detection in infrared image based on depth transfer learning | |
Singh et al. | CNN based approach for traffic sign recognition system | |
Kustikova et al. | A survey of deep learning methods and software for image classification and object detection | |
Sabater et al. | Event Transformer+. A multi-purpose solution for efficient event data processing | |
Akanksha et al. | A Feature Extraction Approach for Multi-Object Detection Using HoG and LTP. | |
CN117372853A (en) | Underwater target detection algorithm based on image enhancement and attention mechanism | |
Vijayalakshmi K et al. | Copy-paste forgery detection using deep learning with error level analysis | |
Li | A deep learning-based text detection and recognition approach for natural scenes | |
Zhou et al. | Semantic image segmentation using low-level features and contextual cues | |
CN116797821A (en) | Generalized zero sample image classification method based on fusion visual information | |
CN112668643B (en) | Semi-supervised significance detection method based on lattice tower rule |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |