CN115631369A - Fine-grained image classification method based on convolutional neural network - Google Patents

Fine-grained image classification method based on convolutional neural network Download PDF

Info

Publication number
CN115631369A
CN115631369A CN202211224648.8A CN202211224648A CN115631369A CN 115631369 A CN115631369 A CN 115631369A CN 202211224648 A CN202211224648 A CN 202211224648A CN 115631369 A CN115631369 A CN 115631369A
Authority
CN
China
Prior art keywords
feature
channel
classification
fine
stage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211224648.8A
Other languages
Chinese (zh)
Inventor
王坤
王延江
刘宝弟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN202211224648.8A priority Critical patent/CN115631369A/en
Publication of CN115631369A publication Critical patent/CN115631369A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a fine-grained image classification method based on a convolutional neural network, which belongs to the technical field of fine-grained image processing, and comprises the steps of firstly constructing a classification network model by adopting a fusion channel feature re-attention module and a spatial multi-region feature attention module, then designing a contrast learning loss item in a loss function by adopting a contrast learning idea, and finally classifying images acquired in real time by adopting the classification network model; the method specifically comprises the following steps: constructing a classification network model; the classification network model comprises a feature extraction network, a channel feature re-attention module, a spatial multi-region feature attention module and a classifier; constructing a training data set and carrying out model training; and acquiring images to be classified in real time, and sending the images to the trained classification network model to obtain the classification result of the current image. The invention effectively reduces the difficulty of classifying fine-grained images and solves the limitation of attention mechanism in the field.

Description

Fine-grained image classification method based on convolutional neural network
Technical Field
The invention belongs to the technical field of fine-grained image processing, and particularly relates to a fine-grained image classification method based on a convolutional neural network.
Background
In recent years, deep learning is rapidly developed, research focus of classification of object images is also transferred from coarse-grained image classification to fine-grained image classification, and the problem of fine-grained image classification is to identify subclasses under a base class, such as distinguishing different types of birds and vehicles of different brands. Compared with coarse-grained image classification, the difference between fine-grained image categories is finer, accurate resolution can be performed only by means of the small local difference, and compared with object-level classification tasks such as face recognition, the difference in the fine-grained image categories is finer and various uncertain factors such as postures, shielding and background interference exist, so that the task is very challenging. The subject currently consists primarily of identifying different types of birds, dogs, flowers, cars, planes, etc.
The fine-grained image classification neural network model has wide business requirements and application scenes in the industry and in real life in recent years. The photographing and identifying functions of the flower helper and the mobile phone software are used for understanding the car owner; in ecological protection, effective identification of different types of organisms is an important prerequisite for ecological research. Therefore, if the fine-grained image recognition and classification with low cost and high accuracy can be realized by means of the computer vision technology, the method has very important significance for both academic and industrial fields.
It is known through research that the currently existing methods for classifying fine-grained images can be classified into a method for classifying only visual information and a method for adding additional information. The former relies entirely on visual information to solve the classification problem, while the latter attempts to add additional information to the classification.
The methods of classification using only visual information can be roughly divided into two types: a method based on location-classification subnetworks and a method based on higher-order feature coding. The method of location-based classification of subnetworks is to detect and locate the discriminatory part of the object and build up a corresponding local feature representation. Early work used component tagging as a strong supervision to focus the network on subtle differences between classes, but component tagging information would bring expensive costs. Therefore, the current mainstream method mostly adopts a weak supervision mode, namely only using image-level labels for classification. The method based on the high-order feature coding is to perform high-order integration on the features generated by the neural network to obtain more discriminative features. However, both of these methods have their own limitations: most methods based on location-classification sub-networks focus on the most prominent parts of the object, while ignoring those that are not significant but distinctive, which makes the features less discriminative enough. The method based on high-order feature coding needs to occupy a large amount of computing resources when the dimension of a feature map channel is high, and has insufficient interpretability.
The method of adding additional information classification is to build a joint feature representation by adding additional information (such as network data, multimodal data, etc.), wherein the multimodal data in turn comprises sound, text description of objects, etc. By combining rich additional information and a deep neural network architecture, the method realizes effective classification of fine-grained images. The limitation of this approach is that they are designed for specific a priori knowledge and cannot be applied to other auxiliary information at will.
Disclosure of Invention
In order to solve the problems of difficult classification of fine-grained images and application limitation of an attention mechanism in the field in the prior art, the invention provides a fine-grained image classification method based on a convolutional neural network, and provides a convolutional neural network integrating a channel feature re-attention module and a spatial multi-region feature attention module to classify the fine-grained images.
The technical scheme of the invention is as follows:
a fine-grained image classification method based on a convolutional neural network comprises the steps that firstly, a classification network model is constructed by fusing a channel feature re-attention module and a spatial multi-region feature attention module, then a contrast learning loss item in a loss function is designed by adopting a contrast learning idea, and finally the classification network model is adopted to classify images acquired in real time; the method specifically comprises the following steps:
step 1, constructing a classification network model;
the classification network model comprises a feature extraction network, a channel feature re-attention module, a spatial multi-region feature attention module and a classifier;
step 2, constructing a training data set and carrying out model training;
and 3, acquiring the images to be classified in real time, and sending the images to the trained classification network model to obtain the classification result of the current image.
Further, a convolutional neural network which is output in the last three stages is adopted as a feature extraction network, the feature extraction network is composed of ResNet50, resNet101 and Densenet161 basic convolutional networks, each convolutional network structure is composed of a plurality of stages, each stage comprises a convolutional layer, when an image is input into the feature extraction network, the space size of a feature graph is reduced by half and the number of channels is doubled after each stage, and the feature graph X which is output in the plurality of stages of the feature extraction network is output l As an output feature of the feature extraction network.
Further, the channel feature re-attention module firstly integrates feature channel information by adopting average pooling and maximum pooling, and acquires weight information of each channel in the feature map by utilizing a SoftMax function; obtaining an enhanced mask matrix E according to weight distribution, suppressing channels with high weights, and obtaining a suppressed mask matrix S through a suppression function F (x); inputting a feature map X l Multiplying the enhanced mask matrix E and the suppressed mask matrix S respectively to obtain an output characteristic diagram
Figure BDA0003879131920000021
And
Figure BDA0003879131920000022
wherein the content of the first and second substances,
the SoftMax function is represented by:
Figure BDA0003879131920000023
wherein Z is i The output value of each channel after passing through the SoftMax function, C is the total number of output channels, and the weight information of the solved channels is obtained through the SoftMax function;
the enhanced mask matrix E is calculated by:
E=SoftMax(AvgPool(X l )+MaxPool((X l ))) (2)
wherein AvgPool (. Cndot.) represents the average pooling, and MaxPool (. Cndot.) represents the maximum pooling;
the suppression function F (x) is represented by the following formula:
Figure BDA0003879131920000031
wherein Z is max The maximum output value of the channel is obtained, and both omega and delta represent hyper-parameters which respectively represent the degree of the corresponding channel to be inhibited and the degree of the channel to be inhibited;
output feature map of current stage
Figure BDA0003879131920000032
And
Figure BDA0003879131920000033
is obtained by the following formula:
Figure BDA0003879131920000034
wherein the content of the first and second substances,
Figure BDA0003879131920000035
representing element-by-element multiplication operations;
of a plurality of stages
Figure BDA0003879131920000036
The dimension of the Conv channel of the convolutional layer is unified and then the Conv channel is used as the output of a corresponding stage, and the channel is unified to ensure the balance of low-level information and high-level information;
Figure BDA0003879131920000037
the inputs to the subsequent stage force the network to mine potential channel features containing fine-grained knowledge.
Further, the spatial multi-region feature attention module employs a downsampling convolution, a 1 × 1 convolution, a SoftMax function and a CCMP module, wherein the downsampling convolution is used for a plurality of stages
Figure BDA0003879131920000038
And a feature map of the last stage of the network
Figure BDA0003879131920000039
The spatial scale is kept consistent, 1 × 1 convolution is used to simplify the calculation, and the SoftMax function and the CCMP module are used to calculate multiple stages
Figure BDA00038791319200000310
And obtaining a diversity learning loss L div ,L div And the similarity is in negative correlation, and the loss of diversity is reduced through training, so that the method can be used for multiple stages
Figure BDA00038791319200000311
Spatially focusing on different discriminatory portions of the object;
the characteristic graphs obtained by the channel re-attention module in the last three stages of the assumed characteristic extraction network are respectively
Figure BDA00038791319200000312
And
Figure BDA00038791319200000313
wherein, C t Represents the normalized channel dimension, W L-2 Width, H, of the feature map representing the L-2 th stage L-2 Height of the L-2 stage feature map; w L-1 Width, H, of the feature map representing the L-1 st stage L-1 Height, W, of the characteristic diagram of the L-1 st stage L Width, H, of characteristic diagram representing L stage L Showing characteristic diagrams of the L-th stageA height;
in order to reduce the amount of calculation, the feature map is preprocessed by the following formula:
Figure BDA0003879131920000041
where φ (·) represents a 1 × 1 convolution; conv _ block l (. To) represents a downsampled convolution; l represents the second stage of the profile;
after characteristic graphs from three stages with the same space size and 1 channel number are obtained, weight information of each space position is obtained by adopting a SoftMax function, and then the characteristic graphs are obtained by splicing along the channel dimension
Figure BDA0003879131920000042
Inputting it into CCMP module, CCMP pair X concat Is responsive to the peak in the channel dimension and is responsive to X concat The middle element is subjected to the operation h (-) of summing and averaging to obtain the value S of the similarity i
Figure BDA0003879131920000043
Wherein k represents X concat The size of the spatial dimension of (a), j represents X concat Of a few channels, ε denotes X concat The number of channels is obtained by a spatial multi-region feature attention module to represent the similarity value S between feature maps of each stage i
Finally, according to the similarity S i Obtaining a diversity learning loss L div The calculation method is as follows,
L div =(1-S i )/ε (7)
where ε represents the output of several stages using a feature extraction network, where X represents concat The number of channels of (2).
Further, the classifier employs a SoftMax classifier, which is applied in a multi-classification task to map the outputs of a plurality of neurons into a (0, 1) interval.
Further, the total loss function L of the classification network model total The definition is as follows:
L total =αL cls +βL div +γL con (8)
wherein L is cls Represents the cross entropy loss, L div Indicates a loss of diversity learning, L con Representing comparison learning loss, wherein alpha, beta and gamma represent balance parameters and are used for weighting each balance loss function; wherein, the first and the second end of the pipe are connected with each other,
cross entropy loss L cls The classification loss is composed of the classification loss of each stage and the classification loss of the whole body represented by splicing the characteristics of each stage, and the calculation formula is as follows:
Figure BDA0003879131920000044
wherein y is a truth label of the input image and is represented by a one-hot vector; theta.theta. 12 The SoftMax function is used for calculating a predicted label value of the neural network; cls l (. Cndot.) represents a classifier that,
Figure BDA0003879131920000051
representing the output characteristic f of the l stage l The predicted value of the label is obtained; cls concat (. Represents a classifier for the representation of the overall features, Z fconcat Representing a global feature representation f concate The tag prediction value of (a);
comparative learning loss L con Comprises the following steps:
Figure BDA0003879131920000052
where N is the size of the input image batch, z i ,z j Is passed through 2 Regularized input images of different classes within the same batch, y i ,y j Is the label value, sim (z), of the different classes of input images i ,z j ) Is z i ,z j Cosine similarity between i, j represents different samples of the same batch, and η represents the loss L only for different classes of inputs with similarity greater than η con It is helpful.
Further, the specific process of step 2 is as follows:
step 2.1, adopting a CUB _200 _2011data set as a training data set, carrying out data preprocessing on the acquired original image in a horizontal turning and center cutting mode, realizing data expansion, and constructing the training data set;
and 2.2, sending the fine-grained images of the training data set into a classification network model, and training and optimizing learnable parameters in the classification network model, so that a channel feature re-attention module in the model can furthest mine potential fine-grained knowledge in the feature map, a spatial multi-region feature attention module can greatly reduce the similarity between feature maps in different stages, and when the whole model is trained to be convergent, the trained classification network model is obtained.
Further, the specific process of step 3 is as follows:
firstly, fine-grained images to be classified are sent into a feature extraction network with the stage L, and then the fine-grained images are input into a channel feature re-attention module to obtain a channel enhanced feature map
Figure BDA0003879131920000053
And channel rejection profile
Figure BDA0003879131920000054
The channel enhanced feature map is used as the output of the current stage of the network, and the channel suppressed feature map is sent to the subsequent stage to force the network to pay attention to the channels which contain information impoverishment of fine-grained knowledge; the model training process utilizes the spatial multi-region feature attention module to enhance the channel feature map output by multiple stages
Figure BDA0003879131920000055
Focusing on different discriminative parts of the object in the spatial dimension; therefore, the model can obtain a plurality of output characteristics with discriminativity in space and channels, and finally the output characteristics of a plurality of stages are taken as the characteristic representation of the image; and finally obtaining a classification result of the current image through a SoftMax classifier.
The invention has the following beneficial technical effects:
the present invention greatly improves the limitations of attention mechanisms and convolutional neural network-based methods on fine-grained image classification. Through the multi-stage feature extraction network, the aggregation capability of the classification network on feature information is improved, low-level information and high-level semantic information are included, and the robustness of the extracted features is improved; through the channel characteristic re-attention module, the classification network is effectively helped to extract the channel characteristics which are ignored originally but are helpful for fine-grained classification, so that the obtained characteristics are more comprehensively represented; through the spatial multi-region feature attention module, the features output by multiple stages of the classification network respectively pay attention to different discriminative parts of the object in the spatial dimension, so that the discriminative performance of the final feature representation is improved; by fusing the loss terms of the comparison learning idea, different types of fine-grained images are treated differently, and the difference between the types is increased. In the comparison learning loss item, the idea of comparison learning is fused, different types of training images in the same input batch are set as negative samples, the same type of training images are set as positive samples, the distance between the positive samples is pulled in through the setting of a loss function, and the distance between the negative samples is pulled out, so that the classification effect of the classification network is further optimized in the training process.
Drawings
FIG. 1 is a flow chart of a fine-grained image classification method based on a convolutional neural network according to the present invention;
FIG. 2 is a schematic diagram of the overall structure of the classification network model of the present invention;
FIG. 3 is a schematic diagram of a classification network model channel feature re-attention module according to the present invention;
FIG. 4 is a schematic diagram of a multi-region feature attention module of a classification network model space according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the following figures and detailed description:
research shows that in a plurality of fine-grained image classification methods, a convolutional neural network which integrates a channel feature attention module and a spatial multi-region feature attention module is a reliable classification idea, belongs to a weak supervision method, and can obtain more comprehensive and abundant features by taking a multi-stage convolutional network as a feature extraction network. Because the multi-stage features contain both low-level information (color, edge connection points, etc.) and high-level semantic information, the low-level information remains unchanged when the pose and background of the object change, reducing intra-class variance. Although the classification method based on deep learning and attention mechanism improves the effect of classifying fine-grained images to some extent, there are some disadvantages. For a fine-grained image classification network, in addition to extracting features which are significant and easy to distinguish, the method also helps a neural network to learn more knowledge which is helpful for fine-grained classification in the dimensions of channels and spaces of object features, namely, the method can use a channel feature re-attention module to force the network to mine knowledge in channel features with poor information content, and use a spatial multi-region feature attention module to enable multi-stage features to respectively focus on different discriminative portions of an object. A more discriminative representation of the features in channel and spatial dimensions is finally obtained.
Therefore, the invention provides a fine-grained image classification method based on a convolutional neural network, a classification network model is constructed by fusing a channel feature re-attention module and a spatial multi-region feature attention module, a contrast learning loss item in a loss function is designed by adopting a contrast learning idea, and finally, the classification network model is adopted to classify images acquired in real time. As shown in fig. 1 and fig. 2, the method specifically includes the following steps:
step 1, constructing a classification network model;
the classification network model comprises a feature extraction network, a channel feature re-attention module, a spatial multi-region feature attention module and a classifier.
The feature extraction network is composed of basic convolution networks such as ResNet50, resNet101 and Densenet161, the convolution networks are similar in structure and are composed of multiple stages, each stage comprises a convolution layer, when an image is input into the feature extraction network, the space size of a feature graph is reduced by half and the number of channels is doubled after the image passes through one stage, and output feature graphs X of the multiple stages of the feature extraction network are output l The convolutional neural network output in the last three stages is used as the feature extraction network.
As shown in fig. 3, the channel feature re-attention module firstly integrates feature channel information by using average pooling and maximum pooling operations, and obtains weight information of each channel in the feature map by using a SoftMax function; and obtaining an enhanced mask matrix E according to weight distribution, suppressing channels with high weights, and obtaining a suppressed mask matrix S through a suppression function F (x). Inputting a feature map X l Respectively multiplied with the enhanced mask matrix E and the suppressed mask matrix S to obtain an output characteristic diagram
Figure BDA0003879131920000071
And
Figure BDA0003879131920000072
wherein the content of the first and second substances,
the SoftMax function may be represented by:
Figure BDA0003879131920000073
wherein, Z i The output value of each channel after passing through the SoftMax function, C is the total number of output channels, and the weight information of the channel can be obtained through the SoftMax function.
The enhanced mask matrix E may be calculated by:
E=SoftMax(AvgPool(X l )+MaxPool((X l ))) (2)
wherein, avgPool (. Cndot.) represents the average pooling, and MaxPool (. Cndot.) represents the maximum pooling;
the suppression function F (x) may be represented by the following formula:
Figure BDA0003879131920000074
wherein, Z i Is the output value, Z, of each channel after the SoftMax function max Is the maximum output value of the channel, and both ω and δ represent hyper-parameters, which represent the degree to which the corresponding channel is suppressed and the degree to which the channel needs to be suppressed, respectively.
Output feature map of current stage
Figure BDA0003879131920000075
And
Figure BDA0003879131920000076
can be obtained by the following formula:
Figure BDA0003879131920000077
wherein the content of the first and second substances,
Figure BDA0003879131920000078
representing element-by-element multiplication operations.
Of multiple stages
Figure BDA0003879131920000081
And the Conv channels of the convolutional layers are unified in dimensionality and then serve as output of corresponding stages, and the channels are unified to ensure balance of low-level information and high-level information.
Figure BDA0003879131920000082
Input to the subsequent stage forces the network to mine potential channel features that contain fine-grained knowledge.
As shown in FIG. 4, the spatial multi-region feature attention module employs downsampling convolution, 1 × 1 convolution, softMax function andCCMP (Cross-channel max boosting) module in which downsampling convolution is used to combine multiple stages
Figure BDA0003879131920000083
And a feature map of the last stage of the network
Figure BDA0003879131920000084
The spatial scale is kept consistent, 1 × 1 convolution is used to simplify the calculation, and the SoftMax function and the CCMP module are used to calculate multiple stages
Figure BDA0003879131920000085
And obtaining a diversity learning loss L div ,L div And the similarity is in negative correlation, and the loss of diversity is reduced through training, so that the method can be used for multiple stages
Figure BDA0003879131920000086
Spatially focusing on different discriminative portions of the object;
the characteristic graphs obtained by the channel re-attention module in the last three stages of the assumed characteristic extraction network are respectively
Figure BDA0003879131920000087
And
Figure BDA0003879131920000088
wherein, C t Representing the normalized channel dimension, the value of which is equal to 1,W in the present invention L-2 Width, H, of the feature map representing the L-2 th stage L-2 Height of the L-2 stage feature map; w L-1 Width, H, of the feature map representing the L-1 st stage L-1 Height, W, of the characteristic diagram of the L-1 st stage L Width, H, of characteristic diagram representing L stage L Height of characteristic diagram of L stage;
in order to reduce the amount of calculation, the feature map is preprocessed by the following formula:
Figure BDA0003879131920000089
where φ (·) represents a 1 × 1 convolution; conv _ block l (. To) represents a downsampled convolution; l denotes the stage of the profile.
This results in a three-stage profile with the same spatial dimensions and a channel number of 1. In order to explore the similarity of the feature maps in three stages on the spatial dimension, a SoftMax function is adopted to obtain weight information of each spatial position, and then the weight information is spliced along the channel dimension to obtain the feature maps
Figure BDA00038791319200000810
It is input into CCMP module, CCMP is across channel maximum pooling, it tends to X concat Is responsive to the peak in the channel dimension and is responsive to X concat The middle element is subjected to the operation h (-) of summing and averaging to obtain the value S of the similarity i
Figure BDA00038791319200000811
Wherein k represents X concat J represents X concat Of a few channels, ε denotes X concat The number of channels is obtained by a spatial multi-region feature attention module to represent the similarity value S between feature maps of each stage i 。S i The larger the value of (a), the higher the similarity between feature maps. In order to focus the classification model on a number of different parts of the object. The similarity between the feature maps is reduced during training, namely S is reduced i
Finally, according to the similarity S i Obtaining a diversity learning loss L div The calculation method is as follows,
L div =(1-S i )/ε (7)
where ε represents the stages of the network in which feature extraction is used as the output, where X represents concat The number of channels in the present invention is 3.
The classifier adopts a SoftMax classifier, which is used in a multi-classification task and can map the output of a plurality of neurons into a (0, 1) interval, which can be understood as probability, so as to perform multi-classification.
In addition, the total loss function L of the classification network model total The definition is as follows:
L total =αL cls +βL div +γL con (8)
wherein L is cls Represents the cross entropy loss, L div Indicates a loss of diversity learning, L con Representing comparative learning loss, α, β, γ representing balance parameters, weights for each balance loss function; wherein the content of the first and second substances,
cross entropy loss L cls The classification loss is composed of the classification loss of each stage and the classification loss of the whole body represented by splicing the characteristics of each stage, and the calculation formula is as follows:
Figure BDA0003879131920000091
where y is the true label of the input image, represented by a one-hot vector. Theta.theta. 12 Also a balance parameter, the SoftMax function is used to calculate the predicted tag values for the neural network. cls l (. Cndot.) represents a classifier that,
Figure BDA0003879131920000092
representing the output characteristic f of the l stage l The tag prediction value of (1). cls concat (. Represents a classifier for the representation of the overall features, Z fconcat Representing a global feature representation f concate The tag prediction value of (1).
Loss of diversity learning L div The calculation formula of (2) is formula (6):
comparative learning loss L con Comprises the following steps:
Figure BDA0003879131920000093
where N is the size of the input image batch, z i ,z j Is passed through 2 Regularized input images of different classes within the same batch, y i ,y j Is the label value, sim (z), of the different classes of input images i ,z j ) Is z i ,z j Cosine similarity between i, j represents different samples of the same batch, and η represents the loss L only for different classes of inputs with similarity greater than η con It is helpful.
Step 2, constructing a training data set and carrying out model training; the specific process is as follows:
step 2.1, adopting a CUB _200 _2011data set as a training data set, carrying out data preprocessing on the acquired original image in modes of horizontal turning, center cutting and the like, realizing data expansion, and constructing the training data set;
the CUB _200_2011 dataset was a fine-grained dataset proposed by the california institute of technology in 2010, which is also the baseline image dataset for current fine-grained classification and identification studies. It has 11788 bird images, including 200 bird species, where the training data set has 5994 images and the test set has 5794 images, each of which provides image-class tagging information.
And 2.2, sending the fine-grained images of the training data set into a classification network model, and training and optimizing learnable parameters in the classification network model, so that a channel feature re-attention module in the model can furthest mine potential fine-grained knowledge in the feature map, a spatial multi-region feature attention module can greatly reduce the similarity between the feature maps in different stages, and when the whole model is trained to be convergent, the trained classification network model is obtained.
And 3, acquiring the images to be classified in real time, and sending the images to the trained classification network model to obtain the classification result of the current image. The specific process is as follows:
firstly, fine-grained images to be classified are sent into a feature extraction network with the stage of L, and then the fine-grained images are input into a channel feature re-attention moduleBlock, deriving a channel enhanced feature map
Figure BDA0003879131920000101
And channel rejection profile
Figure BDA0003879131920000102
The channel enhanced feature map is used as the output of the current stage of the network, and the channel suppressed feature map is sent to the subsequent stage to force the network to pay attention to the channels which contain information impoverishment of fine-grained knowledge; the model training already utilizes a spatial multi-region feature attention module to enable the feature map of the channel output by the multiple stages to be enhanced
Figure BDA0003879131920000103
Focusing on different discriminative parts of the object in the spatial dimension; therefore, the model can obtain a plurality of output characteristics with discriminative performance on space and channels, and finally the output characteristics of a plurality of stages are taken as the characteristic representation of the image; and finally obtaining the classification result of the current image through a SoftMax classifier.
The invention provides a fine-grained image classification method based on a convolution neural network integrating a channel feature re-attention module and a spatial multi-region feature attention module, which combines the method of the convolution neural network in deep learning and an improved attention module to classify fine-grained images. The method of the invention improves the defect of attention mechanism in the task to the maximum extent and enhances the capability of extracting features of the basic convolution network. In a classification network model, the utilization rate of features in a network is improved through the proposed channel feature re-attention module, learning parameters of the original network are hardly increased, potential fine-grained knowledge contained in channel features which are beneficial to fine-grained classification is better learned, and overfitting to a smaller training set task can be controlled (such as a CUB _200 _2011bird data set used in the invention); a spatial multi-region feature attention module is introduced to enable feature maps output by multiple stages of a classification network to focus on different discriminative parts of an object in space instead of focusing on the most significant part of the object; and the comparison learning loss item is designed in the loss function, the idea of comparison learning is fused, and the classification performance of the network model is improved. The invention solves the problems that the context can not be fully utilized when deep networks are used for extracting the features in a fine-grained image classification task, and only the most obvious channel and spatial features of an object are concerned when an attention mechanism is applied.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims (8)

1. A fine-grained image classification method based on a convolutional neural network is characterized by comprising the steps of firstly constructing a classification network model by fusing a channel feature re-attention module and a spatial multi-region feature attention module, then designing a contrast learning loss item in a loss function by adopting a contrast learning idea, and finally classifying images acquired in real time by adopting the classification network model; the method specifically comprises the following steps:
step 1, constructing a classification network model;
the classification network model comprises a feature extraction network, a channel feature re-attention module, a spatial multi-region feature attention module and a classifier;
step 2, constructing a training data set and carrying out model training;
and 3, acquiring the images to be classified in real time, and sending the images to the trained classification network model to obtain the classification result of the current image.
2. The fine-grained image classification method based on convolutional neural network according to claim 1, characterized in that the convolutional neural network outputted in the last three stages is used as a feature extraction network, the feature extraction network is composed of ResNet50, resNet101 and Densenet161 basic convolutional networks, each convolutional network structure is composed of a plurality of stages, each stage comprises a convolutional layer, and when an image is inputted into the feature extraction network, the space size of a feature map is reduced by one stageHalf less, doubling the channel number, and extracting the output characteristic diagram X of multiple stages of the network l As an output feature of the feature extraction network.
3. The fine-grained image classification method based on the convolutional neural network as claimed in claim 1, wherein the channel feature re-attention module firstly integrates feature channel information by using average pooling and maximum pooling operations, and acquires weight information of each channel in the feature map by using a SoftMax function; obtaining an enhanced mask matrix E according to weight distribution, suppressing channels with high weights, and obtaining a suppressed mask matrix S through a suppression function F (x); will input the feature map X l Respectively multiplied with the enhanced mask matrix E and the suppressed mask matrix S to obtain an output characteristic diagram
Figure FDA0003879131910000011
And
Figure FDA0003879131910000012
wherein the content of the first and second substances,
the SoftMax function is represented by:
Figure FDA0003879131910000013
wherein, Z i The output value of each channel after passing through the SoftMax function, C is the total number of output channels, and the weight information of the solved channels is obtained through the SoftMax function;
the enhanced mask matrix E is calculated by:
E=SoftMax(AvgPool(X l )+MaxPool((X l ))) (2)
wherein, avgPool (. Cndot.) represents the average pooling, and MaxPool (. Cndot.) represents the maximum pooling;
the suppression function F (x) is represented by the following formula:
Figure FDA0003879131910000021
wherein Z is max The maximum output value of the channel is obtained, and both omega and delta represent hyper-parameters and respectively represent the degree of the corresponding channel to be inhibited and the degree of the channel to be inhibited;
output feature map of current stage
Figure FDA0003879131910000022
And
Figure FDA0003879131910000023
is obtained by the following formula:
Figure FDA00038791319100000215
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA00038791319100000214
representing element-by-element multiplication operations;
of a plurality of stages
Figure FDA0003879131910000026
The Conv channels of the convolutional layers are unified in dimensionality and then serve as output of corresponding stages, and the channels are unified to ensure balance of low-level information and high-level information;
Figure FDA0003879131910000027
the inputs to the subsequent stage force the network to mine potential channel features containing fine-grained knowledge.
4. The fine-grained image classification method based on convolutional neural network according to claim 1, wherein the spatial multi-region feature attention module adopts downsampling convolution, 1 x 1 convolution, softMax function and CCMP module, wherein the downsampling convolution is used for classifying a plurality of stages
Figure FDA0003879131910000028
And a feature map of the last stage of the network
Figure FDA0003879131910000029
The spatial scale is kept consistent, 1 × 1 convolution is used to simplify the calculation, and the SoftMax function and the CCMP module are used to calculate multiple stages
Figure FDA00038791319100000210
And obtaining a diversity learning loss L div ,L div And similarity are in negative correlation, and multiple stages can be realized by reducing diversity loss through training
Figure FDA00038791319100000216
Spatially focusing on different discriminative portions of the object;
the characteristic graphs obtained by the channel re-attention module in the last three stages of the assumed characteristic extraction network are respectively
Figure FDA00038791319100000211
And
Figure FDA00038791319100000212
wherein, C t Denotes the normalized channel dimension, W L-2 Width, H, of the feature map representing the L-2 th stage L-2 Height of the L-2 stage feature map; w L-1 Width, H, of the feature map representing the L-1 st stage L-1 Height, W, of the characteristic diagram of the L-1 st stage L Width, H, of characteristic diagram representing L stage L Height of characteristic diagram of L stage;
in order to reduce the amount of calculation, the feature map is preprocessed by the following formula:
Figure FDA00038791319100000213
where φ (·) represents a 1 × 1 convolution; conv _ block l (. To) represents a downsampled convolution; l represents the second stage of the profile;
after characteristic graphs from three stages with the same space size and 1 channel number are obtained, weight information of each space position is obtained by adopting a SoftMax function, and then the characteristic graphs are obtained by splicing along the channel dimension
Figure FDA0003879131910000031
Inputting it into CCMP module, CCMP pairs X concat Is responsive to the peak in the channel dimension and is responsive to X concat The middle element is subjected to the operation h (-) of summing and averaging to obtain the value S of the similarity i
Figure FDA0003879131910000032
Wherein k represents X concat J represents X concat Of a few channels, ε denotes X concat The number of channels is obtained by a spatial multi-region feature attention module to represent the similarity value S between feature maps of each stage i
Finally, according to the similarity S i Obtaining a diversity learning loss L div The calculation method is as follows,
L div =(1-S i )/ε (7)
where ε represents the stages of the network in which feature extraction is used as the output, where X represents concat The number of channels of (2).
5. The fine-grained image classification method based on convolutional neural network of claim 1, wherein the classifier adopts a SoftMax classifier and is applied in a multi-classification task to map the output of a plurality of neurons into a (0, 1) interval.
6. The convolutional neural network-based of claim 1The fine-grained image classification method of (1), characterized in that a total loss function L of the classification network model total The definition is as follows:
L total =αL cls +βL div +γL con (8)
wherein L is cls Represents the cross entropy loss, L div Indicates a loss of diversity learning, L con Representing comparative learning loss, wherein alpha, beta and gamma represent balance parameters and are used for weighting each balance loss function; wherein the content of the first and second substances,
cross entropy loss L cls The classification loss is composed of the classification loss of each stage and the classification loss of the whole body represented by splicing the characteristics of each stage, and the calculation formula is as follows:
Figure FDA0003879131910000033
wherein y is a truth label of the input image and is represented by a one-hot vector; theta.theta. 12 The SoftMax function is used for calculating a predicted tag value of the neural network; cls l (. Cndot.) represents a classifier that,
Figure FDA0003879131910000034
representing the output characteristic f of the l stage l The tag prediction value of (a); cls concat () represents a classifier for the overall feature representation,
Figure FDA0003879131910000035
representing a global feature representation f concate The predicted value of the label is obtained;
comparative learning loss L con Comprises the following steps:
Figure FDA0003879131910000041
where N is the size of the input image batch, z i ,z j Is prepared byl 2 Regularized input images of different classes within the same batch, y i ,y j Is the label value, sim (z), of the different classes of input images i ,z j ) Is z i ,z j Cosine similarity between i, j represents different samples of the same batch, and η represents the loss L only for different classes of inputs with similarity greater than η con It is helpful.
7. The fine-grained image classification method based on the convolutional neural network as claimed in claim 1, wherein the specific process of the step 2 is as follows:
step 2.1, adopting a CUB _200 _2011data set as a training data set, carrying out data preprocessing on the acquired original image in a horizontal turning and center cutting mode, realizing data expansion, and constructing the training data set;
and 2.2, sending the fine-grained images of the training data set into a classification network model, and training and optimizing learnable parameters in the classification network model, so that a channel feature re-attention module in the model can furthest mine potential fine-grained knowledge in the feature map, a spatial multi-region feature attention module can greatly reduce the similarity between the feature maps in different stages, and when the whole model is trained to be convergent, the trained classification network model is obtained.
8. The fine-grained image classification method based on the convolutional neural network as claimed in claim 1, wherein the specific process of step 3 is as follows:
firstly, fine-grained images to be classified are sent into a feature extraction network with the stage of L, and then the fine-grained images are input into a channel feature re-attention module to obtain a channel enhanced feature map
Figure FDA0003879131910000042
And channel rejection profile
Figure FDA0003879131910000043
Channel enhancement profilingFor the output of the current stage of the network, the characteristic diagram of the channel suppression is sent to the subsequent stage to force the network to pay attention to the information-barren channels containing fine-grained knowledge; the model training process utilizes the spatial multi-region feature attention module to enhance the channel feature map output by multiple stages
Figure FDA0003879131910000044
Focusing on different discriminative parts of the object in the spatial dimension; therefore, the model can obtain a plurality of output characteristics with discriminative performance on space and channels, and finally the output characteristics of a plurality of stages are taken as the characteristic representation of the image; and finally obtaining the classification result of the current image through a SoftMax classifier.
CN202211224648.8A 2022-10-09 2022-10-09 Fine-grained image classification method based on convolutional neural network Pending CN115631369A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211224648.8A CN115631369A (en) 2022-10-09 2022-10-09 Fine-grained image classification method based on convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211224648.8A CN115631369A (en) 2022-10-09 2022-10-09 Fine-grained image classification method based on convolutional neural network

Publications (1)

Publication Number Publication Date
CN115631369A true CN115631369A (en) 2023-01-20

Family

ID=84904512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211224648.8A Pending CN115631369A (en) 2022-10-09 2022-10-09 Fine-grained image classification method based on convolutional neural network

Country Status (1)

Country Link
CN (1) CN115631369A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116452896A (en) * 2023-06-16 2023-07-18 中国科学技术大学 Method, system, device and medium for improving fine-grained image classification performance
CN116664911A (en) * 2023-04-17 2023-08-29 山东第一医科大学附属肿瘤医院(山东省肿瘤防治研究院、山东省肿瘤医院) Breast tumor image classification method based on interpretable deep learning
CN116994032A (en) * 2023-06-28 2023-11-03 河北大学 Rectal polyp multi-classification method based on deep learning
CN117011718A (en) * 2023-10-08 2023-11-07 之江实验室 Plant leaf fine granularity identification method and system based on multiple loss fusion

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116664911A (en) * 2023-04-17 2023-08-29 山东第一医科大学附属肿瘤医院(山东省肿瘤防治研究院、山东省肿瘤医院) Breast tumor image classification method based on interpretable deep learning
CN116452896A (en) * 2023-06-16 2023-07-18 中国科学技术大学 Method, system, device and medium for improving fine-grained image classification performance
CN116452896B (en) * 2023-06-16 2023-10-20 中国科学技术大学 Method, system, device and medium for improving fine-grained image classification performance
CN116994032A (en) * 2023-06-28 2023-11-03 河北大学 Rectal polyp multi-classification method based on deep learning
CN116994032B (en) * 2023-06-28 2024-02-27 河北大学 Rectal polyp multi-classification method based on deep learning
CN117011718A (en) * 2023-10-08 2023-11-07 之江实验室 Plant leaf fine granularity identification method and system based on multiple loss fusion
CN117011718B (en) * 2023-10-08 2024-02-02 之江实验室 Plant leaf fine granularity identification method and system based on multiple loss fusion

Similar Documents

Publication Publication Date Title
Bouti et al. A robust system for road sign detection and classification using LeNet architecture based on convolutional neural network
CN110532920B (en) Face recognition method for small-quantity data set based on FaceNet method
Yuan et al. Gated CNN: Integrating multi-scale feature layers for object detection
CN115631369A (en) Fine-grained image classification method based on convolutional neural network
CN111639564B (en) Video pedestrian re-identification method based on multi-attention heterogeneous network
CN107977661B (en) Region-of-interest detection method based on FCN and low-rank sparse decomposition
Roecker et al. Automatic vehicle type classification with convolutional neural networks
CN110287798B (en) Vector network pedestrian detection method based on feature modularization and context fusion
Vaidya et al. Deep learning architectures for object detection and classification
Manssor et al. Real-time human detection in thermal infrared imaging at night using enhanced Tiny-yolov3 network
CN116798070A (en) Cross-mode pedestrian re-recognition method based on spectrum sensing and attention mechanism
Yu et al. WaterHRNet: A multibranch hierarchical attentive network for water body extraction with remote sensing images
US20220301311A1 (en) Efficient self-attention for video processing
Ajagbe et al. Performance investigation of two-stage detection techniques using traffic light detection dataset
Wang et al. Pedestrian detection in infrared image based on depth transfer learning
Singh et al. CNN based approach for traffic sign recognition system
Kustikova et al. A survey of deep learning methods and software for image classification and object detection
Sabater et al. Event Transformer+. A multi-purpose solution for efficient event data processing
Akanksha et al. A Feature Extraction Approach for Multi-Object Detection Using HoG and LTP.
CN117372853A (en) Underwater target detection algorithm based on image enhancement and attention mechanism
Vijayalakshmi K et al. Copy-paste forgery detection using deep learning with error level analysis
Li A deep learning-based text detection and recognition approach for natural scenes
Zhou et al. Semantic image segmentation using low-level features and contextual cues
CN116797821A (en) Generalized zero sample image classification method based on fusion visual information
CN112668643B (en) Semi-supervised significance detection method based on lattice tower rule

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination