CN114692750A

CN114692750A - Fine-grained image classification method and device, electronic equipment and storage medium

Info

Publication number: CN114692750A
Application number: CN202210318057.0A
Authority: CN
Inventors: 余松森; 陈建华; 梁军; 黄志机; 朱海文
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2022-07-01

Abstract

The invention relates to a fine-grained image classification method and device, electronic equipment and a storage medium. The fine-grained image classification method comprises the following steps: acquiring an image to be classified; inputting the image to be classified into a trained feature extraction network to obtain the multi-scale features of the image; the feature extraction network is a ResNet101 neural network inserted with a PCFN module; and inputting the multi-scale features of the image into a classification network to obtain a fine-grained classification result of the image to be classified. According to the fine-grained image classification method, more discriminative features are learned by associating cross-layer features, and the multi-scale features with more expressive ability are obtained by combining the characteristics of different layer features, so that the classification effect is improved.

Description

Fine-grained image classification method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer vision technologies, and in particular, to a fine-grained image classification method and apparatus, an electronic device, and a storage medium.

Background

The fine-grained image classification is one of the popular research directions in the field of image recognition, and the fine-grained image classification is used for classifying an image library of a certain large category or a sub-category in an image set. Compared with the traditional image classification, the fine-grained image classification has the characteristics of small inter-class difference and large intra-class difference. It has long been considered a challenging task due to feature learning to distinguish region locations and identify regions.

There are many excellent fine-grained image classification algorithms, but the following problems still exist:

(1) the convolutional neural network uses artificial labeling information in the training process to improve the classification performance of the model, but the labeling information needs to consume a large amount of manpower, material resources and time. Therefore, the method has important research significance on how to enable the model to have better classification performance under the condition of only using the image class label.

(2) Overall, an image usually contains a lot of background information, and differential information that affects the classification tends to exist in a small region of the part. The dimension of the feature extracted by the convolutional neural network is certain, the more the extracted difference information is, the more the features with discrimination in the feature vector generated finally are, and the higher the classification precision is. Therefore, how to make the feature vector contain more effective information is one of the problems to be solved for improving the precision of fine-grained classification.

(3) The channel and spatial pixel relationships of the features of the different layers are not fully considered, and the characteristics of the different layers are not fully considered. Therefore, how to better utilize the multi-scale features to enable the model to have better classification performance has important research significance.

Disclosure of Invention

Based on this, the present invention provides a fine-grained image classification method, apparatus, electronic device and storage medium, which can learn more discriminative features by associating cross-layer features, and obtain multi-scale features with more expressive power by combining the characteristics of different layer features, thereby improving the classification effect.

In a first aspect, the present invention provides a fine-grained image classification method, including the following steps:

acquiring an image to be classified;

inputting the image to be classified into a trained feature extraction network to obtain the multi-scale features of the image; the feature extraction network is a ResNet101 neural network inserted with a PCFN module;

and inputting the multi-scale features of the image into a classification network to obtain a fine-grained classification result of the image to be classified.

Further, inputting the image into a trained feature extraction network to obtain the multi-scale features of the image, including:

inputting the image into a first layer of a Resnet101 neural network, and sequentially inputting layer1, layer2, layer3 and layer4 for convolution processing;

respectively inputting the output characteristics of layer2 and layer3 into the PCFN module behind the PCFN module to obtain the output characteristics after semantic weighting;

respectively inputting the semantically weighted output features of layer2 and layer3 into a dimension transformation module to obtain output features consistent with the output feature space and channel dimension of layer 4;

and performing feature fusion on the output features of the dimension transformation module and the output features of layer4 to obtain the multi-scale features of the image.

Further, the output features of layer2 and layer3 are respectively input into the subsequent PCFN modules, so as to obtain the semantically weighted output features, which includes:

for any one of the output characteristics of the layer2 and the layer3, inputting the output characteristics into a channel-associated branch and a space-associated branch of the PCFN module as a response layer of the PCFN module;

performing upsampling operation on the output characteristics of the layer4, and inputting a channel associated branch and a space associated branch of the PCFN module as a query layer of the PCFN module;

performing weighting operation on the channel association branch and the space association branch to obtain semantic weighting information of the output characteristics of the layer4 to the output characteristics of any one of the layer2 and the layer3 in two dimensions of a channel and a space;

and outputting the weighted information to obtain the semantically weighted output characteristics corresponding to any one of the output characteristics of the layer2 and the layer 3.

Further, feature fusing the output features of the dimension transformation module and the output features of layer4 includes:

and performing feature fusion on the output features of the dimension transformation module and the output features of layer4 in an element-wise add mode.

Further, the training step of the feature extraction network comprises:

acquiring a training data set of the feature extraction network, and dividing the training data set into a training set and a test set;

training the feature extraction network by using the training set, learning by using an SGD optimizer, continuously decreasing a loss function along with the back propagation of errors, continuously increasing the training accuracy, and obtaining a pre-trained feature extraction network when the loss function is converged and does not continuously decrease any more;

and testing the pre-trained feature extraction network by using the test set, and when the accuracy of the test result reaches a preset threshold value, storing the pre-trained feature extraction network to obtain the trained feature extraction network.

Further, training the parameters of the feature extraction network using the training set comprises:

training 150 epochs for each data set, wherein the gradual preheating strategy used in the first ten epochs in the training process is adopted;

the batch size was set to 32, the basic learning rate was set to 0.01;

and a fixed step length attenuation learning rate attenuation strategy is adopted, and the learning rate is reduced to one tenth of the original learning rate every 20 epochs.

Further, acquiring a training data set of the feature extraction network further includes:

preprocessing the training data set to obtain a data-enhanced training data set;

wherein the preprocessing operation comprises: translation, zoom, rotation, and flip.

In a second aspect, the present invention further provides a fine-grained image classification device, including:

the image acquisition module is used for acquiring images to be classified;

the characteristic extraction module is used for inputting the images to be classified into a trained characteristic extraction network to obtain the multi-scale characteristics of the images; the feature extraction network is a ResNet101 neural network inserted with a PCFN module;

and the classification module is used for inputting the multi-scale features of the image into a classification network to obtain a fine-grained classification result of the image to be classified.

In a third aspect, the present invention provides an electronic device, including:

at least one memory and at least one processor;

the memory for storing one or more programs;

when executed by the at least one processor, cause the at least one processor to implement the steps of a fine-grained image classification method according to any one of the first aspect of the invention.

In a fourth aspect, the present invention also provides a computer-readable storage medium, characterized in that:

the computer readable storage medium stores a computer program which, when executed by a processor, implements the steps of a fine-grained image classification method according to any one of the first aspect of the invention.

According to the fine-grained image classification method, the fine-grained image classification device, the electronic equipment and the storage medium, the dependency relationship is formed among the spaces and the channels among the multiple hierarchical features, and more distinguishing features are learned, so that the features with better identifiability and expressive ability are obtained, and the classification performance is improved. The invention provides a plug-and-play module, namely Polarized Cross-layer Fusion Network (PCFN for short), and the Cross-layer modeling space and channel high-resolution dependency relationship is lower in complexity compared with other methods, but achieves the most advanced performance. The PCFN can adaptively pay attention to each part on a multi-scale, and high-level semantics and low-level detailed information can be extracted by establishing high-level and low-level feature dependence on a CNN backbone and extracting multi-scale features under the guidance of the high-level feature semantics, so that better classification and accurate positioning can be realized.

For a better understanding and practice, the invention is described in detail below with reference to the accompanying drawings.

Drawings

Fig. 1 is a schematic flow chart of a fine-grained image classification method according to the present invention;

FIG. 2 is a schematic diagram of a feature extraction network used in the present invention;

FIG. 3 is a schematic diagram of a PCFN module used in the present invention;

fig. 4 is a schematic structural diagram of a fine-grained image classification apparatus provided by the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be understood that the embodiments described are only some embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the embodiments in the present application.

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims. In the description of the present application, it is to be understood that the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not necessarily used to describe a particular order or sequence, nor are they to be construed as indicating or implying relative importance. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.

Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

To solve the problems in the background art, an embodiment of the present application provides a fine-grained image classification method, as shown in fig. 1, the method includes the following steps:

s01: and acquiring an image to be classified.

In a preferred embodiment, the image is a color picture and scaled to the input size of the convolutional neural network.

S02: inputting the image to be classified into a trained feature extraction network to obtain the multi-scale features of the image; the feature extraction network is a ResNet101 neural network inserted with a PCFN module.

In a preferred embodiment, the structure of the feature extraction network is as shown in fig. 2, where ResNet101 is used as a candidate baseline, a PCFN module is inserted into a backbone network, the feature of the last convolutional layer of ResNet is used as a query layer of the PCFN module, and the features of the previous layers of ResNet are used as response layers of PCFN. And (3) performing high-level semantic enhancement on the shallow feature of Resnet through a PCFN module, and fusing the shallow feature subjected to semantic enhancement with the high-level feature after performing space dimension reduction and channel dimension enhancement operation.

The PCFN module can realize the modeling of the dependency relationship between the high-level feature and the shallow-level feature in the channel and the space. The module contains two basic components: a query layer and a response layer. One query layer is associated with multiple response layers based on observations that different layers of neurons have different receptive fields.

S03: and inputting the multi-scale features of the image into a classification network to obtain a fine-grained classification result of the image to be classified.

In a particular embodiment, the classification network may be a softmax function layer. In other examples, a classification network such as a support vector machine, a decision tree, etc. may also be used.

According to the fine-grained image classification method provided by the invention, more discriminative features can be learned by associating cross-layer features, and the multi-scale features with higher expression capability can be obtained by combining the characteristics of different layer features, so that the classification effect is improved.

The working process and working principle of the fine-grained image classification method provided by the invention are explained below with reference to a specific embodiment, starting from the construction and training of the feature extraction network.

First, a convolutional neural network is established. The method takes ResNet101 as a candidate base line, and applies a PCFN module to perform multi-scale feature fusion so as to realize a classification model. And sending the preprocessed image into a convolutional layer of the Resnet network. And replacing the last full connection layer of Resnet with a full connection layer with the output consistent with the category number of the data set. The invention establishes an association between the features of stage2, 3 and stage4 of Resnet, i.e. the features of stage2, 3 act as the response layer for PCFN and the features of stage4 act as the query layer. As shown in fig. 3, the PCFN module includes a channel-associated branch and a spatial-associated branch, and information that the query layer weights the response layer in two dimensions of the channel space is obtained through the two branches, respectively. The module is added to the candidate baselines to obtain the shallow feature after semantic weighting. And performing space dimensionality reduction operation on the semantically weighted shallow features through maximum pooling and channel-by-channel convolution, and then performing channel dimensionality lifting. And performing depth-shallow feature fusion through element-wise add operation, wherein the shallow feature after semantic weighting is consistent with the deep feature dimension through space dimension reduction and channel dimension increasing. Based on this module, more discriminative multi-scale features can be obtained.

For a well-established feature extraction network model, a data set is required to be used for training the model.

The data set is established as follows:

CUB-200 plus 2011, Stanford Cars and FGVC-Aircraft data sets are adopted as external data sets, and each data set is firstly divided into a training set and a test set, wherein the training set and the test set are mutually exclusive, namely, no intersection exists between the two subsets. Each picture should contain a category label indicating to which category the picture belongs. For example, the CUB-200 and 2011 data set has 5994 training sets and 5794 testing sets. In addition, all pictures were scaled to the input size of the convolutional neural network.

In order to enlarge the data of the data set and reduce the overfitting, in a preferred embodiment, data enhanced data preprocessing of the data set is also required. The enhancement method generally comprises the following four methods: translation, zooming, rotation and flipping.

Specifically, the method comprises the following steps: images in the training set were randomly cropped to 448x448 at a random area ratio of 0.08 to 1.25, and the PIL picture was horizontally flipped by a default probability of 0.5. In addition, the scaled image is converted from the input format HxWxC, pixel values [0,255] to be within the tensor format CxHxW pixel value [0,1 ]. Where C denotes the number of channels and H, W denotes the image length and width. The image of the CUB-200-2011 data set is an RGB three-channel image, and C is 3. Then, the pixel values are normalized and input into the network, and the normalization formula is (channel-mean)/std, the mean value of each channel is (0.485,0.456,0.406), and the variance (std) is (0.229,0.224, 0.225). The preprocessing operation can achieve the purpose of data enhancement, so that the network has more robustness to object characteristics in different forms.

Based on the established feature extraction network and the established data set, the feature extraction network is trained by using a training set, and a training result is tested and verified by using a test set.

Preferably, each data set is trained on 150 epochs, with the gradual warm-up strategy used in the first ten epochs of the training process. The batch size is set to be 32, the basic learning rate is set to be 0.01, and the learning rate is reduced to one tenth of the original learning rate every 20epoch by adopting a fixed step length attenuation learning rate attenuation strategy. And training by using an SGD optimizer, and when the loss function is converged and does not continuously decrease any more, storing the convolutional neural network model to obtain a pretrained convolutional neural network which is trained well for testing.

And evaluating the model performance by using the accuracy in the test set, wherein the accuracy of the overall classification result, accuracycacy, is TP/Total, wherein TP represents the number of correctly classified pictures, and Total represents the Total number of pictures.

Based on the trained feature extraction network, the data processing flow for classifying an image to be classified comprises the following steps:

step 1, receiving an image, and inputting the image into a first layer of a Resnet101 neural network as input data of the neural network.

Step 2, inputting the output characteristics of layer2 into the subsequent PCFN module, such as the channel-associated branch and the spatial-associated branch in fig. 3, as the response layer of the module.

And simultaneously, the output characteristics of layer4 are input into the PCFN module behind the layer4 to be used as a query layer.

The PCFN module realizes semantic enhancement on the features of layer2 by associating high-level and low-level features, and obtains output features with enhanced semantics.

Specifically, the PCFN obtains the semantically enhanced features to perform operations by the following method:

the PCFN has two input sources, which are respectively high-low layer characteristicsAnd (5) performing characterization. The high level features are represented as

The underlying features are represented as

Output characteristic X of PCFN^pComprises the following steps:

X^p＝Z^ch+Z^sp＝A^ch(X^q,X^r)⊙^chX^r+A^sp(X^q,X^r)⊙^spX^r

wherein:

A^ch(X^q,X^r)＝F_SG[W_z|θ((σ₁(W_v(Xr)))×F_SM(σ₂(W_q(F_US(X^q)))))]

A^sp(X^q,X^r)＝F_SG[σ₃(F_SM(σ₁(F_GP(W_q(F_US(X^q)))))×σ₂(W_v(X^r)))]

wherein Wq, Wv and Wz are 1X 1 convolutional layers, σ₁、σ₂、σ₃Is a tensor reshaping operator;

FSM (-) is a SoftMax operator, "×" is a matrix dot product operation;

F_USis X^qTo obtain a reaction with X^rUpsampling operations performed in the same spatial dimension; FGP (-) is a global pool operator,

"×" is the dot product operation of the matrix. FSG (-) is a Sigmoid function;

A^ch(X^q,X^r) And obtaining the weight of channel semantic enhancement for the channel operation of the associated query layer and the response layer. An en^chIs a channel level multiplication operator. A. the^ch(X^q,X^r)⊙^chX^rFor obtaining shallow features that enhance the channel semantics.

A^sp(X^q,X^r) To associate spatial operations of the query and response layers, A^sp(X^q,X^r)⊙^spX^rFor obtaining shallow features with enhanced spatial semantics.

And 3, inputting the semantically weighted output features of the step 2 into a dimension conversion module behind the PCFN module to obtain output features consistent with the output feature space and channel dimensions of layer 4.

And 4, performing feature fusion on the output features subjected to the dimension change operation in the step 3 and the layer4 output features in an element-wise add mode.

PCFN_bi(X^q,X^r)＝X^q+W_z(F_MP(X^p)+W_d(X^p)) (2)

X^pFacies is the feature of the response layer after the channel and space association query layer. F_MP(. and W)_ZThe symbol is that the space dimension reduction and the subsequent channel dimension increasing operation are carried out on the response layer characteristic after the correlation query layer characteristic. The space dimensionality reduction adopts maximum pooling, can obtain key information of response layer characteristics after semantic weighting of the query layer, and avoids neglecting local context information by performing channel-by-channel convolution W_dLocal context information is taken.

And 5, repeating the operation of the steps 2-4 on the output characteristics of the layer 3.

And 6, inputting the output characteristics of the last layer into a classification network.

And 7: and outputting the output characteristics of the classification network as a prediction result.

An embodiment of the present application further provides a fine-grained image classification apparatus, as shown in fig. 4, the fine-grained image classification apparatus 400 includes:

an image obtaining module 401, configured to obtain an image to be classified;

a feature extraction module 402, configured to input the image to be classified into a trained feature extraction network, so as to obtain a multi-scale feature of the image; the feature extraction network is a ResNet101 neural network inserted with a PCFN module;

the classification module 403 is configured to input the multi-scale features of the image into a classification network, so as to obtain a fine-grained classification result of the image to be classified.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts shown as units may or may not be physical units. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

An embodiment of the present application further provides an electronic device, including:

at least one memory and at least one processor;

the memory for storing one or more programs;

when executed by the at least one processor, cause the at least one processor to implement the steps of a fine-grained image classification method as described above.

For the apparatus embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described device embodiments are merely illustrative, wherein the components described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. One of ordinary skill in the art can understand and implement it without inventive effort.

Embodiments of the present application also provide a computer-readable storage medium,

the computer-readable storage medium stores a computer program which, when executed by a processor, implements the steps of a fine-grained image classification method as described above.

Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of random access memory (rmam), read only memory (ro M), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which may be used to store information that may be accessed by a computing device.

According to the fine-grained image classification method, the fine-grained image classification device, the electronic equipment and the storage medium, more discriminative features can be learned by associating cross-layer features, characteristics of different layer features are combined, multi-scale features with more expressive ability are obtained, and the classification effect is improved. According to the fine-grained image classification method, the fine-grained image classification device, the electronic equipment and the storage medium, the dependency relationship is formed among the spaces and the channels among the multiple hierarchical features, and more distinguishing features are learned, so that the features with better identifiability and expressive ability are obtained, and the classification performance is improved. The invention provides a plug-and-play module, namely Polarized Cross-layer Fusion Network (PCFN), and the Cross-layer modeling space and channel high-resolution dependency relationship is lower in complexity compared with other methods, but achieves the most advanced performance. The PCFN can adaptively pay attention to each part on a multi-scale, and high-level semantics and low-level detailed information can be extracted by establishing high-level and low-level feature dependence on a CNN backbone and extracting multi-scale features under the guidance of the high-level feature semantics, so that better classification and accurate positioning can be realized.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims

1. A fine-grained image classification method is characterized by comprising the following steps:

acquiring an image to be classified;

2. The fine-grained image classification method according to claim 1, wherein inputting the image into a trained feature extraction network to obtain multi-scale features of the image comprises:

respectively inputting the semantically weighted output features of layer2 and layer3 into a dimension transformation module to obtain output features consistent with the spatial dimension and channel dimension of the output features of layer 4;

3. The fine-grained image classification method according to claim 2, wherein the step of inputting the output features of layer2 and layer3 into the PCFN module after the step of obtaining the semantically weighted output features comprises the steps of:

inputting, as a response layer of the PCFN module, an output signature of any one of the layer2 and layer3, to a channel-associated branch and a space-associated branch of the PCFN module;

4. The fine-grained image classification method according to claim 3, wherein the feature fusion of the output features of the dimension transformation module and the output features of layer4 comprises:

5. The fine-grained image classification method according to claim 1, wherein the training of the feature extraction network comprises:

6. The fine-grained image classification method according to claim 5, wherein training the parameters of the feature extraction network using the training set comprises:

training 150 epochs for each data set, and gradually preheating strategies used in the first ten epochs in the training process;

the batch size was set to 32, the basic learning rate was set to 0.01;

7. The fine-grained image classification method according to claim 5, wherein the obtaining of the training data set of the feature extraction network further comprises:

8. A fine-grained image classification device characterized by comprising:

the image acquisition module is used for acquiring images to be classified;

the characteristic extraction module is used for inputting the images to be classified into a trained characteristic extraction network to obtain the multi-scale characteristics of the images; the characteristic extraction network is a ResNet101 neural network inserted with a PCFN module;

9. An electronic device, comprising:

at least one memory and at least one processor;

the memory to store one or more programs;

when executed by the at least one processor, cause the at least one processor to perform the steps of a fine-grained image classification method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium characterized by:

the computer-readable storage medium stores a computer program which, when executed by a processor, implements the steps of a fine-grained image classification method as claimed in any one of claims 1 to 7.