CN110619369B

CN110619369B - Fine-grained image classification method based on feature pyramid and global average pooling

Info

Publication number: CN110619369B
Application number: CN201910899445.0A
Authority: CN
Inventors: 龚声蓉; 周少雄; 王朝晖; 应文豪; 李菊
Original assignee: Changshu Institute of Technology
Current assignee: Jiangsu Yiyou Huiyun Software Co.,Ltd.
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2020-12-11
Anticipated expiration: 2039-09-23
Also published as: CN110619369A

Abstract

The invention discloses a fine-grained image classification method based on a feature pyramid and global average pooling, which comprises the following steps of: step 1, inputting images into a convolution layer of a pre-trained convolution neural network to obtain a multi-channel characteristic diagram; step 2, the multi-channel feature map passes through a global average pooling layer to obtain a saliency map of the input image, and position information of a target is extracted; step 3, extracting the characteristics of the multi-channel characteristic diagram by the characteristic pyramid network and predicting to obtain K local areas with the maximum information quantity; and 4, aggregating the local features of the K local regions and global feature prediction obtained by the input image through the convolutional neural network to output the final identification category. The method reduces the influence of background noise, enhances the robustness of local area selection and improves the identification precision.

Description

Fine-grained image classification method based on feature pyramid and global average pooling

Technical Field

The invention relates to a fine-grained image classification method, in particular to a fine-grained image classification method based on a feature pyramid and global average pooling.

Background

Fine-grained image recognition is a concept in the field of image processing, and conventional image recognition can generally only recognize a large class to which an object in an image belongs, which is called coarse-grained image recognition. While there are usually many subcategories under the same general category, conventional image recognition methods cannot determine the specific subcategories to which the target belongs. The fine-grained image recognition can be used for classifying the targets in the image more finely, the classification granularity is finer, and the specific sub-category of the targets under the large category can be determined to meet the higher image recognition requirements under different scenes.

The early fine-grained classification method generally relies on manual experience to extract features manually, and generally comprises the steps of extracting local features such as SIFT features or HOG from an image, coding the features by using coding models such as VLAD or Fisher Vector to obtain required feature representation, and then classifying the features by using classifiers such as a shallow neural network or SVM. But the generalization of the model is poor.

The fine-grained image classification method based on deep learning can be divided into two categories, namely a strong supervision method and a weak supervision method, and the difference between the strong supervision method and the weak supervision method is whether manual labeling information such as bounding boxes or local region labeling is used or not. Such processes are generally divided into three steps: firstly, a foreground object and a plurality of local areas in an image are obtained by using methods such as labeling information or visual attention of the image, then, a deep convolution network is utilized to respectively extract convolution characteristics, and finally, the characteristics of all the local areas are integrated to classify a target. The classification method with strong supervision has poor practicability due to the fact that the acquisition cost of manual labeling information is high, and actual application requirements are difficult to meet.

Most of the existing fine-grained identification methods are based on work under a weak supervision condition, namely, manual marking information is not relied on, but it becomes difficult to accurately acquire objects and locate regional local areas in images under the weak supervision condition. In a real scene, a target is not necessarily located in the middle of the scene, and the surrounding environment may block the target, interfere with the target in color, or cause a large visual difference between images of the same category due to different shooting angles, a change in the posture of the target object, and the like. The following two problems exist in particular:

1. the selected local area is more background noise. The target in the image is generally in a complex environment, for example, in a bird recognition task, the target bird is generally positioned in the middle of a branch and is seriously shielded, or the appearance color of a leaf, a trunk and the like is similar to that of the target, so that strong interference is easily caused. Most of the existing methods directly input the whole image into a model and extract features, but visual experiments show that local regions obtained by the methods generally have more background noise, and the features extracted from the noise regions do not belong to target features, so that the results of the classification process are often influenced to a certain extent, and the fine-grained image recognition effect of the model is reduced. Some methods also extract a plurality of areas with higher distinctiveness from an original image by using an unsupervised mode, such as a selectivesearch method, and then send the areas into a network model for training and extracting features.

2. The robustness of the features is not sufficient. Fine-grained image recognition has specificity compared to ordinary image recognition, sub-categories of fine-grained recognition typically have smaller inter-category differences, and these differences typically exist in smaller local areas. However, the current method is not robust enough to the feature extracted from the target object of fine-grained image recognition. The traditional manual design features need to be designed based on expert experience, the manual design features are difficult to effectively express the distinguishing information in the image while having instability, the adaptability of the method is generally poor, and when the operation object of the method is switched from one field to another field, the effect is rapidly reduced, so that the practicability is greatly reduced. Most of the features designed by the existing deep learning-based method are not targeted enough for the tasks, and generally, the features are extracted by directly using deep neural networks such as VGGNet or ResNet, so that good effect can be obtained when global features of targets are extracted, but the capability in the aspect of representation of detailed information is not enough. The difference between images of the fine-grained image recognition task is in tiny details in many cases, so that the recognition effect is poor. And when the size of the target in the image changes greatly, the robust features cannot be extracted in a good adaptation mode, so that a good effect cannot be achieved.

Disclosure of Invention

In view of the above-mentioned defects in the prior art, the present invention provides a fine-grained image classification method based on a feature pyramid and global average pooling, which solves the noise problem of a target location area with less calculation overhead and improves the feature robustness of target object extraction.

The technical scheme of the invention is as follows: a fine-grained image classification method based on a feature pyramid and global average pooling comprises the following steps:

step 1, inputting images into a convolution layer of a pre-trained convolution neural network to obtain a multi-channel characteristic diagram;

step 2, the multi-channel feature map passes through a global average pooling layer to obtain a saliency map of the input image, and position information of a target is extracted;

step 3, extracting the characteristics of the multi-channel characteristic diagram by the characteristic pyramid network and predicting to obtain K local areas with the maximum information quantity;

and 4, aggregating the local features of the K local regions and global feature prediction obtained by the input image through the convolutional neural network to output the final identification category.

Further, the step 2 comprises the following steps: step 2.1, the global average pooling layer maps each feature map into a neuron, is connected with softmax for training, and predicts the category; step 2.2: and after training is finished, multiplying and accumulating the weight of the class with the highest probability corresponding to the neuron and the multi-channel characteristic graph respectively to obtain the saliency map.

Further, the step 3 comprises the following steps: step 3.1, inputting the feature graph into a feature pyramid network to generate feature graphs of N scales, wherein N is a natural number not less than 3; 3.2, performing upsampling on an upper layer characteristic diagram in the characteristic diagram obtained in the step 3.1, performing fusion on a lower layer characteristic diagram after convolution kernel to obtain fusion characteristic diagrams of N scales; and 3.3, selecting candidate areas with different sizes on the fusion characteristic diagram with N scales, filtering the bounding box generated in the second step, predicting the candidate areas, and sequencing the candidate areas according to the size of the activation value of the bounding box to obtain a local area, wherein the target bounding box is generated by taking the maximum connected area in the saliency map and setting a threshold value to obtain the specific position of the target.

Further, the K local regions with the largest information amount predicted in step 3 are optimized by adopting the ranking consistency loss, so that the local region classification prediction result and the activation value obtained by the feature pyramid network have the same ranking.

Further, the optimization by using the ordered consistent loss is optimized by using a hinge loss function, and the K local regions are set as R ═ { R ═ R₁,R₂,…,R_KAnd ranking the K local areas from high to low according to the activation values, wherein the activation values obtained by predicting the K local areas through the characteristic pyramid network are S ═ S respectively₁,S₂,…,S_KAnd the K local areas are predicted by a convolutional neural network to obtain a probability P ═ P₁,P₂,…,P_KThe ordering penalty is defined as follows:

S_iand S_jIn order to activate the value of the key,

the hinge loss function f (x) is: (x) max {1-x, 0 }.

Compared with the prior art, the invention has the advantages that:

the invention reserves all convolution layers of the convolution neural network and replaces the last full connection layer with a full local average pooling layer (GAP), so that the network obtains excellent target positioning capability. And mapping each feature map of the last convolution layer into a neuron after the feature maps pass through the GAP, connecting the neurons with a softmax classification layer to obtain the output probability of each category, and adding the convolution layer feature maps according to the weights of the neurons of the corresponding categories to obtain the saliency maps corresponding to each category. After the saliency map is obtained, a saliency threshold is set to generate a bounding box of the target. Local region candidates for the target are then performed within this bounding box, which greatly reduces the interference of background noise on feature extraction and model classification. And the proposed method shares the convolution layer with the original feature extraction network, only one GAP layer is added, and only little calculation expense is added.

And extracting the features by adopting a feature pyramid network. The principle of the constructed feature pyramid is that features of high-level low-resolution and high-semantic information and features of low-level high-resolution and low-semantic information are fused to obtain a feature map with high semantics and high resolution, and prediction is performed on the feature maps with multiple scales obtained after fusion, so that the model greatly enhances the processing capacity of small targets in the image under the condition of not increasing the calculated amount basically, and the precision of a fine-grained image recognition result is further improved.

Drawings

FIG. 1 is a schematic flow chart of the overall framework of the method of the present invention.

FIG. 2 is a flow diagram illustrating the process of obtaining a target saliency map using global mean pooling.

Fig. 3 is a schematic diagram of a feature pyramid structure.

FIG. 4 is a diagram illustrating the target location result on the CUB-200 and 2011 data set according to the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are not to be construed as limiting the invention thereto.

The general framework of the fine-grained image classification method based on the feature pyramid and the global average pooling according to the present embodiment is shown in fig. 1. The method comprises the following specific steps:

step 2, obtaining a saliency map of the input image by the multi-channel feature map through a global average pooling layer, and extracting position information of a target;

and 4, step 4: and aggregating the local features of the K local regions and global feature prediction obtained by the input image through the convolutional neural network to output a final identification category.

And secondly, replacing a full connection layer of the base network ResNet-50 by a global average pooling layer, reserving all convolution layers, preliminarily predicting the class of the image according to the class of ImageNet-1k, and obtaining a saliency map by a class activation mapping method. The saliency map displays the position of the target in the image in the form of an activation value, and the higher the activation value, the more likely the target is contained. The specific position of the target can be obtained by setting a threshold value according to the maximum connected region in the saliency map, the target bounding box is generated, and the obtained bounding box region has less background noise. In this way, the obtained target bounding box information is further used in step three to perform candidate local area filtering. The method of obtaining the saliency map is shown in figure 2. Step 2 may further comprise the steps of:

step 2.1, mapping each feature map into a neuron by the multi-channel feature map through a global average pooling layer, connecting the neuron with softmax for training, and predicting the category;

and 2.2, after the training is finished, multiplying and accumulating the weight of the class with the highest probability corresponding to the neuron with the multi-channel characteristic graph respectively to obtain the saliency map of the target.

The feature pyramid uses the pyramid shape of the feature hierarchical structure of the convolutional network to fuse the high semantic features with low resolution at the upper layer of the pyramid and the high resolution and low semantic features at the lower layer of the pyramid to obtain the features with high semantic information and relatively retaining more detailed information, and the local regions are independently predicted on the feature maps with different scales. Step 3 the structure of the feature pyramid is shown in fig. 3, and step 3 further comprises the following steps:

step 3.1, inputting the feature graph into a feature pyramid network to further generate feature graphs of three scales;

3.2, performing double upsampling on the upper layer characteristic diagram, fusing the upper layer characteristic diagram with the lower layer characteristic diagram after passing through a 1 × 1 convolution kernel to obtain fused characteristic diagrams of three scales;

and 3.3, selecting candidate areas with different sizes on the fusion characteristic diagram with the three scales, filtering the bounding box generated in the second step, predicting, and sorting according to the size of the activation value.

At this time, for each image, a plurality of local regions and activation values thereof obtained by filtering from the feature pyramid network and the saliency extraction network are selected, K local regions with the highest activation values are selected from the local regions and are scaled to be 224 multiplied by 224, then the local regions are sent into the ResNet-50 network model again for feature extraction, and finally the local regions are classified by a full connection layer. In order to optimize the selected local area, the method optimizes by using the sorting consistency loss, so that the classification prediction result of the local area at the moment has the same sorting with the size of the activation value obtained by the feature pyramid network, and the selected local area has the greatest distinction. A hinge loss function is introduced to optimize the model parameters to select the optimal local area.

Let K local regions be R ═ R₁,R₂,…,R_KAnd ranking the activation values from high to low, wherein the activation values obtained through characteristic pyramid network prediction are S-S respectively₁,S₂,…,S_K}. The K local areas are predicted by a ResNet-50 network to obtain the probability P ═ { P ═ P₁,P₂,…,P_K}. The hinge penalty function can be viewed as a pairwise ordering penalty function that requires the elements S to have precedence in S_iAnd S_jIf S is_i>S_jThen there is also the same precedence order P in P_i>P_jOtherwise, a penalty is imposed. The ordering penalty in this method is defined as follows:

wherein the hinge loss function f (x) is defined as:

f(x)＝max{1-x，0}

in the aspect of model training, the model parameters are optimized by taking the sum of the sorting loss of the K local regions and the classification loss of the K local regions on ResNet-50 and the classification loss of the input images as the total loss. ResNet-50 is used as the basic network, and parameters are shared all the time. When the method is tested, the prediction type of each input image is obtained from the input image and the classification result of K local areas on ResNet-50.

The demonstration experiment of the invention uses a data set comprising: CUB-200 + 2011, Stanford Cars.

The CUB-200-2011 is a bird data set which is the most common and classical data set in the field of fine-grained image recognition at present. There were 11788 bird images in the dataset, grouped into 200 categories. There were 5994 training images and 5794 test images, with approximately 30 training images and 11-30 test images for each bird.

Stanford Cars is a vehicle data set proposed by professor Li-Feifeei at Stanford university, USA, and is one of the most commonly used data sets for fine-grained image recognition at present. There are 16185 images of the vehicles in the data set and the images are divided into 196 vehicle categories by brand, year and model. The number of training images is 8144, the number of test images is 8041, and on average, each vehicle type comprises 24-81 training images and 24-83 test images. The details of the above data set are given in the following table:

in addition, the experimental hardware environment: ubuntu 16.04, Telsa-P100 video card, video memory 12G, core (TM) i7 processor, main frequency 3.4G, memory 16G.

The code running environment is as follows: deep learning framework (Pythrch-0.4.1), Python 3.5.

The experimental results are as follows:

accuracy was selected as an evaluation index to evaluate the experimental results. Training and evaluation are performed under the same experimental environment for different semantic segmentation methods.

A deeper deep neural network ResNet-50 is used as the network backbone. The ResNet-50 network is pre-trained on the ImageNet-1k dataset, which saves a lot of initial parameter training time of the model and reduces model overfitting. During training, using SGD as a model optimizer, using multi-step learning ratesThe mode sets the learning rate, with an initial learning rate of 0.001, which drops to 1/10 after the 60 th and 100 th iterations. Setting the weight decay of the model to 10^-4The momentum is set to 0.9 and the data size of the training batch is set to 16. Cross-entropy Loss in Loss was used as a classification Loss function in the experiments. The image in the dataset is pre-cropped to 448 x 448 size.

To verify the effectiveness of the target localization method presented herein, experiments were first performed on the CUB-200-2011 avian data set. The reason for selecting this data set is that the environment where the bird target is located is generally more complex, and besides the bird target itself is smaller, the bird itself has different attitudes such as flying in the air, perching on trees, swimming in water, and so on, and therefore often accompanies strong interference factors such as occlusion, attitude change, similar background, and so on, and thus the difficulty of accurate positioning is greater than that of a vehicle data set Stanford Cars. The results of the object localization obtained by the method of the present invention are shown in fig. 4. The first line in the figure is the original image processed to 448 × 448 in size, the second line is the saliency map obtained, and the last line is the target object bounding box generated. For the first column of pictures, the target object is located among a large number of branches; in the third column of pictures, the color of the tree is very similar to the color of the body of the target object, and both the tree and the target object have strong interference. It can be seen that the target saliency map and the bounding box obtained based on the method of the invention are accurate.

In addition, the method of the present invention was verified on the CUB-200 plus 2011 and Stanford Cars datasets. The ResNet-50 convolutional neural network is used as the basic network of the model. The ResNet-50 network has 50 convolutional layers, the residual modules adopt a bottleneck structure, and the modules adopt a jump connection mode, so that the characteristic extraction capability is stronger compared with that of VGGNet. Because the general scale of the fine-grained image data set is small, overfitting is easy to generate by direct training, so that the performance of the model is reduced, the model is pre-trained on the ImageNet-1k large-scale data set, the early-stage training process can be accelerated, and the model is not easy to fall into a local optimal solution.

In order to improve the practicability of the method, the method does not use additional marking information, realizes the positioning of the target object under the condition of weak supervision by adopting a global average pooling mode, and further obtains the bounding box of the target object. In order to improve the representation capability of the model on the local detail information, a feature pyramid network is adopted to fuse the feature map output by the ResNet-50 network. And after the activation values of the candidate regions are obtained, K highest activation regions are selected and sent to the ResNet-50 network again for category prediction. And then, removing redundant local regions by using an NMS algorithm, calculating the sequencing consistency loss of the local regions to optimize the selection of the local regions, and finally predicting and combining the selected local regions and the whole image classification result. The results of the experiments on the CUB-200 plus 2011 and Stanford Cars data sets are shown in Table 1. It can be seen that the recognition Accuracy of the method of the present invention on both data sets is higher than that of some methods which are popular at present, and particularly on the CUB-200-2011 data set, the method has obvious advantages compared with other methods.

TABLE 1 results on the CUB-200 plus 2011 and Stanford Cars datasets

According to the method, the target saliency map can be well obtained based on the global average pooling method, and the target position is further determined, so that the background noise of the local area selected in the next step is less, and the calculation cost is reduced. And a more robust feature is further extracted by using a feature pyramid network, and the module performs hierarchical fusion on the multi-scale features, so that the semanteme of the low-level features is enhanced, a network model can capture more detailed information, a more distinctive local area is found, and the identification effect of the model is finally improved. The method of the invention proves the effectiveness of the method through quantitative experiment results on CUB-200 plus 2011 and Stanford Cars data sets.

Claims

1. A fine-grained image classification method based on a feature pyramid and global average pooling is characterized by comprising the following steps:

step 3, extracting the characteristics of the multi-channel characteristic diagram by the characteristic pyramid network and predicting to obtain K local areas with the maximum information quantity; the step 3 comprises the following steps: step 3.1, inputting the feature graph into a feature pyramid network to generate feature graphs of N scales, wherein N is a natural number not less than 3; 3.2, performing upsampling on an upper layer characteristic diagram in the characteristic diagram obtained in the step 3.1, performing fusion on a lower layer characteristic diagram after convolution kernel to obtain fusion characteristic diagrams of N scales; 3.3, selecting candidate areas with different sizes on the fusion characteristic graph with N scales, filtering the bounding box generated in the step 2, predicting and sequencing according to the size of an activation value of the bounding box to obtain a local area, wherein the bounding box is generated by taking the maximum connected area in the saliency map and setting a threshold to obtain a specific position of a target;

2. The fine-grained image classification method based on feature pyramid and global average pooling of claim 1, wherein the step 2 comprises the steps of: step 2.1, the global average pooling layer maps each feature map into a neuron, is connected with softmax for training, and predicts the category; step 2.2: and after training is finished, multiplying and accumulating the weight of the class with the highest probability corresponding to the neuron and the multi-channel characteristic graph respectively to obtain a saliency map.

3. The fine-grained image classification method based on the feature pyramid and the global average pooling of claim 1, wherein the K local regions with the largest information amount predicted in the step 3 are optimized by adopting rank consistent loss, so that the local region classification prediction result and the activation value obtained by the feature pyramid network have the same rank.

4. The method of claim 3, wherein the optimization with rank-consistent loss is performed by using a hinge loss function, and the K local regions are R ═ R { (R) } according to the method of classifying fine-grained images based on the feature pyramid and the global average pooling₁，R₂，...，R_KAnd ranking the K local areas from high to low according to the activation values, wherein the activation values obtained by predicting the K local areas through the characteristic pyramid network are S ═ S respectively₁，S₂，...，S_KAnd the K local areas are predicted by a convolutional neural network to obtain a probability P ═ P₁，P₂，...，P_KThe ordering penalty is defined as follows:

S_iand S_jIn order to activate the value of the key,

the hinge loss function f (x) is: (x) max {1-x, 0 }.