CN112861970A

CN112861970A - Fine-grained image classification method based on feature fusion

Info

Publication number: CN112861970A
Application number: CN202110179265.2A
Authority: CN
Inventors: 初妍; 王丽娜; 莫世奇; 李思纯; 李松; 时洁; 胡博; 苗晓晨; 赵佳昕
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-05-28
Anticipated expiration: 2041-02-09
Also published as: CN112861970B

Abstract

The invention belongs to the technical field of image recognition in computer vision, and particularly relates to a fine-grained image classification method based on feature fusion. The invention realizes the extraction of local detail characteristics of the fine-grained images on the classification task, accurately positions the fine-grained images in the concerned target area, solves the difficulty of small intra-class difference of the fine-grained images on the classification task, utilizes the improved non-maximum value to inhibit the soft-NMS optimization area to suggest the RPN to acquire the target object, and avoids the interference of background information. According to the invention, the bilinear convolutional neural network B-CNNs are improved through the attention module SCA and used for a fine-grained classification task so as to obtain attention characteristics with different dimensions. Compared with the existing classification method, the method is positioned in the key part of the distinction, and has higher accuracy.

Description

Fine-grained image classification method based on feature fusion

Technical Field

The invention belongs to the technical field of image recognition in computer vision, and particularly relates to a fine-grained image classification method based on feature fusion.

Background

The traditional classification task, multi-finger gross classification, is for example cat and dog. Due to their many distinctive features, it is relatively easier than fine-grained image classification. Fine-grained image classification is a subtask of image classification, mainly identifying hundreds of sub-categories under the same basic category, such as hundreds of sub-categories of birds, cars, pets, flowers, and airplanes. Different from a general classification task, fine-grained image classification has the characteristic of small intra-class difference, and the fine and local difference is the key of fine-grained image classification.

Due to the slight intra-class differences, different sub-classes can often be distinguished only by slight local differences. The fine-grained classification method mainly comprises two methods: one is a classification model based on strong supervision, which needs to use additional information such as manually labeled object labeling boxes and part labeling points in addition to the class labels of the images in order to obtain better classification accuracy. For example, the Part R-CNN algorithm adopts a recursive convolutional neural network to detect objects and local regions in an image. The practicability of the algorithm is limited to a great extent because the acquisition cost of the label information is very expensive. The other is a classification model based on weak supervision, which only relies on class labels to complete good classification without using additional part labeling information. Like the Two-level attention (Two-level attention) algorithm, the fine-grained image classification is completed only by using the class label without depending on additional labeling information. Although the extracted features have certain expression capability, how to effectively extract the features of the discriminant parts of the key attention area categories on the premise of only having category labels is challenging.

Disclosure of Invention

The invention aims to realize the extraction of local detail features of fine-grained images on a classification task and the accurate positioning in a concerned target area, and provides a fine-grained image classification method based on feature fusion.

The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:

step 1: acquiring an image data set to be classified, taking partial image data to construct a training set, and forming a test set by the rest data; labeling the images in the training set to obtain class labels corresponding to the images;

step 2: extracting a feature map of each image in the training set by using a VGG-19 convolutional neural network, and obtaining a feature vector of each image in the training set through sliding window operation on the final conv5-3 feature map;

and step 3: inputting the feature vector of each image in the training set into a regression layer and a classification layer to obtain a regional candidate detection frame set of each image in the training set; calculating a confidence score f for each detection frame in the set of region candidate detection frames_iSelecting a detection frame with the highest confidence coefficient to cut the image to obtain a cut image training set;

and 4, step 4: inputting the cut image training set into an SC-B-CNNs model for training;

the SC-B-CNNs model comprises a first ResNet-50 network, a second ResNet-50 network and a softmax classifier; the first ResNet-50 network is a ResNet-50 network which is pre-trained on ImageNet and removes a last full connection layer, and an attention module SCA is added between conv2 and conv3 volume blocks of the ResNet-50 network; the second ResNet-50 network does not perform pre-training and adds attention modules SCA between its conv4 and conv5 volume blocks;

step 4.1: respectively inputting the cut image training set into a first ResNet-50 network and a second ResNet-50 network, wherein the first ResNet-50 network outputs a first weighted feature map f of each image_AThe second ResNet-50 network outputs a second weighted feature map f for each image_B；

Step 4.2: the first weight characteristic graph f of each image in the cut image training set_AAnd a second weighted feature map f_BObtaining a bilinear feature vector of each image in the cut image training set through bilinear pooling operation;

step 4.3: inputting the bilinear feature vector of each image in the cut image training set into a softmax classifier to obtain the category of the image;

and 5: and inputting the test set into the trained SC-B-CNNs model to obtain a classification result of the image data set to be classified.

The present invention may further comprise:

the attention module SCA is used for extracting a feature map F with weight distribution of an input feature map G_scThe method comprises the following specific steps:

step 4.1.1: generating a feature map F by 1 multiplied by 1 convolution for the feature map G input to the attention module SCA;

step 4.1.2: feature graph F is dimensionality reduced using global mean pooling by having a parameter W_fcThe full-connection layer assigns weight to the full-connection layer, then compresses the w multiplied by h multiplied by 1 characteristic diagram into a channel according to the channel direction through convolution operation, and generates a space attention diagram A by adopting a sigmoid activation function_s；

Wherein G ∈ R^w×h×cW is the length of the feature map G, h is the width of the feature map G, and w × h represents the two-dimensional space size of the feature map G; c represents the number of channels; f. of^7×7Represents the size of the convolution kernel; σ () represents a sigmoid activation function;

step 4.1.3: element-by-element dot multiplication method for spatial attention diagram A_sPerforming feature fusion with the feature map F to obtain a spatial attention feature F_s：

Step 4.1.4: feature spatial attention F_sCompressing according to the spatial dimension w multiplied by h to generate a global compressed feature vector z of the current feature map_c；

Wherein f is_sq() Representing a compression operation; u. of_cRepresenting the c channel characteristic diagram;

step 4.1.5: obtaining the weight value of each channel in the feature map through two full-connection layers, and obtaining a feature map F with weight distribution by using sigmoid activation_sc；

A＝σ(W_s2×tanh(W_s1×z_c))

Wherein σ () represents a sigmoid activation function, and tanh () represents a tanh activation function; a is a feature vector of weight distribution; w_s1Is the weight of the first fully connected layer; w_s2Is the weight of the second fully connected layer; u. of_cRepresenting the c channel characteristic diagram;

representing element-by-element dot multiplication.

The invention has the beneficial effects that:

the invention realizes the extraction of local detail characteristics of the fine-grained images on the classification task, accurately positions the fine-grained images in the concerned target area, solves the difficulty of small intra-class difference of the fine-grained images on the classification task, utilizes the improved non-maximum value to inhibit the soft-NMS optimization area to suggest the RPN to acquire the target object, and avoids the interference of background information. According to the invention, the bilinear convolutional neural network B-CNNs are improved through the attention module SCA and used for a fine-grained classification task so as to obtain attention characteristics with different dimensions. Compared with the existing classification method, the method is positioned in the key part of the distinction, and has higher accuracy.

Drawings

Fig. 1 is a frame diagram of the fine-grained image classification method based on feature fusion according to the present invention.

Fig. 2 is a specific flowchart of the RPN network according to the present invention.

FIG. 3 is a schematic diagram of the framework of the B-CNNs based on SCA in the invention.

FIG. 4 is a schematic diagram of the attention module SCA of the present invention.

Fig. 5 is a specific algorithm code diagram of the SCA-based bilinear CNNs in the present invention.

FIG. 6 is a table of the results of comparative experiments performed on three datasets CUB-200, Stanford cars and Oxford flowers.

FIG. 7 is a table of the results of comparative experiments performed on the CUB-200 dataset.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention aims to extract local detail features of fine-grained images on a classification task and accurately position the fine-grained images in a concerned target area, and provides a weak supervision fine-grained image classification method based on feature fusion. An Attention module SCA (Spatial-Channel Attention) is designed to improve bilinear convolutional neural networks (B-CNNs) for a fine-grained classification task so as to acquire Attention features of different dimensions. Compared with the existing classification method, the method is positioned in the key part of the distinction, and has higher accuracy.

Step 1, inputting images in a data set and corresponding class labels, and extracting a feature map of each image by using a VGG-19 convolutional neural network;

step 2, obtaining a 256-dimensional feature vector through 3 × 3 sliding window operation on the final conv5-3 feature map;

step 3, inputting 256-dimensional feature vectors into two full-connection layers, namely a boundary regression layer and a classification layer, to obtain a regional candidate frame set;

step 4, selecting a detection frame with the highest confidence level in the frames to be detected by using an improved soft-NMS algorithm;

step 5, cutting and dividing the detected target area with the highest confidence coefficient;

step 6, inputting the cut image;

step 7, extracting convolution characteristics from the input image by using two ResNet-50 networks with the last full connection layer removed respectively;

step 8, the first network uses ResNet-50 pre-trained on ImageNet and adds a designed attention module SCA between the conv2 and conv3 volume blocks to obtain a weighted feature map;

step 9 the second sub-network uses ResNet-50 without pre-training and adds the designed attention module SCA between the conv4 and conv5 volume blocks to get the weighted feature map;

step 10, obtaining bilinear feature vectors by bilinear pooling operation on the weighted feature maps in the

steps

8 and 9;

step 11, inputting the bilinear feature vectors into a softmax classifier to obtain the category of the image;

step 12 inputs the test data set and calculates the accuracy of the model classification.

The invention extracts the image characteristics through the RPN network and completes the selection of the candidate frame. And taking the picture as an input, extracting rough features of the detected image by using VGG-19, and outputting an RPN (recursive stereo network) which is a region of interest obtained by convolving the feature map. To prevent overfitting, the RPN network is optimized using a modified soft-NMS, selecting the region where the higher confidence target is located. And optimizing the preset region, selecting anchors with 3 scales and 3 aspect ratios, namely generating 9 types of anchors, outputting 18 confidence values at each sliding window position classification layer, and outputting position information of 36 target interested regions by the regression layer to obtain more accurate candidate regions. Carrying out parameterized calculation on the target according to the boundary coordinates, wherein the formula is as follows:

t_x＝(x-x_a)/w_a，t_y＝(y-y_a)/h_a

t_w=log(w/w_a)，t_h=log(h/h_a)

wherein, x, y, w, h respectively represent the horizontal and vertical coordinates and the length and width of the center of the prediction matrix frame. t is t_iRepresenting parameterization of object boundary coordinates.

Indicating annotation information associated with the positive anchor point. x is the number of_a,y_a,w_a,h_aRespectively representing the horizontal and vertical coordinates and the length and width, x, of the anchor point frame^*,y^*,w^*,h^*Respectively representing the abscissa and ordinate of the true position of the label and the length and width.

Sorting all the detected detection boxes according to the scores of the detection boxes (when the scores are scored by using a classifier, a probability value is obtained, and the probability value represents the probability that the current detection box is the target to be detected), selecting the detection box A with the highest score, setting a threshold b, calculating the lou (interaction over Unit) between the detection box A and the maximum detection box A in the rest detection boxes, and if loU is larger than the threshold b, obtaining the detection box with the high overlapping rate. Deleting the detection boxes; there may be no overlap with the current frame or their overlap area is very small (loU is less than threshold b), then the unprocessed frames are reordered, and after the ordering is completed, a frame with the largest score is also selected, then loU values of other frames and the largest frame are calculated, then the frames with loU larger than a certain threshold are deleted again, and the process is iterated continuously until all frames are processed, and the final detection result is output.

The RPN extracted candidate frames will be highly overlapping. To reduce redundancy, the improved soft-NMS is used for optimization based on the classification score of the detection box. And when the score of the detection box is larger than the threshold value t, putting the detection box into a final detection result set. When the areas are overlapped, the score of the detected frame is multiplied by a decay function, so that the error probability is effectively reduced, and the detection accuracy is improved. The specific calculation formula is as follows:

wherein: f. of_iThe score corresponding to the ith detection box is shown, and t is a threshold value.

The SC-B-CNNs network architecture provided by the invention can be formed by a quaternion function B ═ f_A,f_BP, C) represents that bilinear features are obtained by bilinear combination through outer product operation, and the calculation formula is as follows:

b＝f_A ^T·f_B

wherein f is_AAnd f_BThe feature function containing the added attention block SCA, P is the pooling function and C is the classification function.

The feature outputs for each location are combined using bilinear pooling. The bilinear pooling operation of the input image l at position I is defined as:

bilinear(l,I,f_A,f_B)＝f_A(l,I)^Tf_B(l,I)

wherein f is_AAnd f_BAre the output of two feature extraction functions for B-CNNs.

Firstly, a feature graph extracted by a feature function is used as an original input G, G belongs to R^w×h×cWhere w × h denotes a two-dimensional space size of G, and c denotes the number of channels. Feature map F is generated by a 1 × 1 convolution, and F is dimensionality reduced using Global Average Pooling (Global Average Pooling), by having a parameter W_fcThe full-connection layer assigns weight to the full-connection layer, then compresses the w multiplied by h multiplied by 1 characteristic diagram into a channel according to the channel direction through convolution operation, and generates a space attention diagram A by adopting a sigmoid activation function_s，A_s∈R^w×h×1. The process of spatial attention extraction is represented as the formula:

wherein: f. of^7×7Representing the size of the convolution kernel, σ () representing the sigmoid activation function, W_fcIs represented by having a parameter W_fcThe full interconnect layer of (1).

Then, the spatial attention map A is multiplied by element points_sPerforming feature fusion with the original input F to obtain a spatial attention feature F_s：

And compressing the global space information into the channel description characteristic information. Generating a global compressed feature vector z of the current feature map by compressing the feature map Fs in a spatial dimension w × h_cThe specific calculation formula is as follows:

wherein f is_sq() Denotes a compression operation, u_cShowing the c-th channel profile.

Then, an activation operation is carried out, and by learning the weight parameters, the nonlinear correlation between the channels is found. And obtaining the weight value of each channel in the feature map through the two fully-connected layers, and taking the weighted feature map as the input of the next layer of network. The weight assignment calculation formula of the channel is as follows:

A_c＝f_eq(z,W)＝σ(W_s2×tanh(W_s1×z_c))

wherein f is_eq() Represents a compression operation, z represents a global compressed feature vector, σ () represents a sigmoid activation function, and tanh () represents a tanh activation function.

After the weight distribution vector of the feature map is obtained through the operation, simple gate control is selected and used, sigmoid activation is used, and the feature map F with the weight distribution is obtained_scThe calculation process is as follows:

wherein A is_cIs a feature vector of the weight distribution, u_cShowing the characteristic diagram of the c-th channel,

representing element-by-element dot multiplication.

The function of using two fully-connected layers is to ensure the consistency of input and output. The first full-link layer firstly reduces the dimension of the channel to 1/16, and after the channel passes through the tanh activation function, the channel returns to the original input dimension through one full-link layer.

The specific algorithm of the SCA-based bilinear CNNs is shown in FIG. 5. To demonstrate the effectiveness of the proposed method, comparative experiments were performed on three datasets, CUB-200, Stanford cars and Oxford flowers, respectively, and the results of the experiments are shown in FIG. 6. To further verify the validity and accuracy of the improved RPN network and SCA, comparative experiments were performed on the CUB-200 dataset, with the results shown in fig. 7.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A fine-grained image classification method based on feature fusion is characterized by comprising the following steps:

2. The fine-grained image classification method based on feature fusion according to claim 1, characterized in that: the attention module SCA is used for extracting a feature map F with weight distribution of an input feature map G_scThe method comprises the following specific steps:

A＝σ(W_s2×tanh(W_s1×z_c))

representing element-by-element dot multiplication.