CN110188816B

CN110188816B - Image fine granularity identification method based on multi-stream multi-scale cross bilinear features

Info

Publication number: CN110188816B
Application number: CN201910450570.3A
Authority: CN
Inventors: 李春国; 邓亭强; 杨绿溪; 徐琴珍; 俞菲
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-05-28
Filing date: 2019-05-28
Publication date: 2023-05-02
Anticipated expiration: 2039-05-28
Also published as: CN110188816A

Abstract

The invention provides an image fine granularity identification method based on multi-stream multi-scale cross bilinear features. Aiming at the problems of insufficient extraction and insufficient utilization of the fine-grained features of the images, the method utilizes a multi-stream network to extract cross bilinear features, the features can characterize finer local features of the images, and the problem of insufficient feature extraction is solved; the method for enhancing and fusing the multi-scale bottom layer bilinear features by using the image random mixing solves the problem of insufficient feature utilization. Experiments prove that the recognition accuracy of the fine granularity recognition method based on the multi-scale cross bilinear features of the multi-stream network fusion on the CUB-200-2011 public data set is remarkably improved compared with that of the conventional method, and the optimal fine granularity recognition accuracy is respectively achieved.

Description

Image fine granularity identification method based on multi-stream multi-scale cross bilinear features

Technical Field

The invention relates to the fields of computer vision, artificial intelligence and multimedia signal processing, in particular to an image fine granularity recognition method based on multi-stream multi-scale cross bilinear features.

Background

With the continuous development of the deep convolutional neural network, the technology such as deep learning and the like continuously improves the precision and reasoning efficiency of tasks such as target detection, semantic segmentation, target tracking, image classification and the like in computer vision, and the technology is mainly beneficial to the improvement of the powerful nonlinear modeling capability of the convolutional neural network, the current massive data and the computing power of hardware equipment. This has also led to tremendous development in the computer vision task of fine-grained image recognition. At present, a method for image classification tasks is relatively mature, which is reflected in that the identification index on an image Net data set is at a relatively high level, and the image fine-granularity identification tasks have a wider development space and a more valuable application space requirement because of the relatively difficult identification subclasses.

Fine-grained identification of images is relative to coarse-grained identification, which generally refers to the completion of identification of different kinds of classification with large differences, such as people, chairs, cars, cats, etc.; while the task of fine-grained identification is to identify subclasses in a target large class, such as 200 Birds identification in the california college Birds database (CUB-200-2011, caltech-UCSD Birds-200-2011) data set, 196 class Cars in the Stanford Cars data set (Stanford Cars) proposed by the university of stanforum, and the like. Therefore, the fine-granularity recognition task has the characteristics of small variances among the subclasses and large variances inside the subclasses, and compared with the coarse-granularity recognition of images, the fine-granularity image subclasses are easy to be confused, the distinguishable information area points are few, the similar features among the subclasses are many, and the like, so that the difficulty of fine-granularity recognition of the images is increased.

Disclosure of Invention

Aiming at the fine granularity recognition task of the image target subclass, the invention provides an image fine granularity recognition method based on multi-stream multi-scale cross bilinear features, which uses a multi-stream network to extract fine granularity image features, calculates cross bilinear features, utilizes the cross features after fusion to predict fine granularity categories, and for the purpose, the invention provides an image fine granularity recognition method based on the multi-stream multi-scale cross bilinear features, uses the multi-stream network to extract fine granularity image features, calculates cross bilinear features, and utilizes the cross features after fusion to predict fine granularity categories, and the method comprises the following steps:

(1) Data augmentation is performed on the input image;

(2) Extracting image features by using a multi-stream basic network, and calculating cross bilinear features and bottom bilinear features;

(3) And predicting the fine granularity category by using the fused characteristics.

As a further improvement of the invention, the image is amplified in the step (1), and the specific steps are as follows:

step 2.1: the data is enhanced by using offline rotation and online rotation, wherein the offline rotation is to rotate a data set every 10 degrees at [0,359], the online rotation is to randomly rotate a picture input into a network at a certain angle, and besides, the brightness enhancement is also used, and the data enhancement is performed by using a random clipping mode;

step 2.2: data augmentation by random image blending enhancement, let U (ε) be [0,1 ]]Upper part of the cylinderRandom probability distribution, each randomly sampling ε -U (ε), for two sets of training samples x ₁ And x ₂ Random combination is carried out according to probability distribution to obtain epsilon x ₁ +(1-ε)x ₂ The corresponding label is epsilon h ₁ +(1-ε)h ₂ This completes the random image blending enhancement.

As a further improvement of the present invention, in the step (2), image features are extracted by using a multi-stream base network and cross bilinear features are calculated:

step 3.1: and extracting the characteristics of the image after data augmentation by using a multi-stream network. Feeding the amplified pictures into a K-path convolutional neural network, wherein the K-path convolutional neural network Stream 1, stream 2 and Stream 3 respectively adopt a ResNet-34 network, a ResNet-50 network and a VGG-16 network, and the K-path convolutional neural network, the ResNet-50 network and the VGG-16 network are used as extraction networks of basic characteristics, so that the characteristics of fine-grained images are obtained;

step 3.2: the cross bilinear characteristic of the multi-Stream network is calculated, and the cross bilinear characteristic of the Stream 1 and the Stream 2, the bilinear characteristic of the Stream 1 and the Stream 3 and the bilinear characteristic of the Stream 2 and the Stream 3 are respectively extracted, so that the cross bilinear characteristic of the K-path convolutional neural network is obtained, and the calculation method of the bilinear characteristic is as follows: inputting two paths of convolutional neural network feature graphs, namely A and B, respectively, transposing the A and multiplying the A by the B, normalizing the result, and regularizing the L2;

step 3.3: calculating bilinear features of the bottom layer, wherein the bottom layer bilinear features are obtained by performing second-order bilinear pooling by using the bottom layer and the bottom layer, wherein the bottom layer is selected from a ResNet-5a layer of Stream 1, namely a first layer of a fifth bottleneck block, a ResNet-5a layer of Stream 2, namely a first layer of a fifth bottleneck block, and a Conv5_1 layer of Stream 3, namely a first layer of a fifth convolution block, and the bilinear features of the bottom layer and the cross bilinear features of the high layer are fused.

As a further improvement of the present invention, the fine granularity category is predicted in the step (3) by using the fused features:

step 4.1: fusing the cross bilinear features and the bottom bilinear features, wherein two feature fusion modes, namely a splicing mode and an element addition mode, are adopted, and finally, the fused features are sent to a full-connection layer for classification, and a softmax vector is calculated to obtain a predicted result;

wherein the loss function is a cross entropy loss function to guide the training and learning process;

wherein y is _i A true category label is indicated and,

category label information representing network predictions. C is the total number of categories on the training dataset.

So far, the image fine granularity identification method based on the multi-stream multi-scale cross bilinear features is completed.

Drawings

FIG. 1 is a fine particle data augmentation schematic of the present invention.

FIG. 2 is a diagram of an image fine granularity recognition method based on multi-stream multi-scale cross bilinear features of the present invention.

FIG. 3 is a graph showing the change of accuracy rate with training wheel number on a CUB-200-2011 test data set according to the present invention

FIG. 4 the present invention discloses a partial test sample (upper left corner is the predictive category of the present invention) on the CUB-200-2011 dataset.

Detailed Description

The invention is described in further detail below with reference to the attached drawings and detailed description:

the invention provides an image fine granularity recognition method based on multi-stream multi-scale cross bilinear features, which uses a multi-stream network to extract fine granularity image features, calculates cross bilinear features, and predicts fine granularity categories by utilizing the fused cross features.

The following takes a fine granularity disclosure data set as an example, and a specific embodiment of an image fine granularity recognition method based on multi-stream multi-scale cross bilinear features is further described in detail with reference to the accompanying drawings. The invention uses a multi-flow network to extract the fine granularity image characteristics, calculates the cross bilinear characteristics, and predicts the fine granularity category by utilizing the cross characteristics after fusion. The method comprises the following steps:

(1) The input image is first data augmented.

Step 1.1: the data is enhanced by using offline rotation, which is to rotate the data set every 10 degrees at [0,359], and online rotation, which is to randomly rotate the picture of the input network at a certain angle, and besides, the data is enhanced by using brightness enhancement and random clipping.

Step 1.2: data augmentation by random image blending enhancement, as shown in FIG. 1, let U (ε) be [0,1 ]]Random probability distribution on each random sampling epsilon-U (epsilon) for two sets of training samples x ₁ And x ₂ Random combination is carried out according to probability distribution to obtain epsilon x ₁ +(1-ε)x ₂ The corresponding label is epsilon h ₁ +(1-ε)h ₂ This completes the random image blending enhancement.

(2) And extracting image features by using a multi-stream basic network, and calculating cross bilinear features and bottom bilinear features. The method comprises the following specific steps:

step 2.1: and extracting the characteristics of the image after data augmentation by using a multi-stream network. The amplified pictures are fed into a K-path convolutional neural network, wherein the K-path convolutional neural networks Stream 1, stream 2 and Stream 3 respectively adopt a ResNet-34 network, a ResNet-50 network and a VGG-16 network, and the K-path convolutional neural networks are used as extraction networks of basic characteristics. As shown in fig. 2, the features of the fine-grained image are thus obtained. Where K takes a value of 3.

Step 2.2: the cross bilinear features of the multi-stream network are calculated. The bilinear features of Stream 1 and Stream 2, the bilinear features of Stream 1 and Stream 3, and the bilinear features of Stream 2 and Stream 3 are extracted respectively, so that the cross bilinear features of the K-way convolutional neural network are obtained. The calculating method of the bilinear features comprises the following steps: the input is two paths of convolutional neural network characteristic diagrams, namely A and B, and the A is transposed and then multiplied by the B. And carrying out normalization operation on the result and L2 regularization.

Step 2.3: and calculating bilinear features of the bottom layer. The bottom bilinear feature is obtained by performing second-order bilinear pooling by itself and itself, wherein the bottom layer is selected from the ResNet-5a layer of Stream 1 (the first layer of the fifth bottleneck block), the ResNet-5a layer of Stream 2 (the first layer of the fifth bottleneck block) and the Conv5_1 layer of Stream 3 (the first layer of the fifth convolution block). These underlying bilinear features are fused with the higher level cross bilinear features.

(3) And predicting the fine granularity category by using the fused characteristics. The method comprises the following specific steps:

step 3.1: the cross bilinear features and the bottom bilinear features are fused, and two feature fusion modes, namely a splicing mode and an element addition mode are adopted here. And finally, sending the fused features to a full connection layer for classification, and calculating a softmax vector to obtain a predicted result. The overall algorithm flow diagram is shown in algorithm 2.

The loss function of the present invention is a cross entropy loss function to guide the training and learning process.

The experimental platform of the established model is as follows: the centos 7 system configures the E5 processor, a NVIDIA Tesla P100 graphics card. The training process of the invention adopts a joint cross entropy loss function and a sorting consistency loss function for training, the optimizer adopts a random gradient descent optimizer SGD, the initial learning rate is set to lr=0.01, the batch_size=16, 100 epochs are iterated to obtain a trained model, and the training model is tested on a data set CUB200-2011 proposed by the california institute of technology. The hyper-parameters of the model training of the invention are not limited to the following parameters

The test curve of the invention on the data set is shown in fig. 3, and the test result on the data set is shown in the table below in the specification.

FIG. 4 shows the prediction results of a part of the test sample of the CUB-200-2011 data set, and it can be seen that the fine granularity category of the image is predicted better in the invention.

The above description is only of the preferred embodiment of the present invention, and is not intended to limit the present invention in any other way, but is intended to cover any modifications or equivalent variations according to the technical spirit of the present invention, which fall within the scope of the present invention as defined by the appended claims.

Claims

1. The image fine-granularity recognition method based on the multi-stream multi-scale cross bilinear features is characterized by extracting fine-granularity image features by using a multi-stream network, calculating the cross bilinear features, and predicting fine-granularity categories by utilizing the fused cross features, and comprises the following steps:

(1) Data augmentation is performed on the input image;

the image is amplified in the step (1), and the specific steps are as follows:

step 2.2: data augmentation by random image blending enhancement, let U (ε) be [0,1 ]]Random probability distribution on each random sampling epsilon-U (epsilon) for two sets of training samples x ₁ And x ₂ Random combination is carried out according to probability distribution to obtain epsilon x ₁ +(1-ε)x ₂ The corresponding label is epsilon h ₁ +(1-ε)h ₂ This completes the random image blending enhancement;

extracting image features by using a multi-stream basic network in the step (2) and calculating crossed bilinear features:

step 3.1: extracting the characteristics of the image after data augmentation by utilizing a multi-Stream network, feeding the image after the augmentation into a K-path convolutional neural network, wherein the K-path convolutional neural network Stream 1, the Stream 2 and the Stream 3 respectively adopt a ResNet-34 network, a ResNet-50 network and a VGG-16 network, and the characteristics of the fine-grained image are obtained by utilizing the characteristics as extraction networks of basic characteristics;

step 3.2: the cross bilinear characteristic of the multi-Stream network is calculated, and the cross bilinear characteristic of the Stream 1 and the Stream 2, the bilinear characteristic of the Stream 1 and the Stream 3 and the bilinear characteristic of the Stream 2 and the Stream 3 are respectively extracted, so that the cross bilinear characteristic of the K-path convolutional neural network is obtained, and the calculation method of the bilinear characteristic is as follows: inputting two paths of convolutional neural network feature graphs, namely A and B, respectively, transposing the A, multiplying the A by the B, normalizing the result, and regularizing the L2;

step 3.3: calculating bilinear features of a bottom layer, wherein the bottom layer bilinear features are obtained by performing second-order bilinear pooling by using the bottom layer and the bottom layer, wherein the bottom layer is selected from a ResNet-5a layer of Stream 1, namely a first layer of a fifth bottleneck block, a ResNet-5a layer of Stream 2, namely a first layer of a fifth bottleneck block, and a Conv5_1 layer of Stream 3, namely a first layer of a fifth convolution block, and the bilinear features of the bottom layer are fused with cross bilinear features of a high layer;

(3) Predicting the fine granularity category by utilizing the fused characteristics;

in the step (3), the fused characteristics are used for predicting the fine granularity category:

wherein y is _i A true category label is indicated and,

class label information representing network predictions, C being the total number of classes on the training dataset;