CN111860068A

CN111860068A - Fine-grained bird identification method based on cross-layer simplified bilinear network

Info

Publication number: CN111860068A
Application number: CN201910360985.1A
Authority: CN
Inventors: 何小海; 蓝洁; 滕奇志; 卿粼波; 任超; 吴小强; 吴晓红
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2020-10-30

Abstract

The invention discloses a fine-grained bird identification method based on a cross-layer simplified bilinear network. The method comprises the following steps: 5994 training pictures and 5794 test pictures in the CUB-200-2011 data set are preprocessed, and then the processed images are input into a VGG-16 convolutional neural network to extract a feature map of the bird image. In order to consider the interlayer feature interaction, three groups of simplified bilinear feature representations are extracted from the obtained feature maps of different high-level convolutions, normalized and then cascaded to be sent to a softmax classifier. And finally, optimizing the whole network by utilizing cross entropy loss and assisting in pair confusion loss. The identification method described by the invention has the advantages of low feature dimension, less calculation amount, high identification rate, strong robustness and the like, has a certain use value aiming at the specific field of fine-grained image classification, and can be practically applied.

Description

Fine-grained bird identification method based on cross-layer simplified bilinear network

Technical Field

The invention designs a fine-grained bird identification method based on a cross-layer simplified bilinear network, and relates to deep learning and fine-grained image classification.

Background

Fine-grained classification is primarily aimed at distinguishing its numerous sub-categories, such as different kinds of birds, flowers, etc., under the same basic category. Compared with a coarse-grained image, the difference between classes of the fine-grained image is slight, the intra-class difference is obvious, the fine-grained characteristic is often more complex to obtain, the complex parameter in the model is determined by relying on the labeling of the image, and the overfitting phenomenon caused by a small amount of data is avoided as much as possible. The early fine-grained identification method relies on manually marked local information to carry out strong supervised learning on a classification model. Local labeling usually needs experts in the corresponding field to complete, so that the manual participation degree of the method is high. In recent years, a weakly supervised learning method that only needs an image class label becomes a research focus.

The mainstream fine-grained classification method based on weak supervision information mainly has two types. The first type employs a structure that "locates" sub-networks to assist in "classifying" the primary network, enhancing the learning capabilities of the classification network by locating local information (e.g., component locations or segmentation masks) provided by the network. Such approaches require a trade-off between location and identification capabilities, which may degrade the performance of a single network. This trade-off is also reflected in the practice that training usually involves alternating optimization of the two networks or training the two networks separately and then jointly adjusting. The second type is end-to-end feature coding, which enhances the learning capabilities of convolutional neural networks by coding the higher order statistics of the convolutional feature map. Such methods seek a robust representation of the image, and conventional representations include VLAD, Fisher vectors with SIFT features. Such models capture local feature interactions in a translation-invariant manner, which is particularly useful for texture and fine-grained recognition tasks.

The invention provides a fine-grained bird identification method based on a cross-layer simplified Bilinear network (BCNN) based on a simplified Bilinear network of end-to-end coding, which makes full use of the inter-layer characteristic correlation and the interactivity of characteristic maps from different Convolutional layers and regularizes a cross entropy loss function by pairwise confusion. The method makes up the inadequacy of the bilinear feature obtained by a single convolution layer, has lower dimensionality and less calculation amount compared with the BCNN feature, and obtains the recognition rate of 86.6 percent on the CUB-200-plus-2011 data set.

Disclosure of Invention

The invention realizes the purpose through the following technical scheme, which comprises the following steps:

(1) and (5) bird image feature extraction. 5994 training pictures and 5794 test pictures in the CUB-200-2011 data set are preprocessed, and then the processed images are input into a convolutional neural network to extract depth characterization vectors of the images.

(2) And (4) cross-layer simplified bilinear feature fusion. In order to consider the interlayer feature interaction, the feature maps of different high-level convolutions obtained in the step (1) are subjected to simplified bilinear operation to obtain three groups of bilinear feature representations, and the three groups of bilinear feature representations are subjected to normalization operation, then are cascaded and then are sent to a softmax classifier.

(3) The cross-entropy loss is utilized and assisted to optimize the network by the pair-wise confusion loss. Randomly dividing a sample in a training batch into two groups of picture pairs, and if the picture pairs have the same label, directly calculating cross entropy loss; if the picture pairs have different labels, adding paired Euclidean loss as a regularization term on the basis of cross entropy loss.

Drawings

FIG. 1 bird feature image extraction network

FIG. 2 is a schematic diagram of different high-level convolution activation responses

FIG. 3 is a simplified bilinear operation diagram

FIG. 4 is a block diagram of a cross-layer reduced bilinear network

FIG. 5 network training method with pairwise confusion loss

Detailed Description

The invention is further described below with reference to the accompanying drawings:

FIG. 1 is a VGG-16 based bird feature image extraction network. The image feature extractor selects VGG-16, and removes the fifth pooling layer pool5 and three full-connected layers fc6, fc7 and fc 8. Firstly, preprocessing a data set picture, and scaling the data set picture to 512 × S according to the length and width. In the training stage, pictures are disordered, horizontally turned and randomly cut, and the input size is 448 multiplied by 448; the testing stage performs center cropping only on the picture.

FIG. 2 is a diagram of different high-level convolution activation responses in a feature extraction network. As shown in fig. 2, the discrimination of each component in the input image differs between different convolutional layers. As in the first row of pictures in fig. 2, conv5_1 has a strong response to all of the tail, head and wings of the black-legged geoduck, while conv5_3 retains only the activation response to the head. Inspired by the observation, in order to better capture the characteristic relation between layers, the invention provides a cross-layer simplified bilinear pooling method. The method considers the characteristic interaction between layers, integrates a plurality of cross-layer bilinear characteristics and carries out characteristic fusion before final classification so as to enhance the representation capability of the characteristics, and avoids additional training parameters. In contrast to BCNN, which only utilizes features from a single convolutional layer, the present method treats each convolutional layer in a convolutional neural network as a partial attribute extractor, utilizing partial feature interactions from multiple layers.

FIG. 3 is a simplified bilinear operation diagram. The concrete implementation steps are as follows:

(1) first, the feature vector f is transformed using the Count Sketch function Ψ_k∈R^cMapping to a feature space,

k

1, 2. Two vectors s are defined_k∈{-1,1}，h_kE {1, 1.., d }, the initialization is subject to uniform distribution, and the value is fixed in subsequent operations. h is_kFor finding f_kThe ith element f_k(i) Corresponding index j ═ h in feature space_k(i) Then there is

Ψ(f_k,h_k,s_k)＝{Q₁,Q₂,...,Q_d}

Wherein: i ∈ {1,..., C }; j ∈ {1,..., d }.

(2) The Tensor Sketch algorithm indicates that a Count Sketch of the two vector outer products can be obtained by calculating the convolution of two feature vectors, Count Sketch, which can be expressed as

Where denotes a convolution operation. The convolution theorem states that convolution in the time domain is equivalent to a product in the frequency domain. Thus, the above formula can be represented as

Wherein F represents a fast Fourier transform, F-¹Which represents the inverse of the fourier transform,

representing the multiplication of pairs of elements.

(3) And carrying out normalization operation on the three groups of bilinear eigenvectors obtained in the step. Firstly, the bilinear feature x ═ Ψ (i) is obtained by the square root of the symbol

Then l2 normalization is carried out (z ← y/| | y | survival circuitry)₂)。

Fig. 4 is a block diagram of a cross-layer reduced bilinear network. The concrete implementation steps are as follows:

(1) VGG-16 is selected as a feature extractor, output feature maps of different high-level convolutions are obtained from bird feature image extraction networks and recorded as f ₁(x,y),f₂(x,y),f₃(x, y) wherein f₁、f₂、f₃The characteristic functions respectively correspond to the output characteristic functions of the fifth convolution layers conv5_1, conv5_2 and conv5_3 of the VGG-16.

(2) Combining the output characteristic maps f of different layers according to the method of FIG. 3_AWith another layer profile f_BAnd carrying out simplified bilinear operation to obtain three groups of bilinear eigenvectors.

(3) And (4) the normalized features are vector-valued and sent into a softmax classifier for classification.

FIG. 5 is a network training method with pairwise confusion loss. The core idea of pair-wise obfuscation is: randomly dividing a sample in a training batch into two groups of picture pairs, and if the picture pairs have the same label, directly calculating cross entropy loss; if the picture pairs have different labels, adding paired Euclidean loss as a regularization term on the basis of cross entropy loss. The method mainly comprises the following steps:

(1) samples in a training batch are randomly divided into two groups (x)₁,y₁)、(x₂,y₂)。

(2) Obtain the class label vector label (x) of the two sets of samples₁) And label (x)₂)。

(3) If two groups of samples have the same label, the cross entropy loss is directly calculated

If the two groups of samples have different labels, adding Euclidean pairwise confusion loss as a regularization term on the basis of cross entropy loss, namely

Wherein D_ECEpoendo's distance, L _ECTable cross entropy loss, p_θ(y|x_i) Probability vector output by softmax classifier.

(4) Back-propagating the losses and updating the network parameters.

(5) Enter the next batch and jump to step (1).

Claims

1. A fine-grained bird identification method based on a cross-layer simplified bilinear network is characterized by comprising the following steps:

(1) firstly, 5994 training pictures and 5794 testing pictures in a CUB-200-2011 data set are preprocessed, and then the processed images are input into a convolutional neural network VGG-16 to extract a feature map of a bird image;

(2) in order to consider interlayer feature interaction, three groups of simplified bilinear feature vectors are extracted from the feature maps of different high-level convolutions obtained in the step (1) in a cross-layer mode, normalized, cascaded and sent to a softmax classifier;

(3) cross entropy loss is utilized and assisted to pair-wise confusion optimization networks.

2. The cross-layer bilinear feature extraction of claim 1, wherein the selected convolutional layer is VGG-16 fifth set of convolutions, and the specific combination is

Wherein

Defined as a reduced bilinear operation.

3. The three sets of reduced bilinear eigenvectors of claim 1, wherein the output eigenvector dimension takes the value of 8192.

4. The optimization network using cross entropy loss according to claim 1, wherein samples in a training batch are randomly divided into two groups of picture pairs, and if the picture pairs have the same label, cross entropy loss is directly calculated; if the picture pairs have different labels, adding paired Euclidean loss as a regularization term on the basis of cross entropy loss.

5. The Euclidean loss weight of claim 4 takes a value of 20 and the cross-entropy loss weight takes a value of 1.