CN107766890B

CN107766890B - Improved method for discriminant graph block learning in fine-grained identification

Info

Publication number: CN107766890B
Application number: CN201711040828.XA
Authority: CN
Inventors: 冀中; 赵可心; 张锁平
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-10-31
Filing date: 2017-10-31
Publication date: 2021-09-14
Anticipated expiration: 2037-10-31
Also published as: CN107766890A

Abstract

An improved method for discriminant graph block learning in fine-grained recognition comprises the following steps: extracting image blocks with distinguishing properties in an original image, comprising the following steps: obtaining a feature map from the original image through a convolution pooling layer in a convolution neural network, and regarding a vector of each space fixed position in the feature map as a detector corresponding to a corresponding position image block in the original image; supposing that a detector with the highest response in a discriminant area in an original image is learned, carrying out convolution operation on the detector and a feature map to obtain a new response map; selecting the position with the maximum value in the new response image to obtain an image block with distinguishing property; learning features of tiles having discriminative properties and for classification, comprising: obtaining a local saliency map according to an image block with distinguishing properties; the local saliency map is encoded using a spatially weighted fisher vector. The method learns the discriminant characteristics more suitable for a fine-grained identification task, and reduces the interference of background information in a discriminant image block so as to improve the classification precision.

Description

Improved method for discriminant graph block learning in fine-grained identification

Technical Field

The invention relates to discriminative graph block learning in fine-grained identification. In particular, to an improved method for discriminant tile learning in fine-grained identification by spatially weighting an image descriptor to obtain spatially weighted fisher vectors according to a response map.

Background

In recent years, fine-grained recognition attracts more and more attention in the field of target recognition, which is to recognize subclasses of a certain large class of targets, such as flowers, birds, dogs, automobiles and the like, which generally have the same structure, so that how to learn features with distinguishing properties in images becomes a main task of fine-grained recognition.

In past research, the field of fine-grained identification mainly includes two tasks: local localization and characterization. In addition to image category labels, fine-grained datasets usually provide additional labeling of target bounding boxes and local parts, and many previous tasks more or less rely on these additional labeling, but fine-grained classification usually requires expert-level knowledge, and ordinary people have difficulty in completing the task, which makes manual labeling expensive. In recent years, more research is focused on methods without any additional labeled information, the method related to the invention only needs image category labels, and does not need local labeling, and local features with discriminability are learned by a weak supervision method.

For the feature description of images, CNN features have made a breakthrough on many benchmarks. The traditional method is to encode local information and then fuse the local information into an integral feature representation, the CNN features are different from the CNN features, the global learning can be directly realized, a feature extractor does not need to be designed manually, and the current fine-grained identification method is based on the CNN and is used for learning the fine and unique features of the image through other algorithms.

Disclosure of Invention

The invention aims to solve the technical problem of providing an improved method for learning discriminant image blocks in fine-grained identification, which can accurately learn detailed features and abandon disordered background information in small image blocks, thereby improving classification precision and avoiding the need of global feature assistance.

The technical scheme adopted by the invention is as follows: an improved method for learning discriminant graph blocks in fine-grained recognition comprises the following steps:

1) extracting image blocks with distinguishing properties in an original image, comprising the following steps:

(1) obtaining a characteristic image of C multiplied by H multiplied by W size from an original image through a convolution pooling layer in a convolution neural network, wherein C is the number of pipelines, H is the height, W is the width, and a C multiplied by 1 vector of each space fixed position in the characteristic image is regarded as a detector corresponding to a corresponding position image block in the original image;

(2) supposing that a detector with the highest response in a discriminant area in an original image is learned, carrying out convolution operation on the detector with the size of C multiplied by 1 and a feature map with the size of C multiplied by H multiplied by W to obtain a new response map with the size of H multiplied by W;

(3) selecting the position with the maximum value in the new H multiplied by W response image to obtain a 1 multiplied by 1 image block with distinguishing property;

2) learning features of tiles having discriminative properties and for classification, comprising:

(1) obtaining a local saliency map according to an image block with distinguishing properties;

(2) the local saliency map is encoded using a spatially weighted fisher vector.

Step 1) step (2) the step of learning the detector with the highest response in a discriminant region in the original image is as follows:

(1) assuming that the number of discriminative pattern block detectors of each type of image is n, the images share M types, and the number of required detectors is nM;

(2) respectively performing convolution operation on nM Cx 1 x 1 detectors and Cx H x W feature maps to obtain new feature maps, and performing global maximum pooling on the new feature maps to obtain a nM-dimension feature vector;

(3) averaging each class of eigenvectors in the nM dimensional eigenvectors to obtain an M dimensional vector;

(4) introducing the M-dimensional vector into a Softmax loss function, training a Cx 1 x 1 detector by using a back propagation algorithm, and obtaining a detector with the highest response in a discriminant region in an original image after the training is finished;

step 2) the step (1) of obtaining the local saliency map is as follows: a local saliency map Q is computed from a saliency map S derived from the original image, as follows:

where p is the pixel of the discriminative tile, i is the detection site, and when the ith detection site contains pixel p, then D_i(p) 1, otherwise D_i(p) 0, s (p) the saliency map of the whole image, q (p) the local saliency map, Z a normalized constant such that maxq (p) 1.

Step 2) the step (2) comprises the following steps:

suppose vector I is (z)₁,…,z_N) Is a series of D-dimension characteristic vectors extracted from an image, and the Fisher vector of a picture I encodes phi (I) to (u)₁,v₁,…,u_k,v_k) Is the mean square error u_kSum covariance v_kAccumulation of (u)_k，v_kWritten as follows:

where j 1, D represents the vector dimension, and o (μ) is defined_k,σ_k,π_kK1, …, K) are parameters of a gaussian mixture model, q_ikIs each vector z of mode k in the hybrid model_iWherein i is 1, …, N.

For each vector z_iIntroducing a spatial weighting term Q (p)_i)，u_jkAnd v_jkThe weighted result is expressed as:

wherein, Q (p)_i) Is a local saliency map, u_ijk、v_ijkRespectively, formula (2) (3), important features can be learned by introducing spatial weights.

The improved method for learning the discriminant image block in the fine-grained recognition combines the CNN characteristic and the Fisher vector to learn the discriminant characteristic more suitable for a fine-grained recognition task, and reduces the interference of background information in the discriminant image block so as to improve the classification precision. The method mainly aims at the improvement of the existing discriminant graphic block learning, and for the discriminant region detected by the learned graphic block detector, except the discriminant feature of the target object, redundant background information often exists, so that the method utilizes the local saliency map and the Fisher vector coding to improve the limit, and fully utilizes the characteristics of the discriminant region to be effectively used for the classification task. The advantages are mainly reflected in that:

1) the novelty is as follows: the most efficient and popular feature representation method at present is CNN, but the present invention combines fisher vector coding with CNN features for specific problems. Because the most important thing in the fine-grained identification problem is the learning of the discriminant features, and the backgrounds in the data sets are usually very similar, the method introduces the Fisher vector in a novel way, and can effectively reduce the interference of the background information in the discriminant image blocks.

2) Effectiveness: compared with the original method, the local salient map and Fisher vector coding-based method designed by the invention can effectively learn local features, the traditional CNN usually needs a rectangle with a fixed size as input, which contains invalid information of the background, and the method can effectively reduce the interference of background noise, so that the learned local features are more discriminative, and the classification precision is improved.

3) The practicability is as follows: the method is simple and feasible, is an end-to-end network, and can be effectively used for fine-grained identification.

Drawings

FIG. 1 is a flow chart of an improved method of discriminative tile learning in fine-grained recognition of the present invention.

Detailed Description

1. An improved method for learning discriminant graph blocks in fine-grained recognition is characterized by comprising the following steps:

(2) encoding the local saliency map using spatially weighted fisher vectors;

2. the improved method for discriminative patch learning in fine granularity recognition as claimed in claim 1, wherein the step 1) step (2) of learning the detector with the highest response in a discriminative region in the original image comprises the steps of:

3. the improved method for discriminative graph block learning in fine grain identification as claimed in claim 1, wherein the step 2) the step (1) is to obtain the local saliency map as follows: a local saliency map Q is computed from a saliency map S derived from the original image, as follows:

4. The improved method for discriminative tile learning in fine granularity recognition as claimed in claim 1, wherein the step 2) and the step (2) comprise:

Specific examples are given below in connection with fig. 1:

FIG. 1 depicts a flow diagram of the architecture of the present invention. The structure of the invention mainly comprises three parts, as shown in figure 1. The method is based on a VGG-16 model, and the model has 16 layers. The implementation process is divided into two stages: a training phase and a testing phase.

In the training stage, the parameters of the detector are mainly learned, and the process is shown as (i) and (ii) in fig. 1.

(1) First, the input image passes through a pre-trained convolutional neural network VGG-16, conv4-3 outputs a feature map with the size of 512 × 28 × 28, and therefore, the size of each detector is 512 × 1 × 1. Setting the number of detectors of each class to 10, there are 2000 detectors for the CUB200-2011 dataset;

(2) convolving each detector with the obtained 512 × 28 × 28 feature map to obtain a response map with the size of 28 × 28;

(3) after the response map is subjected to global maximum pooling, a 2000-dimensional feature vector is obtained,

(4) the feature vectors of each class in the 2000-dimensional vectors are averaged to obtain a 200-dimensional vector, the vector after the average value pooling is transmitted into a Softmax loss function to be trained through a back propagation algorithm, and then a detector capable of extracting discriminant image blocks for each class can be obtained.

And (3) a testing stage, as shown in (r) and (c) of FIG. 1. For the trained detector, the steps (1) to (3) in the training stage are repeated, so that a response graph with the size of 1 × 1 can be obtained, and the part with the distinguishing property of each picture can be identified. And then calculating a local saliency map of the image, wherein the local saliency map is obtained by two parts of the local map and the image saliency map, and a discriminant image block extracted from the original image is multiplied by the global saliency map, so as to obtain the local saliency map by the formula (1). The local saliency map is used to indicate the possibility that the pixel belongs to the foreground, which can effectively reduce the interference of the background. Designing weights for the Fisher vectors of the images according to the local saliency maps to obtain space-weighted Fisher vectors, and learning important features in a fine-grained identification task by introducing the weights so as to finally realize classification of the fine-grained images.

Claims

(1) obtaining a characteristic diagram of C multiplied by H multiplied by W size from an original image through a convolution pooling layer in a convolution neural network, wherein C is the number of pipelines, H is the height, and W is the width;

(2) supposing that a detector with the highest response in a discriminant area in an original image is learned, carrying out convolution operation on the detector with the size of C multiplied by 1 and a feature map with the size of C multiplied by H multiplied by W to obtain a new response map with the size of H multiplied by W; the step of learning the detector with the highest response in a discriminant region in the original image is as follows:

(2.1) assuming that the number of discriminant pattern block detectors of each type of image is n, the images share M types, and the number of required detectors is nM;

(2.2) performing convolution operation on nM Cx 1 x 1 detectors and Cx H x W feature maps to obtain new feature maps, and performing global maximum pooling on the new feature maps to obtain an nM-dimension feature vector;

(2.3) averaging each type of feature vector in the feature vectors of nM dimension to obtain a vector of M dimension;

(2.4) transmitting the M-dimensional vector into a Softmax loss function, training a Cx 1 x 1 detector by using a back propagation algorithm, and obtaining a detector with the highest response in a discriminant area in an original image after the training is finished;

(1) obtaining a local saliency map according to an image block with distinguishing properties; the local saliency map is obtained by: a local saliency map Q is computed from a saliency map S derived from the original image, as follows:

where p is the pixel of the discriminant block, m represents the mth detection position, m is 1, …, Α, and when the mth detection position contains pixel p, then D is_m(p) 1, otherwise D_m(p) 0, s (p) is the saliency map of the whole image, q (p) is the local saliency map, Z is a normalized constant such that maxq (p) 1;

(2) encoding the local saliency map using spatially weighted fisher vectors; the method comprises the following steps:

where j 1, D represents the vector dimension, and o (μ) is defined_k,σ_k,π_kK1, …, K) are parameters of a gaussian mixture model, q_ikIs each vector z of mode k in the hybrid model_iWherein i ═ 1, …, N;