CN112418351B

CN112418351B - Zero sample learning image classification method based on global and local context sensing

Info

Publication number: CN112418351B
Application number: CN202011460544.8A
Authority: CN
Inventors: 王国威; 陶文源; 管乃洋
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2023-04-07
Anticipated expiration: 2040-12-11
Also published as: CN112418351A

Abstract

The invention discloses a zero sample learning image classification method based on global and local context sensing, which comprises the following steps: carrying out feature extraction on the image by using a deep neural network to obtain a multilayer feature map; calculating any layer of feature map by using global attention to obtain a feature map containing global information; calculating the characteristic diagram of the same layer by using local attention to obtain a characteristic vector representing local information; obtaining a global feature vector from the last layer of global feature map through a full connection layer; performing element-by-element addition on the multiple groups of local feature vectors to obtain complete local feature vectors; splicing the complete local feature vector and the global feature vector, projecting the complete local feature vector and the global feature vector to a semantic space and a hidden feature space, and performing parameter optimization by respectively adopting softmax loss and triple loss; and repeating the steps, setting a plurality of periods for training to obtain a zero sample learning model with strong representation capacity, and classifying the images through the trained zero sample learning model.

Description

Zero sample learning image classification method based on global and local context sensing

Technical Field

The invention relates to the field of image classification, in particular to a zero sample learning image classification method based on global and local context sensing.

Background

Deep learning techniques have evolved rapidly, and their related applications have been practiced in a number of fields (computer vision, natural language processing, etc.), since deep learning can utilize massive amounts of data for model training and thus achieve powerful recognition capabilities. However, the training sample may not cover all of the classes. In particular, for existing data, it is also inherently subject to long tail distributions, which means that only a few common classes can provide a large number of samples, while the most uncommon classes can collect a very limited amount of samples. The phenomenon is reflected in deep learning, that is, the deep learning model can achieve ideal recognition accuracy for common classes due to the fact that the training samples are abundant, but for uncommon classes, the recognition capability of the model is different from that of the former models in nature. In particular, for classes for which no training samples are collected, the recognition capability is zero. However, the model is applied in reality, and not only needs to obtain strong recognition capability from the collected data, but also needs to have recognition capability when a brand new category without any training sample appears. New categories such as new species and new models of electronic equipment are generated every day in the world, the identification capability of unseen categories can be realized, the key turning of development of deep learning systems to date is realized, and the task of identifying unseen categories can be solved through zero-sample learning.

Zero-sample learning is a deep learning technique that mimics the ability of the human brain to recognize, and Lambert states that humans can recognize perhaps 30,000 fundamental classes, as well as fine-grained subclasses of these classes. In addition to identifying the categories that have been seen and using this knowledge to identify fine-grained sub-categories, humans can identify entirely new categories or concepts, such as those that can be accurately identified when they first see zebra, as expressed by the expression "look similar to horses, with black and white stripes".

In the zero-sample learning image classification task, the model can only use the images from the known classes, but can identify the classes to which the images from the unknown classes belong, so that the task of identifying the unknown classes can be completed, because a high-level semantic indication for describing the characteristics of the object, such as attributes, is used, and the unknown classes and the known classes are linked by assuming that the known classes and the unknown classes share all the attributes. Generally speaking, the zero sample learning step is as follows, in the training phase, the model learns a visual-semantic mapping, in the inference phase, for an image of unknown class, firstly, the image is converted into the form of semantic vector by using the mapping relation learned in the previous step, then, the semantic vector is compared with the real attribute vector of unknown class, and the closest class is selected as the prediction result.

Existing zero sample learning algorithms can be classified into two categories, one being model-based algorithms and the other being compatibility-based algorithms, depending on whether new training data is generated during the training phase. The first type of algorithm generates images according to semantic description of unknown classes, and trains the images together with the existing known class images by adopting a traditional deep learning mode. However, the existing methods have a plurality of defects, such as that the generated unknown class images cannot well restore the details and the generated unknown class characteristics have no interpretability. Such methods ignore the importance of information-rich visual regions in the image. The second category of methods directly uses semantic knowledge to learn a visual-semantic mapping relationship by aligning the visual space with the semantic space. Most models based on the compatibility method focus on how to mine the discriminative local information that the object itself has, and how to better align two different spaces. However, the forward contribution of global information to the zero sample learning task is ignored.

Disclosure of Invention

The invention provides a zero sample learning image classification method based on global and local context sensing, which considers global features and local features at the same time, enhances the learned mapping expression capability, and further improves the performance of a zero sample learning model, as described in detail in the following:

a zero sample learning image classification method based on global and local context sensing comprises the following steps:

performing feature extraction on the image by using a deep neural network to obtain a multilayer feature map;

calculating any layer of feature map by using global attention to obtain a feature map containing global information; calculating the characteristic diagram of the same layer by using local attention to obtain a characteristic vector representing local information;

obtaining a global feature vector from the last layer of global feature map through a full connection layer; performing element-by-element addition on the multiple groups of local feature vectors to obtain complete local feature vectors;

splicing the complete local feature vector and the global feature vector, projecting the complete local feature vector and the global feature vector to a semantic space and an implicit feature space, and respectively adopting softmax loss and triple loss to carry out parameter optimization;

and repeating the steps, setting a plurality of periods for training to obtain a zero sample learning model with strong representation capacity, and classifying the images through the trained zero sample learning model.

The calculating any layer of feature map by using global attention to obtain the feature map containing global information specifically includes:

obtaining a spatial self-attention module weight matrix, and using the obtained weight matrix to the eigenvalue

Weighting is effected to obtain a weighted value>

Based on the weighted characteristic, a residual linking method is adopted to add->

Get>

Will obtain

Is re-dimensioned to be as large as the original feature map, is>

Will->

The global context information is transmitted to the last layer by taking the same operation in the multi-layer characteristic diagram.

Further, the spatial self-attention module weight matrix is specifically:

wherein the content of the first and second substances,

dimensional information, softmax, representing variables _col In and/or in>

Transposing of query features for re-dimension, <' >>

For re-dimensional key features, T is transposed, L = H × W is the product of the length and the width of the characteristic diagram, is->

Is a re-dimensioned feature map.

Wherein the weighted values

Comprises the following steps:

wherein, alpha is a balance factor; c is the number of channels of the characteristic diagram,

to a re-dimensioned feature map.

Further, the obtaining of the feature vector representing the local information by using the local attention to calculate the feature map of the same layer is specifically:

calculating by a space converter and carrying out matrix multiplication with an original characteristic diagram to obtain a plurality of corresponding region Rs, and extracting characteristics by adopting an initiation for each region Rs:

processing the IR by adopting global maximum pooling and global average pooling on the extracted features; processing the IR' obtained from the plurality of areas by adopting element-by-element addition to obtain the characteristics which finally represent the local area; and respectively learning visual-semantic mapping and visual-implicit mapping, and splicing.

The technical scheme provided by the invention has the beneficial effects that:

1. the method leads the model to be more adaptive to a zero sample learning classification task by directly training the sample of the original image;

2. according to the invention, the global attention module is adopted to extract global context information from the original feature map to generate the feature map containing global information, the global features extracted by the model have strong expression capability, and the global understanding of the model to the object is enhanced;

3. according to the method, a local attention module is adopted to extract local context information of an original characteristic diagram to obtain local characteristic vectors, the same steps are adopted for a plurality of characteristic diagrams, and finally the plurality of local characteristic vectors are summed to obtain complete local characteristic vectors, so that the local understanding of the model to the object is enhanced;

4. according to the method, complete feature expression is obtained by adopting a feature splicing mode, both global information and local information are considered, the representation capability of the model is greatly improved, and the model precision is improved;

5. according to the method, the scheme of projecting image features to a semantic space and a hidden space at the same time is adopted, and softmax loss and triplet loss are respectively adopted to optimize and update parameters.

Drawings

FIG. 1 is a flow chart of a zero sample learning image classification method based on global and local context awareness;

FIG. 2 is a schematic diagram of a global attention module;

FIG. 3 is a schematic diagram of a space transformer;

fig. 4 is a schematic diagram of an initiation network.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

A zero sample learning image classification method based on global and local context sensing, referring to FIG. 1, comprises the following steps:

101: carrying out feature extraction on the image by using a deep neural network to obtain a multilayer feature map;

102: calculating any layer of feature map by using global attention to obtain a feature map containing global information;

103: calculating the characteristic diagram of the same layer by using local attention to obtain a characteristic vector representing local information;

104: repeating the operations of steps 102 and 103 for multiple layers to obtain a plurality of global feature maps and local feature vectors;

105: obtaining a global feature vector from the last layer of global feature map through a full connection layer; performing element-by-element addition on the multiple groups of local feature vectors to obtain complete local feature vectors;

106: splicing the complete local feature vector and the global feature vector, projecting the complete local feature vector and the global feature vector to a semantic (attribute) space and an implicit feature space simultaneously, and performing parameter optimization by respectively adopting softmax loss and triple loss;

107: and repeating the steps, setting a plurality of periods for training, finally obtaining a zero sample learning model with strong representation capability, and classifying the images through the trained zero sample learning model.

In summary, in the embodiment of the present invention, the deep neural network is used to calculate the feature maps extracted from the image through global attention, so as to obtain new feature maps containing global information, and local features are obtained by calculating local attention for each feature map; calculating a plurality of groups of feature maps, finally performing feature fusion, and projecting the fused features to a semantic (attribute) space and a hidden feature space simultaneously; by the method, the learned features are enhanced, the expression ability of the learned mapping is improved, and the classification accuracy of the model is improved.

Example 2

The scheme of example 1 is further described below with reference to specific calculation formulas and examples, which are described in detail below:

first, the basic setup is introduced:

training set

Containing Ns samples, wherein>

The ith image, representing a known class s>

Is its corresponding class label. Test set>

Contains Nu samples, wherein->

The jth sample, representing an unknown class u>

Is its corresponding class label. The semantic features of the known class and the unknown class can be represented as:

and & ->

The known class and the unknown class are disjoint,

Y ^s ∪Y ^u = Y. Using phi (x) = theta (x) ^T W represents the projection of the visual features in the semantic space, wherein theta (x) is the visual features extracted by the deep neural network, W represents a conversion matrix, and T represents transposition. σ (x) represents the projection of the visual feature in the hidden space.

In zero-sample learning, the training phase can only use known class images and semantic features (attributes), and the model needs to obtain the capability of predicting unknown classes by learning visual-semantic mapping or visual-implicit feature mapping.

1. Global context information extraction

Convolutional layers are important components of deep neural networks, but are limited by the size of their convolutional kernels, so that the features extracted by deep neural networks inevitably contain only local information. However, for computer vision tasks such as image classification, image segmentation and object detection, extracting more global features is the key to improve the model characterization capability. If global information can be introduced into some layers, the dilemma limited by the size of a convolution kernel can be relieved, and the performance of the deep neural network is improved. It is critical to be able to extract global information from the image.

The global self-attention module is initially used in natural language processing tasks and subsequently widely applied in computer vision tasks. Specifically, global self-attention can be gained by:

for an input profile, X ∈ R ^C×H×W Firstly, a set of convolution operations is adopted, the size of a convolution kernel is 1 x 1, and query characteristics Q, key characteristics K and value characteristics are generated

And a re-dimension feature->

Wherein Q, K ∈ R ^C′×H×W C' denotes the number of reduced feature map channels, based on the comparison result, is selected>

L = H × W, R represents dimension information of a variable, C represents the number of channels of the feature map, H represents the length of the feature map, and W represents the width of the feature map. />

Then re-dimension Q and K to obtain

Then the spatial self-attention module weight matrix at this time can be expressed as:

then using the obtained weight matrix to make an eigenvalue pair

Weighting to obtain: (2)

Wherein α is a balance factor.

In order to prevent the loss of the original information, a residual error chaining mode is adopted, and the method is added on the basis of the weighting characteristic

Obtaining:

finally, will obtain

Is re-dimensioned to the same size as the original feature map, is>

Will->

The global context information can be transferred to the last layer by taking the same action at multiple layers of feature maps as a new feature map input to the next layer of neural network.

2. Local context information extraction

The local attention module also uses a layer of feature map X ∈ R ^C×H×W As input, a local feature vector Z ∈ R is output ^k×1 Wherein the k value is consistent with the dimension size of the attribute feature. The module consists of three sub-modules, namely a space transformer, an initiation and a global max/average pooling. The spatial transformer can be represented as a function ST (-) whose role is to help the network linearity learn the spatial invariance and the translational invariance and extend its range to all affine transformations or nonradiative transformations. This means that the spatial transformer can learn a transformation that can rectify the object that has undergone the affine transformation:

wherein，(t _x ,t _y ) Representing two-dimensional spatial coordinates (r) _h ,r _w ) Representing the scale transformation factor, l corresponds to the characteristic diagram of the l-th layer. And obtaining a plurality of corresponding regions by calculating through a space converter and carrying out matrix multiplication on the regions and the original characteristic diagram:

Rs＝ST ^l (X) (5)

for each extracted region R, extracting the characteristics by using the inference:

IR＝Inception(Rs) (6)

then processing the IR by respectively adopting global maximum pooling and global average pooling for the extracted features:

IR ^l ＝GAP(IR)+GMP(IR) (7)

the features obtained at this time encode important information of the local area. For the IR's obtained from multiple regions, they are processed by element-by-element addition, to obtain the features that ultimately represent local regions,

the model needs to learn two mappings, namely visual-semantic mapping and visual-implicit mapping, which respectively correspond to the two mapping matrixes W _a And W _b For computational convenience, Z is self-stitched such that its dimension is 2k.

3. Visual-semantic mapping and visual-latent mapping

Dividing the deep neural network into a plurality of layers of feature maps according to different receptive field sizes, extracting global context information from the feature maps by using a global attention module to obtain new feature maps to replace the original feature maps as the input of the next layer of the network, wherein the feature vectors obtained in the last layer contain the global context information. And then the last layer of feature vectors are projected to a semantic space and a hidden space through a full link layer, so that two kinds of mapping, namely visual-semantic mapping and visual-hidden mapping, are generated respectively. And performing parameter optimization by adopting a softmax loss function for visual-semantic mapping, and performing optimization by adopting a triple loss function for visual-implicit mapping. This has the advantage that both the interpretability of the attribute is preserved and the identifiability of the hidden attribute is taken into account.

For visual-semantic mapping, order

Being a semantic feature of category y, its compatibility score can be expressed as:

wherein, theta _x Representing a visual feature, W _a Representing the visual-semantic mapping matrix to be learned. Considering the compatibility score s as logits in softmax, then sotfmax loss can be expressed as:

wherein the content of the first and second substances,

for visual-implicit mapping, triple loss is adopted to minimize intra-class distance and maximize inter-class distance, so as to obtain implicit features with discriminativity:

wherein x is _i ，x _j ，x _k Respectively representing anchor point, positive class and negative class samples, mrg representing separation distance and set to 1.0. Combining the visual-semantic mapping, the visual-implicit mapping and the loss function of the clipping network, the overall loss function can be expressed as:

L＝L _att +αL _lat (13)

where α is a balance factor and is set to 1.0.

4. Zero sample learning prediction

Since the visual-semantic mapping and the visual latent feature mapping are learned simultaneously in the training phase, in the testing phase, correspondingly, for the case of the visual-semantic mapping, a test image x is given, whose projection in the semantic space is phi (x), with the goal of assigning it a class label:

for visual-latent feature mapping, image x is tested, its projection in semantic space is σ (x), and the mean of the prototypes of the known class of latent features is:

for an unseen class u, its relationship in semantic space to all the seen classes is first computed:

suppose that unseen class u shares a relationship in hidden space that is consistent with the semantic space:

the prediction of the entire blend can be expressed as,

where s (·, ·) is a compatibility function.

The parameters and the meanings of English abbreviations are as follows:

in the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A zero sample learning image classification method based on global and local context sensing is characterized by comprising the following steps:

1) Carrying out feature extraction on the image by using a deep neural network to obtain a multilayer feature map;

2) Calculating any layer of feature map by using global attention to obtain a feature map containing global information;

3) Calculating the characteristic diagram of the same layer by using local attention to obtain a characteristic vector representing local information;

4) Repeating the operations of the step 2) and the step 3) for multiple layers to obtain a plurality of global feature maps and local feature vectors;

5) Obtaining a global feature vector from the last layer of global feature map through a full connection layer; performing element-by-element addition on the multiple groups of local feature vectors to obtain complete local feature vectors;

splicing the complete local feature vector and the global feature vector, projecting the complete local feature vector and the global feature vector to a semantic space and a hidden feature space, and performing parameter optimization by respectively adopting softmax loss and triple loss;

repeating the steps, setting a plurality of periods for training to obtain a zero sample learning model with strong representation capability, and classifying the images through the trained zero sample learning model;

the calculating any layer of feature map by using global attention to obtain the feature map containing global information specifically comprises:

Weighting is carried out to obtain a weighted value->

Get->

Will obtain

Is re-dimensioned to be as large as the original feature map, is>

Will->

Inputting the feature map as a new feature map into a next layer of neural network, adopting the same operation in a plurality of layers of feature maps, and transmitting the global context information to the last layer;

the method for calculating the feature map of the same layer by using local attention to obtain the feature vector representing the local information specifically comprises the following steps:

2. The method according to claim 1, wherein the spatial self-attention module weight matrix is specifically:

wherein the content of the first and second substances,

dimension information, softmax, representing variables _col For the calculation of the softmax score column by column for the matrix, ->

Transposing for re-dimension query features, <' > in>

To re-dimension a key feature, T is transposed, and L = H × W is the product of the length and width of the feature map.

3. The zero sample learning image classification method based on global and local context awareness according to claim 2,

the weighting values

Comprises the following steps:

/>

wherein, alpha is a balance factor; c is the channel number of the characteristic diagram,

is a re-dimensioned feature map. />