CN112418351A

CN112418351A - Zero sample learning image classification method based on global and local context sensing

Info

Publication number: CN112418351A
Application number: CN202011460544.8A
Authority: CN
Inventors: 王国威; 陶文源; 管乃洋
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-02-26
Anticipated expiration: 2040-12-11
Also published as: CN112418351B

Abstract

The invention discloses a zero sample learning image classification method based on global and local context sensing, which comprises the following steps: carrying out feature extraction on the image by using a deep neural network to obtain a multilayer feature map; calculating any layer of feature map by using global attention to obtain a feature map containing global information; calculating the characteristic diagram of the same layer by using local attention to obtain a characteristic vector representing local information; obtaining a global feature vector from the last layer of global feature map through a full connection layer; performing element-by-element addition on the multiple groups of local feature vectors to obtain complete local feature vectors; splicing the complete local feature vector and the global feature vector, projecting the complete local feature vector and the global feature vector to a semantic space and an implicit feature space, and respectively adopting softmax loss and triple loss to carry out parameter optimization; and repeating the steps, setting a plurality of periods for training to obtain a zero sample learning model with strong representation capability, and classifying the images through the trained zero sample learning model.

Description

Zero sample learning image classification method based on global and local context sensing

Technical Field

The invention relates to the field of image classification, in particular to a zero sample learning image classification method based on global and local context sensing.

Background

Deep learning techniques have been developed rapidly, and their related applications have been practiced in a number of fields (computer vision, natural language processing, etc.), since deep learning can utilize massive data for model training and thus obtain powerful recognition capabilities. However, the training sample may not cover all of the classes. In particular, for existing data, it is also inherently subject to long tail distributions, which means that only a few common classes can provide a large number of samples, while the most uncommon classes can collect a very limited amount of samples. The phenomenon is reflected in deep learning, that is, the deep learning model can achieve ideal recognition accuracy for common classes due to the fact that the training samples are abundant, but for uncommon classes, the recognition capability of the model is different from that of the former models in nature. In particular, for classes for which no training samples are collected, the recognition capability is zero. However, the model is applied in reality, and not only needs to obtain strong recognition capability from the collected data, but also needs to have recognition capability when a brand new category without any training sample appears. New categories such as new species and new models of electronic equipment are generated every day in the world, the identification capability of unseen categories can be realized, the key turning of development of deep learning systems to date is realized, and the task of identifying unseen categories can be solved through zero-sample learning.

Zero-sample learning is a deep learning technique that mimics the ability of the human brain to recognize, and Lambert states that humans can recognize perhaps 30,000 fundamental classes, as well as fine-grained subclasses of these classes. In addition to identifying the categories that have been seen and using this knowledge to identify fine-grained sub-categories, humans can identify entirely new categories or concepts, such as those that can be accurately identified when they first see zebra, as expressed by the expression "look similar to horses, with black and white stripes".

In the zero-sample learning image classification task, the model can only use the images from the known classes, but can identify the classes to which the images from the unknown classes belong, so that the task of identifying the unknown classes can be completed, because a high-level semantic indication for describing the characteristics of the object, such as attributes, is used, and the unknown classes and the known classes are linked by assuming that the known classes and the unknown classes share all the attributes. Generally speaking, the zero sample learning step is as follows, in the training phase, the model learns a visual-semantic mapping, in the inference phase, for an image of unknown class, firstly, the image is converted into the form of semantic vector by using the mapping relation learned in the previous step, then, the semantic vector is compared with the real attribute vector of unknown class, and the closest class is selected as the prediction result.

Existing zero sample learning algorithms can be classified into two categories, one being model-based algorithms and the other being compatibility-based algorithms, depending on whether new training data is generated during the training phase. The first type of algorithm generates images according to semantic descriptions of unknown classes and trains the images together with the existing known classes of images by adopting a traditional deep learning mode. However, the existing methods have a plurality of defects, such as that the generated unknown class images cannot well restore the details and the generated unknown class characteristics have no interpretability. Such methods ignore the importance of information-rich visual regions in the image. The second category of methods directly uses semantic knowledge to learn a visual-semantic mapping relationship by aligning the visual space with the semantic space. Most models based on the compatibility method focus on how to mine the discriminative local information that the object itself has, and how to better align two different spaces. However, the forward contribution of global information to the zero sample learning task is ignored.

Disclosure of Invention

The invention provides a zero sample learning image classification method based on global and local context sensing, which considers global features and local features at the same time, enhances the learned mapping expression capability, and further improves the performance of a zero sample learning model, as described in detail in the following:

a zero sample learning image classification method based on global and local context sensing comprises the following steps:

carrying out feature extraction on the image by using a deep neural network to obtain a multilayer feature map;

calculating any layer of feature map by using global attention to obtain a feature map containing global information; calculating the characteristic diagram of the same layer by using local attention to obtain a characteristic vector representing local information;

obtaining a global feature vector from the last layer of global feature map through a full connection layer; performing element-by-element addition on the multiple groups of local feature vectors to obtain complete local feature vectors;

splicing the complete local feature vector and the global feature vector, projecting the complete local feature vector and the global feature vector to a semantic space and an implicit feature space, and respectively adopting softmax loss and triple loss to carry out parameter optimization;

and repeating the steps, setting a plurality of periods for training to obtain a zero sample learning model with strong representation capability, and classifying the images through the trained zero sample learning model.

The calculating any layer of feature map by using global attention to obtain the feature map containing global information specifically includes:

obtaining a spatial self-attention module weight matrix, and using the obtained weight matrix to the eigenvalue

Weighting to obtain weighted value

Adding the residual error linkage mode on the basis of the weighting characteristics

To obtain

Will obtain

Re-dimensioned to the same size as the original feature map,

will be provided with

And inputting the new feature map into a next layer of neural network, taking the same operation in multiple layers of feature maps, and transferring the global context information to the last layer.

Further, the spatial self-attention module weight matrix is specifically:

wherein the content of the first and second substances,

dimensional information, softmax, representing variables_colIn order to realize the purpose,

to transpose the re-dimensioned query features,

for re-dimensioned key features, T is transposed, L × W is the product of the length and width of the feature map,

is a re-dimensioned feature map.

Wherein the weighting values

Comprises the following steps:

wherein, alpha is a balance factor; c is the channel number of the characteristic diagram,

is a re-dimensioned feature map.

Further, the calculating the feature map of the same layer by using the local attention to obtain the feature vector representing the local information specifically includes:

calculating by a space converter and carrying out matrix multiplication with an original characteristic diagram to obtain a plurality of corresponding region Rs, and extracting characteristics of each region Rs by adopting an acceptance:

processing the IR by adopting global maximum pooling and global average pooling on the extracted features; processing the IR' obtained from the plurality of areas by adopting element-by-element addition to obtain the characteristics which finally represent the local area; and respectively learning visual-semantic mapping and visual-implicit mapping, and splicing.

The technical scheme provided by the invention has the beneficial effects that:

1. the method leads the model to be more adaptive to a zero sample learning classification task by directly training the sample of the original image;

2. according to the method, the global attention module is adopted to extract global context information from the original feature map, the feature map containing global information is generated, the global features extracted by the model have strong expression capability, and the global understanding of the model to the object is enhanced;

3. according to the method, a local attention module is adopted to extract local context information of an original characteristic diagram to obtain local characteristic vectors, the same steps are adopted for a plurality of characteristic diagrams, and finally the plurality of local characteristic vectors are summed to obtain complete local characteristic vectors, so that the local understanding of the model to the object is enhanced;

4. according to the method, a complete feature expression is obtained by adopting a feature splicing mode, and both global information and local information are considered, so that the representation capability of the model is greatly improved, and the precision of the model is improved;

5. according to the method, the scheme of projecting image features to a semantic space and a hidden space at the same time is adopted, and softmax loss and triplet loss are respectively adopted to optimize and update parameters.

Drawings

FIG. 1 is a flow chart of a zero sample learning image classification method based on global and local context sensing;

FIG. 2 is a schematic diagram of a global attention module;

FIG. 3 is a schematic diagram of a space transformer;

fig. 4 is a schematic diagram of an initiation network.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

A zero sample learning image classification method based on global and local context sensing, referring to FIG. 1, the method comprises the following steps:

101: carrying out feature extraction on the image by using a deep neural network to obtain a multilayer feature map;

102: calculating any layer of feature map by using global attention to obtain a feature map containing global information;

103: calculating the characteristic diagram of the same layer by using local attention to obtain a characteristic vector representing local information;

104: repeating the operations of steps 102 and 103 for multiple layers to obtain a plurality of global feature maps and local feature vectors;

105: obtaining a global feature vector from the last layer of global feature map through a full connection layer; performing element-by-element addition on the multiple groups of local feature vectors to obtain complete local feature vectors;

106: splicing the complete local feature vector and the global feature vector, projecting the complete local feature vector and the global feature vector to a semantic (attribute) space and an implicit feature space simultaneously, and performing parameter optimization by respectively adopting softmax loss and triple loss;

107: and repeating the steps, setting a plurality of periods for training, finally obtaining a zero sample learning model with strong representation capability, and classifying the images through the trained zero sample learning model.

In summary, in the embodiment of the present invention, a deep neural network is used to calculate the feature map extracted from the image through global attention, so as to obtain a new feature map including global information, and local features are obtained by calculating local attention for each feature map; calculating by a plurality of groups of feature maps, finally performing feature fusion, and projecting the fused features to a semantic (attribute) space and a hidden feature space simultaneously; by the method, the learned features are enhanced, the expression ability of the learned mapping is improved, and the classification accuracy of the model is improved.

Example 2

The scheme of example 1 is further described below with reference to specific calculation formulas and examples, which are described in detail below:

first, the basic setup is introduced:

training set

Containing Ns samples, wherein

The ith image representing a known class s,

is its corresponding class label. Test set

Contains Nu samples, wherein

The jth sample representing the unknown class u,

is its corresponding class label. The semantic features of the known class and the unknown class can be represented as:

and

the known class and the unknown class are disjoint,

Y^s∪Y^uy. Using phi (x) as theta (x)^TW represents the projection of the visual features in the semantic space, wherein theta (x) is the visual features extracted by the deep neural network, W represents a conversion matrix, and T represents transposition. σ (x) represents the projection of the visual feature in the hidden space.

In zero-sample learning, the training phase can only use known class images and semantic features (attributes), and the model needs to obtain the capability of predicting unknown classes by learning visual-semantic mapping or visual-implicit feature mapping.

One, global context information extraction

The convolutional layer is an important component of the deep neural network, but is limited by the size of the convolutional kernel, so that the features extracted by the deep neural network inevitably contain only local information. However, for computer vision tasks such as image classification, image segmentation and object detection, extracting more global features is key to improving the model characterization capability. If global information can be introduced into some layers, the dilemma limited by the size of a convolution kernel can be relieved, and the performance of the deep neural network is improved. It is critical to be able to extract global information from the image.

The global self-attention module is initially used in natural language processing tasks and subsequently widely used in computer vision tasks. Specifically, global self-attention can be gained by:

for an input profile, X ∈ R^C×H×WFirstly, a set of convolution operations is adopted, the size of a convolution kernel is 1 x 1, and a query feature Q, a key feature K and a value feature are generated

And re-dimension features

Wherein Q, K ∈ R^C′×H×WC' represents the number of reduced feature map channels,

l is H × W, R represents dimension information of the variable, C represents the number of channels of the feature map, H represents the length of the feature map, and W represents the width of the feature map.

Then re-dimension Q and K to obtain

Then the spatial self-attention module weight matrix at this time can be expressed as:

then using the obtained weight matrix to make an eigenvalue pair

Weighting to obtain: (2)

wherein α is a balance factor.

In order to prevent the loss of original information, a residual linking mode is adopted, and a weighting characteristic is added

Obtaining:

finally, will obtain

Re-dimensioned to the same size as the original feature map,

will be provided with

The global context information can be transferred to the last layer by taking the same action at multiple layers of feature maps as a new feature map input to the next layer of neural network.

Second, local context information extraction

The local attention module also uses a layer of feature map X ∈ R^C×H×WAs input, a local feature vector Z ∈ R is output^k×1Wherein the k value is consistent with the dimension size of the attribute feature. The module consists of three sub-modules, namely a space transformer, an initiation and a global max/average pooling. The spatial transformer can be represented as a function ST (-) whose role is to help the network linearity learn the spatial invariance and the translational invariance and extend its range to all affine transformations or nonradiative transformations. This means that the spatial transformer can learn a transformation that can rectify the object that has undergone the affine transformation:

wherein (t)_x,t_y) Representing two-dimensional spatial coordinates (r)_h,r_w) Representing the scale transformation factor, l corresponds to the characteristic diagram of the ith layer. And obtaining a plurality of corresponding regions by calculating through a space converter and carrying out matrix multiplication on the regions and the original characteristic diagram:

Rs＝ST^l(X) (5)

for each extracted region R, extracting the characteristics by using the inference:

IR＝Inception(Rs) (6)

then processing the IR by respectively adopting global maximum pooling and global average pooling for the extracted features:

IR^l＝GAP(IR)+GMP(IR) (7)

the features obtained at this time encode important information of the local area. For the IR's obtained from multiple regions, they are processed by element-by-element addition to obtain features that ultimately represent local regions,

the model needs to learn two mappings, namely visual-semantic mapping and visual-implicit mapping, which respectively correspond to the two mapping matrixes W_aAnd W_bFor computational convenience, Z is self-stitched such that its dimension is 2 k.

Three, visual-semantic mapping and visual-latent mapping

Dividing the deep neural network into a plurality of layers of feature maps according to different receptive field sizes, extracting global context information from the feature maps by using a global attention module to obtain new feature maps to replace the original feature maps as input of the next layer of the network, wherein feature vectors obtained in the last layer contain the global context information. And then the last layer of feature vectors are projected to a semantic space and a hidden space through a full link layer, so that two kinds of mapping, namely visual-semantic mapping and visual-hidden mapping, are generated respectively. And performing parameter optimization by adopting a softmax loss function for visual-semantic mapping, and performing optimization by adopting a triple loss function for visual-implicit mapping. This has the advantage that both the interpretability of the attribute is preserved and the identifiability of the hidden attribute is taken into account.

For visual-semantic mapping, order

Being a semantic feature of category y, its compatibility score can be expressed as:

wherein, theta_xRepresenting a visual feature, W_aRepresenting the visual-semantic mapping matrix that needs to be learned. Considering the compatibility score s as logits in softmax, then sotfmax loss can be expressed as:

wherein the content of the first and second substances,

for visual-implicit mapping, triple loss is adopted to minimize intra-class distance and maximize inter-class distance, so as to obtain implicit features with discriminativity:

wherein x is_i，x_j，x_kRespectively, an anchor point, a positive class sample and a negative class sample, and mrg represents a separation distance and is set to 1.0. Loss function of combined visual-semantic mapping, visual-implicit mapping and clipping network, integral loss functionThe number may be expressed as:

L＝L_att+αL_lat (13)

where α is a balance factor and is set to 1.0.

Four, zero sample learning prediction

Since the visual-semantic mapping and the visual implicit feature mapping are learned simultaneously in the training phase, in the testing phase, correspondingly, for the case of the visual-semantic mapping, a test image x is given whose projection in the semantic space is phi (x), with the goal of assigning it a class label:

for visual-latent feature mapping, image x is tested, its projection in semantic space is σ (x), and the mean of the prototypes of the known class of latent features is:

for an unseen class u, its relationship in semantic space to all the seen classes is first computed:

the unseen class u is assumed to share a relationship in the hidden space that is consistent with the semantic space:

the prediction of the entire blend can be expressed as,

where s (·, ·) is a compatibility function.

The parameters and the meanings of English abbreviations are as follows:

in the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A zero sample learning image classification method based on global and local context sensing is characterized by comprising the following steps:

2. The method for zero-sample learning image classification based on global and local context awareness according to claim 1, wherein the computing is performed on any layer of feature map by using global attention, and the obtaining of the feature map containing global information specifically includes:

Weighting to obtain weighted value

To obtain

Will obtain

Re-dimensioned to the same size as the original feature map,

will be provided with

3. The method according to claim 2, wherein the spatial self-attention module weight matrix is specifically:

wherein the content of the first and second substances,

dimensional information, softmax, representing variables_colTo calculate the softmax score by column for the matrix,

to re-dimension the transpose of the query features,

for re-dimension key features, T is transposed and L — H × W is the product of the length and width of the feature map.

4. The zero sample learning image classification method based on global and local context awareness according to claim 3,

the weighting values

Comprises the following steps:

is a re-dimensioned feature map.

5. The method according to claim 1, wherein the computing of the feature map of the same layer using local attention to obtain the feature vector representing the local information specifically comprises: