CN110163258B

CN110163258B - Zero sample learning method and system based on semantic attribute attention redistribution mechanism

Info

Publication number: CN110163258B
Application number: CN201910335801.6A
Authority: CN
Inventors: 刘洋; 蔡登�
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-04-24
Filing date: 2019-04-24
Publication date: 2021-04-09
Anticipated expiration: 2039-04-24
Also published as: CN110163258A

Abstract

The invention discloses a zero sample learning method and a zero sample learning system based on a semantic attribute attention redistribution mechanism, wherein the zero sample learning method comprises the following steps: (1) establishing a neural network model based on a semantic attribute attention redistribution mechanism; (2) re-assigning weights between semantic features using attention of a semantic attribute space; (3) training a neural network model using the labeled image data set; (4) calculating the similarity between the weighted semantic features of the image and semantic prototypes of unknown classes, calculating the similarity between hidden layer features and hidden layer feature prototypes of the unknown classes, and adding the two similarities to obtain the similarity between the test image and each unknown class; (5) and sorting according to the similarity between the images and the classes, and selecting the class with the maximum similarity as the class prediction of the image. The zero sample learning method can enable zero sample learning to be more closely related to the semantic space and the hidden layer space in the training process, so that the result of combined classification of the two spaces is more robust.

Description

Zero sample learning method and system based on semantic attribute attention redistribution mechanism

Technical Field

The invention relates to the field of zero sample learning classification systems, in particular to a zero sample learning method and a zero sample learning system based on a semantic attribute attention redistribution mechanism.

Background

In recent years, object classification has received attention from researchers in the industry and academia as an important branch in the field of computer vision. The supervised object classification task has greatly advanced with the rapid development of deep learning technology, but at the same time, the training method under the supervision has some limitations. In supervised classification, each class requires enough labeled training samples. In addition, the learned classifier can only classify instances belonging to classes covered by the training data, lacking the ability to process previously unseen classes. In practical applications, each class may not have enough training samples, and there may be cases where classes not covered in training appear in the test samples. The zero sample learning aims at classifying examples belonging to classes which are not covered in training, becomes a rapidly developing direction in the field of machine learning, and has wide application in the aspects of computer vision, natural language processing and pervasive computation.

At present, the mainstream zero sample learning method mainly adopts attribute-based two-stage derivation to predict the label of the image. The derivation method comprises the following steps: an image is input, and the model predicts the attributes of the image in the first stage and deduces the class label by searching the class with the most similar attribute set in the second stage. For example, The DAP model proposed in Attribute-based classification for zero-shot visual object classification, published in 2013 in The IEEE schema Analysis and Machine Intelligence journal (The IEEE Transactions on Pattern Analysis and Machine Intelligence), by Christoph et al, estimates The posterior probability of each Attribute of an image by learning a probabilistic Attribute classifier, and then infers a class label of The image by calculating The posterior probability of a class and The maximum posterior estimate. "converting The transmission link" included in The Conference on Computer Vision and Pattern Recognition (The Conference on Computer Vision and Pattern Recognition) 2016: in the classification of class-attribute for unsurpassed zero-shot learning, a probability classifier is firstly learned for each attribute, and then classification is carried out by a random forest method, and the classification method can process some unreliable attributes. This two-stage approach has domain migration problems, for example, while the target task is to predict the labels of classes of images, the intermediate task of DAP is to learn classifiers related to image attributes.

The latest advances in zero-sample learning directly learn the mapping from the image feature space to the attribute semantic space. The ALE model proposed in The article Label-embedding for image classification, which was incorporated in The IEEE standards on Pattern Analysis and Machine Intelligence journal (The IEEE Transactions on Pattern Analysis and Machine Intelligence) in 2016, learns bilinear compatibility functions between images and attribute spaces using a rank-based loss function. The Semantic auto-encoder for zero-shot learning, which was recorded in The international Conference on Computer Vision and Pattern Recognition in 2017, proposed a Semantic auto-encoder that forcibly projects image features into a Semantic space that can be reconstructed from an image. An article entitled "Predicting visual artifacts of unsen classes for zero-shot learning" at The International Conference on Computer Vision in 2017 proposes to classify nearest neighbors by projecting class semantic representations into visual feature space and among these projections.

Zero-sample learning in addition to the commonly used semantic attribute space, some recent work has often involved joint class inference from multiple spaces. For example, an article named "Learning discrete attributes for zero-shot classification" at The International Conference on Computer Vision in 2017 proposed an LAD model using dictionary Learning to obtain a latent feature space that distinguishes but retains semantic information. In The international Computer Vision and Pattern Recognition Conference (The Conference on Computer Vision and Pattern Recognition) in 2018, an article named "cognitive Learning of content Features for Zero-Shot Recognition" proposes a new hidden feature space combining maximization of inter-class distance and minimization of intra-class distance, and performs joint inference on a semantic space and The hidden feature space. In The international Conference on Computer Vision of European Computer Vision in 2018, an article named "Learning Class protocols via Structure Alignment for Zero-Shot registration" proposed CDL model to align Class structures simultaneously in visual and semantic space. However, these methods essentially consider that all attributes are equally important in the classification process, and neglect that they have different distributions, variances, information entropies, and the like between different classes, this processing method is easy to cause misclassification on some challenging images.

Disclosure of Invention

The invention provides a zero sample learning method and a zero sample learning system based on a semantic attribute attention redistribution mechanism, which provide an attention redistribution mode for attribute prediction of each image, and redistribute the importance degree of each attribute in the image classification, thereby achieving a better zero sample learning effect.

The technical scheme of the invention is as follows:

a zero sample learning method based on a semantic attribute attention re-allocation mechanism comprises the following steps:

(1) establishing a neural network model based on a semantic attribute attention redistribution mechanism, wherein the neural network model comprises a visual-semantic space mapping branch, a visual-hidden layer space mapping branch and an attention branch, and when the neural network model is combined with the three branches to carry out image forward derivation, the semantic features of an image in a semantic attribute space, the hidden layer features in the hidden layer space and the attention in the semantic attribute space are respectively obtained;

(2) re-assigning weights between semantic features using attention of a semantic attribute space;

(3) training a neural network model using the labeled image data set;

(4) inputting an image to be tested, calculating the similarity between the weighted semantic features of the image and semantic prototypes of unknown classes, calculating the similarity between the hidden layer features and hidden layer feature prototypes of the unknown classes, and adding the two similarities to obtain the similarity between the test image and each unknown class;

(5) and sorting according to the similarity between the images and the classes, and selecting the class with the maximum similarity as the class prediction of the image.

The zero sample learning method based on semantic attribute attention redistribution is an improved algorithm for joint inference of semantic attribute space and hidden layer space. Compared with the prior algorithm, the combination of the semantic space and the hidden layer space in the method is tighter, and the method comprises the steps that 1) the hidden layer space provides guidance of class information for the semantic attribute space, so that the neural network generates correct and reasonable attention; 2) the semantic attribute space provides an initial method for the prototype construction of the unknown class in the hidden layer space, thereby reducing the defects caused by domain migration. Meanwhile, the model combines the semantic attribute space and the hidden layer space of the redistributed attention to carry out inference, so that the stability of model prediction is greatly improved.

In the step (1), the visual-semantic space mapping branch and the visual-hidden layer space mapping branch use a VGG19 framework network structure as a shared shallow network, and respectively use different full connection layers for feature mapping of different spaces;

the attention branch performs feature extraction on feature maps of different layers in the VGG19 framework network by using a single-layer convolutional neural network with convolution kernel size of 3 and different parameters, and calculates attention of semantic attribute spaces corresponding to the VGG19 feature maps of the different layers by using a feature fusion method.

In the step (1), the specific process of obtaining the semantic features of the image in the semantic attribute space, the hidden layer features in the hidden layer space and the attention in the semantic attribute space is as follows:

extracting image input x using a pre-trained deep convolutional neural network_iDeep visual feature of (theta)_iRespectively mapping the deep visual features of the image to a semantic space and a hidden space by using a fully-connected neural network, wherein the calculation formulas of the semantic space and the hidden space are as follows:

σ_i＝FC₂(θ_i)

wherein the content of the first and second substances,

representing the vector representation, σ, of an image i in semantic space_iVector representation, FC, representing image i in hidden layer space₁Representing a mapping function, FC, mapped from a visual space to a semantic space₂Representing a mapping function that maps from a visual space to a hidden layer space;

selecting an intermediate feature map representation for image i for layer I in a deep convolutional neural network

And hidden layer space vector sigma_iThe formula for calculating the semantic attribute attention of the image i at the visual depth of l is as follows:

wherein, W_lAnd b_lIs a parameter of a single-layer fully-connected network,

is hiddenThe method comprises the following steps of (1) representing a layer vector and representing visual feature fusion when the depth is l, wherein the calculation formula of the feature fusion representation is as follows:

wherein, F_sqIs a transform function of a matrix, converting a three-dimensional matrix representation of size C x H x W into a two-dimensional matrix representation of C x HW,

the representation matrices are summed by channel,

is a characteristic diagram

K represents the number of channels after feature fusion through a series of convolution results, and the length of the channels is consistent with that of semantic vector representation and hidden layer vector representation; finally, selecting the layer number l belonging to l_BThe calculation formula of the attention of the image i in the semantic attribute space is as follows:

wherein p is_i，lAttention is paid to the semantic attribute of image i at visual depth l.

In the step (2), the attention of the semantic attribute space is used for redistributing the weight among the semantic features, and the calculation formula is as follows:

wherein, diag (p)_i) Is a diagonal matrix of k x k, with the value on the diagonal being p_i。

The specific process of the step (3) is as follows:

(3-1) in the data preparation process, the original training data set D is divided into a set consisting of a plurality of triples in advance

Wherein for any triplet

And

are from the same class

Of the different images of (a) the image,

is from and

classes of different images

The image of (a);

(3-2) in the training process, for each triplet

The neural network model trains the neural network by using a mixed loss function L, and the specific calculation formula of the loss function L is as follows:

wherein L is_FIs a loss function defined in the hidden layer space, L_AIs a loss function defined in the semantic attribute space；

The hidden layer space loss function simultaneously maximizes the inter-class distance and minimizes the intra-class distance by using a triple loss function; the specific calculation formula of the hidden layer loss function is as follows:

the loss function of the semantic attribute space maximizes the probability of classification in the semantic space using a cross entropy based loss function; the specific calculation formula of the semantic loss function is as follows:

where Y is the set of all training classes,

is known as class y_iThe semantic attribute prototype of (1).

The specific steps of the step (4) are as follows:

(4-1) for input image x_iPredicting its semantic vector representation using a trained model

Hidden layer vector representation σ_iAnd its semantic attribute attention p_i；

(4-2) for any one of the categories y^u∈Y^uWherein Y is^uRepresenting classes not covered by training, calculating image x separately_iAnd type semantic prototype

And-like hidden layer prototype

Cosine similarity between the semantic attribute space and the hidden layer space; the calculation formula of the cosine similarity of the semantic attribute space is as follows:

the calculation formula of the cosine similarity of the hidden layer space is as follows:

summing the cosine similarity of the two spaces to obtain an image x_iAnd class y^uThe similarity between the two is calculated by the formula:

the specific steps of the step (5) are as follows:

computing image x using a nearest neighbor search algorithm_iIn class set Y^uClass prediction of

The calculation formula is as follows:

the zero sample learning algorithm based on semantic attribute attention redistribution has all the advantages of zero sample learning, and can correctly distinguish some difficult samples with semantic ambiguity, such as pigs with speckles and spotted dogs. In practice, the proposed algorithm is found to have much lower variance of the attribute prediction value in the semantic space than the previous zero sample learning algorithm, so that the classification result is influenced by various semantic attributes instead of being dominated by one or a few more prominent attribute predictions when the final retrieval and classification are performed. The algorithm enables the relation between the semantic space and the hidden layer space to be tighter on the basis of the joint inference classification of the semantic space and the hidden layer space, and avoids the problem that the image is correctly classified in one characteristic space and is wrongly classified in the other characteristic space.

The invention also provides a zero sample learning system based on a semantic attribute attention re-allocation mechanism, comprising a computer memory, a computer processor and a computer program stored in the computer memory and executable on the computer processor, wherein the computer memory has stored therein the following modules:

the visual feature module captures the depth visual features of the input image by utilizing a depth convolution neural network;

the vision-semantic mapping module is used for mapping the vision characteristics to a semantic attribute space by utilizing a full-connection neural network;

the vision-hidden layer mapping module is used for mapping the vision characteristics to a hidden layer space by utilizing a full-connection neural network;

the semantic attention module is used for generating attribute attention of a semantic space by utilizing the shallow visual feature of the image and the class information of the hidden space;

the classification retrieval module is used for classifying the images by utilizing the semantic attribute space representation, the hidden layer space representation and the attention of the semantic space of the images;

and the classification generation module is used for outputting a classification result to the outside after the model classification is finished.

Compared with the prior art, the invention has the following beneficial effects:

1. the semantic attribute attention redistribution algorithm provided by the invention has the advantages that through an attention mechanism, the competitive relationship exists among semantic attribute predictions, the classification result is determined by more semantic attributes instead of only depending on part of more prominent semantic attributes, and the error classification on some difficult samples with semantic ambiguity is avoided.

2. The method can avoid the problem of domain migration when the semantic space and the hidden layer space jointly infer classification, so the problem that the classification result tends to train the coverage class when a common zero sample is learned can be avoided.

3. A large number of experiments prove that the model performance superior to other baseline algorithms is demonstrated. The superiority of the model is proved from experiments.

Drawings

FIG. 1 is a schematic overall framework diagram of a zero sample learning method based on a semantic attribute attention redistribution mechanism according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating operation of a semantic attention module of the method according to an embodiment of the present invention;

FIG. 3 is a schematic overall structure diagram of a zero-sample learning system based on a semantic attribute attention re-allocation mechanism according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a predicted distribution of semantic space obtained by using different attention mechanisms according to an embodiment of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.

As shown in FIG. 1, the main model of the present invention is divided into three outputs, a visual feature module and three branch modules, which respectively correspond to the image inputs, and the three branches are synchronously placed into the optimization process of the whole model. The method comprises the following specific steps:

(a) learning input image x in zero sample learning training process by visual feature module_iDepth visual feature of (theta)_iThe method comprises the following basic steps:

(a-1) initializing network model parameters using a pre-trained large neural network ResNet 101. Input image x_iCenter-clipped image x 'of 224 × 224 size'_iAs the actual input to the network.

(a-2) taking the feature vector of the last non-classified layer of the neural network as an image x'_iDepth visual feature of (theta)_iThe length of the eigenvector is denoted as V.

(b) The visual-semantic mapping module provides mapping from a deep visual space to a semantic space for a zero sample learning process, and the basic steps are as follows:

(b-1) initializing model parametersAnd (4) counting. Initializing a spatial mapping matrix W₁∈R^k×VAnd b₁∈R^k。

(b-2) expressing the visual characteristics θ_iMapping to a semantic space to obtain a semantic space representation

The specific calculation formula is as follows:

(c) the visual-hidden layer mapping module provides mapping from a deep visual space to a hidden layer space for a zero sample learning process, and the basic steps are as follows:

(c-1) initializing the model to obtain W, similarly to the step (b-1)₂And b₂。

(c-2), similar to the step (b-2), mapping the visual features, wherein the specific calculation formula is as follows:

σ_i＝W₂θ_i+b₂

(d) the semantic-attention module uses partial class information of the hidden layer space and the shallow visual information of the image to redistribute the weight for the output of the semantic space in zero sample learning, as shown in fig. 2, and the basic steps are as follows:

(d-1) selecting a shallow visual feature map of a specific layer l

And hidden layer feature representation σ_i∈R^kAs input to the semantic-attention module. Initializing convolutional neural networks

Parameter of (2), parameter W of the fully connected network FC₃And b₃。

(d-2) superficial visual feature map

Through a series of convolution transformations

Obtaining a characteristic diagram

Wherein the size of the feature map is H '× W', the number of channels is k, and the hidden layer feature σ_iThe length of (a) is kept consistent.

(d-3) feature map

Hidden layer representation sigma of image_i∈R^kChannel-by-channel addition is carried out, and the generated matrix is converted into column vectors channel-by-channel to obtain hidden variables

(d-4) obtaining shallow visual characteristics of the set image through the fully connected neural network FC by the hidden variable

And hidden layer feature σ_iThe attention representation p with a visual depth l in semantic space_i，lThe specific calculation formula is as follows:

(d-5) selecting four specific network depths l epsilon l_BRepeating the steps 1-4, and accumulating the obtained attention to obtain the total semantic attribute attention of the image in the semantic attribute space, wherein the calculation formula is as follows:

the training steps of the zero sample learning method based on semantic attribute attention redistribution are as follows:

1. initializing a training data set D { (x)_i，y_i) In which x_iRepresenting an input image, y_ie.Y represents the class label of the input image, Y represents the set of classes covered by the training set, Y for each class^s∈Y，

Is the prototype vector of the class in semantic space. Sorting a dataset into sets of triples

Wherein

And

are from the same class

Of the different images of (a) the image,

is from and

classes of different images

The image of (2).

2. Selecting a triad pair

As the input of the network model, the vector representation of the attention of each image in the semantic space, the hidden layer space and the semantic attribute is obtained.

3. Minimizing the intra-class distance using a triplet-based loss function to simultaneously maximize the inter-class distance. The specific calculation formula of the hidden layer loss function is as follows:

a cross entropy based loss function is used to maximize the probability of classification in the semantic space. The specific calculation formula of the semantic loss function is as follows:

4. and (5) repeating the steps 2-3 by adopting a gradient descent method, and training the parameters of each module.

5. Using the average hidden layer space representation of the training picture as a class hidden layer prototype of the training coverage class, wherein the calculation formula is as follows:

wherein N is^sRepresents class y^sThe number of training samples. S-use of ridge regression to compute hidden vector representations for training uncovered classes

The specific calculation formula is as follows:

the sample classification step of the zero sample learning method based on semantic attribute attention redistribution is as follows:

1. for an input image x_iPredicting its semantic vector representation using a trained model

Hidden layer vector representation σ_iAnd its semantic attribute attention p_i。

2. For any class y^u∈Y^uWherein Y is^uRepresenting classes not covered by training, calculating image x separately_iAnd type semantic prototype

And-like hidden layer prototype

Cosine similarity in semantic space and hidden layer space. The formula for calculating the cosine similarity of the semantic space is as follows:

3. computing image x using a nearest neighbor search algorithm_iIn class set Y^uClass prediction of

The calculation formula is as follows:

as shown in fig. 3, a zero sample classification system based on semantic attribute attention redistribution is divided into six modules, which are a visual feature module, a visual-semantic mapping module, a visual-hidden layer mapping module, a semantic-attention module, a classification retrieval module, and a classification generation module.

The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.

This embodiment is compared on the three large public data sets AwA2, CUB and SUN with other current leading zero sample learning methods. AwA2 is a coarse-grained, medium-sized dataset from 37322 images of 50 animal categories with 85 user-defined attributes. The CUB is a fine-grained dataset consisting of 11788 images from 200 different birds, with 312 user-defined attributes. The SUN is another fine-grained dataset that includes 14340 images from 717 different scenes, providing 102 user-defined attributes. The data set is divided into two parts: training set and test set, divided differently on different data sets. On the AwA2 data set, 40 animals were used as training set and 10 animals were used as test set; similarly, 150 classes are used as training sets of CUBs, 50 classes are used as test sets of CUBs, 645 classes are used as training sets of SUNs, and 72 classes are used as test sets of SUNs. The evaluation index of this embodiment is the class average identification accuracy, 5 current mainstream zero sample identification algorithms are compared in total, and the overall comparison result is shown in table 1.

TABLE 1

As can be seen from Table 1, the zero sample learning framework based on semantic attribute attention redistribution provided by the invention obtains the optimal effect under each large evaluation index, and fully shows the superiority of the algorithm of the invention.

To further illustrate that the algorithm proposed by the present invention does suppress some of the more prominent predicted values in semantic space, the present invention compares the experimental comparison of "not using attention mechanism", "using attention mechanism based on sigmoid function" and "using attention mechanism using softmax function" on the CUB data set, and the experimental results are shown in table 2.

TABLE 2

Method	Class average accuracy (%)	Semantic predictor variance x 10^-3
			w/o Attention	62.1	2.48
w/Sigmoid Attention	73.5	1.75
			w/Softmax Attention	81.1	0.86

As can be seen from table 2, the attention mechanism based on the softmax function achieves the optimal experimental result in the experiment, and it is found that as the variance of the semantic space prediction value decreases, the suppression effect of the prominent prediction value is better, the effect of the model is also better, and the effectiveness of the attention mechanism in zero sample learning is fully explained.

In addition, the kernel function estimation of the semantic predicted value as shown in fig. 4 also reflects that the attention mechanism limits the prediction distribution of the semantic space within a specific range, which fully shows that the algorithm provided by the invention achieves the effect of suppressing the abnormal semantic predicted value by redistributing the attribute weight of the semantic space.

The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A zero sample learning method based on a semantic attribute attention re-allocation mechanism is characterized by comprising the following steps:

(1) establishing a neural network model based on a semantic attribute attention redistribution mechanism, wherein the neural network model comprises a visual-semantic attribute space mapping branch, a visual-hidden layer space mapping branch and an attention branch, so that when the image is subjected to forward derivation of the network, the semantic features of the image in a semantic attribute space, the hidden layer features in a hidden layer space and the attention in the semantic attribute space are respectively obtained;

(2) re-assigning weights between semantic features using attention of a semantic attribute space; the calculation formula is as follows:

wherein, diag (p)_i) Is a diagonal matrix of k x k, with the value on the diagonal being p_i，

Representing a vector representation of an image i in semantic space;

(3) training a neural network model using the labeled image data set; the specific process is as follows:

Wherein for any triplet

And

are from the same class

Of the different images of (a) the image,

is from and

classes of different images

The image of (a);

(3-2) in the training process, for each triplet

wherein L is_FIs a loss function defined in the hidden layer space, L_AIs a loss function defined in the semantic attribute space;

where Y is the set of all training classes,

is known as class y_iThe semantic attribute prototype of (2);

2. The zero-sample learning method based on semantic attribute attention re-allocation mechanism as claimed in claim 1, wherein in step (1), the visual-semantic space mapping branch and the visual-hidden layer space mapping branch use VGG19 skeleton network structure as shared shallow layer network, and use different fully connected layers for feature mapping of different spaces respectively;

3. The zero-sample learning method based on the semantic attribute attention re-allocation mechanism as claimed in claim 1, wherein in the step (1), the specific processes of obtaining the semantic features of the image in the semantic attribute space, the hidden layer features in the hidden layer space and the attention in the semantic attribute space are as follows:

σ_i＝FC₂(θ_i)

wherein the content of the first and second substances,

wherein, W_lAnd b_lIs a parameter of a single-layer fully-connected network,

the method is characterized by comprising the following steps of (1) hidden layer vector representation and visual feature fusion representation when the depth is l, wherein the calculation formula of the feature fusion representation is as follows:

the representation matrices are summed by channel,

is a characteristic diagram

wherein p is_i,lAttention is paid to the semantic attribute of image i at visual depth l.

4. The zero-sample learning method based on the semantic attribute attention re-allocation mechanism as claimed in claim 1, wherein the step (4) comprises the following specific steps:

And-like hidden layer prototype

in the formula, diag (p)_i) Is a diagonal matrix of k x k, with the value on the diagonal being p_i；

5. the zero-sample learning method based on the semantic attribute attention re-allocation mechanism as claimed in claim 4, wherein the step (5) comprises the following specific steps:

The calculation formula is as follows: