CN112364894B

CN112364894B - Zero sample image classification method of countermeasure network based on meta-learning

Info

Publication number: CN112364894B
Application number: CN202011147848.9A
Authority: CN
Inventors: 冀中; 崔碧莹
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2022-07-08
Anticipated expiration: 2040-10-23
Also published as: CN112364894A

Abstract

The invention belongs to the technical field of image classification, and particularly relates to a zero sample image classification method of a countermeasure network based on meta-learning. The method can make the generalized zero sample image classification capability more prominent, improve the generalization capability of the model and relieve the field offset problem commonly existing in zero sample learning.

Description

Zero sample image classification method of countermeasure network based on meta-learning

Technical Field

The invention belongs to the technical field of image classification, and particularly relates to a zero sample image classification method of a countermeasure network based on meta-learning.

Background

In recent years, machine learning has been widely applied in the fields of natural language processing, computer vision, speech recognition, etc., while in the field of computer vision, the task of image classification is one of the most concerned and most widely applied tasks, and various classification techniques are developed endlessly and the performance is continuously improved. In a machine learning task, a supervised learning method for realizing classification through a large number of artificially labeled images is a traditional method for image classification, and is well applied to real life. However, it is not easy to collect enough samples and label for each category of image in practice, and a lot of labor is consumed. It is easy to understand that species distribution in nature presents a long tail effect, only a few classes of species have enough image samples for supervised learning to train a classification model, and many classes of species have few samples and difficult label labeling, which makes supervised learning a huge challenge. Therefore, zero sample learning arises in order to solve the problem of sample label loss.

The zero sample image classification is an important direction of zero sample learning and is used for solving the classification problem of difficult image labeling, in the traditional zero sample image classification setting, a visible image sample and a label training model thereof are utilized, an unseen image sample test model is utilized, and the classes of the test image and the classes of the training image are not intersected under the setting; whereas in the generalized zero-sample image classification setting, the test image sample includes both images of the visible class and images of the unseen class. The zero sample learning referred to in this patent includes the two settings described above. Currently, the main research methods for zero sample image classification can be roughly divided into two types: the method is based on mapping, and the visual characteristic and the semantic characteristic are mapped through mapping between a visual characteristic space and a semantic characteristic space or mapping from the visual characteristic space and the semantic characteristic space to a public space, so that a better classification result is obtained; and the other is a generation-based method, which utilizes generation models such as a generation countermeasure network and a variational self-encoder to generate pseudo features of the test sample, and determines the category of the test sample by comparing the similarity between the generated pseudo features and the real features.

In order to complete the prediction of the test sample class, the zero sample image classification technology achieves the effect of knowledge migration by utilizing the semantic information of a visible class and an invisible class. The experimental setup is as follows: in the training phase, a labeled sample of a visible class is given

Where n is the number of samples of the visible class,

is the visual characteristic of the ith sample,

indicating its corresponding category label and, in addition,

representing its corresponding class level semantic prototype. The traditional zero-sample image classification is a semantic feature A of a given unseen class_UTo test a sample x_tClassified into the unseen class Y_UIn, and

the generalized zero-sample image classification is to classify a test sample x according to the semantic features of a visible class and an invisible class_tThe classification is visible and invisible. In summary, the zero-sample image classification is to train a model by using the relevant features of the visible class samples, and predict the class label y of the test sample by using the model_t。

The feature representation can be incomplete by learning a simple mapping relation between a visual space and a semantic space, and meanwhile, a low-dimensional pivot point problem can be generated. Simple mapping from a high-dimensional visual space to a low-dimensional semantic space by learning can cause a pivot point phenomenon that samples of different classes in the high-dimensional are compressed to the same class of semantics in the low-dimensional, and similar problems can also occur with simple mapping from the low-dimensional space to the high-dimensional space. In recent years, the generation of countermeasure networks has gained attention by researchers, and in combination with zero sample learning, the accuracy of classification is improved by generating a large number of pseudo features. However, the essential disadvantage of generating the countermeasure network is that the training process is unstable, and the problem of mode collapse is easily caused. Yet another generation-based approach introduces a variational auto-encoder (VAE) that generates pseudo-visual features by inputting VAEs conditioned on semantic information. VAEs tend to distort the visual features generated due to the introduction of lower bounds of variation.

Disclosure of Invention

The invention aims to: aiming at the defects of the prior art, the zero sample image classification method of the countermeasure network based on the meta-learning is provided, and the zero sample image classification accuracy can be improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

a zero sample image classification method of a countermeasure network based on meta-learning comprises the following steps:

1) randomly selecting M categories from the visible category as a training set of one epicode, and using the rest categories in the visible category as a test category of the epicode, thereby obtaining the training set

It can be known that

Wherein n is_trFor the number of training set samples, x, in each epamode_iVisual characteristics of the ith training sample, y_iFor the corresponding class label of the ith training sample, a_i∈A_trA semantic prototype of the class of the ith training sample, and a_te∈A_teDefining two memory modules m for semantic prototype of test class in an epicode₁、m₂；

2) Visual feature x of a training sample_iRandomly selecting data x of a set batch, and inputting the data x into a encoder E₁And decoder D₁In a constituent variational autocoder, pseudo-visual features similar to those of real visual samples are generated

The reconstruction constraints are as follows:

wherein the content of the first and second substances,

is expressed by a 2 norm;

3) after passing through the variational self-encoder, calculating a variational self-encoder loss function L_VAE；

4) Inputting the generated pseudo-visual features into a softmax classifier after passing through a dimension reduction matrix W, obtaining the probability that one-hot classification results represent each class, and calculating the classification loss according to the real labels as follows:

wherein f represents a softmax classifier, W is a classifier parameter, and the function is to reduce the dimension of the generated features to M dimension and compare the dimension with a real label y, and define W as a classifier of a visual mode;

5) the visual feature x of the training sample and the generated pseudo visual feature

Input into a discriminator D, and the countermeasure loss is L_D：

；

6) Calculating the distillation loss L of this epsicode visual modality training process_kd-wAnd L_kd-v；

7) Setting a target function as the sum of the loss functions, and carrying out multiple iteration training on the variational self-encoder of the visual mode:

wherein λ is₁、λ₂For the weight coefficients characterizing the reconstruction loss and the loss of the variational autocoder, the encoder E of the trained variational autocoder₁And decoder D₁The parameters are respectively stored in the two memory modules;

8) the category semantic prototype a in the training class_trAs an input to an auto-encoder, corresponding visual prototypes are generated

At the same time handle

Classifiers defined as semantic modalities, using

Classifying the reconstructed features and calculating the classification loss L_cls2：

；

9) Classifier for constraining semantic modality by classifier W of visual modality

So as to obtain the distillation constraint of vision to semantics and calculate the distillation loss L_kd2

；

10) The objective function of the self-encoder for training the semantic modalities is as follows:

L_a＝L_cls2+λ₃L_sup+λ₄L_kd2

wherein L is_supSupervision of decoders of semantic modalities for decoders of visual modalities, λ₃And λ₄Weight coefficients for the supervision loss and distillation loss, respectively;

11) procedure for testing of the epicode: semantic prototype a of test set_teInput to trained encoder E₂And decoder D₂In order to obtain a corresponding visual prototype

12) Will be provided with

And

the classifiers are spliced together to obtain all visible classes

At this time, the sorter C is reused_SClassifying all visible samples, calculating classification loss, and finely adjusting the previously learned parameters:

；

13) semantic features a of test samples of visible classes and invisible classes_tInputting into semantic encoder and decoder, and adding the generated visual feature prototype and x_tComparing, and obtaining a classification result by using a nearest neighbor method;

14) and repeating the steps 1) to 13) to finish the meta-training process of a plurality of epicodes until the optimal classification performance is obtained.

As an improvement of the zero sample image classification method based on the meta-learning confrontation network, the step 2) generates the pseudo-visual characteristics

And step 3) calculating L_VAEThe working process comprises the following steps:

(2.1) training visual characteristics x of the sample_iRandomly selecting data x of set batch, and inputting the data x into an encoder E₁The probability distribution of the latent variable z is obtained as follows:

p(z|x)＝N(μ,Σ)

wherein p (z | x) represents the distribution of the latent variable z, μ, Σ represent the mean and variance of the latent variable z, respectively, and N represents a normal distribution;

(2.2) input z to decoder D₁In generating pseudo-visual features

Wherein, w₁、v₁Are respectively an encoder E₁And decoder D₁The parameters of (a);

(2.3) calculating the variational autocoder loss function L_VAE：

Wherein L is_VAERepresenting the variational self-encoder loss function,

representing the calculation of the expectation over the distribution of the underlying variable z, p (x | z) representing the distribution of the visual features generated by the underlying variable z, q (z | x) representing the conditional distribution of the underlying variable z, p (z) representing the prior distribution of the underlying variable z, set to a normal distribution, log being a logarithmic operation, D_KLCalculated for KL divergence.

As an improvement of the zero sample image classification method based on the meta-learning confrontation network, the distillation loss L of the step 6) is calculated_kd-wAnd L_kd-vThe working process comprises the following steps:

using encoders E stored in memory modules₁And decoder D₁Parameters calculation distillation loss:

wherein, w_1-beforeAnd v_1-beforeRespectively representing the code stored in the previous epsilon of the two memory modulesDevice E₁Parameter of and decoder D₁When epicode is 1, w_1-before＝v_1-before＝0。

As an improvement of the zero sample image classification method based on meta-learning confrontation network of the invention, the step 8) of generating visual prototype

The working process comprises the following steps:

(4.1) class semantic prototype a in training class_trAs an encoder E₂Is input of a_trMapping to a hidden space with the same dimension as z to obtain z_a：

z_a＝E₂(a_tr,w₂)

Wherein w₂For an encoder E₂The parameters of (1);

(4.2) adding z_aInput to a decoder D₂In (2), generating corresponding visual prototypes

And is

With true visual features x_iThe dimensions are the same:

wherein v is₂Is a decoder D₂The parameter (c) of (c).

As an improvement of the zero sample image classification method based on the meta-learning confrontation network, the step 10) calculates L_supThe working process comprises the following steps:

wherein v is₁、v₂Are respectively a decoder D₁And D₂The 2 norm algorithm is used to make the decoder of the semantic mode and the decoder of the visual mode similar, so that the generated visual prototype is closer to the real visual prototype.

The invention has the advantages that the invention completes an epicode meta-training process by utilizing a method of generating a network by two paths, so that a semantic classifier learns the visual classifier, and the zero sample learning performance is improved more intuitively and efficiently by utilizing the confrontation of a generator and a discriminator and the knowledge distillation of the characteristics between the front epicode and the back epicode. The training mode of meta-learning is used in a zero sample classification task, the visual characteristic and the semantic characteristic are input into a network in sequence, the learning task for zero sample image classification is simulated in the training stage, the generation process of the visual characteristic is finished, the alignment relation of different classifiers is guaranteed, and meanwhile, the knowledge obtained by each epsilon task is fully utilized, so that the semantic classifier is trained better under the supervision of the visual classifier, the visual characteristic and the semantic characteristic which are closer to real distribution are synthesized, and a zero sample image classification technology suitable for the real situation is designed. Therefore, the method can make the generalized zero sample image classification capability more prominent, improve the generalization capability of the model, and relieve the field offset problem commonly existing in zero sample learning, so that the classification task in a more real scene can be realized, the zero sample learning can be promoted to be applied to the production and life practice, and the deep learning algorithm can be accelerated to be developed to be practical.

Drawings

Features, advantages and technical effects of exemplary embodiments of the present invention will be described below with reference to the accompanying drawings.

Fig. 1 is a schematic structural diagram of meta learning in the present invention.

Detailed Description

As used in the specification and in the claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. "substantially" means within an acceptable error range, that a person skilled in the art can solve the technical problem within a certain error range, and that a technical effect is substantially achieved.

Furthermore, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

The present invention will be described in further detail with reference to fig. 1, but the present invention is not limited thereto.

The invention discloses a zero sample image classification method of a meta-learning based confrontation network, which is based on the basic idea that each subtask is used for simulating the whole generalized zero sample image classification process, and a knowledge distillation method is adopted among the tasks to enhance the memory and generalization capability of a model. In each epsilon task, a plurality of classes are randomly selected from all visible classes to serve as the visible classes in each task and are used for simulating generalized zero sample learning, after a visual classifier is learned by using a variational self-encoder, the visual classifier is used for guiding the semantics and learning to a semantic classifier. In the learning process of each epicode, the related parameters are stored in the memory module to supervise the learning of the related parameters of the next epicode, so that the function of knowledge distillation is achieved. Meanwhile, the supervision of the semantic classifier by the visual classifier can also be regarded as the operation of knowledge distillation. In the testing process after each epsilon training, the nearest neighbor is used for classifying the testing samples, and the zero sample image classification technology is realized.

In zero-sample image classification, the currently common training mode is to train a model with a visible class in a single round of multiple iterations, and then predict the class of a test sample, where the test sample includes both the visible class sample and the unseen class sample. In recent years, meta learning has been widely used in learning with few samples and has achieved excellent performance. Among training methods for meta-learning, meta-training methods based on sets (epicodes) are widely used. In the training mode, each epicode updates the model by using different training data in the training process, so that the previous knowledge and experience are fully utilized to guide the learning of a new task.

The invention discloses a zero sample image classification method of a countermeasure network based on meta-learning, which comprises the steps of firstly dividing an image data set into a visible class and an unseen class, then randomly selecting M classes from the visible class as a training set of an epicode, and using the rest classes in the visible class as a test class of the epicode. Given training set

It can be known that

Wherein n is_trFor the number of training set samples, x, in each epamode_iFor the visual characteristics of the i-th training sample, y_iFor the corresponding class label of the ith training sample, a_i∈A_trA semantic prototype of the class of the ith training sample, and a_te∈A_teSemantic prototypes of the test classes in each epicode. Given x_tTo test the visual characteristics of the sample, a_tThe category semantic features of the test sample are obtained. As shown in fig. 1, the following steps are performed:

1) m categories are randomly selected from the visible category as a training set of one epicode, and the remaining categories in the visible category are used as a test category of the epicode. Encoder E in visual modal variational auto-encoder is initialized respectively₁And decoder D₁Semantic moduleEncoder E in state self-encoder₂And decoder D₂And parameter w of discriminator D₁、v₁、w₂、v₂And r, defined to store the parameter w₁、v₁The two memory modules are m₁、m₂；

2) In this epicode, the visual characteristics x of the sample will be trained_iRandomly selecting data x of a set batch as an encoder E₁The input of (1);

3) generating a pseudo visual feature formula according to the following steps to obtain the generated pseudo visual feature

Wherein, the encoder E₁Is a latent variable, denoted by z, the probability distribution of z is expressed as follows:

p(z|x)＝N(μ,Σ) (2)

wherein p (z | x) represents the distribution of the latent variable z, μ, Σ represent the mean and variance of the latent variable z, respectively, and N represents the normal distribution;

4) after passing through the variational self-encoder, the pseudo visual features expected to be generated are close to real features, and a feature reconstruction loss function and a variational self-encoder loss function are respectively calculated:

wherein L is_rec1A function representing the loss of reconstruction is represented,

is represented by a 2 norm, L_VAERepresenting a variational autocoder loss function, E_PE(zx)Representing the calculation of the expectation over the distribution of the underlying variable z, p (x | z) representing the distribution of the visual features generated by the underlying variable z, q (z | x) representing the conditional distribution of the underlying variable z, p (z) representing the prior distribution of the underlying variable z, set to a normal distribution, log being a logarithmic operation, D_KLCalculating KL divergence;

5) inputting the generated pseudo-visual features into a softmax classifier after passing through a dimension reduction matrix W, obtaining the probability that one-hot classification results represent each class, and calculating the classification loss according to the real labels as follows:

wherein f represents a softmax classifier, W is a classifier parameter and is used for reducing the dimension of the generated features to M dimension and then comparing with the real label y. W is defined herein as the classifier of the visual modality.

6) The visual feature x of the training sample and the generated pseudo visual feature

Inputting the data into a discriminator D, training the discriminator D by using a countering loss function formula, and reserving a parameter r which enables the performance of the discriminator D to be the best, wherein the countering loss function formula is as follows:

wherein L is_DAs a function of the penalty of the discriminator D, E_xTo calculate the expectation over the distribution of the visual features x of the training sample,

for pseudo-visual features being generated

Over-distribution calculation period ofInspection;

7) the distillation loss for this epsicode was calculated as follows:

wherein w_1-beforeAnd v_1-beforeRespectively representing the encoders E stored in the two memory modules in the immediately preceding epamode₁Parameter of and decoder D₁When epicode is 1, w_1-before＝v_1-before＝0；

8) E in the visual variation self-encoder is trained by adding the loss functions of the formulas (3) to (8)₁And D₁Updating the memory module;

wherein λ is₁、λ₂Weight coefficients characterizing the reconstruction loss and the variational self-coder loss.

9) In the epicode, the category semantic prototype a in the training class is further processed_trAs an input to an autocoder, of which the encoder E₂Mapping the category semantic prototype into a hidden space with the same dimension as z, and passing through a decoder D₂Reconstructing the features of the hidden space into the visual space to generate corresponding visual prototypes, where the decoder uses D₁And (4) supervision and constraint:

wherein, the first and the second end of the pipe are connected with each other,

is a visual prototype generated from a category semantic prototype, defined as a classifier of semantic modalities, L_supRepresentation decoder D₁To D ₂2 norm constraints of;

10) meanwhile, the visual prototype features also need to be constrained by a dimensionality reduction matrix W, namely, a classifier of a visual mode is used for constraining a classifier of a semantic mode, so that distillation constraint of vision on semantics and distillation loss L are obtained_kd2The following were used:

11) by using

Classifying the features, and calculating classification loss:

12) the loss functions in equations (11) to (13) are added to train the encoder E₂And decoder D₂：

L_a＝L_cls2+λ₃L_sup+λ₄L_kd2 (14)

Wherein λ is₃And λ₄Weight coefficients for the supervision loss and distillation loss, respectively;

13) semantic prototype a of the testing set of the epicode_teInput to trained encoder E₂And decoder D₂Obtaining a corresponding visual prototype:

14) can be utilizedTo

And

the classifiers are spliced together to obtain all visible classes

At this point, the sorter C is reused_SClassifying all visible samples, calculating classification loss, and calculating parameter w₁、v₁、w₂、v₂And r for fine tuning:

15) semantic features a of test samples of visible classes and invisible classes_tInputting the visual feature prototype and x into semantic encoder and decoder_tAnd (5) comparing, and obtaining a classification result by using a nearest neighbor method.

16) And repeating the steps 1) to 15), and finishing the meta-training process of a plurality of epicodes until the optimal classification performance is obtained.

Variations and modifications to the above-described embodiments may also occur to those skilled in the art, which fall within the scope of the invention as disclosed and taught herein. Therefore, the present invention is not limited to the above-mentioned embodiments, and any obvious improvement, replacement or modification made by those skilled in the art based on the present invention is within the protection scope of the present invention. Furthermore, although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

1. A zero sample image classification method of a countermeasure network based on meta-learning is characterized by comprising the following steps:

1) randomly selecting M classes from the visible classes as a training set for an epsode,the remaining classes in the visible class serve as the test class for this epsilon, training the set

It can be known that

Wherein n is_trFor the number of training set samples, x, in each epamode_iFor the visual characteristics of the i-th training sample, y_iFor the corresponding class label of the ith training sample, a_i∈A_trA semantic prototype of the class of the ith training sample, and a_te∈A_teDefining two memory modules m for semantic prototype of test class in an epicode₁、m₂；

The reconstruction constraints are as follows:

wherein the content of the first and second substances,

is expressed by a 2 norm;

3) after passing through the variational autocoder, calculating a variational autocoder loss function L_VAE；

4) Inputting the generated pseudo-visual features into a softmax classifier after passing through a dimensionality reduction matrix W, obtaining the probability that one-hot classification results represent each class, and calculating the classification loss according to the real labels as follows:

Input into a discriminator D, and the countermeasure loss is L_D：

At the same time handle

Classifier defined as a semantic modalityBy using

；

；

L_a＝L_cls2+λ₃L_sup+λ₄L_kd2

11) test procedure of the epicode: semantic prototype a of test set_teInput to trained encoder E₂And decoder D₂In order to obtain a corresponding visual prototype

12) Will be provided with

And

are spliced togetherTo obtain all visible classes

；

13) semantic features a of test samples of visible classes and invisible classes_tInputting into semantic encoder and decoder, and adding the generated visual feature prototype and x_tBy contrast, where x_tObtaining a classification result by using a nearest neighbor method for testing the visual characteristics of the sample;

2. The method for zero-sample image classification of countermeasure network based on meta-learning as claimed in claim 1, wherein the step 2) of generating pseudo-visual features

And step 3) calculating L_VAEThe working process comprises the following steps:

p(z|x)＝N(μ,Σ)

(2.2) input z to decoder D₁In generating pseudo visual features

Wherein, w₁、v₁Are respectively an encoder E₁And decoder D₁The parameters of (1);

(2.3) calculating the variational autocoder loss function L_VAE：

Wherein L is_VAERepresenting the variational self-encoder loss function,

3. The method for zero-sample image classification based on meta-learning confrontation network as claimed in claim 1, wherein the step 6) of calculating distillation loss L_kd-wAnd L_kd-vThe working process comprises the following steps:

wherein w_1-beforeAnd v_1-beforeRespectively representing the encoders E stored in the two memory modules in the immediately preceding epamode₁Parameter of and decoder D₁When epicode is 1, w_1-before＝v_1-before＝0。

4. The method for zero-sample image classification of countermeasure network based on meta-learning as claimed in claim 1, wherein the step 8) of generating visual prototype

The working process comprises the following steps:

z_a＝E₂(a_tr,w₂)

Wherein, w₂For an encoder E₂The parameters of (1);

And is

With true visual features x_iThe dimensions are the same:

wherein v is₂Is a decoder D₂The parameter (c) of (c).

5. The method for zero-sample image classification of countermeasure network based on meta-learning as claimed in claim 1, wherein the calculation L of step 10) is_supThe working process comprises the following steps: