CN111476294B

CN111476294B - Zero sample image identification method and system based on generation countermeasure network

Info

Publication number: CN111476294B
Application number: CN202010263452.4A
Authority: CN
Inventors: 张桂梅; 龙邦耀
Original assignee: Nanchang Hangkong University
Current assignee: Nanchang Hangkong University
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2022-03-22
Anticipated expiration: 2040-04-07
Also published as: CN111476294A

Abstract

The invention discloses a zero sample image identification method and a zero sample image identification system based on a generation countermeasure network. The method comprises the following steps: acquiring a training image sample with marking information and a test image sample without marking information; constructing and generating a confrontation network model; the generation of the countermeasure network model comprises a semantic feature generator, a visual feature generator, a semantic discriminator and a visual discriminator; constructing a multi-target loss function comprising a cycle consistency loss function, a countermeasure loss function of a semantic discriminator, a countermeasure loss function of a visual discriminator and a classification loss function of the semantic discriminator; taking the training image sample as the input of the generated countermeasure network model, and carrying out iterative training on the generated countermeasure network model based on the multi-target loss function to obtain the trained generated countermeasure network model; and inputting the test image sample into the trained generated confrontation network model to obtain a recognition result. The invention can identify the sketch without the marked information, and the zero sample identification precision is high.

Description

Zero sample image identification method and system based on generation countermeasure network

Technical Field

The invention relates to the field of image identification based on weak/semi-supervision, in particular to a zero sample image identification method and system based on a generation countermeasure network.

Background

The concept of Zero-shot Learning (ZSL) was first proposed by h.larochelle et al in 2008, and is mainly used to solve the problem of how to correctly classify and identify unknown new objects under the condition that labeled training samples do not sufficiently cover all object classes. If a classifier is learned on a training set and applied to a test sample set according to a traditional supervised learning method, the classification effect is poor because the sample distribution of two domains is different. This image recognition problem is called zero sample recognition.

Zero sample identification requires only labeled samples of known classes to predict unknown classes. The main idea is to introduce category semantic information as middle layer characteristics and to link visual characteristics with semantic characteristics. Therefore, at the feature level, the key problem of implementing zero sample identification is: 1) searching visual features capable of fully expressing visual information of the image and semantic information capable of fully representing semantic features; 2) how to relate visual features to category semantic information.

For the key problem 1), finding visual features that can sufficiently express image visual information is one of the challenges of zero sample identification. With the rise of deep learning, scholars extract the identification features of images by using deep convolutional neural networks. Zero-sample image recognition requires not only visual features of the image, but also semantic features that can represent the semantics of image classes to link known classes to unknown classes. The most widely used semantic features currently are attribute features and text features. Due to the fact that the attribute characteristics are marked manually, accuracy is poor. In recent years, with the development of natural language processing techniques, research using text description features instead of attribute features has been receiving much attention. Because the text description features can be extracted directly from the corpus, each class corresponds to a vector in the text description space. Compared with attribute features, the text description features can obtain text vectors of any words from the unlabeled text corpus through natural language processing technology, and therefore have better expansibility. A commonly used text vector extraction method is Word2 Vec.

Existing semantic feature spaces can be divided into three categories: (1) semantic feature space based on attributes. (2) A text-based semantic feature space. (3) A common semantic feature space. After the semantic feature space is selected, how to establish the mapping relationship between the visual features and the semantic features is the second key problem of zero sample identification.

For the key problem 2), after extracting semantic features of known classes and unknown classes in a given semantic space, the semantic correlation between the classes can be obtained from the similarity between the semantic features. However, sample images are represented by visual features in the visual space, and they cannot directly link semantic features of the semantic space due to the existence of semantic gaps. Most of the existing methods learn the mapping function which is mapped from the visual space to the semantic space through the visual features of the known class pictures and the semantic features of the corresponding labels. Then, the visual features of the test image are mapped to a semantic space through the mapping function, and predicted semantic features are obtained. And finally finding out the semantic features of the unknown class closest to the unknown class to determine the class to which the unknown class belongs.

In zero-sample image recognition, since the known class and the unknown class are not intersected, the direct application of the model learned from the training sample set to the test set results in a large deviation between the mapping of the test set samples in the semantic space and the real class semantics, which is called domain offset. Recently, to solve the domain shift problem in zero sample learning, many methods have been proposed, such as data enhancement, self-training, and pivot correction.

The zero sample recognition has received a wide attention of the middle and old scholars in recent years, and the application-related algorithm of the zero sample recognition has come to be applied in practice. Previous zero sample learning methods mainly identify targets in the conventional zero sample learning setting, i.e., the test image is limited to only the target class, whereas in an actual scenario, the test image comes not only from the target class but also possibly from the source class. In this case, data from both the source class and the target class should be taken into consideration, and thus the generalized zero sample setting has been introduced in recent years, however, the recognition accuracy of the zero sample based on the generalized zero sample learning is much lower than that based on the conventional zero sample learning. Therefore, the conventional generalized zero sample identification method has the problem of low identification precision.

Disclosure of Invention

Based on the above, there is a need for a zero-sample image recognition method and system based on a generation countermeasure network, which can perform high-precision recognition on test images from a target class and a source class.

In order to achieve the purpose, the invention provides the following scheme:

a zero sample image identification method based on a generation countermeasure network comprises the following steps:

acquiring a training image sample and a test image sample; the training image sample is a sample image with marking information, and the test image sample is a sample image without marking information;

constructing and generating a confrontation network model; the generation countermeasure network model comprises a semantic feature generator, a visual feature generator, a semantic discriminator and a visual discriminator; the semantic feature generator is used for generating pseudo-semantic features according to the real visual features; the visual feature generator is used for generating a pseudo visual feature according to the pseudo semantic feature; the semantic discriminator is used for discriminating the real semantic features and the pseudo semantic features; the visual discriminator is used for discriminating the real visual features and the pseudo visual features;

constructing a multi-target loss function; the multi-target loss function comprises a cycle consistency loss function of a real visual feature and a pseudo visual feature, an antagonistic loss function of a semantic discriminator, an antagonistic loss function of the visual discriminator and a classification loss function of the semantic discriminator;

taking the training image sample as the input of the generated countermeasure network model, and performing iterative training on the generated countermeasure network model based on the multi-target loss function to obtain a trained generated countermeasure network model;

and inputting the test image sample into the trained generated confrontation network model to obtain a recognition result.

The invention also provides a zero sample image recognition system based on the generation countermeasure network, which comprises:

the sample acquisition module is used for acquiring a training image sample and a test image sample; the training image sample is a sample image with marking information, and the test image sample is a sample image without marking information;

the network model building module is used for building and generating a confrontation network model; the generation countermeasure network model comprises a semantic feature generator, a visual feature generator, a semantic discriminator and a visual discriminator; the semantic feature generator is used for generating pseudo-semantic features according to the real visual features; the visual feature generator is used for generating a pseudo visual feature according to the pseudo semantic feature; the semantic discriminator is used for discriminating the real semantic features and the pseudo semantic features; the visual discriminator is used for discriminating the real visual features and the pseudo visual features;

the loss function construction module is used for constructing a multi-target loss function; the multi-target loss function comprises a cycle consistency loss function of a real visual feature and a pseudo visual feature, an antagonistic loss function of a semantic discriminator, an antagonistic loss function of the visual discriminator and a classification loss function of the semantic discriminator;

the training module is used for taking the training image sample as the input of the generated confrontation network model, and performing iterative training on the generated confrontation network model based on the multi-target loss function to obtain a trained generated confrontation network model;

and the test identification module is used for inputting the test image sample into the trained generated confrontation network model to obtain an identification result.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a zero sample image recognition method and a zero sample image recognition system based on a generated countermeasure network, wherein the method comprises the steps of constructing a generated countermeasure network model comprising a semantic feature generator, a visual feature generator, a semantic discriminator and a visual discriminator; constructing a multi-target loss function comprising a cycle consistency loss function, a countermeasure loss function of a semantic discriminator, a countermeasure loss function of a visual discriminator and a classification loss function of the semantic discriminator; taking the training image sample as the input of the generated countermeasure network model, and carrying out iterative training on the generated countermeasure network model based on the multi-target loss function to obtain the trained generated countermeasure network model; and inputting the test image sample into the trained generated confrontation network model to obtain a recognition result. The method can identify the sketch without the marked information, improve the zero sample identification precision and improve the generalization capability of the model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a zero-sample image recognition method based on a generation countermeasure network according to an embodiment of the present invention;

FIG. 2 is a semantic feature generator G according to an embodiment of the present invention₁The network structure of (1);

FIG. 3 is a diagram of a visual feature generator G according to an embodiment of the present invention₂The network structure of (1);

FIG. 4 is a diagram of a semantic discriminator D according to an embodiment of the present invention₁The network structure of (1);

FIG. 5 is a diagram of a visual discriminator D according to an embodiment of the invention₂The network structure of (1);

FIG. 6 is a block diagram of a trained generative confrontation network model according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a zero-sample image recognition system based on a generative countermeasure network according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

In order to improve the identification precision of the generalized zero sample, the following two problems need to be solved: on the one hand, aligned image pairs are required or inefficient feature fusion is required to map visual information to semantic space; on the other hand, when the self-encoder is used for extracting semantic information from Wikipedia, redundant noise texts exist, and the recognition effect is influenced.

Fig. 1 is a flowchart of a zero-sample image recognition method based on a generation countermeasure network according to an embodiment of the present invention. Referring to fig. 1, the zero-sample image recognition method based on the generation countermeasure network of the embodiment includes:

step 101: training image samples and test image samples are obtained.

The training image sample is a sample image with marking information, and the test image sample is a sample image without marking information.

Step 102: constructing and generating a confrontation network model; the generation countermeasure network model comprises a semantic feature generator, a visual feature generator, a semantic discriminator and a visual discriminator.

The semantic feature generator is used for generating pseudo-semantic features according to the real visual features; the visual feature generator is used for generating a pseudo visual feature according to the pseudo semantic feature; the semantic discriminator is used for discriminating the real semantic features and the pseudo semantic features; the visual discriminator is used for discriminating the real visual features and the pseudo visual features.

Before this step is performed, it is also necessary: 1) inputting texts in Wikipedia into a layered model to obtain useful information of the texts, and inputting the useful information of the texts into a self-encoder to obtain real semantic features. 2) And inputting the training image sample into a CNN model based on an attention mechanism to obtain a real visual feature.

Step 103: constructing a multi-target loss function; the multi-target loss function comprises a cycle consistency loss function of a real visual feature and a pseudo visual feature, a confrontation loss function of a semantic discriminator, a confrontation loss function of the visual discriminator and a classification loss function of the semantic discriminator.

Step 104: and taking the training image sample as the input of the generated countermeasure network model, and performing iterative training on the generated countermeasure network model based on the multi-target loss function to obtain the trained generated countermeasure network model.

Step 105: and inputting the test image sample into the trained generated confrontation network model to obtain a recognition result.

Step 101 is a training initial stage of this embodiment, the training initial stage of the recognition model is completed under a deep learning tensflo framework, and a specific flow of obtaining a training image sample and a testing image sample is as follows:

the training image samples and the test image samples in this embodiment may be selected from Sketchy and TU-Berlin. Sketchy and TU-Berlin are two common and popular sketch datasets.

The Sketchy dataset is a large sketch set. The dataset consists of 125 different classes of slaves, each class having 100 drafts. The sketch of the object appearing in this 12500 sketch was collected by group sourcing, with the result being 75471 sketches. The data set also contains fine-grained correspondence (alignment) between particular images and sketches, as well as various data augmentations for deep learning based methods. The data set was then expanded by adding 60502 photos, yielding a total of 73002 sketches. We randomly draw 25 classes of sketches as invisible test sets for zero sample recognition (without their labeling information) and the remaining 100 classes of data are used for training (with labeling information).

The TU-Berlin dataset (extended) contains 250 categories, followed by an extension of 20000 sketches, the natural image corresponding to the sketches class, with a total size of 204489. Randomly selecting 30 types of sketches as a test set (without using the labeled information thereof); the remaining 220 classes are used for training (using annotation information).

Step 102 is the middle training stage of this embodiment, namely, constructing a structure for generating a countermeasure network model, where the structure for generating the countermeasure network model includes a semantic feature generator G₁Visual feature generator G₂And a semantic discriminator D₁And a visual sense discriminator D₂. The specific construction process is as follows:

1) construction of the generator network:

constructing a generator network, the generator network having two: semantic feature generator G₁And visual feature generator G₂. As shown in FIG. 2, semantic feature generator G₁Comprises 2 groups of convolution modules and 2 groups of full connection modules. The convolution module consists of a convolution layer (Conv), a Max Pooling layer (Max power) and a normalization layer (normalization); the full-connection module consists of a full-connection layer (FC) and a Leaky ReLU. As shown in FIG. 3, visual feature generator G₂The system comprises two groups of full-connection modules, a 3-layer 4096-dimensional full-connection layer (FC 4096), a resampling layer (Reshape) and 5 groups of up-sampling modules. Wherein the full-connection module consists of a full-connection layer and a Leaky ReLU; the up-sampling module consists of two up-sampling layers (Upconv) and two Leaky ReLUs, wherein the up-sampling layers and the Leaky ReLUs are alternately connected. G₂The input comes from G₁And outputting the semantic features.

In particular, semantic feature generator G₁Comprises 2 groups of convolution modules and 2 groups of full connection modules. After an image is input into a generator, firstly, convolution processing is carried out on a convolution layer with a convolution kernel of 11 and a step length of 4 through a convolution module, the deviation of a mean square error left by parameter errors of the convolution layer is reduced through maximum pooling with a pooling layer of 3 and a step length of 2, and the dimension of input data is normalized in subsequent normalization; then convolution processing is carried out on the convolution layer with the convolution kernel of 5 and the step length of 1, the deviation of the mean square error left by the parameter error of the convolution layer is reduced through the maximum pooling with the pooling layer of 3 and the step length of 2, the dimensionality of input data is normalized in the subsequent normalization, and then the input data is input into a 1024 full-connection module; and finally, generating semantic features from the input visual features through two full-connection modules with the same size.

In particular, the visual feature generator G₂The device comprises two groups of full-connection modules, 3 layers of 4096-dimensional full-connection layers, a resampling layer and 5 groups of up-sampling modules. Inputting the semantic features generated by the semantic feature generator into the visual feature generator, and firstly passing through two 1024 full-connection modules; then three 4096-dimensional full-connected layers extract 4096 dimensions from the input dataThe feature vector of (2); then, the dimensionality of the input feature vector is resampled to be 4 multiplied by 256 through a resampling layer; finally, 5 up-sampling modules with convolution kernels of 4 and step length of 2 are used for up-sampling the feature vectors, and an activation function is used once every up-sampling to prevent gradient disappearance; and outputting the feature vector.

2) Construction of a discriminator network:

constructing a discriminator network, wherein the discriminator network comprises two networks: semantic discriminator D₁And a visual sense discriminator D₂。D₁Comprises two branches: one branch for 0/1 (true and false) classification and the other branch for classification of the input label category. The first branched network structure comprises a group of fully connected modules and a two-way fully connected layer. The full-connection module consists of a full-connection layer and a Leaky ReLU; the network structure of the other branch comprises a group of fully connected modules and an n-way fully connected layer. The full-link module consists of a full-link layer and a Leaky ReLU. D₂The full-connection module comprises a group of full-connection modules and a full-connection layer, wherein the full-connection modules comprise a full-connection layer and a Leaky ReLU. Two discriminators D₁，D₂The fully-connected layer of the last layer serves as a classifier in the overall convolutional neural network.

As shown in FIG. 4, in particular, semantic discriminator D₁Two branches are included, one branch for the 0/1 second category; the other branch is used for class label classification. It receives true semantic features from the extracted self-encoder and a semantic feature generator G₁Firstly, extracting features through a group of 1024 full-connection modules in a two-classification branch, then stabilizing the gradient by using an activation function, and finally, performing 0/1 two-classification through a full-connection layer to judge the truth of the input features; and in another n-way classification branch, the input data is subjected to n-way classification by using the last full connection layer.

As shown in FIG. 5, in particular, the visual discriminator D₂For discriminating use of visual feature generator G₂The authenticity of the features between the generated pseudo visual features and the CNN extracted real visual features. Inputting the generated pseudo-visual featuresTo discriminating network D₂Firstly, using 1024 full-link layer to extract features, then using activation function to prevent gradient from disappearing, finally using full-link layer to make secondary classification of data and judging true and false of input features.

Wherein, a multi-target loss function is constructed in step 103, and the purpose of constructing the loss function is as follows: according to the convergence condition of the loss function value, the corresponding parameters in the zero sample identification network model can be better updated and optimized, the optimized generation countermeasure network model is finally obtained, and the image to be identified in the real data set is more accurately identified. Specifically, the method comprises the following steps:

the above-mentioned antagonism loss function is divided into two parts, one is the antagonism loss of CTGAN which evaluates the synthesis semantic features, the antagonism loss of CTGAN can make corresponding constraint to the gradient punishment to improve the quality of the synthesis features; and secondly, the antagonism loss of the general GAN for evaluating the synthesized pseudo-visual characteristics, and the general antagonism mechanism can well reduce the domain difference.

The degree of match between the visual features extracted by the CNN based on the attention mechanism and the generated pseudo-visual features can be well documented by the circular consistency loss function.

The classifier is attached to the semantic discriminator D₁Therefore, the classifier can effectively classify the class label data so as to meet the task of zero sample image identification. Semantic discriminator D in the generation countermeasure network model₁The countermeasure loss function of (1) is specifically as follows:

where x represents the true visual features, a represents the true semantic features, G₁(x) Semantic Generator, D, representing an input visual feature as x₁(G₁(x) Represents input G)₁(x) Semantic discriminator of, D₁(a) Semantic discriminator expressing input semantic features as a, P_fA priori distribution, P, representing true visual features_rA prior distribution representing the true semantic features,

representing linear interpolation between features, P_r,fRepresenting a prior distribution of obedient real visual features and real semantic features; first item

Representing a desire for a pseudo feature distribution; second item

An expectation representing a true feature distribution; the difference between the first term and the second term represents the Wasserstein distance between the feature distributions;

denotes the gradient penalty, λ, of performing the Lipschitz constraint₂CT|_x',x”A consistency or continuity term representing an increased constraint gradient penalty; lambda [ alpha ]₁A weight representing a gradient penalty; lambda [ alpha ]₂Weights representing consistency or continuity terms; wherein,

x 'and x' both represent perturbation data near the true visual features (arbitrarily extracted perturbation data near the true samples); c is a fixed constant; d (x ') represents a semantic arbiter entered as x', D (x ") represents a semantic arbiter entered as x", | D (x ') -D (x ") | | represents the distance between two arbiter values, | | x' -x" | | represents the distance between two perturbation data features; the consistency term is used

To approximate the gradient and limit it to be less than c.

Construction of a countering loss function for a visual arbiter

Wherein,

representing pseudo-semantic features, D₂(x) A visual discriminator representing the input visual feature x,

representing input pseudo-semantic features

The visual sense generator of (a) is,

presentation input

The visual characteristics generator of (1); continuously optimizing the network by a loss function such that the generated pseudo-visual features

And the true visual feature x is getting closer.

The loss resisting function is that the real characteristic distribution and the generated characteristic distribution are integrally analyzed, a feedback signal is output to the generator network, and the parameter of the network is adjusted and optimized.

Constructing a circular consistency loss function of real visual features and pseudo visual features

E[||G₂(G₁(x))-x||₁]Representing a distribution expectation of two visual features measured by cyclic consistency;

representing the period of distribution of two semantic features measured by cycle consistencyInspection; loss of cyclic consistency L_cycTo optimize network parameters such that the true visual feature x and the pseudo-semantic feature

Better matching is possible.

Constructing a classification loss function for a semantic classifier

L_cls＝-E[logP(b|G₁(a)；θ)]；

Wherein, P (c | G)₁(a) (ii) a θ) represents the class conditional probability of the class label, G₁(a) And a semantic generator which represents the input semantic features as a, theta is a parameter of the classification network, and b is a class label of a. The classification accuracy of class labels is improved by minimizing the classification loss of generated features.

And 104, performing iterative training on the constructed generation countermeasure network model, updating and optimizing parameters of the network model, and obtaining the trained generation countermeasure network model. Specifically, the training image sample is used as an input of the semantic feature generator, and the semantic feature generator, the visual feature generator, the semantic discriminator and the visual discriminator are jointly trained in a back propagation manner according to the multi-objective loss function, so that parameters in the semantic feature generator, the visual feature generator, the semantic discriminator and the visual discriminator are continuously updated and optimized, and a trained generated confrontation network model is obtained. Fig. 6 is a diagram of a trained structure for generating a confrontation network model according to an embodiment of the present invention. The specific iterative training steps are as follows:

inputting training sample data on the Sketchy data set and the TU-Berlin data set into a CNN based on an attention mechanism, extracting visual characteristic information of the training sample, and inputting the visual characteristic information into a semantic characteristic generator G₁Generating pseudo-semantic features

Inputting the pseudo-semantic features obtained in the last step into a visual feature generator G₂In generating pseudo visionFeature(s)

In order to better measure the similarity between the sketch and the real image in the training process, a cycle-GAN cycle consistency loss constraint is introduced. Because the cycle-GAN consists of two generators and two discriminators. Semantic feature generator G for generating semantic features and visual features as data information of two different domains₁Generating pseudo-semantic features from real visual features x

Visual feature generator G₂The obtained pseudo semantic features

Reverse generation of pseudo-visual features

Cyclios is then used to measure the similarity of true visual features and pseudo visual features.

The method comprises the steps of inputting texts in the Wikipedia into a layered model to obtain useful information of the texts, then inputting the useful information into a self-encoder, and extracting real semantic information of the Wikipedia texts. Using the real semantic information a as a discriminator D₁Input of (1), with G₁The generated pseudo-semantic features are used for counterstudy.

Using a variant of WGAN, CTGAN, as discriminator D₁To improve the accuracy of zero sample image recognition. Because the gradient penalty of the WGAN is not reasonable, if the real sample distribution and the generated pseudo sample distribution are far away from each other, the gradient penalty often cannot detect the continuity of the area near the real sample, that is, the discriminator will destroy the Lipschitz continuity. The CTGAN adds a constancy term to constrain the gradient of the real sample distribution on the basis of WGAN, thereby enhancing the Lipschitz continuity near the data sample distribution.

Visual feature generator G₂Generated pseudo-visual features

And the real visual feature x as a visual discriminator D₂Input of G₂Judging the truth of visual features to generate countermeasures to loss, and updating the optimized network parameters via loss function to make the visual features pseudo

Closer and closer to the true visual feature x.

Constructing a discriminator D according to the characteristic information of the Wikipedia text and the sketch₁Is the function of the penalty of fighting L_CTGANAnd a discriminator D₂Is the function of the penalty of fighting L_adv(ii) a Constructing a cycle consistency loss function L according to real visual features and pseudo visual features of a sketch_cycThen, a loss function L for classifying the label category is constructed_cls。

The specific updating optimization process comprises the following steps: the fixed generator network parameters are used for training the discriminator network to obtain a trained discriminator network model; and fixing the trained discriminator network model parameters, carrying out back propagation training on the generator network to obtain an optimized generator network model, and repeating the steps to obtain an optimal generation confrontation network model.

The zero sample image identification method based on the generation countermeasure network in the embodiment has the following advantages: introducing cycle consistent loss constraint of semantic alignment into a generation model to solve the problem that common semantic knowledge cannot be utilized between a training image and a testing image in a real scene, measuring the correlation between visual features and semantic features, and adding a classification network parallel to a discriminator at the output part of the discriminator to correctly classify class labels; using variant CTGAN of WGAN to carry out antagonistic learning on the true characteristic and the synthesized characteristic, and adding a consistency term on the basis of the WGAN so as to restrict the gradient of the distribution of the true characteristic; the zero sample learning has the problem that the training cost and the training complexity are high when the whole attribute set based on the features is identified, and the self-encoder extraction scheme based on the Wikipedia text and the hierarchical structure is proposed to extract the features of the subsets of the attributes, then the hierarchical structure is used for dividing the subsets, useful information is screened, and important feature information from the text is extracted, so that the training cost and the training complexity are reduced, and the zero sample learning is more effective in identifying the attribute subset than the whole attribute set.

The method in the embodiment adopts the generation of the countermeasure network to realize the identification of the zero sample, can identify the sketch without the labeled information, can improve the identification precision of the zero sample, and improves the generalization capability of the model.

Fig. 7 is a schematic structural diagram of a zero-sample image recognition system based on a generative countermeasure network according to an embodiment of the present invention. Referring to fig. 7, the zero-sample image recognition system based on the generation countermeasure network includes:

a sample obtaining module 201, configured to obtain a training image sample and a test image sample; the training image sample is a sample image with marking information, and the test image sample is a sample image without marking information.

A network model construction module 202, configured to construct and generate a confrontation network model; the generation countermeasure network model comprises a semantic feature generator, a visual feature generator, a semantic discriminator and a visual discriminator; the semantic feature generator is used for generating pseudo-semantic features according to the real visual features; the visual feature generator is used for generating a pseudo visual feature according to the pseudo semantic feature; the semantic discriminator is used for discriminating the real semantic features and the pseudo semantic features; the visual discriminator is used for discriminating the real visual features and the pseudo visual features.

A loss function constructing module 203, configured to construct a multi-objective loss function; the multi-target loss function comprises a cycle consistency loss function of a real visual feature and a pseudo visual feature, a confrontation loss function of a semantic discriminator, a confrontation loss function of the visual discriminator and a classification loss function of the semantic discriminator.

The training module 204 is configured to use the training image sample as an input of the generated confrontation network model, and perform iterative training on the generated confrontation network model based on the multi-objective loss function to obtain a trained generated confrontation network model.

And the test recognition module 205 is configured to input the test image sample into the trained generated confrontation network model to obtain a recognition result.

As an optional implementation, the system for zero-sample image recognition based on generation of a countermeasure network further includes:

and the real semantic feature acquisition module is used for inputting the text in the Wikipedia into the layered model to obtain useful information of the text and inputting the useful information of the text into the self-encoder to obtain the real semantic features.

And the real visual feature acquisition module is used for inputting the training image sample into a CNN model based on an attention mechanism to obtain real visual features.

As an optional implementation manner, the network model building module 202 specifically includes:

the first generator constructing unit is used for constructing a semantic feature generator; the semantic feature generator comprises two groups of convolution modules and two groups of full-connection modules; the convolution module comprises a convolution layer, a maximum pooling layer and a normalization layer which are connected in sequence; the full-connection module comprises a full-connection layer and a Leaky ReLU layer.

A second generator building unit for building a visual feature generator; the visual feature generator comprises two groups of fully-connected modules, three 4096-dimensional fully-connected layers, a resampling layer and five groups of upsampling modules which are sequentially connected; the up-sampling module comprises two up-sampling layers and two Leaky ReLU layers; and the upsampling layer in the upsampling module and the Leaky ReLU layer are alternately connected.

The first discriminator constructing unit is used for constructing a semantic discriminator; the semantic discriminator comprises a group of full-connection modules, a two-path full-connection layer, an n-path full-connection layer, two classifiers and an input label classifier.

A second discriminator establishing unit for establishing a visual discriminator; the visual discriminator comprises a group of fully connected modules, a fully connected layer and a classifier.

As an optional implementation manner, the loss function constructing module 203 specifically includes:

a first loss function construction unit for constructing a countermeasure loss function of the semantic discriminator

representing linear interpolation between features, P_r,fRepresenting a prior distribution of obedient real visual features and real semantic features;

representing a desire for a pseudo feature distribution;

an expectation representing a true feature distribution;

x' and x "both represent perturbation data in the vicinity of the true visual feature; c is a fixed constant; d (x ') represents the semantic arbiter input as x', D (x ") represents the semantic arbiter input as x", | D (x ') -D (x ") | | represents the distance between two arbiter values, | | x' -x" | | represents the distance between two perturbation data features.

A second loss function constructing unit for constructing a countering loss function of the visual discriminator

Wherein,

representing input pseudo-semantic features

The visual sense generator of (a) is,

presentation input

The visual characteristics generator of (1).

A third loss function constructing unit for constructing a circular consistency loss function of the real visual features and the pseudo visual features

representing the expectation of distribution of two semantic features measured by circular consistency.

A fourth loss function construction unit for constructing the classification loss function of the semantic discriminator

L_cls＝-E[logP(b|G₁(a)；θ)]；

Wherein, P (c | G)₁(a) (ii) a θ) represents the class conditional probability of the class label, G₁(a) And a semantic generator which represents the input semantic features as a, theta is a parameter of the classification network, and b is a class label of a.

As an optional implementation manner, the training module 204 specifically includes:

and the training unit is used for taking the training image sample as the input of the semantic feature generator, and performing combined training on the semantic feature generator, the visual feature generator, the semantic discriminator and the visual discriminator in a back propagation mode according to the multi-target loss function, so that parameters in the semantic feature generator, the visual feature generator, the semantic discriminator and the visual discriminator are continuously updated and optimized, and a trained generated confrontation network model is obtained.

The zero sample image identification system based on the generation countermeasure network in the embodiment adopts the generation countermeasure network to realize the identification of the zero sample, can identify the sketch without the labeled information, can improve the identification precision of the zero sample, and improves the generalization capability of the model.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A zero sample image identification method based on a generation countermeasure network is characterized by comprising the following steps:

inputting the test image sample into the trained generated confrontation network model to obtain a recognition result;

the constructing of the multi-target loss function specifically includes:

construction of a penalty function for semantic discriminators

representing a desire for a pseudo feature distribution;

an expectation representing a true feature distribution;

x' and x "both represent perturbation data in the vicinity of the true visual feature; c is a fixed constant; d (x ') represents a semantic arbiter entered as x', D (x ") represents a semantic arbiter entered as x", | D (x ') -D (x ") | | represents the distance between two arbiter values, | | x' -x" | | represents the distance between two perturbation data features;

construction of a countering loss function for a visual arbiter

Wherein,

a visual generator representing an input pseudo-semantic feature a-,

presentation input

The visual characteristics generator of (1);

Ε[||G₂(G₁(x))-x||₁]Representing a distribution expectation of two visual features measured by cyclic consistency;

representing the distribution expectation of two semantic features measured by cycle consistency;

constructing a classification loss function for a semantic classifier

L_cls＝-Ε[logP(b|G₁(a)；θ)]；

Wherein, P (b | G)₁(a) (ii) a θ) represents the class conditional probability of the class label, G₁(a) And a semantic generator which represents the input semantic features as a, theta is a parameter of the classification network, and b is a class label of a.

2. The method for zero-sample image recognition based on generation of countermeasure network according to claim 1, further comprising, before the constructing of the generation of countermeasure network model:

inputting texts in Wikipedia into a layered model to obtain useful information of the texts, and inputting the useful information of the texts into a self-encoder to obtain real semantic features;

and inputting the training image sample into a CNN model based on an attention mechanism to obtain a real visual feature.

3. The method for zero-sample image recognition based on generation of a countermeasure network according to claim 1, wherein the constructing of the generation countermeasure network model specifically includes:

constructing a semantic feature generator; the semantic feature generator comprises two groups of convolution modules and two groups of full-connection modules; the convolution module comprises a convolution layer, a maximum pooling layer and a normalization layer which are connected in sequence; the full-connection module comprises a full-connection layer and a Leaky ReLU layer;

constructing a visual feature generator; the visual feature generator comprises two groups of fully-connected modules, three 4096-dimensional fully-connected layers, a resampling layer and five groups of upsampling modules which are sequentially connected; the up-sampling module comprises two up-sampling layers and two Leaky ReLU layers; the upsampling layer in the upsampling module is alternately connected with the Leaky ReLU layer;

constructing a semantic discriminator; the semantic discriminator comprises a group of full-connection modules, a two-path full-connection layer, an n-path full-connection layer, two classifiers and an input label classifier;

constructing a visual discriminator; the visual discriminator comprises a group of fully connected modules, a fully connected layer and a classifier.

4. The method as claimed in claim 1, wherein the step of iteratively training the generated countermeasure network model based on the multi-objective loss function by using the training image samples as the input of the generated countermeasure network model to obtain the trained generated countermeasure network model specifically comprises:

and taking the training image sample as the input of the semantic feature generator, and performing combined training on the semantic feature generator, the visual feature generator, the semantic discriminator and the visual discriminator in a back propagation mode according to the multi-target loss function, so that parameters in the semantic feature generator, the visual feature generator, the semantic discriminator and the visual discriminator are continuously updated and optimized, and a trained generated confrontation network model is obtained.

5. A zero-sample image recognition system based on a generative countermeasure network, comprising:

the test recognition module is used for inputting the test image sample into the trained generated confrontation network model to obtain a recognition result;

the loss function building module specifically includes:

representing a desire for a pseudo feature distribution;

an expectation representing a true feature distribution;

Wherein,

a visual generator representing an input pseudo-semantic feature a-,

presentation input

The visual characteristics generator of (1);

L_cls＝-Ε[log P(b|G₁(a)；θ)]；

6. The system of claim 5, further comprising:

the real semantic feature acquisition module is used for inputting the text in the Wikipedia into the layered model to obtain useful information of the text and inputting the useful information of the text into the self-encoder to obtain real semantic features;

7. The system for zero-sample image recognition based on generation of a countermeasure network according to claim 5, wherein the network model construction module specifically comprises:

the first generator constructing unit is used for constructing a semantic feature generator; the semantic feature generator comprises two groups of convolution modules and two groups of full-connection modules; the convolution module comprises a convolution layer, a maximum pooling layer and a normalization layer which are connected in sequence; the full-connection module comprises a full-connection layer and a Leaky ReLU layer;

a second generator building unit for building a visual feature generator; the visual feature generator comprises two groups of fully-connected modules, three 4096-dimensional fully-connected layers, a resampling layer and five groups of upsampling modules which are sequentially connected; the up-sampling module comprises two up-sampling layers and two Leaky ReLU layers; the upsampling layer in the upsampling module is alternately connected with the Leaky ReLU layer;

the first discriminator constructing unit is used for constructing a semantic discriminator; the semantic discriminator comprises a group of full-connection modules, a two-path full-connection layer, an n-path full-connection layer, two classifiers and an input label classifier;

8. The system of claim 5, wherein the training module specifically comprises: