CN114386534A

CN114386534A - Image augmentation model training method and image classification method based on variational self-encoder and countermeasure generation network

Info

Publication number: CN114386534A
Application number: CN202210111331.7A
Authority: CN
Inventors: 饶元; 苏仕芳; 江朝晖; 金�秀; 张武; 梁惠; 李绍稳
Original assignee: Anhui Agricultural University AHAU
Current assignee: Anhui Agricultural University AHAU
Priority date: 2022-01-29
Filing date: 2022-01-29
Publication date: 2022-04-22

Abstract

The invention discloses an image augmentation model training method and an image classification method based on a variational self-encoder and an antagonistic generation network. For zero sample image classification, generating pseudo visual features of unseen training images by a model trained on a visible class and classifying the unseen images by combining a class label training classifier; visual information and semantic information of the images can be effectively fused, visible images and unseen images which are closer to real data distribution and high in quality are generated, and the classification accuracy of the zero-sample images is improved.

Description

Image augmentation model training method and image classification method based on variational self-encoder and countermeasure generation network

Technical Field

The invention belongs to the technical field of image recognition, and particularly relates to an image augmentation model training method and an image classification method based on a variational self-encoder and a confrontation generation network.

Background

The traditional image classification task not only needs a large amount of labeled image data, but also has poor performance when the classes of the model training set and the test set are inconsistent. For example, for a picture that is not seen or does not belong to any class of training set, the sample needs to be collected again and labeled, and enough training samples are obtained to retrain the model, so that the model has the capability of recognizing the picture. In the process, the cost is high, and the speed is low; and in reality the acquisition and labeling of a large number of labeled images is highly complex and uncertain. Therefore, to solve the problem of missing unseen class samples, Zero-shot learning (ZSL) is proposed.

Zero sample learning is a special scene of transfer learning, and is used for solving the problem of identifying samples of unseamed classes in training samples. In general, zero sample learning is the way that models can model human reasoning and identify things that have never been seen. The labeled exemplars in the feature space are visible classes and the unlabeled exemplars in the feature space are unseen classes. The traditional zero sample learning aims to find the mapping relation between the visual features and the semantic features of an image from a given visible picture, generalize the mapping relation into an unseen picture, and identify the unseen picture, so that a zero sample image identification task is realized. For example, a zero sample recognition model is trained by using image data of cauliflowers, and a semantic relation of "cauliflower is green cauliflower" is input into the zero sample model, so that the model can recognize and classify the pictures of the cauliflowers.

Zero sample learning is realized by firstly establishing two most basic spaces: feature space and semantic space of categories. The elements in the feature space are visual features of all pictures, and the semantic space of the category is description of the attributes of the picture labels and is generally expressed as semantic attribute space or semantic word vector space; what the zero sample learning needs to do is to learn the mapping relationship between the feature space and the semantic space. Generally, visual features in a feature space are extracted through a deep convolutional neural network, the dimensionality of the visual features is high, the dimensionality of the visual features in a semantic space is low, and therefore the mapping relationship between the feature space and the semantic space is divided into mapping from the feature space (high-dimensional space) to the semantic space (low-dimensional space) and mapping from the semantic space (low-dimensional space) to the feature space (high-dimensional space). The mapping from the high-dimensional space to the low-dimensional space is finally realized, for any picture, the unknown characteristic is mapped to the semantic space from the characteristic space through the neural network learning mapping relation, and then one nearest neighbor is searched in the semantic space to realize the identification and classification of the picture at one time; the mapping from the low-dimensional space to the high-dimensional space is finally realized, the semantic information is described under the condition of an unseen image, the semantic features of the unseen image are obtained through a word vector model, the dimensionality of the unseen image is low, the image features of the unseen image are generated through the learned mapping relation, the features are input into a classifier, and the category of the features is obtained. However, because the data distribution between the visible class and the unseen class is different, the direct mapping of the visual space and the semantic space can cause the recognition of the unseen class to be biased to the visible class, and in order to alleviate the problems, a zero sample learning-oriented generation model is provided, specifically, the visual features and the semantic features of the unseen class samples are input into a generation model synthesis training sample to directly train a classifier, and the zero sample classification is converted into classical supervised learning.

The generation model facing zero sample learning is mainly a generation model based on a generation countermeasure network (GAN) and a variation self-encoder (VAE), and the generated samples of the generation countermeasure network are clearer and more vivid, but the generation countermeasure network has the problems that the training network is unstable, so that part of the generated samples are seriously deviated from real distribution, the model is easy to collapse and the like; different from generation of an antagonistic network, the training process of the variational self-encoder is relatively stable, the difference between the reconstructed picture and the original picture is directly compared, but the variational self-encoder directly calculates the mean square error of the reconstructed picture and the original picture as a loss function, so that the generated picture is low in quality.

Disclosure of Invention

1. Problems to be solved

Aiming at the problem of zero sample image classification, the invention combines the advantages of a variational self-encoder and a generation countermeasure network to fully fuse the visual information and semantic information of an image to generate a more effective sample, and provides an image augmentation model training method based on the variational self-encoder and the countermeasure network for effectively synthesizing the visual characteristics of an unseen image, which combines the variational self-encoder with the generation countermeasure network for clearly generating the sample stably in the training process, inputs the visual characteristics and the semantic characteristics of the image sample at the same time, effectively matches the visual information and the semantic information of the image, improves the quality of generated data, effectively solves the problem of unseen image loss in zero sample learning, trains a classifier by utilizing a generated pseudo sample to convert the zero sample learning into classical supervised learning, therefore, the zero sample image classification accuracy is improved.

2. Technical scheme

In order to solve the problems, the invention adopts the following technical scheme:

the invention provides a method for training an image augmentation model based on a variational self-encoder and a confrontation generation network, which is characterized by comprising the following steps:

s110: acquiring a visible training image, and extracting visual features and semantic features of the visible training image;

s120: an image augmentation model is configured in advance, and the image augmentation model comprises a visual modal variation self-encoder, a semantic modal variation self-encoder and a generator configured according to a generated countermeasure network;

s130: respectively inputting the visual characteristic and the semantic characteristic into a visual modal variation self-encoder and a semantic modal variation self-encoder to generate a first pseudo visual characteristic and a pseudo semantic characteristic;

s140: inputting the first pseudo-visual feature and the pseudo-semantic feature into a pre-configured generator, and fusing to generate a second pseudo-visual feature;

s150: and performing back propagation optimization parameters according to the loss function of the image augmentation model until the overall loss function is converged, and storing the model parameters to obtain the trained image augmentation model.

As one example, the loss function includes a countermeasure loss function, and the countermeasure loss function obtaining step includes:

configuring a visual feature discriminator and a semantic feature discriminator;

inputting the visual feature and the second pseudo visual feature into a visual feature discriminator to obtain first discrimination information;

inputting the semantic features and the pseudo-semantic features into a semantic feature discriminator to obtain second discrimination information;

respectively determining a countermeasure loss function according to the first discrimination information and the second discrimination information, and updating parameters of a visual feature discriminator and a semantic feature discriminator by adopting an Adam gradient descent algorithm;

the loss function further comprises a total loss function L of the variational self-encoder_VAEVisual modal variational reconstruction loss from encoder

And KL divergence loss and semantic modal variation self-editingReconstruction loss of coders

And KL divergence loss.

As an example, in the step S110:

extracting visual features of the visible training images by using a visual feature extraction model, wherein the visual feature extraction model uses a convolutional neural network and a Transformer encoder as a feature extraction network;

inputting the visible training images into a convolutional neural network to obtain a characteristic diagram;

dividing the feature map into multi-dimensional feature vector blocks, and mapping each feature vector block into a one-dimensional vector through linear mapping to obtain a plurality of feature vectors;

and carrying out position coding on the feature vector, embedding the feature vector into the Transformer encoder, repeatedly stacking encoder blocks in the encoder for L times, outputting a second-dimension feature vector, and recombining the second-dimension feature vector into visual features with a preset size.

As an example, in the step S110:

and extracting semantic features of the visible training images by using a semantic feature extraction model, taking a continuous bag-of-words model obtained through unsupervised training in a text corpus as the semantic feature extraction model, extracting semantic feature vectors of the visible images by using the semantic feature extraction model, and converting the semantic feature vectors into semantic features with preset sizes through a dimension transformation network.

As an example, the visual modal variation self-encoder in step S120 includes an encoder network E1 and a decoder network D1, where the encoder network E1 is a full convolution network and includes n layers of convolutions, and the number of filter channels increases layer by layer for learning deep features; the output of the last convolution layer in the full convolution network is two n-dimensional vectors of a mean vector and a variance vector;

the encoder network E1 maps the visual features to an interval vector represented by a probability distribution N (μ, Σ), and samples the interval vectorHidden variable Z₁Wherein mu is a mean vector, and Σ is a variance vector; then the latent variable Z₁The probability distribution of (c) is:

q₁(Z₁|x)＝N(Z₁|μ₁,Σ₁),p(Z₁)＝N(Z₁|0,I)

wherein q is₁(Z₁| x) represents a hidden variable Z₁Obedient probability distribution, p (Z)₁) Representing hidden variables Z₁Is here a gaussian distribution of units, mu₁Sum-sigma₁Representing hidden variables Z₁And N represents a normal distribution.

It should be noted that, the encoder part of the variational auto-encoder maps the feature data to another implicit variable space, which is a statistical distribution function, and its parameters have mean and variance, while the neural network can theoretically fit any function, so the encoder network maps the input feature to mean and variance vectors through the full convolution neural network, and the variational auto-encoder randomly samples an element from the implicit space composed of the distribution obeyed by mean and variance, and decodes the element to the original input.

As an example, the semantic modality variational self-encoder comprises an encoder network E2 and a decoder network D2, wherein the encoder network E2 and the decoder network D2 both use two fully-connected layers for encoding and decoding, and the semantic features are input into the encoder network E2 to obtain an implicit variable Z₂According to the hidden variable Z, decoder network D2₂The probability distribution of (2) is reduced to approximate probability distribution of the original data, i.e. pseudo-semantic features similar to the semantic features are generated

And updates the parameter E of the encoder network E2 and the decoder network D2₂、d₂And calculating reconstruction loss:

q₂(Z₂|a)＝N(Z₂|μ₂,Σ₂),p(Z₂)＝N(Z₂|0,I)

wherein q is₂(Z₂| a) represents a hidden variable Z₂Obedient probability distribution, p (Z)₂) Representing hidden variables Z₂Is here a gaussian distribution of units, mu₂Sum-sigma₂Representing hidden variables Z₂N represents a normal distribution,

represents the reconstruction loss of the semantic modal variation self-encoder, sigma represents the logical operation,

representing the norm squared of L2.

As an example, the total loss function L of the visual modality variation autoencoder and the semantic modality variation autoencoder is calculated at step S130_VAESaid total loss function L_VAEThe method comprises the total reconstruction loss and the KL divergence loss of a visual mode self-encoder and a semantic mode variation self-encoder, wherein the total reconstruction loss is used for calculating the similarity degree of a second pseudo-visual characteristic and visual characteristic data, and the following formula is shown as follows:

wherein L is_VAEIs the sum of the loss of the image visual mode and the semantic mode variation self-encoder,

representing the loss of visual modal variation from the encoder reconstruction,

representing semantic modality variational self-encoder reconstruction loss, q₁(Z₁|x)、q₂(Z₂A) respectively represent hidden variables Z₁And Z₂Obedient probability distribution, p (Z)₁)、p(Z₂) Respectively representing hidden variables Z₁And Z₂A priori distribution of (A), (B), (C) a priori distribution of_KLIn order to obtain a loss of the divergence of KL,

the weight of the reconstruction loss term is used for reducing the difference between the generated characteristic and the real characteristic, and beta is the weight of the KL divergence loss term and is used for encouraging the network learning to be more widely distributed; given the hidden variable space dimension n, the KL divergence loss is defined as:

wherein L is_KLIndicating KL divergence loss; mu.s_iMeans representing a spatial dimension i; sigma_iRepresenting the variance of the spatial dimension i.

As an example, the generator network in step S140 is a multi-layer perceptron network with two fully-connected hidden layers; performing global average pooling on the first pseudo-visual feature to obtain a pseudo-visual feature with a preset size, inputting the pseudo-visual feature and the pseudo-semantic feature into a generator network to generate a new feature, transforming the generated feature by using a reshape function to obtain a second pseudo-visual feature, and updating parameters of a generator by adopting an Adam gradient descent algorithm, wherein a loss function of the generator is as follows:

wherein the content of the first and second substances,

representing a first pseudo-visual feature

And pseudo-semantic features

A second pseudo-visual feature generated after passing through the generator G.

As an example, the countermeasure loss function of the visual feature discriminator D and the semantic feature discriminator D is as follows:

wherein L is_DA countermeasure loss function representing the visual feature discriminator D,

loss-fighting function representing a semantic feature discriminator D, E_xThe expectation of the visual characteristic x is represented,

representing a first pseudo-visual feature

And pseudo-semantic features

Expectation of synthesis, E_aThe expectation of the semantic feature a is represented,

representing pseudo-semantic features

In the expectation that the position of the target is not changed,

a visual feature discriminator representing the input visual feature x,

representing an input first pseudo-visual feature

And pseudo-semantic features

The second pseudo-visual feature generated is,

representing input of a second pseudo-visual feature

Visual feature discriminator

The second aspect of the invention provides an image classification method based on a variational self-encoder and an image augmentation model of a countermeasure generation network

Inputting the visual features and semantic features of the unseen training images into the image augmentation model to generate pseudo visual features of the unseen training images;

and training an image classifier by using the generated pseudo visual features, and inputting the visual features of the unseen test images to be recognized into the classifier for classification to obtain a classification result.

A third aspect of the present invention provides an electronic device, including a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected in sequence, the memory is used for storing a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the above method.

A fourth aspect of the invention provides a readable storage medium, storing a computer program comprising program instructions, which when executed by a processor, cause the processor to perform the method as described above.

3. Advantageous effects

Compared with the prior art, the invention has the beneficial effects that:

(1) the embodiment of the invention provides an image augmentation model training method based on a variational self-encoder and an antagonistic generation network, which can effectively synthesize the visual characteristics of unseen images by aiming at the zero sample image classification problem and combining the advantages of the variational self-encoder and the antagonistic generation network and fully fusing the visual information and semantic information of the images to generate more effective samples, the method combines a variational self-encoder with stable training process and a clear generation countermeasure network for generating samples, meanwhile, the visual features and semantic features of the image samples are input, the visual information and semantic information of the images are effectively matched, the quality of generated data is improved, the problem that class-free images are not lost in zero sample learning is effectively solved, the generated pseudo samples are used for training a classifier, and zero sample learning is converted into classical supervision learning, so that the accuracy of zero sample image classification is improved.

(2) The method comprises the steps of encoding and decoding features of different modes through a variational self-encoder, constructing exclusive probability distribution for each sample by the encoder, sampling the probability distribution, reconstructing data through a decoder, generating sample features closer to real data distribution, and having higher robustness.

(3) According to the embodiment of the invention, the output of the vision and semantic modality variational self-encoder is input into the generator network in series, the vision information and the semantic information of the image are fully fused together, and the association between the vision characteristic and the semantic characteristic can be more effectively mined, so that the vision characteristic and the semantic characteristic are more effectively synthesized, and the influence of the generated data imbalance on the model is reduced; the embodiment of the invention effectively combines the advantages of the variational self-encoder and the generation of the confrontation network, and utilizes the advantages of the variational self-encoder, which directly compares the difference between the generated data and the original data through the steps of encoding and decoding, and the stability of the training process and the clear advantage of the confrontation generation network generated picture, thereby improving the stability and the discrimination capability of the generated model.

(4) The invention constructs a network model with two modal variational self-encoders, a generation confrontation network generator and two discriminator networks, effectively synthesizes visual characteristics and semantic characteristics which are closer to real data distribution, ensures the alignment relation among different modalities and synthesizes more effective visual characteristics; in addition, a multi-module loss function comprising a variation self-encoder loss function, a generator loss function and a vision and semantic feature discriminator anti-loss function is constructed, the variation self-encoder anti-generation network model is trained based on the multi-module loss function, the problems of gradient explosion, easy model collapse and the like are effectively solved, and the performance of the model is improved.

(5) The invention utilizes the variational self-encoder to resist the generation network model to generate the unseen pseudo sample, effectively solves the problem of unseen sample loss, and uses the generated pseudo sample to train the classifier to classify unseen test images, thereby increasing the generalization capability of the model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments are briefly described below.

FIG. 1 is a flowchart of an image augmentation model training method based on a variational auto-encoder and a confrontation generation network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an image augmentation model according to an embodiment of the present invention;

FIG. 3 is a network architecture diagram of a visual feature extraction model in an embodiment of the invention;

FIG. 4 is a network architecture diagram of a semantic feature extraction model in an embodiment of the present invention;

FIG. 5 is a network architecture diagram of a visual modality variation autoencoder in an embodiment of the present invention;

FIG. 6 is a network structure diagram of a semantic modality variation self-encoder according to an embodiment of the present invention;

fig. 7 is a network configuration diagram of the generator G in the embodiment of the present invention;

FIG. 8 is a network architecture diagram of a visual feature discriminator D in an embodiment of the present invention;

fig. 9 is a network structure diagram of the semantic feature discriminator D according to the embodiment of the present invention;

FIG. 10 is a flowchart of an embodiment of an image augmentation model for zero-sample image classification.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the following describes in detail an image augmentation model training method and an image classification method based on a variational self-encoder and an antagonistic generation network according to the present invention with reference to the accompanying drawings.

As shown in fig. 1 and 2, the present example provides an image augmentation model training method based on a variational self-encoder and a countermeasure generation network, the method including the steps of:

s110: and acquiring a visible training image, and extracting visual features and semantic features of the visible training image.

Specifically, the visible class training image in this example refers to a sample with a label in the feature space, and a sample without a label in the feature space is an invisible class. The visual features of the visible training images are obtained through a pre-configured visual feature extraction model, wherein the visual feature extraction model is obtained through training of a convolutional neural network and a Transformer encoder. The semantic features of the visible training images are obtained through a pre-configured semantic feature extraction model, wherein the semantic feature extraction model is a continuous bag-of-words model obtained through unsupervised training. Respectively defining the visual features and semantic features of visible training images as x and a, and respectively defining the visual features and semantic features of unseen training images as x_ut、a_utThe visual characteristic of the unseen test image is defined as x_t。

It should be noted that in this example, AWA2 is used as a data set of an image, the data set includes text and picture files, the text records the animal types included in the data set, and attribute tags of each type, the picture files include 37322 pictures of 50 animals, 40 types of training sets 30337 pictures, and 10 types of test sets 6985 pictures. The data set of this example is selected from AWA2, where the number of category attributes is 75, the visible class is 40 classes, the unseen class is 10 classes, the number of visible class samples is 23337, and the number of unseen class samples is 7265. It should be understood that other data sets may be selected herein and should not be construed as limiting the invention herein.

In one embodiment, as shown in fig. 3, the visual feature extraction module is composed of a VGGNet16 model and a transform encoder, and the graph is input into a convolutional neural network VGGNet16 to output a feature map with a size of 16 × 16 × 1024. The feature map is divided into 256 1024-dimensional feature vector blocks, each block is mapped into a one-dimensional vector through linear mapping, 256 vectors (generally called token) with the length of 512 dimensions are obtained through mapping, the vectors are subjected to position coding and embedded into a transform encoder, encoder blocks are repeatedly stacked in the transform encoder for L times, 256 512-dimensional feature vectors are output, and the vectors are recombined into visual features of a visible image, wherein the size of the visual features is 16 x 512. The value of L in this example is 5.

As shown in fig. 4, in an embodiment, the semantic feature extraction Model is obtained through unsupervised training, and a Continuous Bag-of-Word Model (CBOW) obtained through unsupervised training in a large-scale text corpus is first obtained in advance; and inputting the category semantic label information of the visible image into the model to obtain a semantic feature vector of the visible image, and converting the semantic feature vector into a semantic feature with the dimension of 1 × 1 × 512 by using a dimension conversion network with only one hidden layer.

S120: an image augmentation model is configured in advance, and the image augmentation model comprises a visual modal variation self-encoder, a semantic modal variation self-encoder and a generator for generating countermeasure network configuration.

Specifically, the image augmentation model in this example is preconfigured, wherein the visual modality variational self-encoder comprises an encoder network E1 and a decoder network D1, and the semantic modality variational self-encoder comprises an encoder network E2 and a decoder network D2.

As shown in fig. 5, the encoder network E1 is a Full Convolutional Network (FCN) containing n layers of convolutions; the number of filter channels is increased layer by layer for learning deep features, and the output of the last convolution in the full convolution network is two n-dimensional vectors: mean vector and variance vector

The encoder network E1 maps the visual features to an interval vector represented by a probability distribution N (μ, Σ), and samples the interval vector to obtain an implicit variable Z₁Wherein mu is a mean vector, sigma is a variance vector, and the mean and the variance contain structural information of the input features; then the latent variable Z₁The probability distribution of (c) is:

q₁(Z₁|x)＝N(Z₁|μ₁,Σ₁),p(Z₁)＝N(Z₁|0,I) (1)

wherein q is₁(Z₁| x) represents a hidden variable Z₁Obedient probability distribution, p (Z)₁) Representing hidden variables Z₁Is here a gaussian distribution of units, mu₁Sum-sigma₁Representing hidden variables Z₁And N represents a normal distribution. The encoder network E1 in this example is a 3-layer convolution with filter sizes of 32, 64 and 128, respectively, the last convolution outputs two-dimensional vectors, and the last output of the decoder network is a pseudo visual feature with dimensions identical to the true visual feature, i.e. 16 x 512. The filters of the convolutional layer are used for identifying certain specific features of the image, each filter can perform sliding convolution on the feature map of the previous layer, the shallow layer of the convolutional neural network can generally detect primary features such as 'edges' and 'colors', the filters perform convolution on the features to obtain various new features along with the increase of the number of the convolutional layers, and the filters can extract deeper features when the depth of the convolutional layers is deep.

Further considering that the structures of the decoder network D1 and the encoder network E1 are substantially symmetrical, since the image feature dimension of the image in the encoder decreases after passing through the full convolution network, the up-sampling layer is adopted in the decoder to slowly increase the image feature dimension, and the number of filter channels gradually decreases to gradually integrate the image feature dimension to the original feature dimension. Decoder network D1 Generation and original vision of imagesFirst pseudo-visual feature with similar features

And updates the parameter E of the encoder E1 and the decoder D1₁、 d₁The reconstruction loss of decoder D1 is calculated:

the visual feature x is obtained by performing convolution operation on a Transformer feature and a semantic feature, namely x ═ t × a.

Representing the reconstruction loss of the visual modal variation from the encoder,

representing the norm squared of L2.

As shown in fig. 6, the semantic modality variational self-encoder includes an encoder network E2 and a decoder network D2, both of which use two fully-connected layers for encoding and decoding, and the semantic features are input into the encoder network E2 to obtain hidden variables Z₂The decoder network D2 reverts to an approximate probability distribution of the original data according to the probability distribution of the hidden variables, i.e. generates pseudo-semantic features similar to the original semantic features

q₂(Z₂|a)＝N(Z₂|μ₂,Σ₂),p(Z₂)＝N(Z₂|0,I) (4)

represents the norm squared of L2;

s130: and respectively inputting the visual characteristic and the semantic characteristic into a visual mode variational self-encoder and a semantic mode variational self-encoder to generate a first pseudo visual characteristic and a pseudo semantic characteristic.

Specifically, the visual features obtained in step S110 are input into a visual mode variational self-encoder, and encoded to obtain a first pseudo-visual feature; and (5) inputting the semantic features obtained in the step (S110) into a semantic mode variational self-encoder, and encoding to obtain pseudo-semantic features. The pseudo-semantic feature size finally output in this example is 1 × 1 × 512, and the pseudo-visual feature finally output has the same dimension as the real visual feature, namely 16 × 16 × 512.

S140: and inputting the first pseudo-visual feature and the pseudo-semantic feature into a pre-configured generator, and fusing to generate a second pseudo-visual feature.

Specifically, the generator network in this example is a Multi-Layer Perceptron (MLP) network having two fully-connected hidden layers, and the inputs of the MLP network are the first pseudo-visual features respectively

And pseudo-semantic features

Performing global average pooling on the first pseudo-visual features to obtain pseudo-visual features with preset sizes, wherein the size of the feature size is 1 × 1 × C; inputting the generated features and the pseudo-semantic features into a generator network to generate new features, converting the generated features into second pseudo-visual features with the size of H multiplied by W multiplied by C by using a reshape function, and updating parameters of a generator by adopting an Adam gradient descent algorithm, wherein a generator loss function is as follows:

wherein the content of the first and second substances,

representing a first pseudo-visual feature

And pseudo-semantic features

A second pseudo-visual feature generated by the generator G;

in the present example, a first pseudo-visual feature and a pseudo-semantic feature, which respectively include visual information and semantic information, are serially input into a generator network, and image visual information and semantic information are sufficiently fused to generate a second pseudo-visual feature with a size of 16 × 16 × 512, where the feature may be a visible class or an invisible class.

S150: and performing back propagation optimization parameters according to each loss function of the image augmentation model until the overall loss function is converged, and storing the model parameters to obtain the trained image augmentation model.

Specifically, the respective loss functions of the image augmentation model in this example include a countervailing loss function, a visual modality variational self-encoder loss function, a semantic modality variational self-encoder loss function, and a generator loss function.

In one embodiment, the total loss function L_VAEThe method is characterized by comprising the total reconstruction loss and KL (Kullback-Leibler) divergence loss of a visual modal self-encoder and a semantic modal variation self-encoder, wherein the total reconstruction loss is used for calculating the similarity degree of generated characteristic data and original characteristic data, and the KL divergence loss is used for testing the difference between a mean value mu and a variance sigma, and is shown in the following formula:

wherein L is_VAEIs the sum of losses of the image visual mode self-encoder and the semantic mode variation self-encoder,

and the weight of the reconstruction loss term is used for reducing the difference between the generated characteristic and the real characteristic, and the beta is used for encouraging the network learning to be more widely distributed.

Given a hidden variable space dimension of n, the KL divergence loss is defined as:

wherein the content of the first and second substances,L_KLindicating KL divergence loss; mu.s_iMeans representing a spatial dimension i; sigma_iRepresenting the variance of the spatial dimension i.

In one embodiment, the penalty function acquisition step is as follows:

firstly, a visual feature discriminator D and a semantic feature discriminator D are constructed, the visual features and the second pseudo visual features of the visible images in the steps are input into the discriminator D to discriminate true and false, and first discrimination information is obtained;

and inputting the semantic features and the pseudo-semantic features of the visible images into a discriminator D to discriminate true and false, obtaining second discrimination information, and respectively determining a countermeasure loss function according to the first discrimination information and the second discrimination information.

Specifically, as shown in fig. 8 and 9, in this example, the visual feature discriminator D includes a fully-connected group and a two-Sigmoid classifier, the semantic feature discriminator D includes a fully-connected hidden layer and a two-Sigmoid classifier, the final outputs of the visual feature discriminator D and the semantic feature discriminator D are 0 and 1, 0 indicates that the feature is false, 1 indicates that the feature is true, and the parameter of the discriminator is updated by using Adam gradient descent algorithm, and the pair-immunity loss function of the visual feature discriminator D and the semantic feature discriminator D is as follows:

loss-fighting function representing a semantic feature discriminator D, E_xThe expectation of the visual characteristic x of the image is represented,

representing a first pseudo-visual feature

And pseudo-semantic features

Expectation of synthesis, E_aThe expectation of the semantic features a of the image is represented,

representing pseudo-semantic features

In the expectation that the position of the target is not changed,

a visual feature arbiter representing the input visual feature x,

representing input pseudo-visual features

And pseudo-semantic features

The generated second pseudo-visual characteristic is generated,

representing input of a second pseudo-visual feature

The visual feature discriminator.

And repeating the steps, taking the visual characteristics and the semantic characteristics of the visible training images as the input of the image augmentation model, training the model in a back propagation mode based on each loss function, continuously updating and optimizing parameters in a visual mode variation self-encoder, a semantic mode variation self-encoder, a generator, a visual characteristic and semantic characteristic discriminator until the total loss function is converged, obtaining a trained variation self-encoder, resisting and generating a network model, storing model parameters, and finishing the training of the image augmentation model. In addition, in the model of this example, when training, the number of images per batch (batch _ size) is 32, the learning rate (learning _ rate) is set to 0.0001 by using the Adam optimizer, the activation function is ReLU, the Dropout rejection rate is 0.5, and the maximum number of training rounds is 100000.

The method combines the advantages of a variational self-encoder and a generation countermeasure network, fully integrates visual information and semantic information of the image to generate a more effective sample, can effectively match the visual characteristics and the semantic characteristics of the image, improves the quality of generated data, effectively solves the problem of missing of an unseen image in zero sample learning, trains a classifier by using the generated pseudo sample, converts zero sample learning into classical supervised learning, and accordingly improves the accuracy of zero sample image classification.

The present example also provides an image classification method based on a variational self-encoder and a countermeasure generation network, comprising the steps of:

s210: and inputting the visual features and semantic features of the unseen training images into the image augmentation model to generate the pseudo visual features of the unseen training images.

Specifically, as shown in fig. 10, the visual feature x of the unseen training image is_utAnd semantic features a_utInputting the training image augmentation model to generate invisible pseudo-visual features

S220: and training an image classifier by using the generated pseudo visual features, and inputting the visual features of the unseen test image to be recognized into the classifier for classification to obtain a classification result.

Specifically, a Softmax classifier is trained by using unseen pseudo-visual features and class labels, and the visual features x of unseen test images are tested_tInputting the data into a classifier which is trained to obtain the classification accuracy.

Pseudo-views of unseen class are used in this exampleTraining a Softmax classifier for zero sample image classification by using the perceptual features and the class labels, wherein the training set of the Softmax classifier is

For a given classifier input, m is the number of training samples, t_iIs the number of the class to which the sample belongs, t_iE {1,2, …, C }, C being the total number of classes of the image, and defining the Softmax classifier as

Wherein the content of the first and second substances,

indicating at a given input

The probability that the input data is of class k, k 1, …, C,

the output of the classifier, being a column vector of C rows and 1 columns, each row representing the probability that the current input is recognized as being of class k, the sum of all row elements being 1,

parameters to be estimated for the Softmax classifier and form a parameter matrix

This example classifies for arbitrary feature vectors when classifying unseen pictures using the Softmax classifier

All select the maximum probability pairTaking the corresponding k categories as the classification results of the current picture, comparing the classification results with the marked real results, if the results are consistent, the classification is correct, and if the results are inconsistent, the classification is wrong;

and optimizing the model parameters by adopting a gradient descent method, wherein the loss function of the Softmax classifier is as follows:

wherein the content of the first and second substances,

solving by conjugate algorithm

And (4) performing unconstrained optimization to serve as an evaluation index for testing the classification accuracy. Visual feature x of unseen test image_tAnd inputting the classification label corresponding to the value with the maximum probability output by the classifier into the trained Softmax classifier, and taking the classification label as a prediction result of the classifier.

Claims

1. An image augmentation model training method based on a variational self-encoder and a confrontation generation network is characterized by comprising the following steps:

s130: respectively inputting the visual features and the semantic features into a visual modal variation self-encoder and a semantic modal variation self-encoder to generate first pseudo visual features and pseudo semantic features;

2. The method of claim 1, wherein the loss function comprises a countering loss function, and the countering loss function obtaining step comprises:

And KL divergence loss and reconstruction loss of semantic modal variation self-encoder

And KL divergence loss.

3. The method for training an image augmentation model based on a variational self-encoder and a countermeasure generation network according to claim 1, wherein in the step S110:

4. The method of claim 3, wherein in the step S110:

5. The method for training an image augmentation model based on a variational self-encoder and a antagonism generation network according to claim 1, wherein the visual mode variational self-encoder in the step S120 comprises an encoder network E1 and a decoder network D1, wherein the encoder network E1 is a full convolution network containing n layers of convolutions, and the number of filter channels increases layer by layer for learning deep features; the output of the last convolution layer in the full convolution network is two n-dimensional vectors of a mean vector and a variance vector;

the encoder network E1 maps the visual features to an interval vector represented by a probability distribution N (μ, Σ), and samples the interval vector to obtain an implicit variable Z₁Wherein mu is a mean vector, and Σ is a variance vector; then the latent variable Z₁The probability distribution of (c) is:

q₁(Z₁|x)＝N(Z₁|μ₁,Σ₁),p(Z₁)＝N(Z₁|0,I)

wherein q is₁(Z₁| x) represents a hidden variable Z₁Obedient probability distribution, p (Z)₁) Representing hidden variables Z₁A prior distribution of (a), here the unit gaussian distribution, mu₁Sum-sigma₁Representing hidden variables Z₁And N represents a normal distribution.

6. The method for training the image augmentation model based on the variational self-encoder and the antagonistic generation network according to claim 1, wherein the semantic modal variational self-encoder comprises an encoder network E2 and a decoder network D2, both the encoder network E2 and the decoder network D2 use two fully-connected layers for encoding and decoding, and the semantic features are input into the encoder network E2 to obtain the hidden variable Z₂According to the hidden variable Z, decoder network D2₂The probability distribution of (2) is reduced to approximate probability distribution of the original data, i.e. pseudo-semantic features similar to the semantic features are generated

q₂(Z₂|a)＝N(Z₂|μ₂,Σ₂),p(Z₂)＝N(Z₂|0,I)

wherein q is₂(Z₂| a) represents a hidden variable Z₂Obedient probability distribution, p (Z)₂) Representing hidden variables Z₂Is a priori dividedCloth, here unit Gaussian distribution, μ₂Sum-sigma₂Representing hidden variables Z₂N represents a normal distribution,

representing the norm squared of L2.

7. The method of claim 5, wherein the total loss function L of the visual mode variational autocoder and the semantic mode variational autocoder is calculated in step S130_VAESaid total loss function L_VAEThe method comprises the total reconstruction loss and the KL divergence loss of a visual mode self-encoder and a semantic mode variation self-encoder, wherein the total reconstruction loss is used for calculating the similarity degree of a second pseudo-visual characteristic and visual characteristic data, and the following formula is shown as follows:

8. The method of claim 1, wherein the generator network in step S140 is a multi-layer perceptron network having two fully-connected hidden layers; performing global average pooling on the first pseudo-visual feature to obtain a pseudo-visual feature with a preset size, inputting the pseudo-visual feature and the pseudo-semantic feature into a generator network to generate a new feature, converting the generated feature by using a reshape function to obtain a second pseudo-visual feature, and updating parameters of a generator by adopting an Adam gradient descent algorithm, wherein a loss function of the generator is as follows:

wherein the content of the first and second substances,

representing a first pseudo-visual feature

And pseudo-semantic features

A second pseudo-visual feature generated after passing through the generator G.

9. The method according to claim 3, wherein the robust loss function of the visual feature discriminator D and the semantic feature discriminator D is as follows:

a penalty function representing the semantic feature discriminant D, E_xThe expectation of the visual characteristic x is represented,

representing a first pseudo-visual feature

And pseudo-semantic features

representing pseudo-semantic features

In the expectation that the position of the target is not changed,

a visual feature discriminator representing the input visual feature x,

representing an input first pseudo-visual feature

And pseudo-semantic features

The second pseudo-visual feature generated is,

representing input of a second pseudo-visual feature

The visual feature discriminator.

10. An image classification method based on a variational self-encoder and an image augmentation model of a competing generation network,

inputting the visual features and semantic features of the unseen training images into the image augmentation model according to any one of claims 1 to 8 to generate pseudo visual features of the unseen training images;