CN114386534A - Image augmentation model training method and image classification method based on variational self-encoder and countermeasure generation network - Google Patents

Image augmentation model training method and image classification method based on variational self-encoder and countermeasure generation network Download PDF

Info

Publication number
CN114386534A
CN114386534A CN202210111331.7A CN202210111331A CN114386534A CN 114386534 A CN114386534 A CN 114386534A CN 202210111331 A CN202210111331 A CN 202210111331A CN 114386534 A CN114386534 A CN 114386534A
Authority
CN
China
Prior art keywords
visual
encoder
semantic
feature
pseudo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210111331.7A
Other languages
Chinese (zh)
Inventor
饶元
苏仕芳
江朝晖
金�秀
张武
梁惠
李绍稳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Agricultural University AHAU
Original Assignee
Anhui Agricultural University AHAU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Agricultural University AHAU filed Critical Anhui Agricultural University AHAU
Priority to CN202210111331.7A priority Critical patent/CN114386534A/en
Publication of CN114386534A publication Critical patent/CN114386534A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image augmentation model training method and an image classification method based on a variational self-encoder and an antagonistic generation network. For zero sample image classification, generating pseudo visual features of unseen training images by a model trained on a visible class and classifying the unseen images by combining a class label training classifier; visual information and semantic information of the images can be effectively fused, visible images and unseen images which are closer to real data distribution and high in quality are generated, and the classification accuracy of the zero-sample images is improved.

Description

Image augmentation model training method and image classification method based on variational self-encoder and countermeasure generation network
Technical Field
The invention belongs to the technical field of image recognition, and particularly relates to an image augmentation model training method and an image classification method based on a variational self-encoder and a confrontation generation network.
Background
The traditional image classification task not only needs a large amount of labeled image data, but also has poor performance when the classes of the model training set and the test set are inconsistent. For example, for a picture that is not seen or does not belong to any class of training set, the sample needs to be collected again and labeled, and enough training samples are obtained to retrain the model, so that the model has the capability of recognizing the picture. In the process, the cost is high, and the speed is low; and in reality the acquisition and labeling of a large number of labeled images is highly complex and uncertain. Therefore, to solve the problem of missing unseen class samples, Zero-shot learning (ZSL) is proposed.
Zero sample learning is a special scene of transfer learning, and is used for solving the problem of identifying samples of unseamed classes in training samples. In general, zero sample learning is the way that models can model human reasoning and identify things that have never been seen. The labeled exemplars in the feature space are visible classes and the unlabeled exemplars in the feature space are unseen classes. The traditional zero sample learning aims to find the mapping relation between the visual features and the semantic features of an image from a given visible picture, generalize the mapping relation into an unseen picture, and identify the unseen picture, so that a zero sample image identification task is realized. For example, a zero sample recognition model is trained by using image data of cauliflowers, and a semantic relation of "cauliflower is green cauliflower" is input into the zero sample model, so that the model can recognize and classify the pictures of the cauliflowers.
Zero sample learning is realized by firstly establishing two most basic spaces: feature space and semantic space of categories. The elements in the feature space are visual features of all pictures, and the semantic space of the category is description of the attributes of the picture labels and is generally expressed as semantic attribute space or semantic word vector space; what the zero sample learning needs to do is to learn the mapping relationship between the feature space and the semantic space. Generally, visual features in a feature space are extracted through a deep convolutional neural network, the dimensionality of the visual features is high, the dimensionality of the visual features in a semantic space is low, and therefore the mapping relationship between the feature space and the semantic space is divided into mapping from the feature space (high-dimensional space) to the semantic space (low-dimensional space) and mapping from the semantic space (low-dimensional space) to the feature space (high-dimensional space). The mapping from the high-dimensional space to the low-dimensional space is finally realized, for any picture, the unknown characteristic is mapped to the semantic space from the characteristic space through the neural network learning mapping relation, and then one nearest neighbor is searched in the semantic space to realize the identification and classification of the picture at one time; the mapping from the low-dimensional space to the high-dimensional space is finally realized, the semantic information is described under the condition of an unseen image, the semantic features of the unseen image are obtained through a word vector model, the dimensionality of the unseen image is low, the image features of the unseen image are generated through the learned mapping relation, the features are input into a classifier, and the category of the features is obtained. However, because the data distribution between the visible class and the unseen class is different, the direct mapping of the visual space and the semantic space can cause the recognition of the unseen class to be biased to the visible class, and in order to alleviate the problems, a zero sample learning-oriented generation model is provided, specifically, the visual features and the semantic features of the unseen class samples are input into a generation model synthesis training sample to directly train a classifier, and the zero sample classification is converted into classical supervised learning.
The generation model facing zero sample learning is mainly a generation model based on a generation countermeasure network (GAN) and a variation self-encoder (VAE), and the generated samples of the generation countermeasure network are clearer and more vivid, but the generation countermeasure network has the problems that the training network is unstable, so that part of the generated samples are seriously deviated from real distribution, the model is easy to collapse and the like; different from generation of an antagonistic network, the training process of the variational self-encoder is relatively stable, the difference between the reconstructed picture and the original picture is directly compared, but the variational self-encoder directly calculates the mean square error of the reconstructed picture and the original picture as a loss function, so that the generated picture is low in quality.
Disclosure of Invention
1. Problems to be solved
Aiming at the problem of zero sample image classification, the invention combines the advantages of a variational self-encoder and a generation countermeasure network to fully fuse the visual information and semantic information of an image to generate a more effective sample, and provides an image augmentation model training method based on the variational self-encoder and the countermeasure network for effectively synthesizing the visual characteristics of an unseen image, which combines the variational self-encoder with the generation countermeasure network for clearly generating the sample stably in the training process, inputs the visual characteristics and the semantic characteristics of the image sample at the same time, effectively matches the visual information and the semantic information of the image, improves the quality of generated data, effectively solves the problem of unseen image loss in zero sample learning, trains a classifier by utilizing a generated pseudo sample to convert the zero sample learning into classical supervised learning, therefore, the zero sample image classification accuracy is improved.
2. Technical scheme
In order to solve the problems, the invention adopts the following technical scheme:
the invention provides a method for training an image augmentation model based on a variational self-encoder and a confrontation generation network, which is characterized by comprising the following steps:
s110: acquiring a visible training image, and extracting visual features and semantic features of the visible training image;
s120: an image augmentation model is configured in advance, and the image augmentation model comprises a visual modal variation self-encoder, a semantic modal variation self-encoder and a generator configured according to a generated countermeasure network;
s130: respectively inputting the visual characteristic and the semantic characteristic into a visual modal variation self-encoder and a semantic modal variation self-encoder to generate a first pseudo visual characteristic and a pseudo semantic characteristic;
s140: inputting the first pseudo-visual feature and the pseudo-semantic feature into a pre-configured generator, and fusing to generate a second pseudo-visual feature;
s150: and performing back propagation optimization parameters according to the loss function of the image augmentation model until the overall loss function is converged, and storing the model parameters to obtain the trained image augmentation model.
As one example, the loss function includes a countermeasure loss function, and the countermeasure loss function obtaining step includes:
configuring a visual feature discriminator and a semantic feature discriminator;
inputting the visual feature and the second pseudo visual feature into a visual feature discriminator to obtain first discrimination information;
inputting the semantic features and the pseudo-semantic features into a semantic feature discriminator to obtain second discrimination information;
respectively determining a countermeasure loss function according to the first discrimination information and the second discrimination information, and updating parameters of a visual feature discriminator and a semantic feature discriminator by adopting an Adam gradient descent algorithm;
the loss function further comprises a total loss function L of the variational self-encoderVAEVisual modal variational reconstruction loss from encoder
Figure BDA0003495147290000031
And KL divergence loss and semantic modal variation self-editingReconstruction loss of coders
Figure BDA0003495147290000032
And KL divergence loss.
As an example, in the step S110:
extracting visual features of the visible training images by using a visual feature extraction model, wherein the visual feature extraction model uses a convolutional neural network and a Transformer encoder as a feature extraction network;
inputting the visible training images into a convolutional neural network to obtain a characteristic diagram;
dividing the feature map into multi-dimensional feature vector blocks, and mapping each feature vector block into a one-dimensional vector through linear mapping to obtain a plurality of feature vectors;
and carrying out position coding on the feature vector, embedding the feature vector into the Transformer encoder, repeatedly stacking encoder blocks in the encoder for L times, outputting a second-dimension feature vector, and recombining the second-dimension feature vector into visual features with a preset size.
As an example, in the step S110:
and extracting semantic features of the visible training images by using a semantic feature extraction model, taking a continuous bag-of-words model obtained through unsupervised training in a text corpus as the semantic feature extraction model, extracting semantic feature vectors of the visible images by using the semantic feature extraction model, and converting the semantic feature vectors into semantic features with preset sizes through a dimension transformation network.
As an example, the visual modal variation self-encoder in step S120 includes an encoder network E1 and a decoder network D1, where the encoder network E1 is a full convolution network and includes n layers of convolutions, and the number of filter channels increases layer by layer for learning deep features; the output of the last convolution layer in the full convolution network is two n-dimensional vectors of a mean vector and a variance vector;
the encoder network E1 maps the visual features to an interval vector represented by a probability distribution N (μ, Σ), and samples the interval vectorHidden variable Z1Wherein mu is a mean vector, and Σ is a variance vector; then the latent variable Z1The probability distribution of (c) is:
q1(Z1|x)=N(Z111),p(Z1)=N(Z1|0,I)
wherein q is1(Z1| x) represents a hidden variable Z1Obedient probability distribution, p (Z)1) Representing hidden variables Z1Is here a gaussian distribution of units, mu1Sum-sigma1Representing hidden variables Z1And N represents a normal distribution.
It should be noted that, the encoder part of the variational auto-encoder maps the feature data to another implicit variable space, which is a statistical distribution function, and its parameters have mean and variance, while the neural network can theoretically fit any function, so the encoder network maps the input feature to mean and variance vectors through the full convolution neural network, and the variational auto-encoder randomly samples an element from the implicit space composed of the distribution obeyed by mean and variance, and decodes the element to the original input.
As an example, the semantic modality variational self-encoder comprises an encoder network E2 and a decoder network D2, wherein the encoder network E2 and the decoder network D2 both use two fully-connected layers for encoding and decoding, and the semantic features are input into the encoder network E2 to obtain an implicit variable Z2According to the hidden variable Z, decoder network D22The probability distribution of (2) is reduced to approximate probability distribution of the original data, i.e. pseudo-semantic features similar to the semantic features are generated
Figure BDA0003495147290000041
And updates the parameter E of the encoder network E2 and the decoder network D22、d2And calculating reconstruction loss:
q2(Z2|a)=N(Z222),p(Z2)=N(Z2|0,I)
Figure BDA0003495147290000042
Figure BDA0003495147290000043
wherein q is2(Z2| a) represents a hidden variable Z2Obedient probability distribution, p (Z)2) Representing hidden variables Z2Is here a gaussian distribution of units, mu2Sum-sigma2Representing hidden variables Z2N represents a normal distribution,
Figure BDA0003495147290000044
represents the reconstruction loss of the semantic modal variation self-encoder, sigma represents the logical operation,
Figure BDA0003495147290000045
representing the norm squared of L2.
As an example, the total loss function L of the visual modality variation autoencoder and the semantic modality variation autoencoder is calculated at step S130VAESaid total loss function LVAEThe method comprises the total reconstruction loss and the KL divergence loss of a visual mode self-encoder and a semantic mode variation self-encoder, wherein the total reconstruction loss is used for calculating the similarity degree of a second pseudo-visual characteristic and visual characteristic data, and the following formula is shown as follows:
Figure BDA0003495147290000046
wherein L isVAEIs the sum of the loss of the image visual mode and the semantic mode variation self-encoder,
Figure BDA0003495147290000047
representing the loss of visual modal variation from the encoder reconstruction,
Figure BDA0003495147290000048
representing semantic modality variational self-encoder reconstruction loss, q1(Z1|x)、q2(Z2A) respectively represent hidden variables Z1And Z2Obedient probability distribution, p (Z)1)、p(Z2) Respectively representing hidden variables Z1And Z2A priori distribution of (A), (B), (C) a priori distribution ofKLIn order to obtain a loss of the divergence of KL,
Figure BDA0003495147290000049
the weight of the reconstruction loss term is used for reducing the difference between the generated characteristic and the real characteristic, and beta is the weight of the KL divergence loss term and is used for encouraging the network learning to be more widely distributed; given the hidden variable space dimension n, the KL divergence loss is defined as:
Figure BDA0003495147290000051
wherein L isKLIndicating KL divergence loss; mu.siMeans representing a spatial dimension i; sigmaiRepresenting the variance of the spatial dimension i.
As an example, the generator network in step S140 is a multi-layer perceptron network with two fully-connected hidden layers; performing global average pooling on the first pseudo-visual feature to obtain a pseudo-visual feature with a preset size, inputting the pseudo-visual feature and the pseudo-semantic feature into a generator network to generate a new feature, transforming the generated feature by using a reshape function to obtain a second pseudo-visual feature, and updating parameters of a generator by adopting an Adam gradient descent algorithm, wherein a loss function of the generator is as follows:
Figure RE-GDA0003529284980000052
wherein the content of the first and second substances,
Figure BDA0003495147290000053
representing a first pseudo-visual feature
Figure BDA0003495147290000054
And pseudo-semantic features
Figure BDA0003495147290000055
A second pseudo-visual feature generated after passing through the generator G.
As an example, the countermeasure loss function of the visual feature discriminator D and the semantic feature discriminator D is as follows:
Figure BDA0003495147290000056
Figure BDA0003495147290000057
wherein L isDA countermeasure loss function representing the visual feature discriminator D,
Figure BDA0003495147290000058
loss-fighting function representing a semantic feature discriminator D, ExThe expectation of the visual characteristic x is represented,
Figure BDA0003495147290000059
representing a first pseudo-visual feature
Figure BDA00034951472900000510
And pseudo-semantic features
Figure BDA00034951472900000511
Expectation of synthesis, EaThe expectation of the semantic feature a is represented,
Figure BDA00034951472900000512
representing pseudo-semantic features
Figure BDA00034951472900000513
In the expectation that the position of the target is not changed,
Figure BDA00034951472900000514
a visual feature discriminator representing the input visual feature x,
Figure BDA00034951472900000515
representing an input first pseudo-visual feature
Figure BDA00034951472900000516
And pseudo-semantic features
Figure BDA00034951472900000517
The second pseudo-visual feature generated is,
Figure BDA00034951472900000518
representing input of a second pseudo-visual feature
Figure BDA00034951472900000519
Visual feature discriminator
The second aspect of the invention provides an image classification method based on a variational self-encoder and an image augmentation model of a countermeasure generation network
Inputting the visual features and semantic features of the unseen training images into the image augmentation model to generate pseudo visual features of the unseen training images;
and training an image classifier by using the generated pseudo visual features, and inputting the visual features of the unseen test images to be recognized into the classifier for classification to obtain a classification result.
A third aspect of the present invention provides an electronic device, including a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected in sequence, the memory is used for storing a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the above method.
A fourth aspect of the invention provides a readable storage medium, storing a computer program comprising program instructions, which when executed by a processor, cause the processor to perform the method as described above.
3. Advantageous effects
Compared with the prior art, the invention has the beneficial effects that:
(1) the embodiment of the invention provides an image augmentation model training method based on a variational self-encoder and an antagonistic generation network, which can effectively synthesize the visual characteristics of unseen images by aiming at the zero sample image classification problem and combining the advantages of the variational self-encoder and the antagonistic generation network and fully fusing the visual information and semantic information of the images to generate more effective samples, the method combines a variational self-encoder with stable training process and a clear generation countermeasure network for generating samples, meanwhile, the visual features and semantic features of the image samples are input, the visual information and semantic information of the images are effectively matched, the quality of generated data is improved, the problem that class-free images are not lost in zero sample learning is effectively solved, the generated pseudo samples are used for training a classifier, and zero sample learning is converted into classical supervision learning, so that the accuracy of zero sample image classification is improved.
(2) The method comprises the steps of encoding and decoding features of different modes through a variational self-encoder, constructing exclusive probability distribution for each sample by the encoder, sampling the probability distribution, reconstructing data through a decoder, generating sample features closer to real data distribution, and having higher robustness.
(3) According to the embodiment of the invention, the output of the vision and semantic modality variational self-encoder is input into the generator network in series, the vision information and the semantic information of the image are fully fused together, and the association between the vision characteristic and the semantic characteristic can be more effectively mined, so that the vision characteristic and the semantic characteristic are more effectively synthesized, and the influence of the generated data imbalance on the model is reduced; the embodiment of the invention effectively combines the advantages of the variational self-encoder and the generation of the confrontation network, and utilizes the advantages of the variational self-encoder, which directly compares the difference between the generated data and the original data through the steps of encoding and decoding, and the stability of the training process and the clear advantage of the confrontation generation network generated picture, thereby improving the stability and the discrimination capability of the generated model.
(4) The invention constructs a network model with two modal variational self-encoders, a generation confrontation network generator and two discriminator networks, effectively synthesizes visual characteristics and semantic characteristics which are closer to real data distribution, ensures the alignment relation among different modalities and synthesizes more effective visual characteristics; in addition, a multi-module loss function comprising a variation self-encoder loss function, a generator loss function and a vision and semantic feature discriminator anti-loss function is constructed, the variation self-encoder anti-generation network model is trained based on the multi-module loss function, the problems of gradient explosion, easy model collapse and the like are effectively solved, and the performance of the model is improved.
(5) The invention utilizes the variational self-encoder to resist the generation network model to generate the unseen pseudo sample, effectively solves the problem of unseen sample loss, and uses the generated pseudo sample to train the classifier to classify unseen test images, thereby increasing the generalization capability of the model.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments are briefly described below.
FIG. 1 is a flowchart of an image augmentation model training method based on a variational auto-encoder and a confrontation generation network according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an image augmentation model according to an embodiment of the present invention;
FIG. 3 is a network architecture diagram of a visual feature extraction model in an embodiment of the invention;
FIG. 4 is a network architecture diagram of a semantic feature extraction model in an embodiment of the present invention;
FIG. 5 is a network architecture diagram of a visual modality variation autoencoder in an embodiment of the present invention;
FIG. 6 is a network structure diagram of a semantic modality variation self-encoder according to an embodiment of the present invention;
fig. 7 is a network configuration diagram of the generator G in the embodiment of the present invention;
FIG. 8 is a network architecture diagram of a visual feature discriminator D in an embodiment of the present invention;
fig. 9 is a network structure diagram of the semantic feature discriminator D according to the embodiment of the present invention;
FIG. 10 is a flowchart of an embodiment of an image augmentation model for zero-sample image classification.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the following describes in detail an image augmentation model training method and an image classification method based on a variational self-encoder and an antagonistic generation network according to the present invention with reference to the accompanying drawings.
As shown in fig. 1 and 2, the present example provides an image augmentation model training method based on a variational self-encoder and a countermeasure generation network, the method including the steps of:
s110: and acquiring a visible training image, and extracting visual features and semantic features of the visible training image.
Specifically, the visible class training image in this example refers to a sample with a label in the feature space, and a sample without a label in the feature space is an invisible class. The visual features of the visible training images are obtained through a pre-configured visual feature extraction model, wherein the visual feature extraction model is obtained through training of a convolutional neural network and a Transformer encoder. The semantic features of the visible training images are obtained through a pre-configured semantic feature extraction model, wherein the semantic feature extraction model is a continuous bag-of-words model obtained through unsupervised training. Respectively defining the visual features and semantic features of visible training images as x and a, and respectively defining the visual features and semantic features of unseen training images as xut、autThe visual characteristic of the unseen test image is defined as xt
It should be noted that in this example, AWA2 is used as a data set of an image, the data set includes text and picture files, the text records the animal types included in the data set, and attribute tags of each type, the picture files include 37322 pictures of 50 animals, 40 types of training sets 30337 pictures, and 10 types of test sets 6985 pictures. The data set of this example is selected from AWA2, where the number of category attributes is 75, the visible class is 40 classes, the unseen class is 10 classes, the number of visible class samples is 23337, and the number of unseen class samples is 7265. It should be understood that other data sets may be selected herein and should not be construed as limiting the invention herein.
In one embodiment, as shown in fig. 3, the visual feature extraction module is composed of a VGGNet16 model and a transform encoder, and the graph is input into a convolutional neural network VGGNet16 to output a feature map with a size of 16 × 16 × 1024. The feature map is divided into 256 1024-dimensional feature vector blocks, each block is mapped into a one-dimensional vector through linear mapping, 256 vectors (generally called token) with the length of 512 dimensions are obtained through mapping, the vectors are subjected to position coding and embedded into a transform encoder, encoder blocks are repeatedly stacked in the transform encoder for L times, 256 512-dimensional feature vectors are output, and the vectors are recombined into visual features of a visible image, wherein the size of the visual features is 16 x 512. The value of L in this example is 5.
As shown in fig. 4, in an embodiment, the semantic feature extraction Model is obtained through unsupervised training, and a Continuous Bag-of-Word Model (CBOW) obtained through unsupervised training in a large-scale text corpus is first obtained in advance; and inputting the category semantic label information of the visible image into the model to obtain a semantic feature vector of the visible image, and converting the semantic feature vector into a semantic feature with the dimension of 1 × 1 × 512 by using a dimension conversion network with only one hidden layer.
S120: an image augmentation model is configured in advance, and the image augmentation model comprises a visual modal variation self-encoder, a semantic modal variation self-encoder and a generator for generating countermeasure network configuration.
Specifically, the image augmentation model in this example is preconfigured, wherein the visual modality variational self-encoder comprises an encoder network E1 and a decoder network D1, and the semantic modality variational self-encoder comprises an encoder network E2 and a decoder network D2.
As shown in fig. 5, the encoder network E1 is a Full Convolutional Network (FCN) containing n layers of convolutions; the number of filter channels is increased layer by layer for learning deep features, and the output of the last convolution in the full convolution network is two n-dimensional vectors: mean vector and variance vector
The encoder network E1 maps the visual features to an interval vector represented by a probability distribution N (μ, Σ), and samples the interval vector to obtain an implicit variable Z1Wherein mu is a mean vector, sigma is a variance vector, and the mean and the variance contain structural information of the input features; then the latent variable Z1The probability distribution of (c) is:
q1(Z1|x)=N(Z111),p(Z1)=N(Z1|0,I) (1)
wherein q is1(Z1| x) represents a hidden variable Z1Obedient probability distribution, p (Z)1) Representing hidden variables Z1Is here a gaussian distribution of units, mu1Sum-sigma1Representing hidden variables Z1And N represents a normal distribution. The encoder network E1 in this example is a 3-layer convolution with filter sizes of 32, 64 and 128, respectively, the last convolution outputs two-dimensional vectors, and the last output of the decoder network is a pseudo visual feature with dimensions identical to the true visual feature, i.e. 16 x 512. The filters of the convolutional layer are used for identifying certain specific features of the image, each filter can perform sliding convolution on the feature map of the previous layer, the shallow layer of the convolutional neural network can generally detect primary features such as 'edges' and 'colors', the filters perform convolution on the features to obtain various new features along with the increase of the number of the convolutional layers, and the filters can extract deeper features when the depth of the convolutional layers is deep.
Further considering that the structures of the decoder network D1 and the encoder network E1 are substantially symmetrical, since the image feature dimension of the image in the encoder decreases after passing through the full convolution network, the up-sampling layer is adopted in the decoder to slowly increase the image feature dimension, and the number of filter channels gradually decreases to gradually integrate the image feature dimension to the original feature dimension. Decoder network D1 Generation and original vision of imagesFirst pseudo-visual feature with similar features
Figure BDA0003495147290000091
And updates the parameter E of the encoder E1 and the decoder D11、 d1The reconstruction loss of decoder D1 is calculated:
Figure BDA0003495147290000092
Figure BDA0003495147290000093
the visual feature x is obtained by performing convolution operation on a Transformer feature and a semantic feature, namely x ═ t × a.
Figure BDA0003495147290000094
Representing the reconstruction loss of the visual modal variation from the encoder,
Figure BDA0003495147290000095
representing the norm squared of L2.
As shown in fig. 6, the semantic modality variational self-encoder includes an encoder network E2 and a decoder network D2, both of which use two fully-connected layers for encoding and decoding, and the semantic features are input into the encoder network E2 to obtain hidden variables Z2The decoder network D2 reverts to an approximate probability distribution of the original data according to the probability distribution of the hidden variables, i.e. generates pseudo-semantic features similar to the original semantic features
Figure BDA00034951472900000910
And updates the parameter E of the encoder network E2 and the decoder network D22、d2And calculating reconstruction loss:
q2(Z2|a)=N(Z222),p(Z2)=N(Z2|0,I) (4)
Figure BDA0003495147290000096
Figure BDA0003495147290000097
wherein q is2(Z2| a) represents a hidden variable Z2Obedient probability distribution, p (Z)2) Representing hidden variables Z2Is here a gaussian distribution of units, mu2Sum-sigma2Representing hidden variables Z2N represents a normal distribution,
Figure BDA0003495147290000098
represents the reconstruction loss of the semantic modal variation self-encoder, sigma represents the logical operation,
Figure BDA0003495147290000099
represents the norm squared of L2;
s130: and respectively inputting the visual characteristic and the semantic characteristic into a visual mode variational self-encoder and a semantic mode variational self-encoder to generate a first pseudo visual characteristic and a pseudo semantic characteristic.
Specifically, the visual features obtained in step S110 are input into a visual mode variational self-encoder, and encoded to obtain a first pseudo-visual feature; and (5) inputting the semantic features obtained in the step (S110) into a semantic mode variational self-encoder, and encoding to obtain pseudo-semantic features. The pseudo-semantic feature size finally output in this example is 1 × 1 × 512, and the pseudo-visual feature finally output has the same dimension as the real visual feature, namely 16 × 16 × 512.
S140: and inputting the first pseudo-visual feature and the pseudo-semantic feature into a pre-configured generator, and fusing to generate a second pseudo-visual feature.
Specifically, the generator network in this example is a Multi-Layer Perceptron (MLP) network having two fully-connected hidden layers, and the inputs of the MLP network are the first pseudo-visual features respectively
Figure BDA0003495147290000101
And pseudo-semantic features
Figure BDA0003495147290000102
Performing global average pooling on the first pseudo-visual features to obtain pseudo-visual features with preset sizes, wherein the size of the feature size is 1 × 1 × C; inputting the generated features and the pseudo-semantic features into a generator network to generate new features, converting the generated features into second pseudo-visual features with the size of H multiplied by W multiplied by C by using a reshape function, and updating parameters of a generator by adopting an Adam gradient descent algorithm, wherein a generator loss function is as follows:
Figure RE-GDA0003529284980000103
wherein the content of the first and second substances,
Figure BDA0003495147290000104
representing a first pseudo-visual feature
Figure BDA0003495147290000105
And pseudo-semantic features
Figure BDA0003495147290000106
A second pseudo-visual feature generated by the generator G;
in the present example, a first pseudo-visual feature and a pseudo-semantic feature, which respectively include visual information and semantic information, are serially input into a generator network, and image visual information and semantic information are sufficiently fused to generate a second pseudo-visual feature with a size of 16 × 16 × 512, where the feature may be a visible class or an invisible class.
S150: and performing back propagation optimization parameters according to each loss function of the image augmentation model until the overall loss function is converged, and storing the model parameters to obtain the trained image augmentation model.
Specifically, the respective loss functions of the image augmentation model in this example include a countervailing loss function, a visual modality variational self-encoder loss function, a semantic modality variational self-encoder loss function, and a generator loss function.
In one embodiment, the total loss function LVAEThe method is characterized by comprising the total reconstruction loss and KL (Kullback-Leibler) divergence loss of a visual modal self-encoder and a semantic modal variation self-encoder, wherein the total reconstruction loss is used for calculating the similarity degree of generated characteristic data and original characteristic data, and the KL divergence loss is used for testing the difference between a mean value mu and a variance sigma, and is shown in the following formula:
Figure BDA0003495147290000107
wherein L isVAEIs the sum of losses of the image visual mode self-encoder and the semantic mode variation self-encoder,
Figure BDA0003495147290000108
representing the loss of visual modal variation from the encoder reconstruction,
Figure BDA0003495147290000109
representing semantic modality variational self-encoder reconstruction loss, q1(Z1|x)、q2(Z2A) respectively represent hidden variables Z1And Z2Obedient probability distribution, p (Z)1)、p(Z2) Respectively representing hidden variables Z1And Z2A priori distribution of (A), (B), (C) a priori distribution ofKLIn order to obtain a loss of the divergence of KL,
Figure BDA00034951472900001010
and the weight of the reconstruction loss term is used for reducing the difference between the generated characteristic and the real characteristic, and the beta is used for encouraging the network learning to be more widely distributed.
Given a hidden variable space dimension of n, the KL divergence loss is defined as:
Figure BDA0003495147290000111
wherein the content of the first and second substances,LKLindicating KL divergence loss; mu.siMeans representing a spatial dimension i; sigmaiRepresenting the variance of the spatial dimension i.
In one embodiment, the penalty function acquisition step is as follows:
firstly, a visual feature discriminator D and a semantic feature discriminator D are constructed, the visual features and the second pseudo visual features of the visible images in the steps are input into the discriminator D to discriminate true and false, and first discrimination information is obtained;
and inputting the semantic features and the pseudo-semantic features of the visible images into a discriminator D to discriminate true and false, obtaining second discrimination information, and respectively determining a countermeasure loss function according to the first discrimination information and the second discrimination information.
Specifically, as shown in fig. 8 and 9, in this example, the visual feature discriminator D includes a fully-connected group and a two-Sigmoid classifier, the semantic feature discriminator D includes a fully-connected hidden layer and a two-Sigmoid classifier, the final outputs of the visual feature discriminator D and the semantic feature discriminator D are 0 and 1, 0 indicates that the feature is false, 1 indicates that the feature is true, and the parameter of the discriminator is updated by using Adam gradient descent algorithm, and the pair-immunity loss function of the visual feature discriminator D and the semantic feature discriminator D is as follows:
Figure BDA0003495147290000112
Figure BDA0003495147290000113
wherein L isDA countermeasure loss function representing the visual feature discriminator D,
Figure BDA0003495147290000114
loss-fighting function representing a semantic feature discriminator D, ExThe expectation of the visual characteristic x of the image is represented,
Figure BDA0003495147290000115
representing a first pseudo-visual feature
Figure BDA0003495147290000116
And pseudo-semantic features
Figure BDA0003495147290000117
Expectation of synthesis, EaThe expectation of the semantic features a of the image is represented,
Figure BDA0003495147290000118
representing pseudo-semantic features
Figure BDA0003495147290000119
In the expectation that the position of the target is not changed,
Figure BDA00034951472900001110
a visual feature arbiter representing the input visual feature x,
Figure BDA00034951472900001111
representing input pseudo-visual features
Figure BDA00034951472900001112
And pseudo-semantic features
Figure BDA00034951472900001113
The generated second pseudo-visual characteristic is generated,
Figure BDA00034951472900001114
representing input of a second pseudo-visual feature
Figure BDA00034951472900001115
The visual feature discriminator.
And repeating the steps, taking the visual characteristics and the semantic characteristics of the visible training images as the input of the image augmentation model, training the model in a back propagation mode based on each loss function, continuously updating and optimizing parameters in a visual mode variation self-encoder, a semantic mode variation self-encoder, a generator, a visual characteristic and semantic characteristic discriminator until the total loss function is converged, obtaining a trained variation self-encoder, resisting and generating a network model, storing model parameters, and finishing the training of the image augmentation model. In addition, in the model of this example, when training, the number of images per batch (batch _ size) is 32, the learning rate (learning _ rate) is set to 0.0001 by using the Adam optimizer, the activation function is ReLU, the Dropout rejection rate is 0.5, and the maximum number of training rounds is 100000.
The method combines the advantages of a variational self-encoder and a generation countermeasure network, fully integrates visual information and semantic information of the image to generate a more effective sample, can effectively match the visual characteristics and the semantic characteristics of the image, improves the quality of generated data, effectively solves the problem of missing of an unseen image in zero sample learning, trains a classifier by using the generated pseudo sample, converts zero sample learning into classical supervised learning, and accordingly improves the accuracy of zero sample image classification.
The present example also provides an image classification method based on a variational self-encoder and a countermeasure generation network, comprising the steps of:
s210: and inputting the visual features and semantic features of the unseen training images into the image augmentation model to generate the pseudo visual features of the unseen training images.
Specifically, as shown in fig. 10, the visual feature x of the unseen training image isutAnd semantic features autInputting the training image augmentation model to generate invisible pseudo-visual features
Figure BDA0003495147290000121
S220: and training an image classifier by using the generated pseudo visual features, and inputting the visual features of the unseen test image to be recognized into the classifier for classification to obtain a classification result.
Specifically, a Softmax classifier is trained by using unseen pseudo-visual features and class labels, and the visual features x of unseen test images are testedtInputting the data into a classifier which is trained to obtain the classification accuracy.
Pseudo-views of unseen class are used in this exampleTraining a Softmax classifier for zero sample image classification by using the perceptual features and the class labels, wherein the training set of the Softmax classifier is
Figure BDA0003495147290000122
For a given classifier input, m is the number of training samples, tiIs the number of the class to which the sample belongs, tiE {1,2, …, C }, C being the total number of classes of the image, and defining the Softmax classifier as
Figure BDA0003495147290000123
Figure BDA0003495147290000124
Wherein the content of the first and second substances,
Figure BDA0003495147290000125
indicating at a given input
Figure BDA0003495147290000126
The probability that the input data is of class k, k 1, …, C,
Figure BDA0003495147290000127
the output of the classifier, being a column vector of C rows and 1 columns, each row representing the probability that the current input is recognized as being of class k, the sum of all row elements being 1,
Figure BDA0003495147290000128
parameters to be estimated for the Softmax classifier and form a parameter matrix
Figure BDA0003495147290000129
This example classifies for arbitrary feature vectors when classifying unseen pictures using the Softmax classifier
Figure BDA00034951472900001210
All select the maximum probability pairTaking the corresponding k categories as the classification results of the current picture, comparing the classification results with the marked real results, if the results are consistent, the classification is correct, and if the results are inconsistent, the classification is wrong;
and optimizing the model parameters by adopting a gradient descent method, wherein the loss function of the Softmax classifier is as follows:
Figure BDA0003495147290000131
wherein the content of the first and second substances,
Figure BDA0003495147290000132
solving by conjugate algorithm
Figure BDA0003495147290000133
And (4) performing unconstrained optimization to serve as an evaluation index for testing the classification accuracy. Visual feature x of unseen test imagetAnd inputting the classification label corresponding to the value with the maximum probability output by the classifier into the trained Softmax classifier, and taking the classification label as a prediction result of the classifier.

Claims (10)

1. An image augmentation model training method based on a variational self-encoder and a confrontation generation network is characterized by comprising the following steps:
s110: acquiring a visible training image, and extracting visual features and semantic features of the visible training image;
s120: an image augmentation model is configured in advance, and the image augmentation model comprises a visual modal variation self-encoder, a semantic modal variation self-encoder and a generator configured according to a generated countermeasure network;
s130: respectively inputting the visual features and the semantic features into a visual modal variation self-encoder and a semantic modal variation self-encoder to generate first pseudo visual features and pseudo semantic features;
s140: inputting the first pseudo-visual feature and the pseudo-semantic feature into a pre-configured generator, and fusing to generate a second pseudo-visual feature;
s150: and performing back propagation optimization parameters according to the loss function of the image augmentation model until the overall loss function is converged, and storing the model parameters to obtain the trained image augmentation model.
2. The method of claim 1, wherein the loss function comprises a countering loss function, and the countering loss function obtaining step comprises:
configuring a visual feature discriminator and a semantic feature discriminator;
inputting the visual feature and the second pseudo visual feature into a visual feature discriminator to obtain first discrimination information;
inputting the semantic features and the pseudo-semantic features into a semantic feature discriminator to obtain second discrimination information;
respectively determining a countermeasure loss function according to the first discrimination information and the second discrimination information, and updating parameters of a visual feature discriminator and a semantic feature discriminator by adopting an Adam gradient descent algorithm;
the loss function further comprises a total loss function L of the variational self-encoderVAEVisual modal variational reconstruction loss from encoder
Figure FDA0003495147280000011
And KL divergence loss and reconstruction loss of semantic modal variation self-encoder
Figure FDA0003495147280000012
And KL divergence loss.
3. The method for training an image augmentation model based on a variational self-encoder and a countermeasure generation network according to claim 1, wherein in the step S110:
extracting visual features of the visible training images by using a visual feature extraction model, wherein the visual feature extraction model uses a convolutional neural network and a Transformer encoder as a feature extraction network;
inputting the visible training images into a convolutional neural network to obtain a characteristic diagram;
dividing the feature map into multi-dimensional feature vector blocks, and mapping each feature vector block into a one-dimensional vector through linear mapping to obtain a plurality of feature vectors;
and carrying out position coding on the feature vector, embedding the feature vector into the Transformer encoder, repeatedly stacking encoder blocks in the encoder for L times, outputting a second-dimension feature vector, and recombining the second-dimension feature vector into visual features with a preset size.
4. The method of claim 3, wherein in the step S110:
and extracting semantic features of the visible training images by using a semantic feature extraction model, taking a continuous bag-of-words model obtained through unsupervised training in a text corpus as the semantic feature extraction model, extracting semantic feature vectors of the visible images by using the semantic feature extraction model, and converting the semantic feature vectors into semantic features with preset sizes through a dimension transformation network.
5. The method for training an image augmentation model based on a variational self-encoder and a antagonism generation network according to claim 1, wherein the visual mode variational self-encoder in the step S120 comprises an encoder network E1 and a decoder network D1, wherein the encoder network E1 is a full convolution network containing n layers of convolutions, and the number of filter channels increases layer by layer for learning deep features; the output of the last convolution layer in the full convolution network is two n-dimensional vectors of a mean vector and a variance vector;
the encoder network E1 maps the visual features to an interval vector represented by a probability distribution N (μ, Σ), and samples the interval vector to obtain an implicit variable Z1Wherein mu is a mean vector, and Σ is a variance vector; then the latent variable Z1The probability distribution of (c) is:
q1(Z1|x)=N(Z111),p(Z1)=N(Z1|0,I)
wherein q is1(Z1| x) represents a hidden variable Z1Obedient probability distribution, p (Z)1) Representing hidden variables Z1A prior distribution of (a), here the unit gaussian distribution, mu1Sum-sigma1Representing hidden variables Z1And N represents a normal distribution.
6. The method for training the image augmentation model based on the variational self-encoder and the antagonistic generation network according to claim 1, wherein the semantic modal variational self-encoder comprises an encoder network E2 and a decoder network D2, both the encoder network E2 and the decoder network D2 use two fully-connected layers for encoding and decoding, and the semantic features are input into the encoder network E2 to obtain the hidden variable Z2According to the hidden variable Z, decoder network D22The probability distribution of (2) is reduced to approximate probability distribution of the original data, i.e. pseudo-semantic features similar to the semantic features are generated
Figure FDA0003495147280000021
And updates the parameter E of the encoder network E2 and the decoder network D22、d2And calculating reconstruction loss:
q2(Z2|a)=N(Z222),p(Z2)=N(Z2|0,I)
Figure FDA0003495147280000022
Figure FDA0003495147280000023
wherein q is2(Z2| a) represents a hidden variable Z2Obedient probability distribution, p (Z)2) Representing hidden variables Z2Is a priori dividedCloth, here unit Gaussian distribution, μ2Sum-sigma2Representing hidden variables Z2N represents a normal distribution,
Figure FDA0003495147280000024
represents the reconstruction loss of the semantic modal variation self-encoder, sigma represents the logical operation,
Figure FDA0003495147280000025
representing the norm squared of L2.
7. The method of claim 5, wherein the total loss function L of the visual mode variational autocoder and the semantic mode variational autocoder is calculated in step S130VAESaid total loss function LVAEThe method comprises the total reconstruction loss and the KL divergence loss of a visual mode self-encoder and a semantic mode variation self-encoder, wherein the total reconstruction loss is used for calculating the similarity degree of a second pseudo-visual characteristic and visual characteristic data, and the following formula is shown as follows:
Figure FDA0003495147280000031
wherein L isVAEIs the sum of the loss of the image visual mode and the semantic mode variation self-encoder,
Figure FDA0003495147280000032
representing the loss of visual modal variation from the encoder reconstruction,
Figure FDA0003495147280000033
representing semantic modality variational self-encoder reconstruction loss, q1(Z1|x)、q2(Z2A) respectively represent hidden variables Z1And Z2Obedient probability distribution, p (Z)1)、p(Z2) Respectively representing hidden variables Z1And Z2A priori distribution of (A), (B), (C) a priori distribution ofKLIn order to obtain a loss of the divergence of KL,
Figure FDA0003495147280000034
the weight of the reconstruction loss term is used for reducing the difference between the generated characteristic and the real characteristic, and beta is the weight of the KL divergence loss term and is used for encouraging the network learning to be more widely distributed; given the hidden variable space dimension n, the KL divergence loss is defined as:
Figure FDA0003495147280000035
wherein L isKLIndicating KL divergence loss; mu.siMeans representing a spatial dimension i; sigmaiRepresenting the variance of the spatial dimension i.
8. The method of claim 1, wherein the generator network in step S140 is a multi-layer perceptron network having two fully-connected hidden layers; performing global average pooling on the first pseudo-visual feature to obtain a pseudo-visual feature with a preset size, inputting the pseudo-visual feature and the pseudo-semantic feature into a generator network to generate a new feature, converting the generated feature by using a reshape function to obtain a second pseudo-visual feature, and updating parameters of a generator by adopting an Adam gradient descent algorithm, wherein a loss function of the generator is as follows:
Figure RE-FDA0003562295750000035
wherein the content of the first and second substances,
Figure RE-FDA0003562295750000036
representing a first pseudo-visual feature
Figure RE-FDA0003562295750000037
And pseudo-semantic features
Figure RE-FDA0003562295750000038
A second pseudo-visual feature generated after passing through the generator G.
9. The method according to claim 3, wherein the robust loss function of the visual feature discriminator D and the semantic feature discriminator D is as follows:
Figure FDA00034951472800000310
Figure FDA00034951472800000311
wherein L isDA countermeasure loss function representing the visual feature discriminator D,
Figure FDA00034951472800000312
a penalty function representing the semantic feature discriminant D, ExThe expectation of the visual characteristic x is represented,
Figure FDA00034951472800000313
representing a first pseudo-visual feature
Figure FDA00034951472800000314
And pseudo-semantic features
Figure FDA00034951472800000315
Expectation of synthesis, EaThe expectation of the semantic feature a is represented,
Figure FDA00034951472800000316
representing pseudo-semantic features
Figure FDA00034951472800000317
In the expectation that the position of the target is not changed,
Figure FDA00034951472800000318
a visual feature discriminator representing the input visual feature x,
Figure FDA00034951472800000319
representing an input first pseudo-visual feature
Figure FDA00034951472800000320
And pseudo-semantic features
Figure FDA00034951472800000321
The second pseudo-visual feature generated is,
Figure FDA0003495147280000041
representing input of a second pseudo-visual feature
Figure FDA0003495147280000042
The visual feature discriminator.
10. An image classification method based on a variational self-encoder and an image augmentation model of a competing generation network,
inputting the visual features and semantic features of the unseen training images into the image augmentation model according to any one of claims 1 to 8 to generate pseudo visual features of the unseen training images;
and training an image classifier by using the generated pseudo visual features, and inputting the visual features of the unseen test images to be recognized into the classifier for classification to obtain a classification result.
CN202210111331.7A 2022-01-29 2022-01-29 Image augmentation model training method and image classification method based on variational self-encoder and countermeasure generation network Pending CN114386534A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210111331.7A CN114386534A (en) 2022-01-29 2022-01-29 Image augmentation model training method and image classification method based on variational self-encoder and countermeasure generation network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210111331.7A CN114386534A (en) 2022-01-29 2022-01-29 Image augmentation model training method and image classification method based on variational self-encoder and countermeasure generation network

Publications (1)

Publication Number Publication Date
CN114386534A true CN114386534A (en) 2022-04-22

Family

ID=81203509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210111331.7A Pending CN114386534A (en) 2022-01-29 2022-01-29 Image augmentation model training method and image classification method based on variational self-encoder and countermeasure generation network

Country Status (1)

Country Link
CN (1) CN114386534A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115131347A (en) * 2022-08-29 2022-09-30 江苏茂融智能科技有限公司 Intelligent control method for processing zinc alloy parts
CN115588436A (en) * 2022-09-29 2023-01-10 沈阳新松机器人自动化股份有限公司 Voice enhancement method for generating countermeasure network based on variational self-encoder
CN115601553A (en) * 2022-08-15 2023-01-13 杭州联汇科技股份有限公司(Cn) Visual model pre-training method based on multi-level picture description data
CN115758159A (en) * 2022-11-29 2023-03-07 东北林业大学 Zero sample text position detection method based on mixed contrast learning and generation type data enhancement
CN116051909A (en) * 2023-03-06 2023-05-02 中国科学技术大学 Direct push zero-order learning unseen picture classification method, device and medium
CN116109877A (en) * 2023-04-07 2023-05-12 中国科学技术大学 Combined zero-sample image classification method, system, equipment and storage medium
CN117972440A (en) * 2024-04-01 2024-05-03 长春理工大学 Unbalanced heart rate data set processing method and system based on generation countermeasure network

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115601553A (en) * 2022-08-15 2023-01-13 杭州联汇科技股份有限公司(Cn) Visual model pre-training method based on multi-level picture description data
CN115601553B (en) * 2022-08-15 2023-08-18 杭州联汇科技股份有限公司 Visual model pre-training method based on multi-level picture description data
CN115131347A (en) * 2022-08-29 2022-09-30 江苏茂融智能科技有限公司 Intelligent control method for processing zinc alloy parts
CN115588436A (en) * 2022-09-29 2023-01-10 沈阳新松机器人自动化股份有限公司 Voice enhancement method for generating countermeasure network based on variational self-encoder
CN115758159A (en) * 2022-11-29 2023-03-07 东北林业大学 Zero sample text position detection method based on mixed contrast learning and generation type data enhancement
CN115758159B (en) * 2022-11-29 2023-07-21 东北林业大学 Zero sample text position detection method based on mixed contrast learning and generation type data enhancement
CN116051909A (en) * 2023-03-06 2023-05-02 中国科学技术大学 Direct push zero-order learning unseen picture classification method, device and medium
CN116109877A (en) * 2023-04-07 2023-05-12 中国科学技术大学 Combined zero-sample image classification method, system, equipment and storage medium
CN116109877B (en) * 2023-04-07 2023-06-20 中国科学技术大学 Combined zero-sample image classification method, system, equipment and storage medium
CN117972440A (en) * 2024-04-01 2024-05-03 长春理工大学 Unbalanced heart rate data set processing method and system based on generation countermeasure network

Similar Documents

Publication Publication Date Title
CN114386534A (en) Image augmentation model training method and image classification method based on variational self-encoder and countermeasure generation network
Wang et al. Deep visual domain adaptation: A survey
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN110163258B (en) Zero sample learning method and system based on semantic attribute attention redistribution mechanism
CN104866810A (en) Face recognition method of deep convolutional neural network
CN112949647B (en) Three-dimensional scene description method and device, electronic equipment and storage medium
CN109800768B (en) Hash feature representation learning method of semi-supervised GAN
CN110188827A (en) A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model
Jha et al. Extracting low‐dimensional psychological representations from convolutional neural networks
CN112926675B (en) Depth incomplete multi-view multi-label classification method under double visual angle and label missing
CN113159067A (en) Fine-grained image identification method and device based on multi-grained local feature soft association aggregation
CN112115967A (en) Image increment learning method based on data protection
CN117746260B (en) Remote sensing data intelligent analysis method and system
Feng et al. Deep image set hashing
CN117494051A (en) Classification processing method, model training method and related device
Wu et al. Semisupervised feature learning by deep entropy-sparsity subspace clustering
CN113240033A (en) Visual relation detection method and device based on scene graph high-order semantic structure
CN117315070A (en) Image generation method, apparatus, electronic device, storage medium, and program product
JP6886120B2 (en) Signal search device, method, and program
US20230186600A1 (en) Method of clustering using encoder-decoder model based on attention mechanism and storage medium for image recognition
CN115392474B (en) Local perception graph representation learning method based on iterative optimization
Liu et al. Multi-digit recognition with convolutional neural network and long short-term memory
CN115640418A (en) Cross-domain multi-view target website retrieval method and device based on residual semantic consistency
CN115527064A (en) Toxic mushroom fine-grained image classification method based on multi-stage ViT and contrast learning
Sassi et al. Neural approach for context scene image classification based on geometric, texture and color information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination