CN112765316A - Method and device for generating image by introducing text of capsule network - Google Patents

Method and device for generating image by introducing text of capsule network Download PDF

Info

Publication number
CN112765316A
CN112765316A CN202110069525.0A CN202110069525A CN112765316A CN 112765316 A CN112765316 A CN 112765316A CN 202110069525 A CN202110069525 A CN 202110069525A CN 112765316 A CN112765316 A CN 112765316A
Authority
CN
China
Prior art keywords
text
image
class information
capsule
hidden
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110069525.0A
Other languages
Chinese (zh)
Inventor
周德宇
孙凯
胡名起
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202110069525.0A priority Critical patent/CN112765316A/en
Publication of CN112765316A publication Critical patent/CN112765316A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a method and a device for generating an image by a text introduced into a capsule network, which comprises a training stage and a testing stage, wherein the training stage trains an image generation model introduced into the capsule network through text information, class labels and real images, the method comprises a multi-stage image generator and an image discriminator for grading the generated images, the testing stage inputs the text and the class labels thereof, the image generator is used for generating corresponding images, the capsule network is introduced in the process of generating the images by the text, and meanwhile, the natural language text and the entity information in the corresponding class labels are learned, so that the correlation between the generated images and the text is enhanced. The invention has the beneficial effects that: the capsule network enhances the learning of entity information, the text information and the class information are fused in a hidden space with lower dimensionality, the training parameter quantity is reduced, the interaction of the text information and the class information is enhanced, and the training difficulty of directly generating a high-resolution image is reduced through a multi-stage generation process in the training process.

Description

Method and device for generating image by introducing text of capsule network
Technical Field
The invention relates to the technical field of deep learning generation models, in particular to a method and a device for generating an image by introducing a text of a capsule network.
Background
Text generation images are an important problem and have wide application, such as computer aided design, illustration generation, etc.
The research of the text image generation method is mainly based on two models of generating formulas, namely a Conditional variable Auto-Encoder (CVAE) and a Conditional Generative Adaptive Network (CGAN). Among them, pictures generated by the CVAE method often have the problem of picture blurring, and the current mainstream methods are all based on CGAN models.
Because of the instability of GAN training, it is very difficult to directly generate high-resolution images from text descriptions, so a hierarchical generation countermeasure network (StackGAN) proposes a strategy of generating low-resolution images from text and then gradually generating high-resolution images from low-resolution images, and is widely applied in later work.
The general text semantic embedding shows that only information at a sentence level exists, the detail information in the text is lack of corresponding to the image, the attention generation countermeasure network (AttnGAN) corresponds a specific word in the text to a generated picture sub-region through an attention mechanism, and the generation effect is improved.
In conventional CGAN architectures, the initially input condition vector is typically mapped to a long narrow three-dimensional initial image feature representation through a fully connected layer. However, the general semantic space has a low dimension, the image feature space has a high dimension, and the problem of information loss may occur when the full connection layer is directly used for the dimension conversion.
The Capsule Network (Capsule Network) is a new type of neuron, and the input and the output are vectors, the activation vector represents instantiation parameters of a specific type of entity, and the modulus of the activation vector can represent the probability of the entity.
Conventional text-generating image networks use only text information and ignore class information of the text itself. But the type information is also helpful for text generation, objects of the same type often have certain similarity, and the introduction of the type information of the text can help solve the one-sided problem of single text description, and simultaneously, the relevance between the generated image and the text is drawn.
As a method for generating an image from a text, an index of Inclusion Score (IS) IS widely used for evaluation. The IS evaluates the quality of the generated image by calculating the correlation between the generated image distribution and the real image distribution, and the higher the IS value IS, the more clear and easier-to-identify entity IS contained in the generated image.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method and a device for generating an image by a text introduced into a capsule network, which can introduce class information to which the text belongs in the process of generating the image, constrain the correlation of the same class of text to generate pictures by the class information, and solve the problem that the single text description may not be comprehensive.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method and a device for generating an image by a text introduced into a capsule network comprise the following steps:
step 1, encoding a natural language text for describing an image to obtain text semantic embedded representation;
step 2, encoding the class label of the text to obtain class information semantic embedded representation;
step 3, mixing the text semantic embedded representation obtained in the step 1 with random noise, reading the text semantic embedded representation and the random noise by adopting a recurrent neural network, and outputting an object hidden code of the text and a hidden code of each word in the text;
step 4, mixing the semantic embedded representation of the class information obtained in the step 2 with noise, and obtaining object hidden codes of the class information through variational inference;
step 5, fusing the text hidden codes and the class information hidden codes obtained in the step 3 and the step 4 to obtain fused hidden codes containing the text information and the class information;
step 6, transcoding the fused hidden codes obtained in the step 5 by utilizing a capsule network to obtain image characteristics;
step 7, decoding the image characteristics obtained in the step 6 and outputting an image with a target size;
step 8, carrying out countermeasure training on the generated image and the corresponding real image;
and 9, fusing the image characteristics obtained in the step 6 and the hidden codes of each word in the text obtained in the step 3 by using an attention model, taking the fused image characteristics as input of the next stage, and repeating the steps 6-8 to gradually generate images with higher resolution.
Step 10, inputting the natural language text and the labels thereof in the testing stage, and generating corresponding images in stages according to the steps 1-7.
Further, the capsule network in step 6 is a neuron, the input and output are vectors, the modulus of the activated vector can represent the probability of the occurrence of the entity, the representation capability of the capsule network to the entity is utilized, the joint coding of the text and the class information is transcoded in a generator, then an image with corresponding dimensionality is obtained by an image decoder, the class information is evaluated in a discriminator, and the identification of the entity information in the generated image by the discriminator is enhanced.
Further, in step 1, the method for encoding the natural language text describing the image in step 1 is as follows: segmenting the natural language text to obtain a word sequence p ═ w (with the length d)1,w2,…wd) Wherein each word wiRepresenting by adopting a pre-trained word vector, i is 1-d, and coding the text by using the obtained word vector;
further, if each text-image data only belongs to one class in step 2, the class information is encoded in a one-bit efficient coding (one-hot) manner; if the text-image data belongs to multiple classes, the class information is encoded using a multi-bit efficient coding (multi-hot) scheme.
Further, the recurrent neural network in step 3 adopts a bidirectional long-short term memory network, wherein the long-short term memory network reads text semantic embedding and the hidden state of the previous step, outputs the object hidden code of each step, and uses the object hidden code of each step as the feature s of the word level of each wordiAnd taking the object hidden code s of the last step as the characteristic of sentence level, namely the object hidden code of the text.
Further, the text semantic embedded representation and the noise in the step 3 are directly connected in a mixed mode, and the adopted noise is gaussian noise
Figure BDA0002905266890000031
The mixed result of text semantic embedding s and z is
Figure BDA0002905266890000032
Figure BDA0002905266890000033
The mixed mode of class information semantic embedding and noise in the step 4 is variation inference, namely a variation encoder is used for given noise
Figure BDA0002905266890000034
And class information c, estimating the hidden attribute distribution of the class information
Figure BDA0002905266890000035
Semantically embedding a class information representation z sampled from the distributionc
Further, in the step 5, a direct connection mode is adopted to fuse the text hidden code and the class information hidden code to obtain a fused hidden code z ═ (z ═s,zc)。
Further, in the step 6, the capsule network is used for transcoding the fused hidden codes containing the class information into the initial image representation, and then the upsampling network is used for obtaining the image features of the corresponding dimension.
Further, the confrontation training method in step 8 is as follows: and respectively obtaining image implicit representation of the generated image and the real image through a convolutional neural network, simultaneously inputting corresponding text and class information, and outputting scores of image real degree, image and text matching degree and image and class information matching degree.
Furthermore, the class information is judged by constructing a classification capsule layer by utilizing a capsule network, and a module of an output vector is standardized for judging the class information, so that the entity classification performance in the generated image can be improved
Further, in the step 9, a staged image generation method is adopted to generate pictures with higher resolution step by step, taking two-stage image generation as an example, in the first stage, a low-resolution picture is generated by using fusion hidden coding of text and class information; in the second stage, an attention mechanism is utilized to calculate the attention score of each sub-region of the image feature h obtained in the first stage and each word in the text coding of the word level obtained in the step 3, high-dimensional text-image mixed features are obtained through fusion and input into a high-resolution image generator as fusion hidden codes, the steps 7-8 are repeated, and the high-resolution image is generated while the generation difficulty is reduced by adopting a multi-stage generation scheme.
Further, in the staged image generation network, where the resolution of the images generated in the first stage is 64 × 64, the resolution of the images generated in the second stage is 128 × 128, and the resolution of the images generated in the third stage is 256 × 256, the model may continue to be stacked.
Further, the input of the testing stage in the step 10 is text and its class labels, and the high-resolution image is generated in stages through the generator model obtained in the training stage.
A method and a device for generating an image by a text introduced into a capsule network are characterized in that: the device comprises:
the text encoder is used for encoding the text describing the image to obtain text semantic embedded representation;
the class information encoder is used for encoding class information of a text describing an image to obtain semantic embedded representation of the class information;
the generator comprises a cyclic neural network transcoder, a variational inference transcoder, a capsule network transcoder and an image decoder, wherein the cyclic neural network transcoder is used for reading the semantic embedding of a text and the hidden state of the previous step of the transcoder and outputting a corresponding text hidden code; the variational inference transcoder is used for reading semantic embedding of input class information and outputting corresponding class information hidden codes; the capsule network transcoder fuses the text information and the class information hidden codes and transcodes the text information and the class information hidden codes into image characteristics; the image decoder decodes the image characteristics to generate an image;
the discriminator comprises an image semantic discriminator, a text semantic discriminator and a class information discriminator introduced into the capsule network, wherein the image semantic discriminator judges the correlation between the generated image and the real image; the text semantic discriminator judges the correlation between the generated image and the corresponding text; the class information discriminator introduced into the capsule network utilizes the capsule network to replace a full connection layer in the original discriminator, and the correlation between the generated image and the class information is judged better.
The invention constructs a CGAN-based text generation image model introducing class information, adopts the technologies of a recurrent neural network, a capsule network, variational inference, attention mechanism and the like in the generation process, extracts information from a text and a class label respectively, generates text semantic embedding and class information semantic embedding, and then fuses two kinds of hidden representations in a semantic space to obtain input fused hidden codes. In a generator, the fused implicit codes are converted into image features by using a capsule network, then an image is generated by using an image decoder, and the generated image and a real image are subjected to confrontation training in a discriminator.
Compared with the prior art, the invention has the following beneficial effects:
the method introduces additional class information in the process of generating the image by the text, constrains the similar image in training, enhances the correlation between the generated image and the text, obtains the hidden code of the class information by a variation inference method, fully excavates the entity information behind the class label, fuses the class information and the text information in a semantic space with lower dimensionality, enhances the interaction between the class information and the text information, reduces the parameter amount of the training, transcodes the fused hidden code through a capsule network, better learns the entity information in the text and the class label, scores the class information by using the capsule network in a discriminator, enhances the identification of the entity information in the generated image, gradually generates the high-resolution image by a staged generation method, and reduces the difficulty of generating the high-resolution image.
Drawings
Fig. 1 is a flow chart of a generator method of the first stage of the present invention (generating a low resolution image).
FIG. 2 is a flow chart of a method of the discriminator of the present invention.
Fig. 3 is a flow chart of a generator method (generating a high resolution image) at a subsequent stage in the invention.
Detailed Description
The invention will be further elucidated with reference to the drawings and specific embodiments, it being understood that these examples are intended to illustrate the invention only and are not intended to limit the scope of the invention. Various equivalent modifications of the invention, which fall within the scope of the appended claims of this application, will occur to persons skilled in the art upon reading this disclosure. A text-generating image method for generating introductory class information of a confrontational text based on a condition, as shown in fig. 1, comprising the steps of:
step 1, constructing a text encoder, inputting a natural language text, and outputting an embedded representation of the text, wherein the natural language text adopts an English text, a word sequence with the length of d is obtained after stop words are removed, and each word is represented by a pre-trained word vector.
For example, inputting the natural language text "Colorful disorders holding mean, vegetables, menu, and break", removing stop words to obtain the final word sequence of [ Colorful, dis, hold, mean, vegetables, menu, break ], d ═ 7, we set the maximum value of d to 18, the insufficient part is filled, and the excessive part is cut.
The goal of the text coder is to extract natural languageThe high-level semantic features in the text are served by a pre-trained bidirectional long-time memory network (Bi-LSTM). Input text sequence, hidden state h of each word outputiFeatures s as word-level of the wordiThe output of the temporal average of the hidden states at all times is embedded as the semantic of the text, i.e.
Figure BDA0002905266890000061
This is only a preferred way of encoding text, but other reasonable encoding methods can be used.
And 2, constructing a class information encoder, inputting class information corresponding to the text, and outputting embedded representation of the class information. The class information of the text has two cases, namely single class and multi-class, if each text has only one class label, we encode the class information in a one-hot form, such as "This little text is an album with a missing print and a second records," there is only one class label, indicating that the type of the bird corresponding to the text is changed, and the data set has 20 different classes of birds, we encode the class information into a 20-dimensional one-hot vector [1,0, … 0,0 ]; if the text itself has multiple class attributes, such as 4 class labels Dish, mean, Vegetables, Fruit, andbread, "in the sentence" color Dish holding mean, Vegetables, Fruit, andbread, "the text is encoded by using a multi-hot encoding method, such as encoding into [1,1,1,1,1,0, … … 0,0], indicating that the text has class labels of the first to fifth classes.
After class information is coded into a class vector c, the class vector is converted into class information to be embedded by adopting a variation inference mode. The encoder uses the class vector c and the noise data
Figure BDA0002905266890000062
Conditional on a given z and
Figure BDA0002905266890000063
the posterior inference of the hidden variable z is performed. We assume the posterior distribution of hidden variables
Figure BDA0002905266890000064
We here use a three-layer linear neural network for inference, subject to multivariate diagonal gaussian distributions, where the mean and variance of the implicit distributions are learned by the encoder, or more complex encoding schemes based on the distribution of classes in the data.
Step 3, constructing a text information and class information fusion and transcoding module, and embedding the semantics of the whole text obtained in the step 3 as a text information code zsAnd the class information distribution z obtained by sampling in step 4cEncoding z as class informationcThe two codes are fused in a hidden space by adopting a direct connection mode to obtain a fused hidden code z ═ z (z is)s,zc)。
And 4, constructing a condition generation countermeasure network, wherein the generator consists of a capsule network and a convolutional neural network, and the discriminator consists of an image encoder, an image discriminator, a text semantic discriminator and a class information discriminator which are introduced into the capsule network. The generator replaces a common full-connection layer in a CGAN model with a capsule network, the capsule is a novel neuron, the mode of an activation vector of the neuron can represent the probability of the occurrence of an entity in a picture, and the class information of the neuron is often related to the entity information. The input and output of the capsule network are vectors, a capsule layer containing 1024 capsule units is used, the input dimension of each capsule unit is 16 groups of input vectors with the length of 8, the output dimension is 1024 groups of vectors with the length of 16, and the input vectors are transformed into 1024 x 4 three-dimensional image features h which are input into the convolutional neural network; in a convolutional neural network, decoding an image by using a scale-invariant convolutional layer, and converting the characteristics of a fused image into a final generated image; an image encoder of a capsule network is introduced into a discriminator to encode a generated image and a real image, the generated image and the real image are input into a subsequent discriminator with three dimensions, the image discriminator scores the real degree of the generated image, a text semantic discriminator is used for evaluating the phase relation between the generated image and an original text, a classification capsule layer is built by utilizing the capsule network in a class information discriminator, and the class information matching degree of the generated image is scored after the output vector is standardized.
Step 5, respectively inputting the natural language text describing the image and the corresponding class information into a text encoder and a class information encoder to obtain text semantic embedded representation and class information embedded representation;
step 6, the generated text semantic embedded representation and class information embedded representation output text information and class information fusion module to obtain fusion hidden codes containing the two kinds of information;
step 7, inputting the fused image features into an image generator to generate a picture with lower resolution, wherein the set resolution is 64 × 64; and inputting corresponding real pictures, natural language texts and class information into the discriminator to perform countermeasure training. The loss functions of the generator and the discriminator in the countermeasure training process are respectively as follows:
LD=-(Ex~P[log[D(x)r]+Ex~Q[1-log[D(x)r])
-(Ex~P[log[D(x)c]+Ex~Q[1-log[D(x)c])
-(Ex~P[log[D(x,s)]+Ex~Q[1-log[D(x,s)])
Figure BDA0002905266890000071
in the formula, P is the actual data, Q is the generated data distribution, D (x)rRepresenting the probability that the generated image x is true, D (x)cRepresenting the probability that the generated image belongs to the correct class label, and D (x, s) representing the probability of a match between the generated image and the descriptive text.
Two KL divergence terms are added to the loss function of the generator as constrained two hidden variables zcAnd zsLoss of regularization. During training, first, with a fixed generator, the loss L is reducedDOptimizing the arbiter D (x), and then, in the case of a fixed arbiter, pressing the penalty LGThe generator G is optimized. The above two steps are alternately trained by small batches of random gradient descent. Fighting trainingTraining the discriminator once and the generator once every iteration; training the network using Adam solver, where β1=0.5,β20.999; the learning rate α is 0.0002.
Step 8, generating image characteristics h in the first stage and semantic characteristics s of the word level obtained in the step 3iThe stitching is performed through an attention network. The attention mechanism obtains image-text features by calculating the correlation of each word and each sub-region in the image features, wherein the text image features of the jth sub-region are calculated as follows:
Figure BDA0002905266890000081
wherein s'iIs siA feature representation obtained by a new activation function;
Figure BDA0002905266890000082
representing fine-grained fusion of image features and word-level text features; beta is aj,iThe relevance of the jth sub-region to the ith word is shown by an attention model. We upsample the image-text features sen-img to 128 x 128 dimensions, repeating the confrontational training process of step 7 in the higher dimensions.
When the network is trained, Normalization techniques such as Batch Normalization and Spectral Normalization can be added into the generator and the discriminator to stabilize the training, and the generation quality is further improved.
In the experimental process, the reference model based on the AttnGAN is based on the experiment, dimensions of a hidden variable and a noise variable are set to be 128, the arbiter is trained once every iteration of the antagonistic training, and the generator is trained once; training the network using Adam solver, where β1=0.5,β20.999; the learning rate α is 0.0002.
The IS IS promoted to 4.67 +/-0.05 from 4.36 +/-0.02 on the CUB data set; the image generation quality and the definition of entities in the generated image are better than those of the reference model. In the peeling experiment, we compared the introduction of the capsule network in the generator and the discriminator respectively, wherein the IS IS 4.53 + -0.05 when the capsule network IS introduced in the generator only, and the IS IS 4.46 + -0.02 when the capsule network IS introduced in the discriminator only, and the effectiveness of the capsule network IS proved.
In summary, compared with the previous method, the method for generating the image by the text introduced into the capsule network disclosed by the invention is additionally provided with a module for encoding the class information and fusing the class information and the text information, introduces the class mark of the text, limits the class of the generated image in the discriminator, and improves the correlation between the generated image and the text by introducing the class information; text information and class information are fused in a low-dimensional hidden space, so that the training parameters are reduced; the generator adopts a capsule network to transcode hidden codes containing class information, so that entity information in the input can be better learned; the discriminator adopts the capsule network to grade the class information of the generated image, thus enhancing the identification of the entity information in the generated image; the whole structure adopts a multi-stage generation mode of introducing an attention mechanism, reduces the difficulty of generating high-resolution images, and strengthens the supervision of text information on the image generation process.
The above examples are only preferred embodiments of the present invention, but the embodiments of the present invention are not limited by the above examples, and it should be noted that: it will be apparent to those skilled in the art that various modifications, alterations, combinations, and simplifications may be made without departing from the spirit of the invention, which is equivalent to the substitution and is intended to be within the scope of the invention.

Claims (14)

1. A method and a device for generating an image by a text introduced into a capsule network are characterized in that: the method comprises the following steps:
step 1, encoding a natural language text for describing an image to obtain text semantic embedded representation;
step 2, encoding the class label of the text to obtain class information semantic embedded representation;
step 3, mixing the text semantic embedded representation obtained in the step 1 with random noise, reading the text semantic embedded representation and the random noise by adopting a recurrent neural network, and outputting an object hidden code of the text and a hidden code of each word in the text;
step 4, mixing the semantic embedded representation of the class information obtained in the step 2 with noise, and obtaining object hidden codes of the class information through variational inference;
step 5, fusing the text hidden codes and the class information hidden codes obtained in the step 3 and the step 4 to obtain fused hidden codes containing the text information and the class information;
step 6, transcoding the fused hidden codes obtained in the step 5 by utilizing a capsule network to obtain image characteristics;
step 7, decoding the image characteristics obtained in the step 6 and outputting an image with a target size;
step 8, carrying out countermeasure training on the generated image and the corresponding real image;
and 9, fusing the image characteristics obtained in the step 6 and the hidden codes of each word in the text obtained in the step 3 by using an attention model, taking the fused image characteristics as input of the next stage, and repeating the steps 6-8 to gradually generate images with higher resolution.
Step 10, inputting the natural language text and the labels thereof in the testing stage, and generating corresponding images in stages according to the steps 1-7.
2. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 1, wherein: the capsule network in the step 6 is a neuron, the input and the output are vectors, the mode of the activated vector can represent the occurrence probability of the entity, the representation capability of the capsule network to the entity is utilized, the combined coding of the text and the class information is transcoded in a generator, then an image with corresponding dimensionality is obtained by an image decoder, the class information is evaluated in a discriminator, and the identification of the entity information in the generated image by the discriminator is enhanced.
3. Text generation for introduction into capsule networks of claim 1The image forming method and the device are characterized in that: the method for encoding the natural language text describing the image in the step 1 comprises the following steps: segmenting the natural language text to obtain a word sequence p ═ w (with the length d)1,w2,…wd) Wherein each word wiAnd representing by using a pre-trained word vector, i is 1-d, and coding the text by using the obtained word vector.
4. The method of generating an image of text incorporated into a capsule network of claim 1, wherein: if each text-image data only belongs to one class in the step 2, encoding class information in a one-bit effective encoding (one-hot) mode; if the text-image data belongs to multiple classes, the class information is encoded using a multi-bit efficient coding (multi-hot) scheme.
5. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 1, wherein: the recurrent neural network in the step 3 adopts a bidirectional long-short time memory network, wherein the long-short time memory network reads text semantic embedding and the hidden state of the previous step, outputs the object hidden code of each step, and takes the object hidden code of each step as the feature s of the word level of each wordiAnd taking the object hidden code s of the last step as the characteristic of sentence level, namely the object hidden code of the text.
6. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 1, wherein: in the step 3, the text semantic embedded expression and the noise are mixed in a direct connection mode, and the adopted noise is Gaussian noise
Figure FDA0002905266880000022
The mixed result of text semantic embedding s and z is
Figure FDA0002905266880000021
Semantic embedding of class information and noise in the step 4The mixing being a variational inference, i.e. the variational coder giving noise
Figure FDA0002905266880000023
And class information c, estimating the hidden attribute distribution of the class information
Figure FDA0002905266880000024
Semantically embedding a class information representation z sampled from the distributionc
7. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 1, wherein: in the step 5, a direct connection mode is adopted to fuse the text hidden code and the class information hidden code to obtain a fused hidden code z ═ (z ═s,zc)。
8. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 1, wherein: in the step 6, the capsule network is used for transcoding the fused hidden codes containing the class information into the initial image representation, and then the upsampling network is used for obtaining the image features of the corresponding dimensionality.
9. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 1, wherein: the confrontation training method in the step 8 comprises the following steps: and respectively obtaining image implicit representation of the generated image and the real image through a convolutional neural network, simultaneously inputting corresponding text and class information, and outputting scores of image real degree, image and text matching degree and image and class information matching degree.
10. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 9, wherein: the class information is judged by constructing a classification capsule layer by utilizing a capsule network, and a module of an output vector is standardized for judging the class information, so that the entity classification performance in a generated image can be improved.
11. The method and apparatus for generating images of texts introduced into capsule networks according to claim 1, wherein: in the step 9, a staged image generation method is adopted to generate pictures with higher resolution step by step, taking two-stage image generation as an example, in the first stage, a low-resolution picture is generated by using fusion hidden coding of text and class information; in the second stage, an attention mechanism is utilized to calculate the attention score of each sub-region of the image feature h obtained in the first stage and each word in the text coding of the word level obtained in the step 3, high-dimensional text-image mixed features are obtained through fusion and input into a high-resolution image generator as fusion hidden codes, the steps 7-8 are repeated, and the high-resolution image is generated while the generation difficulty is reduced by adopting a multi-stage generation scheme.
12. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 11, wherein: in the staged image generation network, where the resolution of the images generated in the first stage is 64 x 64, the resolution of the images generated in the second stage is 128 x 128, and the resolution of the images generated in the third stage is 256 x 256, the model may continue to be stacked.
13. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 1, wherein: in the step 10, the input of the test stage is text and its class labels, and the high-resolution image is generated in stages through the generator model obtained in the training stage.
14. A method and a device for generating an image by a text introduced into a capsule network are characterized in that: the device comprises:
the text encoder is used for encoding the text describing the image to obtain text semantic embedded representation;
the class information encoder is used for encoding class information of a text describing an image to obtain semantic embedded representation of the class information;
the generator comprises a cyclic neural network transcoder, a variational inference transcoder, a capsule network transcoder and an image decoder, wherein the cyclic neural network transcoder is used for reading the semantic embedding of a text and the hidden state of the previous step of the transcoder and outputting a corresponding text hidden code; the variational inference transcoder is used for reading semantic embedding of input class information and outputting corresponding class information hidden codes; the capsule network transcoder fuses the text information and the class information hidden codes and transcodes the text information and the class information hidden codes into image characteristics; the image decoder decodes the image characteristics to generate an image;
the discriminator comprises an image semantic discriminator, a text semantic discriminator and a class information discriminator introduced into the capsule network, wherein the image semantic discriminator judges the correlation between the generated image and the real image; the text semantic discriminator judges the correlation between the generated image and the text; the class information discriminator introduced into the capsule network utilizes the capsule network to replace a full connection layer in the original discriminator, and the correlation between the generated image and the class information is judged better.
CN202110069525.0A 2021-01-19 2021-01-19 Method and device for generating image by introducing text of capsule network Pending CN112765316A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110069525.0A CN112765316A (en) 2021-01-19 2021-01-19 Method and device for generating image by introducing text of capsule network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110069525.0A CN112765316A (en) 2021-01-19 2021-01-19 Method and device for generating image by introducing text of capsule network

Publications (1)

Publication Number Publication Date
CN112765316A true CN112765316A (en) 2021-05-07

Family

ID=75703187

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110069525.0A Pending CN112765316A (en) 2021-01-19 2021-01-19 Method and device for generating image by introducing text of capsule network

Country Status (1)

Country Link
CN (1) CN112765316A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113240115A (en) * 2021-06-08 2021-08-10 深圳数联天下智能科技有限公司 Training method for generating face change image model and related device
CN113434918A (en) * 2021-06-28 2021-09-24 北京理工大学 Text-based three-dimensional voxel model generation method
CN113537487A (en) * 2021-06-25 2021-10-22 北京百度网讯科技有限公司 Model training method, picture generating method and device
WO2023030348A1 (en) * 2021-08-31 2023-03-09 北京字跳网络技术有限公司 Image generation method and apparatus, and device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830334A (en) * 2018-06-25 2018-11-16 江西师范大学 A kind of fine granularity target-recognition method based on confrontation type transfer learning
CN109299216A (en) * 2018-10-29 2019-02-01 山东师范大学 A kind of cross-module state Hash search method and system merging supervision message
CN109543159A (en) * 2018-11-12 2019-03-29 南京德磐信息科技有限公司 A kind of text generation image method and device
CN109671125A (en) * 2018-12-17 2019-04-23 电子科技大学 A kind of GAN network model that height merges and the method for realizing text generation image
CN110751698A (en) * 2019-09-27 2020-02-04 太原理工大学 Text-to-image generation method based on hybrid network model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830334A (en) * 2018-06-25 2018-11-16 江西师范大学 A kind of fine granularity target-recognition method based on confrontation type transfer learning
CN109299216A (en) * 2018-10-29 2019-02-01 山东师范大学 A kind of cross-module state Hash search method and system merging supervision message
CN109543159A (en) * 2018-11-12 2019-03-29 南京德磐信息科技有限公司 A kind of text generation image method and device
CN109671125A (en) * 2018-12-17 2019-04-23 电子科技大学 A kind of GAN network model that height merges and the method for realizing text generation image
CN110751698A (en) * 2019-09-27 2020-02-04 太原理工大学 Text-to-image generation method based on hybrid network model

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113240115A (en) * 2021-06-08 2021-08-10 深圳数联天下智能科技有限公司 Training method for generating face change image model and related device
CN113240115B (en) * 2021-06-08 2023-06-06 深圳数联天下智能科技有限公司 Training method for generating face change image model and related device
CN113537487A (en) * 2021-06-25 2021-10-22 北京百度网讯科技有限公司 Model training method, picture generating method and device
CN113537487B (en) * 2021-06-25 2023-08-04 北京百度网讯科技有限公司 Model training method, picture generating method and device
CN113434918A (en) * 2021-06-28 2021-09-24 北京理工大学 Text-based three-dimensional voxel model generation method
CN113434918B (en) * 2021-06-28 2022-12-02 北京理工大学 Text-based three-dimensional voxel model generation method
WO2023030348A1 (en) * 2021-08-31 2023-03-09 北京字跳网络技术有限公司 Image generation method and apparatus, and device and storage medium

Similar Documents

Publication Publication Date Title
CN112765316A (en) Method and device for generating image by introducing text of capsule network
CN109543159B (en) Text image generation method and device
CN110795556B (en) Abstract generation method based on fine-grained plug-in decoding
CN111046668B (en) Named entity identification method and device for multi-mode cultural relic data
CN111897908B (en) Event extraction method and system integrating dependency information and pre-training language model
CN111026869B (en) Method for predicting multi-guilty names by using sequence generation network based on multilayer attention
CN110032638B (en) Encoder-decoder-based generative abstract extraction method
CN112765317A (en) Method and device for generating image by introducing text of class information
CN112734881B (en) Text synthesized image method and system based on saliency scene graph analysis
CN111078866B (en) Chinese text abstract generation method based on sequence-to-sequence model
CN113254610B (en) Multi-round conversation generation method for patent consultation
CN113140020B (en) Method for generating image based on text of countermeasure network generated by accompanying supervision
CN111402365B (en) Method for generating picture from characters based on bidirectional architecture confrontation generation network
CN113673535B (en) Image description generation method of multi-modal feature fusion network
CN110751188A (en) User label prediction method, system and storage medium based on multi-label learning
CN111639547B (en) Video description method and system based on generation countermeasure network
CN113961736A (en) Method and device for generating image by text, computer equipment and storage medium
CN114529903A (en) Text refinement network
CN114154504A (en) Chinese named entity recognition algorithm based on multi-information enhancement
Bie et al. Renaissance: A survey into ai text-to-image generation in the era of large model
CN115545033A (en) Chinese field text named entity recognition method fusing vocabulary category representation
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
Tibebu et al. Text to image synthesis using stacked conditional variational autoencoders and conditional generative adversarial networks
CN112528168A (en) Social network text emotion analysis method based on deformable self-attention mechanism
CN115496134A (en) Traffic scene video description generation method and device based on multi-modal feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination