CN112765317A - Method and device for generating image by introducing text of class information - Google Patents
Method and device for generating image by introducing text of class information Download PDFInfo
- Publication number
- CN112765317A CN112765317A CN202110071013.8A CN202110071013A CN112765317A CN 112765317 A CN112765317 A CN 112765317A CN 202110071013 A CN202110071013 A CN 202110071013A CN 112765317 A CN112765317 A CN 112765317A
- Authority
- CN
- China
- Prior art keywords
- image
- text
- class information
- class
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000012549 training Methods 0.000 claims abstract description 18
- 238000012360 testing method Methods 0.000 claims abstract description 6
- 238000013528 artificial neural network Methods 0.000 claims description 14
- 238000009826 distribution Methods 0.000 claims description 12
- 230000004927 fusion Effects 0.000 claims description 12
- 230000000306 recurrent effect Effects 0.000 claims description 12
- 238000005070 sampling Methods 0.000 claims description 8
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 230000015654 memory Effects 0.000 claims description 2
- 230000002457 bidirectional effect Effects 0.000 claims 1
- 238000010606 normalization Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241000282465 Canis Species 0.000 description 1
- 241000282414 Homo sapiens Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004576 sand Substances 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/5866—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformation in the plane of the image
- G06T3/40—Scaling the whole image or part thereof
- G06T3/4053—Super resolution, i.e. output image resolution higher than sensor resolution
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Library & Information Science (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a method and a device for generating an image by a text with introduced class information, wherein the method for generating the image by the text with introduced class information comprises a training stage and a testing stage, wherein the training stage is based on generation of a confrontation network and utilizes a natural language text for describing an image, a class label of the text, a corresponding real image training generator and a discriminator; in the testing stage, the text and the class labels thereof are utilized to generate corresponding images in the generator, and the invention has the advantages that: according to the text information coding and the class information coding, text semantic image features and class information image features are generated through transcoding respectively, then the two levels of image features are fused, decoding is performed to generate images, the corresponding class information is introduced in the image generating process to strengthen the correlation between the generated images and texts, meanwhile, the images with higher resolution are generated step by step through the multi-stage generating process in the training process, and the training difficulty of directly generating high-resolution images is reduced.
Description
Technical Field
The invention relates to the technical field of deep learning generation models, in particular to a method and a device for generating an image by introducing class information text.
Background
Text generation images are an important issue and have wide applications such as computer-aided medicine, news photo generation, etc.
The research of the text image generation method is mainly based on two models of generating formulas, namely a Conditional variable Auto-Encoder (CVAE) and a Conditional Generative Adaptive Network (CGAN). Among them, pictures generated by the CVAE method often have the problem of picture blurring, and the current mainstream methods are all based on CGAN models.
Because of the instability of GAN training, it is very difficult to directly generate high-resolution images from text descriptions, so a hierarchical generation countermeasure network (StackGAN) proposes a strategy of generating low-resolution images from text and then gradually generating high-resolution images from low-resolution images, and is widely applied in later work.
The problem of generating images in text can be mainly divided into two sub-problems: 1) how to capture text semantic representation is generally to extract visual related information in text semantic embedding and text description through a text encoder; 2) how to generate a vivid and text-fitting image by the generator using the semantic representation of the text in 1, so that human beings can misunderstand that the image is a real related picture.
Conventional text-generating image networks use only text information and ignore class information of the text itself. But the type information is also helpful for text generation, objects of the same type often have certain similarity, the introduction of the type information of the text can help solve the one-sided problem of single text description, and in addition, the correlation between the generated image and the text can be drawn. As a method for generating an image from a text, an index of Inclusion Score (IS) IS widely used for evaluation. The IS evaluates the quality of the generated image by calculating the correlation between the generated image distribution and the real image distribution, and the higher the IS value IS, the more clear and easier-to-identify entity IS contained in the generated image.
Disclosure of Invention
The invention provides a method and a device for generating an image by a text with class information introduced, aiming at the defects of the existing image generation method, which can introduce the class information to which the text belongs in the image generation process, restrict the relevance of the same class of text generated pictures through the class information and solve the problem that the single text description is not comprehensive.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a method and a device for generating an image by a text with introduced class information are characterized in that: the method comprises the following steps:
step 1, encoding a natural language text for describing an image to obtain text semantic embedded representation;
step 2, encoding the class label of the text to obtain class information semantic embedded representation;
step 3, mixing the text semantic embedded representation obtained in the step 1 with random noise, reading the text semantic embedded representation and the random noise by adopting a recurrent neural network, and outputting an object steganography of the text;
step 4, mixing the semantic embedded representation of the class information obtained in the step 2 with noise, and obtaining object hidden codes of the class information through variational inference;
step 5, respectively decoding the text hidden codes and the class information hidden codes obtained in the step 3 and the step 4 to obtain image characteristics of a text level and image characteristics of a class level;
step 6, decoding the image characteristics of the fusion text level and the image characteristics of the class level obtained in the step 5 to generate an image;
step 7, performing countermeasure training on the generated image obtained in the step 6 and the corresponding real image;
step 8, respectively up-sampling the image characteristics obtained in the step 5 to obtain image characteristics with different dimensions, and repeating the steps 6-7 to gradually generate images with higher resolution;
and 9, inputting the text and the class labels thereof in the testing stage, repeating the steps 1-6, and generating a high-resolution image in the image generator through multiple stages.
Further, in step 1, the method for encoding the natural language text describing the image is as follows: segmenting the natural language text to obtain a word sequence p ═ w (with the length d)1,w2,…wd) Wherein each word wiRepresenting by adopting a pre-trained word vector, i is 1-d, and coding the text by using the obtained word vector;
further, in the step 2, if each text-image data only belongs to one class, the class information is encoded in a one-bit efficient coding (one-hot) manner; if the text-image data term is multiple classes, the class information is encoded using a multi-bit-efficient coding (multi-hot) approach.
Further, in step 3, the recurrent neural network adopts a long-term memory network.
Further, in the step 3, a direct connection mode is adopted for a mixed mode of text semantic embedding representation and noise, the adopted noise is Gaussian noise z-N (0, I), and a mixed result of text semantic embedding s and z is zs=(s,z)。
Furthermore, the mixed mode of class information semantic embedding and noise in the step 4 is variable estimation, that is, a variable encoder estimates the hidden attribute distribution q (z) of class information under the condition of given noise z and class information ccI c, z), the semantic embedding of class information sampled from the distribution represents zc。
Further, in the step 5, an upsampling operation is adopted to decode the text steganography and the class information steganography to obtain the image characteristics.
Further, in the step 6, the image feature h generated by the textcImage feature h generated by sum class informationrThe image features are fused in a cascading mode, and the fused image features canIs denoted by h1=hc⊙hr(ii) a And decoding the fused image features by adopting a convolutional neural network to generate an image.
Further, the confrontation training method in step 7 is as follows: and respectively obtaining image implicit representation of the generated image and the real image through a convolutional neural network, simultaneously inputting corresponding text and class information, and outputting scores of image real degree, image and text matching degree and image and class information matching degree.
Further, in the step 8, a staged image generation method is adopted to generate pictures with higher resolution step by step, taking two-stage image generation as an example, the first stage generates a picture with low resolution by using the fused image features; the second stage generates image characteristics h of the text obtained in the first stagecImage feature h generated by sum class informationrAnd performing upsampling to obtain image features and text features with higher dimensionality, and then generating a picture with higher resolution.
Further, in the two-stage image generation network, in which the resolution of the image generated in the first stage is 64 × 64 and the resolution of the image generated in the second stage is 128 × 128, the model may be further stacked.
Further, the input of the test stage in step 9 is text and its class labels, and the high-resolution image is generated in stages through the generator model obtained in the training stage.
A method and a device for generating an image by introducing a text of class information are characterized in that the device comprises:
the text encoder is used for encoding the text describing the image to obtain text semantic embedded representation;
the class information encoder is used for encoding class information of a text describing an image to obtain semantic embedded representation of the class information;
the generator comprises a recurrent neural network transcoder, a variational inference transcoder, an image feature fusion device and an image decoder, wherein the recurrent neural network transcoder is used for reading the text semantic embedding and the hidden state of the previous step of the transcoder and outputting the corresponding text image feature; the variational inference transcoder is used for reading semantic embedding of input class information and outputting corresponding class information image characteristics; the image feature fusion device fuses text image features and information-like image features generated by the recurrent neural network transcoder and the variational inference; the image decoder decodes the input fusion image characteristics to generate an image;
the discriminator comprises an image semantic discriminator, a text semantic discriminator and a class information discriminator, and the image semantic discriminator judges the correlation between the generated image and the real image; the text semantic discriminator judges the correlation between the generated image and the corresponding text; the class information discriminator judges the correlation between the generated image and the class information. The discriminator comprises an image semantic discriminator, a text semantic discriminator and a class information discriminator, wherein the image semantic discriminator judges the correlation between the generated image and the real image; the text semantic discriminator judges the correlation between the generated image and the corresponding text; the class information discriminator judges the correlation between the generated image and the class information.
Compared with the prior art, the invention has the following beneficial effects:
the invention introduces extra class information in the process of generating the image by the text, such as class labels of birds and different object labels contained in the picture, the class information obtains hidden codes by a variation inference method, entity information behind the class labels can be fully mined, image features generated by the text and image features generated by the class information are fused in an image space, and the training difficulty is reduced.
Drawings
FIG. 1 is a flow chart of a generator method of the present invention.
FIG. 2 is a flow chart of a method of the discriminator of the present invention.
Detailed Description
The invention will be further elucidated with reference to the drawings and specific embodiments, it being understood that these examples are intended to illustrate the invention only and are not intended to limit the scope of the invention. Various equivalent modifications of the invention, which fall within the scope of the appended claims of this application, will occur to persons skilled in the art upon reading this disclosure. A text-generating image method for generating introductory class information of a confrontational text based on a condition, as shown in fig. 1-2, comprising the steps of:
step 1, constructing a text encoding unit comprising a text encoder and a recurrent neural network transcoder, as described in S1 in fig. 1. A natural language text is input into a text coder and an embedded representation of the text is output. The natural language text is an English text, a word sequence with the length d is obtained after stop words are removed, and each word is represented by a pre-trained word vector.
For example, inputting the natural language text "This little text is moved with blackprimary and secondaries", removing stop words to obtain the final word sequence of [ little text, bird, move, blue, black, primary, secondaries ], d ═ 7, we set the maximum value of d to 18, the insufficient part is filled in, and the excessive part is cut out.
The goal of the recurrent neural network transcoder is to extract high-level semantic features in natural language text, served by a pre-trained Bi-directional long-and-short memory network (Bi-LSTM). Input text sequence, hidden state h of each word outputiAs the feature of the word level of the word, the output hidden state time sequence average at all the time is embedded as the semantic meaning of the text, namely
The method is only one preferred mode of the text coding unit, and other reasonable coding modes can be adopted for coding.
And 2, constructing a class information coding unit comprising a class information coder and a variation inference coder, as described in S2 in FIG. 1. The class information encoder inputs class information corresponding to the text and outputs an embedded representation of the class information. The class information of the text has two cases, namely single class and multi-class, if each text has only one class label, we encode the class information in a one-hot form, for example, "This little text is an album with a blackprimary and secondary," there is only one class label, i.e., "there is only one class label, which represents that the type of bird corresponding to the text is changed, and there are 20 different classes of birds in the data set, we encode the class information into a 20-dimensional one-hot vector [1,0, … 0,0 ]; if the text has a plurality of class attributes, the text is coded by using a multi-hot coding mode alignment, for example, coded into [1,0,0,1,0], so that the text has class labels of a first class and a third class.
After class information is coded into a class vector c, the class vector is converted into class information to be embedded by adopting a variation inference mode. The encoder uses the class vector c and the noise dataConditional on a given z andthe posterior inference of the hidden variable z is performed. We assume the posterior distribution of hidden variablesWe here use a three-layer linear neural network for inference, subject to multivariate diagonal gaussian distributions, where the mean and variance of the implicit distributions are learned by the encoder, or more complex encoding schemes based on the distribution of classes in the data.
And 3, constructing a text information and class information fusion module as described in S3 in FIG. 1. The module fuses image features generated by text and image features generated by class information. The fusion module is composed of an up-sampling network for extracting image characteristics, text image characteristics and the splicing of information-like image characteristics. For text information, upsampling network input text semantic embedding s and joint data z of noisesAnd obtaining image characteristics h of corresponding dimensionality by up-samplings(ii) a For class information, the up-sampling network predicts the posterior distribution of hidden variables from class information encoderIntermediate sampling to obtain input zcObtaining the class information image characteristics h with the same dimensionalityc(ii) a And finally, performing point multiplication operation on the text information image characteristics and the class information image characteristics to obtain finally fused image characteristics h.
Step 4, constructing the condition generating countermeasure network is composed of a generator and an arbiter, as described in S4 in fig. 1 and fig. 2. The generator is composed of a convolution neural network, and the discriminator is composed of an image discriminator, a text semantic discriminator and a class information discriminator. The generator decodes the image by adopting the scale-invariant convolution layer, and converts the characteristics of the fused image into a finally generated image; the image discriminator scores the truth of the generated image, the text semantic discriminator evaluates the relation between the generated image and the original text, and the class information discriminator scores the matching degree of the class information of the generated image.
Step 5, respectively inputting the natural language text describing the image and the corresponding class information into a text encoder and a class information encoder to obtain text semantic embedded representation and class information embedded representation;
step 6, the generated text semantic embedded representation and class information embedded representation output text information and class information fusion module to obtain the image characteristics fusing the two information;
step 7, inputting the fused image features into an image generator to generate a picture with lower resolution, wherein the set resolution is 64 × 64; and inputting corresponding real pictures, natural language texts and class information into the discriminator to perform countermeasure training. The loss functions of the generator and the discriminator in the countermeasure training process are respectively as follows:
LD=-(Ex~P[log[D(x)r]+Ex~Q[1-log[D(x)r])
-(Ex~P[log[D(x)c]+Ex~Q[1-log[D(x)c])
-(Ex~P[log[D(x,s)]+Ex~Q[1-log[D(x,s)])
in the formula, P is the actual data, Q is the generated data distribution, D (x)rRepresenting the probability that the generated image x is true, D (x)cRepresenting the probability that the generated image belongs to the correct class label, and D (x, s) representing the probability of a match between the generated image and the descriptive text.
Two KL divergence terms are added to the loss function of the generator as constrained two hidden variables zcAnd zsLoss of regularization. During training, first, with a fixed generator, the loss L is reducedDOptimizing the arbiter D (x), and then, in the case of a fixed arbiter, pressing the penalty LGThe generator G is optimized. The above two steps are alternately trained by small batches of random gradient descent.
And 8, respectively up-sampling the image characteristic text image characteristic and the class information image characteristic generated in the first stage to 128 × 128 dimensionality. The confrontational training process of step 7 is repeated in a higher dimension, generating higher resolution images.
When the network is trained, Normalization techniques such as Batch Normalization and Spectral Normalization can be added into the generator and the discriminator to stabilize the training, and the generation quality is further improved.
In summary, compared with the previous method, the method for generating the image by the text with the introduced class information disclosed by the invention is additionally provided with the modules for encoding the class information and fusing the class information and the text information. The method introduces the category mark of the text, limits the category of the generated image in the discriminator, and improves the correlation between the generated image and the text by introducing the category information.
In the experimental process, the benchmark model based on the StackGAN is based on the experiment, the dimensions of the hidden variable and the noise variable are set to be 128, the arbiter is trained once per iteration of the countertraining, and the generator is trained once; training the network using Adam solver, where β1=0.5,β20.999; the learning rate α is 0.0002.
The IS IS promoted to 3.74 +/-0.03 from 3.35 +/-0.02 on the CUB data set; the IS IS promoted from 7.34 +/-0.17 to 7.46 +/-0.30 on the COCO data set, and the image generation quality and the entity definition in the generated image are better than those of a reference model.
The above examples are only preferred embodiments of the present invention, but the embodiments of the present invention are not limited by the above examples, and it should be noted that: it will be apparent to those skilled in the art that various modifications, alterations, combinations, and simplifications may be made without departing from the spirit of the invention, which is equivalent to the substitution and is intended to be within the scope of the invention.
Claims (13)
1. A method and a device for generating an image by a text with introduced class information are characterized in that: the method comprises the following steps:
step 1, encoding a natural language text for describing an image to obtain text semantic embedded representation;
step 2, encoding the class label of the text to obtain class information semantic embedded representation;
step 3, mixing the text semantic embedded representation obtained in the step 1 with random noise, reading the text semantic embedded representation and the random noise by adopting a recurrent neural network, and outputting an object steganography of the text;
step 4, mixing the semantic embedded representation of the class information obtained in the step 2 with noise, and obtaining object hidden codes of the class information through variational inference;
step 5, respectively decoding the text hidden codes and the class information hidden codes obtained in the step 3 and the step 4 to obtain image characteristics of a text level and image characteristics of a class level;
step 6, decoding the image characteristics of the fusion text level and the image characteristics of the class level obtained in the step 5 to generate an image;
step 7, performing countermeasure training on the generated image obtained in the step 6 and the corresponding real image;
step 8, respectively up-sampling the image characteristics obtained in the step 5 to obtain image characteristics with different dimensions, and repeating the steps 6-7 to gradually generate images with higher resolution;
and 9, inputting the text and the class labels thereof in the testing stage, repeating the steps 1-6, and generating a high-resolution image in the image generator through multiple stages.
2. The method and apparatus for generating images according to the text of the lead-in information as claimed in claim 1, wherein: in step 1, the method for encoding the natural language text describing the image includes: segmenting the natural language text to obtain a word sequence p ═ w (with the length d)1,w2,…wd) Wherein each word wiAnd representing by using a pre-trained word vector, i is 1-d, and coding the text by using the obtained word vector.
3. The method and apparatus for generating images according to the text of the lead-in information as claimed in claim 1, wherein: in step 2, if each text-image data only belongs to one class, the class information is encoded by using a one-hot (one-hot) method, and if the text-image data belongs to a plurality of classes, the class information is encoded by using a multi-hot (multi-hot) method.
4. The method and apparatus for generating images according to the text of the lead-in information as claimed in claim 1, wherein: in the step 3, the recurrent neural network adopts a bidirectional long-time and short-time memory network.
5. The method and apparatus for generating images according to the text of the lead-in information as claimed in claim 1, wherein: in the step 3, the text semantic embedding expression and the noise mixing mode adopt a direct connection mode, the adopted noise is Gaussian noise z-N (0, I), and the text semantic embedding s and the mixing result of z are zs=(s,z)。
6. The method and apparatus for generating images according to the text of the lead-in information as claimed in claim 1, wherein: the mixed mode of class information semantic embedding and noise in the step 4For variational inference, i.e. the variational encoder infers the hidden attribute distribution q (z) of class information given noise z and class information ccI c, z), the semantic embedding of class information sampled from the distribution represents zc。
7. The method and apparatus for generating images according to the text of the lead-in information as claimed in claim 1, wherein: in the step 5, the text steganography and the class information steganography are decoded by adopting an upsampling operation to obtain image characteristics.
8. The method and apparatus for generating image according to the text with the introduction of class information as claimed in claim 1, wherein in the step 6, the image feature h generated by the textcImage feature h generated by sum class informationrThe fusion is carried out in a dot multiplication mode, and the fused image features can be expressed as h1=hc⊙hr(ii) a And decoding the fused image features by adopting a convolutional neural network to generate an image.
9. The method and apparatus for generating image according to the text with the generic information as claimed in claim 1, wherein the countertraining method in step 7 is: and respectively obtaining image implicit representation of the generated image and the real image through a convolutional neural network, simultaneously inputting corresponding text and class information, and outputting scores of image real degree, image and text matching degree and image and class information matching degree.
10. The method and apparatus for generating image according to the text with the introduction of class information as claimed in claim 1, wherein the step 8 employs a staged image generation method to generate the pictures with higher resolution step by step, for example, two-stage image generation, the first stage generates the pictures with lower resolution by using the fused image features; the second stage generates image characteristics h of the text obtained in the first stagesImage feature h generated by sum class informationcUp-sampling is carried out to obtain image characteristics and text characteristics with higher dimensionality, and then generation is carried outA picture with higher resolution is formed.
11. The method and apparatus of claim 10, wherein the model is further stacked in a two-stage image generation network, wherein the resolution of the first stage image generation is 64 x 64 and the second stage image generation is 128 x 128.
12. The method and apparatus for generating image from text with class information as claimed in claim 1, wherein the input of the test stage in step 9 is text and its class label, and the high resolution image is generated in stages by the generator model obtained in the training stage.
13. A method and a device for generating an image by introducing a text of class information are characterized in that the device comprises:
the text encoder is used for encoding the text describing the image to obtain text semantic embedded representation;
the class information encoder is used for encoding class information of a text describing an image to obtain semantic embedded representation of the class information;
the generator comprises a recurrent neural network transcoder, a variational inference transcoder, an image feature fusion device and an image decoder, wherein the recurrent neural network transcoder is used for reading the text semantic embedding and the hidden state of the previous step of the transcoder and outputting the corresponding text image feature; the variational inference transcoder is used for reading semantic embedding of input class information and outputting corresponding class information image characteristics; the image feature fusion device fuses text image features and information-like image features generated by the recurrent neural network transcoder and the variational inference; the image decoder decodes the input fusion image characteristics to generate an image;
the discriminator comprises an image semantic discriminator, a text semantic discriminator and a class information discriminator, and the image semantic discriminator judges the correlation between the generated image and the real image; the text semantic discriminator judges the correlation between the generated image and the corresponding text; the class information discriminator judges the correlation between the generated image and the class information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110071013.8A CN112765317A (en) | 2021-01-19 | 2021-01-19 | Method and device for generating image by introducing text of class information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110071013.8A CN112765317A (en) | 2021-01-19 | 2021-01-19 | Method and device for generating image by introducing text of class information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112765317A true CN112765317A (en) | 2021-05-07 |
Family
ID=75703285
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110071013.8A Pending CN112765317A (en) | 2021-01-19 | 2021-01-19 | Method and device for generating image by introducing text of class information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112765317A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113254694A (en) * | 2021-05-21 | 2021-08-13 | 中国科学技术大学 | Text-to-image method and device |
WO2023030348A1 (en) * | 2021-08-31 | 2023-03-09 | 北京字跳网络技术有限公司 | Image generation method and apparatus, and device and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543159A (en) * | 2018-11-12 | 2019-03-29 | 南京德磐信息科技有限公司 | A kind of text generation image method and device |
US20190380657A1 (en) * | 2015-10-23 | 2019-12-19 | Siemens Medical Solutions Usa, Inc. | Generating natural language representations of mental content from functional brain images |
CN111968193A (en) * | 2020-07-28 | 2020-11-20 | 西安工程大学 | Text image generation method based on StackGAN network |
-
2021
- 2021-01-19 CN CN202110071013.8A patent/CN112765317A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190380657A1 (en) * | 2015-10-23 | 2019-12-19 | Siemens Medical Solutions Usa, Inc. | Generating natural language representations of mental content from functional brain images |
CN109543159A (en) * | 2018-11-12 | 2019-03-29 | 南京德磐信息科技有限公司 | A kind of text generation image method and device |
CN111968193A (en) * | 2020-07-28 | 2020-11-20 | 西安工程大学 | Text image generation method based on StackGAN network |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113254694A (en) * | 2021-05-21 | 2021-08-13 | 中国科学技术大学 | Text-to-image method and device |
CN113254694B (en) * | 2021-05-21 | 2022-07-15 | 中国科学技术大学 | Text-to-image method and device |
WO2023030348A1 (en) * | 2021-08-31 | 2023-03-09 | 北京字跳网络技术有限公司 | Image generation method and apparatus, and device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109543159B (en) | Text image generation method and device | |
CN110795556B (en) | Abstract generation method based on fine-grained plug-in decoding | |
JP5128629B2 (en) | Part-of-speech tagging system, part-of-speech tagging model training apparatus and method | |
CN111897908A (en) | Event extraction method and system fusing dependency information and pre-training language model | |
CN112765316A (en) | Method and device for generating image by introducing text of capsule network | |
CN111444367B (en) | Image title generation method based on global and local attention mechanism | |
CN111078866B (en) | Chinese text abstract generation method based on sequence-to-sequence model | |
CN113283244B (en) | Pre-training model-based bidding data named entity identification method | |
CN110032638B (en) | Encoder-decoder-based generative abstract extraction method | |
CN112765317A (en) | Method and device for generating image by introducing text of class information | |
CN110390049B (en) | Automatic answer generation method for software development questions | |
CN113140020B (en) | Method for generating image based on text of countermeasure network generated by accompanying supervision | |
CN111402365B (en) | Method for generating picture from characters based on bidirectional architecture confrontation generation network | |
CN113961736A (en) | Method and device for generating image by text, computer equipment and storage medium | |
CN107463928A (en) | Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM | |
CN114529903A (en) | Text refinement network | |
CN113140023A (en) | Text-to-image generation method and system based on space attention | |
US20220300708A1 (en) | Method and device for presenting prompt information and storage medium | |
Bie et al. | RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model | |
CN110750669B (en) | Method and system for generating image captions | |
CN117034951A (en) | Digital person with specific language style based on large language model | |
CN117093864A (en) | Text generation model training method and device | |
CN116521857A (en) | Method and device for abstracting multi-text answer abstract of question driven abstraction based on graphic enhancement | |
CN115496134A (en) | Traffic scene video description generation method and device based on multi-modal feature fusion | |
CN115331073A (en) | Image self-supervision learning method based on TransUnnet architecture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |