CN110751698B - Text-to-image generation method based on hybrid network model - Google Patents
Text-to-image generation method based on hybrid network model Download PDFInfo
- Publication number
- CN110751698B CN110751698B CN201910923354.6A CN201910923354A CN110751698B CN 110751698 B CN110751698 B CN 110751698B CN 201910923354 A CN201910923354 A CN 201910923354A CN 110751698 B CN110751698 B CN 110751698B
- Authority
- CN
- China
- Prior art keywords
- image
- text
- title
- training
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/001—Texturing; Colouring; Generation of texture or colour
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30168—Image quality inspection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a text-to-image generation method based on a hybrid network model, which comprises the steps of carrying out forward training on the mapping from a text to an image, carrying out forward training on a generator and a discriminator in a text-to-image generation model, inputting three types of inputs into the discriminator to train the discriminator, carrying out forward test training on a text-to-image generation countermeasure model, training the model to obtain model function loss information, and evaluating the quality of the image by using an image evaluation module; the quality of the generated image is obviously higher than that of the image generated by the traditional GAN text, so that the application value is higher. The method overcomes the defects of the existing model research based on the generation of the text to the image of the confrontation network, and can be better suitable for the generation of the text to the image. By the method and the device, the output of high-quality clear images can be realized, and a generalization effect can be obtained through a small amount of training data.
Description
Technical Field
The invention belongs to the technical field of image processing, and relates to a text-to-image generation method based on a hybrid network model.
Background
With the rapid development of artificial intelligence, generating images from text has attracted a great deal of interest. In recent years, recurrent neural network architectures have been used to learn text feature representations. The deep convolution resisting network can generate high-quality clear images of specific types, such as specific images of human faces, rooms and the like.
The conventional text-based image model for generating a countermeasure network is generated, a convolutional network is used in a discriminator to extract the features of an image, and a recurrent neural network is used to extract the features of a sentence sequence, but the conventional convolutional network for extracting the image features needs a lot of images for training, so that the conventional convolutional network has great limitation, and a capsule network can be generalized by using much less training data. Meanwhile, the conventional convolutional network cannot cope well with image blurriness, but the capsule network can. In addition, in the conventional model for generating an image based on a text for generating a countermeasure network, the conventional convolutional layer network for image feature extraction is connected with a full connection layer, and the full connection layer almost occupies a large part of parameters in the network. Therefore, in the model training process, the training speed of the model is relatively slow, and overfitting is relatively serious.
Based on the above, it is necessary to invent a brand-new method for extracting image features by using a convolutional network to solve the above problems in the process of extracting image features by using a conventional convolutional network.
Disclosure of Invention
The invention provides a text-to-image generation method based on a hybrid network model, which is used for realizing the output of high-quality clear images and obtaining the generalization effect by using a small amount of training data.
The technical scheme adopted by the invention is that the text-to-image generation method based on the hybrid network model specifically comprises the following steps:
step S1: loading relevant data based on a text-image countermeasure model generating a countermeasure network;
step S2: defining a text-image countermeasure model, including defining a real image, defining an error image, defining a real title, defining an error title and defining a noise variable;
step S3: in the text-image confrontation model, forward training is carried out on the mapping of the text and the image;
step S4: forward training a generator in the text-image countermeasure model, encoding a correct title by using a recurrent neural network, and adding noise into an encoded vector to train the generator to obtain a trained generator and a forged image;
step S5: three types of inputs are input to a discriminator of a text-image countermeasure model (a discriminator with pairing consciousness is used here, which is an improvement of the discriminator in the standard text conditional DCGAN framework, and besides the discriminator is used for discriminating whether the generated image is not true or not, the generated image fails to be discriminated: the method comprises the steps that a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector are used for training a discriminator to obtain a trained discriminator;
step S6: carrying out forward test training on the text-image confrontation model, coding a real title by using a recurrent neural network, and adding random noise into a coding vector to test a generator so as to know whether the generator can output an ideal result as expected;
step S7: parameter definition, specifically including learning rate, learning attenuation rate and the definition of the optimizer of the generator and the discriminator;
step S8: training a text-image countermeasure model, downloading a nearest check point, acquiring subscripts for generating seeds, noise and sentences, acquiring matched texts, acquiring real images, acquiring wrong titles, acquiring wrong images and noise, updating the mapping of the text-images, updating a discriminator and a generator, and finally obtaining round number and loss function information;
and step S9, storing the trained text-image confrontation model, and evaluating the image quality of the image generated by the trained text-image confrontation model.
Further, in step S1, the loading the relevant data of the text-image confrontation model includes: loading a title set, and storing the processed title set into a corresponding dictionary; establishing a related vocabulary, wherein the number of corresponding vocabularies is recorded; storing the subscripts of the associated titles in a list; randomly checking a list storing title subscripts; loading related images and deforming the images; acquiring the image quantity of the related image training set and image test set, and the title quantity of the title training set and title test set; storing the vocabulary, the image training set, the image testing set, the title training number, the title testing number, the title number corresponding to each image, the number of testing images, the number of training images, the training subscript set and the testing subscript set in a binary form;
further, in step S2, the definitions of the correct image, the incorrect image, the correct title, and the incorrect title specifically include definitions of a name, a type, and a size of the true and false images, and definitions of a name, a type, and a size of the true and false titles.
Further, in the forward training of the text-image mapping, the real and false images are encoded through the capsule network, and the specific process is as follows:
(1) firstly inputting a true image or a false image to an input layer of a capsule network in a vector group form;
(2) after the input layer simply processes the image, the image is sequentially input into the two capsule layers, and after high-order features of the image are extracted by the two capsule layers, the image is normalized through the batch normalization layer; then, performing feature extraction and normalization processing on the image again through a capsule layer and a batch normalization layer, and performing same processing on the image through the same capsule layer and the same batch normalization layer; then, the network layer compresses an image tensor into a vector; and finally, passing through a full connection layer, wherein each node of the full connection layer is connected with all nodes of the previous layer, and is used for integrating the extracted features and outputting the final overall feature.
Further, in step S3, in the forward training of the text-image mapping, the recurrent neural network is used to extract feature vectors from the header, and the method includes the following steps:
(1) inputting the title sequence into an Embedding layer (the Embedding layer is used for mapping words from a semantic space to a vector space and keeping the relation of a sample in the semantic space as much as possible, namely the vector distance after the words with similar semantics are mapped is as close as possible, and vice versa, the title sequence is processed, and finally, a three-dimensional tensor is output and contains batch size and Embedding dimension information;
(2) inputting information output by an Embedding input layer into a Dynamic circulation network layer (Dynamic RNN layer) of a recurrent neural network, and processing the information to obtain a final network output tensor;
(3) calculating a loss function of the recurrent neural network;
further, in step S4, the forward training of the generator is as follows:
(1) encoding the correct header using a recurrent neural network;
(2) after the correct title is coded, adding corresponding noise, and processing by a generator to obtain a generated forged image; the generator updates the formula as follows:
LG←log(Sf) (1)
G←G-αδLG/δG (2)
wherein S isfThe probability of discrimination between the forged image and the real text vector is shown, and alpha is a constant factor.
Further, in step S5, the training process of the arbiter is as follows:
(1) three types of inputs are input to the arbiter: the method comprises the steps that a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector are used for training a discriminator to obtain a trained discriminator;
(2) in updating the discriminator, the correlation formula is as follows:
LD←log(Sτ)+(log(1-Sw)+log(1-Sf))/2 (3)
D←D-αδLD/δD (4)
in the above formula, Sτ、SWAnd SfRespectively representing the discrimination probabilities of the real image and the real title, the discrimination probabilities of the real image and the error text and the discrimination probabilities of the forged image and the real text;
in the discriminator, in the process of matching and discriminating the image vector and the title vector, the similarity of two probability distributions is discriminated by using a KL divergence formula:
p and Q are two probability distributions, respectively, and P (x) and Q (x) are two corresponding probability densities, respectively, and the KL divergence has nonnegativity and asymmetry.
Further, in the step S6, the specific process of performing forward test training on the text-to-image generative confrontation model is as follows: coding the real title, and then adding noise for training to obtain a trained generator; and respectively obtaining loss functions of the generator and the discriminator through a cross entropy formula, wherein the cross entropy formula is as follows:
wherein y comprises 3 values, a logarithmic probability of discrimination of the forged image and the real title vector, a logarithmic probability of discrimination of the real image and the real title, and a logarithmic probability of discrimination of the real image and the false title,to unitize the corresponding log probabilities.
Further, in step S7, parameters are defined, and the specific process is as follows: defining a learning rate, defining a learning attenuation rate, and defining a learning attenuation round number and a bias; acquiring names of related variables; an optimizer defining a generator and a discriminator, and an optimizer defining a recurrent neural network.
Further, in step S8, training of the model is started, and the preparation process before the training is started is as follows: using a ConfigProto function in TensorFlow to perform parameter configuration on the session, and using a global initialization function to initialize the session; loading the latest check point; acquiring title description sentences and subscripts of the sentences of the model to be input, acquiring the batch size of training and sample sentences, and preprocessing the sample sentences; updating the learning rate; obtaining the matched text, obtaining the correct image, obtaining the wrong title, obtaining the wrong image and obtaining the corresponding noise.
Further, in step S8, training of the model is started, and the training process is as follows: updating a mapping process from a text to an image, when the number of training rounds is less than 50, acquiring a dictionary containing true and false images and true and false titles, and acquiring errors of the recurrent neural network by using vectors formed by a loss function and an optimization function of the recurrent neural network; when the number of training rounds is more than or equal to 50 rounds, setting the error to be 0 and updating the discriminator and the generator; printing the time after a certain number of turns; acquiring a dictionary composed of sample sentences and sample seeds, combining a vector group composed of the output of a generating network and the output of a circulating neural network to obtain the output of the generated image and the output of the circulating neural network, adding a layer of attention mechanism in the process of generating the image, and when a certain region in the image is generated, corresponding the certain probability to a corresponding sentence, thereby improving the generation quality; saving the generated image to a specified directory; and saving the model, saving the updated check point every 10 rounds, saving the latest and latest check point in the last round, namely the 100 th round, and updating the corresponding name.
Further, in step S9, the image quality of the image generated by the trained text-image confrontation model is evaluated, and the specific process is as follows: constructing a corresponding scoring module; adopting one of the image quality evaluation methods: FID (Freehet inclusion distance) Score, the generated image is embedded into a feature space given by a specific layer of the inclusion Net, the space is regarded as continuous multi-element Gaussian distribution, the mean value and the covariance of the generated data and the actual data are calculated, and finally the mean value and the covariance are returned to serve as the evaluation standard of the image quality.
Different from the prior art, the text-to-image generation method based on the hybrid network model carries out forward training through the mapping from the text to the image, carries out forward training on a generator and a discriminator in a text-to-image generation model, inputs three types of inputs into the discriminator to train the discriminator, carries out forward test training on a generation countermeasure model from the text to the image, trains the model to obtain model function loss information, and uses an image evaluation module to evaluate the quality of the image; the quality of the generated image is obviously higher than that of the image generated by the traditional GAN text, so that the application value is higher. The method overcomes the defects of the existing model research based on the generation of the text to the image of the confrontation network, and can be better suitable for the generation of the text to the image. By the method and the device, the output of high-quality clear images can be realized, and a generalization effect can be obtained through a small amount of training data.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a text-to-image generation method based on a hybrid network model according to the present invention.
Fig. 2 is a structure and a workflow of a capsule network, taking handwritten numbers as an example, in a text-to-image generation method based on a hybrid network model provided by the invention.
FIG. 3 is a workflow of an attention mechanism module in image generation in the text-to-image generation method based on a hybrid network model according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, the present invention provides a text-to-image generation method based on a hybrid network model, which specifically comprises the following steps:
step S1: loading relevant data based on a text-image countermeasure model generating a countermeasure network;
step S2: defining a text-image countermeasure model, including defining a real image, defining an error image, defining a real title, defining an error title and defining a noise variable;
step S3: in the text-image confrontation model, forward training is carried out on the mapping of the text and the image;
step S4: forward training a generator in the text-image countermeasure model, encoding a correct title by using a recurrent neural network, and adding noise into an encoded vector to train the generator to obtain a trained generator and a forged image;
step S5: three types of inputs are input to a discriminator of a text-image countermeasure model (a discriminator with pairing consciousness is used here, which is an improvement of the discriminator in the standard text conditional DCGAN framework, and besides the discriminator is used for discriminating whether the generated image is not true or not, the generated image fails to be discriminated: training the discriminator through a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector to obtain a trained discriminator;
step S6: carrying out forward test training on the text-image confrontation model, coding a real title by using a recurrent neural network, and adding random noise into a coding vector to test a generator so as to know whether the generator can output an ideal result as expected;
step S7: parameter definition, specifically including learning rate, learning attenuation rate and the definition of the optimizer of the generator and the discriminator;
step S8: training a text-image countermeasure model, downloading a nearest check point, acquiring subscripts for generating seeds, noise and sentences, acquiring matched texts, acquiring real images, acquiring wrong titles, acquiring wrong images and noise, updating the mapping of the text-images, updating a discriminator and a generator, and finally obtaining round number and loss function information;
and step S9, storing the trained text-image confrontation model, and evaluating the image quality of the image generated by the trained text-image confrontation model.
Further, in step S1, the loading the relevant data of the text-image confrontation model includes: loading a title set, and storing the processed title set into a corresponding dictionary; establishing a related vocabulary, wherein the number of corresponding vocabularies is recorded; storing the subscripts of the associated titles in a list; randomly checking a list storing title subscripts; loading related images and deforming the images; acquiring the number of images of the related image training set and image testing set, and the number of titles of the title training set and title testing set; storing the vocabulary, the image training set, the image testing set, the title training number, the title testing number, the title number corresponding to each image, the number of testing images, the number of training images, the training subscript set and the testing subscript set in a binary form;
further, in step S2, the definitions of the correct image, the incorrect image, the correct title, and the incorrect title specifically include definitions of a name, a type, and a size of the true and false images, and definitions of a name, a type, and a size of the true and false titles.
Further, in the forward training of the text-image mapping, the true and false images are encoded through the capsule network, and the structure and the work flow of the capsule network, taking the handwritten numbers as an example, are shown in fig. 2. The specific process is as follows:
(1) firstly inputting a true image or a false image to an input layer of a capsule network in a vector group form;
(2) after the input layer simply processes the image, the image is sequentially input into the two capsule layers, and after high-order features of the image are extracted by the two capsule layers, the image is normalized through the batch normalization layer; then, performing feature extraction and normalization processing on the image again through a capsule layer and a batch normalization layer, and performing same processing on the image through the same capsule layer and the same batch normalization layer; then, the network layer compresses an image tensor into a vector; and finally, passing through a full connection layer, wherein each node of the full connection layer is connected with all nodes of the previous layer, and is used for integrating the extracted features and outputting the final overall feature.
Further, in step S3, in the forward training of the text-image mapping, a recurrent neural network is used to extract feature vectors from the header, and the workflow of the attention mechanism module in the image generation is shown in fig. 3. The method comprises the following steps:
(1) inputting the title sequence into an Embedding layer (the Embedding layer is used for mapping words from a semantic space to a vector space and keeping the relation of a sample in the semantic space as much as possible, namely the vector distance after the words with similar semantics are mapped is as close as possible, and vice versa, the title sequence is processed, and finally, a three-dimensional tensor is output and contains batch size and Embedding dimension information;
(2) inputting information output by an Embedding input layer into a Dynamic circulation network layer (Dynamic RNN layer) of a recurrent neural network, and processing the information to obtain a final network output tensor;
(3) calculating a loss function of the recurrent neural network;
further, in step S4, the forward training of the generator is as follows:
(1) encoding the correct header using a recurrent neural network;
(2) after the correct title is coded, adding corresponding noise, and processing by a generator to obtain a generated forged image; the generator updates the formula as follows:
LG←log(Sf) (1)
G←G-αδLG/δG (2)
wherein S isfThe probability of discrimination between the forged image and the real text vector is shown, and alpha is a constant factor.
Further, in step S5, the training process of the arbiter is as follows:
(1) three types of inputs are input to the arbiter: the method comprises the steps that a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector are used for training a discriminator to obtain a trained discriminator;
(2) in updating the discriminator, the correlation formula is as follows:
LD←log(Sτ)+(log(1-Sw)+log(1-Sf))/2 (3)
D←D-αδLD/δD (4)
in the above formula, Sτ、SWAnd SfRespectively representing the discrimination probabilities of the real image and the real title, the discrimination probabilities of the real image and the error text and the discrimination probabilities of the forged image and the real text;
in the discriminator, in the process of matching and discriminating the image vector and the title vector, the similarity of two probability distributions is discriminated by using a KL divergence formula:
p and Q are two probability distributions, respectively, and P (x) and Q (x) are two corresponding probability densities, respectively, and the KL divergence has nonnegativity and asymmetry.
Further, in the step S6, the specific process of performing forward test training on the text-to-image generative confrontation model is as follows: coding the real title, and then adding noise for training to obtain a trained generator; and respectively obtaining loss functions of the generator and the discriminator through a cross entropy formula, wherein the cross entropy formula is as follows:
wherein y comprises 3 values, a forged image andthe log probability of the true title vector discrimination, the log probability of the true image and true title discrimination, and the log probability of the true image and false title discrimination,to unitize the corresponding log probabilities.
Further, in step S7, parameters are defined, and the specific process is as follows: defining a learning rate, defining a learning attenuation rate, and defining a learning attenuation round number and a bias; acquiring names of related variables; an optimizer defining a generator and a discriminator, and an optimizer defining a recurrent neural network.
Further, in step S8, training of the model is started, and the preparation process before the training is started is as follows: using a ConfigProto function in TensorFlow to perform parameter configuration on the session, and using a global initialization function to initialize the session; loading the latest check point; acquiring title description sentences and subscripts of the sentences of the model to be input, acquiring the batch size of training and sample sentences, and preprocessing the sample sentences; updating the learning rate; obtaining the matched text, obtaining the correct image, obtaining the wrong title, obtaining the wrong image and obtaining the corresponding noise.
Further, in the step S8, training of the model is started, and the training process is as follows: updating a mapping process from a text to an image, when the number of training rounds is less than 50, acquiring a dictionary containing true and false images and true and false titles, and acquiring errors of the recurrent neural network by using vectors formed by a loss function and an optimization function of the recurrent neural network; when the number of training rounds is more than or equal to 50 rounds, setting the error as 0; updating the discriminator and the generator; printing the time after a certain number of turns; acquiring a dictionary composed of sample sentences and sample seeds, combining a vector group composed of the output of a generating network and the output of a circulating neural network to obtain the output of the generated image and the output of the circulating neural network, adding a layer of attention mechanism in the process of generating the image, and when a certain region in the image is generated, corresponding the certain probability to a corresponding sentence, thereby improving the generation quality; saving the generated image to a specified directory; and saving the model, saving the updated check point every 10 rounds, saving the latest and latest check point in the last round, namely the 100 th round, and updating the corresponding name. Further, in step S9, the image quality of the image generated by the trained text-image confrontation model is evaluated, and the specific process is as follows: constructing a corresponding scoring module; adopting one of the image quality evaluation methods: FID (Freehet inclusion distance) Score, the generated image is embedded into a feature space given by a specific layer of the inclusion Net, the space is regarded as continuous multi-element Gaussian distribution, the mean value and the covariance of the generated data and the actual data are calculated, and finally the mean value and the covariance are returned to serve as the evaluation standard of the image quality.
Different from the prior art, the text-to-image generation method based on the hybrid network model carries out forward training through the mapping from the text to the image, carries out forward training on a generator and a discriminator in a text-to-image generation model, inputs three types of inputs into the discriminator to train the discriminator, carries out forward test training on a generation countermeasure model from the text to the image, trains the model to obtain model function loss information, and uses an image evaluation module to evaluate the quality of the image; the quality of the generated image is obviously higher than that of the image generated by the traditional GAN text, so that the application value is higher. The method overcomes the defects of the existing model research based on the generation of the text to the image of the confrontation network, and can be better suitable for the generation of the text to the image. By the method and the device, the output of high-quality clear images can be realized, and a generalization effect can be obtained through a small amount of training data.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.
Claims (8)
1. A method for text-to-image generation based on a hybrid network model, comprising the steps of:
step S1: loading relevant data based on a text-image countermeasure model generating a countermeasure network;
step S2: defining a text-image countermeasure model, including defining a real image, defining an error image, defining a real title, defining an error title and defining a noise variable;
step S3: in the text-image confrontation model, forward training is carried out on the mapping of the text and the image; in the forward training of the mapping of the text and the image, the real and false images are coded through a capsule network, and the specific process is as follows:
(1) firstly inputting a true image or a false image into an input layer of a capsule network in a vector group form;
(2) after the input layer simply processes the image, the image is sequentially input into the two capsule layers, and after the high-order features of the image are extracted by the two capsule layers, the image is normalized through the batch normalization layer; then, performing feature extraction and normalization processing on the image again through a capsule layer and a batch normalization layer, and performing same processing on the image through the same capsule layer and the same batch normalization layer; then, the network layer compresses an image tensor into a vector; finally, passing through a full connection layer, each node of the full connection layer is connected with all nodes of the previous layer and used for integrating the extracted features in front and outputting the final overall feature;
step S4: forward training a generator in the text-image countermeasure model, encoding a correct title by using a recurrent neural network, and adding noise into an encoded vector to train the generator to obtain a trained generator and a forged image;
step S5: inputting three types of input into a discriminator of a text-image countermeasure model, using a discriminator with pairing consciousness, improving the discriminator in a standard text conditional DCGAN framework, and discriminating whether the generated content failed belongs to the unreal generated image or the mismatching generated image, a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector to train the discriminator to obtain the trained discriminator by the discriminator besides discriminating the authenticity of an output image;
step S6: carrying out forward test training on the text-image confrontation model, coding a real title by using a recurrent neural network, and adding random noise into a coding vector to test a generator so as to know whether the generator can output an ideal result as expected;
step S7: parameter definition, specifically including learning rate, learning attenuation rate and the definition of the optimizer of the generator and the discriminator;
step S8: training a text-image countermeasure model, downloading a nearest check point, acquiring subscripts for generating seeds, noise and sentences, acquiring matched texts, acquiring real images, acquiring wrong titles, acquiring wrong images and noise, updating the mapping of the text-images, updating a discriminator and a generator, and finally obtaining round number and loss function information;
the training of the model begins, as follows: updating a mapping process from a text to an image, when the number of training rounds is less than 50, acquiring a dictionary containing true and false images and true and false titles, and acquiring errors of the recurrent neural network by using vectors formed by a loss function and an optimization function of the recurrent neural network; when the number of training rounds is more than or equal to 50 rounds, setting the error as 0; updating the discriminator and the generator; printing the time after a certain number of turns; acquiring a dictionary composed of sample sentences and sample seeds, combining a vector group composed of the output of a generating network and the output of a recurrent neural network to obtain a generated image, adding a layer of attention mechanism in the process of generating the image, and when a certain region in the image is generated, corresponding the certain region to a corresponding sentence with a certain probability so as to improve the generation quality; saving the generated image to a specified directory; the storage model stores the check points once every 10 rounds, and stores the latest check point and updates the corresponding name in the last round, namely the 100 th round;
step S9, saving the trained text-image confrontation model, and evaluating the image quality of the image generated by the trained text-image confrontation model, wherein the evaluation process is as follows: constructing a corresponding scoring module; adopting one of the image quality evaluation methods: and the FID Score is used for embedding the generated image into a feature space given by a specific layer of the increment Net, regarding the feature space as a continuous multi-element Gaussian distribution, calculating the mean value and the covariance of the generated data and the actual data, and finally returning the mean value and the covariance as the evaluation standard of the image quality.
2. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein the step S1, loading the relevant data of the text-to-image confrontation model comprises: loading a title set, and storing the processed title set into a corresponding dictionary; establishing a related vocabulary, wherein the number of corresponding vocabularies is recorded; storing the subscripts of the associated titles in a list; randomly checking a list storing title subscripts; loading related images and deforming the images; acquiring the image quantity of the related image training set and image test set, and the title quantity of the title training set and title test set; and storing the vocabulary, the image training set, the image testing set, the title training number, the title testing number, the title number corresponding to each image, the number of testing images, the number of training images, the training subscript set and the testing subscript set in a binary form.
3. The method for generating texts to images based on hybrid network model according to claim 1, wherein in step S2, the definitions of the correct images, the incorrect images, the correct titles and the incorrect titles specifically include definitions of names, types and sizes of true and false images and definitions of names, types and sizes of true and false titles.
4. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein in said step S3, during the forward training of the text-to-image mapping, the recursive neural network is used to extract feature vectors from the header, according to the following steps:
(1) inputting the title sequence into an Embedding layer for processing, and finally outputting a three-dimensional tensor comprising batch size and embedded dimension information;
(2) inputting information output by an Embedding input layer into a Dynamic circulation network layer (namely a Dynamic RNN layer) of a recurrent neural network, and processing the information to obtain a final network output tensor;
(3) a loss function of the recurrent neural network is calculated.
5. A method for generating text-to-image based on hybrid network model according to claim 1, wherein in step S4, the forward training of the generator is as follows:
(1) encoding the correct header using a recurrent neural network;
(2) after the correct title is coded, adding corresponding noise, and processing by a generator to obtain a generated forged image; the generator updates the formula as follows:
LG←log(Sf) (1)
G←G-αδLG/δG (2)
wherein S isfThe probability of discrimination between the forged image and the real text vector is shown, and alpha is a constant factor.
6. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein in step S5, the training of the discriminant is as follows:
(1) three types of inputs are input to the arbiter: the method comprises the steps that a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector are used for training a discriminator to obtain a trained discriminator;
(2) in updating the discriminator, the correlation formula is as follows:
LD←log(Sτ)+(log(1-Sw)+log(1-Sf))/2 (3)
D←D-αδLD/δD (4)
in the above formula, Sτ、SWAnd SfRespectively representing the discrimination probabilities of the real image and the real title, the discrimination probabilities of the real image and the error text and the discrimination probabilities of the forged image and the real text;
in the discriminator, in the process of matching and discriminating the image vector and the title vector, the similarity of two probability distributions is discriminated by using a KL divergence formula:
p and Q are two probability distributions, respectively, and P (x) and Q (x) are two corresponding probability densities, respectively, and the KL divergence has nonnegativity and asymmetry.
7. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein the forward test training of the text-to-image generation countermeasure model in step S6 is as follows: coding the real title, and then adding noise for training to obtain a trained generator; and respectively obtaining loss functions of the generator and the discriminator through a cross entropy formula, wherein the cross entropy formula is as follows:
wherein y comprises 3 values, a logarithmic probability of discrimination of the forged image and the real title vector, a logarithmic probability of discrimination of the real image and the real title, and a logarithmic probability of discrimination of the real image and the false title,to unitize the corresponding log probabilities.
8. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein in step S7, the parameters are defined as follows: defining a learning rate, defining a learning attenuation rate, and defining a learning attenuation round number and a bias; acquiring names of related variables; defining an optimizer of a generator and a discriminator and defining an optimizer of a recurrent neural network; in step S8, training of the model is started, and the preparation process before the training is started is as follows: using a ConfigProto function in TensorFlow to perform parameter configuration on the session, and using a global initialization function to initialize the session; loading the latest check point; acquiring title description sentences and subscripts of the sentences of the model to be input, acquiring the batch size of training and sample sentences, and preprocessing the sample sentences; updating the learning rate; obtaining the matched text, obtaining the correct image, obtaining the wrong title, obtaining the wrong image and obtaining the corresponding noise.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910923354.6A CN110751698B (en) | 2019-09-27 | 2019-09-27 | Text-to-image generation method based on hybrid network model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910923354.6A CN110751698B (en) | 2019-09-27 | 2019-09-27 | Text-to-image generation method based on hybrid network model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110751698A CN110751698A (en) | 2020-02-04 |
CN110751698B true CN110751698B (en) | 2022-05-17 |
Family
ID=69277252
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910923354.6A Active CN110751698B (en) | 2019-09-27 | 2019-09-27 | Text-to-image generation method based on hybrid network model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110751698B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111339734B (en) * | 2020-02-20 | 2023-06-30 | 青岛联合创智科技有限公司 | Method for generating image based on text |
CN111402365B (en) * | 2020-03-17 | 2023-02-10 | 湖南大学 | Method for generating picture from characters based on bidirectional architecture confrontation generation network |
CN111860507B (en) * | 2020-07-20 | 2022-09-20 | 中国科学院重庆绿色智能技术研究院 | Compound image molecular structural formula extraction method based on counterstudy |
CN111968193B (en) * | 2020-07-28 | 2023-11-21 | 西安工程大学 | Text image generation method based on StackGAN (secure gas network) |
CN112215868B (en) * | 2020-09-10 | 2023-12-26 | 湖北医药学院 | Method for removing gesture image background based on generation of countermeasure network |
CN114359423B (en) * | 2020-10-13 | 2023-09-12 | 四川大学 | Text generation face method based on deep countermeasure generation network |
WO2022145525A1 (en) * | 2020-12-29 | 2022-07-07 | 주식회사 디자이노블 | Method and apparatus for generating design based on learned condition |
CN112765316B (en) * | 2021-01-19 | 2024-08-02 | 东南大学 | Method and device for generating image by text introduced into capsule network |
CN113052784B (en) * | 2021-03-22 | 2024-03-08 | 大连理工大学 | Image generation method based on multiple auxiliary information |
CN113298895B (en) * | 2021-06-18 | 2023-05-12 | 上海交通大学 | Automatic encoding method and system for unsupervised bidirectional generation oriented to convergence guarantee |
CN114021558B (en) * | 2021-11-10 | 2022-05-10 | 北京航空航天大学杭州创新研究院 | Intelligent evaluation method for consistency of graph and text meaning based on layering |
CN114648681B (en) * | 2022-05-20 | 2022-10-28 | 浪潮电子信息产业股份有限公司 | Image generation method, device, equipment and medium |
CN115018954B (en) * | 2022-08-08 | 2022-10-28 | 中国科学院自动化研究所 | Image generation method, device, electronic equipment and medium |
CN115546848B (en) * | 2022-10-26 | 2024-02-02 | 南京航空航天大学 | Challenge generation network training method, cross-equipment palmprint recognition method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107862377A (en) * | 2017-11-14 | 2018-03-30 | 华南理工大学 | A kind of packet convolution method that confrontation network model is generated based on text image |
CN109003678A (en) * | 2018-06-12 | 2018-12-14 | 清华大学 | A kind of generation method and system emulating text case history |
CN109543200A (en) * | 2018-11-30 | 2019-03-29 | 腾讯科技(深圳)有限公司 | A kind of text interpretation method and device |
CN109584337A (en) * | 2018-11-09 | 2019-04-05 | 暨南大学 | A kind of image generating method generating confrontation network based on condition capsule |
CN109871888A (en) * | 2019-01-30 | 2019-06-11 | 中国地质大学(武汉) | A kind of image generating method and system based on capsule network |
-
2019
- 2019-09-27 CN CN201910923354.6A patent/CN110751698B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107862377A (en) * | 2017-11-14 | 2018-03-30 | 华南理工大学 | A kind of packet convolution method that confrontation network model is generated based on text image |
CN109003678A (en) * | 2018-06-12 | 2018-12-14 | 清华大学 | A kind of generation method and system emulating text case history |
CN109584337A (en) * | 2018-11-09 | 2019-04-05 | 暨南大学 | A kind of image generating method generating confrontation network based on condition capsule |
CN109543200A (en) * | 2018-11-30 | 2019-03-29 | 腾讯科技(深圳)有限公司 | A kind of text interpretation method and device |
CN109871888A (en) * | 2019-01-30 | 2019-06-11 | 中国地质大学(武汉) | A kind of image generating method and system based on capsule network |
Non-Patent Citations (1)
Title |
---|
深度学习在图像识别中的应用研究综述;郑远攀 等;《计算机工程与应用》;20190419;第55卷(第12期);20-36 * |
Also Published As
Publication number | Publication date |
---|---|
CN110751698A (en) | 2020-02-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110751698B (en) | Text-to-image generation method based on hybrid network model | |
CN108875807B (en) | Image description method based on multiple attention and multiple scales | |
CN111061843B (en) | Knowledge-graph-guided false news detection method | |
CN110427461B (en) | Intelligent question and answer information processing method, electronic equipment and computer readable storage medium | |
CN108830287A (en) | The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method | |
CN106650789A (en) | Image description generation method based on depth LSTM network | |
CN109413028A (en) | SQL injection detection method based on convolutional neural networks algorithm | |
CN114120041B (en) | Small sample classification method based on double-countermeasure variable self-encoder | |
CN111160452A (en) | Multi-modal network rumor detection method based on pre-training language model | |
CN112784929B (en) | Small sample image classification method and device based on double-element group expansion | |
CN109711465A (en) | Image method for generating captions based on MLL and ASCA-FR | |
CN111966812A (en) | Automatic question answering method based on dynamic word vector and storage medium | |
CN113642621A (en) | Zero sample image classification method based on generation countermeasure network | |
CN112784031B (en) | Method and system for classifying customer service conversation texts based on small sample learning | |
CN111506709A (en) | Entity linking method and device, electronic equipment and storage medium | |
CN110909174B (en) | Knowledge graph-based method for improving entity link in simple question answering | |
CN109766918A (en) | Conspicuousness object detecting method based on the fusion of multi-level contextual information | |
CN109101984B (en) | Image identification method and device based on convolutional neural network | |
CN114332565A (en) | Method for generating image by generating confrontation network text based on distribution estimation condition | |
CN117094325B (en) | Named entity identification method in rice pest field | |
CN116822633B (en) | Model reasoning method and device based on self-cognition and electronic equipment | |
CN108829675A (en) | document representing method and device | |
CN115588487B (en) | Medical image data set manufacturing method based on federal learning and antagonism network generation | |
CN113901820A (en) | Chinese triplet extraction method based on BERT model | |
CN115588486A (en) | Traditional Chinese medicine diagnosis generating device based on Transformer and application thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20200204 Assignee: Shanxi Shiji Beibo Information Technology Co.,Ltd. Assignor: Taiyuan University of Technology Contract record no.: X2023140000006 Denomination of invention: A method of text-to-image generation based on hybrid network model Granted publication date: 20220517 License type: Common License Record date: 20230110 |
|
EE01 | Entry into force of recordation of patent licensing contract |