CN110751698B - Text-to-image generation method based on hybrid network model - Google Patents

Text-to-image generation method based on hybrid network model Download PDF

Info

Publication number
CN110751698B
CN110751698B CN201910923354.6A CN201910923354A CN110751698B CN 110751698 B CN110751698 B CN 110751698B CN 201910923354 A CN201910923354 A CN 201910923354A CN 110751698 B CN110751698 B CN 110751698B
Authority
CN
China
Prior art keywords
image
text
title
training
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910923354.6A
Other languages
Chinese (zh)
Other versions
CN110751698A (en
Inventor
张玲
李钢
黄晓琪
杨子固
刘剑超
王莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan University of Technology
Original Assignee
Taiyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan University of Technology filed Critical Taiyuan University of Technology
Priority to CN201910923354.6A priority Critical patent/CN110751698B/en
Publication of CN110751698A publication Critical patent/CN110751698A/en
Application granted granted Critical
Publication of CN110751698B publication Critical patent/CN110751698B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/001Texturing; Colouring; Generation of texture or colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30168Image quality inspection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a text-to-image generation method based on a hybrid network model, which comprises the steps of carrying out forward training on the mapping from a text to an image, carrying out forward training on a generator and a discriminator in a text-to-image generation model, inputting three types of inputs into the discriminator to train the discriminator, carrying out forward test training on a text-to-image generation countermeasure model, training the model to obtain model function loss information, and evaluating the quality of the image by using an image evaluation module; the quality of the generated image is obviously higher than that of the image generated by the traditional GAN text, so that the application value is higher. The method overcomes the defects of the existing model research based on the generation of the text to the image of the confrontation network, and can be better suitable for the generation of the text to the image. By the method and the device, the output of high-quality clear images can be realized, and a generalization effect can be obtained through a small amount of training data.

Description

Text-to-image generation method based on hybrid network model
Technical Field
The invention belongs to the technical field of image processing, and relates to a text-to-image generation method based on a hybrid network model.
Background
With the rapid development of artificial intelligence, generating images from text has attracted a great deal of interest. In recent years, recurrent neural network architectures have been used to learn text feature representations. The deep convolution resisting network can generate high-quality clear images of specific types, such as specific images of human faces, rooms and the like.
The conventional text-based image model for generating a countermeasure network is generated, a convolutional network is used in a discriminator to extract the features of an image, and a recurrent neural network is used to extract the features of a sentence sequence, but the conventional convolutional network for extracting the image features needs a lot of images for training, so that the conventional convolutional network has great limitation, and a capsule network can be generalized by using much less training data. Meanwhile, the conventional convolutional network cannot cope well with image blurriness, but the capsule network can. In addition, in the conventional model for generating an image based on a text for generating a countermeasure network, the conventional convolutional layer network for image feature extraction is connected with a full connection layer, and the full connection layer almost occupies a large part of parameters in the network. Therefore, in the model training process, the training speed of the model is relatively slow, and overfitting is relatively serious.
Based on the above, it is necessary to invent a brand-new method for extracting image features by using a convolutional network to solve the above problems in the process of extracting image features by using a conventional convolutional network.
Disclosure of Invention
The invention provides a text-to-image generation method based on a hybrid network model, which is used for realizing the output of high-quality clear images and obtaining the generalization effect by using a small amount of training data.
The technical scheme adopted by the invention is that the text-to-image generation method based on the hybrid network model specifically comprises the following steps:
step S1: loading relevant data based on a text-image countermeasure model generating a countermeasure network;
step S2: defining a text-image countermeasure model, including defining a real image, defining an error image, defining a real title, defining an error title and defining a noise variable;
step S3: in the text-image confrontation model, forward training is carried out on the mapping of the text and the image;
step S4: forward training a generator in the text-image countermeasure model, encoding a correct title by using a recurrent neural network, and adding noise into an encoded vector to train the generator to obtain a trained generator and a forged image;
step S5: three types of inputs are input to a discriminator of a text-image countermeasure model (a discriminator with pairing consciousness is used here, which is an improvement of the discriminator in the standard text conditional DCGAN framework, and besides the discriminator is used for discriminating whether the generated image is not true or not, the generated image fails to be discriminated: the method comprises the steps that a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector are used for training a discriminator to obtain a trained discriminator;
step S6: carrying out forward test training on the text-image confrontation model, coding a real title by using a recurrent neural network, and adding random noise into a coding vector to test a generator so as to know whether the generator can output an ideal result as expected;
step S7: parameter definition, specifically including learning rate, learning attenuation rate and the definition of the optimizer of the generator and the discriminator;
step S8: training a text-image countermeasure model, downloading a nearest check point, acquiring subscripts for generating seeds, noise and sentences, acquiring matched texts, acquiring real images, acquiring wrong titles, acquiring wrong images and noise, updating the mapping of the text-images, updating a discriminator and a generator, and finally obtaining round number and loss function information;
and step S9, storing the trained text-image confrontation model, and evaluating the image quality of the image generated by the trained text-image confrontation model.
Further, in step S1, the loading the relevant data of the text-image confrontation model includes: loading a title set, and storing the processed title set into a corresponding dictionary; establishing a related vocabulary, wherein the number of corresponding vocabularies is recorded; storing the subscripts of the associated titles in a list; randomly checking a list storing title subscripts; loading related images and deforming the images; acquiring the image quantity of the related image training set and image test set, and the title quantity of the title training set and title test set; storing the vocabulary, the image training set, the image testing set, the title training number, the title testing number, the title number corresponding to each image, the number of testing images, the number of training images, the training subscript set and the testing subscript set in a binary form;
further, in step S2, the definitions of the correct image, the incorrect image, the correct title, and the incorrect title specifically include definitions of a name, a type, and a size of the true and false images, and definitions of a name, a type, and a size of the true and false titles.
Further, in the forward training of the text-image mapping, the real and false images are encoded through the capsule network, and the specific process is as follows:
(1) firstly inputting a true image or a false image to an input layer of a capsule network in a vector group form;
(2) after the input layer simply processes the image, the image is sequentially input into the two capsule layers, and after high-order features of the image are extracted by the two capsule layers, the image is normalized through the batch normalization layer; then, performing feature extraction and normalization processing on the image again through a capsule layer and a batch normalization layer, and performing same processing on the image through the same capsule layer and the same batch normalization layer; then, the network layer compresses an image tensor into a vector; and finally, passing through a full connection layer, wherein each node of the full connection layer is connected with all nodes of the previous layer, and is used for integrating the extracted features and outputting the final overall feature.
Further, in step S3, in the forward training of the text-image mapping, the recurrent neural network is used to extract feature vectors from the header, and the method includes the following steps:
(1) inputting the title sequence into an Embedding layer (the Embedding layer is used for mapping words from a semantic space to a vector space and keeping the relation of a sample in the semantic space as much as possible, namely the vector distance after the words with similar semantics are mapped is as close as possible, and vice versa, the title sequence is processed, and finally, a three-dimensional tensor is output and contains batch size and Embedding dimension information;
(2) inputting information output by an Embedding input layer into a Dynamic circulation network layer (Dynamic RNN layer) of a recurrent neural network, and processing the information to obtain a final network output tensor;
(3) calculating a loss function of the recurrent neural network;
further, in step S4, the forward training of the generator is as follows:
(1) encoding the correct header using a recurrent neural network;
(2) after the correct title is coded, adding corresponding noise, and processing by a generator to obtain a generated forged image; the generator updates the formula as follows:
LG←log(Sf) (1)
G←G-αδLG/δG (2)
wherein S isfThe probability of discrimination between the forged image and the real text vector is shown, and alpha is a constant factor.
Further, in step S5, the training process of the arbiter is as follows:
(1) three types of inputs are input to the arbiter: the method comprises the steps that a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector are used for training a discriminator to obtain a trained discriminator;
(2) in updating the discriminator, the correlation formula is as follows:
LD←log(Sτ)+(log(1-Sw)+log(1-Sf))/2 (3)
D←D-αδLD/δD (4)
in the above formula, Sτ、SWAnd SfRespectively representing the discrimination probabilities of the real image and the real title, the discrimination probabilities of the real image and the error text and the discrimination probabilities of the forged image and the real text;
in the discriminator, in the process of matching and discriminating the image vector and the title vector, the similarity of two probability distributions is discriminated by using a KL divergence formula:
Figure BDA0002218269610000041
p and Q are two probability distributions, respectively, and P (x) and Q (x) are two corresponding probability densities, respectively, and the KL divergence has nonnegativity and asymmetry.
Further, in the step S6, the specific process of performing forward test training on the text-to-image generative confrontation model is as follows: coding the real title, and then adding noise for training to obtain a trained generator; and respectively obtaining loss functions of the generator and the discriminator through a cross entropy formula, wherein the cross entropy formula is as follows:
Figure BDA0002218269610000042
wherein y comprises 3 values, a logarithmic probability of discrimination of the forged image and the real title vector, a logarithmic probability of discrimination of the real image and the real title, and a logarithmic probability of discrimination of the real image and the false title,
Figure BDA0002218269610000043
to unitize the corresponding log probabilities.
Further, in step S7, parameters are defined, and the specific process is as follows: defining a learning rate, defining a learning attenuation rate, and defining a learning attenuation round number and a bias; acquiring names of related variables; an optimizer defining a generator and a discriminator, and an optimizer defining a recurrent neural network.
Further, in step S8, training of the model is started, and the preparation process before the training is started is as follows: using a ConfigProto function in TensorFlow to perform parameter configuration on the session, and using a global initialization function to initialize the session; loading the latest check point; acquiring title description sentences and subscripts of the sentences of the model to be input, acquiring the batch size of training and sample sentences, and preprocessing the sample sentences; updating the learning rate; obtaining the matched text, obtaining the correct image, obtaining the wrong title, obtaining the wrong image and obtaining the corresponding noise.
Further, in step S8, training of the model is started, and the training process is as follows: updating a mapping process from a text to an image, when the number of training rounds is less than 50, acquiring a dictionary containing true and false images and true and false titles, and acquiring errors of the recurrent neural network by using vectors formed by a loss function and an optimization function of the recurrent neural network; when the number of training rounds is more than or equal to 50 rounds, setting the error to be 0 and updating the discriminator and the generator; printing the time after a certain number of turns; acquiring a dictionary composed of sample sentences and sample seeds, combining a vector group composed of the output of a generating network and the output of a circulating neural network to obtain the output of the generated image and the output of the circulating neural network, adding a layer of attention mechanism in the process of generating the image, and when a certain region in the image is generated, corresponding the certain probability to a corresponding sentence, thereby improving the generation quality; saving the generated image to a specified directory; and saving the model, saving the updated check point every 10 rounds, saving the latest and latest check point in the last round, namely the 100 th round, and updating the corresponding name.
Further, in step S9, the image quality of the image generated by the trained text-image confrontation model is evaluated, and the specific process is as follows: constructing a corresponding scoring module; adopting one of the image quality evaluation methods: FID (Freehet inclusion distance) Score, the generated image is embedded into a feature space given by a specific layer of the inclusion Net, the space is regarded as continuous multi-element Gaussian distribution, the mean value and the covariance of the generated data and the actual data are calculated, and finally the mean value and the covariance are returned to serve as the evaluation standard of the image quality.
Different from the prior art, the text-to-image generation method based on the hybrid network model carries out forward training through the mapping from the text to the image, carries out forward training on a generator and a discriminator in a text-to-image generation model, inputs three types of inputs into the discriminator to train the discriminator, carries out forward test training on a generation countermeasure model from the text to the image, trains the model to obtain model function loss information, and uses an image evaluation module to evaluate the quality of the image; the quality of the generated image is obviously higher than that of the image generated by the traditional GAN text, so that the application value is higher. The method overcomes the defects of the existing model research based on the generation of the text to the image of the confrontation network, and can be better suitable for the generation of the text to the image. By the method and the device, the output of high-quality clear images can be realized, and a generalization effect can be obtained through a small amount of training data.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a text-to-image generation method based on a hybrid network model according to the present invention.
Fig. 2 is a structure and a workflow of a capsule network, taking handwritten numbers as an example, in a text-to-image generation method based on a hybrid network model provided by the invention.
FIG. 3 is a workflow of an attention mechanism module in image generation in the text-to-image generation method based on a hybrid network model according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, the present invention provides a text-to-image generation method based on a hybrid network model, which specifically comprises the following steps:
step S1: loading relevant data based on a text-image countermeasure model generating a countermeasure network;
step S2: defining a text-image countermeasure model, including defining a real image, defining an error image, defining a real title, defining an error title and defining a noise variable;
step S3: in the text-image confrontation model, forward training is carried out on the mapping of the text and the image;
step S4: forward training a generator in the text-image countermeasure model, encoding a correct title by using a recurrent neural network, and adding noise into an encoded vector to train the generator to obtain a trained generator and a forged image;
step S5: three types of inputs are input to a discriminator of a text-image countermeasure model (a discriminator with pairing consciousness is used here, which is an improvement of the discriminator in the standard text conditional DCGAN framework, and besides the discriminator is used for discriminating whether the generated image is not true or not, the generated image fails to be discriminated: training the discriminator through a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector to obtain a trained discriminator;
step S6: carrying out forward test training on the text-image confrontation model, coding a real title by using a recurrent neural network, and adding random noise into a coding vector to test a generator so as to know whether the generator can output an ideal result as expected;
step S7: parameter definition, specifically including learning rate, learning attenuation rate and the definition of the optimizer of the generator and the discriminator;
step S8: training a text-image countermeasure model, downloading a nearest check point, acquiring subscripts for generating seeds, noise and sentences, acquiring matched texts, acquiring real images, acquiring wrong titles, acquiring wrong images and noise, updating the mapping of the text-images, updating a discriminator and a generator, and finally obtaining round number and loss function information;
and step S9, storing the trained text-image confrontation model, and evaluating the image quality of the image generated by the trained text-image confrontation model.
Further, in step S1, the loading the relevant data of the text-image confrontation model includes: loading a title set, and storing the processed title set into a corresponding dictionary; establishing a related vocabulary, wherein the number of corresponding vocabularies is recorded; storing the subscripts of the associated titles in a list; randomly checking a list storing title subscripts; loading related images and deforming the images; acquiring the number of images of the related image training set and image testing set, and the number of titles of the title training set and title testing set; storing the vocabulary, the image training set, the image testing set, the title training number, the title testing number, the title number corresponding to each image, the number of testing images, the number of training images, the training subscript set and the testing subscript set in a binary form;
further, in step S2, the definitions of the correct image, the incorrect image, the correct title, and the incorrect title specifically include definitions of a name, a type, and a size of the true and false images, and definitions of a name, a type, and a size of the true and false titles.
Further, in the forward training of the text-image mapping, the true and false images are encoded through the capsule network, and the structure and the work flow of the capsule network, taking the handwritten numbers as an example, are shown in fig. 2. The specific process is as follows:
(1) firstly inputting a true image or a false image to an input layer of a capsule network in a vector group form;
(2) after the input layer simply processes the image, the image is sequentially input into the two capsule layers, and after high-order features of the image are extracted by the two capsule layers, the image is normalized through the batch normalization layer; then, performing feature extraction and normalization processing on the image again through a capsule layer and a batch normalization layer, and performing same processing on the image through the same capsule layer and the same batch normalization layer; then, the network layer compresses an image tensor into a vector; and finally, passing through a full connection layer, wherein each node of the full connection layer is connected with all nodes of the previous layer, and is used for integrating the extracted features and outputting the final overall feature.
Further, in step S3, in the forward training of the text-image mapping, a recurrent neural network is used to extract feature vectors from the header, and the workflow of the attention mechanism module in the image generation is shown in fig. 3. The method comprises the following steps:
(1) inputting the title sequence into an Embedding layer (the Embedding layer is used for mapping words from a semantic space to a vector space and keeping the relation of a sample in the semantic space as much as possible, namely the vector distance after the words with similar semantics are mapped is as close as possible, and vice versa, the title sequence is processed, and finally, a three-dimensional tensor is output and contains batch size and Embedding dimension information;
(2) inputting information output by an Embedding input layer into a Dynamic circulation network layer (Dynamic RNN layer) of a recurrent neural network, and processing the information to obtain a final network output tensor;
(3) calculating a loss function of the recurrent neural network;
further, in step S4, the forward training of the generator is as follows:
(1) encoding the correct header using a recurrent neural network;
(2) after the correct title is coded, adding corresponding noise, and processing by a generator to obtain a generated forged image; the generator updates the formula as follows:
LG←log(Sf) (1)
G←G-αδLG/δG (2)
wherein S isfThe probability of discrimination between the forged image and the real text vector is shown, and alpha is a constant factor.
Further, in step S5, the training process of the arbiter is as follows:
(1) three types of inputs are input to the arbiter: the method comprises the steps that a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector are used for training a discriminator to obtain a trained discriminator;
(2) in updating the discriminator, the correlation formula is as follows:
LD←log(Sτ)+(log(1-Sw)+log(1-Sf))/2 (3)
D←D-αδLD/δD (4)
in the above formula, Sτ、SWAnd SfRespectively representing the discrimination probabilities of the real image and the real title, the discrimination probabilities of the real image and the error text and the discrimination probabilities of the forged image and the real text;
in the discriminator, in the process of matching and discriminating the image vector and the title vector, the similarity of two probability distributions is discriminated by using a KL divergence formula:
Figure BDA0002218269610000081
p and Q are two probability distributions, respectively, and P (x) and Q (x) are two corresponding probability densities, respectively, and the KL divergence has nonnegativity and asymmetry.
Further, in the step S6, the specific process of performing forward test training on the text-to-image generative confrontation model is as follows: coding the real title, and then adding noise for training to obtain a trained generator; and respectively obtaining loss functions of the generator and the discriminator through a cross entropy formula, wherein the cross entropy formula is as follows:
Figure BDA0002218269610000082
wherein y comprises 3 values, a forged image andthe log probability of the true title vector discrimination, the log probability of the true image and true title discrimination, and the log probability of the true image and false title discrimination,
Figure BDA0002218269610000083
to unitize the corresponding log probabilities.
Further, in step S7, parameters are defined, and the specific process is as follows: defining a learning rate, defining a learning attenuation rate, and defining a learning attenuation round number and a bias; acquiring names of related variables; an optimizer defining a generator and a discriminator, and an optimizer defining a recurrent neural network.
Further, in step S8, training of the model is started, and the preparation process before the training is started is as follows: using a ConfigProto function in TensorFlow to perform parameter configuration on the session, and using a global initialization function to initialize the session; loading the latest check point; acquiring title description sentences and subscripts of the sentences of the model to be input, acquiring the batch size of training and sample sentences, and preprocessing the sample sentences; updating the learning rate; obtaining the matched text, obtaining the correct image, obtaining the wrong title, obtaining the wrong image and obtaining the corresponding noise.
Further, in the step S8, training of the model is started, and the training process is as follows: updating a mapping process from a text to an image, when the number of training rounds is less than 50, acquiring a dictionary containing true and false images and true and false titles, and acquiring errors of the recurrent neural network by using vectors formed by a loss function and an optimization function of the recurrent neural network; when the number of training rounds is more than or equal to 50 rounds, setting the error as 0; updating the discriminator and the generator; printing the time after a certain number of turns; acquiring a dictionary composed of sample sentences and sample seeds, combining a vector group composed of the output of a generating network and the output of a circulating neural network to obtain the output of the generated image and the output of the circulating neural network, adding a layer of attention mechanism in the process of generating the image, and when a certain region in the image is generated, corresponding the certain probability to a corresponding sentence, thereby improving the generation quality; saving the generated image to a specified directory; and saving the model, saving the updated check point every 10 rounds, saving the latest and latest check point in the last round, namely the 100 th round, and updating the corresponding name. Further, in step S9, the image quality of the image generated by the trained text-image confrontation model is evaluated, and the specific process is as follows: constructing a corresponding scoring module; adopting one of the image quality evaluation methods: FID (Freehet inclusion distance) Score, the generated image is embedded into a feature space given by a specific layer of the inclusion Net, the space is regarded as continuous multi-element Gaussian distribution, the mean value and the covariance of the generated data and the actual data are calculated, and finally the mean value and the covariance are returned to serve as the evaluation standard of the image quality.
Different from the prior art, the text-to-image generation method based on the hybrid network model carries out forward training through the mapping from the text to the image, carries out forward training on a generator and a discriminator in a text-to-image generation model, inputs three types of inputs into the discriminator to train the discriminator, carries out forward test training on a generation countermeasure model from the text to the image, trains the model to obtain model function loss information, and uses an image evaluation module to evaluate the quality of the image; the quality of the generated image is obviously higher than that of the image generated by the traditional GAN text, so that the application value is higher. The method overcomes the defects of the existing model research based on the generation of the text to the image of the confrontation network, and can be better suitable for the generation of the text to the image. By the method and the device, the output of high-quality clear images can be realized, and a generalization effect can be obtained through a small amount of training data.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (8)

1. A method for text-to-image generation based on a hybrid network model, comprising the steps of:
step S1: loading relevant data based on a text-image countermeasure model generating a countermeasure network;
step S2: defining a text-image countermeasure model, including defining a real image, defining an error image, defining a real title, defining an error title and defining a noise variable;
step S3: in the text-image confrontation model, forward training is carried out on the mapping of the text and the image; in the forward training of the mapping of the text and the image, the real and false images are coded through a capsule network, and the specific process is as follows:
(1) firstly inputting a true image or a false image into an input layer of a capsule network in a vector group form;
(2) after the input layer simply processes the image, the image is sequentially input into the two capsule layers, and after the high-order features of the image are extracted by the two capsule layers, the image is normalized through the batch normalization layer; then, performing feature extraction and normalization processing on the image again through a capsule layer and a batch normalization layer, and performing same processing on the image through the same capsule layer and the same batch normalization layer; then, the network layer compresses an image tensor into a vector; finally, passing through a full connection layer, each node of the full connection layer is connected with all nodes of the previous layer and used for integrating the extracted features in front and outputting the final overall feature;
step S4: forward training a generator in the text-image countermeasure model, encoding a correct title by using a recurrent neural network, and adding noise into an encoded vector to train the generator to obtain a trained generator and a forged image;
step S5: inputting three types of input into a discriminator of a text-image countermeasure model, using a discriminator with pairing consciousness, improving the discriminator in a standard text conditional DCGAN framework, and discriminating whether the generated content failed belongs to the unreal generated image or the mismatching generated image, a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector to train the discriminator to obtain the trained discriminator by the discriminator besides discriminating the authenticity of an output image;
step S6: carrying out forward test training on the text-image confrontation model, coding a real title by using a recurrent neural network, and adding random noise into a coding vector to test a generator so as to know whether the generator can output an ideal result as expected;
step S7: parameter definition, specifically including learning rate, learning attenuation rate and the definition of the optimizer of the generator and the discriminator;
step S8: training a text-image countermeasure model, downloading a nearest check point, acquiring subscripts for generating seeds, noise and sentences, acquiring matched texts, acquiring real images, acquiring wrong titles, acquiring wrong images and noise, updating the mapping of the text-images, updating a discriminator and a generator, and finally obtaining round number and loss function information;
the training of the model begins, as follows: updating a mapping process from a text to an image, when the number of training rounds is less than 50, acquiring a dictionary containing true and false images and true and false titles, and acquiring errors of the recurrent neural network by using vectors formed by a loss function and an optimization function of the recurrent neural network; when the number of training rounds is more than or equal to 50 rounds, setting the error as 0; updating the discriminator and the generator; printing the time after a certain number of turns; acquiring a dictionary composed of sample sentences and sample seeds, combining a vector group composed of the output of a generating network and the output of a recurrent neural network to obtain a generated image, adding a layer of attention mechanism in the process of generating the image, and when a certain region in the image is generated, corresponding the certain region to a corresponding sentence with a certain probability so as to improve the generation quality; saving the generated image to a specified directory; the storage model stores the check points once every 10 rounds, and stores the latest check point and updates the corresponding name in the last round, namely the 100 th round;
step S9, saving the trained text-image confrontation model, and evaluating the image quality of the image generated by the trained text-image confrontation model, wherein the evaluation process is as follows: constructing a corresponding scoring module; adopting one of the image quality evaluation methods: and the FID Score is used for embedding the generated image into a feature space given by a specific layer of the increment Net, regarding the feature space as a continuous multi-element Gaussian distribution, calculating the mean value and the covariance of the generated data and the actual data, and finally returning the mean value and the covariance as the evaluation standard of the image quality.
2. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein the step S1, loading the relevant data of the text-to-image confrontation model comprises: loading a title set, and storing the processed title set into a corresponding dictionary; establishing a related vocabulary, wherein the number of corresponding vocabularies is recorded; storing the subscripts of the associated titles in a list; randomly checking a list storing title subscripts; loading related images and deforming the images; acquiring the image quantity of the related image training set and image test set, and the title quantity of the title training set and title test set; and storing the vocabulary, the image training set, the image testing set, the title training number, the title testing number, the title number corresponding to each image, the number of testing images, the number of training images, the training subscript set and the testing subscript set in a binary form.
3. The method for generating texts to images based on hybrid network model according to claim 1, wherein in step S2, the definitions of the correct images, the incorrect images, the correct titles and the incorrect titles specifically include definitions of names, types and sizes of true and false images and definitions of names, types and sizes of true and false titles.
4. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein in said step S3, during the forward training of the text-to-image mapping, the recursive neural network is used to extract feature vectors from the header, according to the following steps:
(1) inputting the title sequence into an Embedding layer for processing, and finally outputting a three-dimensional tensor comprising batch size and embedded dimension information;
(2) inputting information output by an Embedding input layer into a Dynamic circulation network layer (namely a Dynamic RNN layer) of a recurrent neural network, and processing the information to obtain a final network output tensor;
(3) a loss function of the recurrent neural network is calculated.
5. A method for generating text-to-image based on hybrid network model according to claim 1, wherein in step S4, the forward training of the generator is as follows:
(1) encoding the correct header using a recurrent neural network;
(2) after the correct title is coded, adding corresponding noise, and processing by a generator to obtain a generated forged image; the generator updates the formula as follows:
LG←log(Sf) (1)
G←G-αδLG/δG (2)
wherein S isfThe probability of discrimination between the forged image and the real text vector is shown, and alpha is a constant factor.
6. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein in step S5, the training of the discriminant is as follows:
(1) three types of inputs are input to the arbiter: the method comprises the steps that a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector are used for training a discriminator to obtain a trained discriminator;
(2) in updating the discriminator, the correlation formula is as follows:
LD←log(Sτ)+(log(1-Sw)+log(1-Sf))/2 (3)
D←D-αδLD/δD (4)
in the above formula, Sτ、SWAnd SfRespectively representing the discrimination probabilities of the real image and the real title, the discrimination probabilities of the real image and the error text and the discrimination probabilities of the forged image and the real text;
in the discriminator, in the process of matching and discriminating the image vector and the title vector, the similarity of two probability distributions is discriminated by using a KL divergence formula:
Figure RE-FDA0003574479390000031
p and Q are two probability distributions, respectively, and P (x) and Q (x) are two corresponding probability densities, respectively, and the KL divergence has nonnegativity and asymmetry.
7. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein the forward test training of the text-to-image generation countermeasure model in step S6 is as follows: coding the real title, and then adding noise for training to obtain a trained generator; and respectively obtaining loss functions of the generator and the discriminator through a cross entropy formula, wherein the cross entropy formula is as follows:
Figure RE-FDA0003574479390000041
wherein y comprises 3 values, a logarithmic probability of discrimination of the forged image and the real title vector, a logarithmic probability of discrimination of the real image and the real title, and a logarithmic probability of discrimination of the real image and the false title,
Figure RE-FDA0003574479390000042
to unitize the corresponding log probabilities.
8. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein in step S7, the parameters are defined as follows: defining a learning rate, defining a learning attenuation rate, and defining a learning attenuation round number and a bias; acquiring names of related variables; defining an optimizer of a generator and a discriminator and defining an optimizer of a recurrent neural network; in step S8, training of the model is started, and the preparation process before the training is started is as follows: using a ConfigProto function in TensorFlow to perform parameter configuration on the session, and using a global initialization function to initialize the session; loading the latest check point; acquiring title description sentences and subscripts of the sentences of the model to be input, acquiring the batch size of training and sample sentences, and preprocessing the sample sentences; updating the learning rate; obtaining the matched text, obtaining the correct image, obtaining the wrong title, obtaining the wrong image and obtaining the corresponding noise.
CN201910923354.6A 2019-09-27 2019-09-27 Text-to-image generation method based on hybrid network model Active CN110751698B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910923354.6A CN110751698B (en) 2019-09-27 2019-09-27 Text-to-image generation method based on hybrid network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910923354.6A CN110751698B (en) 2019-09-27 2019-09-27 Text-to-image generation method based on hybrid network model

Publications (2)

Publication Number Publication Date
CN110751698A CN110751698A (en) 2020-02-04
CN110751698B true CN110751698B (en) 2022-05-17

Family

ID=69277252

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910923354.6A Active CN110751698B (en) 2019-09-27 2019-09-27 Text-to-image generation method based on hybrid network model

Country Status (1)

Country Link
CN (1) CN110751698B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339734B (en) * 2020-02-20 2023-06-30 青岛联合创智科技有限公司 Method for generating image based on text
CN111402365B (en) * 2020-03-17 2023-02-10 湖南大学 Method for generating picture from characters based on bidirectional architecture confrontation generation network
CN111860507B (en) * 2020-07-20 2022-09-20 中国科学院重庆绿色智能技术研究院 Compound image molecular structural formula extraction method based on counterstudy
CN111968193B (en) * 2020-07-28 2023-11-21 西安工程大学 Text image generation method based on StackGAN (secure gas network)
CN112215868B (en) * 2020-09-10 2023-12-26 湖北医药学院 Method for removing gesture image background based on generation of countermeasure network
CN114359423B (en) * 2020-10-13 2023-09-12 四川大学 Text generation face method based on deep countermeasure generation network
WO2022145525A1 (en) * 2020-12-29 2022-07-07 주식회사 디자이노블 Method and apparatus for generating design based on learned condition
CN112765316B (en) * 2021-01-19 2024-08-02 东南大学 Method and device for generating image by text introduced into capsule network
CN113052784B (en) * 2021-03-22 2024-03-08 大连理工大学 Image generation method based on multiple auxiliary information
CN113298895B (en) * 2021-06-18 2023-05-12 上海交通大学 Automatic encoding method and system for unsupervised bidirectional generation oriented to convergence guarantee
CN114021558B (en) * 2021-11-10 2022-05-10 北京航空航天大学杭州创新研究院 Intelligent evaluation method for consistency of graph and text meaning based on layering
CN114648681B (en) * 2022-05-20 2022-10-28 浪潮电子信息产业股份有限公司 Image generation method, device, equipment and medium
CN115018954B (en) * 2022-08-08 2022-10-28 中国科学院自动化研究所 Image generation method, device, electronic equipment and medium
CN115546848B (en) * 2022-10-26 2024-02-02 南京航空航天大学 Challenge generation network training method, cross-equipment palmprint recognition method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862377A (en) * 2017-11-14 2018-03-30 华南理工大学 A kind of packet convolution method that confrontation network model is generated based on text image
CN109003678A (en) * 2018-06-12 2018-12-14 清华大学 A kind of generation method and system emulating text case history
CN109543200A (en) * 2018-11-30 2019-03-29 腾讯科技(深圳)有限公司 A kind of text interpretation method and device
CN109584337A (en) * 2018-11-09 2019-04-05 暨南大学 A kind of image generating method generating confrontation network based on condition capsule
CN109871888A (en) * 2019-01-30 2019-06-11 中国地质大学(武汉) A kind of image generating method and system based on capsule network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862377A (en) * 2017-11-14 2018-03-30 华南理工大学 A kind of packet convolution method that confrontation network model is generated based on text image
CN109003678A (en) * 2018-06-12 2018-12-14 清华大学 A kind of generation method and system emulating text case history
CN109584337A (en) * 2018-11-09 2019-04-05 暨南大学 A kind of image generating method generating confrontation network based on condition capsule
CN109543200A (en) * 2018-11-30 2019-03-29 腾讯科技(深圳)有限公司 A kind of text interpretation method and device
CN109871888A (en) * 2019-01-30 2019-06-11 中国地质大学(武汉) A kind of image generating method and system based on capsule network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
深度学习在图像识别中的应用研究综述;郑远攀 等;《计算机工程与应用》;20190419;第55卷(第12期);20-36 *

Also Published As

Publication number Publication date
CN110751698A (en) 2020-02-04

Similar Documents

Publication Publication Date Title
CN110751698B (en) Text-to-image generation method based on hybrid network model
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN111061843B (en) Knowledge-graph-guided false news detection method
CN110427461B (en) Intelligent question and answer information processing method, electronic equipment and computer readable storage medium
CN108830287A (en) The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN106650789A (en) Image description generation method based on depth LSTM network
CN109413028A (en) SQL injection detection method based on convolutional neural networks algorithm
CN114120041B (en) Small sample classification method based on double-countermeasure variable self-encoder
CN111160452A (en) Multi-modal network rumor detection method based on pre-training language model
CN112784929B (en) Small sample image classification method and device based on double-element group expansion
CN109711465A (en) Image method for generating captions based on MLL and ASCA-FR
CN111966812A (en) Automatic question answering method based on dynamic word vector and storage medium
CN113642621A (en) Zero sample image classification method based on generation countermeasure network
CN112784031B (en) Method and system for classifying customer service conversation texts based on small sample learning
CN111506709A (en) Entity linking method and device, electronic equipment and storage medium
CN110909174B (en) Knowledge graph-based method for improving entity link in simple question answering
CN109766918A (en) Conspicuousness object detecting method based on the fusion of multi-level contextual information
CN109101984B (en) Image identification method and device based on convolutional neural network
CN114332565A (en) Method for generating image by generating confrontation network text based on distribution estimation condition
CN117094325B (en) Named entity identification method in rice pest field
CN116822633B (en) Model reasoning method and device based on self-cognition and electronic equipment
CN108829675A (en) document representing method and device
CN115588487B (en) Medical image data set manufacturing method based on federal learning and antagonism network generation
CN113901820A (en) Chinese triplet extraction method based on BERT model
CN115588486A (en) Traditional Chinese medicine diagnosis generating device based on Transformer and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20200204

Assignee: Shanxi Shiji Beibo Information Technology Co.,Ltd.

Assignor: Taiyuan University of Technology

Contract record no.: X2023140000006

Denomination of invention: A method of text-to-image generation based on hybrid network model

Granted publication date: 20220517

License type: Common License

Record date: 20230110

EE01 Entry into force of recordation of patent licensing contract