CN113191375B - Text-to-multi-object image generation method based on joint embedding - Google Patents
Text-to-multi-object image generation method based on joint embedding Download PDFInfo
- Publication number
- CN113191375B CN113191375B CN202110642098.0A CN202110642098A CN113191375B CN 113191375 B CN113191375 B CN 113191375B CN 202110642098 A CN202110642098 A CN 202110642098A CN 113191375 B CN113191375 B CN 113191375B
- Authority
- CN
- China
- Prior art keywords
- text
- image
- features
- word
- level
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 230000011218 segmentation Effects 0.000 claims abstract description 74
- 230000004927 fusion Effects 0.000 claims abstract description 55
- 238000012549 training Methods 0.000 claims abstract description 45
- 239000013598 vector Substances 0.000 claims description 60
- 230000000007 visual effect Effects 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 5
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 claims description 3
- 238000006467 substitution reaction Methods 0.000 claims description 3
- 238000003786 synthesis reaction Methods 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 238000013461 design Methods 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 13
- 230000008485 antagonism Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 2
- 239000003086 colorant Substances 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000452 restraining effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2132—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Multimedia (AREA)
- Editing Of Facsimile Originals (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a text-to-multi-object image generation method based on joint embedding, and belongs to the field of cross-mode generation from text to image. The implementation method of the invention comprises the following steps: the text description is input into a semantic encoder to obtain joint semantic features of the text and the image, the text is input into a spatial layout encoder to obtain joint spatial features of the text and the segmentation map, and the joint semantic features and the spatial features both comprise sentence levels and word levels. And respectively fusing the word-level features and the sentence-level features by using a dynamic fusion module. Feeding the sentence-level features obtained through fusion into an initial generator in a generating type countermeasure network to generate a low-resolution image, and feeding the word-level features obtained through fusion into a subsequent generator to generate a fine high-resolution image. And constructing a cascade generation type countermeasure network consisting of a plurality of pairs of generators and discriminators, training on a data set through a design loss function to generate the countermeasure network, and generating corresponding images by using the trained generators.
Description
Technical Field
The invention relates to a text-to-multi-object image generation method based on joint embedding, and belongs to the field of cross-mode generation from text to image.
Background
In our daily lives, we are not able to rely on a single data form to convey some information in most cases, and we often need to express it in combination with multi-modal data. For example, when we describe a thing, we usually state it in the form of a text-assisted image. However, such pair-matched data requires a significant amount of financial and effort to collect. The generation problem is different from the retrieval problem, and the retrieved resources are all existing data, so that the generation is more prone to creating data. Collecting corresponding text and images is not an easy task and text-to-image generation studies help solve these problems. The obscure words on the book are often painful for students lacking in imagination, and we want to configure matching images or three-dimensional scenes for the texts by means of a deep learning method, and combine the texts, the corresponding images and the three-dimensional scenes to help the students understand the knowledge more deeply. Generating corresponding images from textual descriptions is a challenging and meaningful study.
For cross-modal generation problems, the key is the extraction of joint features and the design of the generation model. Just as in the text-to-image task, text and images are two different modalities of data, how to derive the joint features of text and images from input and how to design a reasonable model to generate an image is key to solving this problem. The purpose of generating images from text is to generate high quality images that are reasonable in shape, color, layout, etc. and that conform to the text description.
Previous studies have simplified the text-to-image generation task into two parts to process: extracting text-image joint semantic features from the text; and feeding the obtained feature vector into a generated network model to obtain a corresponding image. However, visual space is high-dimensional, structured, covering visual features of different aspects, including high-level abstract semantics, layout features, and low-level textures, color features, and so forth. Previous methods have achieved good results in text-to-single object image generation tasks, but they are not suitable for processing multi-object image generation corresponding to complex text. Mapping directly from semantic features into a visual space with a reasonable layout is a very difficult challenge for multi-object image generation tasks.
Disclosure of Invention
Aiming at the problem that the generated multi-object image has no reasonable space layout, the invention discloses a text-to-multi-object image generation method based on joint embedding, which aims to solve the technical problems that: a network framework capable of generating corresponding multi-object images from textual descriptions is provided, consisting essentially of a semantic encoder, a spatial layout encoder, a dynamic feature fusion module, and a cascade-generated countermeasure network with attention modules. The method comprises the steps of extracting joint semantic features of a text and an image from the text through a semantic encoder, extracting joint spatial features of the text and a segmentation map from the text through a spatial layout encoder, fusing the semantic features and the spatial features through a dynamic fusion module, and feeding the fused features into a generation type countermeasure network to generate an image which accords with the text description and has reasonable layout. The invention has the advantages of convenience, wide applicability and good generation effect. The invention uses the corresponding image generated from the text in the cross-modal generation field, and solves the technical problem of related engineering.
The method comprises the steps of multimedia education resource construction, image editing and computer teaching assistance.
In order to achieve the above purpose, the present invention adopts the following technical scheme.
The invention discloses a method for generating a text-to-multi-object image based on joint embedding. And respectively fusing the word-level features and the sentence-level features by using a dynamic fusion module. Feeding the sentence-level features obtained through fusion into an initial generator in a generating type countermeasure network to generate a low-resolution image, and feeding the word-level features obtained through fusion into a subsequent generator to generate a fine high-resolution image. And constructing a cascade generation type countermeasure network consisting of a plurality of pairs of generators and discriminators, training on a data set through a design loss function to generate the countermeasure network, and generating corresponding images by using the trained generators. The invention uses the corresponding image generated from the text in the cross-modal generation field, and solves the technical problem of related engineering.
The invention discloses a text-to-multi-object image generation method based on joint embedding, which comprises the following steps:
step 1: the text description is input into a semantic encoder to obtain joint semantic features of the text and the image, and the semantic encoder is realized by the text encoder which is obtained by pre-training in a semantic encoder architecture.
The semantic encoder architecture described in step 1 includes a text encoder and an image encoder. The training text encoder is guided in the semantic encoder architecture by using the pre-trained image encoder, and the text encoder obtained through training can extract the joint semantic features of the text and the image from the text.
The text encoder obtained through training can extract the joint semantic features of the text and the image from the text, and the realization method comprises the following steps: the image encoder is trained on the ImageNet data set in advance, and the trained image encoder can obtain image semantic features from the image. By minimizing the distance between text features and semantic features, the text encoder can be forced to extract joint semantic features of text and images from the text. Considering that the image encoder pre-trained on the ImageNet dataset is not fully applicable to other datasets, the text encoder is optimized, and meanwhile, part of the network layer of the image encoder is continuously optimized by minimizing the distance between the text features and the image semantic features, and the two encoders are synchronously trained and optimized until the joint semantic features of the text and the image can be extracted from the text. The text features include sentence-level features e I And word-level features w I The image semantic features also comprise global features g I And local feature v I . Thus, the distance between the text feature and the semantic feature also includes sentence level and word level.
Using word-level penaltyAnd sentence level penalty->Optimizing a semantic encoder architecture, wherein the semantic encoder architecture loses a function L I Is defined as follows:
image global semantic feature g using pairing I And text sentence feature e I Dot product between to calculate sentence levelSentence-level penalty in semantic encoder architectureIs defined as:
to calculate word level lossThe method for calculating the similarity of the text and the image by the citation DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text feature and the image semantic feature on the word level, and the specific implementation method is as follows:
calculating to obtain a similarity matrix s of words in the text and the image subregion:
wherein ,wI Word characteristics, v, representing text I Representing the sub-region features of the image.
The degree of similarity between the word and the image block can be obtained by calculation of the similarity matrix s,representing the degree of similarity of the ith word to the jth image block, where T represents the number of words in the text. / >
By introducing an attention module, a region context vector is calculated for each word in the textSaid region context vector +.>The dynamic representation of the image subarea associated with the ith word in the sentence is the weighted sum of all the visual characteristics of the image subareas, and the calculation formula is as follows:
wherein N represents the number of sub-region blocks of the image,representing the j-th sub-region feature of the image. The similarity between text and image local features is calculated using the region context vector and the word vector. The formula is as follows:
wherein ,calculating the region context vector of the i-th word-associated image subregion>And word characteristics of the i-th word +.>Cosine similarity of (c). The trained text encoder can extract the joint semantic features of the text and the image from the text.
Step 2: the text is input into a spatial layout encoder to obtain joint spatial features of the text and the segmentation map, and the spatial layout encoder is realized by a text encoder which is obtained by pre-training in a spatial encoder architecture.
The spatial encoder architecture described in step 2 consists of a text encoder and a segmentation map encoder. The text encoder is guided to be trained by using a pre-trained segmentation map encoder in a space encoder architecture, and the text encoder obtained through training can extract joint space layout characteristics of the text and the segmentation map from the text.
The text encoder obtained through training can extract the joint space layout characteristics of the text and the segmentation map from the text, and the realization method comprises the following steps: the segmentation map encoder is trained on the ImageNet data set in advance, and the segmentation map encoder obtained through training can obtain spatial layout features from the segmentation map. By minimizing the distance between the text features and the segmentation map features, the text encoder can be forced to extract from the text joint spatial layout features of the text and the segmentation map. Considering that the segmentation map encoder pre-trained on the ImageNet dataset is not fully applicable to other datasets, the text encoder is optimized, and meanwhile, part of the network layer of the segmentation map encoder is continuously optimized by minimizing the distance between the text features and the segmentation map features, the two encoders are synchronously trained and optimized until the joint spatial layout features of the text and the segmentation map can be extracted from the text. The text features include sentence-level features e S And word-level features w S The spatial layout features also include global features g S And local feature v S . Thus, the distance between the text feature and the spatial layout feature also includes a sentence-level distance and a word-level distance.
Using word-level penaltyAnd sentence level penalty->Optimizing spatial layout encoder architecture, spatial encoder architecture loss function L S Is defined as follows:
global spatial feature g of segmentation map using pairing S And text sentence feature e S The dot product between them calculates sentence-level losses, sentence-level losses in the spatial encoder architectureIs defined as: />
To calculate word level lossThe method for calculating the similarity of the text and the image by the citation DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text characteristic and the space characteristic of the segmentation map on the word level, and the specific implementation method is as follows:
calculating to obtain a similar matrix of words in the text and the subareas of the segmentation map:
wherein ,wS Word characteristics, v, representing text S Representing the sub-region features of the segmentation map.
The similarity between the word and the segmented block is calculated by the similarity matrix s,representing the degree of similarity of the ith word to the jth segmented tile, where T represents the number of words in the text.
By introducing an attention module, a region context vector is calculated for each word in the textSaid region context vector +.>The dynamic representation of the segmentation map subarea associated with the ith word in the sentence is the weighted sum of the visual characteristics of all the segmentation map subareas, and the calculation formula is as follows:
Where N represents the number of sub-region blocks of the segmentation map. The similarity between the text and the segmentation map local features is calculated using the region context vector and the word vector. The formula is as follows:
wherein ,calculating the region context vector of the i-th word-associated segmentation map sub-region>And word characteristics of the i-th word +.>Cosine similarity of (c). The trained text encoder can extract joint spatial features of the text and the segmentation map from the text.
Step 3: and (3) fusing the semantic features obtained in the step (1) and the spatial features obtained in the step (2) through a dynamic fusion module, wherein the semantic features and the spatial features comprise sentence-level features and word-level features. The fused sentence-level features are fed into an initial generator in a generative antagonism network to generate a low resolution image. The fused word-level features are then used in subsequent generators in the generative antagonism network to constrain the generation of the high resolution image.
In the step 3, concat operation is adopted, and sentence-level semantic features e are fused I And spatial feature e S As input to the initial stage generator. The specific implementation formula is as follows:
e I-S =concat(e I ,e S ,z)
where z represents a random noise vector.
Step 3, fusing word-level semantic features w by adopting a dynamic fusion mode I And spatial features w S The obtained characteristicsIs fed into a high-stage generator to generate a high-resolution image. The specific implementation method comprises the following steps:
word-level semantic feature w I And spatial features w S Respectively with the image characteristics h obtained in the previous stage k-1 And (3) performing dynamic fusion, and re-fusing the two features obtained in the previous step by adopting dynamic fusion again to serve as semantic-space constraint of a generator in the subsequent stage. The formula is as follows:
wherein DF (.) represents the multilayer perceptron. Firstly, DF (·) is used for obtaining the image characteristic h at the previous stage k-1 And semantic features w I Obtaining a fusion ratioAnd then DF (DEG) is used as the image characteristic h obtained in the last stage k-1 And spatial features w S A fusion ratio is determined>Finally, get->And fusing the semantic features and the spatial features obtained in the kth stage. The fused word-level features ensure that high-pixel image generation is constrained in both semantic and spatial layout aspects in the high-stage generator.
Step 4: the sentence-level feature e obtained by fusion in the step 3 is processed I-S Into an initial generator in a cascade generation type countermeasure network, a low resolution image is generated. Processing the word-level semantic features w obtained in step 2 using an attention module I And spatial features w S Obtaining word-level attention semantic featuresAnd attention space features->Calculate +.>Is to calculate word-level attention semantic-space constraint +.>Image feature h generated by the generator of the previous stage k-1 And word-level attention semantic-space constraint +.>Together into a subsequent generator to generate a finer high resolution image.
Generating word-level attention semantic-space constraints in step 4The specific implementation mode of (a) is as follows:
word-level semantic feature w I And spatial features w S Respectively with the image characteristics h obtained in the previous stage k-1 And (3) performing dynamic fusion, and using the fused characteristics as a value vector in the attention module. Generating word-level attention semantic features using attention modules, respectivelyAnd attention space features->The median vector of the attention module is the same as the keyword vector, and the query vector uses the image feature h obtained in the previous stage k-1 . Finally, a dynamic fusion module is adopted to fuse and obtain word-level attention semantic-space constraint +.>
Using the attention module proposed by AttnGAN, taking the image feature h generated in the last stage as a query vector, and taking the word-level space feature w S Treated as a keyword vector and a value vector. Generating an attention space feature for each image block using word-level space features wherein />Is defined as:
wherein Representing the degree of attention of the jth image to the ith word-level spatial feature, synthesizing an attention-space feature c for the image by weighted summing the word-level spatial features S . The above formula Attn (h, w S ) Keyword vector and value vector w in (a) S Substitution to word-level semantic features w I To obtain Attn (h, w I ) Also for image synthesis an attention semantic feature c I . Fusing the attention space features c by adopting a word-level feature fusion mode in the step 3 S And attention to semantic feature c I . The specific formula is defined as:
in step 4, sentence-level feature e obtained by fusion in step 3 is used I-s As an input to the initial stage generator, a low resolution image is generated, concretely implemented as:
h 0 =F 0 (e I-S )
word-level attention semantic-space constraint c obtained by using image features obtained in the previous stage and dynamic fusion I-S As input to the subsequent generator, a high resolution image is generated. Consider c I-S The method belongs to high-level features of the image, and low-level features such as colors, textures, structures and the like may be more concerned when the image is generated at a low stage, so that two high-level information of semantic and spatial layout cannot necessarily play a good role in guiding. The network determines whether the current generation stage needs higher-level information or middle-low-level information to obtain The image characteristic h obtained in the previous stage is obtained k-1 Constraint information with the kth phase->Fusion, send to F k In the module, image feature h is generated k . Finally, the image features are sent to a corresponding generator G k A high resolution image is generated. The specific implementation formula is as follows:
wherein Fk (. Cndot.) image feature generation Module representing the kth stage, G k (·) represents the kth generator.
Step 5: and constructing a cascade generation type countermeasure network consisting of a plurality of pairs of generators and discriminators, and training the generation countermeasure network on a data set through designing a loss function, namely training the generation countermeasure network with minimized loss function.
To optimize the overall neural network model, generator penalty L is employed G Sum discriminator loss L D To alternate training. Loss function L of generator G And loss function L of the arbiter D Is defined as follows:
L G =L advG +λ 1 L DAMSM L D =L advD
wherein LadvG Is defined as follows:
p G representing the data distribution of the generated image, s represents the text description of the input, and D (·) represents the discriminant in the generated countermeasure network. Wherein the first term ensures that the generated image is as realistic as possible, and the second term ensures that the generated image matches the input text. Loss L using DAMSM in step 2 DAMSM To calculate the local and global similarity of the input text and the generated image, and to ensure that the generated image and the input text maintain semantic consistency by minimizing the loss.
L advD Is defined as:
p data the data distribution representing the real image, the first term in the formula being used to distinguish the generated image from the real image, the second term determining whether the input image and text match.
Step 6: and 3, fusing the word characteristics and the sentence characteristics obtained in the step 1 and the step 2, feeding the word characteristics and the sentence characteristics into a cascade generation type countermeasure network according to the mode described in the step 4, training a generator according to the mode described in the step 5, and generating images with different resolutions by the trained generator.
Further comprising the step 7 of: and (3) applying the image which is generated in the step (6) and accords with the text description to the field of cross-mode generation, and solving the technical problem of related engineering.
The related engineering technical problems in the step 7 comprise multimedia education resource construction, image editing and computer teaching assistance.
The beneficial effects are that:
1. the invention discloses a text-to-multi-object image generation method based on joint embedding, which provides a network framework capable of generating a corresponding multi-object image from text description, wherein the network framework mainly comprises a semantic encoder, a spatial layout encoder, a dynamic characteristic fusion module and a cascade generation type countermeasure network with an attention module. The method comprises the steps of extracting joint semantic features of a text and an image from the text through a semantic encoder, extracting joint spatial features of the text and a segmentation map from the text through a spatial layout encoder, fusing the semantic features and the spatial features through a dynamic fusion module, and feeding the fused features into a generation type countermeasure network to generate an image conforming to text description. Because the generation of the image is constrained in terms of both semantics and spatial layout when the high-resolution image is generated, the generated image has the advantages of conforming to the text description, reasonable spatial layout and realistic visual effect.
2. The existing mode of obtaining the text paired images is mostly realized through retrieval, but no text description exists in the existing corresponding images, and professional image editing software is needed to manually create a corresponding image. The method for generating the text-to-multi-object image based on joint embedding can realize the simple text description input, can quickly generate corresponding images through a network obtained through training, and has the advantages of convenience and rapidness.
3. Visual space is high-dimensional, structured, covering visual features of different aspects, including high-level abstract semantics, layout features, and low-level textures, color features, and the like. Existing research has simplified text-to-image generation tasks into two parts to process: extracting text and image joint semantic features from the text; and feeding the obtained feature vector into a generated network model to obtain a corresponding image. Mapping directly from semantic features into visual space with reasonable layout is a very difficult challenge. The method for generating the text-to-multi-object image based on joint embedding disclosed by the invention has the advantages that the spatial layout features are extracted from the text when the image is generated, the semantic features and the spatial features are used for restraining the image generation, the problem of unreasonable image layout generated by the previous method is solved, and the generation effect is good.
4. The invention discloses a method for generating a corresponding multi-object image from a text, which is used for applying the multi-object image which is generated by the invention and accords with the text description to the cross-mode generation field and solving the technical problem of related engineering. For example: the method comprises the following steps of building multimedia education resources, editing images, assisting computer teaching and the like.
Drawings
FIG. 1 is a flow chart of an implementation of the joint embedding-based text-to-multi-object image generation method of the present invention;
FIG. 2 is a diagram of the overall network architecture of the present invention;
FIG. 3 is a block diagram of an encoder architecture according to the present invention, where FIG. 3 (a) is a semantic encoder architecture and FIG. 3 (b) is a spatial encoder architecture;
FIG. 4 is a block diagram of a dynamic fusion module according to the present invention;
FIG. 5 is an exemplary graph of the results of the generation on the MSCOCO data set in the present invention;
FIG. 6 is an exemplary view of an image generated on an MSCOCO data set according to the present invention.
Detailed Description
Embodiments of the present invention will be described below with reference to the drawings.
As shown in fig. 1, the method for generating a text-to-multi-object image based on joint embedding disclosed in this embodiment can perform education-related application on an MSCOCO dataset, for example, input a text description, and a model can generate a corresponding image, and understand knowledge contained in the text by combining the multi-object image and the text. The training and image generation flow of this embodiment is shown in fig. 1.
Step 1: the text description is input into a semantic encoder to obtain joint semantic features of the text and the image, and the semantic encoder is realized by the text encoder which is obtained by pre-training in a semantic encoder architecture.
The semantic encoder architecture described in step 1 includes a text encoder and an image encoder, as shown in fig. 3 (a). The training text encoder is guided in the semantic encoder architecture by using the pre-trained image encoder, and the text encoder obtained through training can extract the joint semantic features of the text and the image from the text.
The text encoder is realized by a two-way long-short term memory neural network (LSTM), and the final hidden layer output of the model is used for representing the text sentence characteristic e I The intermediate hidden layer output then represents the text word feature w I . The image encoder is implemented by using an acceptance-v 3 model pre-trained on an image net data set, and the last layer of the model learns to obtain the global feature g of the image I The middle convolution layer learns to obtain the subarea characteristic v of the image I 。
The two encoders are continuously optimized and trained by minimizing the distance between text features and image features until joint semantic features of text and images can be extracted from the text. The distance between the text feature and the image semantic feature contains sentence level and word level as shown in fig. 3 (a).
Using word-level penaltyAnd sentence level penalty->Optimizing a semantic encoder architecture, wherein the semantic encoder architecture loses a function L I Is defined as follows:
image global semantic feature g using pairing I And text sentence feature e I The dot product between them calculates sentence-level losses, sentence-level losses in the semantic encoder architectureIs defined as:
to calculate word level lossThe method for calculating the similarity of the text and the image by the reference DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text characteristics and the image characteristics on the word level, and the specific implementation method is as follows:
calculating to obtain a similarity matrix s of words in the text and the image subregion:
wherein ,wI Word characteristics, v, representing text I Representing the sub-region features of the image.Then represents the similarity of the ith word to the jth image block and T represents the word number in the textA number. By introducing an attention module, a region context vector is calculated for each word in the text>Said region context vector +.>The dynamic representation of the image subarea associated with the ith word in the sentence is the weighted sum of all the visual characteristics of the image subareas, and the calculation formula is as follows:
Where N represents the number of sub-region blocks of the image. The similarity between text and image local features is calculated using the region context vector and the word vector. The formula is as follows:
wherein ,calculate-> and />Cosine similarity of (c). The trained text encoder can extract the joint semantic features of the text and the image from the text.
Step 2: the text is input into a spatial layout encoder to obtain joint spatial features of the text and the segmentation map, and the spatial layout encoder is realized by a text encoder which is obtained by pre-training in a spatial encoder architecture.
The spatial encoder architecture described in step 2 consists of a text encoder and a segmentation map encoder, as shown in fig. 3 (b). The text encoder is guided to be trained by using a pre-trained segmentation map encoder in a space encoder architecture, and the text encoder obtained through training can extract joint space layout characteristics of the text and the segmentation map from the text.
The text encoder is realized by a two-way long-short term memory neural network (LSTM), and the final hidden layer output of the model is used for representing the text sentence characteristic e S The intermediate hidden layer outputs then represent text word features w, respectively S . The segmentation map encoder is implemented using an acceptance-v 3 model pre-trained on the ImageNet dataset, and considering the differences between the segmentation map and the image, only the first few layers of the model are used in the paper to extract the segmentation features. The last layer of the model is learned to obtain global space feature g of the segmentation map S Intermediate convolution layer learning to obtain subarea feature v of segmentation map S 。
The segmentation map encoder and the text encoder are continuously trained and optimized by minimizing the distance between the text features and the segmentation map features until the joint spatial layout features of the text and the segmentation map can be extracted from the text. The distance between the text feature and the spatial layout feature comprises sentence level and word level.
Using word-level penaltyAnd sentence level penalty->Optimizing spatial layout encoder architecture, spatial encoder architecture loss function L S Is defined as follows:
sentence-level penalty calculation using dot products between paired segmentation map global spatial features and text sentence features, sentence-level penalty in a spatial encoder architectureIs defined as:
to calculate word level lossThe method for calculating the similarity of the text and the image by the citation DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text characteristic and the space characteristic of the segmentation map on the word level, and the specific implementation method is as follows:
Calculating to obtain a similar matrix of words in the text and the subareas of the segmentation map:
wherein ,wS Word characteristics, v, representing text S Representing the sub-region features of the segmentation map.Then it represents the degree of similarity of the ith word to the jth segmented tile and T represents the number of words in the text. By introducing an attention module, a region context vector is calculated for each word in the text>Said region context vector +.>The dynamic representation of the segmentation map subarea associated with the ith word in the sentence is the weighted sum of the visual characteristics of all the segmentation map subareas, and the calculation formula is as follows:
where N represents the number of sub-region blocks of the segmentation map. The similarity between the text and the segmentation map local features is calculated using the region context vector and the word vector. The formula is as follows:
wherein ,calculate-> and />Cosine similarity of (c). The trained text encoder can extract joint spatial features of the text and the segmentation map from the text.
Step 3: and (3) fusing the semantic features obtained in the step (1) and the spatial features obtained in the step (2) through a dynamic fusion module, wherein the semantic features and the spatial features comprise sentence levels and word levels. The fused sentence-level features are fed into an initial generator in a generative antagonism network to generate a low resolution image. The fused word-level features are then used in subsequent generators in the generative antagonism network to constrain the generation of the high resolution image.
In the step 3, concat operation is adopted, and sentence-level semantic features e are fused I And spatial feature e S As input to the initial stage generator, as shown in fig. 4. The specific implementation formula is as follows:
e I-S =concat(e I ,e S ,z)
where z represents a random noise vector.
Step 3, fusing word-level semantic features w by adopting a dynamic fusion mode I And spatial features w S The resulting features are fed into the high stageThe generator generates a high resolution image. The specific implementation method comprises the following steps:
word-level semantic feature w I And spatial features w S Respectively with the image characteristics h obtained in the previous stage k-1 And (3) performing dynamic fusion, and re-fusing the two features obtained in the previous step by adopting dynamic fusion again to serve as semantic-space constraint of a generator in the subsequent stage. The formula is as follows:
wherein DF (.) represents the multilayer perceptron. Firstly, DF (·) is used for obtaining the image characteristic h at the previous stage k-1 And semantic features w I Obtaining a fusion ratioAnd then DF (DEG) is used as the image characteristic h obtained in the last stage k-1 And spatial features w S A fusion ratio is determined>Finally, get->And fusing the semantic features and the spatial features obtained in the kth stage. Fused word-level features->It is ensured that in the high-stage generator the high-pixel image generation is constrained both in terms of semantics and spatial layout.
Step 4: the sentence-level feature e obtained by fusion in the step 3 is processed I-S Into an initial generator in a cascade generation type countermeasure network, a low resolution image is generated. Processing the word-level semantic features w obtained in step 2 using an attention module I And spatial features w S Obtaining word-level attention semantic featuresAnd attention space features->Calculate +.>Is to calculate word-level attention semantic-space constraint +.>Image feature h generated by the generator of the previous stage k-1 And word-level attention semantic-space constraint +.>Together into a subsequent generator to generate a finer high resolution image.
Generating word-level attention semantic-space constraints using an attention module in step 4The specific implementation mode of (a) is as follows:
word-level semantic feature w I And spatial features w S Respectively with the image characteristics h obtained in the previous stage k-1 And (3) performing dynamic fusion, and using the fused characteristics as a value vector in the attention module. Generating word-level attention semantic features and attention space features respectively by using an attention module, wherein the median vector of the attention module is the same as the keyword vector, and the query vector uses the image features h obtained in the previous stage k-1 . Finally, fusing by adopting a dynamic fusion module to obtain word-level attention semantic-space constraint
Using the attention module proposed by AttnGAN, taking the image feature h generated in the last stage as a query vector, and taking the word-level space feature w S Treated as a keyword vector and a value vector. Generating an attention space feature for each image block using word-level space features wherein />Is defined as:
wherein Representing the degree of attention of the jth image to the ith word-level spatial feature, synthesizing an attention-space feature c for the image by weighted summing the word-level spatial features S . The above formula Attn (h, w S ) Keyword vector and value vector w in (a) S Substitution to word-level semantic features w I To obtain Attn (h, w I ) Also for image synthesis an attention semantic feature c I . Fusing the attention space features c by adopting a word-level feature fusion mode in the step 3 S And attention to semantic feature c I . The specific formula is defined as:
as shown in fig. 2, the sentence-level feature e obtained by the fusion in step 3 is used in step 4 I-S As an input to the initial stage generator, a low resolution image is generated, concretely implemented as:
h 0 =F 0 (e I-S )
word-level attention semantic-space constraint c obtained by using image features obtained in the previous stage and dynamic fusion I-S As an input to the subsequent generator,a high resolution image is generated. Consider c I-S The method belongs to high-level features of the image, and low-level features such as colors, textures, structures and the like may be more concerned when the image is generated at a low stage, so that two high-level information of semantic and spatial layout cannot necessarily play a good role in guiding. As shown in FIG. 4, the network determines whether the current generation stage requires higher-level information or middle-lower-level information by itself to obtainThe image characteristic h obtained in the previous stage is obtained k-1 Constraint information with the kth phase->Fusion, send to F k In the module, image feature h is generated k . Finally, the image features are sent to corresponding generators G k As shown in fig. 2. The specific implementation formula is as follows:
wherein Fk (. Cndot.) image feature generation Module representing the kth stage, G k (·) represents the kth generator.
Step 5: and constructing a cascade generation type countermeasure network consisting of a plurality of pairs of generators and discriminators, and training the generation countermeasure network on a data set through designing a loss function, namely training the generation countermeasure network with minimized loss function.
To optimize the overall neural network model, generator penalty L is employed G Sum discriminator loss L D Alternate training. Loss function L of generator G And loss function L of the arbiter D Is defined as follows:
L G =L advG +λ 1 L DAMSM L D =L advD
wherein LadvG Is defined as follows:
p G representing the data distribution of the generated image, s represents the text description of the input, and D (·) represents the discriminant in the generated countermeasure network. Wherein the first term ensures that the generated image is as realistic as possible, and the second term ensures that the generated image matches the input text. Loss L using DAMSM in step 2 DAMSM To calculate the local and global similarity of the input text and the generated image, and to ensure that the generated image and the input text maintain semantic consistency by minimizing the loss.
L advD Is defined as:
p data the data distribution representing the real image, the first term in the formula being used to distinguish the generated image from the real image, the second term determining whether the input image and text match.
In the training process of the invention, 210 generations are trained altogether, the whole network model is not end-to-end in the training process, and the text encoder and the image encoder are trained synchronously first, and the learning rate of both is 0.002. The other text encoder and the segmentation map encoder were resynchronized, and the learning rate of both was also 0.002. The training process of the generated countermeasure network is a process of constantly optimizing and countering between the generator and the arbiter, and their learning rate is set to 0.0002.
Step 6: and 3, fusing the word characteristics and the sentence characteristics obtained in the step 1 and the step 2, feeding the word characteristics and the sentence characteristics into a cascade generation type countermeasure network according to the mode described in the step 4, training a generator according to the mode described in the step 5, and generating images with different resolutions by the trained generator.
In step 6, the present embodiment obtains a good generation result on the MSCOCO public data set. The MSCOCO dataset is a large, rich object detection, segmentation and caption dataset. This dataset is targeted at scene understanding, mainly capturing images from complex daily scenes. MSCOCO provides very detailed data labeling for each image, including text description of the entire image, bounding box information of objects in the image, class labels, and segmentation map, etc. The image contains 91 classes of objects, of which 82 classifications all have over 5,000 instance objects. An effect diagram of generating a corresponding image from a text description is shown in fig. 5, and an example of a generated image of a network is shown in fig. 6.
In summary, in this embodiment, the text description is input into the generating countermeasure network, and the generating countermeasure network model is trained, so as to obtain a well-trained generator, and at this time, the generator can generate a multi-object image conforming to the text description. The method and the device can solve the problems that the time cost and the labor cost of the generation in the traditional method are high, and the effect cannot be guaranteed.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
Claims (7)
1. A text-to-multi-object image generation method based on joint embedding is characterized by comprising the following steps of: comprises the following steps of the method,
step 1: inputting the text description into a semantic encoder to obtain joint semantic features of the text and the image, wherein the semantic encoder is realized by a text encoder which is obtained by pre-training in a semantic encoder architecture;
the semantic encoder architecture described in step 1 includes a text encoder and an image encoder; guiding a training text encoder by using an image encoder obtained by pre-training in a semantic encoder architecture, wherein the text encoder obtained by training can extract joint semantic features of a text and an image from the text;
step 2: inputting the text into a space layout encoder to obtain joint space characteristics of the text and the segmentation map, wherein the space layout encoder is realized by a text encoder which is obtained by pre-training in a space encoder architecture;
The space encoder architecture described in step 2 consists of a text encoder and a segmentation map encoder; the text encoder is guided to be trained by using a segmentation map encoder obtained through pre-training in a space encoder framework, and the text encoder obtained through training can extract joint space layout characteristics of the text and the segmentation map from the text;
step 3: the semantic features obtained in the step 1 and the spatial features obtained in the step 2 are fused through a dynamic fusion module, wherein the semantic features and the spatial features comprise sentence-level features and word-level features; the sentence-level features obtained by fusion are sent to an initial generator in a cascade generation type countermeasure network to generate a low-resolution image; the word-level features obtained by fusion are used in a subsequent generator in a cascade generation type countermeasure network to restrict the generation of high-resolution images;
step 4: the sentence-level features obtained by fusion in the step 3 are sent to an initial generator in a cascade generation type countermeasure network, and a low-resolution image is generated; processing the word-level semantic features obtained in the step 1 and the space features obtained in the step 2 by using an attention module to obtain word-level attention semantic features and attention space features; calculating word-level attention semantic-space constraints using the method of step 3 calculation; the image features generated by the generator at the previous stage and the word-level attention semantic-space constraint are fed into the subsequent generator together to generate a finer high-resolution image;
Step 5: constructing a cascade generation type countermeasure network consisting of a plurality of pairs of generators and discriminators, and training on a data set through designing a loss function to generate the countermeasure network, namely training to obtain a generated countermeasure network with minimized loss function;
step 6: and 3, fusing the word-level features and the sentence-level features obtained in the step 1 and the step 2, sending the word-level features and the sentence-level features into a cascade generation type countermeasure network in a manner described in the step 4, training a generator in a manner described in the step 5, and generating images with different resolutions by the trained generator.
2. A method for generating a text-to-multi-object image based on joint embedding as recited in claim 1, wherein: the method also comprises a step 7 of applying the image which is generated in the step 6 and accords with the text description to the cross-mode generation field to solve the technical problem of related engineering;
the related engineering technical problems in the step 7 comprise multimedia education resource construction, image editing and computer teaching assistance.
3. A method of generating a text-to-multi-object image based on joint embedding as claimed in claim 1 or 2, wherein: the text encoder obtained through training in the semantic encoder architecture can extract the joint semantic features of the text and the image from the text, and the realization method comprises the following steps: training an image encoder on an ImageNet data set in advance, wherein the trained image encoder can obtain image semantic features from images; the text encoder can extract the joint semantic features of the text and the image from the text by minimizing the distance between the text features and the semantic features; while optimizing the text encoder, continuously optimizing part of the network layer of the image encoder by minimizing the distance between the text features and the image semantic features, and synchronously training and optimizing the two encoders until the joint semantic features of the text and the image can be extracted from the text; the text features include sentence-level features e I And word-level features w I The image semantic features also comprise global features g I And local feature v I The method comprises the steps of carrying out a first treatment on the surface of the Thus, the distance between the text feature and the semantic feature also includes sentence-level distance and word-level distance;
using word-level penaltyAnd sentence level penalty->Optimizing a semantic encoder architecture, wherein the semantic encoder architecture loses a function L I Is defined as follows:
image global semantic feature g using pairing I And text sentence feature e I The dot product between them calculates sentence-level losses, sentence-level losses in the semantic encoder architectureIs defined as:
to calculate word level lossThe method for calculating the similarity of the text and the image by the citation DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text feature and the image semantic feature on the word level, and the specific implementation method is as follows:
calculating to obtain a similarity matrix s of words in the text and the image subregion:
wherein ,wI Word characteristics, v, representing text I Sub-representing imageRegional characteristics;
the degree of similarity between the word and the image block can be obtained by calculation of the similarity matrix s,representing the similarity of the ith word and the jth image block, wherein T represents the number of words in the text;
By introducing an attention module, a region context vector is calculated for each word in the textSaid region context vector +.>The dynamic representation of the image subarea associated with the ith word in the sentence is the weighted sum of all the visual characteristics of the image subareas, and the calculation formula is as follows:
wherein N represents the number of sub-region blocks of the image,representing the j-th sub-region feature of the image, calculating the similarity between the text and the image local feature using the region context vector and the word vector; the formula is as follows:
4. A method for generating a text-to-multi-object image based on joint embedding as recited in claim 3, wherein: the text encoder obtained through training in the space encoder architecture can extract the joint space layout characteristics of the text and the segmentation map from the text, and the realization method comprises the following steps: training a segmentation map encoder on an ImageNet data set in advance, wherein the segmentation map encoder obtained through training can obtain spatial layout characteristics from the segmentation map; by minimizing the distance between the text features and the segmentation map features, the text encoder can be enabled to extract the joint spatial layout features of the text and the segmentation map from the text; while optimizing the text encoder, continuously optimizing part of the network layer of the segmentation map encoder by minimizing the distance between the text features and the segmentation map features, and synchronously training and optimizing the two encoders until the joint space layout features of the text and the segmentation map can be extracted from the text; the text features include sentence-level features e S And word-level features w S The spatial layout features also include global features g S And local feature v S The method comprises the steps of carrying out a first treatment on the surface of the Thus, the distance between the text feature and the spatial layout feature also includes sentence-level distance and word-level distance;
using word-level penaltyAnd sentence level penalty->Optimizing spatial layout encoder architecture, spatial encoder architecture loss function L S Is defined as follows:
global spatial feature g of segmentation map using pairing S And text sentence feature e S The dot product between them calculates sentence-level losses, sentence-level losses in the spatial encoder architectureIs defined as:
to calculate word level lossThe method for calculating the similarity of the text and the image by the citation DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text characteristic and the space characteristic of the segmentation map on the word level, and the specific implementation method is as follows:
calculating to obtain a similar matrix of words in the text and the subareas of the segmentation map:
wherein ,wS Word characteristics, v, representing text S A sub-region feature representing a segmentation map;
the similarity between the word and the segmented block is calculated by the similarity matrix s,representing the similarity of the ith word and the jth segmented block, wherein T represents the number of words in the text;
By introducing an attention module, a region context vector is calculated for each word in the textSaid region context vector +.>The dynamic representation of the segmentation map subarea associated with the ith word in the sentence is the weighted sum of the visual characteristics of all the segmentation map subareas, and the calculation formula is as follows:
wherein N represents the number of sub-region blocks of the segmentation map; calculating a similarity between the text and the segmentation map local features using the region context vector and the word vector; the formula is as follows:
5. A method for generating a text-to-multi-object image based on joint embedding as recited in claim 4, wherein: in the step 3, concat operation is adopted, and sentence-level semantic features e are fused I And spatial feature e S As input to the initial stage generator; the specific implementation formula is as follows:
e I-S =concat(e I ,e S ,z)
wherein z represents a random noise vector;
step 3, fusing word-level semantic features w by adopting a dynamic fusion mode I And spatial features w S The obtained characteristicsIs fed into a high-stage generator to generate a high-resolution image; the specific implementation method comprises the following steps:
word-level semantic feature w I And spatial features w S Respectively with the image characteristics h obtained in the previous stage k-1 Dynamic fusion is carried out, and the two features obtained in the previous step are fused again by adopting the dynamic fusion to serve as semantic-space constraint of a generator in the subsequent stage; the formula is as follows:
wherein DF (& gt) represents the multi-layer perceptron; firstly, DF (·) is used for obtaining the image characteristic h at the previous stage k-1 And semantic features w I Obtaining a fusion ratioAnd then DF (DEG) is used as the image characteristic h obtained in the last stage k-1 And spatial features w S A fusion ratio is determined>Finally, get->Fusing semantic features and space features obtained in the kth stage; the word-level features obtained through fusion ensure that high-pixel image generation is constrained in a high-stage generator from two aspects of semantics and spatial layout;
step 4: the sentence-level feature e obtained by fusion in the step 3 is processed I-S Sending the low-resolution images into an initial generator in a cascade generation type countermeasure network; processing the word-level semantic features w obtained in step 2 using an attention module I And spatial features w S Obtaining word-level attention semantic featuresAnd attention space features->Calculate +.>Is to calculate word-level attention semantic-space constraint +.>Image feature h generated by the generator of the previous stage k-1 And word-level attention semantic-space constraint +.>Together into a subsequent generator to generate a finer high resolution image.
6. The method for generating a text-to-multi-object image based on joint embedding of claim 5, wherein:
generating word-level attention semantic-space constraints in step 4The specific implementation mode of (a) is as follows:
word-level semantic feature w I And spatial features w S Respectively with the image characteristics h obtained in the previous stage k-1 Dynamic fusion is carried out, and the characteristics obtained by fusion are used as value vectors in the attention module; generating word-level attention semantic features using attention modules, respectivelyAnd attention space features->The median vector of the attention module is the same as the keyword vector, and the query vector uses the image feature h obtained in the previous stage k-1 The method comprises the steps of carrying out a first treatment on the surface of the Finally, a dynamic fusion module is adopted to fuse and obtain word-level attention semantic-space constraint +.>
Using the attention module proposed by AttnGAN to make the image feature h generated in the last stage k-1 Regarded as a query vector, the word-level spatial features w S Treating as a keyword vector and a value vector; generating an attention space feature for each image block using word-level space features wherein />Is defined as:
wherein Representing the degree of attention of the jth image to the ith word-level spatial feature, synthesizing an attention-space feature c for the image by weighted summing the word-level spatial features S The method comprises the steps of carrying out a first treatment on the surface of the The above formula Attn (h, w S ) Keyword vector and value vector w in (a) S Substitution to word-level semantic features w I To obtain Attn (h, w I ) Also for image synthesis an attention semantic feature c I The method comprises the steps of carrying out a first treatment on the surface of the Fusing the attention space features c by adopting a word-level feature fusion mode in the step 3 S And attention to semantic feature c I The method comprises the steps of carrying out a first treatment on the surface of the The specific formula is defined as:
in step 4, sentence-level feature e obtained by fusion in step 3 is used I-S As an input to the initial stage generator, a low resolution image is generated, concretely implemented as:
h 0 =F 0 (e I-S )
word-level attention semantic-space constraint c obtained by using image features obtained in the previous stage and dynamic fusion I-S Generating a high resolution image as input to a subsequent generator; the network determines that the current generation stage needs higher-level information or middle-low-level information to obtainThe image characteristic h obtained in the previous stage is obtained k-1 Constraint information with the kth phase- >Fusion, send to F k In the module, image feature h is generated k The method comprises the steps of carrying out a first treatment on the surface of the Finally, the image features are sent to a corresponding generator G k A high resolution image is generated; the specific implementation formula is as follows:
wherein Fk (. Cndot.) image feature generation Module representing the kth stage, G k (·) represents the kth generator.
7. The method for generating a text-to-multi-object image based on joint embedding of claim 6, wherein: to optimize the overall neural network model, generator penalty L is employed G Sum discriminator loss L D To train alternately; loss function L of generator G And loss function L of the arbiter D Is defined as follows:
L G =L advG +λ 1 L DAMSM L D =L advD
wherein LadvG Is defined as follows:
p G representing the data distribution of the generated image, s representing the input text description, and D (-) representing the discriminators in the cascade generation type countermeasure network; wherein the first term ensures that the generated image is authentic and the second term ensures that the generated image matches the input text; loss L using DAMSM in step 2 DAMSM Calculating the local similarity and the global similarity of the input text and the generated image, and ensuring that the generated image and the input text keep semantic consistency by minimizing the loss;
L advD is defined as:
p data the data distribution representing the real image, the first term in the formula being used to distinguish the generated image from the real image, the second term determining whether the input image and text match.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110642098.0A CN113191375B (en) | 2021-06-09 | 2021-06-09 | Text-to-multi-object image generation method based on joint embedding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110642098.0A CN113191375B (en) | 2021-06-09 | 2021-06-09 | Text-to-multi-object image generation method based on joint embedding |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113191375A CN113191375A (en) | 2021-07-30 |
CN113191375B true CN113191375B (en) | 2023-05-09 |
Family
ID=76976242
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110642098.0A Active CN113191375B (en) | 2021-06-09 | 2021-06-09 | Text-to-multi-object image generation method based on joint embedding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113191375B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113850779A (en) * | 2021-09-24 | 2021-12-28 | 深圳闪回科技有限公司 | Automatic grading algorithm for second-hand mobile phone based on variational multi-instance image recognition |
CN113869007B (en) * | 2021-10-11 | 2024-04-23 | 大连理工大学 | Text generation image learning method based on deep learning |
CN115293109B (en) * | 2022-08-03 | 2024-03-19 | 合肥工业大学 | Text image generation method and system based on fine granularity semantic fusion |
CN115512368B (en) * | 2022-08-22 | 2024-05-10 | 华中农业大学 | Cross-modal semantic generation image model and method |
CN115797495B (en) * | 2023-02-07 | 2023-04-25 | 武汉理工大学 | Method for generating image by sentence-character semantic space fusion perceived text |
CN116030048B (en) * | 2023-03-27 | 2023-07-18 | 山东鹰眼机械科技有限公司 | Lamp inspection machine and method thereof |
CN116863032B (en) * | 2023-06-27 | 2024-04-09 | 河海大学 | Flood disaster scene generation method based on generation countermeasure network |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111260740A (en) * | 2020-01-16 | 2020-06-09 | 华南理工大学 | Text-to-image generation method based on generation countermeasure network |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107330100B (en) * | 2017-07-06 | 2020-04-03 | 北京大学深圳研究生院 | Image-text bidirectional retrieval method based on multi-view joint embedding space |
CN110263203B (en) * | 2019-04-26 | 2021-09-24 | 桂林电子科技大学 | Text-to-image generation method combined with Pearson reconstruction |
CN110866958B (en) * | 2019-10-28 | 2023-04-18 | 清华大学深圳国际研究生院 | Method for text to image |
CN111340907A (en) * | 2020-03-03 | 2020-06-26 | 曲阜师范大学 | Text-to-image generation method of self-adaptive attribute and instance mask embedded graph |
-
2021
- 2021-06-09 CN CN202110642098.0A patent/CN113191375B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111260740A (en) * | 2020-01-16 | 2020-06-09 | 华南理工大学 | Text-to-image generation method based on generation countermeasure network |
Also Published As
Publication number | Publication date |
---|---|
CN113191375A (en) | 2021-07-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113191375B (en) | Text-to-multi-object image generation method based on joint embedding | |
CN110490946B (en) | Text image generation method based on cross-modal similarity and antagonism network generation | |
Rastgoo et al. | Sign language production: A review | |
CN111581401B (en) | Local citation recommendation system and method based on depth correlation matching | |
Wu et al. | Hierarchical attention-based multimodal fusion for video captioning | |
CN108154156B (en) | Image set classification method and device based on neural topic model | |
CN114818691A (en) | Article content evaluation method, device, equipment and medium | |
CN112651940A (en) | Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network | |
Huang et al. | C-Rnn: a fine-grained language model for image captioning | |
Mantovan et al. | The computerization of archaeology: Survey on artificial intelligence techniques | |
Jishan et al. | Bangla language textual image description by hybrid neural network model | |
Min et al. | Deep learning-based short story generation for an image using the encoder-decoder structure | |
CN116309913A (en) | Method for generating image based on ASG-GAN text description of generation countermeasure network | |
Patil et al. | Performance analysis of image caption generation using deep learning techniques | |
Pande et al. | Development and deployment of a generative model-based framework for text to photorealistic image generation | |
Malakan et al. | Vision transformer based model for describing a set of images as a story | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
Robert | The Role of Deep Learning in Computer Vision | |
Abdelaziz et al. | Few-shot learning with saliency maps as additional visual information | |
CN114944002B (en) | Text description-assisted gesture-aware facial expression recognition method | |
Miah et al. | Multi-stream graph-based deep neural networks for skeleton-based sign language recognition | |
Bende et al. | VISMA: A Machine Learning Approach to Image Manipulation | |
Sra et al. | Deepspace: Mood-based image texture generation for virtual reality from music | |
Guo et al. | Double-layer affective visual question answering network | |
Malavath et al. | Natya Shastra: Deep Learning for Automatic Classification of Hand Mudra in Indian Classical Dance Videos. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |