CN113191375B - Text-to-multi-object image generation method based on joint embedding - Google Patents

Text-to-multi-object image generation method based on joint embedding Download PDF

Info

Publication number
CN113191375B
CN113191375B CN202110642098.0A CN202110642098A CN113191375B CN 113191375 B CN113191375 B CN 113191375B CN 202110642098 A CN202110642098 A CN 202110642098A CN 113191375 B CN113191375 B CN 113191375B
Authority
CN
China
Prior art keywords
text
image
features
word
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110642098.0A
Other languages
Chinese (zh)
Other versions
CN113191375A (en
Inventor
余月
王孟岚
杨越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202110642098.0A priority Critical patent/CN113191375B/en
Publication of CN113191375A publication Critical patent/CN113191375A/en
Application granted granted Critical
Publication of CN113191375B publication Critical patent/CN113191375B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2132Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Multimedia (AREA)
  • Editing Of Facsimile Originals (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a text-to-multi-object image generation method based on joint embedding, and belongs to the field of cross-mode generation from text to image. The implementation method of the invention comprises the following steps: the text description is input into a semantic encoder to obtain joint semantic features of the text and the image, the text is input into a spatial layout encoder to obtain joint spatial features of the text and the segmentation map, and the joint semantic features and the spatial features both comprise sentence levels and word levels. And respectively fusing the word-level features and the sentence-level features by using a dynamic fusion module. Feeding the sentence-level features obtained through fusion into an initial generator in a generating type countermeasure network to generate a low-resolution image, and feeding the word-level features obtained through fusion into a subsequent generator to generate a fine high-resolution image. And constructing a cascade generation type countermeasure network consisting of a plurality of pairs of generators and discriminators, training on a data set through a design loss function to generate the countermeasure network, and generating corresponding images by using the trained generators.

Description

Text-to-multi-object image generation method based on joint embedding
Technical Field
The invention relates to a text-to-multi-object image generation method based on joint embedding, and belongs to the field of cross-mode generation from text to image.
Background
In our daily lives, we are not able to rely on a single data form to convey some information in most cases, and we often need to express it in combination with multi-modal data. For example, when we describe a thing, we usually state it in the form of a text-assisted image. However, such pair-matched data requires a significant amount of financial and effort to collect. The generation problem is different from the retrieval problem, and the retrieved resources are all existing data, so that the generation is more prone to creating data. Collecting corresponding text and images is not an easy task and text-to-image generation studies help solve these problems. The obscure words on the book are often painful for students lacking in imagination, and we want to configure matching images or three-dimensional scenes for the texts by means of a deep learning method, and combine the texts, the corresponding images and the three-dimensional scenes to help the students understand the knowledge more deeply. Generating corresponding images from textual descriptions is a challenging and meaningful study.
For cross-modal generation problems, the key is the extraction of joint features and the design of the generation model. Just as in the text-to-image task, text and images are two different modalities of data, how to derive the joint features of text and images from input and how to design a reasonable model to generate an image is key to solving this problem. The purpose of generating images from text is to generate high quality images that are reasonable in shape, color, layout, etc. and that conform to the text description.
Previous studies have simplified the text-to-image generation task into two parts to process: extracting text-image joint semantic features from the text; and feeding the obtained feature vector into a generated network model to obtain a corresponding image. However, visual space is high-dimensional, structured, covering visual features of different aspects, including high-level abstract semantics, layout features, and low-level textures, color features, and so forth. Previous methods have achieved good results in text-to-single object image generation tasks, but they are not suitable for processing multi-object image generation corresponding to complex text. Mapping directly from semantic features into a visual space with a reasonable layout is a very difficult challenge for multi-object image generation tasks.
Disclosure of Invention
Aiming at the problem that the generated multi-object image has no reasonable space layout, the invention discloses a text-to-multi-object image generation method based on joint embedding, which aims to solve the technical problems that: a network framework capable of generating corresponding multi-object images from textual descriptions is provided, consisting essentially of a semantic encoder, a spatial layout encoder, a dynamic feature fusion module, and a cascade-generated countermeasure network with attention modules. The method comprises the steps of extracting joint semantic features of a text and an image from the text through a semantic encoder, extracting joint spatial features of the text and a segmentation map from the text through a spatial layout encoder, fusing the semantic features and the spatial features through a dynamic fusion module, and feeding the fused features into a generation type countermeasure network to generate an image which accords with the text description and has reasonable layout. The invention has the advantages of convenience, wide applicability and good generation effect. The invention uses the corresponding image generated from the text in the cross-modal generation field, and solves the technical problem of related engineering.
The method comprises the steps of multimedia education resource construction, image editing and computer teaching assistance.
In order to achieve the above purpose, the present invention adopts the following technical scheme.
The invention discloses a method for generating a text-to-multi-object image based on joint embedding. And respectively fusing the word-level features and the sentence-level features by using a dynamic fusion module. Feeding the sentence-level features obtained through fusion into an initial generator in a generating type countermeasure network to generate a low-resolution image, and feeding the word-level features obtained through fusion into a subsequent generator to generate a fine high-resolution image. And constructing a cascade generation type countermeasure network consisting of a plurality of pairs of generators and discriminators, training on a data set through a design loss function to generate the countermeasure network, and generating corresponding images by using the trained generators. The invention uses the corresponding image generated from the text in the cross-modal generation field, and solves the technical problem of related engineering.
The invention discloses a text-to-multi-object image generation method based on joint embedding, which comprises the following steps:
step 1: the text description is input into a semantic encoder to obtain joint semantic features of the text and the image, and the semantic encoder is realized by the text encoder which is obtained by pre-training in a semantic encoder architecture.
The semantic encoder architecture described in step 1 includes a text encoder and an image encoder. The training text encoder is guided in the semantic encoder architecture by using the pre-trained image encoder, and the text encoder obtained through training can extract the joint semantic features of the text and the image from the text.
The text encoder obtained through training can extract the joint semantic features of the text and the image from the text, and the realization method comprises the following steps: the image encoder is trained on the ImageNet data set in advance, and the trained image encoder can obtain image semantic features from the image. By minimizing the distance between text features and semantic features, the text encoder can be forced to extract joint semantic features of text and images from the text. Considering that the image encoder pre-trained on the ImageNet dataset is not fully applicable to other datasets, the text encoder is optimized, and meanwhile, part of the network layer of the image encoder is continuously optimized by minimizing the distance between the text features and the image semantic features, and the two encoders are synchronously trained and optimized until the joint semantic features of the text and the image can be extracted from the text. The text features include sentence-level features e I And word-level features w I The image semantic features also comprise global features g I And local feature v I . Thus, the distance between the text feature and the semantic feature also includes sentence level and word level.
Using word-level penalty
Figure BDA0003108323670000021
And sentence level penalty->
Figure BDA0003108323670000022
Optimizing a semantic encoder architecture, wherein the semantic encoder architecture loses a function L I Is defined as follows:
Figure BDA0003108323670000023
image global semantic feature g using pairing I And text sentence feature e I Dot product between to calculate sentence levelSentence-level penalty in semantic encoder architecture
Figure BDA0003108323670000031
Is defined as:
Figure BDA0003108323670000032
to calculate word level loss
Figure BDA0003108323670000033
The method for calculating the similarity of the text and the image by the citation DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text feature and the image semantic feature on the word level, and the specific implementation method is as follows:
calculating to obtain a similarity matrix s of words in the text and the image subregion:
Figure BDA0003108323670000034
wherein ,wI Word characteristics, v, representing text I Representing the sub-region features of the image.
The degree of similarity between the word and the image block can be obtained by calculation of the similarity matrix s,
Figure BDA0003108323670000035
representing the degree of similarity of the ith word to the jth image block, where T represents the number of words in the text. / >
Figure BDA0003108323670000036
By introducing an attention module, a region context vector is calculated for each word in the text
Figure BDA0003108323670000037
Said region context vector +.>
Figure BDA0003108323670000038
The dynamic representation of the image subarea associated with the ith word in the sentence is the weighted sum of all the visual characteristics of the image subareas, and the calculation formula is as follows:
Figure BDA0003108323670000039
wherein N represents the number of sub-region blocks of the image,
Figure BDA00031083236700000310
representing the j-th sub-region feature of the image. The similarity between text and image local features is calculated using the region context vector and the word vector. The formula is as follows:
Figure BDA00031083236700000311
wherein ,
Figure BDA00031083236700000312
calculating the region context vector of the i-th word-associated image subregion>
Figure BDA00031083236700000313
And word characteristics of the i-th word +.>
Figure BDA00031083236700000314
Cosine similarity of (c). The trained text encoder can extract the joint semantic features of the text and the image from the text.
Step 2: the text is input into a spatial layout encoder to obtain joint spatial features of the text and the segmentation map, and the spatial layout encoder is realized by a text encoder which is obtained by pre-training in a spatial encoder architecture.
The spatial encoder architecture described in step 2 consists of a text encoder and a segmentation map encoder. The text encoder is guided to be trained by using a pre-trained segmentation map encoder in a space encoder architecture, and the text encoder obtained through training can extract joint space layout characteristics of the text and the segmentation map from the text.
The text encoder obtained through training can extract the joint space layout characteristics of the text and the segmentation map from the text, and the realization method comprises the following steps: the segmentation map encoder is trained on the ImageNet data set in advance, and the segmentation map encoder obtained through training can obtain spatial layout features from the segmentation map. By minimizing the distance between the text features and the segmentation map features, the text encoder can be forced to extract from the text joint spatial layout features of the text and the segmentation map. Considering that the segmentation map encoder pre-trained on the ImageNet dataset is not fully applicable to other datasets, the text encoder is optimized, and meanwhile, part of the network layer of the segmentation map encoder is continuously optimized by minimizing the distance between the text features and the segmentation map features, the two encoders are synchronously trained and optimized until the joint spatial layout features of the text and the segmentation map can be extracted from the text. The text features include sentence-level features e S And word-level features w S The spatial layout features also include global features g S And local feature v S . Thus, the distance between the text feature and the spatial layout feature also includes a sentence-level distance and a word-level distance.
Using word-level penalty
Figure BDA0003108323670000041
And sentence level penalty->
Figure BDA0003108323670000042
Optimizing spatial layout encoder architecture, spatial encoder architecture loss function L S Is defined as follows:
Figure BDA0003108323670000043
global spatial feature g of segmentation map using pairing S And text sentence feature e S The dot product between them calculates sentence-level losses, sentence-level losses in the spatial encoder architecture
Figure BDA0003108323670000044
Is defined as: />
Figure BDA0003108323670000045
To calculate word level loss
Figure BDA0003108323670000046
The method for calculating the similarity of the text and the image by the citation DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text characteristic and the space characteristic of the segmentation map on the word level, and the specific implementation method is as follows:
calculating to obtain a similar matrix of words in the text and the subareas of the segmentation map:
Figure BDA0003108323670000047
wherein ,wS Word characteristics, v, representing text S Representing the sub-region features of the segmentation map.
The similarity between the word and the segmented block is calculated by the similarity matrix s,
Figure BDA0003108323670000048
representing the degree of similarity of the ith word to the jth segmented tile, where T represents the number of words in the text.
Figure BDA0003108323670000049
By introducing an attention module, a region context vector is calculated for each word in the text
Figure BDA00031083236700000410
Said region context vector +.>
Figure BDA00031083236700000411
The dynamic representation of the segmentation map subarea associated with the ith word in the sentence is the weighted sum of the visual characteristics of all the segmentation map subareas, and the calculation formula is as follows:
Figure BDA00031083236700000412
Where N represents the number of sub-region blocks of the segmentation map. The similarity between the text and the segmentation map local features is calculated using the region context vector and the word vector. The formula is as follows:
Figure BDA00031083236700000413
wherein ,
Figure BDA0003108323670000051
calculating the region context vector of the i-th word-associated segmentation map sub-region>
Figure BDA0003108323670000052
And word characteristics of the i-th word +.>
Figure BDA0003108323670000053
Cosine similarity of (c). The trained text encoder can extract joint spatial features of the text and the segmentation map from the text.
Step 3: and (3) fusing the semantic features obtained in the step (1) and the spatial features obtained in the step (2) through a dynamic fusion module, wherein the semantic features and the spatial features comprise sentence-level features and word-level features. The fused sentence-level features are fed into an initial generator in a generative antagonism network to generate a low resolution image. The fused word-level features are then used in subsequent generators in the generative antagonism network to constrain the generation of the high resolution image.
In the step 3, concat operation is adopted, and sentence-level semantic features e are fused I And spatial feature e S As input to the initial stage generator. The specific implementation formula is as follows:
e I-S =concat(e I ,e S ,z)
where z represents a random noise vector.
Step 3, fusing word-level semantic features w by adopting a dynamic fusion mode I And spatial features w S The obtained characteristics
Figure BDA0003108323670000054
Is fed into a high-stage generator to generate a high-resolution image. The specific implementation method comprises the following steps:
word-level semantic feature w I And spatial features w S Respectively with the image characteristics h obtained in the previous stage k-1 And (3) performing dynamic fusion, and re-fusing the two features obtained in the previous step by adopting dynamic fusion again to serve as semantic-space constraint of a generator in the subsequent stage. The formula is as follows:
Figure BDA0003108323670000055
Figure BDA0003108323670000056
Figure BDA0003108323670000057
Figure BDA0003108323670000058
Figure BDA0003108323670000059
Figure BDA00031083236700000510
wherein DF (.) represents the multilayer perceptron. Firstly, DF (·) is used for obtaining the image characteristic h at the previous stage k-1 And semantic features w I Obtaining a fusion ratio
Figure BDA00031083236700000511
And then DF (DEG) is used as the image characteristic h obtained in the last stage k-1 And spatial features w S A fusion ratio is determined>
Figure BDA00031083236700000512
Finally, get->
Figure BDA00031083236700000513
And fusing the semantic features and the spatial features obtained in the kth stage. The fused word-level features ensure that high-pixel image generation is constrained in both semantic and spatial layout aspects in the high-stage generator.
Step 4: the sentence-level feature e obtained by fusion in the step 3 is processed I-S Into an initial generator in a cascade generation type countermeasure network, a low resolution image is generated. Processing the word-level semantic features w obtained in step 2 using an attention module I And spatial features w S Obtaining word-level attention semantic features
Figure BDA00031083236700000514
And attention space features->
Figure BDA00031083236700000515
Calculate +.>
Figure BDA00031083236700000516
Is to calculate word-level attention semantic-space constraint +.>
Figure BDA00031083236700000517
Image feature h generated by the generator of the previous stage k-1 And word-level attention semantic-space constraint +.>
Figure BDA00031083236700000518
Together into a subsequent generator to generate a finer high resolution image.
Generating word-level attention semantic-space constraints in step 4
Figure BDA00031083236700000519
The specific implementation mode of (a) is as follows:
word-level semantic feature w I And spatial features w S Respectively with the image characteristics h obtained in the previous stage k-1 And (3) performing dynamic fusion, and using the fused characteristics as a value vector in the attention module. Generating word-level attention semantic features using attention modules, respectively
Figure BDA0003108323670000061
And attention space features->
Figure BDA0003108323670000062
The median vector of the attention module is the same as the keyword vector, and the query vector uses the image feature h obtained in the previous stage k-1 . Finally, a dynamic fusion module is adopted to fuse and obtain word-level attention semantic-space constraint +.>
Figure BDA0003108323670000063
Using the attention module proposed by AttnGAN, taking the image feature h generated in the last stage as a query vector, and taking the word-level space feature w S Treated as a keyword vector and a value vector. Generating an attention space feature for each image block using word-level space features
Figure BDA0003108323670000064
Figure BDA0003108323670000065
wherein />
Figure BDA0003108323670000066
Is defined as:
Figure BDA0003108323670000067
wherein
Figure BDA0003108323670000068
Representing the degree of attention of the jth image to the ith word-level spatial feature, synthesizing an attention-space feature c for the image by weighted summing the word-level spatial features S . The above formula Attn (h, w S ) Keyword vector and value vector w in (a) S Substitution to word-level semantic features w I To obtain Attn (h, w I ) Also for image synthesis an attention semantic feature c I . Fusing the attention space features c by adopting a word-level feature fusion mode in the step 3 S And attention to semantic feature c I . The specific formula is defined as:
Figure BDA0003108323670000069
Figure BDA00031083236700000610
Figure BDA00031083236700000611
Figure BDA00031083236700000612
Figure BDA00031083236700000613
Figure BDA00031083236700000614
in step 4, sentence-level feature e obtained by fusion in step 3 is used I-s As an input to the initial stage generator, a low resolution image is generated, concretely implemented as:
h 0 =F 0 (e I-S )
Figure BDA00031083236700000615
word-level attention semantic-space constraint c obtained by using image features obtained in the previous stage and dynamic fusion I-S As input to the subsequent generator, a high resolution image is generated. Consider c I-S The method belongs to high-level features of the image, and low-level features such as colors, textures, structures and the like may be more concerned when the image is generated at a low stage, so that two high-level information of semantic and spatial layout cannot necessarily play a good role in guiding. The network determines whether the current generation stage needs higher-level information or middle-low-level information to obtain
Figure BDA00031083236700000616
The image characteristic h obtained in the previous stage is obtained k-1 Constraint information with the kth phase->
Figure BDA00031083236700000617
Fusion, send to F k In the module, image feature h is generated k . Finally, the image features are sent to a corresponding generator G k A high resolution image is generated. The specific implementation formula is as follows:
Figure BDA00031083236700000618
Figure BDA00031083236700000619
Figure BDA00031083236700000620
wherein Fk (. Cndot.) image feature generation Module representing the kth stage, G k (·) represents the kth generator.
Step 5: and constructing a cascade generation type countermeasure network consisting of a plurality of pairs of generators and discriminators, and training the generation countermeasure network on a data set through designing a loss function, namely training the generation countermeasure network with minimized loss function.
To optimize the overall neural network model, generator penalty L is employed G Sum discriminator loss L D To alternate training. Loss function L of generator G And loss function L of the arbiter D Is defined as follows:
L G =L advG1 L DAMSM L D =L advD
wherein LadvG Is defined as follows:
Figure BDA0003108323670000071
p G representing the data distribution of the generated image, s represents the text description of the input, and D (·) represents the discriminant in the generated countermeasure network. Wherein the first term ensures that the generated image is as realistic as possible, and the second term ensures that the generated image matches the input text. Loss L using DAMSM in step 2 DAMSM To calculate the local and global similarity of the input text and the generated image, and to ensure that the generated image and the input text maintain semantic consistency by minimizing the loss.
L advD Is defined as:
Figure BDA0003108323670000072
p data the data distribution representing the real image, the first term in the formula being used to distinguish the generated image from the real image, the second term determining whether the input image and text match.
Step 6: and 3, fusing the word characteristics and the sentence characteristics obtained in the step 1 and the step 2, feeding the word characteristics and the sentence characteristics into a cascade generation type countermeasure network according to the mode described in the step 4, training a generator according to the mode described in the step 5, and generating images with different resolutions by the trained generator.
Further comprising the step 7 of: and (3) applying the image which is generated in the step (6) and accords with the text description to the field of cross-mode generation, and solving the technical problem of related engineering.
The related engineering technical problems in the step 7 comprise multimedia education resource construction, image editing and computer teaching assistance.
The beneficial effects are that:
1. the invention discloses a text-to-multi-object image generation method based on joint embedding, which provides a network framework capable of generating a corresponding multi-object image from text description, wherein the network framework mainly comprises a semantic encoder, a spatial layout encoder, a dynamic characteristic fusion module and a cascade generation type countermeasure network with an attention module. The method comprises the steps of extracting joint semantic features of a text and an image from the text through a semantic encoder, extracting joint spatial features of the text and a segmentation map from the text through a spatial layout encoder, fusing the semantic features and the spatial features through a dynamic fusion module, and feeding the fused features into a generation type countermeasure network to generate an image conforming to text description. Because the generation of the image is constrained in terms of both semantics and spatial layout when the high-resolution image is generated, the generated image has the advantages of conforming to the text description, reasonable spatial layout and realistic visual effect.
2. The existing mode of obtaining the text paired images is mostly realized through retrieval, but no text description exists in the existing corresponding images, and professional image editing software is needed to manually create a corresponding image. The method for generating the text-to-multi-object image based on joint embedding can realize the simple text description input, can quickly generate corresponding images through a network obtained through training, and has the advantages of convenience and rapidness.
3. Visual space is high-dimensional, structured, covering visual features of different aspects, including high-level abstract semantics, layout features, and low-level textures, color features, and the like. Existing research has simplified text-to-image generation tasks into two parts to process: extracting text and image joint semantic features from the text; and feeding the obtained feature vector into a generated network model to obtain a corresponding image. Mapping directly from semantic features into visual space with reasonable layout is a very difficult challenge. The method for generating the text-to-multi-object image based on joint embedding disclosed by the invention has the advantages that the spatial layout features are extracted from the text when the image is generated, the semantic features and the spatial features are used for restraining the image generation, the problem of unreasonable image layout generated by the previous method is solved, and the generation effect is good.
4. The invention discloses a method for generating a corresponding multi-object image from a text, which is used for applying the multi-object image which is generated by the invention and accords with the text description to the cross-mode generation field and solving the technical problem of related engineering. For example: the method comprises the following steps of building multimedia education resources, editing images, assisting computer teaching and the like.
Drawings
FIG. 1 is a flow chart of an implementation of the joint embedding-based text-to-multi-object image generation method of the present invention;
FIG. 2 is a diagram of the overall network architecture of the present invention;
FIG. 3 is a block diagram of an encoder architecture according to the present invention, where FIG. 3 (a) is a semantic encoder architecture and FIG. 3 (b) is a spatial encoder architecture;
FIG. 4 is a block diagram of a dynamic fusion module according to the present invention;
FIG. 5 is an exemplary graph of the results of the generation on the MSCOCO data set in the present invention;
FIG. 6 is an exemplary view of an image generated on an MSCOCO data set according to the present invention.
Detailed Description
Embodiments of the present invention will be described below with reference to the drawings.
As shown in fig. 1, the method for generating a text-to-multi-object image based on joint embedding disclosed in this embodiment can perform education-related application on an MSCOCO dataset, for example, input a text description, and a model can generate a corresponding image, and understand knowledge contained in the text by combining the multi-object image and the text. The training and image generation flow of this embodiment is shown in fig. 1.
Step 1: the text description is input into a semantic encoder to obtain joint semantic features of the text and the image, and the semantic encoder is realized by the text encoder which is obtained by pre-training in a semantic encoder architecture.
The semantic encoder architecture described in step 1 includes a text encoder and an image encoder, as shown in fig. 3 (a). The training text encoder is guided in the semantic encoder architecture by using the pre-trained image encoder, and the text encoder obtained through training can extract the joint semantic features of the text and the image from the text.
The text encoder is realized by a two-way long-short term memory neural network (LSTM), and the final hidden layer output of the model is used for representing the text sentence characteristic e I The intermediate hidden layer output then represents the text word feature w I . The image encoder is implemented by using an acceptance-v 3 model pre-trained on an image net data set, and the last layer of the model learns to obtain the global feature g of the image I The middle convolution layer learns to obtain the subarea characteristic v of the image I
The two encoders are continuously optimized and trained by minimizing the distance between text features and image features until joint semantic features of text and images can be extracted from the text. The distance between the text feature and the image semantic feature contains sentence level and word level as shown in fig. 3 (a).
Using word-level penalty
Figure BDA0003108323670000091
And sentence level penalty->
Figure BDA0003108323670000092
Optimizing a semantic encoder architecture, wherein the semantic encoder architecture loses a function L I Is defined as follows:
Figure BDA0003108323670000093
image global semantic feature g using pairing I And text sentence feature e I The dot product between them calculates sentence-level losses, sentence-level losses in the semantic encoder architecture
Figure BDA0003108323670000094
Is defined as:
Figure BDA0003108323670000095
to calculate word level loss
Figure BDA0003108323670000096
The method for calculating the similarity of the text and the image by the reference DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text characteristics and the image characteristics on the word level, and the specific implementation method is as follows:
calculating to obtain a similarity matrix s of words in the text and the image subregion:
Figure BDA0003108323670000097
wherein ,wI Word characteristics, v, representing text I Representing the sub-region features of the image.
Figure BDA0003108323670000098
Then represents the similarity of the ith word to the jth image block and T represents the word number in the textA number. By introducing an attention module, a region context vector is calculated for each word in the text>
Figure BDA0003108323670000099
Said region context vector +.>
Figure BDA00031083236700000910
The dynamic representation of the image subarea associated with the ith word in the sentence is the weighted sum of all the visual characteristics of the image subareas, and the calculation formula is as follows:
Figure BDA00031083236700000911
Where N represents the number of sub-region blocks of the image. The similarity between text and image local features is calculated using the region context vector and the word vector. The formula is as follows:
Figure BDA00031083236700000912
wherein ,
Figure BDA0003108323670000101
calculate->
Figure BDA0003108323670000102
and />
Figure BDA0003108323670000103
Cosine similarity of (c). The trained text encoder can extract the joint semantic features of the text and the image from the text.
Step 2: the text is input into a spatial layout encoder to obtain joint spatial features of the text and the segmentation map, and the spatial layout encoder is realized by a text encoder which is obtained by pre-training in a spatial encoder architecture.
The spatial encoder architecture described in step 2 consists of a text encoder and a segmentation map encoder, as shown in fig. 3 (b). The text encoder is guided to be trained by using a pre-trained segmentation map encoder in a space encoder architecture, and the text encoder obtained through training can extract joint space layout characteristics of the text and the segmentation map from the text.
The text encoder is realized by a two-way long-short term memory neural network (LSTM), and the final hidden layer output of the model is used for representing the text sentence characteristic e S The intermediate hidden layer outputs then represent text word features w, respectively S . The segmentation map encoder is implemented using an acceptance-v 3 model pre-trained on the ImageNet dataset, and considering the differences between the segmentation map and the image, only the first few layers of the model are used in the paper to extract the segmentation features. The last layer of the model is learned to obtain global space feature g of the segmentation map S Intermediate convolution layer learning to obtain subarea feature v of segmentation map S
The segmentation map encoder and the text encoder are continuously trained and optimized by minimizing the distance between the text features and the segmentation map features until the joint spatial layout features of the text and the segmentation map can be extracted from the text. The distance between the text feature and the spatial layout feature comprises sentence level and word level.
Using word-level penalty
Figure BDA0003108323670000104
And sentence level penalty->
Figure BDA0003108323670000105
Optimizing spatial layout encoder architecture, spatial encoder architecture loss function L S Is defined as follows:
Figure BDA0003108323670000106
sentence-level penalty calculation using dot products between paired segmentation map global spatial features and text sentence features, sentence-level penalty in a spatial encoder architecture
Figure BDA0003108323670000107
Is defined as:
Figure BDA0003108323670000108
to calculate word level loss
Figure BDA0003108323670000109
The method for calculating the similarity of the text and the image by the citation DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text characteristic and the space characteristic of the segmentation map on the word level, and the specific implementation method is as follows:
Calculating to obtain a similar matrix of words in the text and the subareas of the segmentation map:
Figure BDA00031083236700001010
wherein ,wS Word characteristics, v, representing text S Representing the sub-region features of the segmentation map.
Figure BDA00031083236700001011
Then it represents the degree of similarity of the ith word to the jth segmented tile and T represents the number of words in the text. By introducing an attention module, a region context vector is calculated for each word in the text>
Figure BDA00031083236700001012
Said region context vector +.>
Figure BDA00031083236700001013
The dynamic representation of the segmentation map subarea associated with the ith word in the sentence is the weighted sum of the visual characteristics of all the segmentation map subareas, and the calculation formula is as follows:
Figure BDA00031083236700001014
where N represents the number of sub-region blocks of the segmentation map. The similarity between the text and the segmentation map local features is calculated using the region context vector and the word vector. The formula is as follows:
Figure BDA0003108323670000111
wherein ,
Figure BDA0003108323670000112
calculate->
Figure BDA0003108323670000113
and />
Figure BDA0003108323670000114
Cosine similarity of (c). The trained text encoder can extract joint spatial features of the text and the segmentation map from the text.
Step 3: and (3) fusing the semantic features obtained in the step (1) and the spatial features obtained in the step (2) through a dynamic fusion module, wherein the semantic features and the spatial features comprise sentence levels and word levels. The fused sentence-level features are fed into an initial generator in a generative antagonism network to generate a low resolution image. The fused word-level features are then used in subsequent generators in the generative antagonism network to constrain the generation of the high resolution image.
In the step 3, concat operation is adopted, and sentence-level semantic features e are fused I And spatial feature e S As input to the initial stage generator, as shown in fig. 4. The specific implementation formula is as follows:
e I-S =concat(e I ,e S ,z)
where z represents a random noise vector.
Step 3, fusing word-level semantic features w by adopting a dynamic fusion mode I And spatial features w S The resulting features are fed into the high stageThe generator generates a high resolution image. The specific implementation method comprises the following steps:
word-level semantic feature w I And spatial features w S Respectively with the image characteristics h obtained in the previous stage k-1 And (3) performing dynamic fusion, and re-fusing the two features obtained in the previous step by adopting dynamic fusion again to serve as semantic-space constraint of a generator in the subsequent stage. The formula is as follows:
Figure BDA0003108323670000115
Figure BDA0003108323670000116
Figure BDA0003108323670000117
Figure BDA0003108323670000118
Figure BDA0003108323670000119
Figure BDA00031083236700001110
/>
wherein DF (.) represents the multilayer perceptron. Firstly, DF (·) is used for obtaining the image characteristic h at the previous stage k-1 And semantic features w I Obtaining a fusion ratio
Figure BDA00031083236700001111
And then DF (DEG) is used as the image characteristic h obtained in the last stage k-1 And spatial features w S A fusion ratio is determined>
Figure BDA00031083236700001112
Finally, get->
Figure BDA00031083236700001113
And fusing the semantic features and the spatial features obtained in the kth stage. Fused word-level features->
Figure BDA00031083236700001114
It is ensured that in the high-stage generator the high-pixel image generation is constrained both in terms of semantics and spatial layout.
Step 4: the sentence-level feature e obtained by fusion in the step 3 is processed I-S Into an initial generator in a cascade generation type countermeasure network, a low resolution image is generated. Processing the word-level semantic features w obtained in step 2 using an attention module I And spatial features w S Obtaining word-level attention semantic features
Figure BDA00031083236700001115
And attention space features->
Figure BDA00031083236700001116
Calculate +.>
Figure BDA00031083236700001117
Is to calculate word-level attention semantic-space constraint +.>
Figure BDA00031083236700001118
Image feature h generated by the generator of the previous stage k-1 And word-level attention semantic-space constraint +.>
Figure BDA0003108323670000121
Together into a subsequent generator to generate a finer high resolution image.
Generating word-level attention semantic-space constraints using an attention module in step 4
Figure BDA0003108323670000122
The specific implementation mode of (a) is as follows:
word-level semantic feature w I And spatial features w S Respectively with the image characteristics h obtained in the previous stage k-1 And (3) performing dynamic fusion, and using the fused characteristics as a value vector in the attention module. Generating word-level attention semantic features and attention space features respectively by using an attention module, wherein the median vector of the attention module is the same as the keyword vector, and the query vector uses the image features h obtained in the previous stage k-1 . Finally, fusing by adopting a dynamic fusion module to obtain word-level attention semantic-space constraint
Figure BDA0003108323670000123
Using the attention module proposed by AttnGAN, taking the image feature h generated in the last stage as a query vector, and taking the word-level space feature w S Treated as a keyword vector and a value vector. Generating an attention space feature for each image block using word-level space features
Figure BDA0003108323670000124
Figure BDA0003108323670000125
wherein />
Figure BDA0003108323670000126
Is defined as:
Figure BDA0003108323670000127
wherein
Figure BDA0003108323670000128
Representing the degree of attention of the jth image to the ith word-level spatial feature, synthesizing an attention-space feature c for the image by weighted summing the word-level spatial features S . The above formula Attn (h, w S ) Keyword vector and value vector w in (a) S Substitution to word-level semantic features w I To obtain Attn (h, w I ) Also for image synthesis an attention semantic feature c I . Fusing the attention space features c by adopting a word-level feature fusion mode in the step 3 S And attention to semantic feature c I . The specific formula is defined as:
Figure BDA0003108323670000129
Figure BDA00031083236700001210
Figure BDA00031083236700001211
Figure BDA00031083236700001212
Figure BDA00031083236700001213
Figure BDA00031083236700001214
as shown in fig. 2, the sentence-level feature e obtained by the fusion in step 3 is used in step 4 I-S As an input to the initial stage generator, a low resolution image is generated, concretely implemented as:
h 0 =F 0 (e I-S )
Figure BDA00031083236700001215
word-level attention semantic-space constraint c obtained by using image features obtained in the previous stage and dynamic fusion I-S As an input to the subsequent generator,a high resolution image is generated. Consider c I-S The method belongs to high-level features of the image, and low-level features such as colors, textures, structures and the like may be more concerned when the image is generated at a low stage, so that two high-level information of semantic and spatial layout cannot necessarily play a good role in guiding. As shown in FIG. 4, the network determines whether the current generation stage requires higher-level information or middle-lower-level information by itself to obtain
Figure BDA00031083236700001216
The image characteristic h obtained in the previous stage is obtained k-1 Constraint information with the kth phase->
Figure BDA00031083236700001217
Fusion, send to F k In the module, image feature h is generated k . Finally, the image features are sent to corresponding generators G k As shown in fig. 2. The specific implementation formula is as follows:
Figure BDA0003108323670000131
Figure BDA0003108323670000132
Figure BDA0003108323670000133
wherein Fk (. Cndot.) image feature generation Module representing the kth stage, G k (·) represents the kth generator.
Step 5: and constructing a cascade generation type countermeasure network consisting of a plurality of pairs of generators and discriminators, and training the generation countermeasure network on a data set through designing a loss function, namely training the generation countermeasure network with minimized loss function.
To optimize the overall neural network model, generator penalty L is employed G Sum discriminator loss L D Alternate training. Loss function L of generator G And loss function L of the arbiter D Is defined as follows:
L G =L advG1 L DAMSM L D =L advD
wherein LadvG Is defined as follows:
Figure BDA0003108323670000134
p G representing the data distribution of the generated image, s represents the text description of the input, and D (·) represents the discriminant in the generated countermeasure network. Wherein the first term ensures that the generated image is as realistic as possible, and the second term ensures that the generated image matches the input text. Loss L using DAMSM in step 2 DAMSM To calculate the local and global similarity of the input text and the generated image, and to ensure that the generated image and the input text maintain semantic consistency by minimizing the loss.
L advD Is defined as:
Figure BDA0003108323670000135
p data the data distribution representing the real image, the first term in the formula being used to distinguish the generated image from the real image, the second term determining whether the input image and text match.
In the training process of the invention, 210 generations are trained altogether, the whole network model is not end-to-end in the training process, and the text encoder and the image encoder are trained synchronously first, and the learning rate of both is 0.002. The other text encoder and the segmentation map encoder were resynchronized, and the learning rate of both was also 0.002. The training process of the generated countermeasure network is a process of constantly optimizing and countering between the generator and the arbiter, and their learning rate is set to 0.0002.
Step 6: and 3, fusing the word characteristics and the sentence characteristics obtained in the step 1 and the step 2, feeding the word characteristics and the sentence characteristics into a cascade generation type countermeasure network according to the mode described in the step 4, training a generator according to the mode described in the step 5, and generating images with different resolutions by the trained generator.
In step 6, the present embodiment obtains a good generation result on the MSCOCO public data set. The MSCOCO dataset is a large, rich object detection, segmentation and caption dataset. This dataset is targeted at scene understanding, mainly capturing images from complex daily scenes. MSCOCO provides very detailed data labeling for each image, including text description of the entire image, bounding box information of objects in the image, class labels, and segmentation map, etc. The image contains 91 classes of objects, of which 82 classifications all have over 5,000 instance objects. An effect diagram of generating a corresponding image from a text description is shown in fig. 5, and an example of a generated image of a network is shown in fig. 6.
In summary, in this embodiment, the text description is input into the generating countermeasure network, and the generating countermeasure network model is trained, so as to obtain a well-trained generator, and at this time, the generator can generate a multi-object image conforming to the text description. The method and the device can solve the problems that the time cost and the labor cost of the generation in the traditional method are high, and the effect cannot be guaranteed.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (7)

1. A text-to-multi-object image generation method based on joint embedding is characterized by comprising the following steps of: comprises the following steps of the method,
step 1: inputting the text description into a semantic encoder to obtain joint semantic features of the text and the image, wherein the semantic encoder is realized by a text encoder which is obtained by pre-training in a semantic encoder architecture;
the semantic encoder architecture described in step 1 includes a text encoder and an image encoder; guiding a training text encoder by using an image encoder obtained by pre-training in a semantic encoder architecture, wherein the text encoder obtained by training can extract joint semantic features of a text and an image from the text;
step 2: inputting the text into a space layout encoder to obtain joint space characteristics of the text and the segmentation map, wherein the space layout encoder is realized by a text encoder which is obtained by pre-training in a space encoder architecture;
The space encoder architecture described in step 2 consists of a text encoder and a segmentation map encoder; the text encoder is guided to be trained by using a segmentation map encoder obtained through pre-training in a space encoder framework, and the text encoder obtained through training can extract joint space layout characteristics of the text and the segmentation map from the text;
step 3: the semantic features obtained in the step 1 and the spatial features obtained in the step 2 are fused through a dynamic fusion module, wherein the semantic features and the spatial features comprise sentence-level features and word-level features; the sentence-level features obtained by fusion are sent to an initial generator in a cascade generation type countermeasure network to generate a low-resolution image; the word-level features obtained by fusion are used in a subsequent generator in a cascade generation type countermeasure network to restrict the generation of high-resolution images;
step 4: the sentence-level features obtained by fusion in the step 3 are sent to an initial generator in a cascade generation type countermeasure network, and a low-resolution image is generated; processing the word-level semantic features obtained in the step 1 and the space features obtained in the step 2 by using an attention module to obtain word-level attention semantic features and attention space features; calculating word-level attention semantic-space constraints using the method of step 3 calculation; the image features generated by the generator at the previous stage and the word-level attention semantic-space constraint are fed into the subsequent generator together to generate a finer high-resolution image;
Step 5: constructing a cascade generation type countermeasure network consisting of a plurality of pairs of generators and discriminators, and training on a data set through designing a loss function to generate the countermeasure network, namely training to obtain a generated countermeasure network with minimized loss function;
step 6: and 3, fusing the word-level features and the sentence-level features obtained in the step 1 and the step 2, sending the word-level features and the sentence-level features into a cascade generation type countermeasure network in a manner described in the step 4, training a generator in a manner described in the step 5, and generating images with different resolutions by the trained generator.
2. A method for generating a text-to-multi-object image based on joint embedding as recited in claim 1, wherein: the method also comprises a step 7 of applying the image which is generated in the step 6 and accords with the text description to the cross-mode generation field to solve the technical problem of related engineering;
the related engineering technical problems in the step 7 comprise multimedia education resource construction, image editing and computer teaching assistance.
3. A method of generating a text-to-multi-object image based on joint embedding as claimed in claim 1 or 2, wherein: the text encoder obtained through training in the semantic encoder architecture can extract the joint semantic features of the text and the image from the text, and the realization method comprises the following steps: training an image encoder on an ImageNet data set in advance, wherein the trained image encoder can obtain image semantic features from images; the text encoder can extract the joint semantic features of the text and the image from the text by minimizing the distance between the text features and the semantic features; while optimizing the text encoder, continuously optimizing part of the network layer of the image encoder by minimizing the distance between the text features and the image semantic features, and synchronously training and optimizing the two encoders until the joint semantic features of the text and the image can be extracted from the text; the text features include sentence-level features e I And word-level features w I The image semantic features also comprise global features g I And local feature v I The method comprises the steps of carrying out a first treatment on the surface of the Thus, the distance between the text feature and the semantic feature also includes sentence-level distance and word-level distance;
using word-level penalty
Figure FDA0003919181360000021
And sentence level penalty->
Figure FDA0003919181360000022
Optimizing a semantic encoder architecture, wherein the semantic encoder architecture loses a function L I Is defined as follows:
Figure FDA0003919181360000023
image global semantic feature g using pairing I And text sentence feature e I The dot product between them calculates sentence-level losses, sentence-level losses in the semantic encoder architecture
Figure FDA0003919181360000024
Is defined as:
Figure FDA0003919181360000025
to calculate word level loss
Figure FDA0003919181360000026
The method for calculating the similarity of the text and the image by the citation DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text feature and the image semantic feature on the word level, and the specific implementation method is as follows:
calculating to obtain a similarity matrix s of words in the text and the image subregion:
Figure FDA0003919181360000027
wherein ,wI Word characteristics, v, representing text I Sub-representing imageRegional characteristics;
the degree of similarity between the word and the image block can be obtained by calculation of the similarity matrix s,
Figure FDA0003919181360000028
representing the similarity of the ith word and the jth image block, wherein T represents the number of words in the text;
Figure FDA0003919181360000029
By introducing an attention module, a region context vector is calculated for each word in the text
Figure FDA00039191813600000210
Said region context vector +.>
Figure FDA00039191813600000211
The dynamic representation of the image subarea associated with the ith word in the sentence is the weighted sum of all the visual characteristics of the image subareas, and the calculation formula is as follows:
Figure FDA00039191813600000212
wherein N represents the number of sub-region blocks of the image,
Figure FDA00039191813600000213
representing the j-th sub-region feature of the image, calculating the similarity between the text and the image local feature using the region context vector and the word vector; the formula is as follows:
Figure FDA00039191813600000214
wherein ,
Figure FDA00039191813600000215
calculating the region context vector of the i-th word-associated image subregion>
Figure FDA00039191813600000216
And word characteristics of the i-th word +.>
Figure FDA00039191813600000217
Cosine similarity of (c); the trained text encoder can extract the joint semantic features of the text and the image from the text.
4. A method for generating a text-to-multi-object image based on joint embedding as recited in claim 3, wherein: the text encoder obtained through training in the space encoder architecture can extract the joint space layout characteristics of the text and the segmentation map from the text, and the realization method comprises the following steps: training a segmentation map encoder on an ImageNet data set in advance, wherein the segmentation map encoder obtained through training can obtain spatial layout characteristics from the segmentation map; by minimizing the distance between the text features and the segmentation map features, the text encoder can be enabled to extract the joint spatial layout features of the text and the segmentation map from the text; while optimizing the text encoder, continuously optimizing part of the network layer of the segmentation map encoder by minimizing the distance between the text features and the segmentation map features, and synchronously training and optimizing the two encoders until the joint space layout features of the text and the segmentation map can be extracted from the text; the text features include sentence-level features e S And word-level features w S The spatial layout features also include global features g S And local feature v S The method comprises the steps of carrying out a first treatment on the surface of the Thus, the distance between the text feature and the spatial layout feature also includes sentence-level distance and word-level distance;
using word-level penalty
Figure FDA0003919181360000031
And sentence level penalty->
Figure FDA0003919181360000032
Optimizing spatial layout encoder architecture, spatial encoder architecture loss function L S Is defined as follows:
Figure FDA0003919181360000033
global spatial feature g of segmentation map using pairing S And text sentence feature e S The dot product between them calculates sentence-level losses, sentence-level losses in the spatial encoder architecture
Figure FDA0003919181360000034
Is defined as:
Figure FDA0003919181360000035
to calculate word level loss
Figure FDA0003919181360000036
The method for calculating the similarity of the text and the image by the citation DAMSM (Deep Attentional Multimodal Similarity Model) calculates the similarity of the text characteristic and the space characteristic of the segmentation map on the word level, and the specific implementation method is as follows:
calculating to obtain a similar matrix of words in the text and the subareas of the segmentation map:
Figure FDA0003919181360000037
wherein ,wS Word characteristics, v, representing text S A sub-region feature representing a segmentation map;
the similarity between the word and the segmented block is calculated by the similarity matrix s,
Figure FDA0003919181360000038
representing the similarity of the ith word and the jth segmented block, wherein T represents the number of words in the text;
Figure FDA0003919181360000039
By introducing an attention module, a region context vector is calculated for each word in the text
Figure FDA00039191813600000310
Said region context vector +.>
Figure FDA00039191813600000311
The dynamic representation of the segmentation map subarea associated with the ith word in the sentence is the weighted sum of the visual characteristics of all the segmentation map subareas, and the calculation formula is as follows:
Figure FDA00039191813600000312
wherein N represents the number of sub-region blocks of the segmentation map; calculating a similarity between the text and the segmentation map local features using the region context vector and the word vector; the formula is as follows:
Figure FDA0003919181360000041
wherein ,
Figure FDA0003919181360000042
calculating the region context vector of the i-th word-associated segmentation map sub-region>
Figure FDA0003919181360000043
And word characteristics of the i-th word +.>
Figure FDA0003919181360000044
Cosine similarity of (c); the trained text encoder can extract joint spatial features of the text and the segmentation map from the text. />
5. A method for generating a text-to-multi-object image based on joint embedding as recited in claim 4, wherein: in the step 3, concat operation is adopted, and sentence-level semantic features e are fused I And spatial feature e S As input to the initial stage generator; the specific implementation formula is as follows:
e I-S =concat(e I ,e S ,z)
wherein z represents a random noise vector;
step 3, fusing word-level semantic features w by adopting a dynamic fusion mode I And spatial features w S The obtained characteristics
Figure FDA0003919181360000045
Is fed into a high-stage generator to generate a high-resolution image; the specific implementation method comprises the following steps:
word-level semantic feature w I And spatial features w S Respectively with the image characteristics h obtained in the previous stage k-1 Dynamic fusion is carried out, and the two features obtained in the previous step are fused again by adopting the dynamic fusion to serve as semantic-space constraint of a generator in the subsequent stage; the formula is as follows:
Figure FDA0003919181360000046
Figure FDA0003919181360000047
Figure FDA0003919181360000048
Figure FDA0003919181360000049
Figure FDA00039191813600000410
Figure FDA00039191813600000411
wherein DF (& gt) represents the multi-layer perceptron; firstly, DF (·) is used for obtaining the image characteristic h at the previous stage k-1 And semantic features w I Obtaining a fusion ratio
Figure FDA00039191813600000412
And then DF (DEG) is used as the image characteristic h obtained in the last stage k-1 And spatial features w S A fusion ratio is determined>
Figure FDA00039191813600000413
Finally, get->
Figure FDA00039191813600000414
Fusing semantic features and space features obtained in the kth stage; the word-level features obtained through fusion ensure that high-pixel image generation is constrained in a high-stage generator from two aspects of semantics and spatial layout;
step 4: the sentence-level feature e obtained by fusion in the step 3 is processed I-S Sending the low-resolution images into an initial generator in a cascade generation type countermeasure network; processing the word-level semantic features w obtained in step 2 using an attention module I And spatial features w S Obtaining word-level attention semantic features
Figure FDA00039191813600000415
And attention space features->
Figure FDA00039191813600000416
Calculate +.>
Figure FDA00039191813600000417
Is to calculate word-level attention semantic-space constraint +.>
Figure FDA00039191813600000418
Image feature h generated by the generator of the previous stage k-1 And word-level attention semantic-space constraint +.>
Figure FDA00039191813600000419
Together into a subsequent generator to generate a finer high resolution image.
6. The method for generating a text-to-multi-object image based on joint embedding of claim 5, wherein:
generating word-level attention semantic-space constraints in step 4
Figure FDA00039191813600000420
The specific implementation mode of (a) is as follows:
word-level semantic feature w I And spatial features w S Respectively with the image characteristics h obtained in the previous stage k-1 Dynamic fusion is carried out, and the characteristics obtained by fusion are used as value vectors in the attention module; generating word-level attention semantic features using attention modules, respectively
Figure FDA0003919181360000051
And attention space features->
Figure FDA0003919181360000052
The median vector of the attention module is the same as the keyword vector, and the query vector uses the image feature h obtained in the previous stage k-1 The method comprises the steps of carrying out a first treatment on the surface of the Finally, a dynamic fusion module is adopted to fuse and obtain word-level attention semantic-space constraint +.>
Figure FDA0003919181360000053
Using the attention module proposed by AttnGAN to make the image feature h generated in the last stage k-1 Regarded as a query vector, the word-level spatial features w S Treating as a keyword vector and a value vector; generating an attention space feature for each image block using word-level space features
Figure FDA0003919181360000054
Figure FDA0003919181360000055
wherein />
Figure FDA0003919181360000056
Is defined as:
Figure FDA0003919181360000057
wherein
Figure FDA0003919181360000058
Representing the degree of attention of the jth image to the ith word-level spatial feature, synthesizing an attention-space feature c for the image by weighted summing the word-level spatial features S The method comprises the steps of carrying out a first treatment on the surface of the The above formula Attn (h, w S ) Keyword vector and value vector w in (a) S Substitution to word-level semantic features w I To obtain Attn (h, w I ) Also for image synthesis an attention semantic feature c I The method comprises the steps of carrying out a first treatment on the surface of the Fusing the attention space features c by adopting a word-level feature fusion mode in the step 3 S And attention to semantic feature c I The method comprises the steps of carrying out a first treatment on the surface of the The specific formula is defined as:
Figure FDA0003919181360000059
Figure FDA00039191813600000510
Figure FDA00039191813600000511
Figure FDA00039191813600000512
Figure FDA00039191813600000513
Figure FDA00039191813600000514
in step 4, sentence-level feature e obtained by fusion in step 3 is used I-S As an input to the initial stage generator, a low resolution image is generated, concretely implemented as:
h 0 =F 0 (e I-S )
Figure FDA00039191813600000515
word-level attention semantic-space constraint c obtained by using image features obtained in the previous stage and dynamic fusion I-S Generating a high resolution image as input to a subsequent generator; the network determines that the current generation stage needs higher-level information or middle-low-level information to obtain
Figure FDA00039191813600000516
The image characteristic h obtained in the previous stage is obtained k-1 Constraint information with the kth phase- >
Figure FDA00039191813600000517
Fusion, send to F k In the module, image feature h is generated k The method comprises the steps of carrying out a first treatment on the surface of the Finally, the image features are sent to a corresponding generator G k A high resolution image is generated; the specific implementation formula is as follows:
Figure FDA00039191813600000518
Figure FDA00039191813600000519
Figure FDA00039191813600000520
wherein Fk (. Cndot.) image feature generation Module representing the kth stage, G k (·) represents the kth generator.
7. The method for generating a text-to-multi-object image based on joint embedding of claim 6, wherein: to optimize the overall neural network model, generator penalty L is employed G Sum discriminator loss L D To train alternately; loss function L of generator G And loss function L of the arbiter D Is defined as follows:
L G =L advG1 L DAMSM L D =L advD
wherein LadvG Is defined as follows:
Figure FDA0003919181360000061
p G representing the data distribution of the generated image, s representing the input text description, and D (-) representing the discriminators in the cascade generation type countermeasure network; wherein the first term ensures that the generated image is authentic and the second term ensures that the generated image matches the input text; loss L using DAMSM in step 2 DAMSM Calculating the local similarity and the global similarity of the input text and the generated image, and ensuring that the generated image and the input text keep semantic consistency by minimizing the loss;
L advD is defined as:
Figure FDA0003919181360000062
p data the data distribution representing the real image, the first term in the formula being used to distinguish the generated image from the real image, the second term determining whether the input image and text match.
CN202110642098.0A 2021-06-09 2021-06-09 Text-to-multi-object image generation method based on joint embedding Active CN113191375B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110642098.0A CN113191375B (en) 2021-06-09 2021-06-09 Text-to-multi-object image generation method based on joint embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110642098.0A CN113191375B (en) 2021-06-09 2021-06-09 Text-to-multi-object image generation method based on joint embedding

Publications (2)

Publication Number Publication Date
CN113191375A CN113191375A (en) 2021-07-30
CN113191375B true CN113191375B (en) 2023-05-09

Family

ID=76976242

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110642098.0A Active CN113191375B (en) 2021-06-09 2021-06-09 Text-to-multi-object image generation method based on joint embedding

Country Status (1)

Country Link
CN (1) CN113191375B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113850779A (en) * 2021-09-24 2021-12-28 深圳闪回科技有限公司 Automatic grading algorithm for second-hand mobile phone based on variational multi-instance image recognition
CN113869007B (en) * 2021-10-11 2024-04-23 大连理工大学 Text generation image learning method based on deep learning
CN115293109B (en) * 2022-08-03 2024-03-19 合肥工业大学 Text image generation method and system based on fine granularity semantic fusion
CN115512368B (en) * 2022-08-22 2024-05-10 华中农业大学 Cross-modal semantic generation image model and method
CN115797495B (en) * 2023-02-07 2023-04-25 武汉理工大学 Method for generating image by sentence-character semantic space fusion perceived text
CN116030048B (en) * 2023-03-27 2023-07-18 山东鹰眼机械科技有限公司 Lamp inspection machine and method thereof
CN116863032B (en) * 2023-06-27 2024-04-09 河海大学 Flood disaster scene generation method based on generation countermeasure network

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111260740A (en) * 2020-01-16 2020-06-09 华南理工大学 Text-to-image generation method based on generation countermeasure network

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330100B (en) * 2017-07-06 2020-04-03 北京大学深圳研究生院 Image-text bidirectional retrieval method based on multi-view joint embedding space
CN110263203B (en) * 2019-04-26 2021-09-24 桂林电子科技大学 Text-to-image generation method combined with Pearson reconstruction
CN110866958B (en) * 2019-10-28 2023-04-18 清华大学深圳国际研究生院 Method for text to image
CN111340907A (en) * 2020-03-03 2020-06-26 曲阜师范大学 Text-to-image generation method of self-adaptive attribute and instance mask embedded graph

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111260740A (en) * 2020-01-16 2020-06-09 华南理工大学 Text-to-image generation method based on generation countermeasure network

Also Published As

Publication number Publication date
CN113191375A (en) 2021-07-30

Similar Documents

Publication Publication Date Title
CN113191375B (en) Text-to-multi-object image generation method based on joint embedding
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
Rastgoo et al. Sign language production: A review
CN111581401B (en) Local citation recommendation system and method based on depth correlation matching
Wu et al. Hierarchical attention-based multimodal fusion for video captioning
CN108154156B (en) Image set classification method and device based on neural topic model
CN114818691A (en) Article content evaluation method, device, equipment and medium
CN112651940A (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
Huang et al. C-Rnn: a fine-grained language model for image captioning
Mantovan et al. The computerization of archaeology: Survey on artificial intelligence techniques
Jishan et al. Bangla language textual image description by hybrid neural network model
Min et al. Deep learning-based short story generation for an image using the encoder-decoder structure
CN116309913A (en) Method for generating image based on ASG-GAN text description of generation countermeasure network
Patil et al. Performance analysis of image caption generation using deep learning techniques
Pande et al. Development and deployment of a generative model-based framework for text to photorealistic image generation
Malakan et al. Vision transformer based model for describing a set of images as a story
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
Robert The Role of Deep Learning in Computer Vision
Abdelaziz et al. Few-shot learning with saliency maps as additional visual information
CN114944002B (en) Text description-assisted gesture-aware facial expression recognition method
Miah et al. Multi-stream graph-based deep neural networks for skeleton-based sign language recognition
Bende et al. VISMA: A Machine Learning Approach to Image Manipulation
Sra et al. Deepspace: Mood-based image texture generation for virtual reality from music
Guo et al. Double-layer affective visual question answering network
Malavath et al. Natya Shastra: Deep Learning for Automatic Classification of Hand Mudra in Indian Classical Dance Videos.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant