CN113448477A - Interactive image editing method and device, readable storage medium and electronic equipment - Google Patents

Interactive image editing method and device, readable storage medium and electronic equipment Download PDF

Info

Publication number
CN113448477A
CN113448477A CN202111008172.XA CN202111008172A CN113448477A CN 113448477 A CN113448477 A CN 113448477A CN 202111008172 A CN202111008172 A CN 202111008172A CN 113448477 A CN113448477 A CN 113448477A
Authority
CN
China
Prior art keywords
image
text
features
attribute
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111008172.XA
Other languages
Chinese (zh)
Other versions
CN113448477B (en
Inventor
李波
林枭
刘彬
刘奋成
赵旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang Hangkong University
Lenovo New Vision Nanchang Artificial Intelligence Industrial Research Institute Co Ltd
Original Assignee
Nanchang Hangkong University
Lenovo New Vision Nanchang Artificial Intelligence Industrial Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang Hangkong University, Lenovo New Vision Nanchang Artificial Intelligence Industrial Research Institute Co Ltd filed Critical Nanchang Hangkong University
Priority to CN202111008172.XA priority Critical patent/CN113448477B/en
Publication of CN113448477A publication Critical patent/CN113448477A/en
Application granted granted Critical
Publication of CN113448477B publication Critical patent/CN113448477B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04845Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range for image manipulation, e.g. dragging, rotation, expansion or change of colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Processing Or Creating Images (AREA)

Abstract

An interactive image editing method, an interactive image editing device, a readable storage medium and an electronic device are provided, wherein the method comprises the following steps: extracting attribute features of the original image to obtain image attribute features; performing word embedding and encoding of context semantics on the descriptive text corresponding to the original image to obtain text characteristics; fusing the image attribute features and the text features to obtain fused features; extracting the integral structural features of the original image; performing space attention fusion processing on the overall structural feature and the fusion feature to obtain a corrected structural feature of the edited region; performing structural feature completion of a non-edited region on the corrected structural feature of the edited region to obtain a corrected overall structural feature; inputting the corrected overall structural features into a generator so that the generator generates an image matched with the descriptive text based on the fusion feature guide.

Description

Interactive image editing method and device, readable storage medium and electronic equipment
Technical Field
The present invention relates to the field of image editing, and in particular, to an interactive image editing method, an interactive image editing apparatus, a readable storage medium, and an electronic device.
Background
Interactive image editing based on text descriptions aims at interactive editing of images through a text description. The text language is one of the most important and common communication modes for human beings, and the interactive editing of images by using the description of the text language is an important research direction of modern artificial intelligence in the field of image processing.
Although the existing method makes a certain progress on the image interactive editing problem based on the text description, and can preliminarily understand the editing intention in the text description, how to ensure the joint consistency of the spatial attention and the text attention of the editing and the decoupling of the non-editing area are still the main difficulties.
The existing text-based image editing method mainly encodes text information and image data into hidden variable semantic manifold space through an encoder, realizes text information-guided interactive editing in high-level semantic manifold space by using the combination and operation of text information encoding and image semantic attribute encoding, and finally generates an editing result through a decoder. The method mainly extends a task from a text to an image generation, lacks definition and constraint on an editing area and a non-editing area, and most of generation results are obviously changed in the non-editing area, so that the quality of an edited image is not high.
Disclosure of Invention
In view of the above, it is necessary to provide an interactive image editing method, an interactive image editing apparatus, a readable storage medium, and an electronic device, for solving the problem that the quality of an edited image is not high in the text-based image editing method in the prior art.
An interactive image editing method comprising:
extracting attribute features of the original image to obtain image attribute features;
performing word embedding and encoding of context semantics on the descriptive text corresponding to the original image to obtain text characteristics;
fusing the image attribute features and the text features to obtain fused features;
extracting the integral structural features of the original image;
performing space attention fusion processing on the overall structural feature and the fusion feature to obtain a corrected structural feature of the edited region;
performing structural feature completion of a non-edited region on the corrected structural feature of the edited region to obtain a corrected overall structural feature;
inputting the corrected overall structural features into a generator so that the generator generates an image matched with the descriptive text based on the fusion feature guide.
Further, in the above interactive image editing method, the step of extracting the attribute features of the original image to obtain the image attribute features includes:
inputting an original image into an image attribute encoder, so that the image attribute encoder extracts the last layer of vectors and outputs the vectors to obtain global attribute characteristics by utilizing increment-v 3 encoding;
using global attribute features as the image attribute encoder input, using super-parameters
Figure 832533DEST_PATH_IMAGE001
A set of multi-layer perceptrons is defined, and the dimensionality corresponding to the input image is estimated to be
Figure 668902DEST_PATH_IMAGE002
And obtaining the image attribute characteristics through the Gaussian mixture distribution.
Further, in the above interactive image editing method, the step of performing context semantic word embedding and encoding on the descriptive text corresponding to the original image to obtain the text feature includes:
a descriptive text corresponding to the original image is mapped through a word list to obtain a group of word indexes, and word vectors with the length of the descriptive text are obtained through embedding;
and inputting the word vector with the descriptive text length into a text encoder, and acquiring the output vector of each time sequence node to obtain text characteristics.
Further, in the above interactive image editing method, the step of fusing the image attribute features and the text features to obtain fused features includes:
splicing the image attribute features and each word vector in the text features in a column direction to obtain spliced features;
inputting the splicing characteristics into a Bi-LSTM model, and acquiring output information of each time sequence node in the Bi-LSTM model to obtain fusion characteristics of corresponding word and image attribute distribution;
and taking the last node hidden layer output vector of the Bi-LSTM model as an image attribute-text fusion code, and decoupling a parameter vector group corresponding to the fused image attribute distribution by passing the image attribute-text fusion code through a group of multilayer perceptrons.
Further, in the above interactive image editing method, the step of inputting the modified overall structural feature into a generator to make the generator generate an image matching the descriptive text based on the fused feature guidance includes:
converting the parameter vector group into the current generated image as a variable parameter in a generator structure;
inputting the corrected overall structural features into the generator, and performing up-sampling and sampling for multiple times
Figure 910527DEST_PATH_IMAGE004
And processing the convolution combination block, and outputting an image matched with the descriptive text.
Further, in the above interactive image editing method, before the step of extracting the attribute features of the original image, the method further includes:
constructing an interactive image editing model by utilizing an image attribute encoder, a text encoder, a content encoder, a fusion device and a generator;
and training the constructed interactive image editing model in a cross-loop mode.
Further, in the above interactive image editing method, before the step of extracting the attribute features of the original image, the method further includes:
and pre-training the image attribute encoder and the text encoder for mapping space alignment by adopting a DAMSM algorithm.
The invention also discloses an interactive image editing device, comprising:
the image attribute feature extraction module is used for extracting attribute features of the original image to obtain image attribute features;
the text feature coding module is used for embedding and coding words with context semantics into the descriptive text corresponding to the original image to obtain text features;
the fusion module is used for fusing the image attribute features and the text features to obtain fusion features;
the overall structure extraction module is used for extracting overall structure characteristics of the original image;
the fusion processing module is used for performing space attention fusion processing on the overall structural feature and the fusion feature to obtain a corrected structural feature of the edited region;
the structure completion module is used for completing the structure characteristics of the non-edited region to the corrected structure characteristics of the edited region to obtain the corrected overall structure characteristics;
and the input module is used for inputting the corrected overall structural features into the generator so that the generator generates an image matched with the descriptive text based on the fusion feature guidance.
Further, the interactive image editing apparatus further includes:
the model building module is used for building an interactive image editing model by utilizing the image attribute encoder, the text encoder, the content encoder, the fusion device and the generator;
and the model training module is used for training the constructed whole interactive image editing model in a cross-loop mode.
Further, the interactive image editing apparatus further includes:
and the pre-training module is used for pre-training the alignment of the mapping space of the image attribute encoder and the text encoder by adopting a DAMSM algorithm.
The invention also discloses a readable storage medium on which a program is stored, which program, when executed by a processor, performs any of the methods described above.
The invention also discloses an electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing the method according to any one of claims 1 to 7 when executing the program.
The method can separate the content and the attribute characteristics of the image, realizes the real text-constrained image editing by integrating the text semantic characteristics into the image attribute characteristics, overcomes the complexity and the uncontrollable property of the traditional method for generating the image from the text again, can better reserve the region irrelevant to the text description and only modify the description object, and has higher speed on the editing of high-quality images.
Drawings
FIG. 1 is a schematic diagram of an interactive image editing model according to an embodiment of the present invention;
FIG. 2 is a flowchart of an interactive image editing method according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a fusion cage according to an embodiment of the present invention;
FIG. 4 is a comparison experiment result of image editing effect quality according to an embodiment of the present invention;
FIG. 5 is a visualization of an ablation experiment performed on a cycle conservation consistency training mode in an embodiment of the present invention;
FIG. 6 is a result of visualization of the effect of an experiment for decoupling image attributes and content in an embodiment of the present invention;
fig. 7 is a block diagram of an interactive image editing apparatus according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
These and other aspects of embodiments of the invention will be apparent with reference to the following description and attached drawings. In the description and drawings, particular embodiments of the invention have been disclosed in detail as being indicative of some of the ways in which the principles of the embodiments of the invention may be practiced, but it is understood that the scope of the embodiments of the invention is not limited correspondingly. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.
The method of the invention requires that the model be trained with a data set of objects of a certain type, including a single target image of an object of that type and a set of corresponding text describing the image. The input image has no special size requirement, the resolution ratio needs to be 256x256 under the optimal condition, and the edited object in the image is obvious; the input text should be an English character string without a specific description format. The interactive image editing method in the embodiment of the present invention may be implemented by using an interactive image editing model as shown in fig. 1, where the model includes an image attribute encoder EA, a text encoder ET, a content encoder EC, a Fuser and a generator G, where Lattr in the diagram is a KL loss in a distribution constraint corresponding to an attribute edited, and adain (adaptive Instance normalization) is adaptive Instance normalization.
Referring to fig. 2, an interactive image editing method according to an embodiment of the present invention includes steps S11 to S17.
And step S11, extracting the attribute features of the original image to obtain the image attribute features.
In order to extract the attribute features of an image better, an inclusion-v 3 network structure can be used as a core structure of an image attribute encoder, after the local and global features of the image are extracted, a plurality of different multilayer perceptrons are used for solving the distribution of the current features, and the specific implementation steps are as follows:
s111: the original image is used as the input of the encoder, and the local attribute characteristics of the image are obtained by encoding the original image by using increment-v 3
Figure 857755DEST_PATH_IMAGE005
And global attribute features
Figure 32384DEST_PATH_IMAGE006
Wherein, in the step (A),
Figure 774075DEST_PATH_IMAGE007
Figure 34155DEST_PATH_IMAGE008
wherein, in the step (A),
Figure 18029DEST_PATH_IMAGE009
the space is a real space,
Figure 47165DEST_PATH_IMAGE010
as the number of image channels,
Figure 959758DEST_PATH_IMAGE011
is the number of channels that are characteristic of the device,
Figure 707134DEST_PATH_IMAGE012
is the size of the image or images,
Figure 996164DEST_PATH_IMAGE013
is the size of the local feature;
s112: will be provided with
Figure 614227DEST_PATH_IMAGE014
As input, use the root-super parameter
Figure 87934DEST_PATH_IMAGE015
A set of multi-layer perceptrons is defined to estimate the dimension corresponding to the input image as the maximum attribute category number of the processed image
Figure 932393DEST_PATH_IMAGE016
The obtained image attribute features
Figure 884168DEST_PATH_IMAGE017
Figure 261798DEST_PATH_IMAGE018
Wherein
Figure 375247DEST_PATH_IMAGE019
Is the number of attribute components. The image attribute features are expressed as a set of output parameter vectors.
And step S12, performing word embedding and encoding of context semantics on the descriptive text corresponding to the original image to obtain text characteristics.
When the method is implemented specifically, the text is firstly subjected to primary word embedding, then a cyclic neural network is used for processing a primary word embedding result, a word embedding result of the text based on context semantics and a sentence embedding vector (global coding) of the text are obtained, and the method comprises the following specific implementation steps:
s121: will have a length of
Figure 707003DEST_PATH_IMAGE020
Descriptive text of
Figure 462469DEST_PATH_IMAGE021
Obtaining a group of word indexes through word list mapping, and embedding to obtain word vectors of the descriptive text length
Figure 196070DEST_PATH_IMAGE022
Figure 11579DEST_PATH_IMAGE023
Which isIn (1),
Figure 830631DEST_PATH_IMAGE024
representing a word vector dimension;
s122: adopting a Bi-Long-Short Memory model (Bi-LSTM) structure as a text encoder based on context, and encoding the text encoder based on the context
Figure 124209DEST_PATH_IMAGE026
Inputting, obtaining output vector of each time sequence node as word embedding result of the node input word dependent context, obtaining final word embedding result of the text, namely text characteristic
Figure 210851DEST_PATH_IMAGE027
Wherein
Figure 931682DEST_PATH_IMAGE028
S123: taking hidden layer output of the last time sequence node of the Bi-LSTM as sentence coding of the text
Figure 362664DEST_PATH_IMAGE029
Used as an auto-supervision variable in the DAMSM algorithm.
And step S13, fusing the image attribute features and the text features to obtain fused features.
In specific implementation, the image attribute features obtained in step S11 and the text features obtained in step S12 are spliced, a fusion operation with time sequence dependence is performed through a cyclic network, a result of each time sequence node and a hidden layer result of a tail node are output, and further the hidden layer result is subjected to a multi-layer perceptron to obtain fused image attribute distribution, wherein the specific implementation process is as follows:
s131: the core structure of the fusion device is shown in fig. 3, MLP is a group of multilayer perceptrons, LSTM is a long-time and short-time memory model, and the image attribute characteristics are represented
Figure 335299DEST_PATH_IMAGE030
Figure 902547DEST_PATH_IMAGE031
With text features
Figure 935225DEST_PATH_IMAGE027
In which each word vector is spliced (operated) in column direction "
Figure 853502DEST_PATH_IMAGE032
") to obtain
Figure 364249DEST_PATH_IMAGE033
Figure 51582DEST_PATH_IMAGE034
Namely:
Figure 771275DEST_PATH_IMAGE035
Figure 645690DEST_PATH_IMAGE036
Figure 225707DEST_PATH_IMAGE037
is the ith word vector of the text;
s132: fusing image text features by Bi-LSTM, i.e.
Figure 767547DEST_PATH_IMAGE038
As input, the hidden layer input of the time sequence starting node is initialized by random noise to enhance the variety of editing, and each time sequence node is taken as the output of the fusion feature of the corresponding word and the image attribute feature
Figure 142027DEST_PATH_IMAGE039
Figure 503739DEST_PATH_IMAGE040
S133: taking Bi-LSTM end node hidden layer output vector as image attribute-text fusion coding
Figure 480922DEST_PATH_IMAGE041
The code is passed through
Figure 18214DEST_PATH_IMAGE042
Decoupling parameter vector group corresponding to fused image attribute distribution by different multilayer perceptrons, and recording as
Figure 688229DEST_PATH_IMAGE043
And step S14, adopting the CNN with the residual error structure as a content encoder, and extracting the overall structural features of the original image.
And step S15, performing space attention fusion processing on the overall structural feature and the fusion feature to obtain a corrected structural feature of the edited region.
And step S16, performing structural feature completion of the non-edited region on the corrected structural feature of the edited region to obtain the corrected overall structural feature.
After the original image is encoded by the content encoder, the result is combined with the fusion feature obtained in step S13
Figure 176717DEST_PATH_IMAGE044
Performing spatial attention processing to obtain a position corresponding to the fused feature on the content code, and recovering a position corresponding to the non-fused feature through Skip Connection, wherein the specific implementation process is as follows:
original image
Figure 957592DEST_PATH_IMAGE046
As input of the encoder, the overall structural characteristics of the image are obtained through output
Figure 349390DEST_PATH_IMAGE047
Figure 190307DEST_PATH_IMAGE048
Will be provided with
Figure 667556DEST_PATH_IMAGE049
Fusion features with image attribute-text
Figure 252121DEST_PATH_IMAGE050
Performing space attention fusion processing to obtain the corrected structural characteristics of the edited region
Figure 498425DEST_PATH_IMAGE051
The spatial attention fusion processing method is specifically as follows:
adopting Skip Connection structure, for
Figure 510244DEST_PATH_IMAGE052
Completing the structural feature of the non-editing related area to obtain the corrected overall structural feature
Figure 707745DEST_PATH_IMAGE053
I.e. by
Figure 96001DEST_PATH_IMAGE054
And step S17, inputting the corrected overall structure characteristics into a generator so that the generator generates an image matched with the descriptive text based on the fusion characteristic guidance.
The generator reprocesses the original image content coding result based on the fusion characteristic guidance to generate an edited image, and the specific implementation steps are as follows:
s171: the Adaptive Instance Normalization (AdaIN) is adopted as the main Normalization method of the generator, so that the parameter vector group is migrated in a similar style
Figure 196812DEST_PATH_IMAGE055
To the currently generated image, and thus
Figure 379532DEST_PATH_IMAGE056
The generator dimension which can be received is adjusted through affine transformation processing and is used as a variable parameter in the generator structure;
s172: the generator input is after correctionOf the overall structure
Figure 831373DEST_PATH_IMAGE057
Through multiple upsampling and
Figure 898686DEST_PATH_IMAGE004
the convolution combination block is processed and output to obtain a text
Figure 978637DEST_PATH_IMAGE058
Images of action
Figure 971739DEST_PATH_IMAGE059
Figure 769930DEST_PATH_IMAGE060
It will be appreciated that the model needs to be trained prior to interactive image editing.
Firstly, combining a pre-training image attribute encoder and a text encoder, then pre-training a model according to S11-S17 to realize initialization, and finally training the model by adopting a circular crossing method, wherein the specific implementation steps are as follows:
m1: adopting a Deep affected Multi-modal Similarity Model (DAMSM) to pre-train the mapping space alignment of an image Attribute feature Encoder (Attribute Encoder) and a Text Encoder (Text Encoder), wherein the DAMSM algorithm specifically comprises the following steps:
n1: and multiplying the text features and the image attribute features, and performing normalization processing by using softmax along the word embedding dimension direction, namely:
Figure 109776DEST_PATH_IMAGE061
Figure 309813DEST_PATH_IMAGE062
Figure 240860DEST_PATH_IMAGE063
in particular, among others, the use of,
Figure 260769DEST_PATH_IMAGE064
in order to be a transpose of the text feature W,
Figure 669884DEST_PATH_IMAGE065
is composed of
Figure 724428DEST_PATH_IMAGE066
The component at position (i, j) represents the similarity of the ith word to the jth region of the image, defining
Figure 799612DEST_PATH_IMAGE067
To normalize the result for this similarity in the text space direction (sentence length).
N2: and calculating a content vector of the joint area, thereby dynamically solving the relevance of each local area and each word in the text:
Figure 713342DEST_PATH_IMAGE068
the so-called region content vector is,
Figure 519624DEST_PATH_IMAGE069
dynamically representing the relevance of the ith word to each area of the image;
Figure 428674DEST_PATH_IMAGE070
is composed of
Figure 435944DEST_PATH_IMAGE071
The jth row vector of (1) is the feature of the jth area of the image;
Figure 164866DEST_PATH_IMAGE072
normalizing the result of the jth area and the ith word of the image along the image space direction;
Figure 915784DEST_PATH_IMAGE073
determining the ratio of the local related sub-region features in the calculation of the region content vectorAnd (4) heavy.
N3: calculating the matching score of the image area and the text word by using the association obtained in N2:
Figure 944920DEST_PATH_IMAGE074
Figure 356048DEST_PATH_IMAGE075
wherein
Figure 103424DEST_PATH_IMAGE037
For the i-th word vector of the text,
Figure 251508DEST_PATH_IMAGE076
the hyper-parameter is used for expanding the influence degree of the text-image area with high correlation degree on the calculation of the correlation score.
N4: calculating the conditional probability distribution of whether the known images of all sample pairs in a batch are matched with the texts thereof by using a score calculation method in N3, and obtaining the conditional probability distribution of whether the known images are matched with the texts thereof by using the same method:
Figure 276096DEST_PATH_IMAGE077
Figure 218644DEST_PATH_IMAGE078
wherein the content of the first and second substances,
Figure 328683DEST_PATH_IMAGE079
respectively an ith image and an ith text in the batch;
Figure 280458DEST_PATH_IMAGE080
the super-parameter is used for smoothing the calculation result, and the effect is obtained through experiments.
N5: the loss is calculated by utilizing the distribution:
Figure 893973DEST_PATH_IMAGE081
Figure 538581DEST_PATH_IMAGE082
wherein the content of the first and second substances,
Figure 368872DEST_PATH_IMAGE083
is prepared by mixing the above with
Figure 124338DEST_PATH_IMAGE084
And respectively replacing the text word embedding and the image local characteristics in all the related formulas with the results obtained by the text sentence embedding and the image global characteristics.
The consistency of the image attribute encoder and the text encoder on the mapping coding space is trained through a DAMSM algorithm.
M2: and pre-training all modules by taking training sample data as input of the model and taking reconstructed original images as targets according to the steps S11-S17 to initialize model parameters, wherein the training sample data comprises a plurality of images used for training and corresponding texts.
M3: and training the whole model by adopting a cross-cycle reconstruction mode. The model input is one batch comprising n tuples at a time, one image within each tuple
Figure 857939DEST_PATH_IMAGE046
And a corresponding text
Figure 407869DEST_PATH_IMAGE085
(ii) a Taking the reverse text and the sequential image in each batch to form a new tuple, namely each image
Figure 961341DEST_PATH_IMAGE046
Corresponding to a non-matching text
Figure 520499DEST_PATH_IMAGE086
Inputting the model according to the steps S11-S17 to obtain the text
Figure 108606DEST_PATH_IMAGE087
Matched edited images
Figure 95016DEST_PATH_IMAGE089
(ii) a All in each batch
Figure 899899DEST_PATH_IMAGE091
As a new input image and sequential text, a new tuple is composed, i.e. each image
Figure 731589DEST_PATH_IMAGE089
Corresponding to a non-matching text
Figure 298836DEST_PATH_IMAGE085
This text
Figure 331514DEST_PATH_IMAGE085
For editing the pre-image
Figure 984213DEST_PATH_IMAGE046
So that the restored image is assumed to be obtained after inputting the model according to the steps S11-S17
Figure 760539DEST_PATH_IMAGE092
Thus, assume an image
Figure 447872DEST_PATH_IMAGE092
Should approximate the original image as closely as possible
Figure 651451DEST_PATH_IMAGE094
. And adopting reconstruction of a matched text, cross reconstruction of a non-matched text and the matched text and similarity of attribute distribution of the image before and after editing as main self-supervision information to construct a loss function, and realizing training optimization of the model.
The objective function in the cross-training process is:
Figure 791446DEST_PATH_IMAGE095
wherein
Figure 598559DEST_PATH_IMAGE096
In order for the image to be lost to the cyclic reconstruction,
Figure 140399DEST_PATH_IMAGE097
in order to lose the reconstruction of the output image after editing the image with matching text,
Figure 514880DEST_PATH_IMAGE098
the reconstruction loss after encoding and decoding for the image itself,
Figure 142170DEST_PATH_IMAGE099
the KL distance between the edited image attribute distribution and the target attribute distribution,
Figure 260299DEST_PATH_IMAGE100
in order to combat the loss of the generator,
Figure 922224DEST_PATH_IMAGE101
for reconstruction loss of the property distribution after loop editing,
Figure 467606DEST_PATH_IMAGE102
Figure 316614DEST_PATH_IMAGE103
Figure 736969DEST_PATH_IMAGE104
and
Figure 253401DEST_PATH_IMAGE105
respectively, representing a hyper-parameter.
Order to
Figure 235263DEST_PATH_IMAGE106
For generating the device, record
Figure 446933DEST_PATH_IMAGE107
Then
Figure 31498DEST_PATH_IMAGE108
Figure 136857DEST_PATH_IMAGE109
Figure 24042DEST_PATH_IMAGE110
Figure 113220DEST_PATH_IMAGE111
Figure 609798DEST_PATH_IMAGE113
Wherein the content of the first and second substances,
Figure 100823DEST_PATH_IMAGE114
is a data "
Figure 158908DEST_PATH_IMAGE115
"reverse order in the batch dimension;
Figure 204225DEST_PATH_IMAGE116
used for calculating the KL distance between the two distributions;
Figure 5959DEST_PATH_IMAGE117
Figure 351489DEST_PATH_IMAGE118
Figure 580477DEST_PATH_IMAGE119
respectively corresponding to the attribute distribution of the original image and the attribute of the edited imageDistributing and reconstructing the attribute distribution of the image after cyclic editing;
Figure 378668DEST_PATH_IMAGE120
Figure 748208DEST_PATH_IMAGE121
the discrimination results of the condition and unconditional condition of the discriminator are respectively,
Figure 948245DEST_PATH_IMAGE122
representing the reverse order arrangement of the text T,
Figure 348133DEST_PATH_IMAGE123
to calculate the expectation function, C is the channel, W and H are the width and height of the image, respectively, and CHW is the product of the channel, the width and the height of the image.
The corresponding discriminator objective function is:
Figure 633621DEST_PATH_IMAGE124
a circular cross training mode is adopted, so that the problem that the model is unsupervised during training in the editing task is solved.
Further, the following correlation experiment was performed on the optimized model.
The model in the example of the invention and the quantitative comparison experiment of the existing open source work are shown in the following table:
Figure 42737DEST_PATH_IMAGE125
the invention respectively carries out experimental comparison with two methods, namely ManiGAN and TAGAN, on a Caltech-UCSD copies 200(CUB) data set. The CUB data set included 8855 training images and 2933 test images. The quantitative indexes adopted by the invention comprise Inclusion Score (IS), text-image similarity (sim), L1-pixel difference (diff) and Management Precision (MP). Wherein IS IS used to measure the edited graphThe quality and the authenticity of the image, sim measures the similarity between the edited image and the input text, diff represents the pixel level difference between the edited image and the original input image, and MP measures the editing effect of the image. Specifically, MP is defined by sim and diff as:
Figure 831701DEST_PATH_IMAGE126
. According to the average scores of the three methods on 2933 test images, the invention is superior to the existing method in four quantitative evaluation indexes. The highest MP value shows that the invention obtains the optimal effect on the consistency of text-image editing, and the IS value reflects that the editing result of the invention IS more real and natural.
In addition, in the experiment, the subjective test sharing of the editing results of the three models is designed for the learning of the user. The method includes the steps that 50 users with the ages ranging from 15 to 50 are invited to carry out subjective visual quality investigation, two editing results in the three methods are randomly and alternately displayed to the users, so that the users click a better editing effect, and experimental results show that more users tend to edit results of a text model.
And compared to the edit quality of existing open source work experiments, as shown in fig. 4. From the visual observation, the invention achieves better results on the target editing related to the text description semantics. In addition, since the algorithm of the invention does not generate images from zero but only modifies the text description content, the output result does not change the form of the edited image as a whole, and areas irrelevant to the text description, especially the background, can be better maintained.
To verify the effectiveness of the cyclic consensus training method of the present invention, an ablation experiment was performed on the cyclic training, as shown in fig. 5. The figure is an editing result output in the model training process, wherein n-ep represents the number of times of model iteration on the whole training set, w/cyc represents a training mode adopting loop consistency, and w/o cyc represents a mode not adopting loop training. According to the result, the model after the cycle consistency constraint is removed can not be effectively edited, and the output result oscillates between the rough editing effect and the original image reconstruction state due to the loss of supervision, so that the model is difficult to converge.
The decoupled visualization of the image content and the attributes is shown in fig. 6. The experimental result verifies that the invention can really separate the information such as the appearance, the background and the like of the image from the attributes such as the color and the like described by the text, effectively decomposes the content characteristics of the image and verifies the effectiveness of the decoupling model.
The method adopts an encoder to encode the text and image attribute information into an implicit variable popular space, and then utilizes a recurrent neural network to operate the distribution of the image attribute through text encoding to obtain the image attribute distribution based on the text; an extra encoder is adopted to encode the content of the image, and attention constraint of text-image attribute fusion characteristics is added to decouple the image structure of an editing region and a non-editing region; restoring the edited image in a manner similar to style migration by taking Adaptive instant Normalization as a main Normalization manner in a generator structure; through a cross cycle training mode, the similarity of the attribute distribution of the corresponding images before and after the cross is restrained so as to achieve the purpose of editing, and meanwhile, the reconstruction results of the images before and after the cycle are restrained so as to keep the quality of the output images. Due to the adoption of a one-stage direct generation mode and the separation of the image content and the attribute, the method has the advantages of high editing speed, remarkable editing effect and good decoupling of an editing target in a non-editing target area, and is suitable for editing the colorful image containing a single object by adopting the natural language text.
Referring to fig. 7, an interactive image editing apparatus according to an embodiment of the present invention includes:
the image attribute feature extraction module 10 is used for extracting attribute features of the original image to obtain image attribute features;
the text feature coding module 20 is configured to perform word embedding and coding of context semantics on the descriptive text corresponding to the original image to obtain text features;
the fusion module 30 is configured to fuse the image attribute features and the text features to obtain fusion features;
an overall structure extraction module 40, configured to extract an overall structure feature of the original image;
a fusion processing module 50, configured to perform spatial attention fusion processing on the overall structural feature and the fusion feature to obtain a modified structural feature of the edited region;
a structure completion module 60, configured to complete the structure characteristics of the non-edited region to the modified structure characteristics of the edited region, so as to obtain the modified overall structure characteristics;
and an input module 70, configured to input the modified overall structural features into a generator, so that the generator generates an image matching the descriptive text based on the fused feature guidance.
Further, the interactive image editing apparatus further includes:
the model building module is used for building an interactive image editing model by utilizing the image attribute encoder, the text encoder, the content encoder, the fusion device and the generator;
and the model training module is used for training the constructed whole interactive image editing model in a cross-loop mode.
Further, the interactive image editing apparatus further includes:
and the pre-training module is used for pre-training the alignment of the mapping space of the image attribute encoder and the text encoder by adopting a DAMSM algorithm.
The implementation principle and the generated technical effect of the interactive image editing apparatus provided by the embodiment of the present invention are the same as those of the foregoing method embodiment, and for brief description, no mention is made in the apparatus embodiment, and reference may be made to the corresponding contents in the foregoing method embodiment.
The invention also proposes a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the interactive image editing method as described above.
The invention also discloses an electronic device, which comprises a memory, a processor and a program which is stored on the memory and can run on the processor, wherein the processor realizes the interactive image editing method when executing the program.
Those of skill in the art will understand that the logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be viewed as implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An interactive image editing method, comprising:
extracting attribute features of the original image to obtain image attribute features;
performing word embedding and encoding of context semantics on the descriptive text corresponding to the original image to obtain text characteristics;
fusing the image attribute features and the text features to obtain fused features;
extracting the integral structural features of the original image;
performing space attention fusion processing on the overall structural feature and the fusion feature to obtain a corrected structural feature of the edited region;
performing structural feature completion of a non-edited region on the corrected structural feature of the edited region to obtain a corrected overall structural feature;
inputting the corrected overall structural features into a generator so that the generator generates an image matched with the descriptive text based on the fusion feature guide.
2. The interactive image editing method of claim 1, wherein said step of extracting the attribute features of the original image to obtain the image attribute features comprises:
inputting an original image into an image attribute encoder, so that the image attribute encoder extracts the last layer of vectors by utilizing increment-v 3 encoding and outputs the vectors to obtain global attribute characteristics;
using global attribute features as the image attribute encoder input, using super-parameters
Figure 472592DEST_PATH_IMAGE001
A set of multi-layer perceptrons is defined, and the dimensionality corresponding to the input image is estimated to be
Figure 633446DEST_PATH_IMAGE002
And obtaining the image attribute characteristics through the Gaussian mixture distribution.
3. The interactive image editing method of claim 2, wherein said step of context-semantic word embedding and encoding the descriptive text corresponding to the original image to obtain text features comprises:
the descriptive text corresponding to the original image is mapped through a word list to obtain a group of word indexes, and word vectors with the length of the descriptive text are obtained through embedding;
and inputting the word vector with the descriptive text length into a text encoder, and acquiring the output vector of each time sequence node to obtain text characteristics.
4. The interactive image editing method of claim 3, wherein said step of fusing said image attribute features with text features to obtain fused features comprises:
splicing the image attribute features and each word vector in the text features in a column direction to obtain spliced features;
inputting the splicing characteristics into a Bi-LSTM model, and acquiring output information of each time sequence node in the Bi-LSTM model to obtain fusion characteristics of corresponding word and image attribute distribution;
and taking the last node hidden layer output vector of the Bi-LSTM model as an image attribute-text fusion code, and decoupling a parameter vector group corresponding to the fused image attribute distribution by passing the image attribute-text fusion code through a group of multilayer perceptrons.
5. The interactive image editing method of claim 4, wherein the step of inputting the revised overall structural features into a generator to cause the generator to generate an image matching the descriptive text based on a fused feature guide comprises:
converting the parameter vector group into the current generated image as a variable parameter in a generator structure;
inputting the corrected overall structural features into the generator, and performing up-sampling and sampling for multiple times
Figure 901616DEST_PATH_IMAGE004
And processing the convolution combination block, and outputting an image matched with the descriptive text.
6. The interactive image editing method of claim 2, wherein said step of performing attribute feature extraction on the original image further comprises:
constructing an interactive image editing model by utilizing an image attribute encoder, a text encoder, a content encoder, a fusion device and a generator;
and training the constructed interactive image editing model in a cross-loop mode.
7. The interactive image editing method of claim 3, wherein said step of performing attribute feature extraction on the original image further comprises:
and pre-training the image attribute encoder and the text encoder for mapping space alignment by adopting a DAMSM algorithm.
8. An interactive image editing apparatus, comprising:
the image attribute feature extraction module is used for extracting attribute features of the original image to obtain image attribute features;
the text feature coding module is used for embedding and coding words with context semantics into the descriptive text corresponding to the original image to obtain text features;
the fusion module is used for fusing the image attribute features and the text features to obtain fusion features;
the overall structure extraction module is used for extracting overall structure characteristics of the original image;
the fusion processing module is used for performing space attention fusion processing on the overall structural feature and the fusion feature to obtain a corrected structural feature of the edited region;
the structure completion module is used for completing the structure characteristics of the non-edited region to the corrected structure characteristics of the edited region to obtain the corrected overall structure characteristics;
and the input module is used for inputting the corrected overall structural features into the generator so that the generator generates an image matched with the descriptive text based on the fusion feature guidance.
9. A readable storage medium on which a program is stored, which program, when executed by a processor, carries out the method according to any one of claims 1 to 7.
10. An electronic device comprising a memory, a processor, and a program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1-7 when executing the program.
CN202111008172.XA 2021-08-31 2021-08-31 Interactive image editing method and device, readable storage medium and electronic equipment Active CN113448477B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111008172.XA CN113448477B (en) 2021-08-31 2021-08-31 Interactive image editing method and device, readable storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111008172.XA CN113448477B (en) 2021-08-31 2021-08-31 Interactive image editing method and device, readable storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN113448477A true CN113448477A (en) 2021-09-28
CN113448477B CN113448477B (en) 2021-11-23

Family

ID=77819123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111008172.XA Active CN113448477B (en) 2021-08-31 2021-08-31 Interactive image editing method and device, readable storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113448477B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115375601A (en) * 2022-10-25 2022-11-22 四川大学 Decoupling expression traditional Chinese painting generation method based on attention mechanism
WO2023060434A1 (en) * 2021-10-12 2023-04-20 中国科学院深圳先进技术研究院 Text-based image editing method, and electronic device
CN116580127A (en) * 2023-07-13 2023-08-11 科大讯飞股份有限公司 Image generation method, device, electronic equipment and computer readable storage medium
CN116630480A (en) * 2023-07-14 2023-08-22 之江实验室 Interactive text-driven image editing method and device and electronic equipment
CN116704079A (en) * 2023-08-03 2023-09-05 腾讯科技(深圳)有限公司 Image generation method, device, equipment and storage medium
CN117726908A (en) * 2024-02-07 2024-03-19 青岛海尔科技有限公司 Training method and device for picture generation model, storage medium and electronic device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019812A (en) * 2018-02-27 2019-07-16 中国科学院计算技术研究所 A kind of user is from production content detection algorithm and system
CN110297890A (en) * 2018-03-21 2019-10-01 国际商业机器公司 It is obtained using the image that interactive natural language is talked with
CN110443863A (en) * 2019-07-23 2019-11-12 中国科学院深圳先进技术研究院 Method, electronic equipment and the storage medium of text generation image
CN111445545A (en) * 2020-02-27 2020-07-24 北京大米未来科技有限公司 Text-to-map method, device, storage medium and electronic equipment
CN111652093A (en) * 2020-05-21 2020-09-11 中国工商银行股份有限公司 Text image processing method and device
CN112132150A (en) * 2020-09-15 2020-12-25 上海高德威智能交通系统有限公司 Text string identification method and device and electronic equipment
US20210042474A1 (en) * 2019-03-29 2021-02-11 Beijing Sensetime Technology Development Co., Ltd. Method for text recognition, electronic device and storage medium
CN113158630A (en) * 2021-03-15 2021-07-23 苏州科技大学 Text editing image method, storage medium, electronic device and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019812A (en) * 2018-02-27 2019-07-16 中国科学院计算技术研究所 A kind of user is from production content detection algorithm and system
CN110297890A (en) * 2018-03-21 2019-10-01 国际商业机器公司 It is obtained using the image that interactive natural language is talked with
US20210042474A1 (en) * 2019-03-29 2021-02-11 Beijing Sensetime Technology Development Co., Ltd. Method for text recognition, electronic device and storage medium
CN110443863A (en) * 2019-07-23 2019-11-12 中国科学院深圳先进技术研究院 Method, electronic equipment and the storage medium of text generation image
CN111445545A (en) * 2020-02-27 2020-07-24 北京大米未来科技有限公司 Text-to-map method, device, storage medium and electronic equipment
CN111652093A (en) * 2020-05-21 2020-09-11 中国工商银行股份有限公司 Text image processing method and device
CN112132150A (en) * 2020-09-15 2020-12-25 上海高德威智能交通系统有限公司 Text string identification method and device and electronic equipment
CN113158630A (en) * 2021-03-15 2021-07-23 苏州科技大学 Text editing image method, storage medium, electronic device and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANUBHAV KUMAR: "An efficient algorithm for text localization and extraction in complex video text images", 《2013 2ND INTERNATIONAL CONFERENCE ON INFORMATION MANAGEMENT IN THE KNOWLEDGE ECONOMY》 *
马龙龙: "图像的文本描述方法研究综述", 《中文信息学报》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023060434A1 (en) * 2021-10-12 2023-04-20 中国科学院深圳先进技术研究院 Text-based image editing method, and electronic device
CN115375601A (en) * 2022-10-25 2022-11-22 四川大学 Decoupling expression traditional Chinese painting generation method based on attention mechanism
CN115375601B (en) * 2022-10-25 2023-02-28 四川大学 Decoupling expression traditional Chinese painting generation method based on attention mechanism
CN116580127A (en) * 2023-07-13 2023-08-11 科大讯飞股份有限公司 Image generation method, device, electronic equipment and computer readable storage medium
CN116580127B (en) * 2023-07-13 2023-12-01 科大讯飞股份有限公司 Image generation method, device, electronic equipment and computer readable storage medium
CN116630480A (en) * 2023-07-14 2023-08-22 之江实验室 Interactive text-driven image editing method and device and electronic equipment
CN116630480B (en) * 2023-07-14 2023-09-26 之江实验室 Interactive text-driven image editing method and device and electronic equipment
CN116704079A (en) * 2023-08-03 2023-09-05 腾讯科技(深圳)有限公司 Image generation method, device, equipment and storage medium
CN116704079B (en) * 2023-08-03 2023-09-29 腾讯科技(深圳)有限公司 Image generation method, device, equipment and storage medium
CN117726908A (en) * 2024-02-07 2024-03-19 青岛海尔科技有限公司 Training method and device for picture generation model, storage medium and electronic device
CN117726908B (en) * 2024-02-07 2024-05-24 青岛海尔科技有限公司 Training method and device for picture generation model, storage medium and electronic device

Also Published As

Publication number Publication date
CN113448477B (en) 2021-11-23

Similar Documents

Publication Publication Date Title
CN113448477B (en) Interactive image editing method and device, readable storage medium and electronic equipment
Frolov et al. Adversarial text-to-image synthesis: A review
Li et al. Multimodal foundation models: From specialists to general-purpose assistants
Jiang et al. Transferability in deep learning: A survey
US11657230B2 (en) Referring image segmentation
CN108416065B (en) Hierarchical neural network-based image-sentence description generation system and method
Awais et al. Foundational models defining a new era in vision: A survey and outlook
Sun et al. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis
CN109214006B (en) Natural language reasoning method for image enhanced hierarchical semantic representation
CN113657124A (en) Multi-modal Mongolian Chinese translation method based on circulation common attention Transformer
CN111783457B (en) Semantic visual positioning method and device based on multi-modal graph convolutional network
CN114528898A (en) Scene graph modification based on natural language commands
CN112200664A (en) Repayment prediction method based on ERNIE model and DCNN model
He et al. Dynamic invariant-specific representation fusion network for multimodal sentiment analysis
CN114817564A (en) Attribute extraction method and device and storage medium
Valle Hands-On Generative Adversarial Networks with Keras: Your guide to implementing next-generation generative adversarial networks
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution
CN115858756A (en) Shared emotion man-machine conversation system based on perception emotional tendency
CN115169472A (en) Music matching method and device for multimedia data and computer equipment
Robertson et al. A Self-Adaptive Architecture for Image Understanding
Nasr et al. SemGAN: Text to Image Synthesis from Text Semantics using Attentional Generative Adversarial Networks
Li et al. Lightweight text-driven image editing with disentangled content and attributes
Wang et al. Entity-level text-guided image manipulation
Huang et al. Flexible entity marks and a fine-grained style control for knowledge based natural answer generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant