CN110706302A

CN110706302A - System and method for text synthesis image

Info

Publication number: CN110706302A
Application number: CN201910962728.5A
Authority: CN
Inventors: 王晓茹; 蔡雅丽; 余志洪
Original assignee: Zhongshan Yidi Technology Co Ltd
Current assignee: Zhongshan Yidi Technology Co Ltd
Priority date: 2019-10-11
Filing date: 2019-10-11
Publication date: 2020-01-17
Anticipated expiration: 2039-10-11
Also published as: CN110706302B

Abstract

The invention discloses a system and a method for synthesizing images by texts, wherein the system comprises the following steps: the generating network is used for generating an initial image meeting preset image characteristic information according to a target text in the sample data set, wherein the preset image characteristic information comprises image characteristics which accord with the semantics of the target text and have a preset global structure; the judging network is used for generating the difference degree between the initial image and the preset image in the sample data set and feeding back the difference degree to the generating network; and the generation network is also used for adjusting the control parameters of the generation network according to the difference degree so as to obtain target control parameters, and generating a target image matched with the target text based on the target control parameters. The invention improves the quality of the synthesized image and meets the real requirement of the user.

Description

System and method for text synthesis image

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a system and a method for synthesizing an image with a text.

Background

The main task of text-to-image synthesis is to generate a clear and realistic image that conforms to the text semantics from natural language descriptions. The application field of the method is wide, and the method can be mainly applied to the fields of cultural relic restoration, security protection, art creation and the like.

Currently, a mainstream method for researching text-to-image synthesis is a generative modeling method based on a Generative Adaptive Network (GAN), wherein the generation of a countermeasure Network includes generating a Network and determining the Network, and the generating Network fits sample image distribution by receiving random noise and text vectors, and determines whether an input image is consistent with text semantics. Although text-to-image synthesis can be achieved with existing generative countermeasure networks, there are some drawbacks. For example, advanced text semantic concepts have a huge semantic gap with visual information at the pixel level. Since the mapping relationship between the text space and the image space is highly sparse, the change of a certain word may cause the pixels of many sub-regions in the image to change; in addition, the quality of the synthesized image is poor due to the under-fitting expression of the generated network, and the problems of global structure distortion and edge blurring exist. Meanwhile, incomplete text description lacks many potential condition constraint information, and the network visual feature expression capability is limited. This results in poor quality of the synthesized image and failure to meet the real needs of the user.

Disclosure of Invention

In view of the above problems, the present invention provides a system and a method for text-to-image synthesis, which achieve the purpose of improving the quality of synthesized images and meeting the real requirements of users.

In order to achieve the purpose, the invention provides the following technical scheme:

a system for text-compositing an image, the system comprising:

the generating network is used for generating an initial image meeting preset image characteristic information according to a target text in the sample data set, wherein the preset image characteristic information comprises image characteristics which accord with the semantics of the target text and have a preset global structure;

the judging network is used for generating the difference degree between the initial image and the preset image in the sample data set and feeding back the difference degree to the generating network;

and the generation network is also used for adjusting the control parameters of the generation network according to the difference degree so as to obtain target control parameters, and generating a target image matched with the target text based on the target control parameters.

Optionally, the generating network comprises a dual attention module, wherein,

the double-attention module is used for guiding the synthesis of the local semantic details of the initial image and coordinating the global structure of the initial image according to the image characteristics of the initial image and the text characteristic information of the target text, so that the initial image meets the preset image characteristic information.

Optionally, the dual attention module comprises:

the text attention module is used for acquiring the incidence relation between the initial image and the target text, guiding the synthesis of the local semantic details of the initial image based on the incidence relation and acquiring text attention characteristics;

the visual attention module is used for acquiring image features of the initial image, modeling the image features on channel and space dimensions, coordinating the global structure of the initial image by using the modeled features and acquiring visual attention features;

and the attention embedding module is used for carrying out feature fusion on the image features of the initial image, the text attention features and the visual attention features to obtain fusion features, and the fusion features are used for realizing multi-dimensional expression of the image features.

Optionally, the text attention module is specifically configured to:

acquiring a word feature vector of a target text and a visual feature vector of the initial image;

converting the word feature vector and the visual feature vector to a target semantic space dimension;

calculating the incidence relation between the word feature vector and the visual feature, and obtaining the word attention weight according to the incidence relation;

obtaining a word context vector for each sub-region of the initial image;

and calculating to obtain a word context matrix based on the word attention weight and the word context vector, and converting the matrix into a feature representation space, wherein the feature representation space is used for guiding the synthesis of local semantic details of the initial image to obtain text attention features.

Optionally, the visual attention module comprises:

the channel attention module is used for learning the global maximum pooling visual feature and the global average pooling visual feature of the channel dimension through a multilayer perceptron network, and performing attention weighting on the visual feature vector on a channel by using a learning result to obtain a first feature map;

the visual attention module is used for learning the association relation between the feature of each pixel in the initial image and the pixel at the preset position on the basis of the spatial dimension, coordinating the visual feature expression on the space on the basis of the association relation and obtaining a second feature map weighted by the attention on the space; wherein the first feature map and the second feature map are used to express visual attention features.

Optionally, the generating network adopts an inverted residual network structure, wherein,

the generation network is further used for reconstructing high-resolution image features from the low-resolution image through the inverted residual error network structure and generating a high-resolution image based on the high-resolution image features.

Optionally, the number of the discrimination networks matches the number of resolution categories of the initial images with different resolutions generated by the generation network.

Optionally, the discriminant network is specifically configured to:

extracting vector features of the initial image and the target text;

and generating a conditional constraint judging mode and/or an unconditional constraint judging mode, and judging the authenticity of the initial image based on the vector characteristics to obtain the difference degree between the initial image and the preset image.

Optionally, the discriminant network is further configured to:

and judging the fine-grained matching degree between the image with the preset resolution in the generating network and the target text.

A method of text synthesizing an image, the method comprising:

generating an initial image meeting preset image characteristic information according to a target text in a sample data set, wherein the preset image characteristic information comprises image characteristics which accord with target text semantics and have a preset global structure;

generating a difference degree between the initial image and a preset image in the sample data set;

and adjusting control parameters according to the difference degree to obtain target control parameters, and generating a target image matched with the target text based on the target control parameters.

Compared with the prior art, the invention provides a system and a method for synthesizing images by texts, wherein a network is generated for generating an initial image meeting preset image characteristic information according to a target text in a sample data set, wherein the preset image characteristic information comprises image characteristics which accord with the semantics of the target text and have a preset global structure; the judging network is used for generating the difference degree between the initial image and the preset image in the sample data set and feeding back the difference degree to the generating network; and the generation network is also used for adjusting the control parameters of the generation network according to the difference degree so as to obtain target control parameters, and generating a target image matched with the target text based on the target control parameters. The initial image comprises the image features which accord with the semantics of the target text and have a preset global structure, the expression of local visual features is guided by exploring the incidence relation between vision and language, the local details which accord with the semantics of the target text are generated, and the quality of the image can be improved through the global structure features, so that the target image synthesized according to the target text can better meet the real requirements of a user.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a system for text-to-image synthesis according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a dual-attention module according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a method for synthesizing an image with a text according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first" and "second," and the like in the description and claims of the present invention and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not set forth for a listed step or element but may include steps or elements not listed.

In an embodiment of the present invention, a system for text synthesis of an image is provided, and referring to fig. 1, the system includes: a generating network 101 and a discriminating network 102.

The generating network 101 is used for generating an initial image meeting preset image characteristic information according to a target text in a sample data set, wherein the preset image characteristic information comprises image characteristics which accord with target text semantics and have a preset global structure;

the judging network 102 is used for generating the difference degree between the initial image and the preset image in the sample data set and feeding back the difference degree to the generating network;

and the generating network 101 is further configured to adjust the control parameter of the generating network according to the difference degree so as to obtain a target control parameter, and generate a target image matched with the target text based on the target control parameter.

The preset image represents an accurate image synthesized by the target text, namely the preset image can accurately reflect the description content of the target text, so that a data basis is provided for the subsequent generation and test stages of generating the countermeasure network.

It should be noted that, in the process of synthesizing a text into an image according to the present invention, a generation countermeasure network technology is applied, that is, the generation network 101 and the countermeasure network 102 are utilized, and the generation network is different from the prior art because an initial image generated by the generation network can satisfy preset image feature information, where the preset image feature information includes image features that meet the semantics of a target text and have a preset global structure. The function is realized mainly by generating a double attention module in the network.

The double-attention module is used for guiding the synthesis of the local semantic details of the initial image and coordinating the global structure of the initial image according to the image features of the initial image and the text feature information of the target text, so that the initial image meets the preset image feature information.

The specific double-attention module enables the generation network to generate an image which accords with target text semantics and has a preset global structure, and an image with a good overall structure and local details can be generated by comprehensively considering constraints of text semantic drive and global visual semantics. The double-attention module guides the expression of local visual features by exploring the association relationship between vision and language and paying attention to relevant text features to generate local details conforming to text semantics.

The generation of the countermeasure network of the present application will be explained below. The generation countermeasure network is an improvement over the stacked generation countermeasure network, the coarse network of the first stage can generate images at resolution 64 x 64, while the fine network of the second and third stages can generate images at resolution 128 x 128 and 256 x 256 using the double-attention module and the inverted residual structure. The improved generation countermeasure network framework of the application is composed of multiple generators and multiple discriminators, the infrastructure of the framework follows a StackGAN-v2 tree structure, and the same input and output configuration as that of AttnGAN in training and testing is adopted, and coarse-to-fine multi-scale images are generated by using the multiple generators in an iteration mode.

The judgment network is used for judging the image generated by the generation network, so that the image generated by the generation network can meet the description content of the target text through repeated confrontation training of the generation network and the judgment network and continuous adjustment of the generation network, wherein the training process of the generation network is mainly realized by adjusting the model parameters of the generation network.

The generation network adopts an inverted residual error network structure, wherein,

The performance of the generated network can be improved by utilizing the inverted residual error network structure. By expanding and compressing the dimensions of the feature representation, the underlying features can be modeled.

Specifically, the generation network generates images step by using a flow pattern structure, a first stage generates 64 × 64 low-resolution images containing correct colors and coarse structures through simple convolution operation and up-sampling operation, and a second stage and a third stage generate 128 × 128 images and 256 × 256 high-resolution images respectively by focusing on detail information. The method introduces an inverted residual error structure in the network, and can model potential features by expanding and compressing dimensions of feature representation. In order to improve the inherent feature expression capability of the network, the residual error structure of the method firstly uses an expansion layer network, and dimension expansion is carried out on the features of the previous layer (wherein the features of the previous layer refer to the image features of the initial image), so that the feature expression space of each pixel point is increased, and more potential expressions can be learned; after a layer of ReLU activation function is used, a layer of compression layer is used for compressing the dimension of the feature, redundant information is removed, and therefore a more representative feature expression is obtained; and finally, fusing the input and output characteristics by using a quick connection mode, and simultaneously preventing the gradient from disappearing. In order to reduce model parameters and improve the training speed of the network, a gated linear unit GLU in an original network structure is converted into a simple ReLU activation function; meanwhile, in order to stabilize the training of the GAN, a spectrum normalization technology is introduced into the generation network and the discrimination network.

In another embodiment of the present invention, a dual attention module comprises:

The generation countermeasure network further comprises a double attention module, and the double attention module enhances the capability of the generation network to express local details and a global structure by jointly paying attention to related text features and remote image features, so that the overall quality and the local details of a composite image are improved. Fig. 2 is a schematic structural diagram of a dual attention module according to an embodiment of the present invention, which includes a text attention module 201, a visual attention module 202, and an attention embedding module 203.

Specifically, in order to generate fine-grained visual detail information conforming to text semantics, the text attention module is adopted in the application

To obtain the mapping relation between the word feature and the visual feature of the image obtained by the previous layer of network. Firstly, converting a word feature vector e and a visual feature vector h into consistent semantic space dimensionality (namely target semantic space dimensionality) through a transformation network; then, performing dot product operation on the word feature vector and the visual feature vector, and executing softmax normalization operation to calculate the association relation between the word feature vector and the visual feature vector to obtain the word attention weight; to obtain a word context vector c for a single sub-region in an image_iAnd performing dot product operation on the obtained word attention weight and the word feature vector e. Finally, the word context matrix { c) will be obtained₀,c₁,…,c_i… into a feature representation space of a two-dimensional image and converting the feature representation space into a feature representation space of a two-dimensional image

The result of (2) is input to the attention embedding module for feature fusion.

Visual attention module

The method aims to model the image features in two dimensions of a channel and a space to obtain global structural features, and better representation of visual features is realized. By using global visual information by learning, important features are selectively enhanced and unimportant features are suppressed. The module learns which layer of features are more important features based on channel dimensions, and learns the association between each pixel feature and a distant location based on spatial dimensions, coordinating feature expression over space. In the embodiment of the invention, the visual attention model in the form of channel-space series connection is preferably used for the feature h given by the previous hidden layer networkChannel attention module

A feature map h' is obtained that is attention weighted on the channel. Visual attention module

Can be expressed as:

the specific visual attention module may in turn include a channel attention module and a spatial attention module. In order to extract image feature layers with significant information, a channel attention module is proposed to mine the relationship of image features between channels. The module fuses the average and maximum values of each layer feature simultaneously, taking into account the overall background and texture information.

For the input image features h, firstly, the spatial features are compressed and integrated into two one-dimensional vectors by using a global average pooling technology and a global maximum pooling technology, each feature value represents the spatial feature of the channel dimension, then, the channel attention vectors are obtained by using a shared multilayer perceptron network structure, and the obtained channel attention vectors are multiplied by the image features h to obtain features h' for performing attention weighting on the image features of different channel layers. The channel attention module may be defined as follows:

wherein sigma represents a sigmoid activation function, MLP represents an FC-ReLU-FC multilayer perceptron network structure,

representing element-by-element multiplication.

The context is crucial for generating detailed information of high resolution images, with the goal of capturing global distant correlations, not just local ones. Convolutional nervesThe local correlation principle of the network causes that the network is difficult to capture the global correlation under the condition of less depth, and if the depth expansion feeling is increased, the information redundancy in the network is large, and the correlation between the positions is difficult to construct. Therefore, in an embodiment of the present invention, a spatial attention module is proposed to enhance local feature representation by encoding rich upper limit information. Based on the image local similarity principle, firstly, average pooling and maximum pooling are used for feature maps

Downsampling to obtain valid features of local blocksTo achieve more efficient computation of spatial attention; then compressing the channel dimension characteristics by using a 1 x 1 convolution neural network and carrying out characteristic space transformation by using a transformation layer to obtain theta, phi and g, wherein

WhereinRepresenting the number of feature points; calculating the association relationship between the positions by dot product to obtain a space attention diagram

β_j，iIt means that when the jth region is synthesized, the higher the association degree of the ith region and the jth region is, the higher the attention degree of the model to the ith region is. It is noted that for the composition of each sub-region, all other sub-regions are taken into account.

Matrix multiplication of the image features g with the transpose of the spatial attention map beta results in the features after spatial attention weighting,and performing a transformation of the warping operation to a two-dimensional image representation

And multiplying the obtained result by a scale factor eta, so that the network can adaptively learn the influence degree of the features after the spatial attention weighting on the image feature expression.

Finally, the characteristic graph o representing the local area block is subjected to up-sampling operation to obtain the image characteristic with the spatial attention weighting consistent with the original image characteristic dimension space

An attention embedding module in the dual attention module performs efficient feature fusion on the multi-path features. In order to prevent gradient messages and enable a network to learn features based on a residual error mode, input features h are subjected to jumping connection and added to the output of each attention module, and repairable learning is performed on the basis of ensuring low-resolution image features. Finally, the obtained characteristic channels are spliced in dimension and then input into the next layer of network.

The number of the discrimination networks provided by the invention can be matched with the number of the initial images generated by the generation network, for example, the generation network generates three images with different sizes, the three discrimination networks can be used for discriminating the images with three different scales, the integral true and false discrimination of the images with conditional constraint and unconditional constraint is realized by extracting the characteristics of the input initial images and the corresponding target text vectors, and meanwhile, the regularization of the deep multi-modal similarity is carried out on the images generated at the last stage, and the matching degree of fine granularity between the images and the text is calculated.

The judgment with conditional constraint refers to the judgment of the known target text and the preset image, and the judgment without conditional constraint refers to the judgment of the known preset image.

The discrimination network is further operable to: and judging the fine-grained matching degree between the image with the preset resolution in the generating network and the target text. Specifically, the depth multi-modal similarity regularization can be performed on the image generated at the last stage in the generation network, and the fine-grained matching degree between the image and the text can be calculated.

The technical features of the present invention will be described below with reference to a specific example of a text-synthesized image.

In the training stage, step 1, a text description is coded by using a bidirectional long and short memory network to obtain a sentence vector and a word vector; step 2, a first stage of generating a network: after the sentence vectors are spliced with the noise vectors, image feature vectors are obtained through full connection layer and reshaping, and the image feature vectors with 64 x 64 resolution of basic shapes and colors are obtained by using four times of upsampling and convolution network. Second and third stages: firstly, inputting the image feature vectors and word vectors obtained in the first and second stages to a double attention module respectively, and combining text semantic driving and global visual semantic constraints to generate an image with good overall structure and local details. The text attention module guides the expression of local visual features by aiming at exploring the incidence relation between vision and language and paying attention to related text features to generate local details conforming to text semantics; the visual attention module carries out modeling on image features on two dimensions of a channel and a space to obtain global structural features, and the quality of the whole image is improved by extracting useful features on the channel and coordinating feature expression on the space; the attention embedding module fuses the multi-path features based on a residual learning idea, and the reality of the image is improved. And then, reconstructing the high-resolution image features by using the image features obtained after attention weighted fusion through two inverted residual error network structures and an up-sampling + convolution network structure to respectively obtain 128 × 128 and 256 × 256 resolution image features. And 3, compressing the image characteristics obtained in the three stages into an RGB three-channel image through three convolution networks respectively. And 4, the three discrimination networks realize the integral true and false discrimination of the image with conditional constraint and unconditional constraint by receiving the images (generated images and real images) and sentence vectors of three different scales (64 × 64, 128 × 128 and 256 × 256), and meanwhile, the image generated in the last stage is normalized by the deep multi-modal similarity, and the fine-grained matching degree between the image and the text is calculated. And 5, the generated network and the discrimination network are mutually played through alternate training, so that the discrimination network can not distinguish whether the generated image is real or not. To stabilize the training of GAN, a spectral normalization technique is introduced here, i.e. adding spectral normalization after each layer of convolutional network.

And (3) a testing stage: namely, 64 × 64, 128 × 128, 256 × 256 resolution images can be obtained through steps 1, 2 and 3. Wherein the 256 × 256 resolution image is the final result.

Based on the description of each embodiment, the generation countermeasure network based on double attention provided by the application realizes the synthesis from the text to the image, and the high-resolution image with high overall structure quality and real local semantic details is synthesized through gradual global visual reconstruction and local semantic constraint. The dual attention module combines constraints of text semantic-driven and global visual semantics to generate images with good overall structure and local details. The text attention module guides the expression of local visual features by aiming at exploring the incidence relation between vision and language and paying attention to related text features to generate local details conforming to text semantics; the visual attention module carries out modeling on image features on two dimensions of a channel and a space to obtain global structural features, and the quality of the whole image is improved by extracting useful features on the channel and coordinating feature expression on the space; the attention embedding module fuses the multi-path features based on a residual learning idea, and the reality of the image is improved. In the embodiment of the application, the inverse residual structure models the nonlinear relation between hidden layers to improve the feature expression capability of the convolutional neural network, and a spectrum normalization technology is applied to stably generate the training of the countermeasure network.

There is also provided in an embodiment of the present application a method of text synthesis of an image, see fig. 3, the method comprising:

s301, generating an initial image meeting preset image characteristic information according to the target text in the sample data set;

the preset image feature information comprises image features which accord with the semantics of a target text and have a preset global structure;

s302, generating the difference degree between the initial image and a preset image in the sample data set;

s303, adjusting control parameters according to the difference degree to obtain target control parameters, and generating a target image matched with the target text based on the target control parameters.

In the method for synthesizing an image with a text provided in this embodiment, the initial image includes an image feature that meets the semantics of the target text and has a preset global structure, so that an incidence relation between the visual sense and the language is explored to guide the expression of the local visual feature, local details that meet the semantics of the target text are generated, and the quality of the image can be improved through the global structure feature, so that the target image synthesized according to the target text can better meet the real requirements of the user.

On the basis of the above embodiment, the method further includes:

and guiding the synthesis of the local semantic details of the initial image and coordinating the global structure of the initial image according to the image characteristics of the initial image and the text characteristic information of the target text, so that the initial image meets the preset image characteristic information.

On the basis of the above embodiment, the guiding, according to the image feature of the initial image and the text feature information of the target text, the synthesis of the local semantic details of the initial image and the coordination of the global structure of the initial image includes:

acquiring an incidence relation between an initial image and a target text, and guiding the synthesis of local semantic details of the initial image based on the incidence relation to acquire text attention characteristics;

acquiring image characteristics of the initial image, modeling the image characteristics on channel and space dimensions, and coordinating the global structure of the initial image by using the modeled characteristics to obtain visual attention characteristics;

and performing feature fusion on the image features of the initial image, the text attention features and the visual attention features to obtain fusion features, wherein the fusion features are used for realizing multi-dimensional expression of the image features.

On the basis of the above embodiment, the obtaining of the association relationship between the initial image and the target text, and the guiding of the synthesis of the local semantic details of the initial image based on the association relationship, to obtain the text attention feature includes:

obtaining a word context vector for each sub-region of the initial image;

On the basis of the above embodiment, the acquiring image features of the initial image, modeling the image features in channel and spatial dimensions, and coordinating a global structure of the initial image by using the modeled features to obtain visual attention features includes:

learning global maximum pooling visual features and global average pooling visual features of channel dimensions through a multilayer perceptron network, and performing attention weighting on the visual feature vectors on channels by using learning results to obtain a first feature map;

learning the association relation between the feature of each pixel in the initial image and the pixel at the preset position on the basis of the spatial dimension, coordinating the visual feature expression on the space on the basis of the association relation, and obtaining a second feature map weighted by attention on the space; wherein the first feature map and the second feature map are used to express visual attention features.

On the basis of the above embodiment, the method further includes:

and reconstructing high-resolution image features from the low-resolution image through the inverted residual error network structure, and generating a high-resolution image based on the high-resolution image features.

On the basis of the foregoing embodiment, the generating a difference degree between the initial image and a preset image in the sample data set includes:

extracting vector features of the initial image and the target text;

On the basis of the above embodiment, the method further includes:

and judging the fine-grained matching degree between the image with the preset resolution and the target text.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A system for text-compositing an image, the system comprising:

2. The system of claim 1, wherein the generation network comprises a dual attention module, wherein,

3. The system of claim 2, wherein the dual attention module comprises:

4. The system of claim 3, wherein the text attention module is specifically configured to:

obtaining a word context vector for each sub-region of the initial image;

5. The system of claim 3, wherein the visual attention module comprises:

6. The system of claim 1, wherein the generation network employs an inverted residual network structure, wherein,

7. The system of claim 6, wherein the number of discriminating networks matches the number of resolution categories of initial images generated by the generating network having different resolutions.

8. The system of claim 7, wherein the discrimination network is specifically configured to:

extracting vector features of the initial image and the target text;

9. The system of claim 8, wherein the discrimination network is further configured to:

10. A method for text-compositing an image, the method comprising: