CN110706302A - System and method for text synthesis image - Google Patents

System and method for text synthesis image Download PDF

Info

Publication number
CN110706302A
CN110706302A CN201910962728.5A CN201910962728A CN110706302A CN 110706302 A CN110706302 A CN 110706302A CN 201910962728 A CN201910962728 A CN 201910962728A CN 110706302 A CN110706302 A CN 110706302A
Authority
CN
China
Prior art keywords
image
features
network
text
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910962728.5A
Other languages
Chinese (zh)
Other versions
CN110706302B (en
Inventor
王晓茹
蔡雅丽
余志洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongshan Yidi Technology Co Ltd
Original Assignee
Zhongshan Yidi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongshan Yidi Technology Co Ltd filed Critical Zhongshan Yidi Technology Co Ltd
Priority to CN201910962728.5A priority Critical patent/CN110706302B/en
Publication of CN110706302A publication Critical patent/CN110706302A/en
Application granted granted Critical
Publication of CN110706302B publication Critical patent/CN110706302B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a system and a method for synthesizing images by texts, wherein the system comprises the following steps: the generating network is used for generating an initial image meeting preset image characteristic information according to a target text in the sample data set, wherein the preset image characteristic information comprises image characteristics which accord with the semantics of the target text and have a preset global structure; the judging network is used for generating the difference degree between the initial image and the preset image in the sample data set and feeding back the difference degree to the generating network; and the generation network is also used for adjusting the control parameters of the generation network according to the difference degree so as to obtain target control parameters, and generating a target image matched with the target text based on the target control parameters. The invention improves the quality of the synthesized image and meets the real requirement of the user.

Description

System and method for text synthesis image
Technical Field
The present invention relates to the field of image processing technologies, and in particular, to a system and a method for synthesizing an image with a text.
Background
The main task of text-to-image synthesis is to generate a clear and realistic image that conforms to the text semantics from natural language descriptions. The application field of the method is wide, and the method can be mainly applied to the fields of cultural relic restoration, security protection, art creation and the like.
Currently, a mainstream method for researching text-to-image synthesis is a generative modeling method based on a Generative Adaptive Network (GAN), wherein the generation of a countermeasure Network includes generating a Network and determining the Network, and the generating Network fits sample image distribution by receiving random noise and text vectors, and determines whether an input image is consistent with text semantics. Although text-to-image synthesis can be achieved with existing generative countermeasure networks, there are some drawbacks. For example, advanced text semantic concepts have a huge semantic gap with visual information at the pixel level. Since the mapping relationship between the text space and the image space is highly sparse, the change of a certain word may cause the pixels of many sub-regions in the image to change; in addition, the quality of the synthesized image is poor due to the under-fitting expression of the generated network, and the problems of global structure distortion and edge blurring exist. Meanwhile, incomplete text description lacks many potential condition constraint information, and the network visual feature expression capability is limited. This results in poor quality of the synthesized image and failure to meet the real needs of the user.
Disclosure of Invention
In view of the above problems, the present invention provides a system and a method for text-to-image synthesis, which achieve the purpose of improving the quality of synthesized images and meeting the real requirements of users.
In order to achieve the purpose, the invention provides the following technical scheme:
a system for text-compositing an image, the system comprising:
the generating network is used for generating an initial image meeting preset image characteristic information according to a target text in the sample data set, wherein the preset image characteristic information comprises image characteristics which accord with the semantics of the target text and have a preset global structure;
the judging network is used for generating the difference degree between the initial image and the preset image in the sample data set and feeding back the difference degree to the generating network;
and the generation network is also used for adjusting the control parameters of the generation network according to the difference degree so as to obtain target control parameters, and generating a target image matched with the target text based on the target control parameters.
Optionally, the generating network comprises a dual attention module, wherein,
the double-attention module is used for guiding the synthesis of the local semantic details of the initial image and coordinating the global structure of the initial image according to the image characteristics of the initial image and the text characteristic information of the target text, so that the initial image meets the preset image characteristic information.
Optionally, the dual attention module comprises:
the text attention module is used for acquiring the incidence relation between the initial image and the target text, guiding the synthesis of the local semantic details of the initial image based on the incidence relation and acquiring text attention characteristics;
the visual attention module is used for acquiring image features of the initial image, modeling the image features on channel and space dimensions, coordinating the global structure of the initial image by using the modeled features and acquiring visual attention features;
and the attention embedding module is used for carrying out feature fusion on the image features of the initial image, the text attention features and the visual attention features to obtain fusion features, and the fusion features are used for realizing multi-dimensional expression of the image features.
Optionally, the text attention module is specifically configured to:
acquiring a word feature vector of a target text and a visual feature vector of the initial image;
converting the word feature vector and the visual feature vector to a target semantic space dimension;
calculating the incidence relation between the word feature vector and the visual feature, and obtaining the word attention weight according to the incidence relation;
obtaining a word context vector for each sub-region of the initial image;
and calculating to obtain a word context matrix based on the word attention weight and the word context vector, and converting the matrix into a feature representation space, wherein the feature representation space is used for guiding the synthesis of local semantic details of the initial image to obtain text attention features.
Optionally, the visual attention module comprises:
the channel attention module is used for learning the global maximum pooling visual feature and the global average pooling visual feature of the channel dimension through a multilayer perceptron network, and performing attention weighting on the visual feature vector on a channel by using a learning result to obtain a first feature map;
the visual attention module is used for learning the association relation between the feature of each pixel in the initial image and the pixel at the preset position on the basis of the spatial dimension, coordinating the visual feature expression on the space on the basis of the association relation and obtaining a second feature map weighted by the attention on the space; wherein the first feature map and the second feature map are used to express visual attention features.
Optionally, the generating network adopts an inverted residual network structure, wherein,
the generation network is further used for reconstructing high-resolution image features from the low-resolution image through the inverted residual error network structure and generating a high-resolution image based on the high-resolution image features.
Optionally, the number of the discrimination networks matches the number of resolution categories of the initial images with different resolutions generated by the generation network.
Optionally, the discriminant network is specifically configured to:
extracting vector features of the initial image and the target text;
and generating a conditional constraint judging mode and/or an unconditional constraint judging mode, and judging the authenticity of the initial image based on the vector characteristics to obtain the difference degree between the initial image and the preset image.
Optionally, the discriminant network is further configured to:
and judging the fine-grained matching degree between the image with the preset resolution in the generating network and the target text.
A method of text synthesizing an image, the method comprising:
generating an initial image meeting preset image characteristic information according to a target text in a sample data set, wherein the preset image characteristic information comprises image characteristics which accord with target text semantics and have a preset global structure;
generating a difference degree between the initial image and a preset image in the sample data set;
and adjusting control parameters according to the difference degree to obtain target control parameters, and generating a target image matched with the target text based on the target control parameters.
Compared with the prior art, the invention provides a system and a method for synthesizing images by texts, wherein a network is generated for generating an initial image meeting preset image characteristic information according to a target text in a sample data set, wherein the preset image characteristic information comprises image characteristics which accord with the semantics of the target text and have a preset global structure; the judging network is used for generating the difference degree between the initial image and the preset image in the sample data set and feeding back the difference degree to the generating network; and the generation network is also used for adjusting the control parameters of the generation network according to the difference degree so as to obtain target control parameters, and generating a target image matched with the target text based on the target control parameters. The initial image comprises the image features which accord with the semantics of the target text and have a preset global structure, the expression of local visual features is guided by exploring the incidence relation between vision and language, the local details which accord with the semantics of the target text are generated, and the quality of the image can be improved through the global structure features, so that the target image synthesized according to the target text can better meet the real requirements of a user.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a system for text-to-image synthesis according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a dual-attention module according to an embodiment of the present invention;
fig. 3 is a flowchart illustrating a method for synthesizing an image with a text according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first" and "second," and the like in the description and claims of the present invention and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not set forth for a listed step or element but may include steps or elements not listed.
In an embodiment of the present invention, a system for text synthesis of an image is provided, and referring to fig. 1, the system includes: a generating network 101 and a discriminating network 102.
The generating network 101 is used for generating an initial image meeting preset image characteristic information according to a target text in a sample data set, wherein the preset image characteristic information comprises image characteristics which accord with target text semantics and have a preset global structure;
the judging network 102 is used for generating the difference degree between the initial image and the preset image in the sample data set and feeding back the difference degree to the generating network;
and the generating network 101 is further configured to adjust the control parameter of the generating network according to the difference degree so as to obtain a target control parameter, and generate a target image matched with the target text based on the target control parameter.
The preset image represents an accurate image synthesized by the target text, namely the preset image can accurately reflect the description content of the target text, so that a data basis is provided for the subsequent generation and test stages of generating the countermeasure network.
It should be noted that, in the process of synthesizing a text into an image according to the present invention, a generation countermeasure network technology is applied, that is, the generation network 101 and the countermeasure network 102 are utilized, and the generation network is different from the prior art because an initial image generated by the generation network can satisfy preset image feature information, where the preset image feature information includes image features that meet the semantics of a target text and have a preset global structure. The function is realized mainly by generating a double attention module in the network.
The double-attention module is used for guiding the synthesis of the local semantic details of the initial image and coordinating the global structure of the initial image according to the image features of the initial image and the text feature information of the target text, so that the initial image meets the preset image feature information.
The specific double-attention module enables the generation network to generate an image which accords with target text semantics and has a preset global structure, and an image with a good overall structure and local details can be generated by comprehensively considering constraints of text semantic drive and global visual semantics. The double-attention module guides the expression of local visual features by exploring the association relationship between vision and language and paying attention to relevant text features to generate local details conforming to text semantics.
The generation of the countermeasure network of the present application will be explained below. The generation countermeasure network is an improvement over the stacked generation countermeasure network, the coarse network of the first stage can generate images at resolution 64 x 64, while the fine network of the second and third stages can generate images at resolution 128 x 128 and 256 x 256 using the double-attention module and the inverted residual structure. The improved generation countermeasure network framework of the application is composed of multiple generators and multiple discriminators, the infrastructure of the framework follows a StackGAN-v2 tree structure, and the same input and output configuration as that of AttnGAN in training and testing is adopted, and coarse-to-fine multi-scale images are generated by using the multiple generators in an iteration mode.
The judgment network is used for judging the image generated by the generation network, so that the image generated by the generation network can meet the description content of the target text through repeated confrontation training of the generation network and the judgment network and continuous adjustment of the generation network, wherein the training process of the generation network is mainly realized by adjusting the model parameters of the generation network.
The generation network adopts an inverted residual error network structure, wherein,
the generation network is further used for reconstructing high-resolution image features from the low-resolution image through the inverted residual error network structure and generating a high-resolution image based on the high-resolution image features.
The performance of the generated network can be improved by utilizing the inverted residual error network structure. By expanding and compressing the dimensions of the feature representation, the underlying features can be modeled.
Specifically, the generation network generates images step by using a flow pattern structure, a first stage generates 64 × 64 low-resolution images containing correct colors and coarse structures through simple convolution operation and up-sampling operation, and a second stage and a third stage generate 128 × 128 images and 256 × 256 high-resolution images respectively by focusing on detail information. The method introduces an inverted residual error structure in the network, and can model potential features by expanding and compressing dimensions of feature representation. In order to improve the inherent feature expression capability of the network, the residual error structure of the method firstly uses an expansion layer network, and dimension expansion is carried out on the features of the previous layer (wherein the features of the previous layer refer to the image features of the initial image), so that the feature expression space of each pixel point is increased, and more potential expressions can be learned; after a layer of ReLU activation function is used, a layer of compression layer is used for compressing the dimension of the feature, redundant information is removed, and therefore a more representative feature expression is obtained; and finally, fusing the input and output characteristics by using a quick connection mode, and simultaneously preventing the gradient from disappearing. In order to reduce model parameters and improve the training speed of the network, a gated linear unit GLU in an original network structure is converted into a simple ReLU activation function; meanwhile, in order to stabilize the training of the GAN, a spectrum normalization technology is introduced into the generation network and the discrimination network.
In another embodiment of the present invention, a dual attention module comprises:
the text attention module is used for acquiring the incidence relation between the initial image and the target text, guiding the synthesis of the local semantic details of the initial image based on the incidence relation and acquiring text attention characteristics;
the visual attention module is used for acquiring image features of the initial image, modeling the image features on channel and space dimensions, coordinating the global structure of the initial image by using the modeled features and acquiring visual attention features;
and the attention embedding module is used for carrying out feature fusion on the image features of the initial image, the text attention features and the visual attention features to obtain fusion features, and the fusion features are used for realizing multi-dimensional expression of the image features.
The generation countermeasure network further comprises a double attention module, and the double attention module enhances the capability of the generation network to express local details and a global structure by jointly paying attention to related text features and remote image features, so that the overall quality and the local details of a composite image are improved. Fig. 2 is a schematic structural diagram of a dual attention module according to an embodiment of the present invention, which includes a text attention module 201, a visual attention module 202, and an attention embedding module 203.
Specifically, in order to generate fine-grained visual detail information conforming to text semantics, the text attention module is adopted in the application
Figure BDA0002229472900000071
To obtain the mapping relation between the word feature and the visual feature of the image obtained by the previous layer of network. Firstly, converting a word feature vector e and a visual feature vector h into consistent semantic space dimensionality (namely target semantic space dimensionality) through a transformation network; then, performing dot product operation on the word feature vector and the visual feature vector, and executing softmax normalization operation to calculate the association relation between the word feature vector and the visual feature vector to obtain the word attention weight; to obtain a word context vector c for a single sub-region in an imageiAnd performing dot product operation on the obtained word attention weight and the word feature vector e. Finally, the word context matrix { c) will be obtained0,c1,…,ci… into a feature representation space of a two-dimensional image and converting the feature representation space into a feature representation space of a two-dimensional image
Figure BDA0002229472900000081
The result of (2) is input to the attention embedding module for feature fusion.
Visual attention module
Figure BDA0002229472900000082
The method aims to model the image features in two dimensions of a channel and a space to obtain global structural features, and better representation of visual features is realized. By using global visual information by learning, important features are selectively enhanced and unimportant features are suppressed. The module learns which layer of features are more important features based on channel dimensions, and learns the association between each pixel feature and a distant location based on spatial dimensions, coordinating feature expression over space. In the embodiment of the invention, the visual attention model in the form of channel-space series connection is preferably used for the feature h given by the previous hidden layer networkChannel attention module
Figure BDA0002229472900000083
A feature map h' is obtained that is attention weighted on the channel. Visual attention module
Figure BDA0002229472900000084
Can be expressed as:
Figure BDA0002229472900000085
the specific visual attention module may in turn include a channel attention module and a spatial attention module. In order to extract image feature layers with significant information, a channel attention module is proposed to mine the relationship of image features between channels. The module fuses the average and maximum values of each layer feature simultaneously, taking into account the overall background and texture information.
For the input image features h, firstly, the spatial features are compressed and integrated into two one-dimensional vectors by using a global average pooling technology and a global maximum pooling technology, each feature value represents the spatial feature of the channel dimension, then, the channel attention vectors are obtained by using a shared multilayer perceptron network structure, and the obtained channel attention vectors are multiplied by the image features h to obtain features h' for performing attention weighting on the image features of different channel layers. The channel attention module may be defined as follows:
Figure BDA0002229472900000086
wherein sigma represents a sigmoid activation function, MLP represents an FC-ReLU-FC multilayer perceptron network structure,
Figure BDA0002229472900000091
representing element-by-element multiplication.
The context is crucial for generating detailed information of high resolution images, with the goal of capturing global distant correlations, not just local ones. Convolutional nervesThe local correlation principle of the network causes that the network is difficult to capture the global correlation under the condition of less depth, and if the depth expansion feeling is increased, the information redundancy in the network is large, and the correlation between the positions is difficult to construct. Therefore, in an embodiment of the present invention, a spatial attention module is proposed to enhance local feature representation by encoding rich upper limit information. Based on the image local similarity principle, firstly, average pooling and maximum pooling are used for feature maps
Figure BDA0002229472900000094
Downsampling to obtain valid features of local blocksTo achieve more efficient computation of spatial attention; then compressing the channel dimension characteristics by using a 1 x 1 convolution neural network and carrying out characteristic space transformation by using a transformation layer to obtain theta, phi and g, wherein
Figure BDA0002229472900000096
WhereinRepresenting the number of feature points; calculating the association relationship between the positions by dot product to obtain a space attention diagram
Figure BDA0002229472900000098
Figure BDA0002229472900000092
βj,iIt means that when the jth region is synthesized, the higher the association degree of the ith region and the jth region is, the higher the attention degree of the model to the ith region is. It is noted that for the composition of each sub-region, all other sub-regions are taken into account.
Matrix multiplication of the image features g with the transpose of the spatial attention map beta results in the features after spatial attention weighting,and performing a transformation of the warping operation to a two-dimensional image representation
Figure BDA0002229472900000099
And multiplying the obtained result by a scale factor eta, so that the network can adaptively learn the influence degree of the features after the spatial attention weighting on the image feature expression.
Figure BDA0002229472900000093
Finally, the characteristic graph o representing the local area block is subjected to up-sampling operation to obtain the image characteristic with the spatial attention weighting consistent with the original image characteristic dimension space
An attention embedding module in the dual attention module performs efficient feature fusion on the multi-path features. In order to prevent gradient messages and enable a network to learn features based on a residual error mode, input features h are subjected to jumping connection and added to the output of each attention module, and repairable learning is performed on the basis of ensuring low-resolution image features. Finally, the obtained characteristic channels are spliced in dimension and then input into the next layer of network.
The number of the discrimination networks provided by the invention can be matched with the number of the initial images generated by the generation network, for example, the generation network generates three images with different sizes, the three discrimination networks can be used for discriminating the images with three different scales, the integral true and false discrimination of the images with conditional constraint and unconditional constraint is realized by extracting the characteristics of the input initial images and the corresponding target text vectors, and meanwhile, the regularization of the deep multi-modal similarity is carried out on the images generated at the last stage, and the matching degree of fine granularity between the images and the text is calculated.
The judgment with conditional constraint refers to the judgment of the known target text and the preset image, and the judgment without conditional constraint refers to the judgment of the known preset image.
The discrimination network is further operable to: and judging the fine-grained matching degree between the image with the preset resolution in the generating network and the target text. Specifically, the depth multi-modal similarity regularization can be performed on the image generated at the last stage in the generation network, and the fine-grained matching degree between the image and the text can be calculated.
The technical features of the present invention will be described below with reference to a specific example of a text-synthesized image.
In the training stage, step 1, a text description is coded by using a bidirectional long and short memory network to obtain a sentence vector and a word vector; step 2, a first stage of generating a network: after the sentence vectors are spliced with the noise vectors, image feature vectors are obtained through full connection layer and reshaping, and the image feature vectors with 64 x 64 resolution of basic shapes and colors are obtained by using four times of upsampling and convolution network. Second and third stages: firstly, inputting the image feature vectors and word vectors obtained in the first and second stages to a double attention module respectively, and combining text semantic driving and global visual semantic constraints to generate an image with good overall structure and local details. The text attention module guides the expression of local visual features by aiming at exploring the incidence relation between vision and language and paying attention to related text features to generate local details conforming to text semantics; the visual attention module carries out modeling on image features on two dimensions of a channel and a space to obtain global structural features, and the quality of the whole image is improved by extracting useful features on the channel and coordinating feature expression on the space; the attention embedding module fuses the multi-path features based on a residual learning idea, and the reality of the image is improved. And then, reconstructing the high-resolution image features by using the image features obtained after attention weighted fusion through two inverted residual error network structures and an up-sampling + convolution network structure to respectively obtain 128 × 128 and 256 × 256 resolution image features. And 3, compressing the image characteristics obtained in the three stages into an RGB three-channel image through three convolution networks respectively. And 4, the three discrimination networks realize the integral true and false discrimination of the image with conditional constraint and unconditional constraint by receiving the images (generated images and real images) and sentence vectors of three different scales (64 × 64, 128 × 128 and 256 × 256), and meanwhile, the image generated in the last stage is normalized by the deep multi-modal similarity, and the fine-grained matching degree between the image and the text is calculated. And 5, the generated network and the discrimination network are mutually played through alternate training, so that the discrimination network can not distinguish whether the generated image is real or not. To stabilize the training of GAN, a spectral normalization technique is introduced here, i.e. adding spectral normalization after each layer of convolutional network.
And (3) a testing stage: namely, 64 × 64, 128 × 128, 256 × 256 resolution images can be obtained through steps 1, 2 and 3. Wherein the 256 × 256 resolution image is the final result.
Based on the description of each embodiment, the generation countermeasure network based on double attention provided by the application realizes the synthesis from the text to the image, and the high-resolution image with high overall structure quality and real local semantic details is synthesized through gradual global visual reconstruction and local semantic constraint. The dual attention module combines constraints of text semantic-driven and global visual semantics to generate images with good overall structure and local details. The text attention module guides the expression of local visual features by aiming at exploring the incidence relation between vision and language and paying attention to related text features to generate local details conforming to text semantics; the visual attention module carries out modeling on image features on two dimensions of a channel and a space to obtain global structural features, and the quality of the whole image is improved by extracting useful features on the channel and coordinating feature expression on the space; the attention embedding module fuses the multi-path features based on a residual learning idea, and the reality of the image is improved. In the embodiment of the application, the inverse residual structure models the nonlinear relation between hidden layers to improve the feature expression capability of the convolutional neural network, and a spectrum normalization technology is applied to stably generate the training of the countermeasure network.
There is also provided in an embodiment of the present application a method of text synthesis of an image, see fig. 3, the method comprising:
s301, generating an initial image meeting preset image characteristic information according to the target text in the sample data set;
the preset image feature information comprises image features which accord with the semantics of a target text and have a preset global structure;
s302, generating the difference degree between the initial image and a preset image in the sample data set;
s303, adjusting control parameters according to the difference degree to obtain target control parameters, and generating a target image matched with the target text based on the target control parameters.
In the method for synthesizing an image with a text provided in this embodiment, the initial image includes an image feature that meets the semantics of the target text and has a preset global structure, so that an incidence relation between the visual sense and the language is explored to guide the expression of the local visual feature, local details that meet the semantics of the target text are generated, and the quality of the image can be improved through the global structure feature, so that the target image synthesized according to the target text can better meet the real requirements of the user.
On the basis of the above embodiment, the method further includes:
and guiding the synthesis of the local semantic details of the initial image and coordinating the global structure of the initial image according to the image characteristics of the initial image and the text characteristic information of the target text, so that the initial image meets the preset image characteristic information.
On the basis of the above embodiment, the guiding, according to the image feature of the initial image and the text feature information of the target text, the synthesis of the local semantic details of the initial image and the coordination of the global structure of the initial image includes:
acquiring an incidence relation between an initial image and a target text, and guiding the synthesis of local semantic details of the initial image based on the incidence relation to acquire text attention characteristics;
acquiring image characteristics of the initial image, modeling the image characteristics on channel and space dimensions, and coordinating the global structure of the initial image by using the modeled characteristics to obtain visual attention characteristics;
and performing feature fusion on the image features of the initial image, the text attention features and the visual attention features to obtain fusion features, wherein the fusion features are used for realizing multi-dimensional expression of the image features.
On the basis of the above embodiment, the obtaining of the association relationship between the initial image and the target text, and the guiding of the synthesis of the local semantic details of the initial image based on the association relationship, to obtain the text attention feature includes:
acquiring a word feature vector of a target text and a visual feature vector of the initial image;
converting the word feature vector and the visual feature vector to a target semantic space dimension;
calculating the incidence relation between the word feature vector and the visual feature, and obtaining the word attention weight according to the incidence relation;
obtaining a word context vector for each sub-region of the initial image;
and calculating to obtain a word context matrix based on the word attention weight and the word context vector, and converting the matrix into a feature representation space, wherein the feature representation space is used for guiding the synthesis of local semantic details of the initial image to obtain text attention features.
On the basis of the above embodiment, the acquiring image features of the initial image, modeling the image features in channel and spatial dimensions, and coordinating a global structure of the initial image by using the modeled features to obtain visual attention features includes:
learning global maximum pooling visual features and global average pooling visual features of channel dimensions through a multilayer perceptron network, and performing attention weighting on the visual feature vectors on channels by using learning results to obtain a first feature map;
learning the association relation between the feature of each pixel in the initial image and the pixel at the preset position on the basis of the spatial dimension, coordinating the visual feature expression on the space on the basis of the association relation, and obtaining a second feature map weighted by attention on the space; wherein the first feature map and the second feature map are used to express visual attention features.
On the basis of the above embodiment, the method further includes:
and reconstructing high-resolution image features from the low-resolution image through the inverted residual error network structure, and generating a high-resolution image based on the high-resolution image features.
On the basis of the foregoing embodiment, the generating a difference degree between the initial image and a preset image in the sample data set includes:
extracting vector features of the initial image and the target text;
and generating a conditional constraint judging mode and/or an unconditional constraint judging mode, and judging the authenticity of the initial image based on the vector characteristics to obtain the difference degree between the initial image and the preset image.
On the basis of the above embodiment, the method further includes:
and judging the fine-grained matching degree between the image with the preset resolution and the target text.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A system for text-compositing an image, the system comprising:
the generating network is used for generating an initial image meeting preset image characteristic information according to a target text in the sample data set, wherein the preset image characteristic information comprises image characteristics which accord with the semantics of the target text and have a preset global structure;
the judging network is used for generating the difference degree between the initial image and the preset image in the sample data set and feeding back the difference degree to the generating network;
and the generation network is also used for adjusting the control parameters of the generation network according to the difference degree so as to obtain target control parameters, and generating a target image matched with the target text based on the target control parameters.
2. The system of claim 1, wherein the generation network comprises a dual attention module, wherein,
the double-attention module is used for guiding the synthesis of the local semantic details of the initial image and coordinating the global structure of the initial image according to the image characteristics of the initial image and the text characteristic information of the target text, so that the initial image meets the preset image characteristic information.
3. The system of claim 2, wherein the dual attention module comprises:
the text attention module is used for acquiring the incidence relation between the initial image and the target text, guiding the synthesis of the local semantic details of the initial image based on the incidence relation and acquiring text attention characteristics;
the visual attention module is used for acquiring image features of the initial image, modeling the image features on channel and space dimensions, coordinating the global structure of the initial image by using the modeled features and acquiring visual attention features;
and the attention embedding module is used for carrying out feature fusion on the image features of the initial image, the text attention features and the visual attention features to obtain fusion features, and the fusion features are used for realizing multi-dimensional expression of the image features.
4. The system of claim 3, wherein the text attention module is specifically configured to:
acquiring a word feature vector of a target text and a visual feature vector of the initial image;
converting the word feature vector and the visual feature vector to a target semantic space dimension;
calculating the incidence relation between the word feature vector and the visual feature, and obtaining the word attention weight according to the incidence relation;
obtaining a word context vector for each sub-region of the initial image;
and calculating to obtain a word context matrix based on the word attention weight and the word context vector, and converting the matrix into a feature representation space, wherein the feature representation space is used for guiding the synthesis of local semantic details of the initial image to obtain text attention features.
5. The system of claim 3, wherein the visual attention module comprises:
the channel attention module is used for learning the global maximum pooling visual feature and the global average pooling visual feature of the channel dimension through a multilayer perceptron network, and performing attention weighting on the visual feature vector on a channel by using a learning result to obtain a first feature map;
the visual attention module is used for learning the association relation between the feature of each pixel in the initial image and the pixel at the preset position on the basis of the spatial dimension, coordinating the visual feature expression on the space on the basis of the association relation and obtaining a second feature map weighted by the attention on the space; wherein the first feature map and the second feature map are used to express visual attention features.
6. The system of claim 1, wherein the generation network employs an inverted residual network structure, wherein,
the generation network is further used for reconstructing high-resolution image features from the low-resolution image through the inverted residual error network structure and generating a high-resolution image based on the high-resolution image features.
7. The system of claim 6, wherein the number of discriminating networks matches the number of resolution categories of initial images generated by the generating network having different resolutions.
8. The system of claim 7, wherein the discrimination network is specifically configured to:
extracting vector features of the initial image and the target text;
and generating a conditional constraint judging mode and/or an unconditional constraint judging mode, and judging the authenticity of the initial image based on the vector characteristics to obtain the difference degree between the initial image and the preset image.
9. The system of claim 8, wherein the discrimination network is further configured to:
and judging the fine-grained matching degree between the image with the preset resolution in the generating network and the target text.
10. A method for text-compositing an image, the method comprising:
generating an initial image meeting preset image characteristic information according to a target text in a sample data set, wherein the preset image characteristic information comprises image characteristics which accord with target text semantics and have a preset global structure;
generating a difference degree between the initial image and a preset image in the sample data set;
and adjusting control parameters according to the difference degree to obtain target control parameters, and generating a target image matched with the target text based on the target control parameters.
CN201910962728.5A 2019-10-11 2019-10-11 System and method for synthesizing images by text Active CN110706302B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910962728.5A CN110706302B (en) 2019-10-11 2019-10-11 System and method for synthesizing images by text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910962728.5A CN110706302B (en) 2019-10-11 2019-10-11 System and method for synthesizing images by text

Publications (2)

Publication Number Publication Date
CN110706302A true CN110706302A (en) 2020-01-17
CN110706302B CN110706302B (en) 2023-05-19

Family

ID=69199264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910962728.5A Active CN110706302B (en) 2019-10-11 2019-10-11 System and method for synthesizing images by text

Country Status (1)

Country Link
CN (1) CN110706302B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461323A (en) * 2020-03-13 2020-07-28 中国科学技术大学 Image identification method and device
CN111918071A (en) * 2020-06-29 2020-11-10 北京大学 Data compression method, device, equipment and storage medium
CN112101330A (en) * 2020-11-20 2020-12-18 北京沃东天骏信息技术有限公司 Image processing method, image processing apparatus, electronic device, and storage medium
CN112348911A (en) * 2020-10-28 2021-02-09 山东师范大学 Semantic constraint-based method and system for generating fine-grained image by stacking texts
CN112364946A (en) * 2021-01-13 2021-02-12 长沙海信智能系统研究院有限公司 Training method of image determination model, and method, device and equipment for image determination
CN112597278A (en) * 2020-12-25 2021-04-02 北京知因智慧科技有限公司 Semantic information fusion method and device, electronic equipment and storage medium
CN112669215A (en) * 2021-01-05 2021-04-16 北京金山云网络技术有限公司 Training text image generation model, text image generation method and device
CN112668655A (en) * 2020-12-30 2021-04-16 中山大学 Method for detecting out-of-distribution image based on generation of confrontation network uncertainty attention enhancement
CN113177562A (en) * 2021-04-29 2021-07-27 京东数字科技控股股份有限公司 Vector determination method and device based on self-attention mechanism fusion context information
CN113421314A (en) * 2021-06-09 2021-09-21 湖南大学 Multi-scale bimodal text image generation method based on generation countermeasure network
CN114078172A (en) * 2020-08-19 2022-02-22 四川大学 Text image generation method for progressively generating confrontation network based on resolution
CN114091662A (en) * 2021-11-26 2022-02-25 广东伊莱特电器有限公司 Text image generation method and device and electronic equipment
CN114140368A (en) * 2021-12-03 2022-03-04 天津大学 Multi-modal medical image synthesis method based on generating type countermeasure network
CN114359435A (en) * 2022-03-17 2022-04-15 阿里巴巴(中国)有限公司 Image generation method, model generation method and equipment
CN114863450A (en) * 2022-05-19 2022-08-05 北京百度网讯科技有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN115512368A (en) * 2022-08-22 2022-12-23 华中农业大学 Cross-modal semantic image generation model and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
US10043109B1 (en) * 2017-01-23 2018-08-07 A9.Com, Inc. Attribute similarity-based search
EP3404586A1 (en) * 2017-05-18 2018-11-21 INTEL Corporation Novelty detection using discriminator of generative adversarial network
CN109783657A (en) * 2019-01-07 2019-05-21 北京大学深圳研究生院 Multistep based on limited text space is from attention cross-media retrieval method and system
CN110111399A (en) * 2019-04-24 2019-08-09 上海理工大学 A kind of image text generation method of view-based access control model attention

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10043109B1 (en) * 2017-01-23 2018-08-07 A9.Com, Inc. Attribute similarity-based search
EP3404586A1 (en) * 2017-05-18 2018-11-21 INTEL Corporation Novelty detection using discriminator of generative adversarial network
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
CN109783657A (en) * 2019-01-07 2019-05-21 北京大学深圳研究生院 Multistep based on limited text space is from attention cross-media retrieval method and system
CN110111399A (en) * 2019-04-24 2019-08-09 上海理工大学 A kind of image text generation method of view-based access control model attention

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TAO XU等: "AttnGAN Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461323A (en) * 2020-03-13 2020-07-28 中国科学技术大学 Image identification method and device
CN111461323B (en) * 2020-03-13 2022-07-29 中国科学技术大学 Image identification method and device
CN111918071A (en) * 2020-06-29 2020-11-10 北京大学 Data compression method, device, equipment and storage medium
CN114078172A (en) * 2020-08-19 2022-02-22 四川大学 Text image generation method for progressively generating confrontation network based on resolution
CN114078172B (en) * 2020-08-19 2023-04-07 四川大学 Text image generation method for progressively generating confrontation network based on resolution
CN112348911A (en) * 2020-10-28 2021-02-09 山东师范大学 Semantic constraint-based method and system for generating fine-grained image by stacking texts
CN112348911B (en) * 2020-10-28 2023-04-18 山东师范大学 Semantic constraint-based method and system for generating fine-grained image by stacking texts
CN112101330A (en) * 2020-11-20 2020-12-18 北京沃东天骏信息技术有限公司 Image processing method, image processing apparatus, electronic device, and storage medium
CN112101330B (en) * 2020-11-20 2021-04-30 北京沃东天骏信息技术有限公司 Image processing method, image processing apparatus, electronic device, and storage medium
CN112597278A (en) * 2020-12-25 2021-04-02 北京知因智慧科技有限公司 Semantic information fusion method and device, electronic equipment and storage medium
CN112668655B (en) * 2020-12-30 2023-08-29 中山大学 Out-of-distribution image detection method based on generating attention enhancement against network uncertainty
CN112668655A (en) * 2020-12-30 2021-04-16 中山大学 Method for detecting out-of-distribution image based on generation of confrontation network uncertainty attention enhancement
CN112669215A (en) * 2021-01-05 2021-04-16 北京金山云网络技术有限公司 Training text image generation model, text image generation method and device
CN112364946B (en) * 2021-01-13 2021-05-28 长沙海信智能系统研究院有限公司 Training method of image determination model, and method, device and equipment for image determination
CN112364946A (en) * 2021-01-13 2021-02-12 长沙海信智能系统研究院有限公司 Training method of image determination model, and method, device and equipment for image determination
CN113177562B (en) * 2021-04-29 2024-02-06 京东科技控股股份有限公司 Vector determination method and device for merging context information based on self-attention mechanism
CN113177562A (en) * 2021-04-29 2021-07-27 京东数字科技控股股份有限公司 Vector determination method and device based on self-attention mechanism fusion context information
CN113421314A (en) * 2021-06-09 2021-09-21 湖南大学 Multi-scale bimodal text image generation method based on generation countermeasure network
CN114091662A (en) * 2021-11-26 2022-02-25 广东伊莱特电器有限公司 Text image generation method and device and electronic equipment
CN114091662B (en) * 2021-11-26 2024-05-14 广东伊莱特生活电器有限公司 Text image generation method and device and electronic equipment
CN114140368A (en) * 2021-12-03 2022-03-04 天津大学 Multi-modal medical image synthesis method based on generating type countermeasure network
CN114140368B (en) * 2021-12-03 2024-04-23 天津大学 Multi-mode medical image synthesis method based on generation type countermeasure network
CN114359435A (en) * 2022-03-17 2022-04-15 阿里巴巴(中国)有限公司 Image generation method, model generation method and equipment
CN114863450A (en) * 2022-05-19 2022-08-05 北京百度网讯科技有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN115512368A (en) * 2022-08-22 2022-12-23 华中农业大学 Cross-modal semantic image generation model and method
CN115512368B (en) * 2022-08-22 2024-05-10 华中农业大学 Cross-modal semantic generation image model and method

Also Published As

Publication number Publication date
CN110706302B (en) 2023-05-19

Similar Documents

Publication Publication Date Title
CN110706302B (en) System and method for synthesizing images by text
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN111047548B (en) Attitude transformation data processing method and device, computer equipment and storage medium
CN110555458A (en) Multi-band image feature level fusion method for generating countermeasure network based on attention mechanism
US20220230276A1 (en) Generative Adversarial Networks with Temporal and Spatial Discriminators for Efficient Video Generation
CN111160343A (en) Off-line mathematical formula symbol identification method based on Self-Attention
CN113901894A (en) Video generation method, device, server and storage medium
CN111325660B (en) Remote sensing image style conversion method based on text data
CN113361251A (en) Text image generation method and system based on multi-stage generation countermeasure network
CN113792641B (en) High-resolution lightweight human body posture estimation method combined with multispectral attention mechanism
CN115222998B (en) Image classification method
CN115953582B (en) Image semantic segmentation method and system
CN117597703A (en) Multi-scale converter for image analysis
CN117475216A (en) Hyperspectral and laser radar data fusion classification method based on AGLT network
CN116912367B (en) Method and system for generating image based on lightweight dynamic refinement text
CN111339734B (en) Method for generating image based on text
CN115512368B (en) Cross-modal semantic generation image model and method
CN114240811A (en) Method for generating new image based on multiple images
CN114299218A (en) System for searching real human face based on hand-drawing sketch
CN114494387A (en) Data set network generation model and fog map generation method
Wang et al. An approach based on Transformer and deformable convolution for realistic handwriting samples generation
CN115115667A (en) Accurate target tracking method based on target transformation regression network
CN115909045B (en) Two-stage landslide map feature intelligent recognition method based on contrast learning
Ma et al. MHGAN: A Multi-Headed Generative Adversarial Network for Underwater Sonar Image Super-Resolution
Kaddoura Real-World Applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant