CN116912367B

CN116912367B - Method and system for generating image based on lightweight dynamic refinement text

Info

Publication number: CN116912367B
Application number: CN202311127041.2A
Authority: CN
Inventors: 杨文姬; 安航; 赵应丁; 杨红云; 谢丽萍
Original assignee: Jiangxi Agricultural University
Current assignee: Jiangxi Agricultural University
Priority date: 2023-09-04
Filing date: 2023-09-04
Publication date: 2023-12-19
Anticipated expiration: 2043-09-04
Also published as: CN116912367A

Abstract

The invention belongs to the technical field of image information processing, and discloses a method and a system for generating an image based on a lightweight dynamic refined text. The invention encodes the input text into text features by a text encoder, and the text features encoded by the text encoder and the noise vectors sampled by Gaussian distribution are input into a generator; the noise vector is fed to a full-connection layer and remolded to obtain initial image features, and the text features are fused into the initial image features through a plurality of dynamic text image fusion refinement blocks to obtain refined image features; finally, the refined image features are converted into an image by the convolution layer and the activation layer. The invention can better realize the processes of self-adaption, refinement, fusion and the like of the text generated image, thereby enhancing the sense of reality and expressive force of the generated image.

Description

Method and system for generating image based on lightweight dynamic refinement text

Technical Field

The invention belongs to the technical field of image information processing, and particularly relates to a method and a system for generating an image based on a lightweight dynamic refined text.

Background

The text generated image can be applied to application scenes such as virtual design, building field, flower design and the like. In the text generation image technology, text information is generally extracted by using a natural language processing technology, then text features are used as constraints for later image generation, a generator generates pictures according to the text features in a generation contrast network (GAN), and then a discriminator evaluates the generation effect. For example, chinese patent application publication No. CN114078172a discloses a text-generated image method based on resolution progressive generation of an countermeasure network, and chinese patent application publication No. CN116451649a discloses a text-generated image method based on affine-involving transformation. The method of generating images by texts is mainly to propose different optimization modes for a generator and a discriminator.

The existing image generation technology generally needs a large amount of computing resources and computing power, and a large amount of GPU computing resources and memory are consumed in the training and generation process, so that the requirement on hardware equipment is high. Training an efficient image generation model typically requires a large amount of training data that requires labeling and preprocessing, a process that also requires significant resource consumption. Although the existing image generation technology has made great progress in generating high-quality images, the generated image quality is still unstable under the conditions of low computational effort and low batch size (batch size), and problems of blurring, distortion and the like of the image quality may exist. And the details of the generated image are not sufficiently rich. Therefore, under the condition of limited hardware resources, the existing image generation technology is difficult to synthesize more realistic and more text-conforming images.

Disclosure of Invention

In order to solve the problems that the existing text generated image model is low in understanding degree of the text and insufficient in detail degree of the generated image, the invention provides a text generated image method and system based on lightweight dynamic refinement, which can better realize the processes of self-adaption, refinement, fusion and the like of the text generated image, thereby enhancing the sense of reality and expressive force of the generated image and meeting the stricter requirements of the text generated image.

The invention is realized by the following technical scheme. An image generation method based on lightweight dynamic refined text, which encodes input text into text features by a text encoder, the text features encoded by the text encoder and noise vectors sampled by Gaussian distribution are input into a generator; the noise vector is fed to a full connection layer and remolded to obtain initial image features, and the text features are fused into the initial image features through a plurality of dynamic text image fusion refinement blocks (DFRBlock) to obtain refined image features; finally, converting the refined image features into images through a convolution layer and an activation layer; the dynamic text image fusion refinement block consists of a text image fusion block (DFBlock), a dynamic self-adaptive convolution block (DACBlock), a Semantic Decoder (Semantic Decoder) and a dynamic image feature refinement block (DRBlock), wherein the dynamic text image fusion refinement block has two inputs, namely a text feature and an image feature, the text feature is respectively input into the text image fusion block, the dynamic self-adaptive convolution block and the dynamic image feature refinement block, and the image feature is respectively input into the text image fusion block, the Semantic Decoder and the dynamic image feature refinement block; the image features output by the text image fusion block are input into the dynamic image feature refinement block, the semantic information output by the semantic decoder is input into the dynamic self-adaptive convolution block, the attention striving force output by the dynamic self-adaptive convolution block is input into the dynamic image feature refinement block, and finally the dynamic image feature refinement block outputs refined image features.

Further preferably, a cross-scale excitation module is used to connect the outputs of the different-scale dynamic text image fusion refinement blocks.

Further preferably, the trained encoder and decoder are trained on the real image using the reconstruction loss function; the generated image and the real image are respectively encoded through an encoder in the discriminator to obtain good image characteristics, then the text characteristics and the image characteristics are spliced, the authenticity of the image is discriminated, then the generator and the discriminator are trained together by using the countermeasures, the quality of the generated image of the generator is optimized through the discrimination result, and conversely, the quality of the generated image further promotes the performance improvement of the discriminator, the two are improved together, and finally the generator generates an image with consistent visual sense reality and semanteme.

Further preferably, the reconstruction loss function is:

；

where x represents the input image, f represents the image features extracted by the encoder,representing a function for processing the image features extracted by the encoder,/->Representing a function for processing a real image, +.>Representing the image feature set,/->Representing a set of real images; />Represented in the real image set I _real In the background, calculate image feature set D _encode(x) The desired value of f is output.

Further preferably, the countering loss is expressed as:

；

wherein,indicating loss of arbiter->Loss of the representation generator->Representing the output of the arbiter, z is the noise vector sampled from the gaussian distribution; />Representing the synthesized data output by the generator through the input noise vector z; e is a text feature; />Representing the distribution of the composite data->Representing the true data distribution->Representing a mismatch data distribution, +.>Expressed in real data distribution->The next desired calculation for a particular set of items, < ->Expressed in the distribution of synthetic data->Performing a desired calculation next on a set of specific items; />Representing the distribution of data in mismatch +.>Performing a desired calculation next on a set of specific items; min represents the minimum function, +.>Representing the operation of gradient determination on variable x, < ->Representing operations for gradient determination of text feature ePerforming; where k and p are two superparameters.

The invention provides a text generation image system based on lightweight dynamic refinement, which consists of a generator, a discriminator and a pre-trained text encoder, wherein the generator comprises a full connection layer, six dynamic text image fusion refinement blocks (DFRBlock), a cross-scale excitation module, a convolution layer and an activation layer; the noise vector sampled by Gaussian distribution is fed into a full-connection layer and remolded to obtain initial image features, and the text features are fused into the initial image features through a dynamic text image fusion refinement block (DFRBlock) to obtain refined image features; finally, converting the refined image features into images through a convolution layer and an activation layer;

the six dynamic text image fusion refinement blocks are sequentially a first dynamic text image fusion refinement block, a second dynamic text image fusion refinement block, a third dynamic text image fusion refinement block, a fourth dynamic text image fusion refinement block, a fifth dynamic text image fusion refinement block and a sixth dynamic text image fusion refinement block, the initial image features are sequentially fused with text features in different scales in the first dynamic text image fusion refinement block, the second dynamic text image fusion refinement block, the third dynamic text image fusion refinement block, the fourth dynamic text image fusion refinement block, the fifth dynamic text image fusion refinement block and the text features, and the image features output by the first dynamic text image fusion refinement block and the image features output by the fifth dynamic text image fusion refinement block are connected through a trans-scale excitation module and then input into the sixth dynamic text image fusion refinement block and fused with the text features; and connecting the image features output by the second dynamic text image fusion refinement block with the image features output by the sixth dynamic text image fusion refinement block through a trans-scale excitation module to obtain refined image features.

Still preferably, the dynamic text image fusion refinement block is composed of a text image fusion block (DFBlock), a dynamic adaptive convolution block (DACBlock), a Semantic Decoder (semanteme Decoder) and a dynamic image feature refinement block (DRBlock), wherein the dynamic text image fusion refinement block has two inputs, namely a text feature and an image feature, the text feature is respectively input into the text image fusion block, the dynamic adaptive convolution block and the dynamic image feature refinement block, and the image feature is respectively input into the text image fusion block, the Semantic Decoder and the dynamic image feature refinement block; the image features output by the text image fusion block are input into the dynamic image feature refinement block, the semantic information output by the semantic decoder is input into the dynamic self-adaptive convolution block, the attention striving force output by the dynamic self-adaptive convolution block is input into the dynamic image feature refinement block, and finally the dynamic image feature refinement block outputs refined image features.

Further preferably, the text image fusion block comprises a plurality of affine transformations and a ReLU activation layer, the affine transformations are carried out on the image features by the up-sampling layer after sampling and the text features acquired by a multi-layer perceptron (MLP) in the affine transformation layer, then the image features are spliced with the input image features after affine transformation processing for a plurality of times by the processing of the ReLU activation layer, and the processing of the convolution layer is carried out after each affine transformation.

Further preferably, a plurality of dynamic convolution kernels are arranged in the dynamic self-adaptive convolution block, each dynamic convolution kernel predicts an attention feature map according to part of text semantics, the attention feature maps predicted by each convolution kernel are stacked into a whole attention feature map along the channel direction, and the whole attention feature map is used for predicting space affine parameters of the dynamic image feature refinement block; the weights of the convolution kernels in the dynamic self-adaptive convolution block are adjusted according to given texts, scaling and shifting operations are applied to each candidate convolution kernel, and scaling parameters and displacement parameters of the convolution kernels are predicted by using two multi-layer perceptrons.

Further preferably, the dynamic image feature refinement block is composed of an upsampling layer, a convolution layer, a channel Affine layer (C-Affine) and a space Affine layer (S-Affine), the dynamic image feature refinement block operates on the space and channel dimensions, inputs the fused image features into the upsampling layer and into the channel Affine layer and the space Affine layer respectively by convolution, sends the text features into two different multi-layer perceptrons to predict the channel scaling parameters of each channelAnd channel Displacement parameter->：

；/>；

Wherein e represents text features, and MLP represents a multi-layer perceptron;

the channel affine transformation is expressed as follows:

；

is the ith channel of the image feature;

for each pixel in the image feature, a dynamically adaptive convolution block and two convolution layer prediction spatial scaling parametersAnd spatial displacement parameter->：

，/>；

Representing one of the convolution layers>Representing another convolution layer, f represents dynamic adaptationAttention feature images acquired by a convolution block;

the spatial affine transformation is expressed as follows:

；

the dynamic image feature refinement block dynamically combines these two operations by the combination weights predicted by the detail predictor:

；

wherein,operation of representing a motion picture feature refinement block, +.>The spatial and channel refined combining weights of the nth dynamic text image refinement fusion block are represented.

Further preferably, each dynamic image feature refinement block comprises a detail predictor, and the image feature output by the previous dynamic text image fusion refinement block is input and mapped to the visual vector V through a convolution layer; for text features, it is mapped to a text vector T by a multi-layer perceptron (MLP); and then, performing channel splicing on the visual vector V and the text vector T, obtaining the mapped visual vector V 'and text vector T' through a multi-layer perceptron (MLP), calculating a difference value V '-T' and an element-by-element product V x T, performing channel splicing on the difference value V '-T' and the element-by-element product V x T and the visual vector V and the text vector T, and finally, predicting the space and channel refined combination weights of N dynamic text image refinement fusion blocks by adopting the multi-layer perceptron (MLP) and a Sigmoid activation layer.

Further preferably, the discriminator includes an encoder and a decoder, the encoder encodes the image through 6 convolution layers, randomly cropping the image features output by the 4 th convolution layer to 1/2 wide and high feature images, and then obtaining the 4 th layer cropped image features through a decoderThe image feature output by the 5 th convolution layer is passed through a decoder to obtain the 5 th layer image feature +.>And merging the image features output by the 6 th convolution layer into text features so as to identify the true or false of the image.

Further preferably, the decoder of the discriminator is composed of an upsampling layer, a convolution layer, a normalization layer and a GLU activation layer in that order.

Further preferably, the trans-scale excitation module is defined as:

；

wherein x is _low Representing low-scale image features, 8 x 8 resolution; x is x _high Representing high-scale image features, 128 x 128 resolution; y represents the fused image features, F represents the manipulation of the low-scale image features, W _i Representing module weights to learn; the cross-scale excitation module transforms the low-scale image features x along the spatial dimension through an adaptive pooling layer _low Downsampling to 4 x 4, then transforming it to 1 x 1 by the convolutional layer, and simulating nonlinearity using the LeakyReLU active layer, and adjusting the output of the LeakyReLU active layer to high-scale image feature x using the 1 x 1 convolutional layer _high And finally, carrying out gating operation through a Sigmoid activation layer, and carrying out gating output along the channel dimension and the high-scale image characteristic x _high Multiplication to produce a high-scale image feature x _high High-scale image features of the same shape.

Under limited hardware resource conditions, conventional image generation algorithms may not operate efficiently. The technical scheme of the invention optimizes the low-power consumption environment, designs a lightweight algorithm, and can execute the image generation task under the limited hardware resources. This makes it possible to achieve high quality image generation on resource-constrained devices.

Existing text-generated image models may not be well understood to the text, resulting in differences between the generated image and the text description. According to the technical scheme, through optimizing the relevance between text understanding and image generation, text content can be better understood, and images are thinned, so that images which are more in line with text description are generated.

Drawings

Fig. 1 is a schematic diagram of a generator.

Fig. 2 is a schematic diagram of a discriminator.

Fig. 3 is a schematic diagram of a dynamic text image fusion refinement block.

FIG. 4 is a schematic diagram of a cross-scale excitation module.

Fig. 5 is a schematic diagram of a decoder.

Fig. 6 is a schematic diagram of a text image fusion block.

Fig. 7 is a schematic diagram of a dynamic adaptive convolution block.

Fig. 8 is a schematic diagram of a moving image feature refinement block.

Fig. 9 is a detail predictor schematic.

Detailed Description

The invention is further described in detail below with reference to the drawings and examples.

A text-to-image method based on lightweight dynamic refinement is performed by a text-to-image system consisting of a generator, a discriminator, and a pre-trained text encoder, as shown in FIG. 1, the text encoder encodes input text into text features, the generator has two inputs, one is the text features encoded by the text encoder, and the other is noise vectors sampled by Gaussian distribution to ensure diversity of the generated image; the noise vector is fed to a full connection layer and remolded to obtain initial image features, and the text features are fused into the initial image features through six dynamic text image fusion refinement blocks (DFRBlock) to obtain refined image features; finally, the refined image features are converted into an image by the convolution layer and the activation layer.

According to the embodiment, the cross-scale excitation module is used for connecting the output of the dynamic text image fusion refinement blocks with different scales, so that the training efficiency can be improved while the features are refined. The six dynamic text image fusion refinement blocks are sequentially a first dynamic text image fusion refinement block, a second dynamic text image fusion refinement block, a third dynamic text image fusion refinement block, a fourth dynamic text image fusion refinement block, a fifth dynamic text image fusion refinement block and a sixth dynamic text image fusion refinement block, initial image features are sequentially fused with text features in different scales in the first dynamic text image fusion refinement block, the second dynamic text image fusion refinement block, the third dynamic text image fusion refinement block, the fourth dynamic text image fusion refinement block and the fifth dynamic text image fusion refinement block, and the image features output by the first dynamic text image fusion refinement block and the image features output by the fifth dynamic text image fusion refinement block are connected through a trans-scale excitation module and then input into the sixth dynamic text image fusion refinement block and fused with the text features; and connecting the image features output by the second dynamic text image fusion refinement block with the image features output by the sixth dynamic text image fusion refinement block through a trans-scale excitation module to obtain refined image features.

As shown in fig. 2, the discriminator includes an encoder and a decoder, the image features are extracted by the encoder, the image is reconstructed by the decoder in the feature space, the reconstruction error is finally minimized by self-coding training of the encoder and the decoder, and the self-coding training can enable the encoder to extract the image features for the discrimination of the true and false of the subsequent image. The encoder encodes the image through 6 convolution layers, randomly cuts out the image features output by the 4 th convolution layer into 1/2 wide and high feature images, and then obtains the 4 th layer cut image features through a decoderThe image feature output by the 5 th convolution layer is passed through a decoder to obtain the 5 th layer image feature +.>The image features output by the 6 th convolution layer are merged into text features to identify the true or false of the image, wherein the text features are obtained through bidirectional long-short periodThe memory network encodes the text and extracts the text features.

Training on the real image using the reconstruction loss function results in a trained encoder and decoder.

The reconstruction loss function is:

；

where x represents the input image, f represents the image features extracted by the encoder,representing a function for processing the image features extracted by the encoder,/->Representing a function for processing a real image, +.>A set of features of the image is represented,representing a set of real images; />Represented in the real image set I _real In the background, calculate image feature set D _encode(x) The desired value of f is output. Such reconstruction training may ensure that the evaluator is able to extract a more comprehensive representation of the features from the input, covering both the overall composition and detailed texture.

The generated image and the real image are respectively encoded through an encoder in the discriminator to obtain good image characteristics, then the text characteristics and the image characteristics are spliced, the authenticity of the image is discriminated, then the generator and the discriminator are trained together by using the countermeasures, the quality of the generated image of the generator is optimized through the discrimination result, and conversely, the quality of the generated image further promotes the performance improvement of the discriminator, the two are improved together, and finally the generator generates an image with consistent visual sense reality and semanteme. The challenge loss is expressed as:

；

wherein,indicating loss of arbiter->Loss of the representation generator->Representing the output of the arbiter, z is the noise vector sampled from the gaussian distribution; />Representing the synthesized data output by the generator through the input noise vector z; e is a text feature; />Representing the distribution of the composite data->Representing the true data distribution->Representing a mismatch data distribution, +.>Expressed in real data distribution->The next desired calculation for a particular set of items, < ->Expressed in the distribution of synthetic data->Performing a desired calculation next on a set of specific items; />Representing the distribution of data in mismatch +.>Performing a desired calculation next on a set of specific items; min represents the minimum function, +.>Representing the operation of gradient determination on variable x, < ->Representing the operation of solving the gradient of the text feature e; where k and p are two superparameters for balancing the validity of the gradient penalty.

Unlike residual block (ResBlock) which implements the jump connection as an addition of elements between different convolution layers and activations and requires that the spatial dimensions of the activations must be the same, the present invention redefines the jump connection concept by two unique designs, forming a cross-scale excitation module (CSEM), i.e. applying channel multiplication between activations, eliminating the repeated computation of convolutions while making jump connections between different resolutions, further apart (e.g. 8 x 8 and 128 x 128, 16 x 16 and 256 x 256). The two designs enable the trans-scale excitation module to inherit the advantages of the residual block, have quick gradient flow and have no extra calculation burden. The cross-scale excitation module is specifically defined as:

；

wherein x is _low Representing low-scale image features, 8 x 8 resolution; x is x _high Representing high-scale image features, 128 x 128 resolution; y represents the fused image features, F represents the manipulation of the low-scale image features, W _i Representing the module weights to learn. As shown in fig. 4, the cross-scale excitation module passes the low-scale image features x along the spatial dimension through the adaptive pooling layer _low Downsampling to 4×4, then transforming it to 1×1 by the convolutional layer, and using the LeakyReLU active layer to simulate nonlinearity, and then using the 1×1 convolutional layer to adjust the output of the LeakyReLU active layer toHigh-scale image feature x _high The same number of channels. Finally, gating operation is carried out through the Sigmoid activation layer, and gating output is carried out along the channel dimension and the high-scale image characteristic x _high Multiplication to produce a high-scale image feature x _high High-scale image features of the same shape.

The dynamic text image fusion refinement block is used for fusing text features and image features in the image generation process, and guiding the generation of image details through Semantic information to obtain refined image features, as shown in fig. 3, the dynamic text image fusion refinement block consists of a text image fusion block (DFBlock), a dynamic self-adaptive convolution block (DACBlock), a Semantic Decoder (semanteme Decoder) and a dynamic image feature refinement block (DRBlock), and particularly as shown in fig. 2, firstly, the dynamic text image fusion refinement block has two inputs, namely text features and image features, the text features are respectively input into the text image fusion block, the dynamic self-adaptive convolution block and the dynamic image feature refinement block, and the image features are respectively input into the text image fusion block, the Semantic Decoder and the dynamic image feature refinement block; the image features output by the text image fusion block are input into the dynamic image feature refinement block, the semantic information output by the semantic decoder is input into the dynamic self-adaptive convolution block, the attention striving force output by the dynamic self-adaptive convolution block is input into the dynamic image feature refinement block, and finally the dynamic image feature refinement block outputs refined image features.

As shown in fig. 5, the decoder of the discriminator is composed of up-sampling layer, convolution layer, normalization layer and GLU activation layer in order, which are used to reconstruct the image in hidden space, in this embodiment 4 groups of up-sampling layer, convolution layer, normalization layer and GLU activation layer.

To efficiently fuse text features and image features, we employ a text-to-image fusion block. As shown in fig. 6, a text image fusion block includes multiple affine transformations and a ReLU activation layer, the image features are sampled by the up-sampling layer and affine transformed by the multi-layer perceptron (MLP) acquired text features at the affine transformation layer, then processed by the ReLU activation layer, each affine transformed, processed by the convolution layer, and then processed by several layersAnd after affine transformation processing, the image features are spliced with the input image features to obtain fused image features. For affine transformation, we use two multi-layer perceptron (MLP) to predict scaling parameters from text features separatelyAnd Displacement parameter->：

；/>；

Where e represents a text feature, for a given input image feature, scaling parameters are first usedPerforming channel scaling operation, and then using the displacement parameter +.>And performing channel displacement operation. This process can be expressed as:

；

wherein Affine represents an Affine transformation;is the ith channel of the image feature.

In order to improve the refinement accuracy of the dynamic image feature refinement block, the embodiment proposes a dynamic adaptive convolution block as a refinement positioning module. As shown in fig. 7, the weights of the convolution kernels in the dynamic adaptive convolution block are adjusted for a given text. However, not only does all parameters of the convolution kernel need to be generated directly, which is costly to calculate, but it also results in an overfitting training text description. To solve this problem, we apply scaling and shifting to each candidate convolution kernelAnd (3) operating. First, we predict the scaling parameter γ and the displacement parameter θ of the convolution kernel using two multi-layer perceptrons (MLPs). The modulation on one convolution kernel can be formally expressed as:wherein->Is a convolution kernel, wherein c, h, w represent the input channel size, kernel height, kernel width, respectively. Through text-guided convolution kernel adaptation, dynamic convolution operations can be queried from text descriptions. However, semantic information in one convolution kernel is limited, and it is difficult to merge the entire text semantic information into one convolution kernel. In the present invention, we have set a plurality of dynamic convolution kernels in a dynamic adaptive convolution block. It enables our model to focus on different image features and different text representation subspaces simultaneously. Each dynamic convolution kernel predicts an attention profile based on the partial text semantics. Then, we stack these attention profile along the channel direction as an entire attention profile for predicting the spatial affine parameters of the dynamic image feature refinement block. The dynamic adaptive convolution block enables our method to distinguish between detailed portions that are needed for text and not for images. The dynamic image feature refinement block may refine the image features more accurately after using the dynamic adaptive convolution block.

As described above, after the conventional text image fusion block fuses the text and the image features, the situation that the matching degree between the image features and the text features is not high and the detail degree of the synthesized image is not high may occur. To solve these problems, we first decompose the visual feature refinement process into spatial and channel refinements. The present invention then proposes a novel dynamic image feature refinement block that reassembles the appropriate refinement process for each input pair by dynamic combination of spatial and channel affine transformations.

As shown in fig. 8, the moving image feature refinement block is composed of an upsampling layer, a convolution layer, and two affine layers. One is a channel Affine layer (C-Affine), the other is a space Affine layer (S-Affine). To fully refine image features, dynamic image features refine block operation space and channel dimensions, and the combined weights predicted by a detail predictorThe two operations are dynamically combined. Specifically, first, the fused image features are input into an upsampling layer and respectively input into a channel Affine layer and a space Affine layer by convolution, in which the text features are transmitted into two different multi-layer perceptrons (MLPs) to predict channel scaling parameters of each channel>And channel Displacement parameter->：

；/>；

The channel affine transformation can be expressed as follows:

；

unlike C-Affine, S-Affine applies Affine transformations in the spatial dimensions of image features. For each pixel in the image feature, a dynamically adaptive convolution block and two convolution layer prediction spatial scaling parametersAnd a spatial displacement parameter：

，/>；

Representing one of the convolution layers>Representing another convolution layer, f represents the attention profile acquired by the dynamic adaptive convolution block.

Therefore, the spatial affine transformation is expressed as follows:

；

in order to fully operate the image features, C-Affine and S-Affine are integrated in the dynamic image feature refinement block. The dynamic image feature refinement block dynamically combines these two operations by the combination weights predicted by the detail predictor:

；

The dynamic image feature refinement block substantially refines the image features from both the spatial and channel dimensions and accounts for the different contributions of these two dimensions at different image scales.

In order to be able to guide the process of image detail synthesis from the current text, a detail predictor is proposed. Each dynamic image feature refinement block comprises a detail predictor, as shown in fig. 9, for inputting 4×4 image features, which are the image features output by the previous dynamic text image fusion refinement block, and then communicatingThe convolutional layer maps it to the visual vector V. For text features, they are mapped to text vectors T by a multi-layer perceptron (MLP). And then, the visual vector V and the text vector T are subjected to channel stitching, and the mapped visual vector V 'and text vector T' are obtained through a multi-layer perceptron (MLP) so as to reduce the gap between the visual and text fields. The difference V '-T' and the element-wise product V x T are calculated and channel spliced with the visual vector V and the text vector T to enhance the distance between the text and the visual information. Finally, a multi-layer perceptron (MLP) and a Sigmoid activation layer are adopted to predict the spatial and channel refined combination weights of the N dynamic text image refinement fusion blocks. The combination weight is a vector, the length of the vector is the channel size of the visual feature after the convolution layer in each dynamic text image fusion refinement block, and the combination weight of the space and channel refinement of the 1 st dynamic text image fusion block isThe combination weight of spatial and channel refinement of the nth dynamic text image refinement fusion block is +.>The spatial and channel refinement of the Nth dynamic text image refinement fusion block has a combination weight of +.>。

The present embodiment provides an electronic device, including: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the lightweight, dynamically refined text-generated image method.

The present embodiment provides a computer-readable storage medium having stored thereon computer instructions that, when executed by a processor, implement the lightweight dynamic refined based text-to-image method.

The above-described invention is merely representative of embodiments of the present invention and should not be construed as limiting the scope of the invention, nor any limitation in any way as to the structure of the embodiments of the present invention. It should be noted that it will be apparent to those skilled in the art that various changes and modifications can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for generating an image based on lightweight dynamic refined text, characterized in that input text is encoded into text features by a text encoder, the text features encoded by the text encoder and noise vectors sampled by gaussian distribution are input into a generator; the noise vector is fed to a full-connection layer and remolded to obtain initial image features, and the text features are fused into the initial image features through a plurality of dynamic text image fusion refinement blocks to obtain refined image features; finally, converting the refined image features into images through a convolution layer and an activation layer; the dynamic text image fusion refinement block consists of a text image fusion block, a dynamic self-adaptive convolution block, a semantic decoder and a dynamic image feature refinement block, wherein the dynamic text image fusion refinement block is provided with two inputs, namely a text feature and an image feature, the text feature is respectively input into the text image fusion block, the dynamic self-adaptive convolution block and the dynamic image feature refinement block, and the image feature is respectively input into the text image fusion block, the semantic decoder and the dynamic image feature refinement block; the image features output by the text image fusion block are input into a dynamic image feature refinement block, semantic information output by a semantic decoder is input into a dynamic self-adaptive convolution block, attention striving force output by the dynamic self-adaptive convolution block is input into the dynamic image feature refinement block, and finally the dynamic image feature refinement block outputs refined image features; and a cross-scale excitation module is used for connecting the output of the dynamic text image fusion refinement blocks with different scales.

2. The method for generating an image based on lightweight dynamic refined text according to claim 1, characterized in that the trained encoder and decoder are trained on the real image using a reconstruction loss function; the generated image and the real image are respectively encoded through an encoder in the discriminator to obtain good image characteristics, then the text characteristics and the image characteristics are spliced, the authenticity of the image is discriminated, then the generator and the discriminator are trained together by using the countermeasures, the quality of the generated image of the generator is optimized through the discrimination result, and conversely, the quality of the generated image further promotes the performance improvement of the discriminator, the two are improved together, and finally the generator generates an image with consistent visual sense reality and semanteme.

3. The method for generating an image based on lightweight, dynamically refined text of claim 2, wherein the reconstruction loss function is:

；

where x represents the input image, f represents the image features extracted by the encoder,representing a function for processing the image features extracted by the encoder,/->Representing a function for processing a real image, +.>Representing the image feature set,/->Representing a set of real images; />Represented in the real image set I _real In the background, calculate image feature set D _encode(x) The desired value of f;

the challenge loss is expressed as:

；

wherein,indicating loss of arbiter->Loss of the representation generator->Representing the output of the arbiter, z is the noise vector sampled from the gaussian distribution; />Representing the synthesized data output by the generator through the input noise vector z; e is a text feature; />Representing the distribution of the composite data->Representing the true data distribution->Representing a mismatch data distribution, +.>Expressed in real data distribution->The next desired calculation for a particular set of items, < ->Expressed in the distribution of synthetic data->Next to a specific groupPerforming expected calculation on the item; />Representing the distribution of data in mismatch +.>Performing a desired calculation next on a set of specific items; min represents the minimum function, +.>Representing the operation of gradient determination on variable x, < ->Representing the operation of solving the gradient of the text feature e; where k and p are two superparameters.

4. The text generation image system based on lightweight dynamic refinement consists of a generator, a discriminator and a pre-trained text encoder, and is characterized in that the generator comprises a full connection layer, six dynamic text image fusion refinement blocks, a cross-scale excitation module, a convolution layer and an activation layer; the noise vector sampled by Gaussian distribution is fed into a full-connection layer and remolded to obtain initial image features, and the text features are fused into the initial image features through a dynamic text image fusion refinement block to obtain refined image features; finally, converting the refined image features into images through a convolution layer and an activation layer;

the six dynamic text image fusion refinement blocks are sequentially a first dynamic text image fusion refinement block, a second dynamic text image fusion refinement block, a third dynamic text image fusion refinement block, a fourth dynamic text image fusion refinement block, a fifth dynamic text image fusion refinement block and a sixth dynamic text image fusion refinement block, the initial image features are sequentially fused with text features in different scales in the first dynamic text image fusion refinement block, the second dynamic text image fusion refinement block, the third dynamic text image fusion refinement block, the fourth dynamic text image fusion refinement block, the fifth dynamic text image fusion refinement block and the text features, and the image features output by the first dynamic text image fusion refinement block and the image features output by the fifth dynamic text image fusion refinement block are connected through a trans-scale excitation module and then input into the sixth dynamic text image fusion refinement block and fused with the text features; the image features output by the second dynamic text image fusion refinement block are connected with the image features output by the sixth dynamic text image fusion refinement block through a trans-scale excitation module to obtain refined image features;

the dynamic text image fusion refinement block consists of a text image fusion block, a dynamic self-adaptive convolution block, a semantic decoder and a dynamic image feature refinement block, wherein the dynamic text image fusion refinement block is provided with two inputs, namely a text feature and an image feature, the text feature is respectively input into the text image fusion block, the dynamic self-adaptive convolution block and the dynamic image feature refinement block, and the image feature is respectively input into the text image fusion block, the semantic decoder and the dynamic image feature refinement block; the image features output by the text image fusion block are input into the dynamic image feature refinement block, the semantic information output by the semantic decoder is input into the dynamic self-adaptive convolution block, the attention striving force output by the dynamic self-adaptive convolution block is input into the dynamic image feature refinement block, and finally the dynamic image feature refinement block outputs refined image features.

5. The text-to-image system based on lightweight dynamic refinement according to claim 4, wherein the text-to-image fusion block comprises a plurality of affine transformations and a ReLU activation layer, the image features are subjected to affine transformation in the affine transformation layer with the text features acquired by the multi-layer perceptron after being sampled by the upsampling layer, then are subjected to convolution layer processing after each affine transformation, and are spliced with the input image features after being subjected to affine transformation processing for a plurality of times to obtain the fused image features.

6. The system for generating images based on lightweight dynamic refinement text according to claim 4, wherein a plurality of dynamic convolution kernels are arranged in the dynamic adaptive convolution block, each dynamic convolution kernel predicts an attention feature map according to part of text semantics, the attention feature maps predicted by each convolution kernel are stacked into a whole attention feature map along a channel direction, and are used for predicting space affine parameters of the dynamic image feature refinement block; the weights of the convolution kernels in the dynamic self-adaptive convolution block are adjusted according to given texts, scaling and shifting operations are applied to each candidate convolution kernel, and scaling parameters and displacement parameters of the convolution kernels are predicted by using two multi-layer perceptrons.

7. The system for generating images based on lightweight dynamic refinement of claim 4, wherein the dynamic image feature refinement block consists of an upsampling layer, a convolution layer, a channel affine layer and a spatial affine layer, the dynamic image feature refinement block operating on spatial and channel dimensions, inputting the fused image features into the upsampling layer and into the channel affine layer and the spatial affine layer, respectively, by convolution, sending the text features into two different multi-layer perceptrons to predict the channel scaling parameters for each channelAnd channel Displacement parameter->：

；/>；

the channel affine transformation is expressed as follows:

；

is the ith channel of the image feature;

，/>；

Representing one of the convolution layers>Representing another convolution layer, f representing a attention feature map acquired by the dynamic adaptive convolution block;

the spatial affine transformation is expressed as follows:

；

8. The system for generating images based on lightweight dynamic refinement text according to claim 4, wherein each dynamic image feature refinement block comprises a detail predictor, inputting the image feature output by the previous dynamic text image fusion refinement block, and mapping it to the visual vector V through a convolution layer; for text features, mapping the text features to text vectors T through a multi-layer perceptron; and then, performing channel splicing on the visual vector V and the text vector T, obtaining the mapped visual vector V 'and text vector T' through a multi-layer perceptron, calculating a difference V '-T' and an element-by-element product V x T, performing channel splicing on the difference V '-T' and the element-by-element product V x T and the visual vector V and the text vector T, and finally, predicting the space of N dynamic text image refinement fusion blocks and the channel refinement combination weight by adopting the multi-layer perceptron and a Sigmoid activation layer.