CN114937191A

CN114937191A - Text image generation method and device and computer equipment

Info

Publication number: CN114937191A
Application number: CN202210620986.7A
Authority: CN
Inventors: 杨有; 吴春燕; 潘龙越; 向若愚
Original assignee: Chongqing Normal University
Current assignee: Chongqing Normal University
Priority date: 2022-06-01
Filing date: 2022-06-01
Publication date: 2022-08-23

Abstract

The present application relates to the field of word processing technologies, and in particular, to a method and an apparatus for generating an image from a text, and a computer device. Firstly, inputting a plurality of initial image features and a plurality of word features into a dynamic memory attention model to enhance the visual features of the initial image features, then inputting first image features obtained from the initial image features into a channel attention residual error model, and fusing the output first refined image features with the initial image features generated in the previous stage. The visual feature representation capability of the initial image features in the space dimension is enhanced by using a secondary memory method, the semantic consistency between the word level and the feature map is further enhanced, and the channel feature representation capability of the feature map in the channel dimension is enhanced by using the channel attention in the channel attention residual block so as to better guide the generation of the picture, so that the image generation method not only can generate a high-quality image, but also can generate a better semantic consistency image.

Description

Text image generation method and device and computer equipment

Technical Field

The present application relates to the field of word processing technologies, and in particular, to a method and an apparatus for generating an image from a text, and a computer device.

Background

The generation of a countermeasure network (GAN) is widely applied in conditional image generation, i.e. generating a corresponding realistic image according to given conditions (text, sketch, semantic segmentation graph, etc.) and ensuring the quality and diversity of the generated image. Text-to-image synthesis (T2I) is one of the challenging tasks in which it is intended to generate a realistic image corresponding to a given Text description. And, with the rise of the cross-modal task, the task has attracted a part of researchers in the fields of natural language processing and computer vision. Besides, the task also has important application value in certain application fields, such as artistic creation, image editing, advertisement design and the like.

In the prior art, in order to solve the problem that the model depends on the generated image at the initial stage, a dynamic memory network is introduced into the model, the dynamic memory network mainly refines the details in the image through semantic correlation between a word sequence and the initially generated image, and although the dependency on the quality of the initial image is relieved in the generation process, the channel semantic information contained in the image cannot be fully utilized; although the reality of the generated image can be improved by using a Deep multi-modal attention Similarity Model (DAMSM), the generator is promoted to better learn key information in the text description so as to optimize the quality of the whole image, but the representation capability of the visual features and the channel features of the image is ignored, so that the problem that the generated image has the lack of details is caused.

Therefore, in the process of generating an image from a text, there are problems that details of the generated image are lost and channel feature information is not sufficiently utilized.

Disclosure of Invention

The method aims to solve the technical problems of lack of generated image details and insufficient utilization of channel characteristic information in the prior art.

The application provides a method for generating an image by a text, which comprises the following steps:

acquiring a plurality of text description sentences, and inputting the text description sentences into a text encoder for encoding to obtain a plurality of sentence characteristics and a plurality of word characteristics;

acquiring a plurality of random sampling noises, and inputting the plurality of random sampling noises and the plurality of sentence characteristics into an initial generator for fusion to obtain a plurality of initial image characteristics and a plurality of initial images;

inputting a plurality of initial image features and a plurality of word features into a dynamic memory attention model and outputting a plurality of first image features, wherein the dynamic memory attention module is used for enhancing visual features of the initial image features;

inputting the plurality of first image features into a channel attention residual block model, outputting a plurality of first thinned image features, and performing convolution on the plurality of first thinned image features to obtain a plurality of first thinned images, wherein the channel attention residual block model is used for enhancing the channel features of the first image features;

taking a plurality of first refined image features as initial image features, inputting the initial image features and a plurality of word features into a dynamic memory attention model, and outputting a plurality of second refined image features, wherein the dynamic memory attention model is used for enhancing visual features of the first refined image features;

inputting a plurality of second thinned image features into a channel attention residual block model, outputting a plurality of third thinned image features, and performing convolution on the plurality of third thinned image features to obtain a plurality of third thinned images, wherein the channel attention residual block model is used for enhancing the channel features of the second thinned image features.

Preferably, the step of acquiring a plurality of random sampling noises, and inputting the plurality of random sampling noises and the plurality of sentence features into the initial generator for fusion to obtain a plurality of initial image features and a plurality of initial images comprises:

respectively inputting a plurality of random sampling noises and a plurality of sentence characteristics into a full-connection layer to perform characteristic preliminary fusion, and outputting a plurality of first fusion image characteristics;

respectively inputting the plurality of preliminary fusion images into a first upsampling block to perform batch normalization processing on the plurality of preliminary fusion image characteristics and output a plurality of second fusion image characteristics, wherein the first upsampling block at least comprises three blocks which are continuously arranged;

inputting the plurality of second fusion image features into a second upsampling block so as to perform example normalization processing on the plurality of second fusion image features to obtain a plurality of third fusion image features;

and outputting the plurality of third fusion image characteristics as initial image characteristics, and performing convolution operation on the plurality of initial image characteristics to obtain a plurality of initial images.

Preferably, the step of inputting a plurality of the preliminary fusion image features into a first upsampling block to perform batch normalization processing on the plurality of preliminary fusion image features and output a plurality of second fusion image features includes:

acquiring batch normalized batch values;

acquiring a first height value H and a first width value of each preliminary fusion image feature;

obtaining a scaling factor and a translation factor which are obtained by the first up-sampling block through autonomous learning during training;

acquiring a characteristic value of the primary fusion image characteristic currently subjected to normalization processing;

calculating the mean value of all the preliminary fusion image features according to the batch value, the first height value and the first width value, wherein the calculation formula is as follows:

wherein, mu _c Representing a mean of all of the preliminary fused image features, N representing a batch value, H representing a first height value, W representing a first width value, x _nchw Representing the characteristic value of the feature of the preliminary fusion image subjected to the normalization processing;

calculating the variance of all the preliminary fusion image features according to the batch value, the first height value and the first width value, wherein the calculation formula is as follows:

wherein, the first and the second end of the pipe are connected with each other,

representing the variance of all of the preliminary fused image features,

calculating the sample distribution of all the primary fusion image characteristics after batch normalization according to the variance and the mean, wherein the calculation formula is as follows:

wherein x' represents the sample distribution of the x-th primary fusion image characteristic after batch normalization, and x _i Representing the ith preliminary fusion image feature, and epsilon represents a nonzero constant;

generating each of the second fused image features from the sample distribution, wherein the generation function is:

BN(x)＝γ×x'+β；

wherein bn (x) represents the xth second fused image feature, γ represents the scaling factor, and β represents the translation factor.

Preferably, the step of inputting a plurality of the initial image features and a plurality of the word features into a dynamic memory attention model and outputting a plurality of first image features includes:

calculating a plurality of weight matrixes according to the plurality of initial image features and the plurality of word features;

storing the plurality of weight matrixes into a dynamic memory slot as a plurality of dynamic memories;

putting a plurality of dynamic memories in a dynamic memory groove into a secondary memory feature enhancement unit to refine image features in the dynamic memories to obtain a plurality of memory image features;

inputting a plurality of memory image features into a memory response gate to enhance insignificant areas in the plurality of memory image features to obtain a plurality of first image features.

Preferably, the step of putting the plurality of dynamic memories in the dynamic memory slot into a secondary memory feature enhancement unit to refine the image features in the plurality of dynamic memories to obtain a plurality of memory image features includes:

taking the dynamic memory and the initial image characteristics as the input of a secondary memory characteristic enhancement unit, and performing primary memory characteristic enhancement to obtain first memory image characteristics;

and performing secondary memory enhancement on the first memory image characteristics to obtain memory image characteristics.

Preferably, the step of obtaining a first memory image feature by using the dynamic memory and the initial image feature as the input of a secondary memory feature enhancing unit and performing primary memory feature enhancement includes:

performing convolution processing on the dynamic memory to obtain a key vector and a value vector;

changing the dimensionality of the initial image features according to the key vectors and the value vectors so as to enable the dimensionality of the initial image features to be the same as the dimensionality of the key vectors and the dimensionality of the value vectors;

performing dot product processing on the initial image features and the key vectors to obtain a spatial weight matrix;

and calculating the weighted sum of the value vector according to the spatial weight matrix to obtain the first memory image characteristic.

Preferably, the step of inputting a plurality of the first image features into a channel attention residual block model and outputting a plurality of first refined image features includes:

performing convolution and batch normalization processing on the plurality of first image features, and inputting the first image features into a gated linear activation function GLU to obtain first intermediate features;

performing convolution and batch normalization processing on the first intermediate features to obtain second intermediate features;

splicing the first intermediate feature and the second intermediate feature in the channel direction to obtain a third intermediate feature;

taking the third intermediate feature as the input of the whole attention unit, and performing global average pooling on the third intermediate feature to obtain a fourth intermediate feature;

performing convolution operation on the fourth intermediate feature, and inputting the convolution operation into an activation function ReLU to obtain a fifth intermediate feature;

performing convolution operation on the fifth intermediate feature, and inputting the fifth intermediate feature into a second activation function Sigmoid to obtain a channel semantic weight;

performing element-by-element multiplication processing on the channel semantic weight and the second intermediate feature to obtain a sixth intermediate feature;

fusing the sixth intermediate feature and the first intermediate feature to obtain a seventh intermediate feature;

and residual error connection is carried out on the seventh intermediate feature and the first image feature to obtain a first refined image feature.

The present application further provides a text generation image device, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of text description sentences and inputting the text description sentences into a text encoder for encoding to obtain a plurality of sentence characteristics and a plurality of word characteristics;

the second acquisition module is used for acquiring a plurality of random sampling noises, and inputting the random sampling noises and the sentence characteristics into the initial generator for fusion to obtain a plurality of initial image characteristics and a plurality of initial images;

the dynamic memory attention model is used for inputting a plurality of initial image features and a plurality of word features into the dynamic memory attention model and outputting a plurality of first image features, wherein the dynamic memory attention module is used for enhancing the visual features of the initial image features;

the channel attention residual block model is used for inputting a plurality of first image features into the channel attention residual block model, outputting a plurality of first thinned image features, and performing convolution on the plurality of first thinned image features to obtain a plurality of first thinned images, wherein the channel attention residual block model is used for enhancing the channel features of the first image features;

the dynamic memory attention model is further used for inputting a plurality of first refined image features serving as initial image features and a plurality of word features into the dynamic memory attention model and outputting a plurality of second refined image features, wherein the dynamic memory attention model is used for enhancing visual features of the first refined image features;

and the channel attention residual block model is further used for inputting a plurality of second refined image features into the channel attention residual block model, outputting a plurality of third refined image features, and performing convolution on the plurality of third refined image features to obtain a plurality of third refined images, wherein the channel attention residual block model is used for enhancing the channel features of the second refined image features.

The application also provides computer equipment which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the text image generation method when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described text-to-image method.

The beneficial effect of this application does: firstly, inputting a plurality of initial image features and a plurality of word features into a dynamic memory attention model, to enhance the visual characteristic of the initial image characteristic, then inputting the first image characteristic obtained by the initial image characteristic into a channel attention residual error model, and the output first thinning image characteristics are fused with the initial image characteristics generated in the last stage to realize the enhancement of the visual characteristics of the secondary image, namely, the visual characteristic representation capability of the initial image characteristic in the space dimension can be enhanced by using a secondary memory method, the semantic consistency between the word level and the characteristic graph is further enhanced, while using channel attention in the channel attention residual block to enhance the channel characterization capability of the feature map in the channel dimension, so as to better guide the picture generation, and the method and the device can generate not only high-quality images, but also better semantic consistency images.

Drawings

Fig. 1 is a schematic flowchart of a text-to-image method according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a text image generating apparatus according to an embodiment of the present application.

Fig. 3 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the object of the present application will be further explained with reference to the embodiments, and with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As shown in fig. 1 to fig. 3, the present application provides a method for generating an image from a text, including:

s1, acquiring a plurality of text description sentences, and inputting the text description sentences into a text encoder for encoding to obtain a plurality of sentence characteristics and a plurality of word characteristics;

s2, acquiring a plurality of random sampling noises, and inputting the plurality of random sampling noises and the plurality of sentence characteristics into an initial generator for fusion to obtain a plurality of initial image characteristics and a plurality of initial images;

s3, inputting the plurality of initial image features and the plurality of word features into a dynamic memory attention model and outputting a plurality of first image features, wherein the dynamic memory attention model is used for enhancing the visual features of the plurality of initial image features;

s4, inputting the first image features into a channel attention residual block model, outputting a plurality of first thinned image features, and performing convolution on the first thinned image features to obtain a plurality of first thinned images, wherein the channel attention residual block model is used for enhancing the channel features of the first image features;

s5, taking a plurality of first refined image features as initial image features, inputting the initial image features and a plurality of word features into a dynamic memory attention model, and outputting a plurality of second refined image features, wherein the dynamic memory attention model is used for enhancing visual features of the first refined image features;

s6, inputting the second thinned image features into a channel attention residual block model, outputting a plurality of third thinned image features, and performing convolution on the third thinned image features to obtain a plurality of third thinned images, wherein the channel attention residual block model is used for enhancing the channel features of the second thinned image features.

As described in the above steps S1-S6, the CUB-200-2011 and the Oxford-102 data sets are used to perform training and testing for generating the anti-network model by feature enhancement, where the CUB-200-2011 and the Oxford-102 are 10 text description sentences corresponding to each picture, and include the visual attributes of the target and a small amount of background information; the CUB-200-. The Oxford-102 dataset consists of 7034 training pictures containing 82 flower classes and 1155 test pictures of another 20 flower classes. The sentence characteristics and word characteristics of each text description sentence can be obtained by inputting each text description sentence into a text encoder for encoding, then random sampling noise is obtained, the random sampling noise and the sentence characteristics are fused to obtain initial image characteristics and an initial image, because the quality of the initial image characteristics is poor, the details of the generated initial image are seriously lost, and channel characteristic information cannot be fully utilized, a secondary memory enhancement scheme is provided for the initial image characteristics, namely, firstly, a plurality of initial image characteristics and a plurality of word characteristics are input into a dynamic memory attention model to enhance the visual characteristics of the initial image characteristics, then, first image characteristics obtained by the initial image characteristics are input into a channel attention residual error model, and the output first thinned image characteristics are fused with the initial image characteristics generated by the previous stage, the enhancement of the visual characteristics of the secondary image is realized, so that the visual characteristic expression capability of the image is improved, and the dependency on the quality of the initial image generated in the initial stage is further lightened; by introducing the channel attention residual block model, not only image semantics containing different levels (wherein the image semantics are used for information interaction between characteristic channels) can be obtained, but also the relevance between similar channel semantics can be enhanced, so that the integral characterization capability of the channel characteristics can be improved. This can reduce the lack of detail of the generated second refined image feature and can make full use of the channel feature information.

In one embodiment, the step S2 of acquiring a plurality of random sampling noises and inputting the plurality of random sampling noises and a plurality of sentence features into the initial generator for fusion to obtain a plurality of initial image features and a plurality of initial images includes:

s21, respectively inputting the plurality of random sampling noises and the plurality of sentence characteristics into a full connection layer for characteristic preliminary fusion, and outputting a plurality of first fusion image characteristics;

s22, respectively inputting the plurality of preliminary fusion images into a first upsampling block to perform batch normalization processing on the plurality of preliminary fusion image features and output a plurality of second fusion image features, wherein the first upsampling block at least comprises three blocks which are continuously arranged;

s23, inputting the second fusion image features into a second upsampling block so as to carry out example normalization processing on the second fusion image features to obtain third fusion image features;

and S24, outputting the third fusion image characteristics as initial image characteristics, and performing convolution operation on the initial image characteristics to obtain a plurality of initial images.

As described above in steps S21-S24, adaptive semantic instance normalization may be used to establish a semantic relationship between the generated second fused image features and a given text description and to ensure individual independence of each generated second fused image feature; the sentence characteristics can be used as affine transformation parameters of image characteristics after example normalization, so that semantic consistency between text-image pairs is promoted; in the prior art, Batch Normalization (BN) is used in most classical models for text generation to keep the training stable, but the Batch Normalization ignores the difference between sample individuals. Based on the above, in the feature enhancement generation confrontation network model, the second upsampling block is subjected to instance normalization processing, so that the style diversity of the third fusion image feature can be increased, and the influence of small training batch on the generation result can be reduced. Secondly, the batch normalization processing and the example normalization processing are combined, so that the performance of the whole generated network can be improved.

In one embodiment, the step S22 of inputting a plurality of the preliminary fused image features into a first upsampling block to perform batch normalization on the plurality of the preliminary fused image features and outputting a plurality of second fused image features includes:

s221, obtaining batch quantity values of batch normalization;

s222, acquiring a first height value H and a first width value of each primary fusion image feature;

s223, obtaining a scaling factor and a translation factor which are obtained by the first up-sampling block through autonomous learning during training;

s224, acquiring a characteristic value of the characteristic of the primary fusion image subjected to the normalization processing;

s225, calculating the mean value of all the preliminary fusion image features according to the batch value, the first height value and the first width value, wherein the calculation formula is as follows:

wherein, mu _c Representing a mean of all of the preliminary fused image features, N representing a batch value, H representing a first height value, W representing a first width value, x _nchw Representing the characteristic value of the feature of the primary fusion image subjected to the normalization processing currently;

s226, calculating the variance of all the preliminary fusion image features according to the batch value, the first height value and the first width value, wherein the calculation formula is as follows:

representing the variance of all of the preliminary fused image features,

s227, calculating sample distribution of all the primary fusion image features after batch normalization according to the variance and the mean, wherein the calculation formula is as follows:

s228, generating each second fusion image feature according to the sample distribution, wherein a generating function is as follows:

BN(x)＝γ×x'+β；

In one embodiment, the step S3 of inputting a plurality of the initial image features and a plurality of the word features into a dynamic memory attention model and outputting a plurality of first image features includes:

s31, calculating a plurality of weight matrixes according to the initial image characteristics and the word characteristics;

s32, storing the weight matrixes as a plurality of dynamic memories into a dynamic memory slot;

s33, putting the dynamic memories in the dynamic memory groove into a secondary memory feature enhancement unit to refine the image features in the dynamic memories to obtain a plurality of memory image features;

and S34, inputting the memory image characteristics into memory response gating so as to enhance the non-significant areas in the memory image characteristics and obtain a plurality of first image characteristics.

As described in the above steps S31-S34, in order to reduce the influence of the non-ideal expression of the initial image features on the generated result, the weight matrix is calculated according to the initial image features and the word features, and then the weight matrix is stored in the dynamic memory slot, and the dynamic memory is stored in the secondary memory feature enhancing unit, so as to refine the image features in the plurality of dynamic memories and obtain a plurality of memory image features; the memory image features are input into a memory response gate to enhance insignificant areas in a plurality of the memory image features while preserving a useful portion, resulting in a first image feature.

Specifically, the mathematical expression of the dynamic memory attention model is as follows:

I _R ＝RG(MoM(WG(I _i-1 ))

wherein, I _R Representing memory image features, RG memory response gating, MOM secondary memory feature enhancement unit, WG memory write gate, I _i-1 Representing the initial image features generated in the previous stage;

representing memory write gating, sigma representing the activation function Sigmoid, w _i Which represents the feature of the ith word,

a denotes a matrix of 1x256, B denotes a matrix of 1x64,

representing the initial image characteristics obtained after global average pooling;

(the importance of each word feature to the initial image feature can be calculated by the above formula and then to I _i-1 Updating is performed to obtain a weight matrix. )

Wherein, W _w 1x1 convolution, W, representing a first weight value _m 1x1, representing a second weight value, wherein the first weight value is different in value from the second weight value.

I _M2 ＝MoM(I _M0 )

I _M2 Representing features of the memory image, I _M0 Representing dynamic memory;

representing memory response gating, b represents an offset in a linear function, namely intercept;

in one embodiment, the step S33 of putting the plurality of dynamic memories in the dynamic memory slot into the secondary memory feature enhancing unit to refine the image features in the plurality of dynamic memories to obtain a plurality of memory image features includes:

s331, taking the dynamic memory and the initial image characteristics as the input of a secondary memory characteristic enhancement unit, and performing primary memory characteristic enhancement to obtain first memory image characteristics;

s332, performing secondary memory enhancement on the first memory image characteristic to obtain a memory image characteristic.

As described in the above steps S331-S332, by putting the dynamic memory into the secondary memory feature enhancement unit, the secondary visual feature enhancement of the memory can be read, thereby further reducing the visual deviation of the image and enhancing the semantic relation between the words and the image. Specifically, two times of visual feature enhancement operations are performed on dynamic memory, wherein the dynamic memory and the initial image features generated in the previous stage are used as input of a secondary memory feature enhancement unit, the initial image features are subjected to primary memory feature enhancement through attention operation to obtain first memory image features, and the first memory image features are used for supplementing local semantic information lacking in the initial image features. And performing secondary visual feature enhancement on the first memory image feature to obtain a memory image feature.

In one embodiment, the step S331 of taking the dynamic memory and the initial image feature as input of a secondary memory feature enhancing unit and performing a primary memory feature enhancement to obtain a first memory image feature includes:

s3311, carrying out convolution processing on the dynamic memory to obtain a key vector and a value vector;

s3312, changing the dimensionality of the initial image features according to the key vectors and the value vectors so as to enable the dimensionality of the initial image features to be the same as the dimensionality of the key vectors and the dimensionality of the value vectors;

s3313, the dot product processing is carried out on the initial image characteristics and the key vectors to obtain a space weight matrix;

s3314, according to the space weight matrix, calculating a weighted sum of the value vector and the first memory image feature.

As described in the above steps S3311-S3314, the first visual feature enhancement is to find the important memory associated with the initial image feature to improve the visual semantic expression of the whole image. The dynamic memory is processed using convolution operations instead of the linear operations in traditional attention, resulting in key vectors and value vectors. And transforming the initial image feature dimensions from c x h x w to (h x w) x c, so that they are the same as the dimensions of the key vectors and value vectors. In order to calculate the importance of each memory slot in dynamic memory to the initial image features, the initial image features and the key vectors are subjected to dot product operation to obtain a spatial weight matrix. Then, a weighted sum of the value vector and the spatial weight matrix is calculated to obtain a first memory image characteristic. The mathematical expression is as follows:

w＝σ(Key ^T ⊙I _i-1 )

I _M1 ＝w⊙Value,

wherein Key represents a Key vector, Value vector, indicates a dot-product operation, σ indicates an activation function Sigmoid,

and

is a convolution operation of 1x1 to obtain a key vector and a value vector of the first memory image feature.

The secondary visual feature enhancement is to further enhance the correlation between the first memory image feature and the initial image feature, fuse important semantic information between the first memory image feature and the initial image feature, omit irrelevant semantic information and enable the irrelevant semantic information to pay attention to the most important feature. And then, performing re-fusion processing on the first memory image characteristic and the initial image characteristic in the channel direction. And then converting the dimension of the fused feature map from (h × w) × c to the original dimension c × h × w to obtain the memory image features. The whole fusion process is mathematically expressed as follows:

I _M2 ＝φ([I _M1 ；I _i-1 ])

wherein [; denotes a splicing operation, and phi denotes a convolution operation of 1 × 1.

In one embodiment, the step of inputting a plurality of the first image features into a channel attention residual block model and outputting a plurality of first refined image features comprises:

s41, performing convolution and batch normalization processing on the plurality of first image features, and inputting the processed first image features into a gated linear activation function GLU to obtain first intermediate features;

s42, performing convolution and batch normalization processing on the first intermediate features to obtain second intermediate features;

s43, splicing the first intermediate feature and the second intermediate feature in the channel direction to obtain a third intermediate feature;

s44, taking the third intermediate feature as the input of the whole attention unit, and performing global average pooling on the third intermediate feature to obtain a fourth intermediate feature;

s45, performing convolution operation on the fourth intermediate feature, and inputting the convolution operation into an activation function ReLU to obtain a fifth intermediate feature;

s46, performing convolution operation on the fifth intermediate feature, and inputting the convolution operation into a second activation function Sigmoid to obtain a channel semantic weight;

s47, multiplying the channel semantic weight and the second intermediate feature element by element to obtain a sixth intermediate feature;

s48, fusing the sixth intermediate feature and the first intermediate feature to obtain a seventh intermediate feature;

and S49, performing residual error connection on the seventh intermediate feature and the first image feature to obtain a first refined image feature.

As described in the above steps S41-S49, in the whole channel attention residual block model, the first intermediate feature and the second intermediate feature contain different levels of semantic information, and the channel map of each feature can be regarded as a response of a specific attribute. And guiding the feature expression in the similar semantic region in the second intermediate feature by utilizing the channel semantic information of the first intermediate feature and the second intermediate feature so as to enhance the feature characterization capability of the whole image. In the whole channel attention unit, the first intermediate feature and the second intermediate feature are spliced in the channel direction to obtain a third intermediate feature, and the third intermediate feature is used as the input of the whole attention unit. And performing global average pooling operation on the third intermediate feature so as to focus on the channel information of the third intermediate feature. The mathematical definition is as follows:

h _avg ＝GAP([O ₁ ；O ₂ ])；

wherein h is _avg Denotes a third intermediate feature, O ₁ Denotes a first intermediate feature, O ₂ Represents a second intermediate feature, [;]representing a splicing operation; GAP represents global average pooling.

Then, two continuous convolution layers are utilized to carry out information interaction on the third intermediate characteristic in the channel direction, the correlation between the channels is clarified, and the important channel characteristic is enhanced. And finally, judging the importance of each feature channel to the whole feature graph by using an activation function to obtain the semantic weight of the channel. The mathematical expression is as follows:

w＝σ(W ₁ (W ₂ (h _avg )))

wherein, w channel semantic weight, sigma represents Sigmoid function; w represents a channel weight value; w ₁ And W ₂ Both represent 1 × 1 convolution operations. First convolution W ₁ The method is used for fusing the channel information between the third intermediate feature and the fourth intermediate feature and carrying out interaction. The second convolution is to perform channel dimension transformation to match the number of channels of the second intermediate feature for subsequent channel feature enhancement. And finally, performing element-by-element multiplication operation on the obtained channel semantic weight and the second intermediate feature. This operation enhances both the channels within the similar semantic region of the second intermediate featureSemantic expression, and channel characteristics of the second intermediate characteristics are enhanced. In order to further enrich semantic information in the second intermediate features, the first intermediate features and the enhanced features are fused, and residual connection is carried out on the first intermediate features and the first image features which are input initially so as to maintain stability of the whole model training. The mathematical expression is as follows:

O ₃ ＝w×O ₂ ；

h'＝O ₃ +O ₁ +h；

wherein O is ₃ Represents a sixth intermediate feature; h first image features, and h' represents the output first refined image features.

In addition, the loss of the entire feature enhancement generation countermeasure network model is the linear sum of the generator loss, the conditional enhancement loss, and the DAMSM loss, as follows:

wherein the content of the first and second substances,

represents a generator loss, L _CA Represents a conditional enhancement loss, L _DAMSM Representing a deep multimodal attention similarity model loss. CA denotes mapping the input sentence vector to an independent gaussian distribution and resampling the sentence vector at that distribution for training data enhancement and avoiding overfitting. In the process, the similarity between two distributions is calculated, with the degree of similarity being a loss thereof. DAMSM loss is the loss of matching between text-images computed at the word level to discriminate how semantically the generated image matches the text description. The overall network loss function is defined as follows:

wherein λ is ₁ And λ ₂ Are all set to 1 for the hyper-parameter. The first half of the expression (1) indicates that only the quality of the generated image is emphasized, and the second half indicates that the quality of the generated image is determined according to the correlation between the generated image and the text description. μ(s) and σ in formula (2) ² (s) represent the mean and variance of the sentence vector, respectively. Meanwhile, all discriminator loss functions are defined as follows:

the unconditional loss is used for distinguishing the real image from the generated image, and the conditional loss is used for further judging the matching degree between the generated image and the text description. The purpose of the discriminant is to ensure that the generated image is consistent with the text description on the premise of ensuring the quality of the generated image.

As shown in fig. 3, the present application also provides a computer device, which may be a server, and the internal structure of which may be as shown in fig. 3. The computer device comprises a processor, a memory, a display screen, an input device, a network interface and a database which are connected through a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used to store all data required by the process of the text-to-image method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text-generating image method.

The present application further provides an apparatus for generating an image from a text, comprising:

the system comprises a first acquisition module 1, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of text description sentences and inputting the text description sentences into a text encoder for encoding to obtain a plurality of sentence characteristics and a plurality of word characteristics;

the second obtaining module 2 is configured to obtain a plurality of random sampling noises, and input the plurality of random sampling noises and the plurality of sentence features into the initial generator to be fused, so as to obtain a plurality of initial image features and a plurality of initial images;

a dynamic memory attention model 3 for inputting a plurality of initial image features and a plurality of word features into the dynamic memory attention model and outputting a plurality of first image features, wherein the dynamic memory attention module is used for enhancing visual features of the initial image features;

a channel attention residual block model 4, configured to input the plurality of first image features into a channel attention residual block model, output a plurality of first thinned image features, and perform convolution on the plurality of first thinned image features to obtain a plurality of first thinned images, where the channel attention residual block model is used to enhance the channel features of the first image features;

the dynamic memory attention model is further used for inputting a plurality of first refined image features as initial image features and a plurality of word features into the dynamic memory attention model and outputting a plurality of second refined image features, wherein the dynamic memory attention module is used for enhancing visual features of the first refined image features;

In one embodiment, the second obtaining module 2 includes:

the preliminary fusion unit is used for respectively inputting the random sampling noises and the sentence characteristics into a full-connection layer to perform preliminary feature fusion and outputting a plurality of first fusion image characteristics;

the first upsampling block is used for inputting the plurality of preliminary fusion images into the first upsampling block respectively so as to carry out batch normalization processing on the plurality of preliminary fusion image characteristics and output a plurality of second fusion image characteristics, wherein the first upsampling block at least comprises three blocks which are continuously arranged;

the second upsampling block is used for inputting the plurality of second fused image features into the second upsampling block so as to perform example normalization processing on the plurality of second fused image features to obtain a plurality of third fused image features;

and the convolution unit is used for outputting the plurality of third fusion image characteristics as initial image characteristics and performing convolution operation on the plurality of initial image characteristics to obtain a plurality of initial images.

In one embodiment, a first upsampling block, comprising:

the first acquisition unit is used for acquiring batch values of batch normalization;

a second obtaining unit, configured to obtain a first height value H and a first width value of each of the preliminary fusion image features;

a third obtaining unit, configured to obtain a scaling factor and a translation factor that are obtained by the first upsampling block through autonomous learning during training;

the fourth acquisition unit is used for acquiring the characteristic value of the primary fusion image characteristic currently subjected to normalization processing;

a first calculating unit, configured to calculate a mean value of all the preliminary fusion image features according to the batch value, the first height value, and the first width value, where a calculation formula is:

a second calculating unit, configured to calculate a variance of all the preliminary fused image features according to the batch value, the first height value, and the first width value, where a calculation formula is:

wherein the content of the first and second substances,

representing the variance of all of the preliminary fused image features,

a third calculating unit, configured to calculate sample distributions of all the preliminary fusion image features after batch normalization according to the variance and the mean, where the calculation formula is:

a generating unit, configured to generate each of the second fused image features according to the sample distribution, where a generating function is:

BN(x)＝γ×x'+β；

In one embodiment, the dynamic memory attention model 3 includes:

a weight matrix unit, configured to calculate a plurality of weight matrices according to the plurality of initial image features and the plurality of word features;

the storing unit is used for storing the weight matrixes into a dynamic memory slot as a plurality of dynamic memories;

the secondary memory characteristic enhancement unit is used for putting a plurality of dynamic memories in the dynamic memory groove into the secondary memory characteristic enhancement unit so as to refine the image characteristics in the dynamic memories to obtain a plurality of memory image characteristics;

and the memory response gating is used for inputting a plurality of memory image characteristics into the memory response gating so as to enhance the unnoticeable areas in the plurality of memory image characteristics and obtain a plurality of first image characteristics.

In one embodiment, the secondary memory feature enhancement unit includes:

the first memory feature enhancement unit is used for taking the dynamic memory and the initial image features as the input of the second memory feature enhancement unit and carrying out first memory feature enhancement to obtain first memory image features;

and the secondary memory enhancement unit is used for performing secondary memory enhancement on the first memory image characteristics to obtain the memory image characteristics.

In one embodiment, the first-time memory feature enhancing unit includes:

the convolution processing unit is used for carrying out convolution processing on the dynamic memory to obtain a key vector and a value vector;

the dimension changing unit is used for changing the dimension of the initial image feature according to the key vector and the value vector so as to enable the dimension of the initial image feature to be the same as the dimension of the key vector and the dimension of the value vector;

the dot product processing unit is used for performing dot product processing on the initial image features and the key vectors to obtain a spatial weight matrix;

and the weighted sum unit is used for calculating the weighted sum between the first memory image characteristic and the value vector according to the space weight matrix to obtain the first memory image characteristic.

In one embodiment, the channel attention residual block model comprises:

the first intermediate feature unit is used for performing convolution and batch normalization processing on the plurality of first image features and inputting the first image features into a gated linear activation function GLU to obtain first intermediate features;

the second intermediate feature unit is used for performing convolution and batch normalization processing on the first intermediate features to obtain second intermediate features;

the splicing unit is used for splicing the first intermediate feature and the second intermediate feature in the channel direction to obtain a third intermediate feature;

a global average pooling unit, which takes the third intermediate feature as the input of the whole attention unit and performs global average pooling on the third intermediate feature to obtain a fourth intermediate feature;

the fifth intermediate feature unit is used for performing convolution operation on the fourth intermediate feature and inputting the convolution operation into an activation function ReLU to obtain a fifth intermediate feature;

the channel semantic weight unit is used for performing convolution operation on the fifth intermediate feature and inputting the convolution operation into a second activation function Sigmoid to obtain a channel semantic weight;

the element-by-element multiplication processing unit is used for carrying out element-by-element multiplication processing on the channel semantic weight and the second intermediate feature to obtain a sixth intermediate feature;

a seventh intermediate feature unit, configured to fuse the sixth intermediate feature and the first intermediate feature to obtain a seventh intermediate feature;

and the residual error connecting unit is used for performing residual error connection on the seventh intermediate feature and the first image feature to obtain a first refined image feature.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements any one of the above text-to-image methods.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware associated with instructions of a computer program, which may be stored on a non-volatile computer-readable storage medium, and when executed, may include processes of the above embodiments of the methods. Any reference to memory, storage, database or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method for generating an image from a text, comprising:

inputting a plurality of first image features into a channel attention residual block model, outputting a plurality of first thinned image features, and performing convolution on the plurality of first thinned image features to obtain a plurality of first thinned images, wherein the channel attention residual block model is used for enhancing the channel features of the first image features;

taking a plurality of first refined image features as initial image features, inputting the initial image features and a plurality of word features into a dynamic memory attention model, and outputting a plurality of second refined image features, wherein the dynamic memory attention model is used for enhancing the visual features of the first refined image features;

2. The method of claim 1, wherein the step of obtaining a plurality of randomly sampled noises and inputting the plurality of randomly sampled noises and the plurality of sentence features into an initial generator for fusion to obtain a plurality of initial image features and a plurality of initial images comprises:

and outputting the plurality of third fusion image features as initial image features, and performing convolution operation on the plurality of initial image features to obtain a plurality of initial images.

3. The method for generating text-based images according to claim 2, wherein the step of inputting a plurality of the preliminary fusion image features into a first upsampling block to perform batch normalization on the plurality of preliminary fusion image features and outputting a plurality of second fusion image features comprises:

obtaining batch normalized batch values:

acquiring a characteristic value of the characteristic of the primary fusion image subjected to normalization processing;

wherein the content of the first and second substances,

representing the variance of all of the preliminary fused image features,

BN(x)＝γ×x'+β；

4. The method of generating a text image according to claim 1, wherein the step of inputting a plurality of the initial image features and a plurality of the word features into a dynamic memory attention model and outputting a plurality of first image features comprises:

putting a plurality of dynamic memories in a dynamic memory groove into a secondary memory characteristic enhancement unit to refine image characteristics in the dynamic memories to obtain a plurality of memory image characteristics;

5. The method of claim 4, wherein the step of placing the plurality of dynamic memories in the dynamic memory slot into a secondary memory feature enhancement unit to refine the image features in the plurality of dynamic memories to obtain a plurality of memory image features comprises:

6. The method of claim 5, wherein the step of obtaining a first memory image feature by using the dynamic memory and the initial image feature as input of a secondary memory feature enhancement unit and performing a primary memory feature enhancement comprises:

7. The method of generating a text image according to claim 1, wherein the step of inputting a plurality of the first image features into a channel attention residual block model and outputting a plurality of first refined image features comprises:

8. A text-generating image apparatus, comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the text-to-image method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of a method for generating an image from a text as claimed in any one of claims 1 to 7.