CN117788629A

CN117788629A - Image generation method, device and storage medium with style personalization

Info

Publication number: CN117788629A
Application number: CN202410217621.9A
Authority: CN
Inventors: 徐小龙; 许逸非
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2024-02-28
Filing date: 2024-02-28
Publication date: 2024-03-29
Anticipated expiration: 2044-02-28
Also published as: CN117788629B

Abstract

The invention discloses an image generation method, device and storage medium with style individuation in the technical field of artificial intelligence, and aims to solve the technical problems of quality and accuracy of image generation. The method comprises the following steps: selecting a style image, inputting the style image into a pre-constructed VGG19_f3 network model, and obtaining feature images with different sizes; calculating a Gram matrix of the feature map to extract style information of the feature map; inputting the coding, noise image and style information of the text obtained by the text coder into a pre-constructed noise prediction network with a style guide module for noise prediction, obtaining prediction noise, repeatedly denoising the noise image to obtain a latent space image, and decoding the latent space image by an image decoder to obtain a final generated style personalized image. The method and the device can enable the model to generate the style personalized image end to end, enable the model to have stronger style personalized capability, and simultaneously ensure the quality and the accuracy of the generated image.

Description

Image generation method, device and storage medium with style personalization

Technical Field

The invention relates to an image generation method, device and storage medium with style personalization, belonging to the technical field of artificial intelligence.

Background

The intelligent generation of the illustration oriented to the electronic publication is essentially a process of generating images according to texts, the existing large-scale image generation model can realize high-quality and diversified image generation according to natural language text prompts, however, in order to attract more readers, the illustration generation oriented to the electronic publication requires the model to generate the illustration with personalized style according to the personal style preference of a user, and under the background, the existing large-scale image generation model lacks the style individuation capability and cannot meet the application requirements.

For image generation problems, existing researches are usually built on pre-trained large image generation models to fully utilize the image generation capability of the large models, and most solutions thereof need to be divided into a plurality of stages or modules to train the models, or a plurality of steps of fine tuning are needed for the pre-trained large image generation models, so that the complexity is high compared with an end-to-end model, and the weight of the models needs to be re-tuned each time facing a new style, which is time-consuming and memory-consuming, and therefore the expandability and the practicability of the methods are greatly limited.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides an image generation method, an image generation device and a storage medium with style individuation, which can guide the image generation process by utilizing style characteristics, do not need to retrain a model for each new style grid, and simultaneously enable the model to have stronger style individuation capability through a constructed noise prediction network with a style guide module, and can ensure the quality and accuracy of the generated image.

In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:

in a first aspect, the present invention provides an image generation method with style personalization, including:

selecting a style image, and inputting the style image into a pre-constructed VGG19_f3 network model to obtain feature images with different sizes;

calculating a Gram matrix of the feature map to extract style information of the feature map;

inputting the coding, randomly sampled noise image and style information of the text obtained by the text coder into a pre-constructed noise prediction network with a style guide module for noise prediction to obtain prediction noise;

carrying out repeated denoising operation on the noise image by utilizing the predicted noise to obtain a latent space image;

and decoding the latent space image through an image decoder in a large image generation model to obtain a finally generated style personalized image.

With reference to the first aspect, further, the obtaining feature maps with different sizes includes:

setting the pixel size of the style image to a set size;

the first three downsampling blocks of the VGG19 network model are utilized to form a VGG19_f3 network model, the style image is input into the VGG19_f3 network model, and a first feature map with a large designated size is outputSecond characteristic map->And third characteristic diagram->。

With reference to the first aspect, the extracting style information of the feature map includes:

calculating the first feature mapSecond characteristic map->And third characteristic diagram->Is used for obtaining the first style characteristic +.>Second windLattice characterization->And third style characteristics->；

Characterizing the first styleSecond style characteristics->After the maximum pooling operation, the third style characteristic is +.>And adding to obtain style information S.

With reference to the first aspect, the computational expression of the Gram matrix of the feature map and the Gram matrix is as follows:

，

wherein,an ith row and a jth column of matrixes in the Gram matrix of the feature map F; />Channel->Is>An element; />Channel->Is>An element; />Is a characteristic diagram->Is high of (2); />Is a characteristic diagram->Is not limited to a wide range.

With reference to the first aspect, the construction process of the noise prediction network with the style guidance module is as follows:

and copying a downsampling block and an intermediate block of a noise prediction network in the large image generation model to obtain a downsampling network, wherein the downsampling network and the noise prediction network in the large image generation model form the noise prediction network with the style guiding module.

With reference to the first aspect, the obtaining the prediction noise includes:

inputting the style information into the downsampling network to obtain first style informationSecond style information->Third style information->Fourth style information->Fifth style information->；

Encoding the text obtained by the text encoderRandomly pickingNoise image of sample->Sequentially inputting the first prediction noise into a downsampling block and an intermediate block in a noise prediction network;

the fifth style informationFourth style information->Third style information->Second style information->First style information->Sequentially adding to the first predicted noise to output final predicted noise +.>。

With reference to the first aspect, the obtaining a latent space image includes:

performing a first denoising operation on the noise image by using the predicted noise to obtain a first denoised imageThe first denoising image is +.>The method comprises the steps of inputting the codes of the style individuation characteristics and the text to a pre-constructed noise prediction network with a style guide module, and outputting a first denoising prediction result +.>Using the first denoising prediction result to denoise the first denoising image +.>Performing a second denoising operation to obtain a second denoised image +.>；

Repeating the denoising operation for a set number of times to obtain a latent space image。

In combination with the first aspect, the expression of the denoising operation is as follows:

，

wherein,is the added noise variance; />Is made of->A decremental sequence obtained by calculation; />Is->To->Is the product of (1); />Is->To->Is the product of (1); />Is a noise image randomly sampled from a standard normal distribution; />The noise image is denoised in one step; />Is->Noise images at the time of steps; />The text information is coded; />Is the step length; />Is style information; />Is the predicted noise.

In a second aspect, an image generation apparatus with style personalization, the apparatus comprising:

the image input module is used for selecting a style image, inputting the style image into a pre-constructed VGG19_f3 network model, and obtaining feature images with different sizes;

the style information extracting module is used for calculating a Gram matrix of the feature map so as to extract style information of the feature map;

the noise prediction module is used for inputting the codes of the texts obtained through the text encoder, the noise images obtained through random sampling and the style information into a pre-constructed noise prediction network with the style guide module to conduct noise prediction, so as to obtain prediction noise;

the noise removing module is used for carrying out repeated noise removing operation on the noise image by utilizing the predicted noise to obtain a latent space image;

and the image decoding module is used for decoding the latent space image through an image decoder in the large-scale image generation model to obtain a finally generated style personalized image.

In a third aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of any of the methods described in the preceding claims.

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, the VGG19-f3 network model is utilized to extract the feature images with different pixel sizes, the Gram matrix is utilized to calculate the style characteristics of each layer of the feature images so as to extract the style information, the style information is utilized to guide the generation process of the images, and the content characteristics of the images do not participate in the guide, so that the model is not required to be trained by using the images corresponding to the content.

Drawings

FIG. 1 is a schematic diagram of an image generation process provided by an embodiment of the present invention;

FIG. 2 is stylistic image data provided by an embodiment of the present invention;

FIG. 3 is a truth image provided by an embodiment of the present invention;

FIG. 4 is an image with style personalization provided by an embodiment of the present invention;

fig. 5 is an image generated by a conditional image generation model ControlNet of the current mainstream provided in the embodiment of the present invention.

Detailed Description

The following detailed description of the technical solutions of the present invention will be given by way of the accompanying drawings and specific embodiments, and it should be understood that the specific features of the embodiments and embodiments of the present invention are detailed descriptions of the technical solutions of the present invention, and not limiting the technical solutions of the present invention, and that the embodiments and technical features of the embodiments of the present invention may be combined with each other without conflict.

The term "and/or" in the present invention is merely an association relation describing the association object, and indicates that three kinds of relations may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Example 1

Fig. 1 is a flowchart of an image generation method with style personalization in embodiment 1 of the present invention. The flow chart merely shows the logical sequence of the method according to the present embodiment, and the steps shown or described may be performed in a different order than shown in fig. 1 in other possible embodiments of the invention without mutual conflict.

Referring to fig. 1, the method of the present embodiment specifically includes the following steps:

s1, selecting a style image, and inputting the style image into a pre-constructed VGG19_f3 network model to obtain feature images with different sizes;

it should be noted that, the style image of the present invention is 10000 sheets randomly selected from the Pinterest open source image dataset.

Specifically, referring to fig. 2, for the selected style image data, to adapt to the requirement of the network model on the input size, the pixel size is set to 512×512, the style image data and the pixel size of 512×512 are input into the model by using the vgg19_f3 network model composed of the first three downsampling blocks of the VGG19 network model, so as to extract the first feature images with the sizes of 128×128×64, 64×64×128 and 32×32×256 respectivelySecond characteristic map->Third feature map->And stores the three feature maps so as to facilitate the subsequent extraction of style information.

Wherein, the composition and parameters of the vgg19_f3 network model refer to the network structure and the feature map size of the first three downsampling blocks in table 1:

TABLE 1 composition and parameters of VGG19_f3 network model

S2, calculating a Gram matrix of the feature map to extract style information of the feature map;

specifically, by calculating a first feature mapSecond characteristic map->Third feature map->Extracting style information of style image data, wherein the computing expression of the Gram matrix is as follows:

，

Further, the style characteristics of each layer are obtained by adopting the calculation expression of the Gram matrix、And->Will->After the maximum pooling operation, and +.>And adding to obtain style information S.

S3, inputting the codes of the texts obtained through the text coder, the noise images obtained through random sampling and style information into a pre-constructed noise prediction network with a style guide module for noise prediction to obtain prediction noise;

specifically, the generation of the image is realized by using a large image generation model (SD), and a noise prediction network with a style guide module is designed according to the SD model structure.

Wherein the noise prediction network in SD comprises 4 downsampled blocks、/>、/>And->1 middle block->And 4 upsampling blocks->、/>、/>And->。

Further, the downsampled block and the intermediate block in the noise prediction network are replicated to obtain a downsampled block replica、/>、、/>And middle block->The total five network blocks together form a downsampling network, and the downsampling network obtained by copying the network blocks and the SDForms a noise prediction network with style guidance modules.

It should be noted that the downsampled block copy、/>、/>、/>And middle block->The image feature sizes of (a) are 64×64, 32×32, 16×16, 8×8, and 8×8, respectively.

As can be derived from fig. 1, the final noise prediction is obtained by adding the output in the noise prediction network to the output in the downsampled block, as follows:

when the style personalized feature S obtained in the step 2 is input into the downsampling network, the style personalized feature S can be obtainedOutput of (2)、/>Output of +.>、/>Output of +.>、/>Output of +.>And +.>Output of +.>At the same time, the input text is coded based on the input text prompt +.>And a noise image sampled randomly +.>Input to downsampling block->Downsampling block +.>Input to the downsampled block +.>Downsampling block +.>Input to the downsampled block +.>Downsampling block +.>Input to the downsampled block +.>Downsampling block +.>Is input to the intermediate block->；

Next to this, the process is carried out,will beOutput of +.>And middle block->Is added as an up-sampling block +.>Is to add a feature map->And upsample block->Is added as an up-sampling block +.>Is to add a feature map->And upsample block->Is added as an up-sampling block +.>Is to add a feature map->And upsample block->Is added as an up-sampling block +.>Is finally characterized byAnd upsample block->Output of (2) adds up, outputting the final prediction noise +.>。

S4, carrying out repeated denoising operation on the noise image by using the predicted noise to obtain a latent space image;

in particular, using predictive noiseNoise image->Performing a first denoising operation to obtain a first denoised image +.>The first denoising image is +.>Style information S, coding of text +.>Then input the first noise removal prediction result to a pre-constructed noise prediction network with a style guiding module, and output the first noise removal prediction result +.>Using the first denoising prediction result to denoise the first denoising image +.>Performing a second denoising operation to obtain a second denoised image +.>The second denoised image is again +.>Style information S, coding of text +.>Then input the noise prediction result to a pre-constructed noise prediction network with a style guiding module, and output a second denoising prediction result +.>The method comprises the steps of carrying out a first treatment on the surface of the Repeating the above denoising operation for 50 times until obtaining the latent space image +.>。

The noise image is denoised using the final predicted noise, and the denoising expression is as follows:

，

wherein,is added noise variance and is followed by +.>Increased linearly increasing superparameter, ++>；/>Is made of->A decremental sequence obtained by calculation; />Is->To->Is the product of (1); />Is->To->Is the product of (1); />For randomly sampled noise from a normal distribution, i.e. +.>The method comprises the steps of carrying out a first treatment on the surface of the Noise image +.>Noise image is obtained after one-step denoising>。

Further, the training method of the noise prediction network with the style guidance module comprises the following steps:

for a selected original image, continuously adding noise to the selected original image, and coding the noise-added image and the corresponding textStyle information->Input to noise prediction network with style guidance module +.>Calculated predicted noiseAnd added real noise->Optimizing the network with a loss function of +.>The expression is:

，

step S5, utilizing an image decoder in SD to perform image decoding on the latent space imageDecoding is carried out, and a final style personalized image is obtained.

Example 2

In the embodiment of the invention, the method and the model in the embodiment 1 are compared with the current mainstream conditional image generation model ControlNet on a test set formed by 2000 pieces of user data, the experiment on the data set respectively calculates the index of the image generated by the model on a true value and the index on a style image set, and the overall performance comparison result is shown in the following table 2, wherein the FID is an index Free chet Inception Distance for measuring the quality of the generated image, which is important in the field of image generation; KID is an important indicator Kernel Inception Distance for measuring the diversity of the generated images; clisim is an important indicator of how well a generated image is consistent with text. StySim is an index Style Similarity for measuring the Similarity of the image Style, and the calculation method comprises the following steps:

for two imagesAnd->Extracting their features using the first four network blocks of VGG19 network, each layer outputting the result +.>And +.>Calculate their +.>Matrix arrayObtain->And->Calculating the mean square error of a pair of Gram matrixes of each layer, and finally summing the calculation results of the four mean square errors to obtain the style similarity,/V>Is calculated by the formula of (2)

，

Wherein,for the number of elements of the Gram matrix in each layer, the smaller the value of the index is, the more similar the styles of the two images are, and the stronger the style individuation performance of the model is; />Is->Gram matrix of (a); />Is->Gram matrix of (a);and (5) measuring the Style Similarity of the images.

Table 2 results of performance comparisons

As can be seen from table 2, on the quality index FID, the present invention is equivalent to control net, and on the diversity index KID, the performance of the present invention is lower than control net, because the style of the generated image is limited, resulting in the decrease of diversity of the generated image, and the style individuation performance of the present invention is stronger from the side; compared with ControlNet on the style similarity index StySim, the invention respectively shows 4.3% ((1.85-1.77)/(1.85 multiplied by 100%) and 7.5% ((2.14-1.98)/(2.14 multiplied by 100%) on the truth image set and the style image set; the present invention is comparable to control net in the text image consistency index clisim.

The generation results of different methods under a group of style conditions are shown in fig. 3, fig. 4 and fig. 5, and the lion image in fig. 3 is a true value, namely an original image; the lion image in fig. 4 is a style personalized image obtained by the invention; the lion in fig. 5 is the image generated by the current mainstream conditional image generation model ControlNet, so compared with the ControlNet, the style personalization effect of the invention is more obvious and is more similar to the style image and the true plane painting style.

Example 3

An image generation apparatus with style personalization, the apparatus comprising:

Example 4

The embodiments of the present invention also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method described below.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. An image generation method with style personalization, comprising:

2. The method for generating an image with style personalization according to claim 1, wherein the obtaining feature maps of different sizes comprises:

setting the pixel size of the style image to a set size;

forming a VGG19_f3 network model by utilizing the first three downsampling blocks of the VGG19 network model, inputting the style image into the VGG19_f3 network model, and outputting a first characteristic diagram with a specified sizeSecond characteristic map->And third characteristic diagram->。

3. The method for generating an image with style personalization according to claim 2, wherein extracting style information of the feature map comprises:

calculating the first feature mapSecond characteristic map->And third characteristic diagram->Obtain a first style characteristic of Gram matrix of (2)Second style characteristics->And third style characteristics->；

4. The method of generating an image with style personalization of claim 3, wherein the first feature mapSecond characteristic map->And third characteristic diagram->The computational expression of the Gram matrix of (c) is as follows:

，

5. The method for generating a personalized image according to claim 1, wherein the noise prediction network with the style guidance module is constructed as follows:

6. The method for generating an image with style personalization of claim 5, wherein said obtaining a prediction noise comprises:

Encoding the text obtained by the text encoderRandomly sampled noise image->Sequentially inputting the first prediction noise into a downsampling block and an intermediate block in a noise prediction network;

the fifth style informationFourth style information->Third style information->Second style information->First style informationSequentially adding to the first predicted noise to output final predicted noise +.>。

7. The method of generating an image with style personalization of claim 6, wherein the obtaining a latent space image comprises:

performing a first denoising operation on the noise image by using the predicted noise to obtain a first denoising operationImage processing apparatusThe first denoising image is +.>Style information S, coding of text +.>Then input the first noise removal prediction result to a pre-constructed noise prediction network with a style guiding module, and output the first noise removal prediction result +.>Using the first denoising prediction result to denoise the first denoising image +.>Performing a second denoising operation to obtain a second denoised image +.>；

8. The method for generating an image with style personalization according to claim 7, wherein the expression of the denoising operation is as follows:

，

9. An image generation apparatus with style personalization, the apparatus comprising:

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.