CN117522675A

CN117522675A - Diffusion model construction method and device

Info

Publication number: CN117522675A
Application number: CN202311598508.1A
Authority: CN
Inventors: 肖海鹏
Original assignee: Ant Blockchain Technology Shanghai Co Ltd
Current assignee: Ant Blockchain Technology Shanghai Co Ltd
Priority date: 2023-11-27
Filing date: 2023-11-27
Publication date: 2024-02-06

Abstract

The specification discloses a method and a device for constructing a diffusion model. And acquiring an image training sample and a prompt text corresponding to the image training sample. Then, generating a noise image corresponding to the image training sample, inputting the prompt text, the noise image and the image training sample into a diffusion model to be trained, extracting image features of the noise image, text features of the prompt text and style features of the image training sample by the diffusion model, carrying out feature fusion on the image features, the text features and the style features, calculating the prediction noise of the noise image based on the fusion features obtained by the feature fusion, and training the diffusion model by taking the error between the calculated prediction noise and actual noise of the noise image as an optimization target. The method does not destroy the definition of the image content of the original image, so that the definition of the generated image is not reduced compared with the original image, and the stylized effect is improved.

Description

Diffusion model construction method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for constructing a diffusion model.

Background

With the development of image processing technology, a technology of transferring visual features (styles) of one image to another image, called image stylization, has emerged. For example, in practical application, the content image a and the style image B may be input into the style migration model, so as to obtain a stylized image C that retains the image content of the content image a and merges the image style of the style image B. That is, the stylized image C is essentially an image that conforms to the style of the style image B, and the image content conforms to the content image a.

At present, a common style migration algorithm based on CNN is an algorithm for migrating only the style and the color of an image without changing the content of the image, and a style migration model is generally trained by using style loss and content loss respectively. However, since the style migration algorithm does not change the content of the content image, only the style and the color of the image are migrated hard, the definition of the image content of the original image is destroyed, the definition of the generated stylized image is lower, the stylized effect is poor, and the reality and the expressive force of the artwork are not provided.

Disclosure of Invention

The specification provides a method and a device for constructing a diffusion model, a storage medium and electronic equipment, so as to avoid damaging the definition of image content of original pictures, and further improve the stylization effect.

The technical scheme adopted in the specification is as follows:

the specification provides a method for constructing a diffusion model, which comprises the following steps:

acquiring an image training sample and a prompt text corresponding to the image training sample;

generating a noise image corresponding to the image training sample, inputting the prompt text, the noise image and the image training sample into a diffusion model to be trained, extracting image features of the noise image, text features of the prompt text and style features of the image training sample by the diffusion model, carrying out feature fusion on the image features, the text features and the style features, calculating prediction noise of the noise image based on fusion features obtained by feature fusion, and training the diffusion model by taking errors between the calculated prediction noise and actual noise of the noise image as optimization targets; the diffusion model is used for generating a target image which is the same as the image style of the image sample and the same as the image content described by the prompt text based on the input prompt text and the image sample.

Optionally, the method further comprises:

acquiring other image training samples with the same image style and different image contents as the image training samples;

inputting the prompt text, the noise image and the other image training samples into the diffusion model, extracting image features of the noise image, text features of the prompt text and style features of the other image training samples by the diffusion model, carrying out feature fusion on the image features, the text features and the style features of the other training images, calculating the prediction noise of the noise image based on fusion features obtained by feature fusion, and further carrying out optimization training on the diffusion model by taking errors between the calculated prediction noise and actual noise of the noise image as optimization targets.

Optionally, acquiring other image training samples with the same image style and different image content as the image training samples includes:

and inputting a prompt text for describing the image style of the image training sample into a preset text to generate an image model, and generating other image training samples which are the same as the image style of the image training sample and have different image contents.

Optionally, the preset text-generating image model includes a LoRA model.

Optionally, acquiring the prompt text corresponding to the image training sample includes:

inputting the image training sample into a preset image generation text model to generate a content description text corresponding to the image training sample;

inputting the image training sample and a preset content description text set into a pre-trained image-text matching model to obtain a content description text matched with the image training sample;

and combining the content description text generated by the image generation text model with the content description text matched with the image training sample to obtain a prompt text corresponding to the image training sample.

Optionally, the preset image generation text model includes a BLIP model; the pattern matching model comprises a CLIP model.

Optionally, generating a noise image corresponding to the image training sample includes:

acquiring a gray level image of the image training sample, wherein the gray level image is used for representing the image content of the image training sample;

and adding noise to the image training sample, and superposing the image training sample added with the noise and a gray level diagram of the image training sample to obtain a noise image corresponding to the image training sample.

Optionally, extracting style features of the image training sample includes:

and extracting image characteristics of the image training sample, and calculating a gram matrix corresponding to the image training sample according to the image characteristics of the image training sample, wherein the gram matrix is used for representing style characteristics of the image training sample.

Optionally, the diffusion model includes:

the content feature extraction layer is used for extracting image features of the noise image and text features of the prompt text;

the style characteristic extraction layer is used for extracting style characteristics of the image training sample;

and the fusion layer is used for carrying out feature fusion on the image features, the text features and the style features, and calculating the prediction noise of the noise image based on the fusion features obtained by the feature fusion.

Optionally, the fusion layer includes: a plurality of network sublayers and a plurality of fusion sublayers corresponding to the network sublayers except the network sublayers of the last layer one by one; wherein, the fusion sublayer corresponding to each network sublayer is positioned behind the network sublayer;

performing feature fusion on the image features, the text features and the style features, and calculating the prediction noise of the noise image based on the fusion features obtained by the feature fusion, wherein the method comprises the following steps:

Inputting the image features and the text features into a first network sub-layer of the plurality of network sub-layers for feature fusion;

and inputting the fusion characteristics output by the first network sub-layer and the gram matrix corresponding to the image training sample into a first fusion sub-layer corresponding to the first network sub-layer for characteristic fusion, inputting the style fusion characteristics output by the first fusion sub-layer corresponding to the first network sub-layer and the text characteristics into a second network sub-layer of a next layer for executing the same actions, and the like until the network sub-layer of the last layer, and calculating the prediction noise of the noise image.

Optionally, inputting the fusion feature output by the first network sublayer and the gram matrix corresponding to the image training sample into a first fusion sublayer corresponding to the first network sublayer to perform feature fusion, including:

calculating a mean prediction matrix and a variance prediction matrix according to the gram matrix corresponding to the image training sample;

and inputting the fusion characteristics output by the first network sub-layer, the mean prediction matrix and the variance prediction matrix into a first fusion sub-layer corresponding to the first network sub-layer for characteristic fusion, multiplying the fusion characteristics output by the first network sub-layer by the variance prediction matrix to obtain a multiplication result, and adding the multiplication result to the mean prediction matrix to obtain the style fusion characteristics output by the first fusion sub-layer.

Optionally, the plurality of network sublayers are network sublayers contained in a U-Net neural network; the fusion layer is a U-Net neural network which is added with the fusion sublayers corresponding to the network sublayers one by one after each network sublayer except the network sublayer of the last layer is included.

The present specification provides an image generation method, the method comprising:

acquiring a content image, a prompt text corresponding to the content image and a style image;

generating a noise image corresponding to the content image, inputting the noise image, a prompt text corresponding to the content image and the style image into a trained diffusion model, extracting image features of the noise image, text features of the prompt text, style features of the style image by the diffusion model, carrying out feature fusion on the image features, the text features and the style features, calculating the prediction noise of the noise image based on the fusion features obtained by feature fusion, and generating a target image which is identical to the image style of the style image and identical to the image content described by the prompt text based on the prediction noise, wherein the diffusion model is obtained by the construction method of the diffusion model.

The present specification provides a diffusion model comprising:

the content feature extraction layer is used for acquiring an input image sample and a prompt text corresponding to the image sample, extracting image features corresponding to the image sample, extracting text features corresponding to the prompt text and inputting the image features and the text features into the fusion layer;

the style characteristic extraction layer is used for acquiring an input image sample, extracting style characteristics of the image sample and inputting the style characteristics into the fusion layer;

and the fusion layer is used for carrying out feature fusion on the input image features, the text features and the style features, calculating the prediction noise of the noise image based on the fusion features obtained by the feature fusion, and generating a target image which has the same image style as the image sample and the same image content as the prompt text based on the prediction noise.

The present specification provides an image generation apparatus, the apparatus comprising:

the acquisition module is used for acquiring the content image, the prompt text corresponding to the content image and the style image;

The generation module is used for generating a noise image corresponding to the content image, inputting the noise image, a prompt text corresponding to the content image and the style image into a trained diffusion model, extracting image features of the noise image, text features of the prompt text, style features of the style image and carrying out feature fusion on the image features, the text features and the style features, calculating prediction noise of the noise image based on fusion features obtained by feature fusion, generating a target image which is identical to the image style of the style image and identical to image content described by the prompt text based on the prediction noise, wherein the diffusion model is obtained by the construction method of the diffusion model.

The specification provides an electronic device, which comprises a communication interface, a processor, a memory and a bus, wherein the communication interface, the processor and the memory are connected with each other through the bus;

the memory stores machine readable instructions, and the processor executes the method of constructing the diffusion model by invoking the machine readable instructions.

The present specification provides a machine-readable storage medium storing machine-readable instructions that, when invoked and executed by a processor, implement the method of constructing a diffusion model described above.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

according to the method, a prompt text, a noise image and an image training sample are input into a diffusion model to be trained, the diffusion model is used for extracting image features of the noise image, text features of the prompt text and style features of the image training sample, feature fusion is carried out on the image features, the text features and the style features, prediction noise of the noise image is calculated based on fusion features obtained by the feature fusion, and the diffusion model is trained by taking the error between the calculated prediction noise and actual noise of the noise image as an optimization target. Thus, through the input prompt text, the target image which has the same image style as the image sample and the same image content as described by the prompt text is generated, so that the style migration of the image is realized. Compared with a style migration algorithm realized based on CNN, the style migration algorithm does not carry out hard migration on the style of the image, does not damage the definition of the image content of the original image, does not reduce the definition of the generated image compared with the original image, and further improves the stylization effect.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

FIG. 1 is a diagram of a model structure of a steady diffusion model in the related art;

FIG. 2 is a model block diagram of a diffusion model shown in an exemplary embodiment;

FIG. 3 is a flow chart illustrating a method of constructing a diffusion model in accordance with an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating a style migration process based on a CNN implementation in accordance with an exemplary embodiment;

FIG. 5 is a schematic diagram illustrating a style migration process implemented based on a diffusion model in accordance with an exemplary embodiment;

FIG. 6 is a schematic diagram of a feature fusion shown in an exemplary embodiment;

FIG. 7 is a flowchart illustrating an image generation method in accordance with an exemplary embodiment;

FIG. 8 is a block diagram of an electronic device with a diffusion model according to an exemplary embodiment;

FIG. 9 is a block diagram of a diffusion model shown in an exemplary embodiment;

fig. 10 is a block diagram of an image generating apparatus shown in an exemplary embodiment.

Detailed Description

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.

In order to make the technical solution in the embodiments of the present specification better understood by those skilled in the art, the related art related to the embodiments of the present specification will be briefly described below.

Image stylization: a technique for combining the image content of one content image A with the image style of another style image B to generate a picture having the image content of the content image A and the image style of the style image B.

Diffusion model: is a generation model capable of generating a composite image. Diffusion models typically introduce noise (e.g., gaussian noise) into an image and train to attempt to denoise the image to generate the image with the aim of minimizing the noise. For a trained diffusion model, a prompt text may be entered into the diffusion model, from which an image conforming to what the prompt text indicates is generated. For example, assuming the hint text is "a Persian cat with blue eyes and white hair," the diffusion model may generate an image of a Persian cat with blue eyes and white hair. The diffusion model may generally include: stable Diffusion (SD), potential Diffusion models (Latent Diffusion Models, LDMs), and the like. For convenience of description, a description will be given below of a method of constructing a diffusion model provided in the present specification, taking a stable diffusion model as an example. Referring to fig. 1, fig. 1 is a schematic diagram of a steady diffusion model in the related art.

In practical applications, style migration algorithms implemented based on CNN typically use style loss and content loss to train the style migration model. Because the style migration algorithm is an algorithm which does not change the image content and only carries out hard migration on the style and the color of the image, the definition of the image content of the original image can be destroyed, the definition of the generated stylized image is lower, and the stylized effect is poor.

And a stable diffusion model which is better in terms of image generation is generally not capable of performing the image stylization task. Based on the above, the present specification proposes a technical scheme of adding a style feature extraction layer and a fusion layer on the basis of a diffusion model to construct a diffusion model applicable to an image stylization task.

In the present embodiment, the diffusion model includes: a content feature extraction layer, a style feature extraction layer and a fusion layer. The fusion layer comprises: and a plurality of network sublayers, and a fusion sublayer corresponding to the network sublayers except the network sublayers of the last layer one by one. The convergence sublayer corresponding to each network sublayer is located behind the network sublayer, and the network sublayer comprises an encoding layer and a decoding layer.

The network sublayers can be network sublayers contained in a U-Net neural network. The fusion layer is a U-Net neural network which is added with the fusion sublayers corresponding to the network sublayers one by one after each network sublayer except the network sublayer of the last layer is included.

Referring to fig. 2, fig. 2 is a model structure diagram of a diffusion model according to an exemplary embodiment.

In fig. 2, a content feature extraction layer is configured to obtain an input noise image and a prompt text corresponding to the noise image, extract an image feature corresponding to an image sample, extract a text feature corresponding to the prompt text, and input the image feature and the text feature to a fusion layer.

The style characteristic extraction layer is used for acquiring an input image sample, extracting style characteristics of the image sample and inputting the style characteristics into the fusion layer. The style characteristics may be represented by a gram matrix.

And the fusion layer is used for carrying out feature fusion on the input image features, text features and style features, calculating the prediction noise of the noise image based on the fusion features obtained by the feature fusion, and generating a target image which has the same image style as the image sample and the same image content as the prompt text based on the prediction noise.

Wherein, the fusion layer includes: the decoding method comprises the steps of a first coding layer, a first fusion sublayer, a second coding layer, a second fusion sublayer, …, an N coding layer, an N fusion sublayer, a first decoding layer, an N+1 fusion sublayer, a second decoding layer, an N+2 fusion sublayer, …, a 2N-1 fusion sublayer and an N decoding layer.

It should be noted that each coding layer in the U-Net neural network may perform a downsampling operation to halve the resolution of the feature map and extract shallow information of the features. Each decoding layer may perform an upsampling operation and concatenate a convolutional layer to double the feature map resolution and extract the deep information of the features. Then, the encoder and decoder of the same hierarchy are connected by skip connection, thereby combining the shallow information and the deep information to generate a better target image.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

FIG. 3 is a flow chart illustrating a method of constructing a diffusion model, which may specifically employ the model structure shown in FIG. 2, in accordance with an exemplary embodiment; the method specifically comprises the following steps:

s300: and acquiring an image training sample and a prompt text corresponding to the image training sample.

In the embodiment of the present specification, an image training sample and a prompt text corresponding to the image training sample may be obtained, where the prompt text may refer to a text describing content in an image.

In practical applications, the source of the prompt text corresponding to the image training sample may be a label or comment obtained from the internet. However, the prompt text obtained by the method may be wrong or inaccurate, so that the trained diffusion model is poor in effect.

Based on the method, a text model is generated through a preset image, a prompt text corresponding to the image training sample is obtained, and the condition that the prompt text is wrong or inaccurate is avoided.

In the embodiment of the present specification, an image training sample is input into a preset image generation text model, and a content description text corresponding to the image training sample is generated as a prompt text corresponding to the image training sample.

In practice, the image-generated text model typically describes only the main content in the image training sample, which can result in the content description text losing detail content in the image training sample.

Based on the content description text, the content description text matched with the image training sample is determined from a preset content description text set through a preset image-text matching model, and the content description text generated by the image generation text model is combined with the content description text matched with the image training sample, so that the prompt text corresponding to the image training sample is more accurate.

In the embodiment of the present specification, an image training sample is input into a preset image generation text model, and a content description text corresponding to the image training sample is generated.

And then, inputting the image training sample and a preset content description text set into a pre-trained image-text matching model to obtain a content description text matched with the image training sample.

And finally, merging the content description text generated by the image generation text model and the content description text matched with the image training sample to obtain a prompt text corresponding to the image training sample.

Of course, there are various methods for obtaining the hint text corresponding to the image training sample. The present specification does not limit the method of acquiring the presentation text corresponding to the image training sample.

It should be noted that, the preset Image generation text model may be a BLIP model (Bootstrapping Language-Image Pre-training for Unifified Vision-Language Understanding and Generation, for unified visual language understanding and Pre-training of the generated bootstrap language Image). The pattern matching model may be a CLIP model (Contrastive Language-Image Pre-training for contrast language Image Pre-training). Of course, the present specification does not specifically limit the image generation text model and the image-text matching model.

S302: generating a noise image corresponding to the image training sample, inputting the prompt text, the noise image and the image training sample into a diffusion model to be trained, extracting image features of the noise image, text features of the prompt text and style features of the image training sample by the diffusion model, carrying out feature fusion on the image features, the text features and the style features, calculating prediction noise of the noise image based on fusion features obtained by feature fusion, and training the diffusion model by taking errors between the calculated prediction noise and actual noise of the noise image as optimization targets; the diffusion model is used for generating a target image which is similar to the image style of the image sample and the image content described by the prompt text based on the input prompt text and the image sample.

In practical application, the prompt text and a piece of randomly generated noise image are input into a stable diffusion model, so that the stable diffusion model calculates the prediction noise of the noise image based on the text characteristics of the prompt text, and the noise image is denoised based on the prediction noise to generate an image corresponding to the prompt text. For example, if the prompt text is "a Persian cat having blue eyes and white hair," an image of the Persian cat having blue eyes and white hair is generated, but the Persian cat may be positioned at different locations in the image and have different poses. It can thus be seen that this approach does not meet the need to preserve the image content in the content image while performing the image stylization task.

In order to make the image content of the generated target image more similar to the image content of the image training sample, a noise image corresponding to the image training sample may be generated on the basis of the image content of the image training sample for subsequent input into the diffusion model, so that the image content of the generated target image is more similar to the image content of the image training sample.

In the embodiment of the present specification, a gray scale image of an image training sample is acquired. The gray-scale map mentioned here is used to represent the image content of the image training samples. And adding noise to the image training sample, and superposing the image training sample added with the noise and the gray level graph of the image training sample to obtain a noise image corresponding to the image training sample. The noise referred to herein may be referred to as gaussian noise. Gaussian noise refers to a class of noise whose probability density function follows a gaussian distribution (i.e., normal distribution).

It can be seen that the noise image is used for the image stylization task, so that more image content in the image training sample can be reserved, and the stylization effect is improved.

In practical application, the style migration algorithm realized based on CNN does not change the content of the content image, only performs hard migration on the style and the color of the image, so that the image information of the lost part of the content image is caused, the problem of blurring of the image details in the content image occurs, the generated stylized image is low in definition, the stylized effect is poor, and the reality and the expressive force of the artistic work are not possessed. Referring specifically to fig. 4, fig. 4 is a schematic diagram illustrating a style migration process based on a CNN implementation according to an exemplary embodiment.

The stable diffusion model which has better image generation performance does not have a style characteristic extraction layer, the image can be generated only according to the prompt text, the stylized information can not be obtained from the style image, and the image stylized task can not be executed.

Based on the above, the improved diffusion model in the present specification can be applied to perform feature fusion on the image features of the noise image, the text features of the prompt text and the style features of the image training sample, so as to generate the target image which has the same image style as the style image and the same image content as the description of the prompt text. Compared with a style migration algorithm realized based on CNN, the definition of the image content of the original image is not destroyed, so that the definition of the generated image is not reduced compared with the original image, and the stylized effect is improved.

In the embodiment of the specification, a prompt text, a noise image and an image training sample are input into a diffusion model to be trained, image features of the noise image, text features of the prompt text and style features of the image training sample are extracted by the diffusion model, feature fusion is carried out on the image features, the text features and the style features, prediction noise of the noise image is calculated based on fusion features obtained by the feature fusion, and error between the calculated prediction noise and actual noise of the noise image is minimized to serve as an optimization target to train the diffusion model; the diffusion model is used for generating a target image which is the same as the image style of the image sample and the same as the image content described by the prompt text based on the input prompt text and the image sample. Referring specifically to fig. 5, fig. 5 is a schematic diagram illustrating a style migration process implemented based on a diffusion model according to an exemplary embodiment.

The method for feature fusion of the image features, the text features and the style features can be various, for example, a neural network for feature fusion, a multi-mode deep learning model and the like. As an example, the image features, text features, and style features are stitched and input into a Transformer architecture-based neural network that outputs fused features. Furthermore, through the multi-layer transducer, the image features, the text features and the style features can be fused, and the three features are fused into a whole to obtain the fused features, so that the diffusion model can calculate the prediction noise of the noise image based on the fused features.

Further, in the process of obtaining the prompt Text corresponding to the Image training sample, an Image encoding layer (Image encoding) in the CLIP model may be applied to extract Image features of the Image, and a Text encoding layer (Text encoding) may be applied to extract Text features of the Text, so as to obtain a content description Text matching with the Image training sample according to a matching relationship between the Image and the Text description corresponding to the Image, for further obtaining the prompt Text corresponding to the Image training sample.

In order to avoid the problem that differences occur between the image features of the extracted noise image and the text features of the prompt text due to the application of different image coding layers and text coding layers in the diffusion model, the content feature extraction layer may apply the image coding layer of the CLIP model to extract the image features of the noise image and apply the text coding layer of the CLIP model to extract the text features of the prompt text, so as to improve the accuracy of the image content of the target image generated later.

The style feature extraction layer may extract image features of the image training samples using a variation from an encoder of the encoder (Variational Autoencoder, VAE) and may also extract image features of the image training samples using an encoder of the convolutional neural network (Convolutional Neural Network, CNN). The present specification does not limit the method of extracting image features. Then, a gram matrix corresponding to the image training sample is calculated based on the image features of the image training sample. The gram matrix mentioned herein may be used to represent the style characteristics of the image training samples.

In the embodiment of the present specification, the fusion layer includes: and a plurality of network sublayers, and a fusion sublayer corresponding to the network sublayers except the network sublayers of the last layer one by one. Wherein the convergence sublayer corresponding to each network sublayer is located after the network sublayer.

The image features and the text features are input to a first network sub-layer of the network sub-layers for feature fusion.

Then, inputting the fusion characteristics output by the first network sub-layer and the gram matrix corresponding to the image training sample into the first fusion sub-layer corresponding to the first network sub-layer for characteristic fusion, further inputting the style fusion characteristics and the text characteristics output by the first fusion sub-layer corresponding to the first network sub-layer into the second network sub-layer of the next layer for characteristic fusion, then inputting the fusion characteristics output by the second network sub-layer and the gram matrix corresponding to the image training sample into the second fusion sub-layer corresponding to the second network sub-layer for characteristic fusion, further inputting the style fusion characteristics and the text characteristics output by the second fusion sub-layer corresponding to the second network sub-layer into the second network sub-layer of the next layer for executing the same action, and so on until the network sub-layer of the last layer calculates the prediction noise of the noise image.

Next, a model structure of the diffusion model will be described with reference to the model structure shown in fig. 2.

First, a noise image and a hint text are input to a content feature extraction layer, image features corresponding to an image sample are extracted, text features corresponding to a hint text are extracted, and the image features and the text features are input to a fusion layer.

And secondly, inputting the style image into a style feature extraction layer, extracting style features corresponding to the style image, and inputting the style features into a fusion layer. The style characteristics may be represented by a gram matrix.

And then, inputting the image features and the text features into the first coding layer for feature fusion to obtain fusion features output by the first coding layer. And inputting the fusion characteristics and the gram matrix output by the first coding layer into a first fusion sublayer corresponding to the first coding layer to perform characteristic fusion, so as to obtain style fusion characteristics output by the first fusion sublayer.

And then, further inputting the style fusion characteristics and the text characteristics output by the first fusion sub-layer corresponding to the first coding layer into the second coding layer to execute the same actions, and the like until the Nth coding layer.

And then, inputting the style fusion characteristics output by the Nth fusion sub-layer corresponding to the Nth coding layer into the first decoding layer for noise prediction to obtain the fusion characteristics output by the first decoding layer. And inputting the fusion characteristics and the gram matrix output by the first decoding layer into an N+1th fusion sub-layer corresponding to the first decoding layer to perform characteristic fusion, so as to obtain style fusion characteristics output by the N+1th fusion sub-layer.

And finally, further inputting the style fusion characteristics and the text characteristics output by the N+2th fusion sub-layer corresponding to the second decoding layer into the second decoding layer to execute the same actions, and the like until the N decoding layer, and calculating the prediction noise of the noise image.

It can be seen that text features are fused in each network sub-layer, and more text features are added on the basis of image features, so that the image content of the target image generated by the diffusion model is closer to the image content of the image training sample. And each fusion sub-layer fuses the style characteristics, and more style characteristics are added on the basis of the image characteristics, so that the image style of the target image generated by the diffusion model is more similar to that of the image training sample.

In practical application, the gram matrix can be used for representing the style characteristics of the image, in the image stylizing task, the gram matrix of the image characteristics of the content image and the gram matrix of the image characteristics of the style image are usually calculated respectively, the deviation between the gram matrix of the content image and the gram matrix of the style image is minimized as an optimization target, the content image is continuously adjusted, the style of the content image is continuously close to the style image, and therefore the image stylizing task is completed. It can be seen that the gram matrix cannot be directly fused with the image features to accomplish the image stylization task.

Based on this, in the present embodiment, the mean prediction matrix and the variance prediction matrix are calculated from the glamer matrix corresponding to the image training sample.

And then, inputting the fusion characteristics, the mean value prediction matrix and the variance prediction matrix which are output by the first network sub-layer into the first fusion sub-layer corresponding to the first network sub-layer for characteristic fusion, multiplying the fusion characteristics which are output by the first network sub-layer by the variance prediction matrix to obtain a multiplication result, and adding the multiplication result and the mean value prediction matrix to obtain the style fusion characteristics which are output by the first fusion sub-layer. And by analogy, inputting the gram matrix into each fusion sublayer for feature fusion, so that the image style of the target image generated by the diffusion model is more similar to that of the image training sample. As shown in particular in fig. 6.

FIG. 6 is a schematic diagram illustrating a feature fusion in accordance with an exemplary embodiment.

In fig. 6, a mean prediction matrix and a variance prediction matrix are calculated from the glamer matrix. Multiplying the output fusion characteristic with the variance prediction matrix to obtain a multiplication result, and adding the multiplication result with the mean prediction matrix to obtain an addition result. And then, inputting the added result into the convolution layer to obtain the style fusion characteristic.

After the diffusion model is trained by the method, the diffusion model has the capabilities of inputting image content and image style and generating a target image. However, since the image content and the image style input by the method are both from the same image training sample, the diffusion model after training can complete the image stylization task, but the generalization capability of the text generated image model is poor.

Based on the method, other image training samples which are the same as the image training samples in image style and different in image content are obtained, and the diffusion model is further optimized and trained, so that the diffusion model has the capability of generating a target image based on the image content and the image style from different image training samples, and the generalization capability of the diffusion model is improved.

In the embodiment of the present specification, other image training samples having the same image style as the image training sample and different image contents are acquired.

Then, the prompt text, the noise image and other image training samples are input into a diffusion model, the image characteristics of the noise image, the text characteristics of the prompt text and the style characteristics of other image training samples are extracted by the diffusion model, the image characteristics, the text characteristics and the style characteristics of other training images are subjected to characteristic fusion, the prediction noise of the noise image is calculated based on the fusion characteristics obtained by the characteristic fusion, and the error between the calculated prediction noise and the actual noise of the noise image is minimized to serve as an optimization target, so that the diffusion model is further optimally trained.

The method comprises the steps of inputting prompt text for describing the image style of an image training sample into a preset text to generate an image model, and generating other image training samples which are the same as the image style of the image training sample and different in image content. The preset text-generating image model referred to herein may be referred to as a lore model (Low-Rank Adaptation of Large Language Models, low-order adaptation of a large language model). The LoRA model can freeze weight parameters of a pre-trained stable diffusion model, and an additional network layer is added into the stable diffusion model. Because the number of the parameters of the newly added network layers is small, the newly added network layer parameters are trained, so that the cost of model training can be reduced, and the effect similar to that of full model fine tuning can be obtained.

The process of further optimizing the diffusion model based on the other image training samples having the same image style and different image contents as the image training samples is the same as the process of training the diffusion model based on the image training samples described above, except that the style features are derived from different image training samples.

In the embodiment of the present specification, a specific formula for determining an error between the calculated predicted noise of the noise image and the actual noise of the noise image is as follows:

in the above-mentioned formula(s),can be used to represent the actual prediction noise of the noisy image, for example>z _θ May be used to represent a fusion layer of the diffusion model. X is x ₀ May be used to represent the image training samples. t may be used to represent the turn of adding noise.

Fig. 7 is a flowchart of an image generation method according to an exemplary embodiment, specifically including the following steps:

s700: and acquiring a content image, a prompt text corresponding to the content image and a style image.

S702: generating a noise image corresponding to the content image, inputting the noise image, a prompt text corresponding to the content image and the style image into a trained diffusion model, extracting image features of the noise image, text features of the prompt text, style features of the style image by the diffusion model, carrying out feature fusion on the image features, the text features and the style features, calculating prediction noise of the noise image based on fusion features obtained by feature fusion, calculating prediction noise, generating a target image which is identical to the image style of the style image and identical to image content described by the prompt text based on the prediction noise.

In the embodiment of the present specification, after training of the text generation image model is completed, a content image, a presentation text corresponding to the content image, and a style image are acquired.

Then, a noise image corresponding to the content image is generated, the noise image, a prompt text corresponding to the content image and a style image are input into a trained diffusion model, so that image features of the noise image, text features of the prompt text, style features of the style image are extracted by the diffusion model, feature fusion is performed on the image features, the text features and the style features, prediction noise of the noise image is calculated based on the fusion features obtained by the feature fusion, and a target image which is identical to the image style of the style image and identical to the image content described by the prompt text is generated based on the prediction noise. The specific formula is as follows:

in the above formula, z _θ May be used to represent a fusion layer of the diffusion model. X is x _t May be used to represent noisy images. X is x _t-1 May be used to represent the denoised noise image of the t-1 th pass. It can be seen that the noise image can be step by step denoised by the above formula to generate the target image.

According to the method, the prompt text, the noise image and the image training sample are input into a diffusion model to be trained, the image characteristics of the noise image, the text characteristics of the prompt text and the style characteristics of the image training sample are extracted by the diffusion model, the image characteristics, the text characteristics and the style characteristics are subjected to characteristic fusion, the prediction noise of the noise image is calculated based on the fusion characteristics obtained by the characteristic fusion, and the diffusion model is trained by taking the error between the calculated prediction noise and the actual noise of the noise image as an optimization target. Thus, through the input prompt text, the target image which has the same image style as the image sample and the same image content as described by the prompt text is generated, so that the style migration of the image is realized. Compared with a style migration algorithm realized based on CNN, the style migration algorithm does not carry out hard migration on the style of the image, does not damage the definition of the image content of the original image, does not reduce the definition of the generated image compared with the original image, and further improves the stylization effect.

Corresponding to the embodiment of the method for constructing the diffusion model, the present specification also provides an embodiment of the diffusion model.

Referring to fig. 8, fig. 8 is a block diagram of an electronic device with a diffusion model according to an exemplary embodiment. At the hardware level, the device includes a processor 802, an internal bus 804, a network interface 806, memory 808, and non-volatile storage 810, although other hardware requirements are possible. One or more embodiments of the present description may be implemented in a software-based manner, such as by the processor 802 reading a corresponding computer program from the non-volatile memory 810 into the memory 808 and then running. Of course, in addition to software implementation, one or more embodiments of the present disclosure do not exclude other implementation manners, such as a logic device or a combination of software and hardware, etc., that is, the execution subject of the following processing flow is not limited to each logic unit, but may also be hardware or a logic device.

Referring to fig. 9, fig. 9 is a block diagram of a diffusion model according to an exemplary embodiment. The diffusion model can be applied to the electronic device shown in fig. 8 to realize the technical scheme of the specification. Wherein the diffusion model may include:

A content feature extraction layer 900, configured to obtain an input image sample and a prompt text corresponding to the image sample, extract an image feature corresponding to the image sample, extract a text feature corresponding to the prompt text, and input the image feature and the text feature to a fusion layer;

a style feature extraction layer 902, configured to obtain an input image sample, extract style features of the image sample, and input the style features to the fusion layer;

and a fusion layer 904, configured to perform feature fusion on the input image feature, the text feature, and the style feature, calculate a prediction noise of the noise image based on the fusion feature obtained by the feature fusion, and generate a target image having the same image style as the image sample and the same image content as the prompt text based on the prediction noise.

Optionally, the fusion layer 904 is specifically further configured to obtain other image training samples having the same image style and different image contents as the image training samples, input the prompt text, the noise image, and the other image training samples into the diffusion model, extract, from the diffusion model, image features of the noise image, text features of the prompt text, style features of the other image training samples, perform feature fusion on the image features, the text features, and style features of the other training images, calculate prediction noise of the noise image based on the fusion features obtained by feature fusion, and further perform optimization training on the diffusion model with the calculated error between the prediction noise and actual noise of the noise image as an optimization target.

Optionally, the fusion layer 904 is specifically further configured to input a prompt text for describing an image style of the image training sample to a preset text to generate an image model, so as to generate other image training samples with the same image style and different image content as the image training sample.

Optionally, the preset text-generating image model includes a LoRA model.

Optionally, the content feature extraction layer 900 is specifically configured to input the image training sample into a preset image generation text model, generate a content description text corresponding to the image training sample, input the image training sample and a preset content description text set into a pre-trained text matching model, obtain a content description text matched with the image training sample, combine the content description text generated by the image generation text model with the content description text matched with the image training sample, and obtain a prompt text corresponding to the image training sample.

Optionally, the content feature extraction layer 900 is specifically configured to obtain a gray level diagram of the image training sample, where the gray level diagram is used to represent image content of the image training sample, add noise to the image training sample, and superimpose the image training sample to which noise is added with the gray level diagram of the image training sample, so as to obtain a noise image corresponding to the image training sample.

Optionally, the style feature extraction layer 902 is specifically configured to extract image features of the image training sample, and calculate a gram matrix corresponding to the image training sample according to the image features of the image training sample, where the gram matrix is used to represent style features of the image training sample.

Optionally, the diffusion model includes:

and the content feature extraction layer is used for extracting the image features of the noise image and the text features of the prompt text. The style characteristic extraction layer is used for extracting style characteristics of the image training sample;

the fusion layer 904 is specifically configured to input the image feature and the text feature to a first network sub-layer of the plurality of network sub-layers for feature fusion, input the fusion feature output by the first network sub-layer and the gram matrix corresponding to the image training sample to the first fusion sub-layer corresponding to the first network sub-layer for feature fusion, further input the style fusion feature output by the first fusion sub-layer corresponding to the first network sub-layer and the text feature to a second network sub-layer of a next layer for performing the same action, and so on until the network sub-layer of the last layer, and calculate the predicted noise of the noise image.

Optionally, the fusion layer 904 is specifically configured to calculate a mean prediction matrix and a variance prediction matrix according to a gram matrix corresponding to the image training sample, input the fusion feature output by the first network sub-layer, the mean prediction matrix and the variance prediction matrix into the first fusion sub-layer corresponding to the first network sub-layer to perform feature fusion, multiply the fusion feature output by the first network sub-layer with the variance prediction matrix to obtain a multiplication result, and add the multiplication result with the mean prediction matrix to obtain a style fusion feature output by the first fusion sub-layer.

The present specification also provides an embodiment of an image generating apparatus corresponding to the embodiment of the image generating method described above.

Referring to fig. 10, fig. 10 is a block diagram of an image generating apparatus according to an exemplary embodiment. The image generating apparatus may be applied to an electronic device as shown in fig. 8 to implement the technical solution of the present specification. Wherein the image generating apparatus may include:

An acquiring module 1000, configured to acquire a content image, a prompt text corresponding to the content image, and a style image;

a generating module 1002, configured to generate a noise image corresponding to the content image, and input the noise image, a prompt text corresponding to the content image, and the style image into a trained diffusion model, so as to extract, from the diffusion model, an image feature of the noise image, a text feature of the prompt text, extract a style feature of the style image, and perform feature fusion on the image feature, the text feature, and the style feature, and calculate a prediction noise of the noise image based on the fusion feature obtained by the feature fusion, generate, based on the prediction noise, a target image having the same image style as the style image and the same image content as described by the prompt text, where the diffusion model is obtained by the above-described method of construction of the diffusion model.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are illustrative only, in that the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

User information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to herein are both user-authorized or fully authorized information and data by parties, and the collection, use and processing of relevant data requires compliance with relevant laws and regulations and standards of the relevant country and region, and is provided with corresponding operation portals for user selection of authorization or denial.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims

1. A method of constructing a diffusion model, the method comprising:

2. The method of claim 1, the method further comprising:

3. The method of claim 2, obtaining other image training samples of the same image style and different image content than the image training sample, comprising:

4. The method of claim 3, wherein the pre-set text-generating image model comprises a lore model.

5. The method of claim 1, obtaining a prompt text corresponding to the image training sample, comprising:

6. The method of claim 5, wherein the preset image-generating text model comprises a BLIP model; the pattern matching model comprises a CLIP model.

7. The method of claim 1, generating a noise image corresponding to the image training sample, comprising:

8. The method of claim 1, extracting style features of the image training sample, comprising:

9. The method of claim 1, the diffusion model comprising:

10. The method of claim 9, the fusion layer comprising: a plurality of network sublayers and a plurality of fusion sublayers corresponding to the network sublayers except the network sublayers of the last layer one by one; wherein, the fusion sublayer corresponding to each network sublayer is positioned behind the network sublayer;

11. The method of claim 10, inputting the fusion features output by the first network sub-layer and the glamer matrix corresponding to the image training samples into the first fusion sub-layer corresponding to the first network sub-layer for feature fusion, comprising:

12. The method of claim 10, the number of network sublayers are network sublayers comprised by a U-Net neural network; the fusion layer is a U-Net neural network which is added with the fusion sublayers corresponding to the network sublayers one by one after each network sublayer except the network sublayer of the last layer is included.

13. An image generation method, the method comprising:

generating a noise image corresponding to the content image, inputting the noise image, a prompt text corresponding to the content image and the style image into a trained diffusion model to extract image features of the noise image, text features of the prompt text, style features of the style image from the diffusion model, and carrying out feature fusion on the image features, the text features and the style features, calculating prediction noise of the noise image based on the fusion features obtained by the feature fusion, and generating a target image which is identical to the image style of the style image and identical to the image content described by the prompt text based on the prediction noise, wherein the diffusion model is obtained by the construction method of the diffusion model according to any one of claims 1 to 12.

14. A diffusion model comprising:

15. An image generation apparatus, the apparatus comprising:

a generating module, configured to generate a noise image corresponding to the content image, and input the noise image, a prompt text corresponding to the content image, and the style image into a diffusion model after training, so as to extract, from the diffusion model, an image feature of the noise image, a text feature of the prompt text, a style feature of the style image, and perform feature fusion on the image feature, the text feature, and the style feature, and calculate a prediction noise of the noise image based on a fusion feature obtained by feature fusion, generate, based on the prediction noise, a target image having the same image style as the style image and the same image content as the image content described by the prompt text, the diffusion model being obtained by the method for constructing the diffusion model according to any one of claims 1 to 12.

16. An electronic device comprises a communication interface, a processor, a memory and a bus, wherein the communication interface, the processor and the memory are connected with each other through the bus;

the memory stores machine readable instructions, the processor executing the method of any of claims 1 to 13 by invoking the machine readable instructions.

17. A machine-readable storage medium storing machine-readable instructions which, when invoked and executed by a processor, implement the method of any one of claims 1 to 13.