CN118114124A

CN118114124A - Text-guided controllable portrait generation method, system and equipment based on diffusion model

Info

Publication number: CN118114124A
Application number: CN202410511490.5A
Authority: CN
Inventors: 叶茫; 王同鑫; 张桑绮
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2024-04-26
Filing date: 2024-04-26
Publication date: 2024-05-31

Abstract

The invention discloses a method, a system and equipment for generating a text-guided controllable portrait based on a diffusion model, which are characterized in that firstly, a prompt text is subjected to conditional coding to be vector representation; then generating a semantic condition appointed editing region mask based on the coded prompt text and a source image x0 to be processed; and finally, inputting the source image, the coded symptom prompt text and the editing area mask into an image generation network based on a diffusion model together to generate an image meeting the requirements. The invention can effectively improve the controllability and the image quality of the generated image, reduce the local blurring of the image and enhance the fidelity of the image.

Description

Text-guided controllable portrait generation method, system and equipment based on diffusion model

Technical Field

The invention belongs to the technical field of image generation, relates to a text-guided portrait generation method, a system and equipment, and in particular relates to a text-guided controllable portrait generation method, a system and equipment based on a diffusion model.

Background

In recent years, image generation techniques have been significantly developed. Text-guided portrait generation is a task of generating a portrait corresponding to a text description by understanding semantics in the text description. The task combines natural language processing and computer vision technology, and provides innovative possibilities for the fields of virtual character design, visual effect production, high-quality virtual data set enhancement, personalized user experience and the like.

With the progress of deep learning technology, a significant breakthrough has been made in the field of generating （Creswell A, White T, Dumoulin V, et al. Generative adversarial networks: An overview[J]. IEEE signal processing magazine, 2018, 35(1): 53-65.）, images, particularly in the generation of models such as an antagonism network (GANs). In recent years, emerging technologies such as Diffusion Models (Diffusion Models) are brand-new and have a brand-new angle （Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in neural information processing systems, 2020, 33: 6840-6851.）,, and the technologies gradually convert noise images into real images to generate high-quality images, so that the images are relatively stable, training controllability is higher, and the Models which are mainly developed against a network are gradually replaced. However, the conventional diffusion model still has some problems in image generation, such as insufficient authenticity and diversity of generated images, insufficient local detail distortion and controllability, and the like. Recently, some fashion work explored the use of skeleton diagrams (skeletons) for controllable portrait generation （Bhunia A K, Khan S, Cholakkal H, et al. Person image synthesis via denoising diffusion model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 5968-5976.）（Ju X, Zeng A, Zhao C, et al. HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation[J]. arXiv preprint arXiv:2304.04269, 2023.）, to achieve a certain effect, however skeleton diagrams only provide limited posture guiding information, and have a certain gap from the portrait generation targets with controllable details. Therefore, how to improve the controllability of the portrait generation process is also a problem of urgent need for thinking in the current research.

At present, a diffusion model is paid attention to in the field of text-guided portrait generation, and provides a new method for realizing finely controlled portrait generation. The following are basic training and testing steps for text-guided portrait generation using diffusion models:

Data preparation: training and testing data sets containing text descriptions and corresponding portrait images are prepared. The text coding module is constructed: and constructing a text coding module, and coding the text for establishing association between the text description and the image. And (3) constructing a diffusion model: a diffusion model is constructed, which is typically composed of an encoder and a decoder. The encoder is configured to convert the portrait image into a hidden space representation, and the decoder associates the hidden space representation with the textual description and generates an edited portrait image. And (3) training a diffusion model: the diffusion model is trained using the training dataset. In the training process, the model learns the association between the text description and the portrait generation, and optimizes the difference between the generated editing result and the target editing. Saving the optimal model: in the training process, the diffusion model with the best performance is stored for later testing. Testing: and editing the new portrait image and the text description by using the saved optimal model. The model generates an editing result according to the text description and converts the editing result into a real portrait image. In text-guided portrait generation, the key to using a diffusion model is to build a link between a text description and a portrait, and to realize fine editing control through a diffusion process. The diffusion model has better stability and training controllability, and has remarkable progress in image generation quality and editing control. However, currently, in text-guided portrait generation, there are still some challenges with using diffusion models, such as partial detail collapse and insufficient editing controllability. These problems require further research and improvement to improve the performance and effectiveness of diffusion models in text-guided portrait generation. Therefore, how to improve the image quality of the generated image and enhance the controllability in the image generation process is a technical difficulty to be solved urgently.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a text-guided controllable portrait generation method, a system and equipment based on a diffusion model, which can conveniently realize local editing of images, improve the image generation quality and enhance the controllability in the image generation process.

The technical scheme adopted by the method is as follows: a text-guided controllable portrait generation method based on a diffusion model comprises the following steps:

Step 1: performing conditional encoding on the prompt text to obtain a vector representation to obtain text characteristics;

Step 2: generating a semantic condition appointed editing region mask based on the coded prompt text and the source image x ₀ to be processed;

step 3: and commonly inputting the source image, the coded prompt text and the editing region mask into an image generation network based on a diffusion model to generate an image meeting the requirements.

Preferably, in step 1, the CLIP text encoder is used to conditionally encode the hint text into a vector representation;

The CLIP text encoder includes a 12-layer stacked residual attention module;

The residual attention module comprises a first normalization layer, a multi-head attention layer, a second normalization layer and a multi-layer perceptron layer which are sequentially connected; the first normalization Layer and the second normalization Layer adopt Layer normalization, and the head number of the multi-head attention Layer is 8; the multi-layer sensor layer comprises a 512-dimensional to 2048-dimensional ascending linear layer, a GELU activation function layer and a 2048-dimensional to 512-dimensional descending linear layer; after 512-dimensional text input enters a residual attention module, the residual attention module is connected with original input through a first normalization layer and a multi-head attention layer to obtain 512-dimensional intermediate features, the dimension is increased from 512 to 2048 and then reduced to 512 after passing through a second normalization layer and a multi-layer perceptron layer, and finally the final 512-dimensional text feature output is obtained through residual connection with input.

Preferably, in step 2, a semantic condition specified editing region mask is generated using the region localization network ERLM;

the local area network ERLM includes an encoder and a decoder;

The encoder is composed of three convolution layers, wherein each layer is a convolution layer with the convolution kernel size of 3 multiplied by 3, the step length of 1 and the filling of 0; the input of each convolution layer is obtained by splicing the output characteristic and the text characteristic of the last convolution layer, wherein the input of the first convolution layer is the splicing result of the source image and the text characteristic;

The decoder consists of three deconvolution layers and is used for upsampling the characteristics obtained by encoding by the encoder, namely decoding the characteristics into a space with the size of a source image; the input of each deconvolution layer is the residual connection of the input of the previous deconvolution layer and the corresponding convolution layer of the encoder, wherein the input of the first deconvolution layer is the concatenation of the output of the last convolution layer of the encoder and the input of the last deconvolution layer of the encoder.

Preferably, the area location network ERLM is a trained network;

For the obtained predictive edit area M, the mask image is represented as ; Wherein/>Representing an element multiplication operator, x ₀ representing a source image;

In the training process, for completing the generation task under the given predictive editor region M mask Extended at/>Defined z _t; wherein z _t is connected by m and z _m along the channel dimension,/>M downsampled from M,/>Is a potential embedded feature of mask image x _m; wherein z _t represents the potential embedding feature of the t-th step, t represents the number of steps of sampling, h represents the height of the potential space, and w represents the width of the potential space; /(I)Representing the extended latent variable,/>Is a general conditional diffusion model, m is a downsampled mask region;

The loss function adopted in the training process is as follows:

；

Wherein the method comprises the steps of The CLIP text encoder representing the pre-training takes the prompt text P as input to obtain a condition c; e represents the desire, z ₀ represents the potential embedded features of the source image x ₀,/>Representing the noise added at z _t.

Preferably, the image generating network based on the diffusion model in the step 3 comprises an encoder module and a decoder module;

The encoder module is composed of three convolution layers, wherein each layer is a convolution layer with the convolution kernel size of 3 multiplied by 3, the step length of 1 and the filling of 0; the input of each convolution layer is the output characteristic of the last convolution layer, wherein the input of the first convolution layer is ；

The decoder module consists of three deconvolution layers and is used for upsampling the characteristics obtained by encoding the encoder; the input of each deconvolution layer is the residual error connection of the input of the previous deconvolution layer and the corresponding coding layer, wherein the input of the first deconvolution layer is the splicing of the output of the last layer of the coder and the input of the first deconvolution layer;

a text cross attention module is added in each layer of the encoder module and the decoder module so as to acquire corresponding information from text features.

Preferably, the image generating network based on the diffusion model in the step 3 generates an image meeting requirements, and in the reasoning process, a classifier-free guiding technology is adopted, wherein the noise prediction of each step is weighted by a combination of unconditional and conditional predictions; is provided withNoise prediction per inference step/>, as unconditional embeddingCalculated by the following formula:

；

wherein, Representing the character string of an empty "" "text characteristics obtained by the CLIP text encoder,/>Representing a guiding scale, a higher guiding scale encourages the generation of images closely associated with the hint text P;

In the cycle of the reasoning process, denoising the latent embedded feature z _t with noise by using a denoising formula, thereby obtaining ，/>Generation/> by decoder D；

Final output to generate image meeting requirementWherein/>Is an edited image that the decoder D generates after the inference process.

The system of the invention adopts the technical proposal that: a text-guided controllable portrait generation system based on a diffusion model comprises the following modules:

The coding module is used for performing conditional coding on the prompt text into vector representation;

The editing region mask generation module is used for generating a semantic condition appointed editing region mask based on the coded prompt text and the source image x ₀ to be processed;

and the image generation module is used for inputting the source image, the coded prompt text and the editing region mask into the image generation network based on the diffusion model together to generate an image meeting the requirements.

The technical scheme adopted by the equipment is as follows: a diffusion model-based text-guided controllable portrait generation device, comprising:

One or more processors;

And the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the text-guided controllable portrait generation method based on the diffusion model.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention considers that controlling the image generation by using the text information which is easy to obtain is beneficial to reducing the application threshold of the technology, so the invention focuses on enhancing the model to learn the cross-modal information, namely the text and the image, and adopts a text attention module in an image generation network based on a diffusion model to help align the text and the image information so as to accord with the semantics described by the text in the image generation;

(2) The invention uses various basic models as the characteristic learning module, and is easy to integrate into the existing basic model to further improve the performance;

(3) The design ERLM of the invention is used for extracting the editing area, and the method provided by the invention effectively improves the control capability of the model on the editing area aiming at the current situation that the existing method lacks local controllable editing;

(4) The invention adopts the diffusion model which is easier to train as the main image generation network, effectively improves the image generation quality, simplifies the training process and is beneficial to the popularization and application of the technology.

Drawings

The following examples, as well as specific embodiments, are used to further illustrate the technical aspects of the present invention. In addition, in the course of describing the technical solutions, some drawings are also used. Other figures and the intent of the present invention can be derived from these figures without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a CLIP text encoder in accordance with an embodiment of the present invention;

FIG. 3 is a block diagram of a local area network ERLM in accordance with an embodiment of the present invention;

fig. 4 is a diagram of an image generation network based on a diffusion model in an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.

This embodiment can be divided into two stages:

in a first stage, the hidden edit area information is located and discovered using the area location network ERLM. Given a text prompt P describing the local garment in fashion image x ₀, it is desirable to obtain a region mask corresponding to this description . First, a text hint P is entered into a pre-trained CLIP text encoder to obtain text embedding/>. The fashion image x ₀ and/>, is then combined by the area localization network ERLM consisting of encoder E and decoder DAs input.

And in the second stage, after the predicted editing area is obtained, the predicted editing area and a text prompt are taken as LDMs (model such as a potential diffusion model) to carry out generation conditions of fashion image editing in the second stage, and the visual content in the fashion image editing area is accurately edited by adopting LDM (potential diffusion model).

Referring to fig. 1, the text-guided controllable portrait generating method based on the diffusion model provided in this embodiment includes the following steps:

Step 1: conditionally encoding the hint text into a vector representation;

in one embodiment, please refer to fig. 2, the CLIP text encoder is utilized to conditionally encode the hint text into a vector representing text characteristics ；

The CLIP text encoder includes a 12-layer stacked residual attention module;

in one embodiment, please refer to fig. 3, a semantic condition specific edit area mask is generated using an area location network ERLM;

the local area network ERLM includes an encoder and a decoder;

In one embodiment, an input picture x ₀ and text prompt are givenInputting a pre-training area positioning network ERLM, and extracting an editing area mask M specified by a prompt text guiding condition;

；

where H denotes an image height and W denotes an image width.

Then ERLM, consisting of encoder E and decoder D, combine the source image x ₀ withAs input. I-layer function of encoder E/>The following can be described:

；

Wherein the operation is by spatial broadcasting Broadcast/>Output/>, with encoder layer i-1Have the same spatial size and will/>, when i=1Set to x ₀. I-layer function of decoder D/>The method comprises the following steps:

；

output of the last layer of decoder D Is input into the full convolution layer to predict the edit region M.

In one embodiment, the local positioning network ERLM is a trained network; for the obtained predictive edit area M, the mask image is represented as; Wherein/>Representing an element multiplication operator, x ₀ representing a source image;

In the training process, for completing the generation task under the given predictive editor region M mask Diffusion model extended in prediction noise/>Defined z _t; where z _t is connected along the channel dimension by m and z _m,M downsampled from M,/>Is a potential embedded feature of mask image x _m; where z _t represents the potential embedded feature of step t, t represents the number of steps sampled,/>Representing the potential embedded features of mask image x _m, h representing the height of the potential space, and w representing the width of the potential space; /(I)Representing the potential variables after the expansion,Is a general conditional diffusion model, m is a downsampled mask region;

The loss function adopted in the training process is as follows:

；

In one embodiment, please refer to fig. 4, wherein the diffusion model-based image generation network includes an encoder module and a decoder module;

In one embodiment, the image generation network based on the diffusion model generates an image meeting requirements, and a classifier-free guiding technology is adopted in the reasoning process, wherein the noise prediction of each step is weighted by a combination of the unconditional and conditional predictions; is provided withNoise prediction per inference step/>, as unconditional embeddingCalculated by the following formula:

；

Wherein the method comprises the steps of Representing a guiding scale, a higher guiding scale encourages the generation of images closely associated with the hint text P; representing the character string of the blank "" "text is obtained by the CLIP text encoder.

In the cycle of the reasoning process, denoising the latent embedded feature z _t with noise by using a denoising formula, thereby obtaining，/>Generation/> by decoder D；

In one embodiment, the diffusion model-based image generation network is a trained network; in the training process:

(1) Preparation of data: training was performed on DFMM-Spotlight dataset.

First, for each image, the dataset provides a manually parsed annotation comprising 24 semantic tags for clothing, body parts and accessories. Meanwhile, each picture is also marked with clothes shape, texture attribute and text description.

Five semantic tags were then selected, namely, coat, pants, outerwear, dress and accessories. Some pixels in the image that match the selected semantic label will be set to 1 and the remaining pixels will be set to 0, thereby generating a region hint image.

And finally, combining the attributes of the prompt texts, and combining the length attribute, the color attribute, the fabric attribute and the class text into the prompt texts for each region prompt extracted in the previous step.

(2) Inputting the training pictures into an image generation network based on a diffusion model for training;

First, the CLIP text encoder is trained to encode the text description into a vector representation. Secondly, the region locating network ERLM is trained to accurately locate the region to be edited in the image according to the text description. Finally, the diffusion model-based image generation network is trained to learn how to generate high quality images based on conditional inputs. In the whole training process, parameter tuning, model evaluation and verification are required to be carried out so as to ensure that the method can obtain good effects in fashion image editing tasks.

This example uses a AdamW optimizer (Loshchilov, i.; and Hutter, f.2018. Decoupled WEIGHT DECAY alignment. In iclr.) on DFMM-Spotlight dataset to fine-tune the model with a fine-tuning step size of 140k and set the learning rate to be. To save memory, a strategy of blending precision （Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. 2018. Mixed Precision Training. In ICLR） and gradient accumulation is employed, where the step size of gradient accumulation is set to 4 and the batch size is set to 1. In the inference phase, PNDM scheduler （Liu, L.; Ren, Y.; Lin, Z.; and Zhao, Z. 2021. Pseudo Numerical Methods for Diffusion Models on Manifolds. In ICLR）, with 50 iterative steps is used and no classifier is directed/>Set to 7.5. For fair comparison, this embodiment employs a Stable Diffusion v1.4 backbone network with fine tuning of SDEdit and DiffEdit at DeepFashion-MultiModal, the fine tuning hyper-parameters of the backbone network being consistent with the second stage model of this embodiment.

(3) Network optimization and parameter updating;

the update includes both forward and backward propagation. Forward propagation through network computing output and loss functions Is of a size of (a) and (b). And updating the network through an optimization strategy of random gradient descent by using the gradient of the reverse transmission loss in the reverse transmission process.

(4) Generating a test of the network based on the image of the diffusion model;

In the test stage, training of the network and parameter updating are not carried out, and the trained model is used for generating an image. This example uses Fr ́ echet Inception Distance (FID) (Heusel et al 2017) and Learned Perceptual IMAGE PATCH SIMILARITY (LPIPS) (Zhang et al 2018 b) to quantitatively evaluate sample fidelity of generated fashion images. In addition, to evaluate whether the edited fashion image matches the input text prompt, this embodiment employs CLIP score (CLIP-S) (Hessel et al, 2021). The CLIP score may be used to evaluate the correlation between the textual description generated by the image and the actual content of the image, which the present embodiment finds that the measurement of the CLIP-S index is highly consistent with the human perception rating. The present embodiment fills the rest of the image except for the edit area with white pixels and then calculates CLIP-S.

The embodiment also provides a text-guided controllable portrait generation system based on a diffusion model, which comprises the following modules:

The editing region mask generation module is used for generating a semantic condition appointed editing region mask based on the coded prompt text and the source image x0 to be processed;

The embodiment also provides a text-guided controllable portrait generating device based on the diffusion model, which comprises:

One or more processors;

The invention is further illustrated by the following specific experiments;

DFMM-Spotlight was used as a training set in this experiment; to evaluate the fashion image editing model introduced in the second stage, the number of pairs of test sets of DFMM-Spotlight was only 2379 pairs, so the experiment extended the test set. In particular, the present experiment searches for multiple text descriptions describing the same cloth category (e.g., vest, T-shirt, shorts, pants) for each text prompt in the dataset. After expansion, the experiment finally yielded an expanded test set of 10845 image-region-text pairs.

In the experiment, all images were downsampled to 512×256 resolution, the ERLM module was trained on DFMM-Spotlight, 100 epochs were trained, and the batch size was 8. With Adam optimizer, the learning rate is set to. A AdamW optimizer was used in training to fine tune it in 140k steps over the DFMM-Spotlight dataset and set the learning rate to/>. To save memory, the experiment employed a strategy of blending accuracy and gradient accumulation, where the step size of gradient accumulation was set to 4 and the batch size was set to 1. Reasoning phase, the experiment uses PNDM scheduler （Liu, L.; Ren, Y.; Lin, Z.; and Zhao, Z. 2021. Pseudo Numerical Methods for Diffusion Models on Manifolds. In ICLR）, with 50 iteration steps and sets the classifier-free pilot table w to 7.5. For a fair comparison, the experiment used a Stable Diffusion v1.4 backbone with fine tuning of SDEdit and DiffEdit at DeepFashion-MultiModal, and the tuning superparameter of the backbone was consistent with the second stage model of the present invention.

Training phase: the experiment was performed on DFMMSpotlight dataset using PyTorch deep learning framework. However, since DFMM-Spotlight's test set contains only 2379 pairs of image-region-text pairs, the present experiment extended the test set in order to evaluate the second phase of the introduced fashion image editing model. In particular, the present experiment searched for multiple text descriptions describing the same cloth category (e.g., vest, T-shirt, shorts, pants) for each text prompt in the dataset. By this extension, the experiment finally obtained an extended test set containing 10845 image-region-text pairs. And calculating various losses by forward propagation, updating network parameters by backward propagation, and obtaining a final network model through multiple iterations.

Testing: an image is generated using the trained network model and FID, LPIPS, CLIP-S indices are calculated.

To verify the effectiveness of the present invention, the experiment compares it with existing text-guided portrait generation methods. As a comparative baseline, three stable diffusion-based image editing methods were selected for this experiment, SDEdit（Meng, C.; He, Y.; Song, Y.; Song, J.; Wu, J.; Zhu, J.-Y.; and Ermon, S. 2021. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In ICLR.）、SD-Inpaint1（Zhang L, Rao A, Agrawala M. Adding conditional control to text-to-image diffusion models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 3836-3847.） and DiffEdit（Couairon, G.; Verbeek, J.; Schwenk, H.; and Cord, M. 2022. DiffEdit: Diffusion-based semantic image editing with mask guidance. In ICLR.）.SDEdit methods, respectively, to edit by adding noise to a partial region of the input image, and then denoising it. The experiment adopts SDEdit editing technology in the Img2Img function of Stable Diffusion, and sets the intensity parameter to 0.8, which is consistent with the original paper. SD-Inpaint was developed on the basis of Stable Diffusion, and has an additional repair function, which can be used to repair missing parts in images. DiffEdit is an editing method that does not require manual mask addition, similar to the method presented in this experiment. DiffEdit generates an automatically computed mask by comparing the source text prompt to the edit text prompt directed prediction noise.

The experimental results are shown below:

As can be seen from experimental results, the FID and LPIPS indexes of the method are higher than those of the comparison method, the method has better image generation quality, the lead level of the similar method is reached on the CLIP-S, and the target image conforming to the text semantics can be generated.

Compared with the existing portrait generation technology, the invention has the following advantages:

(1) The invention provides a text-driven fashion image editing method using a diffusion model, which can achieve a real effect close to fashion image generation by using only text as an initial generation condition.

(2) The invention provides an editing area position model based on text prompt, which is used for explicitly positioning an editing area.

(3) The invention also creates a new DFMM-Spotlight dataset, which is an image-region-text dataset that can implement fine-grained text-guided local image editing.

It should be understood that the embodiments described above are some, but not all, embodiments of the invention. In addition, the technical features of each embodiment or the single embodiment provided by the invention can be combined with each other at will to form a feasible technical scheme, and the combination is not limited by the sequence of steps and/or the structural composition mode, but is necessarily based on the fact that a person of ordinary skill in the art can realize the combination, and when the technical scheme is contradictory or can not realize, the combination of the technical scheme is not considered to exist and is not within the protection scope of the invention claimed.

It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims

1. The text-guided controllable portrait generation method based on the diffusion model is characterized by comprising the following steps of:

the method comprises the steps of performing conditional encoding on prompt texts into vector representations by using a CLIP text encoder;

The CLIP text encoder includes a 12-layer stacked residual attention module;

the residual attention module comprises a first normalization layer, a multi-head attention layer, a second normalization layer and a multi-layer perceptron layer which are sequentially connected; the first normalization Layer and the second normalization Layer adopt Layer normalization, and the head number of the multi-head attention Layer is 8; the multi-layer sensor layer comprises a 512-dimensional to 2048-dimensional ascending linear layer, a GELU activation function layer and a 2048-dimensional to 512-dimensional descending linear layer; after 512-dimensional text input enters a residual attention module, the residual attention module is connected with original input through a first normalization layer and a multi-head attention layer to obtain 512-dimensional intermediate features, the dimension is increased from 512 to 2048 and then reduced to 512 after passing through a second normalization layer and a multi-layer perceptron layer, and finally the residual attention module is connected with the second normalization layer to obtain 512-dimensional text features as output;

step 2: generating an editing area mask specified by semantic conditions based on the coded prompt text and the source image x ₀ to be processed;

2. The diffusion model-based text-guided controllable portrait generation method according to claim 1 is characterized in that: in step 2, the generation of the semantic condition specified edit region mask using the region localization network ERLM is implemented as follows,

The local area network ERLM includes an encoder and a decoder;

3. The diffusion model-based text-guided controllable portrait generation method according to claim 2 is characterized in that: the regional positioning network ERLM is a trained network;

The loss function adopted in the training process is as follows:

；

4. A diffusion model based text-guided controllable portrait generation method according to claim 3, characterized in that: the image generation network based on the diffusion model in the step 3 comprises an encoder and a decoder;

The encoder is composed of three convolution layers, wherein each layer is a convolution layer with the convolution kernel size of 3 multiplied by 3, the step length of 1 and the filling of 0; the input of each convolution layer is the output characteristic of the last convolution layer, wherein the input of the first convolution layer is ；

The decoder consists of three deconvolution layers and is used for upsampling the characteristics obtained by encoding the encoder; the input of each deconvolution layer is the residual error connection of the input of the previous deconvolution layer and the corresponding coding layer, wherein the input of the first deconvolution layer is the splicing of the output of the last layer of the coder and the input of the first deconvolution layer;

A text cross attention module is added in each layer of the encoder and the decoder so as to acquire corresponding information from text features.

5. A diffusion model based text-guided controllable portrait generation method according to claim 3, characterized in that: the image generating network based on the diffusion model generates an image meeting the requirements in the step 3, wherein in the reasoning process, a classifier-free guiding technology is adopted, and the noise prediction of each step is weighted by the combination of the unconditional prediction and the conditional prediction; is provided withNoise prediction per inference step/>, as unconditional embeddingCalculated by the following formula:

；

In the cycle of the reasoning process, denoising the latent embedded feature z _t with noise by using a denoising formula, thereby obtaining ，Generation/> by decoder D；

6. A diffusion model-based text-guided controllable portrait generation system, comprising the following modules:

7. A diffusion model-based text-guided controllable portrait generating device, comprising:

One or more processors;

Storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the diffusion model-based text-guided controllable portrait generation method according to any one of claims 1 to 5.