CN117315417A

CN117315417A - Diffusion model-based garment pattern fusion method and system

Info

Publication number: CN117315417A
Application number: CN202311128437.9A
Authority: CN
Inventors: 汤程杰; 汤永川; 张欣隆; 何永兴; 林城誉; 孙凌云
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-09-04
Filing date: 2023-09-04
Publication date: 2023-12-29

Abstract

The invention discloses a clothing style fusion method and a clothing style fusion system based on a diffusion model, wherein the clothing style fusion method comprises the following steps: constructing a garment generation large model: performing Lora fine adjustment on the stable diffusion model by adopting a clothing graphic data set to obtain a clothing generation large model; training a stylized text encoder: training a stylized control Net model; and the parameter-optimized garment generates a large model, a stylized text encoder and a stylized control Net model as a garment style fusion model for realizing garment style fusion. According to the method and the device, a final network structure is formed through two-step stylized fine adjustment, the generated garment pattern fusion model is not limited to the migration of the whole color tone and the picture wind of the garment, migration of the detail style of the garment is achieved, design details of the garment can be fused, design efficiency and diversity are improved, and cost is reduced.

Description

Diffusion model-based garment pattern fusion method and system

Technical Field

The invention belongs to the technical field of clothing design, and particularly relates to a clothing style fusion method and system based on a diffusion model.

Background

There is an increasing demand for personalized and customized garments, and conventional garment design processes typically require a designer to perform manual sample garment drawing, which is time consuming, laborious and costly. In order to increase the efficiency and diversity of apparel designs while lowering the threshold of apparel designs, attempts have been made to assist in the design effort by using neural network large models for clothing generation.

One such auxiliary design algorithm is a garment style fusion algorithm that can generate a completely new garment design by fusing or intersecting features of different garment styles. The clothing style fusion algorithm is essentially a style migration technology, namely, after a style image and a content image are provided, style and texture information is collected from the style image and migrated to the content image, so that the content image has the style of the style image while the whole structure is maintained. The existing style migration technology with better effect is mostly realized based on GAN (generation of an antagonistic network) or Diffusion Model. GAN is a neural network framework composed of a generator and a arbiter that are opposed to each other, and the Diffusion Model is a framework for generating a target picture by denoising a noise picture through a neural network.

The prior style migration technology is mainly realized through GAN, such as a Swap AE model, a content image and a style image are respectively encoded into structural codes and style codes through two independent encoders, stylegan2 is used as a basic frame in the generation process, the structural codes are used as mask input of the basic frame, and the style codes are injected into each convolution layer of the Stylegan2 to achieve the purpose of style migration. For example, a garment style migration method disclosed in the patent application with publication number CN 115810060a and a garment style migration method and system based on deep learning disclosed in the patent application with publication number CN 114445268A) use structural loss and style loss to train GAN to obtain a style migration model, and after a user uploads clothes, select patterns to perform style migration, thereby generating new clothes with specified patterns.

However, these style migration models designed by using GAN model have the following problems: (1) Because of the limitation of the GAN model, a high-resolution picture cannot be directly generated, the resolution is improved by the assistance of an interpolation algorithm, the image quality is reduced, and the clothing design field has higher requirements on the resolution and quality of design manuscripts; (2) The style information learned by the GAN model is the whole information of the input style map, but the details of the clothing are difficult to learn, but the design concept of the clothing is mainly embodied in the clothing details rather than the global style, so that the general style migration model is difficult to meet the clothing style fusion requirement of a designer.

Style migration based on a diffusion model is mainly realized by combining a stablifusion model with a ControlNet model. Among them, the ControlNet model is a neural network structure for controlling a diffusion model, which realizes control by adding additional condition inputs. It replicates the neural network weights of the original model into a trainable replica in which additional conditional inputs for controlling the model are learned. There are two ways to accomplish style migration using control net: 1. and extracting a line manuscript representation or a depth map representation of the content map as structural information, describing style information by using text, and then inputting a large model to generate the content map under the description style. 2. And extracting a line manuscript representation or a depth map representation of the content map as structural information, performing scrambling recombination (shuffle) on the style map to obtain a style information-only image, and then inputting a large model to obtain the content map of the corresponding style.

The diffusion model can meet the resolution requirement of clothing design on the generated image due to the strong generation capacity, but the original large model is obtained based on large-scale data set training, and the clothing quality in the data set is uneven, so that the clothing quality generated by the original large model is unstable; similarly, the style control of the style migration technology based on the control net model is realized through a style map with a disturbed text or structure, and the contained style information is only the whole information of the clothing picture and cannot be thinned to the detailed representation of the clothing.

Therefore, there is an urgent need for performing individual high quality fine tuning training for diffusion models, and developing a method of garment fusion that can accomplish garment detail fusion on the basis of this.

Disclosure of Invention

In view of the foregoing, an object of the present invention is to provide a garment pattern fusion method and system based on a diffusion model, in which two garment pictures are input, one is used as a style reference, and the other is used as a structural reference, so as to generate a plurality of new garments with the fusion characteristics of the two garments.

In order to achieve the above object, an embodiment of the present invention provides a garment style fusion method based on a diffusion model, including the following steps:

constructing a garment generation large model: carrying out Lora fine adjustment on the stablishffusion model by adopting a clothing graphic data set to obtain a clothing generation large model;

training a stylized text encoder: simultaneously adding style indication words into a dictionary of a stylized text encoder and text description corresponding to the style map, encoding the text description with the style indication words by using the stylized text encoder to obtain embedded vectors with style information, inputting the embedded vectors as conditions of a garment generation large model, fixing parameters of the garment generation large model and irrelevant embedded vectors in the text encoder, and carrying out parameter optimization on the embedded vectors of the style indication words by using the style map as supervision;

training a stylized control net model: in each time step, taking a clothing structure diagram and time corresponding to the style diagram as input of a stylized control Net model, taking an embedded vector of a style instruction word output by a stylized text encoder as conditional input of the stylized control Net model, fixing parameters of the clothing generation large model and the stylized text encoder, and taking the style diagram as supervision to perform parameter optimization on the stylized control Net model so as to enable the stylized control Net model to realize style semantic and structure alignment;

and the parameter-optimized garment generates a large model, a stylized text encoder and a stylized control Net model as a garment style fusion model for realizing garment style fusion.

Preferably, the stable diffusion model comprises a VQ encoder, a VQ decoder, a condition encoder and a denoising network, wherein the VQ encoder is used for forward diffusing an original clothing image into a hidden space in time steps 0-T to obtain a hidden characteristic z of the time steps T _T The conditional encoder is used for encoding the conditional text, and the denoising network is used for encoding the result and hiding the characteristic Z based on the conditional text _T Inverse diffusion is carried out in time 0-T to realize denoising, and noise z at time 0 is obtained ₀ The VQ decoder is used for noise z ₀ Decoding to obtain clothing generation results;

the loss function L is adopted when training a stable diffusion model _SD The method comprises the following steps:

where e represents the real noise sample, z _t Representing potential noise at time t, c _θ (y) represents the condition encoder c _θ Encoding result of conditional text y, E _θ Representing denoising network based on z _t 、t、c _θ (y) as a result of denoising,represents the square of the L2 norm, z _t -VQ (x) represents z _t The hidden characteristic VQ (x) obtained by inputting the original clothing image x into a VQ encoder is obeyed, E represents an expected value, and E-N (0, 1) represents Gaussian distribution of which E obeys a mean value is 0 and variance is 1.

Preferably, the process of performing the Lora fine tuning on the stable diffration model includes:

initializing a dimension-reducing matrix A by random Gaussian distribution, initializing a dimension-increasing matrix B by a 0 matrix, performing point multiplication on the dimension-increasing matrix B and the dimension-reducing matrix A to obtain a bypass low-rank decomposition matrix BA, ensuring that the bypass low-rank decomposition matrix BA is still the 0 matrix at the beginning of training, fixing all parameters of an original stable diffusion model during training, performing parameter optimization on the dimension-reducing matrix A and the dimension-increasing matrix B, and adding the bypass low-rank decomposition matrix BA after parameter optimization to the original parameters of the stable diffusion model to obtain a garment generation large model.

Preferably, when the stylized text encoder is trained to update the style indicator embedded vector, an objective function is adopted as follows:

where v represents the embedded vector of the style reference word, e represents the real noise sample, z _t Represents potential noise at time t, c' _θ (y ') represents the stylized text encoder c' _θ Embedding vector of style indicator in text description y' with style indicator _θ Representing denoising network based on z _t 、t、c′ _θ (y') as a result of denoising,represents the square of the L2 norm, z _t -VQ (x) represents z _t The hidden characteristic VQ (x) obtained by inputting the original clothing image x into a VQ encoder is obeyed, E represents an expected value, and E-N (0, 1) represents Gaussian distribution of which E obeys a mean value is 0 and variance is 1.

Preferably, the loss function L is used when training the stylized control Net model _CN The method comprises the following steps:

where e represents the real noise sample, z _t Represents potential noise at time t, c' _θ (y') represents parameter optimizationStylized text encoder c' _θ An embedded vector, c, encoding a style indicator in the text description y _N (s) represents a stylized control Net model c _N Coding result of clothing structure diagram s corresponding to style diagram, E _θ Representing denoising network based on z _t 、t、c′ _θ (y′)、c _N (s) as a result of denoising,represents the square of the L2 norm, z _t -VQ (x) represents z _t The hidden characteristic VQ (x) obtained by inputting the original clothing image x into a VQ encoder is obeyed, E represents an expected value, and E-N (0, 1) represents Gaussian distribution of which E obeys a mean value is 0 and variance is 1.

Preferably, the garment style fusion is realized by using a garment style fusion model, which comprises the following steps:

encoding the text description with the style indicator by using a stylized text encoder to obtain an embedded vector containing style information corresponding to the style indicator, and inputting the embedded vector into a garment generating large model and a stylized control Net model as conditions;

carrying out style semantic and structure alignment on the embedded vector of the input instruction word and the clothing structure diagram by utilizing a stylized control Net model to obtain style information, and inputting the style information as a condition to clothing to generate a large model;

and performing clothing style fusion by using the clothing generation large model based on the embedded vector of the input instruction word, the style information output by the control Net model and random Gaussian noise, and generating a new clothing image.

Preferably, in the clothes pattern fusion process, the randomness of random seeds and initial Gaussian noise along with new clothes images is controlled, and the mixing effect of a clothes structure diagram and style prompt words is adjusted by adjusting the structure control intensity and the text control intensity;

preferably, the clothing structure diagram includes a depth map or a draft map.

In order to achieve the above object, an embodiment further provides a garment style fusion system based on the diffusion model, which includes a garment structure diagram module, a style module, a garment style fusion module and a visualization module;

the clothing structure diagram module is used for providing a clothing structure diagram and supporting the function of selecting the clothing structure diagram;

the style module is used for providing style data, wherein the style data comprises a style chart or a style description text and supports a function of selecting the style data;

the clothing pattern fusion module performs clothing pattern fusion based on the selected clothing structure diagram and style data by adopting the clothing pattern fusion method of any one of claims 1-9 to generate a new clothing image;

the visualization module is used for visualizing the generated new clothing image.

Preferably, the system further comprises a key parameter setting module, wherein the key parameter setting module provides a configuration function of key parameters, the key parameters comprise the generation number, the structure control intensity, the text control intensity, the generation step number and the random seed of the new clothing image, and the control adjustment of the new clothing image is realized by modulating the key parameters.

Compared with the prior art, the invention has the beneficial effects that at least the following steps are included:

1. compared with the original stabledifusion model, the clothing generation large model obtained by fine adjustment training of a large number of clothing graphic data sets has the advantages of more stable quality of generation results in the clothing field, more abundant patterns and more accurate text control.

2. The stylized text encoder module can obtain a high-dimensional embedded vector of style information, and the stylized control Net module can align style semantic information with a structure and increase the generating capacity of a garment generating large model on style images through additional parameters.

3. Based on the training process of the stylized text encoder and the training process of the stylized control Net model, a final network structure is formed through two-step stylized fine adjustment, the generated clothing style fusion model is not limited to the migration of the whole color tone and the picture wind of clothing, the migration of the detail style of the clothing is achieved, the design details of the clothing can be fused, compared with the direct design of a designer, the novel clothing image fusing the clothing style is generated through the clothing style fusion model, the design efficiency and diversity are improved, and the cost is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a diffusion model-based garment pattern fusion method provided by an embodiment;

the garment provided by the embodiment of fig. 2 generates a training flowchart of the large model, stylized text encoder, and stylized ControlNet model;

FIG. 3 is a new garment image generated by the garment pattern fusion model provided by the embodiments;

fig. 4 is a schematic structural diagram of a garment pattern fusion device based on a diffusion model according to an embodiment.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the scope of the invention.

According to the garment pattern fusion method based on the diffusion model, garment design is completed through the neural network large model, so that the efficiency and diversity of the design are improved, the cost of manpower and material resources required by the design is reduced, and the threshold of the garment design is reduced; compared with the traditional style migration model, the garment style fusion model based on style migration in the embodiment has higher resolution of generated images, better generation quality and more stable generation effect; the garment generation large model is trained, and the model has better effect on garment generation compared with stablideffusion. The garment style fusion model is developed aiming at the field of garment design, and the style migration capability is not limited to migration of the whole color tone and the picture wind of the garment, but migration of the detail style of the garment is achieved.

As shown in fig. 1, the garment style fusion method based on the diffusion model provided in the embodiment includes the following steps:

s110, constructing a garment generation large model.

In the embodiment, the stara fine tuning is carried out on the stabliffusion model by adopting the clothing graphic data set to obtain a clothing generation large model. The method specifically comprises the steps of constructing a clothing graphic data set and training a stablediffusion model.

Aiming at constructing a clothing image-text data set, firstly, 60 tens of thousands of pieces of various clothing image data and corresponding text descriptions are collected and arranged from an e-commerce website and a public clothing data set, and the clothing image-text data is cleaned and screened, which specifically comprises the following steps: and removing pictures with resolution less than 512 x 512, removing the clothing images with the characters by using an OpenPose skeleton point extraction algorithm, manually screening and classifying the clothing images on the front side, extracting the foreground of the clothing images by using the batch processing of PhotoShop, and only keeping the parts of the clothing. The screened image needs to be further processed to meet the model training requirement, wherein the further processing comprises: and filling the image by using transparent pixels to ensure that the width and the height of the image are consistent, scaling the image to 1024 x 1024 resolution by using a bilinear interpolation algorithm, generating a text description corresponding to the image by using a BLIP (binary image-line) text model aiming at 35 ten thousand clothing images with missing clothing descriptions, and finally forming an original clothing image-descriptive text data pair of 55 ten thousand high-quality clothing to form a clothing image-text data set, wherein the original clothing image-descriptive text data pair comprises two modal information of the image and the text.

Aiming at the stable diffusion model, the clothing graphic data set is adopted to carry out Lora fine adjustment on the stable diffusion model, specifically, descriptive text and random Gaussian noise are taken as input, an original clothing image is taken as supervision information, and fine adjustment training is carried out on the stable diffusion.

Wherein the stable diffration model comprises a VQ encoderThe VQ decoder D, the condition encoder and the denoising network are used for mapping the forward diffusion of the original clothing image to the hidden space in the time step 0-T to obtain the hidden characteristic z of the time step T _T The conditional encoder is used for encoding the conditional text, and the denoising network is used for encoding the result and hiding the characteristic Z based on the conditional text _T Inverse diffusion is carried out in time 0-T to realize denoising, and noise z at time 0 is obtained _T The VQ decoder is used for noise z _T And decoding to obtain clothing generation results. The loss function L is adopted when training a stable diffusion model _SD The method comprises the following steps:

In an embodiment, the training process uses a Lora method to perform parameter optimization on a stable diffration model. The Lora method is based on the intrinsic low-rank characteristic of a large model, and a bypass low-rank decomposition matrix BA is added to simulate full-model parameter fine adjustment, so that the purpose of light fine adjustment is achieved. Assume that the original parameter matrix of stable diffration model is W ₀ ∈R ^d×k Original parameter matrix W ₀ ∈R ^d×k The update of (c) can be expressed as:

W ₀ +ΔW＝W ₀ +BA,B∈R ^d×r ,A∈R ^r×k

wherein the up-dimension matrix B and the down-dimension matrix a are point-multiplied to obtain a bypass low-rank decomposition matrix BA, the rank r < min (d, k), d and k represent dimensions of the original parameter matrix, Δw represents the added matrix, and Δw=ba.

Initializing a dimension reduction matrix A by adopting random Gaussian distribution, initializing a dimension ascending matrix B by using a 0 matrix to ensure that a bypass low-rank decomposition matrix BA is still the 0 matrix at the beginning of training, fixing all parameters of a frozen original stable decomposition model during training, injecting a trainable low-rank decomposition matrix into a cross attention layer of a transducer in each VQ encoder and each VQ decoder, and not calculating gradients for the stable decomposition matrix BA during optimization, so that only the bypass low-rank decomposition matrix BA is optimized. And adding the bypass low-rank decomposition matrix BA after parameter optimization to the original parameters of the stable diffusion model to obtain the garment generation large model.

The garment generation large model can generate high-quality and high-text matching degree garment after giving the garment related text prompt, and is superior to the original stablediffusion model. And further, the overall quality of the clothes obtained by a subsequent clothes pattern fusion algorithm can be improved.

S120, training a stylized text encoder.

In an embodiment, after a style chart to be used as a reference is input into a model, a stylized text encoder module and a stylized control net module can be trained sequentially, wherein the stylized text encoder module aims at providing text input containing style information when generating pictures, and the stylized control net module aims at realizing the alignment of style semantics and structures and increasing the generating capacity of a garment generating large model to style images through additional parameters.

Wherein the stylized text encoder converts each word in the input text into an index in a predefined dictionary and then links each index to a unique embedded vector that can be retrieved by index-based lookup, which embedded vectors are typically learned as part of the text encoder. After a Style graph serving as a reference is input, style indicating words </Style > are used for representing the Style represented by the Style graph, the added Style indicating words are added into a dictionary of a stylized text encoder, a high-dimensional embedded vector of the Style indicating words is initialized, no semantic information exists at the moment, then the stylized text encoder is used for encoding text description with the Style indicating words to obtain the embedded vector of the Style indicating words, the embedded vector is used as a condition input for generating a large model by using clothes, parameters of the large model and irrelevant embedded vectors in the text encoder are generated by fixing the clothes, and the Style graph is used for supervising the embedded vector of the Style indicating words to perform parameter optimization.

As shown in fig. 2, during specific training, in order to learn the embedded vector of the style indicator word into more detail information, a random slicing process is performed on the clothing picture to increase the richness of the training sample: five independent slices were performed on the original 1024 x 1024 picture with a slice size of 256 x 256 and these slices were scaled up to 1024 x 1024 resolution using bilinear interpolation algorithm. Then, taking the original clothing picture and the sliced clothing picture as style supervision information (the sampling probability of the original clothing picture is 0.5, and the sampling probability of each slice is 0.1); and taking the descriptive text containing Style index words Style and random Gaussian noise as input of the stylized text encoder, and training the stylized text encoder. Irrelevant embedded vectors in a UNet structure, a VQ coder and a stylized text coder in a fixed stable diffration model are trained and optimized only for the embedded vectors of style index words, so that the style information is learned. The specific optimization objective is defined as:

wherein v represents an embedded vector of the style index word, c' _θ (y ') represents the stylized text encoder c' _θ For the embedded vector of the style indicator in the text description y' with the style indicator, the updated embedded vector can contain the whole style information and detail style information contained in the style reference picture.

S130, training a stylized control Net model.

Because the style reference word is a brand new embedded vector, the original control net model cannot acquire semantic information of the style reference word, fine adjustment is needed to be carried out on the original control net model to achieve alignment of style semantic embedding and a control structure, and meanwhile the control net model is enabled to learn style information of a style reference picture.

The ControlNet model is an end-to-end neural network architecture for controlling the diffusion model, which achieves control by adding additional conditions s. It replicates the neural network weights of the original model into a trainable replica in which additional conditional inputs for controlling the model are learned. The trainable copy and the original neural network block are connected by a convolution layer of fixed parameters to exert control over the model during each generation, when the formula of the prediction noise becomes epsilon _θ (z _t ，t，c _θ ((y),c _N (s))。

As shown in fig. 2, when the stylized control net module is trained, in each time step, the clothing structure diagram and time corresponding to the style diagram are used as input of the stylized control net model, the embedded vector of the style instruction word output by the stylized text encoder is used as the conditional input of the stylized control net model, the parameters of the clothing generating large model and the stylized text encoder are fixed, and the style diagram is used as supervision to perform parameter optimization only on the stylized control net model, so that the stylized control net model realizes style semantic and structure alignment. Wherein the clothing structure diagram comprises at least one of a draft diagram and a depth diagram, and a loss function L adopted during training _CN The method comprises the following steps:

where e represents the real noise sample, z _t Represents potential noise at time t, c' _θ (y ') represents a parameter optimized stylized text encoder c' _θ An embedded vector, c, encoding a style indicator in the text description y _N (s) represents a stylized control Net model c _N Coding result of clothing structure diagram s corresponding to style diagram, E _θ Representing denoising network based on z _t 、t、c′ _θ (y′)、c _N (s) denoising results. The trained stylized control Net module learns style information of the style reference map while achieving alignment of style semantic embedding and control structures.

And S140, the parameter-optimized clothing generates a large model, a stylized text encoder and a stylized control Net model as a clothing style fusion model for realizing clothing style fusion.

After training, the parameter-optimized garment generates a large model, a stylized text encoder and a stylized control Net model as a garment style fusion model, and the garment style fusion model can be used for realizing garment style fusion, and the specific process comprises the following steps:

encoding the input text description with the style indicator by using a stylized text encoder to obtain an embedded vector containing style information, and inputting the embedded vector to a garment generating large model and a stylized control Net model as conditions; carrying out style semantic and structure alignment on the embedded vector of the input instruction word and the clothing structure diagram by utilizing a stylized control Net model to obtain style information, and inputting the style information as a condition to clothing to generate a large model; and performing clothing style fusion by using the clothing generation large model based on the embedded vector of the input instruction word, the style information output by the control Net model and random Gaussian noise, and generating a new clothing image.

In an embodiment, the generating effect of the clothing style fusion is influenced by the structure control intensity, the text control intensity and the generating step number input by a user. The structure control intensity represents the retention degree of the stylized control net model to the structure of the structure reference diagram, and specifically the structure retention degree of the stylized control net model to the structure reference diagram is affected by changing the parameter mixing proportion of the stylized control net model and the clothing large model in each generation step. The text control intensity represents the influence intensity of stylized text coding on the clothing generation of the large model, and particularly the style influence intensity of stylized text coding information on the clothing generation result of the large model is influenced by changing the scaling degree of stylized embedded vector input of the large model. The number of generation steps represents the number of sampling times of the generation process, and the larger the number is, the better the generation effect is, but the longer the time is.

Specifically, in the process of clothes pattern fusion, the randomness of random seeds and initial Gaussian noise control along with new clothes images is controlled, the mixing effect of clothes structure diagrams and style prompt words is adjusted by adjusting the structure control intensity and the text control intensity, and different new clothes images of fusion structures and styles are generated by adjustment is shown in fig. 3.

Based on the same inventive concept, the embodiment also provides a clothing style fusion system based on a diffusion model, as shown in fig. 4, which comprises a clothing structure diagram module 410, a style module 420, a clothing style fusion module 430, a visualization module 440 and a key parameter setting module 450.

The clothing structure diagram module 410 is used for providing clothing structure diagrams and supporting the function of selecting clothing structure diagrams; the style module 420 is configured to provide style data, where the style data includes a style chart or a style description text, and supports a function of selecting the style data; the clothing pattern fusion module 430 performs clothing pattern fusion based on the selected clothing structure diagram and style data using the clothing pattern fusion method described above to generate a new clothing image; the visualization module 440 is configured to visualize the generated new garment image; the key parameter setting module 450 provides a configuration function of key parameters, wherein the key parameters comprise the generation number of new clothing images, the structure control intensity, the text control intensity, the generation step number and the random seed, and the control adjustment of the new clothing images is realized by modulating the key parameters. The number of generated pictures controls the number of single generated pictures; the intensity of the structure control represents the similarity degree of the structure of the generated clothing and the structure reference diagram; the text control intensity represents the similarity degree of the style of the generated clothing and the style reference map; the number of the generation steps represents the sampling times of the generation process, and the larger the number is, the better the generation effect is. The randomness of the clothing pattern fusion generation effect is influenced by random seeds, and different random seeds can enable the results generated by the random number generator in the algorithm generation process to be different.

When the user applies the clothing style fusion system to carry out clothing design, the following steps are executed: the user selects the clothing structure drawing and style drawing as references through the clothing structure drawing module 410 and the style module 420; the user adjusts the key parameters through the key parameter setting module 450, the clothing pattern fusion module 430 performs fine adjustment on the clothing pattern fusion model according to the set key parameters, generates a new clothing image by using the fine-adjusted clothing pattern fusion model, and visualizes the new clothing image.

After the fusion clothing is generated, the user scores the generated new clothing image, and the high-score image and the generation parameters and style parameters of the high-score image are stored. The user can browse the saved generated new clothing image in the history record and directly use the trained stylized text encoder and stylized control net in the history result to fuse with the new clothing structure reference map.

In the garment pattern fusion method and system, the randomness of the model generation effect is based on random seeds and initial Gaussian noise, and theoretically, the fusion of two clothes can produce infinite results, and an assistant designer is inspired. The structure and style mixing effect of the generated result can be adjusted through the structure control intensity and the text control intensity, so that the profile and style of the clothes generated by fusion tend to or deviate from the structure reference clothes and the style reference clothes, and more design possibilities are generated.

The foregoing detailed description of the preferred embodiments and advantages of the invention will be appreciated that the foregoing description is merely illustrative of the presently preferred embodiments of the invention, and that no changes, additions, substitutions and equivalents of those embodiments are intended to be included within the scope of the invention.

Claims

1. The clothing style fusion method based on the diffusion model is characterized by comprising the following steps of:

training a stylized text encoder: simultaneously adding style indication words into a dictionary of a stylized text encoder and text description corresponding to the style map, encoding the text description with the style indication words by using the stylized text encoder to obtain an embedded vector containing style information, inputting the embedded vector as a condition of a garment generation large model, fixing parameters of the garment generation large model and irrelevant embedded vectors in the text encoder, and carrying out parameter optimization on the embedded vector of the style indication words by using the style map as supervision;

2. The diffusion model-based garment pattern fusion method of claim 1, wherein the stablistification model comprises a VQ encoder, a VQ decoder, a condition encoder and a denoising network, wherein the VQ encoder is used for forward diffusion of an original garment image into a hidden space mapping in time steps 0-T to obtain hidden features z of time steps T _T The conditional encoder is used for encoding the conditional text, and the denoising network is used for encoding the result and hiding the characteristic Z based on the conditional text _T Inverse diffusion is carried out in time 0-T to realize denoising, and noise z at time 0 is obtained ₀ The VQ decoder is used for noise z ₀ Decoding to obtain clothing generation results;

3. The diffusion model-based garment pattern fusion method of claim 1, wherein the process of performing the Lora fine tuning on the stable diffusion model comprises:

4. The diffusion model-based garment pattern fusion method of claim 2, wherein when training the stylized text encoder to update the embedded vector of the style index word, an objective function is adopted as:

wherein v represents the embedding of style significantsThe input vector, e, represents the true noise sample, z _t Represents potential noise at time t, c' _θ (y ') represents the stylized text encoder c' _θ Embedding vector of style indicator in text description y' with style indicator _θ Representing denoising network based on z _t 、t、c′ _θ (y') as a result of denoising,represents the square of the L2 norm, z _t -VQ (x) represents z _t The hidden characteristic VQ (x) obtained by inputting the original clothing image x into a VQ encoder is obeyed, E represents an expected value, and E-N (0, 1) represents Gaussian distribution of which E obeys a mean value is 0 and variance is 1.

5. The diffusion model-based garment pattern fusion method of claim 2, wherein the loss function L is used when training the stylized ControlNet model _CN The method comprises the following steps:

where e represents the real noise sample, z _t Represents potential noise at time t, c' _θ (y ') represents a parameter optimized stylized text encoder c' _θ An embedded vector, c, encoding a style indicator in the text description y _N (s) represents a stylized control Net model c _N Coding result of clothing structure diagram s corresponding to style diagram, E _θ Representing denoising network based on z _t 、t、c′ _θ (y′)、c _N (s) as a result of denoising,represents the square of the L2 norm, z _t -VQ (x) represents z _t The hidden feature VQ (x) obtained by inputting VQ encoder to the original clothing image x, E represents the expected value, -N (0, 1) represents the high subject to mean 0 and variance 1A gaussian distribution.

6. The diffusion model-based garment pattern fusion method according to claim 1, wherein the garment pattern fusion is achieved by using a garment pattern fusion model, comprising the following processes:

encoding the input text description with the style indicator by using a stylized text encoder to obtain an embedded vector containing style information corresponding to the style indicator, and inputting the embedded vector of the style indicator to a garment to generate a large model and a stylized control Net model by taking the embedded vector of the style indicator as a condition;

7. The diffusion model-based garment pattern fusion method according to claim 1, wherein in the garment pattern fusion process, the randomness of random seeds and initial Gaussian noise control along with a new garment image is controlled, and the structural control intensity and the text control intensity are adjusted to adjust the mixing effect of a garment structure diagram and a style prompt word.

8. The diffusion model-based garment pattern fusion method of claim 1, wherein the garment structure map comprises a depth map or a line manuscript map.

9. The clothing style fusion system based on the diffusion model is characterized by comprising a clothing structure diagram module, a style module, a clothing style fusion module and a visualization module;

10. The clothing style fusion system based on the diffusion model is characterized by further comprising a key parameter setting module, wherein the key parameter setting module provides a configuration function of key parameters, the key parameters comprise the generation number of new clothing images, the structure control intensity, the text control intensity, the generation step number and random seeds, and the control adjustment of the new clothing images is realized by modulating the key parameters.