CN117315417A - Diffusion model-based garment pattern fusion method and system - Google Patents

Diffusion model-based garment pattern fusion method and system Download PDF

Info

Publication number
CN117315417A
CN117315417A CN202311128437.9A CN202311128437A CN117315417A CN 117315417 A CN117315417 A CN 117315417A CN 202311128437 A CN202311128437 A CN 202311128437A CN 117315417 A CN117315417 A CN 117315417A
Authority
CN
China
Prior art keywords
style
model
clothing
garment
stylized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311128437.9A
Other languages
Chinese (zh)
Inventor
汤程杰
汤永川
张欣隆
何永兴
林城誉
孙凌云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202311128437.9A priority Critical patent/CN117315417A/en
Publication of CN117315417A publication Critical patent/CN117315417A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4007Interpolation-based scaling, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4046Scaling the whole image or part thereof using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention discloses a clothing style fusion method and a clothing style fusion system based on a diffusion model, wherein the clothing style fusion method comprises the following steps: constructing a garment generation large model: performing Lora fine adjustment on the stable diffusion model by adopting a clothing graphic data set to obtain a clothing generation large model; training a stylized text encoder: training a stylized control Net model; and the parameter-optimized garment generates a large model, a stylized text encoder and a stylized control Net model as a garment style fusion model for realizing garment style fusion. According to the method and the device, a final network structure is formed through two-step stylized fine adjustment, the generated garment pattern fusion model is not limited to the migration of the whole color tone and the picture wind of the garment, migration of the detail style of the garment is achieved, design details of the garment can be fused, design efficiency and diversity are improved, and cost is reduced.

Description

Diffusion model-based garment pattern fusion method and system
Technical Field
The invention belongs to the technical field of clothing design, and particularly relates to a clothing style fusion method and system based on a diffusion model.
Background
There is an increasing demand for personalized and customized garments, and conventional garment design processes typically require a designer to perform manual sample garment drawing, which is time consuming, laborious and costly. In order to increase the efficiency and diversity of apparel designs while lowering the threshold of apparel designs, attempts have been made to assist in the design effort by using neural network large models for clothing generation.
One such auxiliary design algorithm is a garment style fusion algorithm that can generate a completely new garment design by fusing or intersecting features of different garment styles. The clothing style fusion algorithm is essentially a style migration technology, namely, after a style image and a content image are provided, style and texture information is collected from the style image and migrated to the content image, so that the content image has the style of the style image while the whole structure is maintained. The existing style migration technology with better effect is mostly realized based on GAN (generation of an antagonistic network) or Diffusion Model. GAN is a neural network framework composed of a generator and a arbiter that are opposed to each other, and the Diffusion Model is a framework for generating a target picture by denoising a noise picture through a neural network.
The prior style migration technology is mainly realized through GAN, such as a Swap AE model, a content image and a style image are respectively encoded into structural codes and style codes through two independent encoders, stylegan2 is used as a basic frame in the generation process, the structural codes are used as mask input of the basic frame, and the style codes are injected into each convolution layer of the Stylegan2 to achieve the purpose of style migration. For example, a garment style migration method disclosed in the patent application with publication number CN 115810060a and a garment style migration method and system based on deep learning disclosed in the patent application with publication number CN 114445268A) use structural loss and style loss to train GAN to obtain a style migration model, and after a user uploads clothes, select patterns to perform style migration, thereby generating new clothes with specified patterns.
However, these style migration models designed by using GAN model have the following problems: (1) Because of the limitation of the GAN model, a high-resolution picture cannot be directly generated, the resolution is improved by the assistance of an interpolation algorithm, the image quality is reduced, and the clothing design field has higher requirements on the resolution and quality of design manuscripts; (2) The style information learned by the GAN model is the whole information of the input style map, but the details of the clothing are difficult to learn, but the design concept of the clothing is mainly embodied in the clothing details rather than the global style, so that the general style migration model is difficult to meet the clothing style fusion requirement of a designer.
Style migration based on a diffusion model is mainly realized by combining a stablifusion model with a ControlNet model. Among them, the ControlNet model is a neural network structure for controlling a diffusion model, which realizes control by adding additional condition inputs. It replicates the neural network weights of the original model into a trainable replica in which additional conditional inputs for controlling the model are learned. There are two ways to accomplish style migration using control net: 1. and extracting a line manuscript representation or a depth map representation of the content map as structural information, describing style information by using text, and then inputting a large model to generate the content map under the description style. 2. And extracting a line manuscript representation or a depth map representation of the content map as structural information, performing scrambling recombination (shuffle) on the style map to obtain a style information-only image, and then inputting a large model to obtain the content map of the corresponding style.
The diffusion model can meet the resolution requirement of clothing design on the generated image due to the strong generation capacity, but the original large model is obtained based on large-scale data set training, and the clothing quality in the data set is uneven, so that the clothing quality generated by the original large model is unstable; similarly, the style control of the style migration technology based on the control net model is realized through a style map with a disturbed text or structure, and the contained style information is only the whole information of the clothing picture and cannot be thinned to the detailed representation of the clothing.
Therefore, there is an urgent need for performing individual high quality fine tuning training for diffusion models, and developing a method of garment fusion that can accomplish garment detail fusion on the basis of this.
Disclosure of Invention
In view of the foregoing, an object of the present invention is to provide a garment pattern fusion method and system based on a diffusion model, in which two garment pictures are input, one is used as a style reference, and the other is used as a structural reference, so as to generate a plurality of new garments with the fusion characteristics of the two garments.
In order to achieve the above object, an embodiment of the present invention provides a garment style fusion method based on a diffusion model, including the following steps:
constructing a garment generation large model: carrying out Lora fine adjustment on the stablishffusion model by adopting a clothing graphic data set to obtain a clothing generation large model;
training a stylized text encoder: simultaneously adding style indication words into a dictionary of a stylized text encoder and text description corresponding to the style map, encoding the text description with the style indication words by using the stylized text encoder to obtain embedded vectors with style information, inputting the embedded vectors as conditions of a garment generation large model, fixing parameters of the garment generation large model and irrelevant embedded vectors in the text encoder, and carrying out parameter optimization on the embedded vectors of the style indication words by using the style map as supervision;
training a stylized control net model: in each time step, taking a clothing structure diagram and time corresponding to the style diagram as input of a stylized control Net model, taking an embedded vector of a style instruction word output by a stylized text encoder as conditional input of the stylized control Net model, fixing parameters of the clothing generation large model and the stylized text encoder, and taking the style diagram as supervision to perform parameter optimization on the stylized control Net model so as to enable the stylized control Net model to realize style semantic and structure alignment;
and the parameter-optimized garment generates a large model, a stylized text encoder and a stylized control Net model as a garment style fusion model for realizing garment style fusion.
Preferably, the stable diffusion model comprises a VQ encoder, a VQ decoder, a condition encoder and a denoising network, wherein the VQ encoder is used for forward diffusing an original clothing image into a hidden space in time steps 0-T to obtain a hidden characteristic z of the time steps T T The conditional encoder is used for encoding the conditional text, and the denoising network is used for encoding the result and hiding the characteristic Z based on the conditional text T Inverse diffusion is carried out in time 0-T to realize denoising, and noise z at time 0 is obtained 0 The VQ decoder is used for noise z 0 Decoding to obtain clothing generation results;
the loss function L is adopted when training a stable diffusion model SD The method comprises the following steps:
where e represents the real noise sample, z t Representing potential noise at time t, c θ (y) represents the condition encoder c θ Encoding result of conditional text y, E θ Representing denoising network based on z t 、t、c θ (y) as a result of denoising,represents the square of the L2 norm, z t -VQ (x) represents z t The hidden characteristic VQ (x) obtained by inputting the original clothing image x into a VQ encoder is obeyed, E represents an expected value, and E-N (0, 1) represents Gaussian distribution of which E obeys a mean value is 0 and variance is 1.
Preferably, the process of performing the Lora fine tuning on the stable diffration model includes:
initializing a dimension-reducing matrix A by random Gaussian distribution, initializing a dimension-increasing matrix B by a 0 matrix, performing point multiplication on the dimension-increasing matrix B and the dimension-reducing matrix A to obtain a bypass low-rank decomposition matrix BA, ensuring that the bypass low-rank decomposition matrix BA is still the 0 matrix at the beginning of training, fixing all parameters of an original stable diffusion model during training, performing parameter optimization on the dimension-reducing matrix A and the dimension-increasing matrix B, and adding the bypass low-rank decomposition matrix BA after parameter optimization to the original parameters of the stable diffusion model to obtain a garment generation large model.
Preferably, when the stylized text encoder is trained to update the style indicator embedded vector, an objective function is adopted as follows:
where v represents the embedded vector of the style reference word, e represents the real noise sample, z t Represents potential noise at time t, c' θ (y ') represents the stylized text encoder c' θ Embedding vector of style indicator in text description y' with style indicator θ Representing denoising network based on z t 、t、c′ θ (y') as a result of denoising,represents the square of the L2 norm, z t -VQ (x) represents z t The hidden characteristic VQ (x) obtained by inputting the original clothing image x into a VQ encoder is obeyed, E represents an expected value, and E-N (0, 1) represents Gaussian distribution of which E obeys a mean value is 0 and variance is 1.
Preferably, the loss function L is used when training the stylized control Net model CN The method comprises the following steps:
where e represents the real noise sample, z t Represents potential noise at time t, c' θ (y') represents parameter optimizationStylized text encoder c' θ An embedded vector, c, encoding a style indicator in the text description y N (s) represents a stylized control Net model c N Coding result of clothing structure diagram s corresponding to style diagram, E θ Representing denoising network based on z t 、t、c′ θ (y′)、c N (s) as a result of denoising,represents the square of the L2 norm, z t -VQ (x) represents z t The hidden characteristic VQ (x) obtained by inputting the original clothing image x into a VQ encoder is obeyed, E represents an expected value, and E-N (0, 1) represents Gaussian distribution of which E obeys a mean value is 0 and variance is 1.
Preferably, the garment style fusion is realized by using a garment style fusion model, which comprises the following steps:
encoding the text description with the style indicator by using a stylized text encoder to obtain an embedded vector containing style information corresponding to the style indicator, and inputting the embedded vector into a garment generating large model and a stylized control Net model as conditions;
carrying out style semantic and structure alignment on the embedded vector of the input instruction word and the clothing structure diagram by utilizing a stylized control Net model to obtain style information, and inputting the style information as a condition to clothing to generate a large model;
and performing clothing style fusion by using the clothing generation large model based on the embedded vector of the input instruction word, the style information output by the control Net model and random Gaussian noise, and generating a new clothing image.
Preferably, in the clothes pattern fusion process, the randomness of random seeds and initial Gaussian noise along with new clothes images is controlled, and the mixing effect of a clothes structure diagram and style prompt words is adjusted by adjusting the structure control intensity and the text control intensity;
preferably, the clothing structure diagram includes a depth map or a draft map.
In order to achieve the above object, an embodiment further provides a garment style fusion system based on the diffusion model, which includes a garment structure diagram module, a style module, a garment style fusion module and a visualization module;
the clothing structure diagram module is used for providing a clothing structure diagram and supporting the function of selecting the clothing structure diagram;
the style module is used for providing style data, wherein the style data comprises a style chart or a style description text and supports a function of selecting the style data;
the clothing pattern fusion module performs clothing pattern fusion based on the selected clothing structure diagram and style data by adopting the clothing pattern fusion method of any one of claims 1-9 to generate a new clothing image;
the visualization module is used for visualizing the generated new clothing image.
Preferably, the system further comprises a key parameter setting module, wherein the key parameter setting module provides a configuration function of key parameters, the key parameters comprise the generation number, the structure control intensity, the text control intensity, the generation step number and the random seed of the new clothing image, and the control adjustment of the new clothing image is realized by modulating the key parameters.
Compared with the prior art, the invention has the beneficial effects that at least the following steps are included:
1. compared with the original stabledifusion model, the clothing generation large model obtained by fine adjustment training of a large number of clothing graphic data sets has the advantages of more stable quality of generation results in the clothing field, more abundant patterns and more accurate text control.
2. The stylized text encoder module can obtain a high-dimensional embedded vector of style information, and the stylized control Net module can align style semantic information with a structure and increase the generating capacity of a garment generating large model on style images through additional parameters.
3. Based on the training process of the stylized text encoder and the training process of the stylized control Net model, a final network structure is formed through two-step stylized fine adjustment, the generated clothing style fusion model is not limited to the migration of the whole color tone and the picture wind of clothing, the migration of the detail style of the clothing is achieved, the design details of the clothing can be fused, compared with the direct design of a designer, the novel clothing image fusing the clothing style is generated through the clothing style fusion model, the design efficiency and diversity are improved, and the cost is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a diffusion model-based garment pattern fusion method provided by an embodiment;
the garment provided by the embodiment of fig. 2 generates a training flowchart of the large model, stylized text encoder, and stylized ControlNet model;
FIG. 3 is a new garment image generated by the garment pattern fusion model provided by the embodiments;
fig. 4 is a schematic structural diagram of a garment pattern fusion device based on a diffusion model according to an embodiment.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the scope of the invention.
According to the garment pattern fusion method based on the diffusion model, garment design is completed through the neural network large model, so that the efficiency and diversity of the design are improved, the cost of manpower and material resources required by the design is reduced, and the threshold of the garment design is reduced; compared with the traditional style migration model, the garment style fusion model based on style migration in the embodiment has higher resolution of generated images, better generation quality and more stable generation effect; the garment generation large model is trained, and the model has better effect on garment generation compared with stablideffusion. The garment style fusion model is developed aiming at the field of garment design, and the style migration capability is not limited to migration of the whole color tone and the picture wind of the garment, but migration of the detail style of the garment is achieved.
As shown in fig. 1, the garment style fusion method based on the diffusion model provided in the embodiment includes the following steps:
s110, constructing a garment generation large model.
In the embodiment, the stara fine tuning is carried out on the stabliffusion model by adopting the clothing graphic data set to obtain a clothing generation large model. The method specifically comprises the steps of constructing a clothing graphic data set and training a stablediffusion model.
Aiming at constructing a clothing image-text data set, firstly, 60 tens of thousands of pieces of various clothing image data and corresponding text descriptions are collected and arranged from an e-commerce website and a public clothing data set, and the clothing image-text data is cleaned and screened, which specifically comprises the following steps: and removing pictures with resolution less than 512 x 512, removing the clothing images with the characters by using an OpenPose skeleton point extraction algorithm, manually screening and classifying the clothing images on the front side, extracting the foreground of the clothing images by using the batch processing of PhotoShop, and only keeping the parts of the clothing. The screened image needs to be further processed to meet the model training requirement, wherein the further processing comprises: and filling the image by using transparent pixels to ensure that the width and the height of the image are consistent, scaling the image to 1024 x 1024 resolution by using a bilinear interpolation algorithm, generating a text description corresponding to the image by using a BLIP (binary image-line) text model aiming at 35 ten thousand clothing images with missing clothing descriptions, and finally forming an original clothing image-descriptive text data pair of 55 ten thousand high-quality clothing to form a clothing image-text data set, wherein the original clothing image-descriptive text data pair comprises two modal information of the image and the text.
Aiming at the stable diffusion model, the clothing graphic data set is adopted to carry out Lora fine adjustment on the stable diffusion model, specifically, descriptive text and random Gaussian noise are taken as input, an original clothing image is taken as supervision information, and fine adjustment training is carried out on the stable diffusion.
Wherein the stable diffration model comprises a VQ encoderThe VQ decoder D, the condition encoder and the denoising network are used for mapping the forward diffusion of the original clothing image to the hidden space in the time step 0-T to obtain the hidden characteristic z of the time step T T The conditional encoder is used for encoding the conditional text, and the denoising network is used for encoding the result and hiding the characteristic Z based on the conditional text T Inverse diffusion is carried out in time 0-T to realize denoising, and noise z at time 0 is obtained T The VQ decoder is used for noise z T And decoding to obtain clothing generation results. The loss function L is adopted when training a stable diffusion model SD The method comprises the following steps:
where e represents the real noise sample, z t Representing potential noise at time t, c θ (y) represents the condition encoder c θ Encoding result of conditional text y, E θ Representing denoising network based on z t 、t、c θ (y) as a result of denoising,represents the square of the L2 norm, z t -VQ (x) represents z t The hidden characteristic VQ (x) obtained by inputting the original clothing image x into a VQ encoder is obeyed, E represents an expected value, and E-N (0, 1) represents Gaussian distribution of which E obeys a mean value is 0 and variance is 1.
In an embodiment, the training process uses a Lora method to perform parameter optimization on a stable diffration model. The Lora method is based on the intrinsic low-rank characteristic of a large model, and a bypass low-rank decomposition matrix BA is added to simulate full-model parameter fine adjustment, so that the purpose of light fine adjustment is achieved. Assume that the original parameter matrix of stable diffration model is W 0 ∈R d×k Original parameter matrix W 0 ∈R d×k The update of (c) can be expressed as:
W 0 +ΔW=W 0 +BA,B∈R d×r ,A∈R r×k
wherein the up-dimension matrix B and the down-dimension matrix a are point-multiplied to obtain a bypass low-rank decomposition matrix BA, the rank r < min (d, k), d and k represent dimensions of the original parameter matrix, Δw represents the added matrix, and Δw=ba.
Initializing a dimension reduction matrix A by adopting random Gaussian distribution, initializing a dimension ascending matrix B by using a 0 matrix to ensure that a bypass low-rank decomposition matrix BA is still the 0 matrix at the beginning of training, fixing all parameters of a frozen original stable decomposition model during training, injecting a trainable low-rank decomposition matrix into a cross attention layer of a transducer in each VQ encoder and each VQ decoder, and not calculating gradients for the stable decomposition matrix BA during optimization, so that only the bypass low-rank decomposition matrix BA is optimized. And adding the bypass low-rank decomposition matrix BA after parameter optimization to the original parameters of the stable diffusion model to obtain the garment generation large model.
The garment generation large model can generate high-quality and high-text matching degree garment after giving the garment related text prompt, and is superior to the original stablediffusion model. And further, the overall quality of the clothes obtained by a subsequent clothes pattern fusion algorithm can be improved.
S120, training a stylized text encoder.
In an embodiment, after a style chart to be used as a reference is input into a model, a stylized text encoder module and a stylized control net module can be trained sequentially, wherein the stylized text encoder module aims at providing text input containing style information when generating pictures, and the stylized control net module aims at realizing the alignment of style semantics and structures and increasing the generating capacity of a garment generating large model to style images through additional parameters.
Wherein the stylized text encoder converts each word in the input text into an index in a predefined dictionary and then links each index to a unique embedded vector that can be retrieved by index-based lookup, which embedded vectors are typically learned as part of the text encoder. After a Style graph serving as a reference is input, style indicating words </Style > are used for representing the Style represented by the Style graph, the added Style indicating words are added into a dictionary of a stylized text encoder, a high-dimensional embedded vector of the Style indicating words is initialized, no semantic information exists at the moment, then the stylized text encoder is used for encoding text description with the Style indicating words to obtain the embedded vector of the Style indicating words, the embedded vector is used as a condition input for generating a large model by using clothes, parameters of the large model and irrelevant embedded vectors in the text encoder are generated by fixing the clothes, and the Style graph is used for supervising the embedded vector of the Style indicating words to perform parameter optimization.
As shown in fig. 2, during specific training, in order to learn the embedded vector of the style indicator word into more detail information, a random slicing process is performed on the clothing picture to increase the richness of the training sample: five independent slices were performed on the original 1024 x 1024 picture with a slice size of 256 x 256 and these slices were scaled up to 1024 x 1024 resolution using bilinear interpolation algorithm. Then, taking the original clothing picture and the sliced clothing picture as style supervision information (the sampling probability of the original clothing picture is 0.5, and the sampling probability of each slice is 0.1); and taking the descriptive text containing Style index words Style and random Gaussian noise as input of the stylized text encoder, and training the stylized text encoder. Irrelevant embedded vectors in a UNet structure, a VQ coder and a stylized text coder in a fixed stable diffration model are trained and optimized only for the embedded vectors of style index words, so that the style information is learned. The specific optimization objective is defined as:
wherein v represents an embedded vector of the style index word, c' θ (y ') represents the stylized text encoder c' θ For the embedded vector of the style indicator in the text description y' with the style indicator, the updated embedded vector can contain the whole style information and detail style information contained in the style reference picture.
S130, training a stylized control Net model.
Because the style reference word is a brand new embedded vector, the original control net model cannot acquire semantic information of the style reference word, fine adjustment is needed to be carried out on the original control net model to achieve alignment of style semantic embedding and a control structure, and meanwhile the control net model is enabled to learn style information of a style reference picture.
The ControlNet model is an end-to-end neural network architecture for controlling the diffusion model, which achieves control by adding additional conditions s. It replicates the neural network weights of the original model into a trainable replica in which additional conditional inputs for controlling the model are learned. The trainable copy and the original neural network block are connected by a convolution layer of fixed parameters to exert control over the model during each generation, when the formula of the prediction noise becomes epsilon θ (z t ,t,c θ ((y),c N (s))。
As shown in fig. 2, when the stylized control net module is trained, in each time step, the clothing structure diagram and time corresponding to the style diagram are used as input of the stylized control net model, the embedded vector of the style instruction word output by the stylized text encoder is used as the conditional input of the stylized control net model, the parameters of the clothing generating large model and the stylized text encoder are fixed, and the style diagram is used as supervision to perform parameter optimization only on the stylized control net model, so that the stylized control net model realizes style semantic and structure alignment. Wherein the clothing structure diagram comprises at least one of a draft diagram and a depth diagram, and a loss function L adopted during training CN The method comprises the following steps:
where e represents the real noise sample, z t Represents potential noise at time t, c' θ (y ') represents a parameter optimized stylized text encoder c' θ An embedded vector, c, encoding a style indicator in the text description y N (s) represents a stylized control Net model c N Coding result of clothing structure diagram s corresponding to style diagram, E θ Representing denoising network based on z t 、t、c′ θ (y′)、c N (s) denoising results. The trained stylized control Net module learns style information of the style reference map while achieving alignment of style semantic embedding and control structures.
And S140, the parameter-optimized clothing generates a large model, a stylized text encoder and a stylized control Net model as a clothing style fusion model for realizing clothing style fusion.
After training, the parameter-optimized garment generates a large model, a stylized text encoder and a stylized control Net model as a garment style fusion model, and the garment style fusion model can be used for realizing garment style fusion, and the specific process comprises the following steps:
encoding the input text description with the style indicator by using a stylized text encoder to obtain an embedded vector containing style information, and inputting the embedded vector to a garment generating large model and a stylized control Net model as conditions; carrying out style semantic and structure alignment on the embedded vector of the input instruction word and the clothing structure diagram by utilizing a stylized control Net model to obtain style information, and inputting the style information as a condition to clothing to generate a large model; and performing clothing style fusion by using the clothing generation large model based on the embedded vector of the input instruction word, the style information output by the control Net model and random Gaussian noise, and generating a new clothing image.
In an embodiment, the generating effect of the clothing style fusion is influenced by the structure control intensity, the text control intensity and the generating step number input by a user. The structure control intensity represents the retention degree of the stylized control net model to the structure of the structure reference diagram, and specifically the structure retention degree of the stylized control net model to the structure reference diagram is affected by changing the parameter mixing proportion of the stylized control net model and the clothing large model in each generation step. The text control intensity represents the influence intensity of stylized text coding on the clothing generation of the large model, and particularly the style influence intensity of stylized text coding information on the clothing generation result of the large model is influenced by changing the scaling degree of stylized embedded vector input of the large model. The number of generation steps represents the number of sampling times of the generation process, and the larger the number is, the better the generation effect is, but the longer the time is.
Specifically, in the process of clothes pattern fusion, the randomness of random seeds and initial Gaussian noise control along with new clothes images is controlled, the mixing effect of clothes structure diagrams and style prompt words is adjusted by adjusting the structure control intensity and the text control intensity, and different new clothes images of fusion structures and styles are generated by adjustment is shown in fig. 3.
Based on the same inventive concept, the embodiment also provides a clothing style fusion system based on a diffusion model, as shown in fig. 4, which comprises a clothing structure diagram module 410, a style module 420, a clothing style fusion module 430, a visualization module 440 and a key parameter setting module 450.
The clothing structure diagram module 410 is used for providing clothing structure diagrams and supporting the function of selecting clothing structure diagrams; the style module 420 is configured to provide style data, where the style data includes a style chart or a style description text, and supports a function of selecting the style data; the clothing pattern fusion module 430 performs clothing pattern fusion based on the selected clothing structure diagram and style data using the clothing pattern fusion method described above to generate a new clothing image; the visualization module 440 is configured to visualize the generated new garment image; the key parameter setting module 450 provides a configuration function of key parameters, wherein the key parameters comprise the generation number of new clothing images, the structure control intensity, the text control intensity, the generation step number and the random seed, and the control adjustment of the new clothing images is realized by modulating the key parameters. The number of generated pictures controls the number of single generated pictures; the intensity of the structure control represents the similarity degree of the structure of the generated clothing and the structure reference diagram; the text control intensity represents the similarity degree of the style of the generated clothing and the style reference map; the number of the generation steps represents the sampling times of the generation process, and the larger the number is, the better the generation effect is. The randomness of the clothing pattern fusion generation effect is influenced by random seeds, and different random seeds can enable the results generated by the random number generator in the algorithm generation process to be different.
When the user applies the clothing style fusion system to carry out clothing design, the following steps are executed: the user selects the clothing structure drawing and style drawing as references through the clothing structure drawing module 410 and the style module 420; the user adjusts the key parameters through the key parameter setting module 450, the clothing pattern fusion module 430 performs fine adjustment on the clothing pattern fusion model according to the set key parameters, generates a new clothing image by using the fine-adjusted clothing pattern fusion model, and visualizes the new clothing image.
After the fusion clothing is generated, the user scores the generated new clothing image, and the high-score image and the generation parameters and style parameters of the high-score image are stored. The user can browse the saved generated new clothing image in the history record and directly use the trained stylized text encoder and stylized control net in the history result to fuse with the new clothing structure reference map.
In the garment pattern fusion method and system, the randomness of the model generation effect is based on random seeds and initial Gaussian noise, and theoretically, the fusion of two clothes can produce infinite results, and an assistant designer is inspired. The structure and style mixing effect of the generated result can be adjusted through the structure control intensity and the text control intensity, so that the profile and style of the clothes generated by fusion tend to or deviate from the structure reference clothes and the style reference clothes, and more design possibilities are generated.
The foregoing detailed description of the preferred embodiments and advantages of the invention will be appreciated that the foregoing description is merely illustrative of the presently preferred embodiments of the invention, and that no changes, additions, substitutions and equivalents of those embodiments are intended to be included within the scope of the invention.

Claims (10)

1. The clothing style fusion method based on the diffusion model is characterized by comprising the following steps of:
constructing a garment generation large model: carrying out Lora fine adjustment on the stablishffusion model by adopting a clothing graphic data set to obtain a clothing generation large model;
training a stylized text encoder: simultaneously adding style indication words into a dictionary of a stylized text encoder and text description corresponding to the style map, encoding the text description with the style indication words by using the stylized text encoder to obtain an embedded vector containing style information, inputting the embedded vector as a condition of a garment generation large model, fixing parameters of the garment generation large model and irrelevant embedded vectors in the text encoder, and carrying out parameter optimization on the embedded vector of the style indication words by using the style map as supervision;
training a stylized control net model: in each time step, taking a clothing structure diagram and time corresponding to the style diagram as input of a stylized control Net model, taking an embedded vector of a style instruction word output by a stylized text encoder as conditional input of the stylized control Net model, fixing parameters of the clothing generation large model and the stylized text encoder, and taking the style diagram as supervision to perform parameter optimization on the stylized control Net model so as to enable the stylized control Net model to realize style semantic and structure alignment;
and the parameter-optimized garment generates a large model, a stylized text encoder and a stylized control Net model as a garment style fusion model for realizing garment style fusion.
2. The diffusion model-based garment pattern fusion method of claim 1, wherein the stablistification model comprises a VQ encoder, a VQ decoder, a condition encoder and a denoising network, wherein the VQ encoder is used for forward diffusion of an original garment image into a hidden space mapping in time steps 0-T to obtain hidden features z of time steps T T The conditional encoder is used for encoding the conditional text, and the denoising network is used for encoding the result and hiding the characteristic Z based on the conditional text T Inverse diffusion is carried out in time 0-T to realize denoising, and noise z at time 0 is obtained 0 The VQ decoder is used for noise z 0 Decoding to obtain clothing generation results;
the loss function L is adopted when training a stable diffusion model SD The method comprises the following steps:
where e represents the real noise sample, z t Representing potential noise at time t, c θ (y) represents the condition encoder c θ Encoding result of conditional text y, E θ Representing denoising network based on z t 、t、c θ (y) as a result of denoising,represents the square of the L2 norm, z t -VQ (x) represents z t The hidden characteristic VQ (x) obtained by inputting the original clothing image x into a VQ encoder is obeyed, E represents an expected value, and E-N (0, 1) represents Gaussian distribution of which E obeys a mean value is 0 and variance is 1.
3. The diffusion model-based garment pattern fusion method of claim 1, wherein the process of performing the Lora fine tuning on the stable diffusion model comprises:
initializing a dimension-reducing matrix A by random Gaussian distribution, initializing a dimension-increasing matrix B by a 0 matrix, performing point multiplication on the dimension-increasing matrix B and the dimension-reducing matrix A to obtain a bypass low-rank decomposition matrix BA, ensuring that the bypass low-rank decomposition matrix BA is still the 0 matrix at the beginning of training, fixing all parameters of an original stable diffusion model during training, performing parameter optimization on the dimension-reducing matrix A and the dimension-increasing matrix B, and adding the bypass low-rank decomposition matrix BA after parameter optimization to the original parameters of the stable diffusion model to obtain a garment generation large model.
4. The diffusion model-based garment pattern fusion method of claim 2, wherein when training the stylized text encoder to update the embedded vector of the style index word, an objective function is adopted as:
wherein v represents the embedding of style significantsThe input vector, e, represents the true noise sample, z t Represents potential noise at time t, c' θ (y ') represents the stylized text encoder c' θ Embedding vector of style indicator in text description y' with style indicator θ Representing denoising network based on z t 、t、c′ θ (y') as a result of denoising,represents the square of the L2 norm, z t -VQ (x) represents z t The hidden characteristic VQ (x) obtained by inputting the original clothing image x into a VQ encoder is obeyed, E represents an expected value, and E-N (0, 1) represents Gaussian distribution of which E obeys a mean value is 0 and variance is 1.
5. The diffusion model-based garment pattern fusion method of claim 2, wherein the loss function L is used when training the stylized ControlNet model CN The method comprises the following steps:
where e represents the real noise sample, z t Represents potential noise at time t, c' θ (y ') represents a parameter optimized stylized text encoder c' θ An embedded vector, c, encoding a style indicator in the text description y N (s) represents a stylized control Net model c N Coding result of clothing structure diagram s corresponding to style diagram, E θ Representing denoising network based on z t 、t、c′ θ (y′)、c N (s) as a result of denoising,represents the square of the L2 norm, z t -VQ (x) represents z t The hidden feature VQ (x) obtained by inputting VQ encoder to the original clothing image x, E represents the expected value, -N (0, 1) represents the high subject to mean 0 and variance 1A gaussian distribution.
6. The diffusion model-based garment pattern fusion method according to claim 1, wherein the garment pattern fusion is achieved by using a garment pattern fusion model, comprising the following processes:
encoding the input text description with the style indicator by using a stylized text encoder to obtain an embedded vector containing style information corresponding to the style indicator, and inputting the embedded vector of the style indicator to a garment to generate a large model and a stylized control Net model by taking the embedded vector of the style indicator as a condition;
carrying out style semantic and structure alignment on the embedded vector of the input instruction word and the clothing structure diagram by utilizing a stylized control Net model to obtain style information, and inputting the style information as a condition to clothing to generate a large model;
and performing clothing style fusion by using the clothing generation large model based on the embedded vector of the input instruction word, the style information output by the control Net model and random Gaussian noise, and generating a new clothing image.
7. The diffusion model-based garment pattern fusion method according to claim 1, wherein in the garment pattern fusion process, the randomness of random seeds and initial Gaussian noise control along with a new garment image is controlled, and the structural control intensity and the text control intensity are adjusted to adjust the mixing effect of a garment structure diagram and a style prompt word.
8. The diffusion model-based garment pattern fusion method of claim 1, wherein the garment structure map comprises a depth map or a line manuscript map.
9. The clothing style fusion system based on the diffusion model is characterized by comprising a clothing structure diagram module, a style module, a clothing style fusion module and a visualization module;
the clothing structure diagram module is used for providing a clothing structure diagram and supporting the function of selecting the clothing structure diagram;
the style module is used for providing style data, wherein the style data comprises a style chart or a style description text and supports a function of selecting the style data;
the clothing pattern fusion module performs clothing pattern fusion based on the selected clothing structure diagram and style data by adopting the clothing pattern fusion method of any one of claims 1-9 to generate a new clothing image;
the visualization module is used for visualizing the generated new clothing image.
10. The clothing style fusion system based on the diffusion model is characterized by further comprising a key parameter setting module, wherein the key parameter setting module provides a configuration function of key parameters, the key parameters comprise the generation number of new clothing images, the structure control intensity, the text control intensity, the generation step number and random seeds, and the control adjustment of the new clothing images is realized by modulating the key parameters.
CN202311128437.9A 2023-09-04 2023-09-04 Diffusion model-based garment pattern fusion method and system Pending CN117315417A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311128437.9A CN117315417A (en) 2023-09-04 2023-09-04 Diffusion model-based garment pattern fusion method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311128437.9A CN117315417A (en) 2023-09-04 2023-09-04 Diffusion model-based garment pattern fusion method and system

Publications (1)

Publication Number Publication Date
CN117315417A true CN117315417A (en) 2023-12-29

Family

ID=89285667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311128437.9A Pending CN117315417A (en) 2023-09-04 2023-09-04 Diffusion model-based garment pattern fusion method and system

Country Status (1)

Country Link
CN (1) CN117315417A (en)

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109712068A (en) * 2018-12-21 2019-05-03 云南大学 Image Style Transfer and analogy method for cucurbit pyrography
US20200082249A1 (en) * 2016-12-16 2020-03-12 Microsoft Technology Licensing, Llc Image stylization based on learning network
US20200151938A1 (en) * 2018-11-08 2020-05-14 Adobe Inc. Generating stylized-stroke images from source images utilizing style-transfer-neural networks with non-photorealistic-rendering
US20210166088A1 (en) * 2019-09-29 2021-06-03 Tencent Technology (Shenzhen) Company Limited Training method and apparatus for image fusion processing model, device, and storage medium
US20220222872A1 (en) * 2021-01-14 2022-07-14 Apple Inc. Personalized Machine Learning System to Edit Images Based on a Provided Style
US20220253202A1 (en) * 2019-05-13 2022-08-11 Microsoft Technology Licensing, Llc Automatic generation of stylized icons
CN115294427A (en) * 2022-04-14 2022-11-04 北京理工大学 Stylized image description generation method based on transfer learning
US20220398836A1 (en) * 2021-06-09 2022-12-15 Baidu Usa Llc Training energy-based models from a single image for internal learning and inference using trained models
US20230095092A1 (en) * 2021-09-30 2023-03-30 Nvidia Corporation Denoising diffusion generative adversarial networks
WO2023061169A1 (en) * 2021-10-11 2023-04-20 北京字节跳动网络技术有限公司 Image style migration method and apparatus, image style migration model training method and apparatus, and device and medium
CN116385848A (en) * 2023-03-27 2023-07-04 重庆理工大学 AR display device image quality improvement and intelligent interaction method based on stable diffusion model
CN116416342A (en) * 2023-06-12 2023-07-11 腾讯科技(深圳)有限公司 Image processing method, apparatus, computer device, and computer-readable storage medium
CN116524299A (en) * 2023-05-04 2023-08-01 中国兵器装备集团自动化研究所有限公司 Image sample generation method, device, equipment and storage medium
CN116563094A (en) * 2023-05-16 2023-08-08 上海芯赛云计算科技有限公司 Method and system for generating style image
CN116597048A (en) * 2023-04-18 2023-08-15 阿里巴巴(中国)有限公司 Image file generation method, device, equipment and program product
CN116630464A (en) * 2023-07-21 2023-08-22 北京蔚领时代科技有限公司 Image style migration method and device based on stable diffusion
CN116664719A (en) * 2023-07-28 2023-08-29 腾讯科技(深圳)有限公司 Image redrawing model training method, image redrawing method and device

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200082249A1 (en) * 2016-12-16 2020-03-12 Microsoft Technology Licensing, Llc Image stylization based on learning network
US20200151938A1 (en) * 2018-11-08 2020-05-14 Adobe Inc. Generating stylized-stroke images from source images utilizing style-transfer-neural networks with non-photorealistic-rendering
CN109712068A (en) * 2018-12-21 2019-05-03 云南大学 Image Style Transfer and analogy method for cucurbit pyrography
US20220253202A1 (en) * 2019-05-13 2022-08-11 Microsoft Technology Licensing, Llc Automatic generation of stylized icons
US20210166088A1 (en) * 2019-09-29 2021-06-03 Tencent Technology (Shenzhen) Company Limited Training method and apparatus for image fusion processing model, device, and storage medium
US20220222872A1 (en) * 2021-01-14 2022-07-14 Apple Inc. Personalized Machine Learning System to Edit Images Based on a Provided Style
US20220398836A1 (en) * 2021-06-09 2022-12-15 Baidu Usa Llc Training energy-based models from a single image for internal learning and inference using trained models
US20230095092A1 (en) * 2021-09-30 2023-03-30 Nvidia Corporation Denoising diffusion generative adversarial networks
WO2023061169A1 (en) * 2021-10-11 2023-04-20 北京字节跳动网络技术有限公司 Image style migration method and apparatus, image style migration model training method and apparatus, and device and medium
CN115294427A (en) * 2022-04-14 2022-11-04 北京理工大学 Stylized image description generation method based on transfer learning
CN116385848A (en) * 2023-03-27 2023-07-04 重庆理工大学 AR display device image quality improvement and intelligent interaction method based on stable diffusion model
CN116597048A (en) * 2023-04-18 2023-08-15 阿里巴巴(中国)有限公司 Image file generation method, device, equipment and program product
CN116524299A (en) * 2023-05-04 2023-08-01 中国兵器装备集团自动化研究所有限公司 Image sample generation method, device, equipment and storage medium
CN116563094A (en) * 2023-05-16 2023-08-08 上海芯赛云计算科技有限公司 Method and system for generating style image
CN116416342A (en) * 2023-06-12 2023-07-11 腾讯科技(深圳)有限公司 Image processing method, apparatus, computer device, and computer-readable storage medium
CN116630464A (en) * 2023-07-21 2023-08-22 北京蔚领时代科技有限公司 Image style migration method and device based on stable diffusion
CN116664719A (en) * 2023-07-28 2023-08-29 腾讯科技(深圳)有限公司 Image redrawing model training method, image redrawing method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SONIA PECENAKOVA; NOUR KARESSLI; REZA SHIRVANY: "《FitGAN: Fit- and Shape-Realistic Generative Adversarial Networks for Fashion》", 《2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)》, 25 August 2022 (2022-08-25), pages 3097 - 3104 *
尤伟涛,江浩,杨智渊,杨昌源,孙凌云: "《针对特定风格的平面广告图像自动生成(英文)》", 《FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING》, vol. 21, no. 10, 3 October 2020 (2020-10-03), pages 1455 - 1467 *
王生辉: "《基于深度学习的服装图像生成方法研究》", 《武汉纺织大学》, 30 August 2023 (2023-08-30), pages 46 - 73 *
赵海英;惠雯;徐光美;: "一种新的图案纹样生成方法", 计算机系统应用, no. 03, 15 March 2011 (2011-03-15), pages 87 - 91 *

Similar Documents

Publication Publication Date Title
Golts et al. Unsupervised single image dehazing using dark channel prior loss
CN111292264B (en) Image high dynamic range reconstruction method based on deep learning
CN111340122B (en) Multi-modal feature fusion text-guided image restoration method
CN110533737A (en) The method generated based on structure guidance Chinese character style
CN109544662B (en) Method and system for coloring cartoon style draft based on SRUnet
CN108830913B (en) Semantic level line draft coloring method based on user color guidance
CN113658051A (en) Image defogging method and system based on cyclic generation countermeasure network
CN107464217B (en) Image processing method and device
CN116416342B (en) Image processing method, apparatus, computer device, and computer-readable storage medium
CN111768340A (en) Super-resolution image reconstruction method and system based on dense multi-path network
CN115565056A (en) Underwater image enhancement method and system based on condition generation countermeasure network
CN111768326B (en) High-capacity data protection method based on GAN (gas-insulated gate bipolar transistor) amplified image foreground object
CN115049556A (en) StyleGAN-based face image restoration method
CN115170388A (en) Character line draft generation method, device, equipment and medium
CN110097615B (en) Stylized and de-stylized artistic word editing method and system
CN117315417A (en) Diffusion model-based garment pattern fusion method and system
CN115018729B (en) Content-oriented white box image enhancement method
Goncalves et al. Guidednet: Single image dehazing using an end-to-end convolutional neural network
CN111161266A (en) Multi-style font generation method of variational self-coding machine based on vector quantization
CN116051593A (en) Clothing image extraction method and device, equipment, medium and product thereof
CN111862253B (en) Sketch coloring method and system for generating countermeasure network based on deep convolution
CN114917583A (en) Animation style game background generation method and platform based on generation confrontation network
Han et al. Deep Portrait Lighting Enhancement with 3D Guidance
CN117750155A (en) Method and device for generating video based on image and electronic equipment
Shi et al. Semantic and style based multiple reference learning for artistic and general image aesthetic assessment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination