CN117495662A

CN117495662A - Cartoon image style migration method and system based on Stable diffration

Info

Publication number: CN117495662A
Application number: CN202311519506.9A
Authority: CN
Inventors: 周宁宁; 张政; 王瑞
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-11-15
Filing date: 2023-11-15
Publication date: 2024-02-02

Abstract

The invention discloses a cartoon image style migration method and a system based on Stable Diffusion, which relate to the technical field of computer vision and comprise the steps of acquiring a dataset containing a cartoon style image and a real scene image, and acquiring a description text corresponding to each picture in the real image dataset; improving a diffusion model backbone model U-Net, and optimizing additional input conditions; training the diffusion model by combining the descriptive text, the reality image and the animation style image to obtain a trained animation style migration model; inputting the image of the reality to be migrated and the image of the cartoon style into the diffusion model after training is completed, and obtaining the corresponding image of the cartoon style generation. The invention can quickly, effectively and reliably synthesize the high-quality cartoon style image, improve the sense of reality and visual quality of the synthesized image, expand the application range and application scene, and have profound significance for promoting the development of science, technology, business and art.

Description

Cartoon image style migration method and system based on Stable diffration

Technical Field

The invention relates to the technical field of computer vision, in particular to a cartoon image style migration method and system based on Stable Diffusion.

Background

Image style migration is an important study in the field of computer vision, and aims to enable a real image to have a specific artistic style by learning and applying the style characteristics of an artistic image.

Cartoon is an artistic expression which is very popular at present and has wide application in various aspects such as film and television works, advertisements and the like, but cartoon production usually needs professional staff to finish, and more time and labor are required. With the rapid development of deep learning, better results have been obtained by performing style migration using abstract style features that Generate Antagonistic Network (GAN) learning images, but training GAN models has a problem of difficult convergence.

The latest diffusion model solves this problem well. Diffusion model is a depth generation model, and Stable Diffusion is a specific application of Diffusion model. The working principle is to iteratively add noise to an image and then train a neural network to learn the noise and recover the image, and in recent years Stable diffration is also becoming increasingly popular in the field of computer vision, known for its high quality, diverse image generation, exhibiting dramatic results in visual tasks, particularly in the generation of artwork.

Disclosure of Invention

The present invention has been made in view of the above-described problems in training the GAN model.

Accordingly, the present invention is directed to a method and system for improving the realism and visual quality of a composite image using a diffusion model.

In order to solve the technical problems, the invention provides the following technical scheme:

in a first aspect, an embodiment of the present invention provides a Stable diffration-based animation image style migration method, which includes acquiring a dataset including animation style images and real scene images, and acquiring a description text corresponding to each picture in the real image dataset; improving a diffusion model backbone model U-Net, and optimizing additional input conditions; training the diffusion model by combining the descriptive text, the reality image and the animation style image to obtain a trained animation style migration model; inputting the image of the reality to be migrated and the image of the cartoon style into the diffusion model after training is completed, and obtaining the corresponding image of the cartoon style generation.

As a preferable scheme of the cartoon image style migration method based on Stable Diffusion, the invention comprises the following steps: the improved diffusion model backbone model U-Net optimizes additional input conditions comprising the steps of: obtaining a style concept embedded vector corresponding to each picture in the style image data set, namely a style embedded vector, and an embedded vector corresponding to a description text of each picture in the reality image data set, namely a text embedded vector; combining the two embedded vectors of the obtained text embedded vector and the style embedded vector, and sending the two embedded vectors into a diffusion model backbone network U-Net; removing style information contained in a real image to be migrated in a forward diffusion process, and simultaneously reserving a content structure of the real image; a channel cross attention mechanism is added into a backbone network U-Net, so that the original jump connection structure is improved; depth separable convolution is used to replace the 2D convolution operation.

As a preferable scheme of the cartoon image style migration method based on Stable Diffusion, the invention comprises the following steps: the process of obtaining the style concept embedded vector corresponding to each picture in the style image data set is as follows: the image encoder using the CLIP obtains an embedded vector of a style image, and the style embedded vector is processed by using a 3-layer self-attention mechanism to obtain style information in the style image, which specifically comprises the following steps:

image_embedding＝CLIPImageEncoder(I _s )

wherein I is _s Is a style image, the CLIPImageEncoder represents an image encoder of the CLIP, the image_embedding is an embedded vector corresponding to the style image obtained after the style image passes through the image encoder, and the embedded vector is marked as tau _θ (I _s ) The method comprises the steps of carrying out a first treatment on the surface of the For the embedded vector v ₀ ＝τ _θ (I _s ) A 3-layer self-attention mechanism is performed, wherein,

Q _i ＝W _Q ·v _i ,K＝W _K ·τ _θ (I _s ),V＝W _V ·τ _θ (I _s )

v _i+1 ＝Atention(Q _i ,K,V)

wherein, attention () represents self-Attention mechanism, Q, K, V represents Query, key and Value, respectively, & represents matrix multiplication, T represents matrix transposition operation, d _k Representing the dimension size of K, softmax () represents the normalization operation, W _Q ，W _K And W is _V Is three trainable parameter matrices, Q _i Representing Query after execution of the ith self-attention mechanism.

The embedded vector from the self-attention mechanism is passed through the CLIP text encoder to obtain a style embedded vector representation that fits the diffusion model.

As a preferable scheme of the cartoon image style migration method based on Stable Diffusion, the invention comprises the following steps: the process of combining the text-embedded vector and the style-embedded vector is as follows: linearly transforming the embedded vector corresponding to the description text of the real picture to obtain a Query vector Q, key-Value key Value pair vector K1 and V1 of the description text; linearly transforming the style embedded vector to obtain Key-Value Key Value pair vectors K2 and V2 of the style image; connecting K1 and V1 of the descriptive text with K2 and V2 of the wind grid image to obtain spliced Key-Value Key values, and carrying out self-attention operation corresponding to the following formula on vectors K and V and Q provided by the descriptive text:

K＝K1(+)K2,V＝V1(+)V2

wherein (+) represents length-level element addition.

As a preferable scheme of the cartoon image style migration method based on Stable Diffusion, the invention comprises the following steps: the process for removing style information contained in the real image to be migrated and simultaneously preserving the content structure of the real image comprises the following steps: converting the real image into a gray image by using an image.

image＝Image.open(path).convert(mode)

The path represents a file path of a real image, the image is opened and a picture is loaded, and the mode represents a converted image type; in the process of continuously adding noise in forward diffusion, a Gabor filter obtained by superposition of a trigonometric function and a Gaussian function is used for extracting texture features of a real image, and the mathematical expression of a two-dimensional Gabor kernel function is as follows:

x′＝xcosθ+ysinθ,y′＝-xsinθ+ysinθ

where (x, y) is the original coordinates, (x ', y') is the rotated coordinates, λ is the filtered wavelength, θ is the inclination angle of the Gabor kernel image, ψ is the phase offset, σ is the standard deviation of the gaussian function, and γ is the aspect ratio determining the ellipticity of the kernel image.

As a preferable scheme of the cartoon image style migration method based on Stable Diffusion, the invention comprises the following steps: the method for improving the original jump connection structure by adding the channel cross attention mechanism into the backbone network U-Net comprises the following steps: channel cross attention: for the image features extracted by the last convolution layer in each stage of the U-Net encoder, sending the image features to a channel cross attention module to acquire a global channel dependency relationship after feature preprocessing, and generating enhanced image feature representation; spatial cross attention: on the basis of the image features obtained by the channel cross attention, performing space cross attention on the image features, and capturing a space-crossing space dependency relationship; the channel cross attention and the space cross attention are combined to form a unified module, and the image characteristics obtained in each stage of the U-Net encoder are processed by the module to obtain enhanced image characteristic representation, and the enhanced image characteristic representation is connected to the decoder stage corresponding to the encoder stage.

As a preferable scheme of the cartoon image style migration method based on Stable Diffusion, the invention comprises the following steps: the process of replacing a 2D convolution with a depth separable convolution includes: assume that the input image size is [ C _in ,H,W]The output image size is [ C _out ,H,W]The conventional 2D convolution parameter is ks C _i *C _out The parameter amount of the depth separable convolution is ks x C _in +C _in *C _out The method comprises the steps of carrying out a first treatment on the surface of the Wherein C is _in Representing the number of input channels, C _out Representing the number of output channels, H and W representing the height and width of the image, respectively, ks representing the convolution kernel size.

In a second aspect, in order to further solve the problems in training the GAN model, an embodiment of the present invention provides a Stable diffration-based animation style migration system, which includes: the data preprocessing module is used for reading pictures in the data set and performing preprocessing operation before model training, and comprises the steps of adjusting the size of the pictures, randomly cutting, randomly horizontally overturning, randomly rotating and normalizing; the training module is used for carrying out operations such as style removal, concept extraction, BLIP text reverse-pushing and the like on the descriptive text, the style image and the real image corresponding to the real image, then sending the descriptive text, the style image and the real image into the improved U-Net network, and calculating the mean square error loss between the U-Net prediction noise and the real added noise; and the picture generation module is used for inputting the original image to be migrated into the cartoon-style image and the cartoon-style image into the improved diffusion model after the network training is completed, so as to obtain an output image which accords with the content of the original image and corresponds to the cartoon-style image.

In a third aspect, embodiments of the present invention provide a computer apparatus comprising a memory and a processor, the memory storing a computer program, wherein: the computer program when executed by a processor implements any step of the Stable diffion-based animation style migration method according to the first aspect of the present invention.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having a computer program stored thereon, wherein: the computer program when executed by a processor implements any step of the Stable diffion-based animation style migration method according to the first aspect of the present invention.

The invention has the beneficial effects that the data enhancement and the extension of the cartoon style dataset are used, then the CLIP encoder and the BLIP text inversion technology are used for obtaining the embedded vector representation of the image and combining, and the style information of the real image is removed, and the main network U-Net structure of the diffusion model is improved, so that the style information of the cartoon image can be fully utilized by the network model in the iterative training process, the training speed is accelerated, and the image with high fidelity and attached cartoon style is finally generated; the invention can quickly, effectively and reliably synthesize the high-quality cartoon style image, improve the sense of reality and visual quality of the synthesized image, expand the application range and application scene, and have profound significance for promoting the development of science, technology, business and art.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

fig. 1 is a flowchart of a cartoon image style migration method and system based on Stable diffration in embodiment 1.

Fig. 2 is a diagram showing the basic Stable diffration model in example 1.

Fig. 3 is a U-Net network structure after the modified jump connection in embodiment 1.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

Example 1

Referring to fig. 1 to 3, in a first embodiment of the present invention, a Stable diffration-based animation style migration method is provided, as shown in fig. 1, including the following steps:

s1: and acquiring a dataset containing the cartoon style image and the real scene image, and acquiring a description text corresponding to each picture in the real image dataset.

S2: and (3) improving a diffusion model backbone model U-Net and optimizing additional input conditions.

Further, the method comprises the following steps: obtaining a style concept embedded vector corresponding to each picture in the style image data set and obtaining an embedded vector corresponding to a description text of each picture in the reality image data set; combining the two embedded vectors and sending the combined embedded vectors into a diffusion model backbone network U-Net;

removing style information contained in a real image to be migrated in a forward diffusion process, and simultaneously reserving a content structure of the real image;

a channel cross attention mechanism is added into a backbone network U-Net, and an original jump connection structure is improved, so that shallow structure and deep semantic information are fused better;

the depth separable convolution is used for replacing 2D convolution operation, so that the parameter number can be obviously reduced and the efficiency can be improved on the premise of not sacrificing the performance of the model;

the size of the picture is adjusted using a transform.

image＝Resize(H,W)

Where Resize represents the Resize function transform, resize, image is the resized input image, and H and W represent the width and height of the output image, respectively.

The input image is randomly cropped by using a transform.

image＝Crop(size,padding,pad_if_needed,fill,mode)

Wherein, loop represents a random clipping function transform. Random loop, size represents a size expected to be output after random clipping, padding represents a value of a padding boundary, pad_if_fed represents a boolean value, avoiding array out-of-range, fill represents padding, mode represents a padding mode, and image represents a final output picture.

Random horizontal flipping of an input picture using a transform.

image＝Filp(P)

Where image represents a horizontally flipped image, filp represents a random horizontal flipping function transform. Random horizontal flip, and P represents the probability that the picture performs horizontal flipping.

The input picture is randomly rotated using a transform.

image＝Rotation(degrees,expand,center,fill,resample)

Wherein image represents a randomly rotated image, rotation represents a random Rotation function transform.

The process for obtaining the descriptive text corresponding to each picture in the real image data set comprises the following steps: installing stable-diffusion-webui and configuring a basic environment: and selecting a proper python version, and installing the deep learning framework pytorch according to own display card configuration.

The Stable-diffion-webui repository of the self clone AUTOMATIC1111 is used to download the Stable diffion base model, after all the required dependent packages are installed, run the Stable-diffion-webui to browser open.

Downloading description texts corresponding to the automatic labeling plug-ins and generating real pictures in batches: selecting an Extensions tab on the webui interface, inputting a place of an automatic labeling plug-in stable-diffion-webui-wd 14-tab warehouse, and clicking an instrument button to wait for installation to be completed.

After the installation is completed, the tab and Batch from directory options on the webui interface are sequentially selected, the data set folder path to be automatically marked and the storage position of the marking file are input, the automatic marking can be started after the output size of the pictures is set, and further the description text corresponding to each picture in the real image data set is obtained.

The process of obtaining the embedded vector corresponding to the description text of each real image comprises the following steps: the word segmentation device CLIPTokenizer and the text encoder CLIPTExtModel are introduced from a transformation library, and the specific word segmentation device and the text encoder are obtained by using a Vit-L/14 model of OpenAI.

Then, the descriptive text corresponding to the real picture needs to be converted into a series of marks (token) through a word segmentation device, each word or sub-word corresponds to an index in a predefined dictionary, and then the token is encoded into an embedded vector representation with a specific dimension by using a text encoder.

Obtaining a style concept embedded vector representation corresponding to each style image, wherein the process comprises the following steps: firstly, an image encoder of a CLIP is used for obtaining an embedded vector of a style image, and then a 3-layer self-attention mechanism is used for processing the style embedded vector to obtain style information in the style image, which comprises the following steps:

image_embedding＝CLIPImageEncoder(I _s )

wherein, is a style image, the CLIPImageEncoder represents the image encoder of the CLIP, and the image_embedding represents the embedding vector corresponding to the style image obtained after the style image passes through the image encoder, and is marked as tau _θ (I _s )。

Then for the embedded vector v ₀ ＝τ _θ (I _s ) A 3-layer self-attention mechanism is performed, wherein,

Q _i ＝W _Q ·v _i ,K＝W _K ·τ _θ (I _s ),V＝W _V ·τ _θ (I _s )

v _i+1 ＝Atention(Q _i ,K,V)

The embedded vector obtained by the self-attention mechanism is then passed through the CLIP text encoder to obtain a style embedded vector representation suitable for the diffusion model.

Combining the text-embedded vector and the style-embedded vector, the process comprising: the combination of the embedded vectors is realized by using a cross attention mechanism, firstly, the embedded vectors corresponding to the description text of the real picture are subjected to linear transformation to obtain Query vector Q, key-Value Key Value pair vectors K1 and V1 of the description text, and then, the Key-Value Key Value pair vectors K2 and V2 of the style image are obtained by subjecting the style embedded vectors to linear transformation.

In order to better fuse the information provided by the text and the image, we first connect K1 and V1 describing the text with K2 and V2 of the style image to obtain the Key-Value Key Value pair vector K and V after splicing, and then perform self-attention operation corresponding to the following formula with Q describing the text:

K＝K1(+)K2,V＝V1(+)V2

wherein (+) represents length-level element addition.

Removing style information contained in a real image to be migrated, and simultaneously preserving a content structure of the real image, wherein the method comprises the following steps of: first, a real image is converted into a gray image by using an image.

image＝Image.open(path).convert(mode)

Where path represents the file path of the real image, image. Open represents opening and loading a picture, mode represents the type of image after conversion, and when mode= "L" represents converting the image into an off-white image.

And in the process of continuously adding noise in forward diffusion, extracting texture features of the real image by using a Gabor filter obtained by superposition of a trigonometric function and a Gaussian function. In the first 30% step (300 steps) of adding noise, for each moment t, a new noise image obtained by subtracting the texture features extracted by the Gabor filter from the actual image after adding noise is used to replace the original noise image, and the next noise adding and texture removing process is circularly performed, so that the texture features in the actual image are gradually removed.

x′＝xcosθ+ysinθ,y′＝-xsinθ+ysinθ

Preferably, a channel cross-attention mechanism is added into a backbone network U-Net, and the improvement of the original jump connection structure comprises the following steps: channel Cross-section: channel cross attention: for the image features extracted by the last convolution layer in each stage of the U-Net encoder, sending the image features to a channel cross attention module to acquire a global channel dependency relationship after feature preprocessing, and generating enhanced image feature representation; spatial Cross-Attention: spatial cross attention: based on the image features obtained by the channel cross attention, performing space cross attention on the image features, and capturing the space-crossing space dependency relationship.

The channel cross attention and the space cross attention are different in that the settings of Query, key and Value are different, and the dimensions of the channel cross attention and the space cross attention are also different in different stages.

Further, replacing the 2D convolution with a depth separable convolution, the process comprising: the core idea of the depth separable convolution is to decompose a complete convolution operation into two steps, depthWise Convolution (depth-wise convolution) and PointWise Convolution (point-wise convolution), respectively.

Assume that the input image size is [ C _in ,H,W]The output image size is [ C _out ,H,W]The conventional 2D convolution parameter is ks C _in *C _out The parameter amount of the depth separable convolution is ks x C _in +C _in *C _out The method comprises the steps of carrying out a first treatment on the surface of the Wherein C is _in Representing the number of input channels, C _out Representing the number of output channels, H and W representing the height and width of the image, respectively, ks representing the convolution kernel size.

The channel cross attention and the space cross attention are combined to form a unified module, and the image characteristics obtained in each stage of the U-Net encoder are processed by the module to obtain enhanced image characteristic representation, and the enhanced image characteristic representation is connected to the decoder stage corresponding to the encoder stage.

S3: and training the diffusion model by combining the descriptive text, the reality image and the animation style image to obtain a trained animation style migration model.

S4: inputting the image of the reality to be migrated and the image of the cartoon style into the diffusion model after training is completed, and obtaining the corresponding image of the cartoon style generation.

The embodiment also provides a cartoon image style migration system based on Stable Diffusion, which comprises a data preprocessing module, a data processing module and a data processing module, wherein the data preprocessing module is used for reading pictures in a data set and performing preprocessing operation before model training is started, and the data processing module comprises the steps of adjusting the size of the pictures, randomly cutting, randomly horizontally overturning, randomly rotating and normalizing; the training module is used for carrying out operations such as style removal, concept extraction, BLIP text reverse-pushing and the like on the descriptive text, the style image and the real image corresponding to the real image, then sending the descriptive text, the style image and the real image into the improved U-Net network, and calculating the mean square error loss between the U-Net prediction noise and the real added noise; and the picture generation module is used for inputting the original image to be migrated into the cartoon-style image and the cartoon-style image into the improved diffusion model after the network training is completed, so as to obtain an output image which accords with the content of the original image and corresponds to the cartoon-style image.

The embodiment also provides a computer device, which is suitable for the situation of a cartoon image style migration method based on Stable diffration, and comprises the following steps: a memory and a processor; the memory is used for storing computer executable instructions, and the processor is used for executing the computer executable instructions to realize the cartoon image style migration method based on Stable Diffusion, which is proposed by the embodiment.

The computer device may be a terminal comprising a processor, a memory, a communication interface, a display screen and input means connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

The present embodiment also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements a Stable diffration-based animation image style migration method as proposed in the above embodiment; the storage medium may be implemented by any type or combination of volatile or nonvolatile Memory devices, such as static random access Memory (Static Random Access Memory, SRAM), electrically erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.

In summary, the invention uses data enhancement to expand the cartoon style dataset, then uses the CLIP encoder and the BLIP text inversion technology to obtain the embedded vector representation of the image and combines the embedded vector representation, and removes the style information of the real image and improves the backbone network U-Net structure of the diffusion model, so that the network model can fully utilize the style information of the cartoon image in the iterative training process, and the training speed is increased, and finally, the image with high fidelity and attached cartoon style is generated; the invention can quickly, effectively and reliably synthesize the high-quality cartoon style image, improve the sense of reality and visual quality of the synthesized image, expand the application range and application scene, and have profound significance for promoting the development of science, technology, business and art.

Example 2

In order to verify the beneficial effects of the second embodiment of the invention, based on the first embodiment, experimental simulation data of the cartoon image style migration method based on Stable diffration is provided.

Data preparation: a cartoon image data set A containing 1000 cartoon images and a real image data set B containing 250000 daily life scene images containing different cartoon styles are collected.

And generating a section of descriptive text for each image in the data set B by using an automatic annotation plug-in stable-diffusion-webui to obtain a text description corresponding to each image in the data set B.

And extracting a style concept embedded vector from each animation style image in the data set A by using the CLIP image encoder to obtain a style concept vector set C.

The text description in the data set B is encoded using a clitokenizer and a clidexmodel, resulting in a set of text description vectors D.

Model construction: based on U-Net in Stable diffration as a backbone network, the network structure is improved: adding channel cross attention and spatial cross attention modules at the encoder layer output; the depth separable convolution is used for replacing the common convolution, so that the parameter number is reduced; the jump connection performs channel and spatial dual-attention feature fusion.

Model training: the training data set is constructed and input as a random image, a style concept vector C and a text description vector D in the data set B, and is output as a cartoon image.

Training parameters: the learning rate was set to 1e-4, the training batch_size was 4, and 1000 epochs were trained.

And updating network parameters by adopting an L2 loss function to obtain a trained style migration model M.

And extracting a corresponding description text of a to-be-migrated real image, encoding to obtain a text description vector, and inputting the text description vector and a style concept vector of the selected cartoon style image into a model M to generate a cartoon image after style migration.

According to the method, through improving the network structure, semantic information of texts and images is effectively utilized, and high-quality animation style migration images can be rapidly generated. The method is simple to operate, high in training speed and good in generating effect, and can be widely applied to the fields of animation creation, game design and the like.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. A cartoon image style migration method based on Stable Diffusion is characterized in that: comprising the following steps:

acquiring a dataset containing cartoon style images and reality scene images, and acquiring a description text corresponding to each picture in the reality image dataset;

improving a diffusion model backbone model U-Net, and optimizing additional input conditions;

training the diffusion model by combining the descriptive text, the reality image and the animation style image to obtain a trained animation style migration model;

inputting the image of the reality to be migrated and the image of the cartoon style into the diffusion model after training is completed, and obtaining the corresponding image of the cartoon style generation.

2. The Stable diffion-based animation style migration method of claim 1, wherein: the improved diffusion model backbone model U-Net optimizes additional input conditions comprising the steps of:

obtaining a style concept embedded vector corresponding to each picture in the style image data set, namely a style embedded vector, and an embedded vector corresponding to a description text of each picture in the reality image data set, namely a text embedded vector;

combining the two embedded vectors of the obtained text embedded vector and the style embedded vector, and sending the two embedded vectors into a diffusion model backbone network U-Net;

a channel cross attention mechanism is added into a backbone network U-Net, so that the original jump connection structure is improved;

depth separable convolution is used to replace the 2D convolution operation.

3. The Stable diffion-based animation style migration method of claim 2, wherein: the process of obtaining the style concept embedded vector corresponding to each picture in the style image data set is as follows:

the image encoder using the CLIP obtains an embedded vector of a style image, and the style embedded vector is processed by using a 3-layer self-attention mechanism to obtain style information in the style image, which specifically comprises the following steps:

image_embedding＝CLIPImageEncoder(I _s )

wherein I is _s Is a style image, the CLIPImageEncoder represents an image encoder of the CLIP, the image_embedding is an embedded vector corresponding to the style image obtained after the style image passes through the image encoder, and the embedded vector is marked as tau _θ (I _s )；

For the embedded vector v ₀ ＝τ _θ (I _s ) A 3-layer self-attention mechanism is performed, wherein,

Q _i ＝W _Q ·v _i ,K＝W _K ·τ _θ (I _s ),V＝W _V ·τ _θ (I _s )

v _i+1 ＝Atention(Q _i ,K,V)

wherein, attention () represents self-Attention mechanism, Q, K, V represents Query, key and Value, respectively, & represents matrix multiplication, T represents matrix transposition operation, d _k Representing the dimension size of K, softmax () represents the normalization operation, W _Q ，W _K And W is _V Is three trainable parameter matrices, Q _i Representing a Query after execution of the ith self-attention mechanism;

4. The Stable diffion-based animation style migration method of claim 3, wherein: the process of combining the text-embedded vector and the style-embedded vector is as follows:

linearly transforming the embedded vector corresponding to the description text of the real picture to obtain a Query vector Q, key-Value key Value pair vector K1 and V1 of the description text;

linearly transforming the style embedded vector to obtain Key-Value Key Value pair vectors K2 and V2 of the style image;

connecting K1 and V1 of the descriptive text with K2 and V2 of the wind grid image to obtain spliced Key-Value Key values, and carrying out self-attention operation corresponding to the following formula on vectors K and V and Q provided by the descriptive text:

K＝K1(+)K2,V＝V1(+)V2

wherein (+) represents length-level element addition.

5. The Stable diffion-based animation style migration method of claim 4, wherein: the process for removing style information contained in the real image to be migrated and simultaneously preserving the content structure of the real image comprises the following steps:

converting the real image into a gray image by using an image.

image＝Image.open(path).convert(mode)

The path represents a file path of a real image, the image is opened and a picture is loaded, and the mode represents a converted image type;

in the process of continuously adding noise in forward diffusion, a Gabor filter obtained by superposition of a trigonometric function and a Gaussian function is used for extracting texture features of a real image, and the mathematical expression of a two-dimensional Gabor kernel function is as follows:

x′＝xcosθ+ysinθ,y′＝-xsinθ+ysinθ

6. The Stable diffion-based animation style migration method of claim 5, wherein: the method for improving the original jump connection structure by adding the channel cross attention mechanism into the backbone network U-Net comprises the following steps:

channel cross attention: for the image features extracted by the last convolution layer in each stage of the U-Net encoder, sending the image features to a channel cross attention module to acquire a global channel dependency relationship after feature preprocessing, and generating enhanced image feature representation;

spatial cross attention: on the basis of the image features obtained by the channel cross attention, performing space cross attention on the image features, and capturing a space-crossing space dependency relationship;

7. The Stable diffion-based animation style migration method of claim 6, wherein: the process of replacing a 2D convolution with a depth separable convolution includes:

assume that the input image size is [ C _in ,H,W]The output image size is [ C _out ,H,W]The conventional 2D convolution parameter is ks C _i *C _out The parameter amount of the depth separable convolution is ks x C _in +C _in *C _out ；

Wherein C is _in Representing the number of input channels, C _out Representing the number of output channels, H and W representing the height and width of the image, respectively, ks representing the convolution kernel size.

8. A Stable-Diffusion-based animation image style migration system, based on the Stable-Diffusion-based animation image style migration method according to any one of claims 1 to 7, characterized in that: comprising the steps of (a) a step of,

the data preprocessing module is used for reading pictures in the data set and performing preprocessing operation before model training, and comprises the steps of adjusting the size of the pictures, randomly cutting, randomly horizontally overturning, randomly rotating and normalizing;

the training module is used for carrying out style removal, concept extraction and BLIP text reverse-pushing operations on the descriptive text, the style image and the real image corresponding to the real image, then sending the descriptive text, the style image and the real image into the improved U-Net network, and calculating the mean square error loss between the U-Net prediction noise and the real added noise;

and the picture generation module is used for inputting the original image to be migrated into the cartoon-style image and the cartoon-style image into the improved diffusion model after the network training is completed, so as to obtain an output image which accords with the content of the original image and corresponds to the cartoon-style image.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that: the steps of the Stable diffion-based animation image style migration method according to any one of claims 1 to 7 are realized when the processor executes the computer program.

10. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program, when executed by a processor, implements the steps of the Stable diffion-based animation style migration method of any one of claims 1 to 7.