WO2022048182A1

WO2022048182A1 - Image style transfer method and apparatus, and image style transfer model training method and apparatus

Info

Publication number: WO2022048182A1
Application number: PCT/CN2021/093432
Authority: WO
Inventors: 傅慧源; 马华东; 余艇; 张宇
Original assignee: 北京邮电大学
Priority date: 2020-09-02
Filing date: 2021-05-12
Publication date: 2022-03-10
Also published as: CN111815509A; CN111815509B

Abstract

Disclosed are an image style transfer method and apparatus, and an image style transfer model training method and apparatus. The image style transfer method comprises: respectively inputting first-style images and second-style images into a content encoder and a style encoder in an encoder network, and respectively extracting content encoded feature images and style encoded feature images; and respectively inputting the style encoded feature images and the content encoded feature images into a decoder network, so as to obtain target images that are transferred from a first style to a second style, wherein an image style transfer model composed of the encoder network and the decoder network is obtained by means of performing pre-training on the basis of image training samples that comprise a plurality of first-style images and second-style images, and instance images cropped from the image training samples. By using the present invention, the style transfer adaptability to a plurality of different scenarios can be improved, the problems of image blur, and poor instance effects in an image after style transfer can be ameliorated, and coarse-grained and fine-grained high-quality image style transfer can also be achieved.

Description

A method and device for image style conversion and model training

technical field

The present invention relates to the technical field of image processing, in particular to a method and device for image style conversion and model training.

Background technique

Image style transfer is an important content in the field of image processing. With the development of society and economy, there are more and more demands for audio and video processing, especially focusing on face editing/generation, image enhancement, and image style conversion. Apps/cameras that apply related technologies are easy to attract people's attention. Create economic and social value. Using image style conversion technology to enhance the image, the nighttime image can be enhanced to obtain a clearer image, thereby greatly improving the visibility of the image, which is of great significance for video surveillance. With the increasing development and perfection of digital image processing, pattern recognition, and deep learning techniques, image style transfer methods are also evolving.

The image style method based on traditional image processing technology directly processes the image itself, does not make use of the advanced features of the image, and has poor adaptability to new scenes. In the prior art, the image style transfer technology based on generative adversarial network is more widely used, but the image is still blurred after style transfer, and the examples in the image, such as cars, people, etc., have poor style transfer effect, and pairs of different styles are obtained. There are many technical problems such as the difficulty of image data.

SUMMARY OF THE INVENTION

In view of this, the purpose of the present invention is to propose a method and device for image style conversion and model training, so as to improve the adaptability of image style conversion in a variety of different scenarios, and to improve image blurring and instances in images after style conversion The problem of poor effect, while achieving high-quality image style transfer of coarse-grained and fine-grained.

Based on the above purpose, the present invention provides a method for image style conversion, comprising:

The image of the first style to be converted into the style and the image of the second style as the reference image are respectively input into the content and style encoders in the encoder network, and the feature images of the content and style encoding are extracted respectively;

The style and content encoding feature images are respectively input into the multi-layer perceptron module and residual convolution module in the decoder network, and the perceptron operation and residual convolution operation are respectively performed to obtain the adaptive instance normalization module respectively. parameters and intermediate process feature images; and share the obtained parameters to the adaptive instance normalization module of the decoder network;

The intermediate process feature image is input into the adaptive instance normalization module for instance normalization, and the instance-normalized feature image is input into the upsampling layer of the decoder network to obtain a Convert to the target image of the second style;

The image style conversion model composed of the encoder and decoder networks is pre-trained by image training samples including a plurality of images of the first and second styles and instance images cropped from the image training samples.

Preferably, the image style conversion model is specifically obtained by pre-training according to the following method:

constructing an image generation model; wherein, the image generation model includes: a global style transfer model and a local style transfer model; wherein, the global style transfer model includes: a global encoder network and a decoder network; the local style The migration model includes: local encoder and decoder networks;

Perform multiple iterative training on the image generation model, and after the number of iterative training reaches the first preset number of times, use the global style transfer model as the image style transfer model obtained by training;

Among them, an iterative training process includes:

Input the images of the first and second styles in the image training samples into the global style transfer model, and decouple the content features and style features and decode the content features twice through the global encoder network and decoder network. and style features are reconstructed for the first and second style images;

The instance images cropped from the first and second style images are input into the local style transfer model, and the content features and style features are decoupled twice through the local encoder network and decoder network, and the content features are decoded. and style features are reconstructed instance images;

The style coding feature image of the first/second style image and the content coding feature image of the instance image of the second/first style are input into the global decoder, and the instance content image of the first/second style that is generated is obtained;

Inputting the generated first/second style instance content image into the content encoder and the style encoder in the local encoder to perform multi-layer convolution operations to decouple the style and content encoding feature image of the instance content image;

Inputting the decoupled style-encoding feature images of the second-style instance content images and the content-encoding feature images of the second-style images into a global decoder network to obtain a reconstructed cross-granularity second-style image;

inputting the decoupled style coding feature image of the first style image and the content coding feature image of the first style instance content image into the global decoder network to obtain a reconstructed cross-granularity instance first style image;

Adjusting the parameters of the global encoder and decoder network according to the distance between the reconstructed first/second style image and the corresponding first/second style image in the image training sample;

adjusting the parameters of the local encoder and decoder networks according to the distance between the reconstructed instance image and the instance image cropped from the corresponding first/second style image in the image training sample;

According to the distance between the reconstructed cross-granularity second style image and the corresponding second style image in the image training sample, and between the reconstructed cross-granularity instance first style image and the corresponding first style instance image distance, the local encoder and decoder network and the global encoder and decoder network are jointly adjusted.

Preferably, after the number of iterations reaches the first preset number of times, the method further includes:

Perform multiple iterative training of confrontational learning on the image generation model based on the discriminator, and when the number of iterative training times of the confrontational learning reaches the second preset number of times, the global style transfer model in the image generation model is used as the final image obtained by training Style transfer model.

The first and second style images in the image training samples are input into the global style transfer model, and the content features and style features are decoupled twice through the global encoder network and decoder network. , decode the content features and style features to obtain the reconstructed first and second style images, including:

The image of the first style in the image training sample is input into the global encoder network, and the multi-layer convolution operation is performed by the content encoder and the style encoder in the global encoder network. A content-encoding feature image and a style-encoding feature image of a style image;

The image of the second style in the image training sample is input into the global encoder network, and the multi-layer convolution operation is performed by the content encoder and the style encoder in the global encoder network. Content-encoding feature images and style-encoding feature images of two styles of images;

The style encoding feature image of the first style image and the content encoding feature image of the second style image are respectively input into the multi-layer perceptron module and the residual convolution module in the global decoder network. The decoder network obtains a first generated image that combines the style of the first style image and the content of the second style image;

The style encoding feature image of the second style image and the content encoding feature image of the first style image are respectively input into the multi-layer perceptron module and the residual convolution module in the global decoder network, and the fusion is obtained. The style of the image of the second style, the second generated image of the content of the image of the first style;

The first generated image is input into the global encoder network, and the content encoder and style encoder of the global encoder network perform multi-layer convolution operations to decouple the content encoding feature image of the first generated image and style encoding feature image;

The second generated image is input into the global encoder network, and the content encoder and style encoder of the global encoder network perform a multi-layer convolution operation to decouple the content encoding feature image of the second generated image and style encoding feature image;

The style encoding feature image of the first generated image and the content encoding feature image of the second generated image are respectively input into the multi-layer perceptron module and the residual convolution module in the global decoder network, and finally the reconstructed first image is obtained. a style image;

The style encoding feature image of the second generated image and the content encoding feature image of the first generated image are respectively input into the multi-layer perceptron module and the residual convolution module in the global decoder network, and finally the reconstructed first image is obtained. Second style image.

The instance images cropped from the first and second style images are input into the local style transfer model, and the content features and style features are decoupled twice through the local encoder network and decoder network. , decoded content features and style features to obtain reconstructed instance images, including:

The instance images of the first style are input into the local encoder network, and the content encoder and the style encoder in the local encoder network perform multi-layer convolution operations to decouple the instance images of the first style. Content-encoding feature images and style-encoding feature images;

Input the instance image of the second style into the local encoder network, and perform multi-layer convolution operations through the content encoder and the style encoder in the local encoder network to decouple the instance image of the second style. Content-encoding feature images and style-encoding feature images;

Input the style encoding feature image of the instance image of the first style and the content encoding feature image of the instance image of the second style into the multi-layer perceptron module and the residual convolution module in the local decoder network, respectively, to obtain a first generated instance image that combines the style of the instance image of the first style and the content of the instance image of the second style;

Input the style-encoded feature image of the instance image of the second style and the content-encoded feature image of the instance image of the first style into the multilayer perceptron module and the residual convolution module in the local decoder network, respectively, to obtain a second generated instance image that combines the style of the instance image of the second style and the content of the instance image of the first style;

Input the first generated instance image into the local encoder network, and perform multi-layer convolution operations through the content encoder and style encoder of the local encoder network to decouple the content encoding features of the first generated instance image. Image and style encoding feature images;

Inputting the second generated instance image into the local encoder network, and performing multi-layer convolution operations through the content encoder and style encoder of the local encoder network to decouple the content encoding features of the second generated instance image Image and style encoding feature images;

The style encoding feature image of the first generated instance image and the content encoding feature image of the second generated instance image are respectively input into the multilayer perceptron module and residual convolution module in the local decoder network, and finally the reconstruction is obtained. An instance image of the first style;

Input the style coding feature image of the second generated instance image and the content coding feature image of the first generated instance image to the multilayer perceptron module and the residual convolution module in the local decoder network respectively, and finally obtain the reconstruction Instance image of the second style;

Wherein, the instance image of the first/second style is an instance image cropped from the image of the first/second style.

The present invention also provides a device for image style conversion, comprising: an image style conversion model trained by the above-mentioned method, which is used to convert the inputted image of the first style to be styled according to the inputted first style as the reference image. A second-style image, converted to a second-style target image.

The present invention also provides a training method for an image style conversion model, comprising:

Among them, an iterative training process includes:

The present invention also provides a training device for an image style conversion model, comprising:

An image generation model building module, used for building an image generation model; wherein, the image generation model includes: a global style transfer model and a local style transfer model; wherein, the global style transfer model includes: a global encoder network, a decoder network; the local style transfer model includes: a local encoder and a decoder network;

The image generation model training module is used to perform multiple iterative training on the image generation model. After the number of iterative training times reaches the first preset number of times, the global style transfer model is used as the image style transfer model obtained by training; wherein, an iterative training process is performed. Including: inputting the first and second style images in the image training samples into the global style transfer model, decoupling content features and style features twice through the global encoder network and decoder network, decoding The reconstructed first and second style images are obtained from content features and style features; the instance images cropped from the first and second style images are input to the local style transfer model, and the local encoder network, decoding The reconstructed instance image is obtained by decoupling the content feature and style feature twice of the decoder network, decoding the content feature and style feature; The image is input to the global decoder, and the generated first/second style instance content image is obtained; the generated first/second style instance content image is input to the content encoder and style encoder in the local encoder for multi-processing. The layer convolution operation decouples the style of the instance content image and the content encoding feature image; the decoupled style encoding feature image of the second style instance content image and the content encoding feature image of the second style image are input into the The global decoder network obtains the reconstructed cross-granularity second style image; the decoupled style encoding feature image of the first style image and the content encoding feature image of the first style instance content image are input to the global decoding obtain the reconstructed cross-granularity instance first style image; according to the distance between the reconstructed first/second style image and the corresponding first/second style image in the image training sample, adjust the global parameters of the encoder and decoder networks; the local encoder and decoder are adjusted according to the distance between the reconstructed instance image and the instance image cropped from the corresponding first/second style image in the image training sample parameters of the decoder network; according to the distance between the reconstructed cross-granularity second style image and the corresponding second style image in the image training sample, and the reconstructed cross-granularity instance first style image and the corresponding first style image The distances between instance images of the style are jointly adjusted for the local encoder and decoder network and the global encoder and decoder network.

In the technical scheme of the present invention, the image of the first style to be style-converted and the image of the second style as the reference image are respectively input into the content and style encoders in the encoder network, and the content and style encoding features are extracted respectively. image; input the content and style encoding feature images into the multi-layer perceptron module and residual convolution module in the decoder network respectively, perform perceptron operation and residual convolution operation respectively, and obtain adaptive instance normalization respectively The parameters of the normalization module and the intermediate process feature image; and the obtained parameters are shared to the adaptive instance normalization module of the decoder network; the intermediate process feature image is input to the adaptive instance normalization module for Instance normalization, inputting the instance-normalized feature image into the upsampling layer of the decoder network to obtain the target image converted from the first style to the second style; wherein, the encoder and decoder network The composed image style transfer model is pre-trained by image training samples including a plurality of first and second style images and instance images cropped from the image training samples. Compared with the prior art, the technical solution of the present invention adopts coarse-grained first and second style images and fine-grained instance images when training the image style transfer model, thereby introducing cross-granularity learning into the style transfer model. , which is used to enhance the style transfer quality of fine-grained instances and ensure the style transfer quality of coarse-grained global images, improving the blur and distortion of local instance images after style transfer.

Description of drawings

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

1 is a flowchart of a method for image style conversion provided by an embodiment of the present invention;

2 is a schematic diagram of the internal structure of an image style conversion model provided by an embodiment of the present invention;

3 is a flowchart of a training method for an image style conversion model provided by an embodiment of the present invention;

4 is a schematic diagram of the internal structure of an image generation model provided by an embodiment of the present invention;

5 is a flowchart of a method for iterative training of an image style transfer model provided by an embodiment of the present invention;

6a-6h are schematic diagrams of reconstructing images of the first and second styles based on a global encoder and decoder network according to an embodiment of the present invention;

7a-7h are schematic diagrams of reconstructing example images of the first and second styles based on local encoder and decoder networks according to an embodiment of the present invention;

8a and 8b are schematic diagrams of instance content images of the first and second styles that are generated based on a global decoder network according to an embodiment of the present invention;

8c is a schematic diagram of the style and content features of an instance content image based on a local encoder network decoupling provided by an embodiment of the present invention;

8d is a schematic diagram of a cross-granularity second style image reconstructed based on a global decoder network according to an embodiment of the present invention;

8e is a schematic diagram of a cross-granularity instance first style image reconstructed based on a global decoder network according to an embodiment of the present invention;

8f is a schematic diagram of the internal structure of an adversarial training model provided by an embodiment of the present invention;

FIG. 9 is a block diagram of the internal structure of an apparatus for training an image style transfer model according to an embodiment of the present invention.

detailed description

In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to specific embodiments and accompanying drawings.

It should be noted that, unless otherwise defined, the technical or scientific terms used in the embodiments of the present invention shall have the usual meanings understood by those with ordinary skill in the art to which the present disclosure belongs. As used in this disclosure, "first," "second," and similar terms do not denote any order, quantity, or importance, but are merely used to distinguish the various components. "Comprises" or "comprising" and similar words mean that the elements or things appearing before the word encompass the elements or things listed after the word and their equivalents, but do not exclude other elements or things. Words like "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "Down", "Left", "Right", etc. are only used to represent the relative positional relationship, and when the absolute position of the described object changes, the relative positional relationship may also change accordingly.

The technical solutions of the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

A method for image style conversion provided by an embodiment of the present invention, the specific process is shown in FIG. 1, including the following steps:

Step S101: Input the image of the first style to be style-converted and the image of the second style as a reference image into the content and style encoders in the image style conversion model respectively, and extract the content and style encoding feature images respectively.

Specifically, the internal structure of the image style conversion model of the present invention, as shown in FIG. 2, may include: an encoder network 201 and a decoder network 202; wherein, the encoder network 201 may include: a content encoder 211 and a style encoder The decoder 212; the decoder network 202 may include: a multi-layer perceptron module 221, a residual convolution module 222, an adaptive instance normalization module 223 and an upsampling layer 224.

The image of the first style and the image of the second style are images in different scenes; for example, the image of the first style may be a nighttime image captured by a surveillance camera device; the image of the second style may be an image without night;

Alternatively, the image of the first style may be a grayscale image captured by an infrared camera device; the image of the second style may be a color image;

In this step, the image of the first style to be style-converted and the image of the second style as the reference image are respectively input into the content encoder 211 and the style encoder 212 in the encoder network 201;

The content encoder 211 performs a first preset convolution operation on the input image of the first style to extract high-level feature information of the input image; the style encoder 212 performs a second preset convolution operation on the input image of the second style, Extract high-level feature information of the input image;

The content encoder 211 and the style encoder 212 respectively output the extracted content coding feature image and style coding feature image.

In a specific embodiment, the content encoder 211 and the style encoder 212 may be a light-weight feature extraction convolutional neural network, such as a convolutional neural network such as UNet (U-shaped convolutional neural network). It can be understood that the feature extraction convolutional neural network continuously expands the receptive field through continuous local convolution operations, and extracts high-level feature information of the input image.

The content encoder 211 includes a multi-layer convolution layer, the convolution layer of the latter layer continues the convolution operation on the coding feature output by the convolution layer of the previous layer, and the coding feature output by the convolution layer of the last layer is the input The content of the image encodes the feature image; the size of the convolution kernel and the convolution step size of each convolutional layer in the content encoder 211 can be set according to specific scenarios, for example, the size of the convolution kernel can be set to (7×7), The step size is (1×1).

The style encoder 212 includes a multi-layer convolution layer, a global pooling layer and a fully connected layer; the convolution layer of the latter layer continues the convolution operation on the encoded features output by the convolution layer of the previous layer, and the last layer of convolution layer The output coding features are globally pooled by the global pooling layer, and then enter the fully connected layer; the fully connected layer maps the globally pooled coding features to the style feature space to obtain a style coding feature image.

Step S102: Input the style and content coding feature images into the multi-layer perceptron module and the residual convolution module in the decoder network, respectively, to obtain the parameters of the adaptive instance normalization module and the intermediate process feature images, respectively.

In this step, the style encoding feature image and the content encoding feature image are respectively input into the multi-layer perceptron module 221 and the residual convolution module 222 in the decoder network 202, and the perceptron operation and the residual convolution operation are performed respectively, The parameters of the adaptive instance normalization module and the intermediate process feature images are obtained respectively;

Specifically, the style encoding feature image and the content encoding feature image are respectively input into the multi-layer perceptron module 221 and the residual convolution module 222 in the decoder network 202;

The multi-layer perceptron module 221 carries out the first preset perceptron operation to the input style coding feature image, and obtains the parameter of the adaptive instance normalization module;

The residual convolution module 222 includes a multi-layer residual convolution layer, and the residual convolution module 222 performs a first preset residual convolution operation on the input content coding feature image to obtain an intermediate process feature image.

Step S103: Share the obtained parameters to the adaptive instance normalization module of the decoder network.

In this step, the parameters of the adaptive instance normalization module obtained by the multilayer perceptron module 221 are shared with the adaptive instance normalization module 223 in the decoder network 202 .

Step S104: Input the intermediate process feature image into the adaptive instance normalization module for instance normalization.

In this step, the intermediate process feature image obtained by the residual convolution module 222 is input to the adaptive instance normalization module 223 for instance normalization.

Step S105: Input the instance-normalized feature image into the upsampling layer of the decoder network to obtain the target image converted from the first style to the second style.

In this step, the feature image subjected to instance normalization by the adaptive instance normalization module 223 is input to the upsampling layer 224 of the decoder network 202, and finally the image size of the input image of the first style to be style-converted is the same as that of the input image. , a target image converted from the first style to the second style, that is, a style transfer image is generated.

Thus, the encoder network 201 is connected with the decoder network 202 in a decoupled manner to perform feature space fusion, and to perform style transfer with high quality while preserving the content.

The above-mentioned image style transfer model is obtained by pre-training through image training samples including a plurality of images of the first and second styles and instance images cropped from the image training samples; Granular first and second style images, and fine-grained instance images are also used to introduce cross-granularity learning into the style transfer model, which is used to enhance the style transfer quality of fine-grained instances and ensure the style transfer quality of coarse-grained global images. , the instances mainly include vehicles, pedestrians, and traffic signs, thereby improving the conversion effect of image style and improving the blur and distortion of local instance images after style conversion.

On the other hand, the image style transfer model belongs to the unsupervised learning model, which does not require paired reference training data of the same scene to be converted, which makes the model have strong generalization ability, greatly reduces the difficulty of data acquisition, and improves the understanding of different Monitors style transfer adaptability in scenarios.

A training method for an image style conversion model provided by an embodiment of the present invention, the process is shown in FIG. 3 , and includes the following steps:

Step S300: Obtain image training samples.

In this step, a plurality of images of the first style and images of the second style of any scene in the monitoring data can be obtained in the real monitoring scene, as image training samples; in order to ensure the style transfer effect of the image style transfer model obtained by training , a large number of images in different monitoring scenarios can be selected as image training samples; wherein, the images of the first style and the images of the second style in the obtained image training samples can be unpaired; The high-quality style transfer effect is helpful for the target detection frame marked in the data to extract fine-grained instances in the image, and the instance images can also be cropped from the image training samples to obtain unpaired instance training samples for subsequent steps. Granular learning.

Step S301: Build an image generation model.

As shown in FIG. 4 , the constructed image generation model includes: a global style transfer model 401 and a local style transfer model 402; wherein, the global style transfer model 401 includes: a global encoder network 411 and a global decoder network 412 ; the local style transfer model 402 includes: a local encoder network 421 and a local decoder network 422 .

The structures of the global encoder network 411 and the local encoder network 421 are the same, and the structures of the global decoder network 412 and the local decoder network 422 are the same. The structure of the global encoder network 411 can be the same as the structure of the encoder network 201 in the above image style transfer model, and the structure of the global decoder network 412 can be the same as the structure of the decoder network 202 in the above image style transfer model .

Step S302: Perform multiple iterative training on the image generation model, and obtain a trained image style conversion model after the number of iterative training times reaches a first preset number of times.

Specifically, the image generation model is iteratively trained for many times. After the number of iterative trainings reaches the first preset number of times, the image style conversion model obtained by training is constructed according to the global encoder network 411 and the global decoder network 412, that is The global encoder network 411 and the global decoder network 412 are used as the encoder network 201 and the decoder network 202 in the image style transfer model obtained by training.

An iterative training process in this step, the specific process is shown in Figure 5, including the following sub-steps:

Sub-step S501: Input the images of the first and second styles in the image training sample into the global style transfer model to obtain the reconstructed images of the first and second styles;

In this sub-step, the images of the first and second styles in the image training samples are input into the global style transfer model, and the two decoupled content features and The first and second style images whose style features, decoded content features and style features are reconstructed;

Specifically, as shown in Fig. 6a, the image of the first style in the image training sample is input into the global encoder network 411, and multi-layer convolution is performed by the content encoder and the style encoder in the global encoder network 411 The operation decouples the content-encoding feature image and the style-encoding feature image of the image of the first style;

As shown in Figure 6b, the images of the second style in the image training samples are input to the global encoder network 411, and the multi-layer convolution operation is decoupled through the content encoder and the style encoder in the global encoder network 411. generating a content-encoding feature image and a style-encoding feature image of the image of the second style;

As shown in Fig. 6c, the style-encoded feature images of the first style image and the content-encoded feature images of the second style of images are input to the multi-layer perceptron module and the residual convolution in the global decoder network 412, respectively. module; the multi-layer perceptron module in the global decoder network 412 operates on the style encoding feature image of the input first style image, and the output parameters are shared as the adaptive instance normalization module of the global decoder network 412 parameters; the residual convolution module in the global decoder network 412 performs a residual convolution operation on the content-encoded feature image of the input second style image, and outputs the intermediate process feature image to the global decoder network 412. The instance normalization module is adapted to obtain an instance-normalized feature image that fuses the content features of the images of the second style and the style features of the images of the first style, and the obtained instance-normalized feature images are input to the global In the upsampling layer of the decoder network 412, a first generated image is obtained that combines the style of the image of the first style and the content of the image of the second style.

Similarly, as shown in Fig. 6d, the style encoding feature image of the second style image and the content encoding feature image of the first style image are respectively input to the multi-layer perceptron module and the residual image in the global decoder network 412. The difference convolution module obtains a second generated image that combines the style of the image of the second style and the content of the image of the first style.

As shown in Fig. 6e, the first generated image is input to the global encoder network 411, and the content of the first generated image is decoupled through the multi-layer convolution operation performed by the content encoder and the style encoder of the global encoder network 411 encoding feature images and style encoding feature images;

As shown in Fig. 6f, the second generated image is input to the global encoder network 411, and the content of the second generated image is decoupled through the multi-layer convolution operation performed by the content encoder and the style encoder of the global encoder network 411. encoding feature images and style encoding feature images;

As shown in Figure 6g, the style encoding feature image of the first generated image and the content encoding feature image of the second generated image are respectively input into the multilayer perceptron module and the residual convolution module in the global decoder network 412, The reconstructed first style image is finally obtained.

As shown in Figure 6h, the style encoding feature image of the second generated image and the content encoding feature image of the first generated image are respectively input to the multilayer perceptron module and the residual convolution module in the global decoder network 412, Finally a reconstructed second style image is obtained.

Sub-step S502: input the instance image cut out from the images of the first and second styles into the local style transfer model, obtain the instance images of the first and second styles of reconstruction;

In this sub-step, the instance images cropped from the first and second style images are input into the local style transfer model, and the two decoupled content features and Style features, decoded content features, and style features are reconstructed instance images; for ease of description, the instance images cropped from the images of the first style in the image training samples are referred to as instance images of the first style. The instance image cropped from the image of the second style in the image training sample is called the instance image of the second style;

Specifically, as shown in Fig. 7a, the instance image of the first style is input into the local encoder network 421, and the content encoder and the style encoder in the local encoder network 421 perform multi-layer convolution operations to decouple the output. a content-encoding feature image and a style-encoding feature image of the instance image of the first style;

As shown in Fig. 7b, the instance image of the second style is input into the local encoder network 421, and the second style is decoupled through the multi-layer convolution operation performed by the content encoder and the style encoder in the local encoder network 421 The content-encoding feature image and the style-encoding feature image of the instance image;

As shown in Figure 7c, the style-encoded feature images of the first style instance images and the content-encoded feature images of the second style instance images are input to the multi-layer perceptron module and the residual in the local decoder network 422, respectively. Convolution module; the multi-layer perceptron module in the local decoder network 422 operates on the style encoding feature image of the input first style image, and the output parameters are shared as the adaptive instance normalization of the local decoder network 422 The parameters of the module; the residual convolution module in the local decoder network 422 performs a residual convolution operation on the content-encoded feature image of the input second style instance image, and outputs the intermediate process feature image to the local decoder network 422. The adaptive instance normalization module of , so as to obtain an instance-normalized feature image that fuses the content features of the second-style instance images and the style features of the first-style instance images, and the obtained instance-normalized feature images Input to the upsampling layer of the local decoder network 422, resulting in a first generated instance image that combines the style of the instance image of the first style and the content of the instance image of the second style.

Similarly, as shown in FIG. 7d, the style encoding feature image of the instance image of the second style and the content encoding feature image of the instance image of the first style are respectively input into the multi-layer perceptron module in the local decoder network 422. and the residual convolution module to obtain a second generated instance image that combines the style of the instance image of the second style and the content of the instance image of the first style.

As shown in Fig. 7e, the first generated instance image is input to the local encoder network 421, and the first generated instance image is decoupled by performing a multi-layer convolution operation through the content encoder and the style encoder of the local encoder network 421 content-encoding feature images and style-encoding feature images;

As shown in Fig. 7f, the second generated instance image is input to the local encoder network 421, and the second generated instance image is decoupled by performing multi-layer convolution operations through the content encoder and style encoder of the local encoder network 421 content-encoding feature images and style-encoding feature images;

As shown in Fig. 7g, the style-encoded feature image of the first generated instance image and the content-encoded feature image of the second generated instance image are respectively input to the multi-layer perceptron module and residual convolution in the local decoder network 422 module, and finally get the reconstructed first style instance image.

As shown in Figure 7h, the style-encoded feature image of the second generated instance image and the content-encoded feature image of the first generated instance image are respectively input to the multi-layer perceptron module and residual convolution in the local decoder network 422 module, and finally get the reconstructed second style instance image.

Sub-step S503: Input the style coding feature image of the first/second style image and the content coding feature image of the second/first style instance image into the global decoder network to obtain the generated first/second style instance content image ;

Specifically, as shown in Fig. 8a, the style-encoding feature images of the images of the second style and the content-encoding feature images of the instance images of the first style are respectively input to the multi-layer perceptron modules and the multi-layer perceptron modules in the global decoder network 412. The residual convolution module obtains an image that combines the style of the image of the second style and the content of the instance image of the first style, that is, the generated instance content image of the second style;

As shown in Figure 8b, the style-encoded feature images of the first style image and the content-encoded feature images of the second style of instance images are input to the multilayer perceptron module and the residual volume in the global decoder network 412, respectively. The product module obtains an image that combines the style of the image of the first style and the content of the instance image of the second style, that is, the generated instance content image of the first style.

Sub-step S504: Input the generated instance content image of the second style into the content encoder and the style encoder in the local encoder network to perform a multi-layer convolution operation to decouple the style and content encoding features of the instance content image image;

Specifically, as shown in Fig. 8c, the generated second style instance content image is input to the content encoder and style encoder in the local encoder network 421 to perform multi-layer convolution operations to decouple the second style. Examples of content images are style-encoded feature images and content-encoded feature images.

Sub-step S505: Input the decoupled style coding feature image of the second style instance content image and the content coding feature image of the second style image into the global decoder network to obtain a reconstructed cross-granularity second style image ;

Specifically, as shown in FIG. 8d , the decoupled style-encoding feature images of the second-style instance content images and the content-encoding feature images of the second-style images are respectively input to the multi-level decoders in the global decoder network 412. Layer perceptron modules and residual convolution modules to obtain reconstructed cross-granularity second-style images.

Sub-step S506: Input the decoupled style-encoding feature image of the instance image of the first style and the content-encoding feature image of the instance content image of the second style into the local decoder network to obtain the reconstructed cross-granularity instance first style image;

Specifically, as shown in FIG. 8e , the decoupled style-encoding feature images of the instance images of the first style and the content-encoding feature images of the instance content images of the second style are input into the local decoder network 422 respectively. Multi-layer perceptron module and residual convolution module to obtain reconstructed cross-granularity instance first style images.

Sub-step S507: Adjust the parameters of the global encoder and decoder network according to the distance between the reconstructed first/second style image and the corresponding first/second style image in the image training sample;

Specifically, adjusting the parameters of the global encoder and decoder networks according to the distance between the reconstructed first style image and the corresponding first style image in the image training sample;

The parameters of the global encoder and decoder network are adjusted according to the distances between the reconstructed images of the second style and the corresponding images of the second style in the image training samples.

Sub-step S508: Adjust the parameters of the local encoder and decoder networks according to the distance between the reconstructed instance image and the instance image cropped from the corresponding first/second style image in the image training sample ;

Specifically, according to the distance between the reconstructed instance image of the first style and the instance image cropped from the corresponding first style image in the image training sample, adjust the local encoder and decoder network parameter;

The parameters of the local encoder and decoder networks are adjusted according to the distance between the reconstructed instance image of the second style and the instance image cropped from the corresponding second style image in the image training sample.

Sub-step S509: According to the distance between the reconstructed cross-granularity second style image and the corresponding second style image in the image training sample, and the distance between the reconstructed cross-granularity instance first style image and the corresponding first style image. The distance between instance images, the local encoder and decoder network and the global encoder and decoder network are jointly adjusted.

When the number of iterations reaches the first preset number of times, it means that the number of parameter adjustments of the encoder network and the decoder network has reached the first preset number of times. At this time, the image generation model already has better feature extraction and feature recovery capabilities. Therefore, You can stop adjusting the parameters of the initial image generation model to get the final image generation model. Wherein, the first preset number of times may be 10,000, 20,000, 50,000, etc., which is not specifically limited.

As a more optimal implementation, in the subsequent step S303, the adversarial learning method can also be adopted based on the discriminator to continue training:

Step S303: Perform multiple iterative training of adversarial learning on the image generation model based on the discriminator, and when the number of iterative trainings of adversarial learning reaches a second preset number of times, use the global style transfer model as the final image style transfer obtained by training. Model.

The adversarial training model based on the image generation model and the discriminator, as shown in Figure 8f, includes: an image generation model, a first discriminator and a second discriminator; wherein, the input of the first discriminator is the same as that in the image generation model. The output of the global decoder network is connected; the input of the second discriminator is connected with the output of the local decoder network in the image generation model;

In the iterative training process of adversarial learning using an adversarial training model, the following can be included:

The first and second style images in the image training samples are input into the global style transfer model of the adversarial training model, and the global decoder network in the adversarial training model outputs the reconstructed first and second style images to The first discriminator; the first discriminator discriminates the authenticity of the input image; according to the discriminant result of the first discriminator, the parameters of the first discriminator are adjusted to enhance the discriminant ability of the first discriminator; according to the reconstructed first/second style The distance between the image and the corresponding first/second style image in the image training sample, adjust the parameters of the global encoder and decoder network;

Input the instance images of the first and second styles to the local style transfer model of the adversarial training model, and the local decoder network in the adversarial training model outputs the reconstructed instance images of the first and second styles to the second discriminator; The second discriminator discriminates the authenticity of the input image; according to the discriminant result of the second discriminator, adjust the parameters of the second discriminator to enhance the discriminant ability of the second discriminator; The distance between the cropped instance images in the corresponding first/second style images in the image training samples, and adjusting the parameters of the local encoder and decoder networks;

In addition, the cross-granularity second style image reconstructed by the global decoder network of the adversarial training model can also be output to the first discriminator; the first discriminator judges the authenticity of the input image; , adjust the parameters of the first discriminator to enhance the discriminative ability of the first discriminator; according to the distance between the reconstructed cross-granularity second style image and the corresponding second style image in the image training sample, adjust the The parameters of the global encoder and decoder network; wherein, the method for generating the reconstructed cross-granularity second style image can be the same as the method for generating the reconstructed cross-granularity second style image described in the above sub-steps S503, S504, S505 are the same, and will not be repeated here;

In addition, the cross-granularity instance first style image reconstructed by the global decoder network of the adversarial training model can also be output to the first discriminator; the first discriminator judges the authenticity of the input image; , adjust the parameters of the first discriminator to enhance the discriminative ability of the first discriminator; according to the distance between the reconstructed cross-granularity instance first style image and the corresponding first style instance image, adjust the global encoder and The parameters of the decoder network; wherein, the method for generating the reconstructed cross-granularity instance first style image may be the same as the method for generating the reconstructed cross-granularity instance first style image described in the above sub-steps S503, S504, and S506, and is not used here. repeat;

The distance between the above-mentioned images reflects the difference between the images, and is used to adjust the parameters of the image generation model; wherein the difference can be the pixel difference between the generated image and the real image and other parameters that can represent the difference between the two, specifically The method of determination is not limited.

When the number of iterative training times of adversarial learning reaches the second preset number of times, it means that the number of times of parameter adjustment of the image generation model and the first and second discriminators has reached enough times. At this time, the image generation model can generally generate a style with higher authenticity. Convert the image, and the first and second discriminators generally cannot distinguish between the real image and the generated image. At this time, the parameter adjustment of the image generation model and the first and second discriminators can be stopped to obtain the final image generation model and the first and second discriminators. . The second preset number of times may be 10,000, 20,000, 50,000, etc., which is not specifically limited here. The global encoder network 411 and the global decoder network 412 in the image generation model are used as the encoder network 201 and the decoder network 202 in the final image style transfer model obtained by training.

An apparatus for image style conversion provided by an embodiment of the present invention includes an image style conversion model trained by the above-mentioned method, which is used to convert the input image of the first style to be style-converted according to the input image of the reference image. The image of the second style, converted to the target image of the second style.

Based on the above-mentioned training method for an image style transfer model, an apparatus for training an image style transfer model provided by an embodiment of the present invention has a structure as shown in Figure 9, including: an image generation model building module 901 and an image generation model training module 902.

Wherein, the image generation model building module 901 is used to construct an image generation model; wherein, the image generation model includes: a global style transfer model and a local style transfer model; wherein, the global style transfer model includes: a global encoder network and decoder network; the local style transfer model includes: local encoder and decoder network;

The image generation model training module 902 is used to perform multiple iterative training on the image generation model, and after the number of iterative training times reaches the first preset number of times, the global style transfer model is used as the image style conversion model obtained by training; wherein, an iterative training process is performed. Including: inputting the first and second style images in the image training samples into the global style transfer model, decoupling content features and style features twice through the global encoder network and decoder network, decoding The reconstructed first and second style images are obtained from content features and style features; the instance images cropped from the first and second style images are input to the local style transfer model, and the local encoder network, decoding The reconstructed instance image is obtained by decoupling the content feature and style feature twice of the decoder network, decoding the content feature and style feature; The image is input to the global decoder, and the generated first/second style instance content image is obtained; the generated first/second style instance content image is input to the content encoder and style encoder in the local encoder for multi-processing. The layer convolution operation decouples the style of the instance content image and the content encoding feature image; the decoupled style encoding feature image of the second style instance content image and the content encoding feature image of the second style image are input into the The global decoder network obtains the reconstructed cross-granularity second style image; the decoupled style encoding feature image of the first style image and the content encoding feature image of the first style instance content image are input to the global decoding obtain the reconstructed cross-granularity instance first style image; according to the distance between the reconstructed first/second style image and the corresponding first/second style image in the image training sample, adjust the global parameters of the encoder and decoder networks; the local encoder and decoder are adjusted according to the distance between the reconstructed instance image and the instance image cropped from the corresponding first/second style image in the image training sample parameters of the decoder network; according to the distance between the reconstructed cross-granularity second style image and the corresponding second style image in the image training sample, and the reconstructed cross-granularity instance first style image and the corresponding first style image The distances between instance images of the style are jointly adjusted for the local encoder and decoder network and the global encoder and decoder network.

Further, the apparatus for training an image style transfer model provided by the embodiment of the present invention may further include: a confrontation training module 903 .

The adversarial training module 903 is used to perform iterative training of multiple adversarial learning on the image generation model based on the discriminator. When the iterative training times of the adversarial learning reaches the second preset number of times, the global style transfer model in the image generation model is used as the The final trained image style transfer model. For the specific iterative training method for the adversarial training module 903 to perform multiple confrontational learning on the image generation model based on the discriminator, reference may be made to the method in the above step S303, which will not be repeated here.

For the specific implementation method of the functions of each module in the apparatus for training an image style transfer model provided by the embodiment of the present invention, reference may be made to the method in each step shown in FIG. 3 above, which will not be repeated here.

On the other hand, the image style conversion model of the present invention belongs to an unsupervised learning model, and does not require paired reference training data of the same scene to be converted, so that the model has strong generalization ability, greatly reduces the difficulty of data acquisition, and improves the In order to adapt to the style transfer in different monitoring scenarios.

The computer readable medium of this embodiment includes both permanent and non-permanent, removable and non-removable media and can be implemented by any method or technology for information storage. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

Those of ordinary skill in the art should understand that the discussion of any of the above embodiments is only exemplary, and is not intended to imply that the scope of the present disclosure (including the claims) is limited to these examples; under the spirit of the present invention, the above embodiments or There may also be combinations between technical features in different embodiments, steps may be carried out in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

Additionally, well known power/ground connections to integrated circuit (IC) chips and other components may or may not be shown in the figures provided in order to simplify illustration and discussion, and in order not to obscure the present invention. . Furthermore, devices may be shown in block diagram form in order to avoid obscuring the present invention, and this also takes into account the fact that the details regarding the implementation of these block diagram devices are highly dependent on the platform on which the invention will be implemented (i.e. , these details should be fully within the understanding of those skilled in the art). Where specific details (eg, circuits) are set forth to describe exemplary embodiments of the invention, it will be apparent to those skilled in the art that these specific details may be used without or with changes The present invention is carried out below. Accordingly, these descriptions are to be considered illustrative rather than restrictive.

Although the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations to these embodiments will be apparent to those of ordinary skill in the art from the foregoing description. For example, other memory architectures (eg, dynamic RAM (DRAM)) may use the discussed embodiments.

Embodiments of the present invention are intended to cover all such alternatives, modifications and variations that fall within the broad scope of the appended claims. Therefore, any omission, modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

A method for image style conversion, comprising:

Input the image of the first style to be style-converted and the image of the second style as the reference image into the content and style encoders in the encoder network respectively, and extract the content and style encoding feature images respectively;

The style and content encoding feature images are respectively input into the multi-layer perceptron module and residual convolution module in the decoder network, and the perceptron operation and residual convolution operation are respectively performed to obtain the adaptive instance normalization module respectively. parameters and intermediate process feature images; and share the obtained parameters to the adaptive instance normalization module of the decoder network;

The intermediate process feature image is input into the adaptive instance normalization module for instance normalization, and the instance-normalized feature image is input into the upsampling layer of the decoder network to obtain a Convert to the target image of the second style;

The image style conversion model composed of the encoder and decoder networks is pre-trained by image training samples including a plurality of images of the first and second styles and instance images cropped from the image training samples.
The method according to claim 1, wherein the image style transfer model is pre-trained according to the following method:

constructing an image generation model; wherein, the image generation model includes: a global style transfer model and a local style transfer model; wherein, the global style transfer model includes: a global encoder network and a decoder network; the local style The migration model includes: local encoder and decoder networks;

Perform multiple iterative training on the image generation model, and after the number of iterative training reaches the first preset number of times, use the global style transfer model as the image style transfer model obtained by training;

Among them, an iterative training process includes:

Input the images of the first and second styles in the image training samples into the global style transfer model, and decouple the content features and style features and decode the content features twice through the global encoder network and decoder network. and style features are reconstructed for the first and second style images;

The instance images cropped from the first and second style images are input into the local style transfer model, and the content features and style features are decoupled twice through the local encoder network and decoder network, and the content features are decoded. and style features are reconstructed instance images;

The style coding feature image of the first/second style image and the content coding feature image of the instance image of the second/first style are input into the global decoder, and the instance content image of the first/second style that is generated is obtained;

Inputting the generated first/second style instance content image into the content encoder and the style encoder in the local encoder to perform multi-layer convolution operations to decouple the style and content encoding feature image of the instance content image;

Inputting the decoupled style-encoding feature images of the second-style instance content images and the content-encoding feature images of the second-style images into a global decoder network to obtain a reconstructed cross-granularity second-style image;

inputting the decoupled style coding feature image of the first style image and the content coding feature image of the first style instance content image into the global decoder network to obtain a reconstructed cross-granularity instance first style image;

Adjusting the parameters of the global encoder and decoder network according to the distance between the reconstructed first/second style image and the corresponding first/second style image in the image training sample;

adjusting the parameters of the local encoder and decoder networks according to the distance between the reconstructed instance image and the instance image cropped from the corresponding first/second style image in the image training sample;

According to the distance between the reconstructed cross-granularity second style image and the corresponding second style image in the image training sample, and between the reconstructed cross-granularity instance first style image and the corresponding first style instance image distance, the local encoder and decoder network and the global encoder and decoder network are jointly adjusted.
The method according to claim 1, wherein after the number of iterations reaches a first preset number of times, the method further comprises:

Perform multiple iterative training of confrontational learning on the image generation model based on the discriminator, and when the number of iterative training times of the confrontational learning reaches the second preset number of times, the global style transfer model in the image generation model is used as the final image obtained by training Style transfer model.
The method according to claim 2 or 3, wherein the images of the first and second styles in the image training samples are input into the global style transfer model, and the images of the first and second styles in the image training samples are passed through the global encoder network, Decoupling the content feature and style feature of the decoder network twice, decoding the content feature and style feature to obtain the reconstructed first and second style images, including:

The image of the first style in the image training sample is input into the global encoder network, and the multi-layer convolution operation is performed by the content encoder and the style encoder in the global encoder network. A content-encoding feature image and a style-encoding feature image of a style image;

The image of the second style in the image training sample is input into the global encoder network, and the multi-layer convolution operation is performed by the content encoder and the style encoder in the global encoder network. Content-encoding feature images and style-encoding feature images of two styles of images;

The style encoding feature image of the first style image and the content encoding feature image of the second style image are respectively input into the multi-layer perceptron module and the residual convolution module in the global decoder network. The decoder network obtains a first generated image that combines the style of the first style image and the content of the second style image;

The style encoding feature image of the second style image and the content encoding feature image of the first style image are respectively input into the multi-layer perceptron module and the residual convolution module in the global decoder network, and the fusion is obtained. The style of the image of the second style, the second generated image of the content of the image of the first style;

The first generated image is input into the global encoder network, and the content encoder and style encoder of the global encoder network perform multi-layer convolution operations to decouple the content encoding feature image of the first generated image and style encoding feature image;

The second generated image is input into the global encoder network, and the content encoder and style encoder of the global encoder network perform a multi-layer convolution operation to decouple the content encoding feature image of the second generated image and style encoding feature image;

The style encoding feature image of the first generated image and the content encoding feature image of the second generated image are respectively input into the multi-layer perceptron module and the residual convolution module in the global decoder network, and finally the reconstructed first image is obtained. a style image;

The style encoding feature image of the second generated image and the content encoding feature image of the first generated image are respectively input into the multi-layer perceptron module and the residual convolution module in the global decoder network, and finally the reconstructed first image is obtained. Second style image.
The method according to claim 2 or 3, wherein the instance images cropped from the images of the first and second styles are input into the local style transfer model, and the local encoder network, Decoupling the content feature and style feature of the decoder network twice, decoding the content feature and style feature to obtain the reconstructed instance image, including:

The instance images of the first style are input into the local encoder network, and the content encoder and the style encoder in the local encoder network perform multi-layer convolution operations to decouple the instance images of the first style. Content-encoding feature images and style-encoding feature images;

Input the instance image of the second style into the local encoder network, and perform multi-layer convolution operations through the content encoder and the style encoder in the local encoder network to decouple the instance image of the second style. Content-encoding feature images and style-encoding feature images;

Input the style coding feature image of the instance image of the first style and the content coding feature image of the instance image of the second style into the multilayer perceptron module and the residual convolution module in the local decoder network, respectively, to obtain a first generated instance image that combines the style of the instance image of the first style and the content of the instance image of the second style;

Input the style-encoded feature image of the instance image of the second style and the content-encoded feature image of the instance image of the first style into the multilayer perceptron module and the residual convolution module in the local decoder network, respectively, to obtain a second generated instance image that combines the style of the instance image of the second style and the content of the instance image of the first style;

Input the first generated instance image into the local encoder network, and perform multi-layer convolution operations through the content encoder and style encoder of the local encoder network to decouple the content encoding features of the first generated instance image. Image and style encoding feature images;

Inputting the second generated instance image into the local encoder network, and performing multi-layer convolution operations through the content encoder and style encoder of the local encoder network to decouple the content encoding features of the second generated instance image Image and style encoding feature images;

Input the style-encoded feature image of the first generated instance image and the content-encoded feature image of the second generated instance image to the multilayer perceptron module and the residual convolution module in the local decoder network respectively, and finally obtain the reconstruction An instance image of the first style;

The style encoding feature image of the second generated instance image and the content encoding feature image of the first generated instance image are respectively input into the multilayer perceptron module and the residual convolution module in the local decoder network, and finally the reconstruction is obtained. Instance image of the second style;

The instance image of the first/second style is an instance image cropped from the image of the first/second style.
A device for image style conversion, comprising: an image style conversion model trained by the method according to any one of claims 1-5, for converting the inputted image of the first style to be style-converted, according to The input image of the second style as the reference image is converted into the target image of the second style.
A method for training an image style transfer model, comprising:

constructing an image generation model; wherein, the image generation model includes: a global style transfer model and a local style transfer model; wherein, the global style transfer model includes: a global encoder network and a decoder network; the local style The migration model includes: local encoder and decoder networks;

Perform multiple iterative training on the image generation model, and after the number of iterative training reaches the first preset number of times, use the global style transfer model as the image style transfer model obtained by training;

Among them, an iterative training process includes:

Input the images of the first and second styles in the image training samples into the global style transfer model, and decouple the content features and style features and decode the content features twice through the global encoder network and decoder network. and style features are reconstructed for the first and second style images;

The instance images cropped from the first and second style images are input into the local style transfer model, and the content features and style features are decoupled twice through the local encoder network and decoder network, and the content features are decoded. and style features are reconstructed instance images;

The style coding feature image of the first/second style image and the content coding feature image of the instance image of the second/first style are input into the global decoder, and the instance content image of the first/second style that is generated is obtained;

Inputting the generated first/second style instance content image into the content encoder and the style encoder in the local encoder to perform multi-layer convolution operations to decouple the style and content encoding feature image of the instance content image;

Inputting the decoupled style-encoding feature images of the second-style instance content images and the content-encoding feature images of the second-style images into a global decoder network to obtain a reconstructed cross-granularity second-style image;

inputting the decoupled style coding feature image of the first style image and the content coding feature image of the first style instance content image into the global decoder network to obtain a reconstructed cross-granularity instance first style image;

Adjust the parameters of the global encoder and decoder network according to the distance between the reconstructed first/second style image and the corresponding first/second style image in the image training sample;

adjusting the parameters of the local encoder and decoder networks according to the distance between the reconstructed instance image and the instance image cropped from the corresponding first/second style image in the image training sample;

According to the distance between the reconstructed cross-granularity second style image and the corresponding second style image in the image training sample, and between the reconstructed cross-granularity instance first style image and the corresponding first style instance image distance, the local encoder and decoder network and the global encoder and decoder network are jointly adjusted.
The method according to claim 7, wherein after the number of iterations reaches the first preset number of times, the method further comprises:

Based on the discriminator, multiple times of iterative training of adversarial learning are performed, and when the number of iterative training times of adversarial learning reaches a second preset number of times, the global style transfer model is used as the final image style transfer model obtained by training.
A training device for an image style conversion model, comprising:

An image generation model building module, used for building an image generation model; wherein, the image generation model includes: a global style transfer model and a local style transfer model; wherein, the global style transfer model includes: a global encoder network, a decoder network; the local style transfer model includes: a local encoder and a decoder network;

The image generation model training module is used to perform multiple iterative training on the image generation model. After the number of iterative training times reaches the first preset number of times, the global style transfer model is used as the image style conversion model obtained by training; wherein, an iterative training process is performed. Including: inputting the first and second style images in the image training samples into the global style transfer model, decoupling content features and style features twice through the global encoder network and decoder network, decoding The reconstructed first and second style images are obtained from content features and style features; the instance images cropped from the first and second style images are input to the local style transfer model, and the local encoder network, decoding The reconstructed instance image is obtained by decoupling the content feature and style feature twice of the decoder network, decoding the content feature and style feature; The image is input to the global decoder, and the generated first/second style instance content image is obtained; the generated first/second style instance content image is input to the content encoder and style encoder in the local encoder for multi-processing. The layer convolution operation decouples the style of the instance content image and the content encoding feature image; the decoupled style encoding feature image of the second style instance content image and the content encoding feature image of the second style image are input into the The global decoder network obtains the reconstructed cross-granularity second style image; the decoupled style encoding feature image of the first style image and the content encoding feature image of the first style instance content image are input to the global decoding obtain the reconstructed cross-granularity instance first style image; according to the distance between the reconstructed first/second style image and the corresponding first/second style image in the image training sample, adjust the global parameters of the encoder and decoder networks; the local encoder and decoder are adjusted according to the distance between the reconstructed instance image and the instance image cropped from the corresponding first/second style image in the image training sample parameters of the decoder network; according to the distance between the reconstructed cross-granularity second style image and the corresponding second style image in the image training sample, and the reconstructed cross-granularity instance first style image and the corresponding first style image The distances between instance images of the style are jointly adjusted for the local encoder and decoder network and the global encoder and decoder network.
The device of claim 9, further comprising:

The adversarial training module is used to perform multiple iterative training of adversarial learning on the image generation model based on the discriminator. When the number of iterative training times of the adversarial learning reaches a second preset number of times, the global style transfer model in the image generation model is used as the The final trained image style transfer model.