CN111815509B

CN111815509B - Image style conversion and model training method and device

Info

Publication number: CN111815509B
Application number: CN202010907304.1A
Authority: CN
Inventors: 傅慧源; 马华东; 余艇; 张宇
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2021-01-01
Anticipated expiration: 2040-09-02
Also published as: CN111815509A; WO2022048182A1

Abstract

The invention discloses a method and a device for image style conversion and model training, wherein the method comprises the following steps: respectively inputting the first and second styles of images into a content encoder and a style encoder in an encoder network, and respectively extracting content and style encoding characteristic images; respectively inputting the style and the content coding characteristic images into a decoder network to obtain a target image converted from a first style to a second style; the image style conversion model formed by the coder and the decoder network is obtained by pre-training an image training sample comprising a plurality of first and second styles of images and an example image cut from the image training sample. The method and the device can improve the style conversion adaptability under various different scenes, solve the problems of image blur and poor example effect in the image after the style conversion, and simultaneously realize the high-quality image style migration with coarse granularity and fine granularity.

Description

Image style conversion and model training method and device

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for image style conversion and model training.

Background

Image style conversion is an important content in the field of image processing. With the development of social economy, more and more demands are made on audio and video processing, especially face editing/generation, image enhancement and image style conversion are focused on, and apps/cameras applying related technologies are easily concerned by people to create economic and social values. The image style conversion technology is used for enhancing the image, and the night image can be enhanced to obtain an image with clearer information, so that the image visibility is greatly improved, and the method has great significance for video monitoring. With the increasing development and improvement of digital image processing, pattern recognition and deep learning technologies, image style conversion methods are also continuously developed.

The image style method based on the traditional image processing technology directly processes the image, does not utilize the advanced features of the image, and has poor adaptability to new scenes. In the prior art, an image style migration technology based on a generation countermeasure network is more widely applied, but still has the technical problems that an image is blurred after style conversion, style conversion effects of examples in an image such as a car and a person are poor, and acquisition of paired image data with different styles is difficult.

Disclosure of Invention

In view of this, the present invention aims to provide a method and an apparatus for image style conversion and model training, so as to improve the adaptability of image style conversion in various different scenes, improve the problems of image blur and poor effect of examples in images after the style conversion, and simultaneously achieve high-quality image style migration with coarse granularity and fine granularity.

Based on the above object, the present invention provides a method for converting image style, comprising:

respectively inputting an image of a first style to be subjected to style conversion and an image of a second style serving as a reference image into a content encoder and a style encoder in an encoder network, and respectively extracting a content encoding characteristic image and a style encoding characteristic image;

respectively inputting the style and content coding characteristic images into a multilayer perceptron module and a residual convolution module in a decoder network, and respectively carrying out perceptron operation and residual convolution operation to respectively obtain parameters of an adaptive instance normalization module and an intermediate process characteristic image; sharing the obtained parameters to an adaptive instance normalization module of the decoder network;

inputting the intermediate process characteristic image into the self-adaptive example normalization module for example normalization, and inputting the example normalized characteristic image into an upsampling layer of the decoder network to obtain a target image converted from a first style to a second style;

the image style conversion model formed by the coder and the decoder network is obtained by pre-training an image training sample comprising a plurality of first and second styles of images and an example image cut from the image training sample.

Preferably, the image style conversion model is obtained by pre-training according to the following method:

constructing an image generation model; wherein, the image generation model comprises: a global style migration model and a local style migration model; wherein, the global style migration model comprises: a global encoder network, decoder network; the local style migration model comprises the following steps: a local encoder, decoder network;

performing iterative training on the image generation model for multiple times, and taking the global style migration model as an image style conversion model obtained by training after the iterative training times reach a first preset time;

wherein, the one-time iterative training process comprises the following steps:

inputting the images of the first and second styles in the image training sample into the global style migration model, and obtaining reconstructed images of the first and second styles through twice decoupling content features and style features, decoding content features and style features of the global encoder network and decoder network;

inputting example images cut from the images of the first and second styles into the local style migration model, and obtaining reconstructed example images through twice decoupling content features and style features, decoding content features and style features of the local encoder network and the local decoder network;

inputting the style encoding characteristic image of the first/second style image and the content encoding characteristic image of the example image of the second/first style into a global decoder to obtain the generated example content image of the first/second style;

inputting the generated example content images of the first style and the second style into a content encoder and a style encoder in a local encoder to perform multilayer convolution operation to decouple the style and the content encoding characteristic images of the example content images;

inputting the style coding characteristic image of the decoupled example content image of the second style and the content coding characteristic image of the second style into a global decoder network to obtain a reconstructed image of a cross-granularity second style;

inputting the decoupled style coding characteristic image of the first style image and the content coding characteristic image of the example content image of the first style into a global decoder network to obtain a reconstructed cross-granularity example first style image;

adjusting parameters of the global encoder and decoder network according to a distance between the reconstructed first/second-style images and corresponding first/second-style images in the image training sample;

adjusting parameters of the local encoder and decoder network according to a distance between a reconstructed example image and a example image cropped from a corresponding first/second style image in the image training sample;

jointly adjusting the local encoder and decoder network, the global encoder and decoder network according to a distance between a reconstructed cross-granular second-style image and a corresponding second-style image in the image training sample, and a distance between a reconstructed cross-granular instance first-style image and a corresponding first-style instance image.

Preferably, after the number of iterations reaches a first preset number, the method further includes:

and performing iterative training of multi-time antagonistic learning on the image generation model based on the discriminator, and when the iterative training times of the antagonistic learning reaches a second preset time, taking the global style migration model in the image generation model as the final image style conversion model obtained by training.

The inputting the first and second styles of images in the image training sample into the global style migration model, and obtaining the reconstructed first and second styles of images through twice decoupling of the content features and the style features, decoding of the content features and the style features of the global encoder network and decoder network specifically includes:

inputting the images of the first style in the image training samples into the global encoder network, and decoupling the content coding characteristic images and the style coding characteristic images of the first style by carrying out multilayer convolution operation through a content encoder and a style encoder in the global encoder network;

inputting the images of the second style in the image training samples into the global encoder network, and decoupling the content coding characteristic images and the style coding characteristic images of the second style by carrying out multilayer convolution operation through a content encoder and a style encoder in the global encoder network;

respectively inputting the style coding characteristic image of the first style image and the content coding characteristic image of the second style image into a multilayer perceptron module and a residual convolution module in the global decoder network, wherein the global decoder network obtains a first generated image fused with the styles of the first style image and the content of the second style image;

respectively inputting the style coding characteristic image of the second style image and the content coding characteristic image of the first style image into a multilayer perceptron module and a residual convolution module in the global decoder network to obtain a second generated image fusing the style of the second style image and the content of the first style image;

inputting the first generated image into the global encoder network, and decoupling the content coding characteristic image and the style coding characteristic image of the first generated image by performing multilayer convolution operation on a content encoder and a style encoder of the global encoder network;

inputting a second generated image into the global encoder network, and decoupling a content coding characteristic image and a style coding characteristic image of the second generated image by performing multilayer convolution operation through a content encoder and a style encoder of the global encoder network;

respectively inputting the style coding characteristic image of the first generated image and the content coding characteristic image of the second generated image into a multilayer perceptron module and a residual convolution module in the global decoder network to finally obtain a reconstructed first style image;

and respectively inputting the style coding characteristic image of the second generated image and the content coding characteristic image of the first generated image into a multilayer perceptron module and a residual convolution module in the global decoder network to finally obtain a reconstructed second style image.

Wherein, the example image cut out from the first and second style images is input into the local style migration model, and a reconstructed example image is obtained by twice decoupling the content feature and the style feature, and decoding the content feature and the style feature of the local encoder network and the local decoder network, and specifically comprises:

inputting the example image of the first style into the local encoder network, and decoupling the content coding characteristic image and the style coding characteristic image of the example image of the first style by carrying out multilayer convolution operation through a content encoder and a style encoder in the local encoder network;

inputting the example image of the second style into the local encoder network, and decoupling the content coding characteristic image and the style coding characteristic image of the example image of the second style by carrying out multilayer convolution operation through a content encoder and a style encoder in the local encoder network;

respectively inputting the style coding characteristic image of the example image of the first style and the content coding characteristic image of the example image of the second style into a multilayer perceptron module and a residual convolution module in the local decoder network to obtain a first generated example image fusing the style of the example image of the first style and the content of the example image of the second style;

respectively inputting the style coding characteristic image of the example image of the second style and the content coding characteristic image of the example image of the first style into a multilayer perceptron module and a residual convolution module in the local decoder network to obtain a second generation example image fusing the style of the example image of the second style and the content of the example image of the first style;

inputting the first generation example image into the local encoder network, and decoupling the content coding characteristic image and the style coding characteristic image of the first generation example image by carrying out multilayer convolution operation through a content encoder and a style encoder of the local encoder network;

inputting a second generation example image into the local encoder network, and decoupling a content coding characteristic image and a style coding characteristic image of the second generation example image by performing multilayer convolution operation on a content encoder and a style encoder of the local encoder network;

respectively inputting the style coding characteristic image of the first generated example image and the content coding characteristic image of the second generated example image into a multilayer perceptron module and a residual convolution module in the local decoder network to finally obtain a reconstructed example image of a first style;

respectively inputting the style coding characteristic image of the second generated example image and the content coding characteristic image of the first generated example image into a multilayer perceptron module and a residual convolution module in the local decoder network to finally obtain a reconstructed example image of a second style;

wherein the example image of the first/second style is an example image cut out from the image of the first/second style.

The invention also provides a device for converting image style, comprising: the image style conversion model trained by the method is used for converting the input image of the first style to be style-converted into the target image of the second style according to the input image of the second style serving as the reference image.

The invention also provides a training method of the image style conversion model, which comprises the following steps:

wherein, the one-time iterative training process comprises the following steps:

inputting the decoupled style coding characteristic image of the example image of the first style and the content coding characteristic image of the example content image of the second style into a global decoder network to obtain a reconstructed cross-granularity example first style image;

The invention also provides a training device of the image style conversion model, which comprises:

the image generation model building module is used for building an image generation model; wherein, the image generation model comprises: a global style migration model and a local style migration model; wherein, the global style migration model comprises: a global encoder network, decoder network; the local style migration model comprises the following steps: a local encoder, decoder network;

the image generation model training module is used for carrying out iterative training on the image generation model for multiple times, and after the iterative training times reach a first preset time, the global style migration model is used as an image style conversion model obtained through training; wherein, the one-time iterative training process comprises the following steps: inputting the images of the first and second styles in the image training sample into the global style migration model, and obtaining reconstructed images of the first and second styles through twice decoupling content features and style features, decoding content features and style features of the global encoder network and decoder network; inputting example images cut from the images of the first and second styles into the local style migration model, and obtaining reconstructed example images through twice decoupling content features and style features, decoding content features and style features of the local encoder network and the local decoder network; inputting the style encoding characteristic image of the first/second style image and the content encoding characteristic image of the example image of the second/first style into a global decoder to obtain the generated example content image of the first/second style; inputting the generated example content images of the first style and the second style into a content encoder and a style encoder in a local encoder to perform multilayer convolution operation to decouple the style and the content encoding characteristic images of the example content images; inputting the style coding characteristic image of the decoupled example content image of the second style and the content coding characteristic image of the second style into a global decoder network to obtain a reconstructed image of a cross-granularity second style; inputting the decoupled style coding characteristic image of the first style image and the content coding characteristic image of the example content image of the first style into a global decoder network to obtain a reconstructed cross-granularity example first style image; adjusting parameters of the global encoder and decoder network according to a distance between the reconstructed first/second-style images and corresponding first/second-style images in the image training sample; adjusting parameters of the local encoder and decoder network according to a distance between a reconstructed example image and a example image cropped from a corresponding first/second style image in the image training sample; jointly adjusting the local encoder and decoder network, the global encoder and decoder network according to a distance between a reconstructed cross-granular second-style image and a corresponding second-style image in the image training sample, and a distance between a reconstructed cross-granular instance first-style image and a corresponding first-style instance image.

In the technical scheme of the invention, an image of a first style to be subjected to style conversion and an image of a second style serving as a reference image are respectively input into a content encoder and a style encoder in an encoder network, and a content encoding characteristic image and a style encoding characteristic image are respectively extracted; respectively inputting the content and style coding feature images into a multilayer perceptron module and a residual convolution module in a decoder network, and respectively carrying out perceptron operation and residual convolution operation to respectively obtain parameters of an adaptive instance normalization module and an intermediate process feature image; sharing the obtained parameters to an adaptive instance normalization module of the decoder network; inputting the intermediate process characteristic image into the self-adaptive example normalization module for example normalization, and inputting the example normalized characteristic image into an upsampling layer of the decoder network to obtain a target image converted from a first style to a second style; the image style conversion model formed by the coder and the decoder network is obtained by pre-training an image training sample comprising a plurality of first and second styles of images and an example image cut from the image training sample. Compared with the prior art, the technical scheme of the invention adopts the first and second images with coarse granularity and the example image with fine granularity when training the image style conversion model, thereby introducing cross-granularity learning into the style conversion model, enhancing the style conversion quality of the example with fine granularity, ensuring the style conversion quality of the global image with coarse granularity, and improving the problems of blurring and distortion of the local example image after style conversion.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method for image style conversion according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an internal structure of an image style conversion model according to an embodiment of the present invention;

FIG. 3 is a flowchart of a training method of an image style conversion model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an internal structure of an image generation model according to an embodiment of the present invention;

FIG. 5 is a flowchart of a method for one-time iterative training of an image style conversion model according to an embodiment of the present invention;

FIGS. 6a to 6h are schematic diagrams of reconstructing a first-style image and a second-style image based on a global codec network according to an embodiment of the present invention;

FIGS. 7a to 7h are schematic diagrams of example images reconstructed by a local codec network according to an embodiment of the present invention in a first style and a second style;

FIGS. 8a and 8b are schematic diagrams of example content images of a first genre and a second genre generated by a global-based decoder network according to an embodiment of the present invention;

FIG. 8c is a schematic diagram of the style and content characteristics of an example content image decoupled by a local-based encoder network according to an embodiment of the present invention;

FIG. 8d is a schematic diagram of a global decoder network according to an embodiment of the present invention obtaining a reconstructed cross-granular second-style image;

fig. 8e is a schematic diagram of a first-style image reconstructed by a global-based decoder network according to an embodiment of the present invention;

FIG. 8f is a schematic diagram of an internal structure of the antagonism training model according to the embodiment of the present invention;

fig. 9 is a block diagram of an internal structure of a training apparatus for an image style conversion model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.

It is to be noted that technical terms or scientific terms used in the embodiments of the present invention should have the ordinary meanings as understood by those having ordinary skill in the art to which the present disclosure belongs, unless otherwise defined. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

The technical solution of the embodiments of the present invention is described in detail below with reference to the accompanying drawings.

The method for converting the image style provided by the embodiment of the invention has the specific flow as shown in fig. 1, and comprises the following steps:

step S101: and respectively inputting the image of the first style to be subjected to style conversion and the image of the second style serving as a reference image into a content and style encoder in an image style conversion model, and respectively extracting a content and style encoding characteristic image.

Specifically, the internal structure of the image style conversion model of the present invention, as shown in fig. 2, may include: an encoder network 201 and a decoder network 202; the encoder network 201 may include: a content encoder 211 and a format encoder 212; the decoder network 202 may include: a multi-layered perceptron module 221, a residual convolution module 222, an adaptive instance normalization module 223, and an upsampling layer 224.

The first-style image and the second-style image are images in different scenes; for example, the first style image may be a nighttime image captured by a monitoring camera device; the second style image may be a non-nighttime image;

alternatively, the first-style image may be a grayscale image taken by an infrared camera; the second style image may be a color image;

in this step, an image of a first genre to be subjected to genre conversion and an image of a second genre as a reference image are input to a content encoder 211 and a format encoder 212 in the encoder network 201, respectively;

the content encoder 211 performs a first preset convolution operation on the input image of the first style, and extracts high-level feature information of the input image; the style encoder 212 performs a second preset convolution operation on the input second style image to extract high-level feature information of the input image;

the content encoder 211 and the style encoder 212 output the extracted content encoding characteristic image and the extracted style encoding characteristic image, respectively.

In one specific embodiment, the content encoder 211 and the trellis encoder 212 may extract a convolutional neural network, such as UNet (U-shaped convolutional neural network), for lightweight features. It can be understood that the feature extraction convolutional neural network continuously enlarges the receptive field through continuous local convolution operation, and extracts high-level feature information of the input image.

The content encoder 211 comprises a plurality of convolutional layers, the convolutional layer of the next layer continuously performs convolutional operation on the encoding characteristic output by the convolutional layer of the previous layer, and the encoding characteristic output by the convolutional layer of the last layer is the content encoding characteristic image of the input image; the convolution kernel size and convolution step size of each convolution layer in the content encoder 211 may be set according to a specific scenario, for example, the convolution kernel size may be set to (7 × 7), and the step size may be set to (1 × 1).

Style encoder 212 includes multiple convolutional layers, global pooling layers, and fully-connected layers; the convolution layer of the next layer continuously performs convolution operation on the coding characteristics output by the convolution layer of the previous layer, and the coding characteristics output by the convolution layer of the last layer enter the full-connection layer after being subjected to global pooling of the global pooling layer; and the full connection layer maps the globally pooled coding features to a style feature space to obtain a style coding feature image.

Step S102: and respectively inputting the style and content coding characteristic images into a multilayer perceptron module and a residual convolution module in a decoder network to respectively obtain parameters of a self-adaptive example normalization module and an intermediate process characteristic image.

In this step, the style coding feature image and the content coding feature image are respectively input to a multilayer perceptron module 221 and a residual convolution module 222 in the decoder network 202, and perceptron operation and residual convolution operation are respectively performed to respectively obtain parameters of an adaptive instance normalization module and an intermediate process feature image;

specifically, the style coding feature image and the content coding feature image are respectively input to the multi-layer perceptron module 221 and the residual convolution module 222 in the decoder network 202;

the multi-layer perceptron module 221 performs a first preset perceptron operation on the input style coding feature image to obtain parameters of the adaptive instance normalization module;

the residual convolution module 222 includes a plurality of layers of residual convolution layers, and the residual convolution module 222 performs a first preset residual convolution operation on the input content coding feature image to obtain an intermediate process feature image.

Step S103: sharing the obtained parameters to an adaptive instance normalization module of the decoder network.

In this step, the parameters of the adaptive instance normalization module obtained by the multi-layer perceptron module 221 are shared with the adaptive instance normalization module 223 in the decoder network 202.

Step S104: and inputting the intermediate process characteristic image into the self-adaptive example normalization module for example normalization.

In this step, the intermediate process feature image obtained by the residual convolution module 222 is input to the adaptive instance normalization module 223 for instance normalization.

Step S105: the example normalized feature images are input into an upsampling layer of the decoder network, resulting in a target image that is converted from a first style to a second style.

In this step, the feature image subjected to the example normalization by the adaptive example normalization module 223 is input to the upsampling layer 224 of the decoder network 202, and finally, a target image which has the same size as the input image of the first style to be subjected to the style conversion and is converted from the first style to the second style is obtained, that is, a style conversion image is generated.

Therefore, the encoder network 201 is connected with the decoder network 202 in a decoupling manner to perform feature space fusion, and perform style migration with high quality while preserving content.

The image style conversion model is obtained by pre-training an image training sample comprising a plurality of first and second styles of images and an example image cut from the image training sample; because the first and second images with coarse granularity and the example image with fine granularity are adopted when the image style conversion model is trained, cross-granularity learning is introduced into the style conversion model to enhance the style conversion quality of the example with fine granularity and ensure the style conversion quality of the global image with coarse granularity, and the example mainly comprises vehicles, pedestrians and traffic signs, the conversion effect of the image style is improved, and the problems of local example image blurring and distortion after style conversion are solved.

On the other hand, the image style conversion model belongs to an unsupervised learning model, and does not need to refer to training data to be converted in the same scene in pairs, so that the generalization capability of the model is strong, the difficulty in data acquisition is greatly reduced, and the style conversion adaptability to different monitoring scenes is improved.

The method for training the image style conversion model provided by the embodiment of the invention has the flow shown in fig. 3, and comprises the following steps:

step S300: and acquiring an image training sample.

In the step, a plurality of images of a first style and images of a second style of any scene in monitoring data can be acquired in a real monitoring scene to be used as image training samples; in order to ensure the style conversion effect of the image style conversion model obtained by training, a large number of images under different monitoring scenes can be selected as image training samples; wherein the first style image and the second style image in the acquired image training sample may be unpaired; in addition, in order to obtain a high-quality style conversion effect of learning to a fine granularity, a target detection frame marked in data is beneficial to extracting an example of the fine granularity in an image, the example image can be cut out from an image training sample, and the unpaired example training sample is obtained and used for cross-granularity learning in the subsequent step.

Step S301: and constructing an image generation model.

As shown in fig. 4, the constructed image generation model includes: global style migration model 401 and local style migration model 402; the global style migration model 401 includes: a global encoder network 411 and a global decoder network 412; the local style migration model 402 includes: a local encoder network 421 and a local decoder network 422.

The global encoder network 411 and the local encoder network 421 have the same structure, and the global decoder network 412 and the local decoder network 422 have the same structure. The structure of the global encoder network 411 may be the same as the structure of the encoder network 201 in the image style conversion model, and the structure of the global decoder network 412 may be the same as the structure of the decoder network 202 in the image style conversion model.

Step S302: and carrying out iterative training on the image generation model for multiple times, and obtaining a trained image style conversion model after the iterative training times reach a first preset time.

Specifically, the image generation model is subjected to multiple iterative training, and after the iterative training times reach a first preset time, a trained image style conversion model is constructed according to the global encoder network 411 and the global decoder network 412, that is, the global encoder network 411 and the global decoder network 412 serve as the encoder network 201 and the decoder network 202 in the trained image style conversion model.

The specific flow of the one-time iterative training process in this step is shown in fig. 5, and includes the following sub-steps:

substep S501: inputting the first and second styles of images in the image training sample into the global style migration model to obtain reconstructed first and second styles of images;

in this sub-step, the images of the first and second styles in the image training sample are input into the global style migration model, and the reconstructed images of the first and second styles are obtained through the twice decoupling content features and style features of the global encoder network and the global decoder network, and the decoding content features and style features;

specifically, as shown in fig. 6a, the image of the first style in the image training sample is input to the global encoder network 411, and the content encoding feature image and the style encoding feature image of the first style are decoupled by performing a multi-layer convolution operation through the content encoder and the style encoder in the global encoder network 411;

as shown in fig. 6b, the images of the second style in the image training sample are input to the global encoder network 411, and the content encoding feature images and the style encoding feature images of the second style are decoupled by performing a multi-layer convolution operation through the content encoder and the style encoder in the global encoder network 411;

as shown in fig. 6c, the style-coded feature image of the first-style image and the content-coded feature image of the second-style image are respectively input to the multi-layer perceptron module and the residual convolution module in the global decoder network 412; the multi-layer perceptron module in the global decoder network 412 operates on the style-coded feature images of the input first style images, and the output parameters are shared as the parameters of the adaptive instance normalization module of the global decoder network 412; the residual convolution module in the global decoder network 412 performs residual convolution operation on the content coding feature image of the input second-style image, and outputs the intermediate process feature image to the adaptive instance normalization module in the global decoder network 412, so as to obtain an instance normalized feature image fusing the content features of the second-style image and the style features of the first-style image, and the obtained instance normalized feature image is input to the upsampling layer of the global decoder network 412, so as to obtain a first generated image fusing the style of the first-style image and the content of the second-style image.

Similarly, as shown in fig. 6d, the style-coded feature image of the second-style image and the content-coded feature image of the first-style image are input to the multi-layer perceptron module and the residual convolution module in the global decoder network 412, respectively, to obtain a second generated image in which the style of the second-style image and the content of the first-style image are fused.

As shown in fig. 6e, the first generated image is input into the global encoder network 411, and the content encoding feature image and the style encoding feature image of the first generated image are decoupled by performing a multi-layer convolution operation on the content encoder and the style encoder of the global encoder network 411;

as shown in fig. 6f, the second generated image is input to the global encoder network 411, and the content encoding feature image and the style encoding feature image of the second generated image are decoupled by performing a multi-layer convolution operation through the content encoder and the style encoder of the global encoder network 411;

as shown in fig. 6g, the style encoding characteristic image of the first generated image and the content encoding characteristic image of the second generated image are respectively input to the multi-layer perceptron module and the residual convolution module in the global decoder network 412, and finally the reconstructed image of the first style is obtained.

As shown in fig. 6h, the style encoding feature image of the second generated image and the content encoding feature image of the first generated image are respectively input to the multi-layer perceptron module and the residual convolution module in the global decoder network 412, and finally, a reconstructed image of the second style is obtained.

Substep S502: inputting example images cut from the first and second styles of images into the local style migration model to obtain reconstructed first and second styles of example images;

in the sub-step, the example images cut from the images of the first style and the second style are input into the local style migration model, and the reconstructed example images are obtained through twice decoupling content characteristics and style characteristics, decoding content characteristics and style characteristics of the local encoder network and the local decoder network; for convenience of description, an example image cut out from an image of a first style in an image training sample is referred to herein as an example image of the first style, and an example image cut out from an image of a second style in the image training sample is referred to herein as an example image of the second style;

specifically, as shown in fig. 7a, the example image of the first style is input into the local encoder network 421, and the content encoding feature image and the style encoding feature image of the example image of the first style are decoupled by performing a multi-layer convolution operation through the content encoder and the style encoder in the local encoder network 421;

as shown in fig. 7b, the example image of the second style is input into the local encoder network 421, and the content encoding feature image and the style encoding feature image of the example image of the second style are decoupled by performing a multi-layer convolution operation through the content encoder and the style encoder in the local encoder network 421;

as shown in fig. 7c, the style-coded feature image of the example image of the first style and the content-coded feature image of the example image of the second style are respectively input to the multi-layer perceptron module and the residual convolution module in the local decoder network 422; the multi-layer perceptron module in the local decoder network 422 operates the style coding characteristic image of the input first style image, and the output parameters are shared as the parameters of the adaptive instance normalization module of the local decoder network 422; the residual convolution module in the local decoder network 422 performs residual convolution operation on the content coding feature image of the input second-style example image, and outputs the intermediate process feature image to the adaptive example normalization module in the local decoder network 422, so as to obtain an example normalized feature image fusing the content feature of the second-style example image and the style feature of the first-style example image, and the obtained example normalized feature image is input into the upsampling layer of the local decoder network 422, so as to obtain a first generated example image fusing the style of the first-style example image and the content of the second-style example image.

Similarly, as shown in fig. 7d, the style-coded feature image of the second style example image and the content-coded feature image of the first style example image are input to the multi-layer perceptron module and the residual convolution module in the local decoder network 422, respectively, to obtain a second generated example image in which the style of the second style example image and the content of the first style example image are fused.

As shown in fig. 7e, the first generation example image is input into the local encoder network 421, and the content encoding characteristic image and the style encoding characteristic image of the first generation example image are decoupled by performing a multi-layer convolution operation through the content encoder and the style encoder of the local encoder network 421;

as shown in fig. 7f, the second generation example image is input into the local encoder network 421, and the content encoding feature image and the style encoding feature image of the second generation example image are decoupled by performing a multi-layer convolution operation through the content encoder and the style encoder of the local encoder network 421;

as shown in fig. 7g, the style encoding characteristic image of the first generated example image and the content encoding characteristic image of the second generated example image are respectively input to the multi-layer perceptron module and the residual convolution module in the local decoder network 422, and finally, the reconstructed example image of the first style is obtained.

As shown in fig. 7h, the style encoding characteristic image of the second generated example image and the content encoding characteristic image of the first generated example image are respectively input to the multi-layer perceptron module and the residual convolution module in the local decoder network 422, and finally, a reconstructed example image of the second style is obtained.

Substep S503: inputting the style coding characteristic image of the first/second style image and the content coding characteristic image of the example image of the second/first style into a global decoder network to obtain the generated example content image of the first/second style;

specifically, as shown in fig. 8a, the style-encoding feature image of the second style image and the content-encoding feature image of the first style example image are input to the multi-layer perceptron module and the residual convolution module in the global decoder network 412, respectively, to obtain an image in which the style of the second style image and the content of the first style example image are fused, that is, the generated second style example content image;

as shown in fig. 8b, the style-coded feature image of the first-style image and the content-coded feature image of the second-style example image are input to the multi-layer perceptron module and the residual convolution module in the global decoder network 412, respectively, to obtain an image in which the contents of the first-style image and the second-style example image are fused, that is, the generated first-style example content image.

Substep S504: inputting the generated example content image of the second style into a content encoder and a style encoder in a local encoder network to perform multilayer convolution operation so as to decouple the style and the content encoding characteristic image of the example content image;

specifically, as shown in fig. 8c, the content encoder and the genre encoder which input the generated example content image of the second genre into the local encoder network 421 perform a multi-layer convolution operation to decouple the genre encoding characteristic image and the content encoding characteristic image of the example content image of the second genre.

Substep S505: inputting the style coding characteristic image of the decoupled example content image of the second style and the content coding characteristic image of the second style into a global decoder network to obtain a reconstructed image of a cross-granularity second style;

specifically, as shown in fig. 8d, the style-encoding characteristic image of the example content image in the second style and the content-encoding characteristic image of the image in the second style that are decoupled are respectively input to the multi-layer perceptron module and the residual convolution module in the global decoder network 412, so as to obtain the reconstructed image in the second style across the granularity.

Substep S506: inputting the decoupled style coding characteristic image of the example image of the first style and the content coding characteristic image of the example content image of the second style into a local decoder network to obtain a reconstructed cross-granularity example first style image;

specifically, as shown in fig. 8e, the style encoding characteristic image of the decoupled example image of the first style and the content encoding characteristic image of the example content image of the second style are respectively input to the multi-layer perceptron module and the residual convolution module in the local decoder network 422, so as to obtain the reconstructed cross-granularity example first style image.

Substep S507: adjusting parameters of the global encoder and decoder network according to a distance between the reconstructed first/second-style images and corresponding first/second-style images in the image training sample;

specifically, parameters of the global encoder and decoder network are adjusted according to the distance between the reconstructed images of the first style and the corresponding images of the first style in the image training sample;

and adjusting parameters of the global encoder and decoder network according to the distance between the reconstructed images of the second style and the corresponding images of the second style in the image training sample.

Substep S508: adjusting parameters of the local encoder and decoder network according to a distance between a reconstructed example image and a example image cropped from a corresponding first/second style image in the image training sample;

specifically, parameters of the local encoder and decoder network are adjusted according to the distance between the reconstructed example image of the first style and the example image cut out from the corresponding image of the first style in the image training sample;

and adjusting parameters of the local encoder and decoder network according to the distance between the reconstructed example image of the second style and the example image cut out from the corresponding image of the second style in the image training sample.

Substep S509: jointly adjusting the local encoder and decoder network, the global encoder and decoder network according to a distance between a reconstructed cross-granular second-style image and a corresponding second-style image in the image training sample, and a distance between a reconstructed cross-granular instance first-style image and a corresponding first-style instance image.

When the iteration times reach the first preset times, the parameter adjustment times of the encoder network and the decoder network reach the first preset times, and at the moment, the image generation model has better feature extraction capability and feature recovery capability, so that the adjustment of the parameters of the initial image generation model can be stopped, and the final image generation model is obtained. The first preset number of times may be 1 ten thousand, 2 ten thousand, 5 ten thousand, and the like, and is not particularly limited.

As a more preferred embodiment, the training can be continued in the following step S303 based on the arbiter by adopting a counterlearning mode:

step S303: and performing iterative training of multi-time countermeasure learning on the image generation model based on the discriminator, and when the iterative training times of the countermeasure learning reach a second preset time, taking the global style migration model as an image style conversion model obtained by final training.

As shown in fig. 8f, the antagonism training model formed based on the image generation model and the discriminator includes: an image generation model, a first discriminator and a second discriminator; wherein the input of the first discriminator is connected to the output of the global decoder network in the image generation model; the input of the second discriminator is connected with the output of the local decoder network in the image generation model;

in the iterative training process of the adversarial learning by using the adversarial training model, the following can be included:

inputting the first and second styles of images in the image training sample into a global style migration model of an antagonism training model, and outputting the reconstructed first and second styles of images to a first discriminator by a global decoder network in the antagonism training model; the first discriminator discriminates the truth of the input image; adjusting parameters of the first discriminator according to the discrimination result of the first discriminator, and enhancing the discrimination capability of the first discriminator; adjusting parameters of the global encoder and decoder network according to a distance between the reconstructed first/second-style images and corresponding first/second-style images in the image training sample;

inputting the example images of the first style and the second style into a local style migration model of the antagonism training model, and outputting the reconstructed example images of the first style and the second style to a second discriminator by a local decoder network in the antagonism training model; the second discriminator discriminates the truth of the input image; adjusting parameters of the second discriminator according to the discrimination result of the second discriminator, and enhancing the discrimination capability of the second discriminator; adjusting parameters of the local encoder and decoder network according to a distance between the reconstructed first and second styles of example images and example images cropped from corresponding first/second styles of images in the image training sample;

in addition, the image of the second style across the granularity reconstructed by the global decoder network of the antagonism training model can be output to the first discriminator; the first discriminator discriminates the truth of the input image; adjusting parameters of the first discriminator according to the discrimination result of the first discriminator, and enhancing the discrimination capability of the first discriminator; adjusting parameters of the global encoder and decoder network according to a distance between the reconstructed cross-granular second-style images and corresponding second-style images in the image training samples; the method for generating the reconstructed images of the second cross-granularity style may be the same as the method for generating the reconstructed images of the second cross-granularity style described in the foregoing substeps S503, S504, and S505, and is not described herein again;

in addition, the cross-granularity instance first-style image reconstructed by the global decoder network of the antagonism training model can be output to the first discriminator; the first discriminator discriminates the truth of the input image; adjusting parameters of the first discriminator according to the discrimination result of the first discriminator, and enhancing the discrimination capability of the first discriminator; adjusting parameters of the global encoder and decoder network according to a distance between the reconstructed cross-granular instance first-style images and corresponding instance images of the first style; the method for generating the reconstructed cross-granularity example first-style image may be the same as the method for generating the reconstructed cross-granularity example first-style image described in the foregoing substeps S503, S504, and S506, and is not described herein again;

the distance between the images reflects the difference between the images and is used for adjusting the parameters of the image generation model; the difference may be a parameter that can indicate a difference between a generated image and a real image, such as a pixel difference between the two images, and a specific determination method is not limited.

When the iterative training times of the counterlearning reaches a second preset time, the parameter adjustment times of the image generation model and the first and second discriminators are shown to reach a sufficient time, at this time, the image generation model can generally generate a style conversion image with higher fidelity, the first and second discriminators cannot generally distinguish a real image from a generated image, at this time, the parameter adjustment of the image generation model and the first and second discriminators can be stopped, and a final image generation model and the first and second discriminators are obtained. The second preset number may be 1 ten thousand, 2 ten thousand, 5 ten thousand, and the like, and is not specifically limited herein. The global encoder network 411 and the global decoder network 412 in the image generation model are used as the encoder network 201 and the decoder network 202 in the image style conversion model obtained by the final training.

The image style conversion device provided by the embodiment of the invention comprises the image style conversion model obtained by training the method, and is used for converting the input image of the first style to be subjected to style conversion into the target image of the second style according to the input image of the second style serving as the reference image.

Based on the above training method for the image style conversion model, a training apparatus for the image style conversion model provided by the embodiment of the present invention has a structure as shown in fig. 9, and includes: an image generation model building module 901 and an image generation model training module 902.

The image generation model constructing module 901 is used for constructing an image generation model; wherein, the image generation model comprises: a global style migration model and a local style migration model; wherein, the global style migration model comprises: a global encoder network, decoder network; the local style migration model comprises the following steps: a local encoder, decoder network;

the image generation model training module 902 is configured to perform iterative training on the image generation model for multiple times, and after the iterative training times reach a first preset time, use the global style migration model as an image style conversion model obtained by training; wherein, the one-time iterative training process comprises the following steps: inputting the images of the first and second styles in the image training sample into the global style migration model, and obtaining reconstructed images of the first and second styles through twice decoupling content features and style features, decoding content features and style features of the global encoder network and decoder network; inputting example images cut from the images of the first and second styles into the local style migration model, and obtaining reconstructed example images through twice decoupling content features and style features, decoding content features and style features of the local encoder network and the local decoder network; inputting the style encoding characteristic image of the first/second style image and the content encoding characteristic image of the example image of the second/first style into a global decoder to obtain the generated example content image of the first/second style; inputting the generated example content images of the first style and the second style into a content encoder and a style encoder in a local encoder to perform multilayer convolution operation to decouple the style and the content encoding characteristic images of the example content images; inputting the style coding characteristic image of the decoupled example content image of the second style and the content coding characteristic image of the second style into a global decoder network to obtain a reconstructed image of a cross-granularity second style; inputting the decoupled style coding characteristic image of the first style image and the content coding characteristic image of the example content image of the first style into a global decoder network to obtain a reconstructed cross-granularity example first style image; adjusting parameters of the global encoder and decoder network according to a distance between the reconstructed first/second-style images and corresponding first/second-style images in the image training sample; adjusting parameters of the local encoder and decoder network according to a distance between a reconstructed example image and a example image cropped from a corresponding first/second style image in the image training sample; jointly adjusting the local encoder and decoder network, the global encoder and decoder network according to a distance between a reconstructed cross-granular second-style image and a corresponding second-style image in the image training sample, and a distance between a reconstructed cross-granular instance first-style image and a corresponding first-style instance image.

Further, the training apparatus for an image style conversion model provided in the embodiment of the present invention may further include: a confrontation training module 903.

The countermeasure training module 903 is configured to perform iterative training of multiple times of countermeasure learning on the image generation model based on the discriminator, and when the number of iterative training times of the countermeasure learning reaches a second preset number, take the global style migration model in the image generation model as the final image style conversion model obtained through training. The specific iterative training method for the countercheck training module 903 to perform multiple countercheck learning on the image generation model based on the discriminator may refer to the method in step S303, and details are not repeated here.

The specific implementation method for the functions of each module in the training device for the image style conversion model provided by the embodiment of the present invention may refer to the method in each step shown in fig. 3, and is not described herein again.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method of image style conversion, comprising:

the image style conversion model formed by the coder and the decoder network is obtained by pre-training an image training sample and an example image cut from the image training sample:

wherein, the one-time iterative training process comprises the following steps:

parameters of the local encoder and decoder network are adjusted based on a distance between the reconstructed example image and an example image cropped from a corresponding first/second style image in the image training sample.

2. The method of claim 1, wherein the method for training the image style conversion model further comprises:

3. The method of claim 2, further comprising, after the iterative training number reaches a first preset number:

4. The method according to claim 2 or 3, wherein the inputting the first and second styles of images in the image training samples into the global style migration model, and obtaining the reconstructed first and second styles of images by twice decoupling the content features and the style features and decoding the content features and the style features through the global encoder network and decoder network, specifically comprises:

5. The method according to claim 2 or 3, wherein the inputting the example image cut out from the first and second styles of images into the local style migration model, and the twice decoupling of the content feature and the style feature, the decoding content feature and the style feature via the local encoder network and the decoder network to obtain the reconstructed example image specifically comprises:

6. An apparatus for image style conversion, comprising: the image style conversion model trained by the method according to any one of claims 1-5, for converting the input image of the first style to be style-converted into the target image of the second style according to the input image of the second style as the reference image.

7. A training method of an image style conversion model is characterized by comprising the following steps:

wherein, the one-time iterative training process comprises the following steps:

8. The method of claim 7, further comprising, after the iterative training number reaches a first preset number:

and performing iterative training of multi-time countermeasure learning based on the discriminator, and when the iterative training times of the countermeasure learning reach a second preset time, taking the global style migration model as an image style conversion model obtained by final training.

9. An apparatus for training an image style conversion model, comprising:

10. The apparatus of claim 9, further comprising:

and the countermeasure training module is used for carrying out iterative training of multiple times of countermeasure learning on the image generation model based on the discriminator, and when the iterative training times of the countermeasure learning reach a second preset time, the global style migration model in the image generation model is used as the image style conversion model obtained by final training.