CN117611434B

CN117611434B - Model training method, image style conversion method and device and electronic equipment

Info

Publication number: CN117611434B
Application number: CN202410068980.2A
Authority: CN
Inventors: 周洲; 樊艳波; 伍洋; 孙钟前
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-01-17
Filing date: 2024-01-17
Publication date: 2024-05-07
Anticipated expiration: 2044-01-17
Also published as: CN117611434A

Abstract

The embodiment of the application provides a model training method, an image style conversion device, electronic equipment and a computer readable storage medium, and relates to the technical fields of artificial intelligence and image processing. The method comprises the following steps: determining a first sample set comprising a plurality of sample content images and a plurality of sample style images; performing countermeasure training on a first initial model according to a first sample set to obtain a trained first initial model, wherein the first initial model comprises a first generator, and obtaining a style conversion model according to the trained first initial model, wherein the style conversion model comprises the first generator in the trained first initial model. The embodiment of the application can balance global style migration and retain local details of the content image.

Description

Model training method, image style conversion method and device and electronic equipment

Technical Field

The application relates to the technical field of image processing and artificial intelligence, in particular to a model training method, an image style conversion device, electronic equipment and a storage medium.

Background

With the rapid development of communication and computer technology, computer and communication-based image processing technology has also been developed robustly and rapidly, and applied to various fields. For example, the image may be style converted using image processing techniques.

The style conversion of the image means that the content of the original image is kept as much as possible, and the style of the original image is converted into the style of the reference image.

The related art has a disadvantage in detail processing of an image when performing style conversion on the image.

Disclosure of Invention

Embodiments of the present application provide a model training method, an image style conversion apparatus, an electronic device, a computer readable storage medium, and a computer program product, which can solve the above-mentioned problems of the prior art. The technical scheme is as follows:

according to a first aspect of an embodiment of the present application, there is provided a training method of a style conversion model, including:

determining a first sample set comprising a plurality of sample content images and a plurality of sample style images;

Performing countermeasure training on a first initial model according to a first sample set to obtain a trained first initial model, wherein the first initial model comprises a first generator, and the first generator is used for performing global feature and local feature enhancement on an input original image and a reference image, and fusing the original image with the enhanced features and the reference image to obtain a converted image;

obtaining a style conversion model according to the trained first initial model, wherein the style conversion model comprises a first generator in the trained first initial model;

When the original image is a sample content image, the reference image is a sample style image, and the converted image is a stylized content image, wherein the stylized content image refers to an image obtained by shifting the style of the sample content image to the style of the sample style image.

According to a second aspect of the embodiment of the present application, there is provided an image style conversion method, including:

Determining a content image and a style image;

and taking the content image as an original image, inputting the style image as a reference image into a style conversion model trained based on the method provided by the first aspect, and obtaining a style content image which is output by the style conversion model and is matched with the content image and the style image.

According to a third aspect of the embodiment of the present application, there is provided a training apparatus for a style conversion model, the apparatus including:

A first sample set determination module for determining a first sample set comprising a plurality of sample content images and a plurality of sample style images;

The countermeasure training module is used for performing countermeasure training on the first initial model according to the first sample set to obtain a trained first initial model, wherein the first initial model comprises a first generator, and the first generator is used for performing global feature and local feature enhancement on an input original image and a reference image, and fusing the original image with the enhanced features and the reference image to obtain a converted image;

The model construction module is used for obtaining a style conversion model according to the trained first initial model, and the style conversion model comprises a first generator in the trained first initial model;

According to a fourth aspect of an embodiment of the present application, there is provided an image style conversion apparatus including:

an image determining module for determining a content image and a style image;

And the reasoning module is used for taking the content image as an original image, inputting the style image as a reference image into a style conversion model trained based on the method provided by the first aspect, and obtaining the style content image which is output by the style conversion model and is matched with the content image and the style image.

According to a fifth aspect of embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory, the processor executing the computer program to carry out the steps of the method provided in the first or second aspect above.

According to a sixth aspect of embodiments of the present application, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method provided in the first or second aspect described above.

According to a seventh aspect of embodiments of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method provided in the first or second aspect described above.

The technical scheme provided by the embodiment of the application has the beneficial effects that:

The method comprises the steps of performing countermeasure training on a first initial model through a first sample set, performing global feature and local feature enhancement on an original image and a reference image through a first sound generator, fusing the original image with the reference image after feature enhancement to obtain a converted image, and when the original image is a sample content image, performing style migration on the converted image to obtain a stylized content image.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;

FIG. 2 is a schematic flow chart of a model training method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a style conversion model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a process flow of a first generator according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a training process of a first initial model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a training process of a first initial model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a training process of a second initial model according to an embodiment of the present application;

FIG. 8 is a flowchart of a training method of a style conversion model according to an embodiment of the present application;

Fig. 9 is a flowchart of an image style conversion method according to an embodiment of the present application;

fig. 10 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a training device for a style conversion model according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an image style conversion device according to an embodiment of the present application;

Fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.

As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present specification. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates that at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

First, several terms related to the present application are described and explained:

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. The embodiment of the application relates to the field of image processing in artificial intelligence technology.

Computer Vision technology (CV): the method is a science for researching how to make the machine "look at", and further means that a camera and a computer are used to replace human eyes to recognize and measure targets and other machine vision, and further graphic processing is performed, so that the computer is used to process images which are more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image detection, image semantic understanding, image retrieval, OCR (Optical Character Recognition ), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D (three Dimensional, three-dimensional) techniques, virtual reality, augmented reality, and map construction, among others, as well as biometric recognition techniques. The embodiment of the application can be applied to the migration of artistic styles to artistic creation scenes by carrying out style conversion on the content images, automatically generate images of different styles according to artistic drafts, and can also be applied to the enhancement of synthetic data, add more details to images which are not fine enough, such as the rapid generation of game assets and the rapid generation of portraits of different styles.

Generating an antagonizing network: a generating type learning method for fitting data distribution. The technology mainly comprises two neural networks of a generator and a discriminator, wherein the generator is used for fitting data distribution, and the discriminator is used for distinguishing whether current data come from fitting distribution or real data distribution. During the course of the countermeasure learning of the generator and the arbiter, the generator evolves step by step to achieve the data generating capability with spurious artifacts.

Some schemes of the related art over emphasize local content details, which may prevent sufficient style information migration, and others model global features only, ignoring local details.

The application provides a model training method, an image style conversion device, electronic equipment, a computer readable storage medium and a computer program product, and aims to solve the technical problems in the prior art.

The technical solutions of the embodiments of the present application and technical effects produced by the technical solutions of the present application are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.

FIG. 1 shows a schematic diagram of an implementation environment provided by an embodiment of the present application. The implementation environment comprises: a terminal 11 and a server 12.

The training method of the style conversion model provided by the embodiment of the application can be executed by the terminal 11, the server 12 and the terminal 11 and the server 12 together, and the embodiment of the application is not limited to this. For the case that the training method of the style conversion model provided by the embodiment of the application is jointly executed by the terminal 11 and the server 12, the server 12 bears the primary computing work, and the terminal 11 bears the secondary computing work; or the server 12 takes on secondary computing work and the terminal 11 takes on primary computing work; or the server 12 and the terminal 11 perform cooperative computing by adopting a distributed computing architecture.

The image style conversion method provided by the embodiment of the present application may be executed by the terminal 11, may be executed by the server 12, or may be executed by both the terminal 11 and the server 12, which is not limited in the embodiment of the present application. For the case that the image style conversion method provided by the embodiment of the application is jointly executed by the terminal 11 and the server 12, the server 12 bears the primary computing work, and the terminal 11 bears the secondary computing work; or the server 12 takes on secondary computing work and the terminal 11 takes on primary computing work; or the server 12 and the terminal 11 perform cooperative computing by adopting a distributed computing architecture.

The device for executing the training method of the style conversion model and the device for executing the image style conversion method may be the same or different, which is not limited in the embodiment of the present application.

In one possible implementation, the terminal 11 may be any electronic product that can perform man-machine interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, a voice interaction or handwriting device, such as a PC (Personal Computer ), a mobile phone, a smart phone, a PDA (Personal DIGITAL ASSISTANT, a Personal digital assistant), a wearable device, a PPC (Pocket PC), a tablet computer, a smart car machine, a smart television, a smart sound box, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, and the like. The server 12 may be a server, a server cluster comprising a plurality of servers, or a cloud computing service center. The terminal 11 establishes a communication connection with the server 12 through a wired or wireless network.

Those skilled in the art will appreciate that the above-described terminal 11 and server 12 are only examples, and that other terminals or servers that may be present in the present application or in the future are applicable and within the scope of the present application and are incorporated herein by reference.

The image style conversion method provided by the application can be applied to artistic creation scenes and game making scenes.

For example, in an artistic creation scenario, the method provided by the present application may perform style conversion processing on an artistic draft to process the artistic draft into an image of a specified artistic style. For example, the art draft may refer to a photographed photo or a pattern drawn by a drawing program.

In a game making scene, the method provided by the application can perform style conversion processing on the game resources so as to convert the game resources into different styles, thereby realizing rapid generation of the game resources. For example, the game resource may refer to a avatar or a dressing of a virtual object, or may refer to a virtual environment.

The image style conversion method can be applied to scenes such as old photo restoration, movie restoration, cartoon restoration, video call quality improvement and the like.

Video telephony and short video content production and the like are becoming increasingly important in people's daily lives. However, in some cases, the problem of poor ambient light condition exists, and in the video call process, the shot video tends to be dim, so that the user experience is reduced. Therefore, the image style conversion method of the embodiment of the application can be applied to improve the image quality and brightness of the dim pictures in the video call.

In the video restoration process, each video frame in the video is used as an image to be processed, and the restoration of the video is realized by performing style conversion processing on each video frame.

The embodiment of the application provides a model training method, as shown in fig. 2, which comprises S101-S103, specifically:

s101, determining a first sample set, the first sample set including a plurality of sample content images and a plurality of sample style images.

The content in each sample content image in the embodiment of the present application is different, and the embodiment of the present application does not specifically limit the content in the content image, and may be, for example, characters, figures, scenery, animals, robots, equipment, buildings, and the like.

The styles in the sample style images in the embodiment of the application are different, the styles express the artistic form of the images, such as watercolor, black-and-white, impression pie, etc., and the embodiment of the application does not limit the styles of the sample style images specifically.

S102, performing countermeasure training on a first initial model according to a first sample set to obtain a trained first initial model, wherein the first initial model comprises a first generator, the first generator is used for performing global feature and local feature enhancement on an input original image and a reference image, and fusing the original image with the enhanced features and the reference image to obtain a converted image.

In the embodiment of the application, when the first initial model is subjected to countermeasure training according to the first sample set, one or more image pairs are input into the first initial model each time, and the image pairs comprise an original image and a reference image.

It should be noted that the original image and the reference image in one image pair are randomly selected, and the content in the original image and the reference image need not be the same, but the styles need to be different. Also, when the original image is a sample content image, the reference image must be a sample style image. By combining different sample content images with different sample style images, different images are formed to counter-train the first initial model, so that the first initial model can fully learn content features in the content images and style features in the style images.

The training mode of the first initial model in the embodiment of the application is countermeasure training, which means that data distribution is learned by competing with each other through a Generator (Generator) and a discriminator (Discriminator), wherein the Generator is responsible for learning to generate data similar to real data from random noise, and the discriminator is used for discriminating the generated data from the real data.

The first initial model comprises a first generator, aims at the problem that some schemes in the prior art can prevent enough style information migration because local content details are emphasized too much, and the problem that some schemes only model global features and ignore local details, and obtains a converted image by carrying out global feature and local feature enhancement on an input original image and a reference image and fusing the original image and the reference image after feature enhancement. When the original image is a sample content image, the converted image is a stylized content image, and the stylized content image is an image obtained by migrating the style of the sample content image into the style of the sample style image.

S103, obtaining a style conversion model according to the trained first initial model, wherein the style conversion model comprises a first generator in the trained first initial model.

According to the embodiment of the application, the first generator in the first initial model after training can be used as a style conversion model, so that when the style conversion model is utilized for reasoning, the stylized content image output by the style conversion model can be used as a style conversion result by inputting the content image and the style image into the style conversion model.

According to the embodiment of the application, the first initial model is subjected to countermeasure training through the first sample set, global features and local features of the original and reference images are enhanced through the first sound generator, the original image with the enhanced features and the reference image are fused to obtain the converted image, and when the original image is a sample content image, the converted image is a stylized content image after style migration.

Based on the above embodiments, the method according to the embodiment of the present application obtains a style conversion model according to the trained first initial model, and further includes:

s201, determining a second sample set including a plurality of sample content images, and a sample color conversion image in which each sample content image is color-converted.

The second sample set determined by the application has two differences compared with the first sample set, on one hand, the data volume is greatly reduced, namely the size of the second sample set is smaller or far smaller than that of the first sample set, so that the second sample set is used for adjusting the details of style conversion by using a small range of training samples, and not too many training samples are needed; on the other hand, the sample images in the second sample set have a corresponding relationship, that is, one sample content image corresponds to at least one sample color conversion image, the sample color conversion image is an image that performs color conversion on the corresponding sample content image, the color conversion is that some or all pixel values in the sample content image are adjusted according to a certain rule, and the adjustment rule is not limited, for example, the adjustment rule may be a gray scale image, and since the sample content image and the sample color conversion image are corresponding, and the difference between the sample content image and the sample color conversion image is only the color of a pixel point, the second initial model after training can learn the detail information of the pixel level through training.

S202, performing stylization processing on each sample content image in the second sample set through a first generator in the trained first initial model, and obtaining stylized content images corresponding to each sample content image.

According to the embodiment of the application, the first generator in the first initial model trained by the embodiment is utilized to carry out style migration on each sample content image in the second sample set, so that stylized content images are obtained.

S203, training a second initial model through the stylized content images and the sample color conversion images corresponding to the sample content images in the second sample set to obtain a trained second initial model, wherein the second initial model is used for extracting image characteristics of the sample color conversion images and the stylized content images, and performing color conversion on the stylized content images according to the image characteristics to obtain color-converted content images;

According to the embodiment of the application, the second initial model is used for carrying out feature extraction and feature fusion on the stylized content images and the sample color conversion images of each sample content image, so that the second initial model can adaptively adjust the colors according to the style images, and the finely adjusted stylized content images are obtained.

S204, obtaining a style conversion model according to the first generator in the trained first initial model and the trained second initial model.

After training the second initial model, the embodiment of the application can construct an image style conversion model, please refer to fig. 3, which exemplarily shows a schematic structural diagram of the style conversion model provided by the embodiment of the application, as shown in the drawing, the style conversion model includes a first generator in the first initial model and the trained second initial model, when reasoning, the content image and the style image to be processed are input into the first generator to obtain a stylized content image output by the first generator, the stylized content image and the style image are further input into the second initial model, the second initial model performs color conversion on the sample color conversion image and the image characteristics of the stylized content image, and finally outputs the stylized content image after the color conversion.

On the basis of the above embodiments, as an alternative embodiment, the first generator is specifically configured to execute the following steps S301 to S305:

S301, for any one of an input original image and a reference image, encoding the image through a preset encoder, extracting the output of a plurality of convolution layers, and obtaining a plurality of groups of convolution characteristics of the image.

The type of encoder is not particularly limited in the embodiments of the present application, and for example, a visual geometry group (Visual Geometry Group, VGG) pre-training model may be used. The VGG model has 5 convolutional layer blocks, the convolutional features of different convolutional layer outputs having different channel widths, specifically:

Convolution layer block 1: the method consists of two convolution layers and a pooling layer, wherein the kernel size of the convolution layer 1 is 3x3, the channel width is 64, the kernel size of the convolution layer 2 is 3x3, the channel width is 64, the size of the pooling layer is 2x2, and the step length is 2.

Convolution layer block 2: the method consists of two convolution layers and a pooling layer, wherein the kernel size of the convolution layer 1 is 3x3, the channel width is 128, the kernel size of the convolution layer 2 is 3x3, the channel width is 128, the size of the pooling layer is 2x2, and the step length is 2.

Convolution layer block 3: the method consists of two convolution layers and a pooling layer, wherein the kernel size of the convolution layer 1 is 3x3, the channel width is 256, the kernel size of the convolution layer 2 is 3x3, the channel width is 256, the kernel size of the convolution layer 3 is 3x3, the channel width is 256, the size of the pooling layer is 2x2, and the step length is 2.

Convolution layer block 4: consists of three convolution layers and a pooling layer. The kernel size of the convolution layer 1 is 3x3, the channel width is 512, the kernel size of the convolution layer 2 is 3x3, the channel width is 512, the kernel size of the convolution layer 3 is 3x3, the channel width is 512, the size of the pooling layer is 2x2, and the step size is 2.

Convolution layer block 5: consists of three convolution layers and a pooling layer. The kernel size of the convolution layer 1 is 3x3, the channel width is 512, the kernel size of the convolution layer 2 is 3x3, the channel width is 512, the kernel size of the convolution layer 3 is 3x3, the channel width is 512, the size of the pooling layer is 2x2, and the step size is 2.

The embodiment of the application can select 3 convolution layers, such as the output of the first convolution layer in the convolution layer blocks 2-4, to form a plurality of groups of convolution characteristics of the image.

S302, taking a group of convolution features with the largest channel width in each group of convolution features as reference convolution features, and aligning the channel width of each group of convolution features to the channel width of the reference convolution features.

Taking the output of the first convolution layer of the convolution layer blocks 2-4 as an example, since the channel width of the output of the first convolution layer of the convolution layer block 4 is the largest and is 512, the channel widths of the convolution features of each group of convolution features are aligned by up-sampling the convolution features of the convolution layer blocks 2 and 3.

S303, performing dense representation on all the aligned groups of convolution features to obtain a first dense feature, and performing dense representation on the reference convolution features to obtain a second dense feature.

That is, the first dense feature of the present application is obtained based on all sets of registered convolution features, while the second dense feature is obtained based only on the reference convolution features.

The embodiment of the application can carry out Dense representation based on a full-connection (Dense) layer, and can flatten all the aligned groups of convolution features by carrying out Dense representation, namely, the high-dimensional feature representation is converted into a low-dimensional vector, and the low-dimensional vector can be used for a final classification task. Each connection of the Dense layer has a weight and the model learns the mapping from input features to output categories by adjusting these weights during the training process. These weights enable the model to learn the appropriate feature representation and classification decisions on the training data.

S304, respectively carrying out feature enhancement on the first dense features and the second dense features, and obtaining first point multiplication features according to the enhancement features after the feature enhancement of the first dense features and the second dense features; and obtaining a second point multiplication feature according to the enhancement feature obtained by carrying out feature enhancement on the second dense feature and the first dense feature.

According to the embodiment of the application, the dense features are enhanced, the understanding capability of image details can be improved, and the learning capability of feature information with different channel widths can be improved by fusing one dense feature with the enhanced feature of the other dense feature.

S305, carrying out feature fusion on the first dense features, the second dense features, the first dot product features and the second dot product features of the input original image and the reference image, and obtaining the converted image according to a feature fusion result.

After the feature fusion result is obtained by the first generator, the feature fusion result can be decoded by a decoder in the first generator, and then a converted image is obtained.

On the basis of the above embodiments, step S304 further includes:

S3041, respectively performing nonlinear processing on the first dense feature and the second dense feature through a preset activation function to obtain respective activation values of the first dense feature and the second dense feature;

According to the embodiment of the application, the first Dense features and the second Dense features are subjected to nonlinear processing by applying a nonlinear activation function after the Dense layer, so that the representation capacity of a model is increased, and nonlinear phonemes are introduced, so that the more complex classification problem can be solved.

S3042, carrying out global feature enhancement and local feature enhancement on the respective activation values of the first dense feature and the second dense feature according to the weights of the global attention mechanism and the local attention mechanism, and obtaining respective enhancement features of the first dense feature and the second dense feature.

The cross-gating mechanism is typically applied to the underlying vision, which is a branch in computer vision that focuses on improving the overall viewing experience of the image (i.e., vision enhancement). If the "middle-high layer vision" focuses on how to let the computing mechanism solve the content in the image, then the bottom layer vision aims at solving various image quality problems such as definition, color, time sequence and the like of the image. The embodiment of the application enhances the activation values of the first dense feature and the second dense feature by the local attention mechanism and the local attention mechanism of the cross gating mechanism and weights of the global attention mechanism and the local attention mechanism so as to obtain the enhancement features of the dense feature and the second dense feature.

S3043, performing point multiplication on the activation value of the first dense feature and the enhancement feature of the second dense feature to obtain a first point multiplication feature; and carrying out point multiplication on the activation value of the second dense feature and the enhancement feature of the first dense feature to obtain a second point multiplication feature.

Referring to fig. 4, a process flow diagram of a first generator of the present application is schematically shown, and includes:

For any one of the input original image and the reference image, encoding the image by a VGG model, extracting convolution characteristics output by relu2_1 layer, relu3_1 layer and relu4_1 layer, which are respectively expressed as F _c1、F_c2 and F _c3;

Upsampling (Upsample) the convolution features of relu2_1 and relu3_1 layers, that is, F _c1 and F _c2, to obtain upsampled results U ₁(F_c1) and U ₂(F_c2),U₁ and U ₂ represent different upsampling functions such that the channel widths of the convolution features of relu2_1 and relu3 _3_1 layers are aligned to the channel width of the convolution feature F _c3 of relu4_1 layers, after which the aligned three sets of convolution features are input to the first Dense layer for Dense representation, to obtain a first Dense representation F ₁, and the convolution features of relu4_1 layers are input to the first Dense layer for Dense representation, to obtain a second Dense representation F ₂, the calculation formula may be expressed as:

F₁=Dense₁[U₁(F_c1),U₂(F_c2),F_c3]；

F₂=Dense₁[F_c3]；

Wherein, dense ₁ represents a first Dense function;

The first Dense feature F ₁ and the second Dense feature F ₂ are respectively input into a second Dense layer for weighting, and the weighted result is subjected to nonlinear processing through Gaussian error linear units (Gaussian Error Linear Units, GELU) to obtain an activation value F ₁ 'of the first Dense feature and an activation value F ₂' of the second Dense feature, wherein a calculation formula can be expressed as follows:

F₁'=σ(W₁(F₁))

F₂'=σ(W₂(F₂))

Wherein σ represents GELU functions, and W1 and W2 represent the weights of convolutions corresponding to Dense2, respectively;

Performing global feature enhancement and local feature enhancement on the respective activation values of the first dense feature and the second dense feature according to weights of a global attention mechanism and a local attention mechanism to obtain an enhancement feature G (F ₁ ') of the first dense feature and an enhancement feature G (F ₂') of the second dense feature, wherein G represents a cross gate module;

performing point multiplication on the activation value of the first dense feature and the enhancement feature of the second dense feature to obtain a first point multiplication feature F ₁'; performing point multiplication on the activation value of the second dense feature and the enhancement feature of the first dense feature to obtain a second point multiplication feature F ₂ '', wherein a calculation formula can be expressed as follows:

F₁''=F₁'⊙(G(F₂’))；

F₂''=F₂'⊙(G(F₁’))；

Wherein, as indicated by ";

inputting the first point multiplication feature F ₁ 'and the second point multiplication feature F ₂' into a third Dense layer respectively for weighting to obtain a weighted result W ₃(F₁ ') and a weighted result W ₄(F₂';

The first dense feature F ₁, the second dense feature F ₂, and the weighted result W ₃(F₁ ") and W ₄(F₂") of each of the original image and the reference image are feature fused to obtain an image feature F of the converted image, and the calculation formula can be expressed as follows:

F=AdaIN(F₁+F₂+W₃(F₁'')+W₄(F₂’’))

wherein AdaIN is a feature fusion function implemented by instance normalization.

Based on the foregoing embodiments, as an alternative embodiment, performing the countermeasure training on the first initial model by using the first sample set to obtain a trained first initial model, please refer to fig. 5, including:

Determining a first initial model, wherein the first initial model comprises a first generator and a first discriminator, and performing first-stage multi-round iterative training on the first initial model according to a first sample set until a first iteration stop condition is met;

Replacing the first discriminator which meets the first iteration stop condition with a second discriminator with stronger discrimination capability to obtain an updated first initial model;

And performing multi-round iterative training of a second stage on the updated first initial model according to the first sample set until a second iterative stopping condition is met, so as to obtain the trained first initial model.

The first initial model of the embodiment of the present application includes a first generator and a first discriminator, where the function of the first generator can be seen in the above embodiment, and is not described herein, and the first discriminator is used for discriminating whether the converted image is a non-original image and discriminating whether the reference image is an original image, and by countermeasure training between the first generator and the first discriminator, the first generator can generate a realistic converted image as much as possible. When the first iteration stop condition is reached, the first discriminator is replaced by the second discriminator with stronger discrimination capability, so that the conversion capability of the first generator is further improved by utilizing the second discriminator, wherein when the original image is a sample content image, the conversion capability of the first generator refers to the capability of carrying out style migration on the content image to be consistent with the style of the style image, the capability and the fidelity of carrying out style migration on the obtained trained first initial model are stronger by carrying out multi-round iteration training of the second stage through the updated first initial model.

In some embodiments, the second arbiter may be a StyleGan network arbiter, styleGAN being characterized by its ability to generate high quality, realistic images with good diversity and controllability. This is achieved by introducing two new mechanisms, style conversion and noise injection. The style conversion mechanism allows the generator to generate images of different styles from the input style vector. The noise injection mechanism makes the generated image more diversified by introducing noise in the generation process.

The first iteration stop condition of the embodiment of the application may be that the loss function in the first stage converges or the number of iterative training reaches a preset threshold, and the second iteration stop condition of the embodiment of the application may be that the loss function in the second stage converges or the number of iterative training reaches a preset threshold.

On the basis of the above embodiments, as an alternative embodiment, each round of iterative training for any one of the first phase and the second phase includes:

Inputting the original image and the reference image in the first sample set into a first generator to obtain a converted image output by the first generator;

Inputting the converted image into a second generator, obtaining a restored original image output by the second generator, and determining a first difference between the restored original image and the original image;

judging the converted image and the reference image by a corresponding stage judging device to obtain a first probability of identifying the converted image as a non-original image and a second probability of identifying the reference image as an original image;

And determining a first loss function value according to the first probability, the second probability and the first difference, and adjusting parameters of the first initial model according to the first loss function value.

In the embodiment of the application, the first initial model comprises two generators and a discriminator, wherein the first generator is used for carrying out global feature enhancement and local feature enhancement on an input original image and a reference image, fusing the original image with the reference image after feature enhancement to obtain a converted image, the second generator is used for restoring the converted image to obtain a restored original image, and the performance of the first generator is optimized based on the restoring effect.

The first difference L _erc between the restored original image and the original image can be expressed by the following formula:

L_erc=E[||F(G(I_{Original, original}))-I_{Original, original}||₁]

Wherein I _{Original, original} denotes an original image, G (I _{Original, original}) denotes a converted image generated by the first generator from the original image, and F (G (I _{Original, original})) denotes a restored original image restored by the second generator from the converted image;

the embodiment of the application can obtain the fight loss (ADVERSIAL LOSS) according to the first probability and the second probability, and the fight loss can be expressed by the following formula:

L_adv=E[logD(I_{Reference to})]+E[log(1-D(G(I_{Original, original})]

Where I _{Reference to} denotes the reference picture, 1-D (G (I _{Original, original}) denotes the first probability, and D (I _{Reference to}) denotes the second probability.

Combining the above formulas for countering loss and period consistent loss, the expression of the first loss function value of the present application is obtained:

L₁=L_erc+L_adv

Referring to fig. 6, a schematic diagram of a training process of a first initial model according to an embodiment of the present application is shown, where the training process includes:

Taking the sample content image as an original image I _{Original, original} and the sample style image as a reference image I _{Reference to};

during first-stage training, inputting an original image I _{Original, original} and a reference image I _{Reference to} into a first generator, obtaining a converted image G (I _{Original, original}) output by the first generator, inputting the converted image G (I _{Original, original}) into a second generator, obtaining a restored original image F (G (I _{Original, original}) output by the second generator, determining a first difference L _erc between the restored original image F (G (I _{Original, original}) and the original image I _{Original, original}, judging the converted image G (I _{Original, original}) and the reference image I _{Reference to} through a first judging device, obtaining a first probability [1-D (G (I _{Original, original}) ] of identifying the converted image as a non-original image, and a second probability D (I _{Reference to}) of identifying the reference image as the original image, determining a first loss function value according to the first probability, the second probability and the first difference, adjusting parameters of the first initial model according to the first loss, and replacing the first judging device with the second judging device when a first iteration stopping condition is met, and obtaining a first model after updating;

In the second stage training, an original image I _{Original, original} is input into a first generator to obtain a converted image G (I _{Original, original}) output by the first generator, the converted image G (I _{Original, original}) is input into a second generator to obtain a restored original image F (G (I _{Original, original}) output by the second generator, a first difference L _erc between the restored original image F (G (I _{Original, original}) and the original image I _{Original, original} is determined, the converted image and a reference image are distinguished through a first discriminator to obtain a first probability [1-D (G (I _{Original, original}) ] of identifying the converted image as a non-original image, a second probability D (I _{Reference to}) of identifying the reference image as the original image, a first loss function value is determined according to the first probability, the second probability and the first difference, parameters of the updated first initial model are adjusted according to the first loss function value until a second iteration stop condition is met, and the trained first initial model is obtained.

On the basis of the above embodiments, as an alternative embodiment, in the second stage, the original image further includes a sample-style image;

When the original image is a sample style image, the reference image is a sample content image, the converted image is a Continue style image, and the Continue style image refers to an image obtained by transferring the content of the sample content image to the sample style image.

In the embodiment of the application, when the first initial model is trained, the sample style image is also used as an original image, the sample content image is used as a reference image, at this time, the converted image is a content style image, wherein the content style image refers to the image obtained by transferring the content of the sample content image to the sample style image, so that the first generator can learn the content transferring capability, the operation can enable the first generator to understand the content in the content image more accurately, the rendering loss of the content image when the first generator performs style transferring can be further reduced, and further the obtained details of the stylized content image can be kept more.

In the embodiment of the application, when one sample content image and one sample style image are input to a first initial model, the first initial model executes two training threads in parallel, wherein one training thread takes the sample content image as an original image, takes the sample style image as a reference image, and the other training thread takes the sample style image as the original image and takes the sample content image as the reference image.

The first difference L _erc between the restored original image and the original image according to the embodiment of the present application may be expressed by the following formula:

L_erc=E[||F(G(I_Content))-I_Content||₁]+E[||F(G(I_{Style of style}))-I_{Style of style}||₁]

Wherein I _Content denotes a sample content image, G (I _Content) denotes a stylized content image generated by the first generator from the sample content image, F (G (I _Content)) denotes a restored content image restored by the second generator from the stylized content image, I _{Style of style} denotes a sample style image, G (I _{Style of style}) denotes a stylized style image generated by the first generator from the sample style image, and F (G (I _{Style of style})) denotes a restored style image restored by the second generator from the stylized style image.

On the basis of the above embodiments, in the second training stage, the second discriminator includes a first sub-discriminator and a second sub-discriminator;

the first sub-discriminant is used for discriminating the stylized content image and the sample style image;

the second sub-discriminant is used for discriminating the content style image from the sample content image.

L_adv=E[logD(I_Content)]+E[log(1-D(G(I_{Style of style})]+

E[logD(I_{Style of style})]+E[log(1-D(G(I_Content)]

Wherein D (I _Content) represents the probability that the discriminator recognizes the sample content image as the original image when the original image is the sample content image; 1-D (G (I _{Style of style}) represents the probability that the arbiter recognizes the contextualized style image as a non-original image when the original image is a sample content image;

D (I _{Style of style}) represents the probability that the arbiter recognizes the sample-style image as the original image when the original image is the sample-style image; 1-D (G (I _Content) represents the probability that the arbiter recognizes the stylized content image as a non-original image when the original image is a sample-style image.

On the basis of the foregoing embodiments, as an optional embodiment, training a second initial model by using a stylized content image and a sample color conversion image corresponding to each sample content image in the second sample set, to obtain a trained second initial model, including:

and carrying out multiple rounds of iterative training on the second initial model through the stylized content images and the sample color conversion images corresponding to the sample content images in the second sample set until a third iterative stopping condition is met, wherein each round of iterative training is as shown in fig. 7 and comprises the following steps:

Inputting the stylized content image and the sample color conversion image corresponding to the sample content image into a feature extraction module of the second initial model to perform feature extraction to obtain a first feature of the stylized content image and a second feature of the color conversion image;

Inputting the first feature and the second feature into a color conversion module of the second initial model, and performing color conversion on the stylized content image to obtain a stylized content image with the converted colors;

Determining a second difference between the color converted image and the color converted stylized content image;

And determining a second loss function value according to the second difference, and adjusting parameters of the feature extraction module and the color conversion module according to the second loss function value.

Referring to fig. 8, a flow chart of a training method of a style conversion model according to an embodiment of the present application is shown, and as shown in the drawing, the method includes:

s401, determining a first initial model, wherein the first initial model comprises the first generator, a second generator and a first discriminator;

S402, performing a first-stage multi-round iterative training on a first initial model according to a first sample set until a first iterative stopping condition is met, wherein each round of iterative training comprises:

S4021, taking a sample content image as an original image, and inputting a sample style image reference image into a first generator to obtain a conversion image output by the first generator;

S4022, inputting the converted image into a second generator, obtaining a restored original image output by the second generator, and determining a first difference between the restored original image and the original image;

S4023, judging the converted image and the reference image through a first judging device to obtain a first probability of identifying the converted image as a non-original image and a second probability of identifying the reference image as an original image;

s4024, determining a first loss function value according to the first probability, the second probability and the first difference, and adjusting parameters of the first initial model according to the first loss function value;

S403, replacing the first discriminator which meets the first iteration stop condition with a second discriminator with stronger discrimination capability, wherein the second discriminator comprises a first sub-discriminator and a second sub-discriminator, and obtaining an updated first initial model;

S404, performing a second-stage multi-round iterative training on the updated first initial model according to the first sample set until a second iterative stopping condition is met, wherein each round of iterative training comprises:

S4041, taking the sample content image as an original image, and taking the sample style image as a reference image to be input into a first generator to obtain a first converted image output by the first generator;

taking the sample style image as an original image, and taking the sample content image as a reference image to be input into a first generator to obtain a second conversion image output by the first generator;

S4042, inputting the first converted image into a second generator, obtaining a restored sample content image output by the second generator, and determining a first sub-difference between the restored sample content image and the sample content image;

Inputting a second converted image into a second generator to obtain a restored sample style image output by the second generator, and determining a second sub-difference between the restored sample style image and the sample style image;

s4043, obtaining a first difference according to the first sub-difference and the second sub-difference;

S4044, judging the stylized content image and the style image through a first sub-judging device to obtain a first sub-probability of failing to change the stylized content image into a non-original image and a second sub-probability of identifying the style image into the original image;

Judging the content style image and the content image through a second sub-judging device to obtain a third sub-probability of failing to convert the content style image into a non-original image and a fourth sub-probability of recognizing the content image into an original image;

S4045, obtaining a first probability according to the first sub-probability and the third sub-probability, and obtaining a second probability according to the second sub-probability and the fourth sub-probability;

s4046, determining a first loss function value according to the first probability, the second probability and the first difference, and adjusting parameters of the first initial model according to the first loss function value;

S405, determining a second sample set, wherein the second sample set comprises a plurality of sample content images and sample color conversion images of which each sample content image is subjected to color conversion;

s406, performing stylization processing on each sample content image in the second sample set through a first generator in the trained first initial model to obtain a stylized content image corresponding to each sample content image;

s407, performing iterative training on the second initial model through the stylized content images and the sample color conversion images corresponding to the sample content images in the second sample set until a third iterative stopping condition is met, wherein each round of iterative training comprises:

S4071, inputting the stylized content image and the sample color conversion image corresponding to the sample content image into a feature extraction module of the second initial model to perform feature extraction, and obtaining a first feature of the stylized content image and a second feature of the color conversion image;

S4072, inputting the first feature and the second feature into a color conversion module of the second initial model, and performing color conversion on the stylized content image to obtain a stylized content image with the color converted;

S4073, determining a second difference between the color conversion image and the color converted stylized content image;

S4074, determining a second loss function value according to the second difference, and adjusting parameters of the feature extraction module and the color conversion module according to the second loss function value.

The embodiment of the application also provides an image style conversion method, as shown in fig. 9, comprising the following steps:

s501, determining a content image and a style image;

s502, taking the content image as an original image, taking the style image as a reference image, inputting the style image into a style conversion model trained by the training method based on the style conversion model of each embodiment, and obtaining a style content image which is output by the style conversion model and is matched with the content image and the style image.

It should be appreciated that if the style conversion model includes only the first generator, the style content image output by the style conversion model is a stylized content image.

If the style conversion model comprises a first generator and a trained second initial model, the style content image output by the style conversion model is a stylized content image subjected to color conversion.

Fig. 10 is a schematic diagram of an application scenario provided in an embodiment of the present application, where the application scenario is a video call scenario, as shown in the drawing, an operation and maintenance person 21 collects a first sample set and a second sample set, where the first sample set includes a plurality of sample content images and a plurality of sample style images, the second sample set includes a plurality of sample content images and a sample color conversion image of each sample content image after color conversion, the sample content images are self-photographing of a person in a dim environment, the sample style images are images with higher brightness, and the sample color conversion images are images after brightness adjustment of the sample content images.

It should be understood that, in the embodiment of the present application, when the relevant data collection and processing is applied to the example, the informed consent or the independent consent of the personal information body should be obtained strictly according to the requirements of the relevant national laws and regulations, and the subsequent data use and processing actions should be performed within the authorized range of the laws and regulations and the personal information body.

The first server 22 obtains the style conversion model through training by using the first sample set and the second sample set according to the training method of the style conversion model provided by the above embodiments of the present application.

The first user 23 logs in an application program capable of carrying out video call on the terminal 24, and initiates the video call to the second user 26 through the application program, the terminal 24 sends video frames collected in real time to a background server 25 of the application program, the background server detects the brightness of the video frames to find that the quality of the video frames is poor, the video frames are darker, the background server 25 calls a style conversion model in the first server 22 to carry out style conversion on the video frames, and the video frames with bright styles are obtained and sent to a terminal 27 of the second user 26.

The embodiment of the application provides a training device for a style conversion model, as shown in fig. 11, the training device for the style conversion model may include: a first sample set determination module 101, and an countermeasure training module 102, wherein,

A first sample set determining module 101 for determining a first sample set comprising a plurality of sample content images and a plurality of sample style images;

The countermeasure training module 102 is configured to perform countermeasure training on a first initial model according to a first sample set to obtain a trained first initial model, where the first initial model includes a first generator, and the first generator is configured to perform global feature enhancement and local feature enhancement on both an input original image and a reference image, and fuse the original image and the reference image after feature enhancement to obtain a converted image;

The device of the embodiment of the present application may execute the training method of the style conversion model provided by the embodiment of the present application, and its implementation principle is similar, and actions executed by each module in the device of each embodiment of the present application correspond to steps in the training method of the style conversion model of each embodiment of the present application, and detailed functional descriptions of each module of the training device of the style conversion model may be specifically referred to descriptions in the corresponding methods shown in the foregoing, and will not be repeated herein.

An embodiment of the present application provides an image style conversion device, as shown in fig. 12, which may include: an image determination module 201, and an inference module 202, wherein,

An image determination module 201 for determining a content image and a style image;

the reasoning module 202 is configured to input the style image as a reference image into a style conversion model trained based on the method described in the above embodiments, and obtain a style content image output by the style conversion model and matching the content image and the style image.

The embodiment of the application provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of a training method of a style conversion model or an image style conversion method, and compared with the related technology, the method can realize the steps of the training method of the style conversion model: the method comprises the steps of performing countermeasure training on a first initial model through a first sample set, performing global feature and local feature enhancement on an original image and a reference image through a first sound generator, fusing the original image with the reference image after feature enhancement to obtain a converted image, and when the original image is a sample content image, performing style migration on the converted image to obtain a stylized content image.

In an alternative embodiment, there is provided an electronic device, as shown in fig. 13, the electronic device 4000 shown in fig. 13 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit ), general purpose Processor, DSP (DIGITAL SIGNAL Processor, data signal Processor), ASIC (Application SPECIFIC INTEGRATED Circuit), FPGA (Field Programmable GATE ARRAY ) or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 13, but not only one bus or one type of bus.

Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.

The memory 4003 is used for storing a computer program for executing an embodiment of the present application, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to realize the steps shown in the foregoing method embodiment.

Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.

The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program can realize the steps and corresponding contents of the embodiment of the method when being executed by a processor.

The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described.

It should be understood that, although various operation steps are indicated by arrows in the flowcharts of the embodiments of the present application, the order in which these steps are implemented is not limited to the order indicated by the arrows. In some implementations of embodiments of the application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages can be flexibly configured according to the requirement, which is not limited by the embodiment of the present application.

The foregoing is merely an optional implementation manner of some of the implementation scenarios of the present application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the present application are adopted without departing from the technical ideas of the scheme of the present application, and the implementation manner is also within the protection scope of the embodiments of the present application.

Claims

1. A method for training a style conversion model, comprising:

when the original image is a sample content image, the reference image is a sample style image, the converted image is a stylized content image, and the stylized content image refers to an image obtained by migrating the style of the sample content image into the style of the sample style image;

The first generator is specifically configured to:

For any one image of an input original image and a reference image, encoding the image through a preset encoder, extracting the output of a plurality of convolution layers, and obtaining a plurality of groups of convolution characteristics of the image; the convolution features output by different convolution layers have different channel widths;

Taking a group of convolution features with the largest channel width in the groups of convolution features as reference convolution features, and aligning the channel width of each group of convolution features to the channel width of the reference convolution features;

performing dense representation on all the aligned groups of convolution features to obtain a first dense feature, and performing dense representation on the reference convolution features to obtain a second dense feature;

respectively carrying out feature enhancement on the first dense feature and the second dense feature, and obtaining a first point multiplication feature according to the enhancement feature after the feature enhancement is carried out on the first dense feature and the second dense feature; obtaining a second point multiplication feature according to the enhancement feature obtained by carrying out feature enhancement on the second dense feature and the first dense feature;

And carrying out feature fusion on the first dense feature, the second dense feature, the first dot product feature and the second dot product feature of the input original image and the reference image, and obtaining the converted image according to a feature fusion result.

2. The method of claim 1, wherein obtaining a style conversion model from the trained first initial model comprises:

Determining a second sample set comprising a plurality of sample content images, each sample content image color-converted sample color-converted image;

performing stylization processing on each sample content image in the second sample set through a first generator in the trained first initial model to obtain a stylized content image corresponding to each sample content image;

Training a second initial model through the stylized content images and the sample color conversion images corresponding to the sample content images in the second sample set to obtain a trained second initial model, wherein the second initial model is used for extracting image features of the sample color conversion images and the stylized content images, and performing color conversion on the stylized content images according to the image features to obtain color-converted content images;

And obtaining the style conversion model according to the first generator in the trained first initial model and the trained second initial model.

3. The method according to claim 1, wherein the feature enhancement is performed on the first dense feature and the second dense feature, respectively, and a first point multiplication feature is obtained according to the enhancement feature after the feature enhancement is performed on the first dense feature and the second dense feature; and obtaining a second point multiplication feature according to the enhanced feature obtained by enhancing the second dense feature and the first dense feature, wherein the second point multiplication feature comprises:

respectively carrying out nonlinear processing on the first dense feature and the second dense feature through a preset activation function to obtain respective activation values of the first dense feature and the second dense feature;

Performing global feature enhancement and local feature enhancement on the respective activation values of the first dense feature and the second dense feature according to weights of a global attention mechanism and a local attention mechanism to obtain respective enhancement features of the first dense feature and the second dense feature;

performing point multiplication on the activation value of the first dense feature and the enhancement feature of the second dense feature to obtain a first point multiplication feature; and carrying out point multiplication on the activation value of the second dense feature and the enhancement feature of the first dense feature to obtain a second point multiplication feature.

4. A method according to any one of claims 1-3, wherein the performing the challenge training on the first initial model according to the first sample set to obtain a trained first initial model comprises:

5. The method of claim 4, wherein each round of iterative training for any one of the first phase and the second phase comprises:

6. The method of claim 5, wherein in a second stage, the original image further comprises a sample-style image;

7. The method of claim 6, wherein the second arbiter comprises a first sub-arbiter and a second sub-arbiter;

8. The method according to claim 2, wherein the training the second initial model through the stylized content image and the sample color conversion image corresponding to each sample content image in the second sample set to obtain a trained second initial model includes:

And performing multiple rounds of iterative training on the second initial model through the stylized content images and the sample color conversion images corresponding to the sample content images in the second sample set until a third iterative stopping condition is met, wherein each round of iterative training comprises:

9. An image style conversion method, comprising:

Determining a content image and a style image;

and taking the content image as an original image, inputting the style image as a reference image into a style conversion model trained based on the method of any one of claims 1-8, and obtaining a style content image which is output by the style conversion model and is matched with the content image and the style image.

10. A training device for a style conversion model, comprising:

The first generator is specifically configured to:

11. An image style conversion device, comprising:

an image determining module for determining a content image and a style image;

The reasoning module is used for taking the content image as an original image, inputting the style image as a reference image into a style conversion model trained based on the method of any one of claims 1-8, and obtaining a style content image which is output by the style conversion model and is matched with the content image and the style image.

12. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1-9.

13. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1-9.

14. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1-9.