CN117611434A

CN117611434A - Model training method, image style conversion method and device and electronic equipment

Info

Publication number: CN117611434A
Application number: CN202410068980.2A
Authority: CN
Inventors: 周洲; 樊艳波; 伍洋; 孙钟前
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-01-17
Filing date: 2024-01-17
Publication date: 2024-02-27
Anticipated expiration: 2044-01-17
Also published as: CN117611434B

Abstract

The embodiment of the application provides a model training method, an image style conversion device, electronic equipment and a computer readable storage medium, and relates to the technical fields of artificial intelligence and image processing. The method comprises the following steps: determining a first sample set comprising a plurality of sample content images and a plurality of sample style images; performing countermeasure training on a first initial model according to a first sample set to obtain a trained first initial model, wherein the first initial model comprises a first generator, and obtaining a style conversion model according to the trained first initial model, wherein the style conversion model comprises the first generator in the trained first initial model. The embodiment of the application can balance global style migration and retain local details of the content image.

Description

Model training method, image style conversion method and device and electronic equipment

Technical Field

The application relates to the technical field of image processing and artificial intelligence, in particular to a model training method, an image style conversion device, electronic equipment and a storage medium.

Background

With the rapid development of communication and computer technology, computer and communication-based image processing technology has also been developed robustly and rapidly, and applied to various fields. For example, the image may be style converted using image processing techniques.

The style conversion of the image means that the content of the original image is kept as much as possible, and the style of the original image is converted into the style of the reference image.

The related art has a disadvantage in detail processing of an image when performing style conversion on the image.

Disclosure of Invention

Embodiments of the present application provide a model training method, an image style conversion method, an apparatus, an electronic device, a computer readable storage medium, and a computer program product, which can solve the above-mentioned problems in the prior art. The technical scheme is as follows:

according to a first aspect of an embodiment of the present application, there is provided a training method of a style conversion model, including:

determining a first sample set comprising a plurality of sample content images and a plurality of sample style images;

performing countermeasure training on a first initial model according to a first sample set to obtain a trained first initial model, wherein the first initial model comprises a first generator, and the first generator is used for performing global feature and local feature enhancement on an input original image and a reference image, and fusing the original image with the enhanced features and the reference image to obtain a converted image;

Obtaining a style conversion model according to the trained first initial model, wherein the style conversion model comprises a first generator in the trained first initial model;

when the original image is a sample content image, the reference image is a sample style image, and the converted image is a stylized content image, wherein the stylized content image refers to an image obtained by shifting the style of the sample content image to the style of the sample style image.

According to a second aspect of the embodiments of the present application, there is provided an image style conversion method, including:

determining a content image and a style image;

and taking the content image as an original image, inputting the style image as a reference image into a style conversion model trained based on the method provided by the first aspect, and obtaining a style content image which is output by the style conversion model and is matched with the content image and the style image.

According to a third aspect of embodiments of the present application, there is provided a training apparatus for a style conversion model, the apparatus including:

a first sample set determination module for determining a first sample set comprising a plurality of sample content images and a plurality of sample style images;

The countermeasure training module is used for performing countermeasure training on the first initial model according to the first sample set to obtain a trained first initial model, wherein the first initial model comprises a first generator, and the first generator is used for performing global feature and local feature enhancement on an input original image and a reference image, and fusing the original image with the enhanced features and the reference image to obtain a converted image;

the model construction module is used for obtaining a style conversion model according to the trained first initial model, and the style conversion model comprises a first generator in the trained first initial model;

According to a fourth aspect of embodiments of the present application, there is provided an image style conversion apparatus, including:

an image determining module for determining a content image and a style image;

and the reasoning module is used for taking the content image as an original image, inputting the style image as a reference image into a style conversion model trained based on the method provided by the first aspect, and obtaining the style content image which is output by the style conversion model and is matched with the content image and the style image.

According to a fifth aspect of embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory, the processor executing the computer program to implement the steps of the method provided in the first or second aspect above.

According to a sixth aspect of embodiments of the present application, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method provided in the first or second aspect described above.

According to a seventh aspect of embodiments of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method provided in the first or second aspect described above.

The beneficial effects that technical scheme that this application embodiment provided brought are:

the method comprises the steps of performing countermeasure training on a first initial model through a first sample set, performing global feature and local feature enhancement on an original image and a reference image through a first sound generator, fusing the original image with the reference image after feature enhancement to obtain a converted image, and when the original image is a sample content image, performing style migration on the converted image to obtain a stylized content image.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;

FIG. 2 is a schematic flow chart of a model training method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a style conversion model according to an embodiment of the present application;

fig. 4 is a schematic process flow diagram of a first generator according to an embodiment of the present application;

fig. 5 is a schematic diagram of a training flow of a first initial model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a training process of a first initial model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a training process of a second initial model according to an embodiment of the present application;

fig. 8 is a flow chart of a training method of a style conversion model according to an embodiment of the present application;

fig. 9 is a flowchart of an image style conversion method according to an embodiment of the present application;

fig. 10 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a training device for a style conversion model according to an embodiment of the present application;

Fig. 12 is a schematic structural diagram of an image style conversion device according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.

As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this application, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present application. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates that at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Several terms which are referred to in this application are first introduced and explained:

artificial intelligence (Artificial Intelligence, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. Embodiments of the present application relate to the field of image processing in artificial intelligence technology.

Computer Vision technology (CV): the method is a science for researching how to make the machine "look at", and further means that a camera and a computer are used to replace human eyes to recognize and measure targets and other machine vision, and further graphic processing is performed, so that the computer is used to process images which are more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image detection, image semantic understanding, image retrieval, OCR (Optical Character Recognition ), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D (three Dimensional, three-dimensional) techniques, virtual reality, augmented reality, and map construction, among others, as well as biometric recognition techniques. The embodiment of the application can be applied to the migration of artistic styles to artistic creation scenes by carrying out style conversion on the content images, automatically generate images of different styles according to artistic drafts, and can also be applied to the augmentation of synthetic data, add more details to images which are not fine enough, such as the rapid generation of game assets and the rapid generation of portraits of different styles.

Generating an antagonizing network: a generating type learning method for fitting data distribution. The technology mainly comprises two neural networks of a generator and a discriminator, wherein the generator is used for fitting data distribution, and the discriminator is used for distinguishing whether current data come from fitting distribution or real data distribution. During the course of the countermeasure learning of the generator and the arbiter, the generator evolves step by step to achieve the data generating capability with spurious artifacts.

Some schemes of the related art over emphasize local content details, which may prevent sufficient style information migration, and others model global features only, ignoring local details.

The present application provides a model training method, an image style conversion method, an apparatus, an electronic device, a computer readable storage medium, and a computer program product, which aim to solve the above technical problems in the prior art.

The technical solutions of the embodiments of the present application and technical effects produced by the technical solutions of the present application are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.

Fig. 1 shows a schematic diagram of an implementation environment provided by an embodiment of the present application. The implementation environment comprises: a terminal 11 and a server 12.

The training method of the style conversion model provided in the embodiment of the present application may be executed by the terminal 11, may be executed by the server 12, or may be executed by both the terminal 11 and the server 12, which is not limited in the embodiment of the present application. For the case that the training method of the style conversion model provided in the embodiment of the present application is executed by the terminal 11 and the server 12 together, the server 12 takes on the primary computing work, and the terminal 11 takes on the secondary computing work; alternatively, the server 12 takes on secondary computing work and the terminal 11 takes on primary computing work; alternatively, a distributed computing architecture is used for collaborative computing between the server 12 and the terminal 11.

The image style conversion method provided in the embodiment of the present application may be executed by the terminal 11, may be executed by the server 12, or may be executed by both the terminal 11 and the server 12, which is not limited in the embodiment of the present application. For the case where the image style conversion method provided in the embodiment of the present application is executed by the terminal 11 and the server 12 together, the server 12 takes on primary computing work, and the terminal 11 takes on secondary computing work; alternatively, the server 12 takes on secondary computing work and the terminal 11 takes on primary computing work; alternatively, a distributed computing architecture is used for collaborative computing between the server 12 and the terminal 11.

The execution device of the training method of the style conversion model and the execution device of the image style conversion method may be the same or different, which is not limited in the embodiment of the present application.

In one possible implementation, the terminal 11 may be any electronic product that can perform man-machine interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, a voice interaction or handwriting device, such as a PC (Personal Computer ), a mobile phone, a smart phone, a PDA (Personal Digital Assistant, a personal digital assistant), a wearable device, a PPC (Pocket PC), a tablet computer, a smart car machine, a smart television, a smart speaker, a smart voice interaction device, a smart home appliance, a car terminal, etc. The server 12 may be a server, a server cluster comprising a plurality of servers, or a cloud computing service center. The terminal 11 establishes a communication connection with the server 12 through a wired or wireless network.

Those skilled in the art will appreciate that the above-described terminal 11 and server 12 are by way of example only, and that other terminals or servers, either now present or later, may be suitable for use in the present application, and are intended to be within the scope of the present application and are incorporated herein by reference.

The image style conversion method provided by the application can be applied to artistic creation scenes and game making scenes.

For example, in an artistic creation scenario, the methods provided herein may perform style conversion processing on an artistic draft to process the artistic draft into an image of a specified artistic style. For example, the art draft may refer to a photographed photo or a pattern drawn by a drawing program.

In a game making scene, the method provided by the application can perform style conversion processing on the game resources so as to convert the game resources into different styles, and further realize rapid generation of the game resources. For example, the game resource may refer to a avatar or a dressing of a virtual object, or may refer to a virtual environment.

The image style conversion method can be applied to scenes such as old photo restoration, movie restoration, cartoon restoration, video call quality improvement and the like.

Video telephony and short video content production and the like are becoming increasingly important in people's daily lives. However, in some cases, the problem of poor ambient light condition exists, and in the video call process, the shot video tends to be dim, so that the user experience is reduced. Therefore, the image style conversion method of the embodiment of the application can be applied to improve the image quality and brightness of the dim pictures in the video call.

In the video restoration process, each video frame in the video is used as an image to be processed, and the restoration of the video is realized by performing style conversion processing on each video frame.

The embodiment of the application provides a model training method, as shown in fig. 2, which includes S101 to S103, specifically:

s101, determining a first sample set, the first sample set including a plurality of sample content images and a plurality of sample style images.

The content in each sample content image in the embodiment of the present application is different, and the content in the content image in the embodiment of the present application is not particularly limited, and may be, for example, a text, a character, a landscape, an animal, a robot, equipment, a building, and the like.

The styles in the sample style images in the embodiments of the present application are different, and the styles express artistic forms of the images, such as watercolor, black-and-white, impression pie, etc., and the embodiments of the present application do not specifically limit the styles of the sample style images.

S102, performing countermeasure training on a first initial model according to a first sample set to obtain a trained first initial model, wherein the first initial model comprises a first generator, the first generator is used for performing global feature and local feature enhancement on an input original image and a reference image, and fusing the original image with the enhanced features and the reference image to obtain a converted image.

In the embodiment of the application, when the first initial model is subjected to countermeasure training according to the first sample set, one or more image pairs are input into the first initial model each time and each image pair comprises an original image and a reference image, and the original image can be a sample content image when the model is trained because the content image is subjected to style conversion during reasoning in the scheme, and the reference image is a sample style image at the moment.

It should be noted that the original image and the reference image in one image pair are randomly selected, and the content in the original image and the reference image need not be the same, but the styles need to be different. Also, when the original image is a sample content image, the reference image must be a sample style image. By combining different sample content images with different sample style images, different images are formed to counter-train the first initial model, so that the first initial model can fully learn content features in the content images and style features in the style images.

The training mode of the first initial model in this embodiment is countermeasure training, which means learning data distribution by competing with each other through a Generator (Generator) and a Discriminator (Discriminator), wherein the Generator is responsible for learning to generate data similar to real data from random noise, and the Discriminator is common knowledge to distinguish the generated data from the real data.

The first initial model comprises a first generator, aims at the problem that some schemes in the prior art can prevent enough style information from migrating due to the fact that local content details are emphasized too much, and the problem that some schemes only model global features and ignore local details, and obtains a converted image by carrying out global feature enhancement and local feature enhancement on an input original image and a reference image and fusing the original image and the reference image after feature enhancement. When the original image is a sample content image, the converted image is a stylized content image, and the stylized content image is an image obtained by migrating the style of the sample content image into the style of the sample style image.

S103, obtaining a style conversion model according to the trained first initial model, wherein the style conversion model comprises a first generator in the trained first initial model.

According to the method and the device for the stylized content image, the first generator in the first initial model after training can be used as the stylized conversion model, so that when the stylized conversion model is utilized for reasoning, the stylized conversion model is utilized to input the content image and the stylized image, and the stylized content image output by the stylized conversion model can be used as a result of the stylized conversion.

According to the method, the first initial model is subjected to countermeasure training through the first sample set, global features and local features of the original and reference images are enhanced through the first sound generator, the original images with the enhanced features and the reference images are fused to obtain the converted images, and when the original images are sample content images, the converted images are stylized content images after style migration.

Based on the foregoing embodiments, in the embodiments of the present application, a style conversion model is obtained according to the trained first initial model, and the method further includes:

s201, determining a second sample set including a plurality of sample content images, and a sample color conversion image in which each sample content image is color-converted.

The second sample set determined by the application has two differences compared with the first sample set, on one hand, the data volume is greatly reduced, namely the size of the second sample set is smaller or far smaller than that of the first sample set, so that the second sample set is used for adjusting the details of style conversion by using a small range of training samples, and not too many training samples are needed; on the other hand, the sample images in the second sample set have a corresponding relationship, that is, one sample content image corresponds to at least one sample color conversion image, the sample color conversion image is an image that performs color conversion on the corresponding sample content image, the color conversion is that some or all pixel values in the sample content image are adjusted according to a certain rule, the adjustment rule is not limited in the embodiment of the present application, for example, the adjustment may be a gray scale image, and since the sample content image and the sample color conversion image are corresponding, and the difference between the sample content image and the sample color conversion image is only the color of a pixel point, the second initial model after training can learn the detail information of the pixel level through training.

S202, performing stylization processing on each sample content image in the second sample set through a first generator in the trained first initial model, and obtaining stylized content images corresponding to each sample content image.

According to the embodiment of the application, the first generator in the first initial model trained by the embodiment is utilized to perform style migration on each sample content image in the second sample set, so that stylized content images are obtained.

S203, training a second initial model through the stylized content images and the sample color conversion images corresponding to the sample content images in the second sample set to obtain a trained second initial model, wherein the second initial model is used for extracting image characteristics of the sample color conversion images and the stylized content images, and performing color conversion on the stylized content images according to the image characteristics to obtain color-converted content images;

according to the method and the device for processing the stylized content images, feature extraction and feature fusion are carried out on the stylized content images and the sample color conversion images of each sample content image through the second initial model, so that the second initial model can adaptively adjust colors according to the style images, and the finely adjusted stylized content images are obtained.

S204, obtaining a style conversion model according to the first generator in the trained first initial model and the trained second initial model.

After training the second initial model, an image style conversion model may be constructed in the embodiment of the present application, please refer to fig. 3, which schematically illustrates a structural schematic diagram of the style conversion model provided in the embodiment of the present application, as shown in the drawing, where the style conversion model includes a first generator in the first initial model and the trained second initial model, and when reasoning, the content image and the style image to be processed are input into the first generator to obtain a stylized content image output by the first generator, further the stylized content image and the style image are input into the second initial model, the second initial model performs color conversion on the sample color conversion image and the image characteristics of the stylized content image, and finally the stylized content image after the color conversion is output.

On the basis of the above embodiments, as an alternative embodiment, the first generator is specifically configured to execute the following steps S301 to S305:

s301, for any one of an input original image and a reference image, encoding the image through a preset encoder, extracting the output of a plurality of convolution layers, and obtaining a plurality of groups of convolution characteristics of the image.

The type of encoder is not particularly limited in the embodiments of the present application, and for example, a visual geometry group (Visual Geometry Group, VGG) pre-training model may be employed. The VGG model has 5 convolutional layer blocks, the convolutional features of different convolutional layer outputs having different channel widths, specifically:

convolution layer block 1: the method consists of two convolution layers and a pooling layer, wherein the kernel size of the convolution layer 1 is 3x3, the channel width is 64, the kernel size of the convolution layer 2 is 3x3, the channel width is 64, the size of the pooling layer is 2x2, and the step length is 2.

Convolution layer block 2: the method consists of two convolution layers and a pooling layer, wherein the kernel size of the convolution layer 1 is 3x3, the channel width is 128, the kernel size of the convolution layer 2 is 3x3, the channel width is 128, the size of the pooling layer is 2x2, and the step length is 2.

Convolution layer block 3: the method consists of two convolution layers and a pooling layer, wherein the kernel size of the convolution layer 1 is 3x3, the channel width is 256, the kernel size of the convolution layer 2 is 3x3, the channel width is 256, the kernel size of the convolution layer 3 is 3x3, the channel width is 256, the size of the pooling layer is 2x2, and the step length is 2.

Convolution layer block 4: consists of three convolution layers and a pooling layer. The kernel size of the convolution layer 1 is 3x3, the channel width is 512, the kernel size of the convolution layer 2 is 3x3, the channel width is 512, the kernel size of the convolution layer 3 is 3x3, the channel width is 512, the size of the pooling layer is 2x2, and the step size is 2.

Convolution layer block 5: consists of three convolution layers and a pooling layer. The kernel size of the convolution layer 1 is 3x3, the channel width is 512, the kernel size of the convolution layer 2 is 3x3, the channel width is 512, the kernel size of the convolution layer 3 is 3x3, the channel width is 512, the size of the pooling layer is 2x2, and the step size is 2.

In the embodiment of the application, 3 convolution layers, for example, the output of the first convolution layer in the convolution layer blocks 2-4, can be selected to form multiple groups of convolution characteristics of the image.

S302, taking a group of convolution features with the largest channel width in each group of convolution features as reference convolution features, and aligning the channel width of each group of convolution features to the channel width of the reference convolution features.

Taking the output of the first convolution layer of the convolution layer blocks 2-4 as an example, since the channel width of the output of the first convolution layer of the convolution layer block 4 is the largest and is 512, the channel widths of the convolution features of each group of convolution features are aligned by up-sampling the convolution features of the convolution layer blocks 2 and 3.

S303, performing dense representation on all the aligned groups of convolution features to obtain a first dense feature, and performing dense representation on the reference convolution features to obtain a second dense feature.

That is, the first dense features of the present application are obtained based on all sets of registered convolution features, while the second dense features are obtained from only the reference convolution features.

The embodiment of the application can carry out Dense representation based on a full-connection (Dense) layer, and can flatten all the aligned convolution characteristics, namely, convert the high-dimensional characteristic representation into a low-dimensional vector, and the low-dimensional vector can be used for a final classification task. Each connection of the Dense layer has a weight and the model learns the mapping from input features to output categories by adjusting these weights during the training process. These weights enable the model to learn the appropriate feature representation and classification decisions on the training data.

S304, respectively carrying out feature enhancement on the first dense features and the second dense features, and obtaining first point multiplication features according to the enhancement features after the feature enhancement of the first dense features and the second dense features; and obtaining a second point multiplication feature according to the enhancement feature obtained by carrying out feature enhancement on the second dense feature and the first dense feature.

According to the embodiment of the application, the dense features are subjected to feature enhancement, the understanding capability of image details can be improved, one dense feature is fused with the enhancement feature of the other dense feature, and the learning capability of feature information with different channel widths can be improved.

S305, carrying out feature fusion on the first dense features, the second dense features, the first dot product features and the second dot product features of the input original image and the reference image, and obtaining the converted image according to a feature fusion result.

After the feature fusion result is obtained by the first generator, the feature fusion result can be decoded by a decoder in the first generator, and then a converted image is obtained.

On the basis of the above embodiments, step S304 further includes:

s3041, respectively performing nonlinear processing on the first dense feature and the second dense feature through a preset activation function to obtain respective activation values of the first dense feature and the second dense feature;

according to the embodiment of the application, the first Dense features and the second Dense features are subjected to nonlinear processing by applying a nonlinear activation function after the Dense layer, so that the representation capacity of a model is increased, nonlinear phonemes are introduced, and the more complex classification problem can be solved.

S3042, carrying out global feature enhancement and local feature enhancement on the respective activation values of the first dense feature and the second dense feature according to the weights of the global attention mechanism and the local attention mechanism, and obtaining respective enhancement features of the first dense feature and the second dense feature.

The cross-gating mechanism is typically applied to the underlying vision, which is a branch in computer vision that focuses on improving the overall viewing experience of the image (i.e., vision enhancement). If the "middle-high layer vision" focuses on how to let the computing mechanism solve the content in the image, then the bottom layer vision aims at solving various image quality problems such as definition, color, time sequence and the like of the image. According to the embodiment of the application, through the local attention mechanism and the local attention mechanism of the cross gating mechanism, the activation values of the first dense feature and the second dense feature are enhanced by the weights of the global attention mechanism and the local attention mechanism, so that the enhancement features of the dense feature and the second dense feature are obtained.

S3043, performing point multiplication on the activation value of the first dense feature and the enhancement feature of the second dense feature to obtain a first point multiplication feature; and carrying out point multiplication on the activation value of the second dense feature and the enhancement feature of the first dense feature to obtain a second point multiplication feature.

Referring to fig. 4, a schematic process flow diagram of a first generator of the present application is schematically shown, and includes:

For any one of the input original image and the reference image, the image is encoded by a VGG model, convolution characteristics output by the relu2_1 layer, the relu3_1 layer and the relu4_1 layer are extracted and respectively expressed asF _c1 、F _c2 AndF _c3 ；

convolution characteristics for relu2_1 and relu3_1 layers, i.eF _c1 AndF _c2 upsampling (upsampling) to obtain upsampling resultU ₁ (F _c1 ) AndU ₂ (F _c2 )，U ₁ and U ₂ Representing different upsampling functions such that the channel width of the convolution features of the relu2_1 layer and the relu3_1 layer is towards the convolution features of the relu4_1 layerF _c3 Is aligned, and then the aligned three groups of convolution features are input into a first Dense layer for Dense representation to obtain a first Dense representationF ₁ Inputting the convolution characteristics of the relu4_1 layer into the first Dense layer to perform Dense representation to obtain a second Dense representationF ₂ The calculation formula can be expressed as:

F ₁ =Dense ₁ [U ₁ (F _c1 ),U ₂ (F _c2 ),F _c3 ]；

F ₂ =Dense ₁ [F _c3 ]；

wherein, dense ₁ Representing a first Dense function;

for the first dense featureF ₁ And a second dense featureF ₂ Respectively inputting the first Dense layer to weight, and nonlinear processing the weighted result by Gaussian error linear units (Gaussian Error Linear Units, GELU) to obtain the activation value of the first Dense featureF ₁ Activation values of' and second dense featuresF ₂ ' the calculation formula can be expressed as:

F ₁ ’=σ(W ₁ (F ₁ ))

F ₂ ’=σ(W ₂ (F ₂ ))

Wherein σ represents a GELU function, and W1 and W2 represent weights of convolutions corresponding to Dense2, respectively;

the activation values of the first dense feature and the second dense feature are used for a global attention mechanism and a local attention mechanismThe weight is enhanced by global features and local features to obtain enhanced features G of the first dense featuresF ₁ ' and enhancement feature G of the second dense featureF ₂ '), wherein G represents a cross gate module;

performing point multiplication on the activation value of the first dense feature and the enhancement feature of the second dense feature to obtain a first point multiplication featureF ₁ ''s; performing point multiplication on the activation value of the second dense feature and the enhancement feature of the first dense feature to obtain a second point multiplication featureF ₂ ' the calculation formula can be expressed as:

F ₁ ’’=F ₁ ’⊙(G(F ₂ ’))；

F ₂ ’’=F ₂ ’⊙(G(F ₁ ’))；

wherein, as indicated by ";

for the first point multiplication featureF ₁ ' and second point multiplication featuresF ₂ ' respectively inputting the weighted result W into the third Dense layer to weight ₃ (F ₁ ') and W ₄ (F ₂ ’’)；

First dense features F for each of the original image and the reference image ₁ Second dense feature F ₂ Weighted result W ₃ (F ₁ ') and W ₄ (F ₂ ' feature fusion is carried out to obtain an image feature F of the converted image, and a calculation formula can be expressed as follows:

F=AdaIN(F ₁ +F ₂ +W ₃ (F ₁ ’’)+W ₄ (F ₂ ’’))

AdaIN is a feature fusion function implemented by instance normalization.

Based on the foregoing embodiments, as an alternative embodiment, performing the countermeasure training on the first initial model by using the first sample set to obtain a trained first initial model, please refer to fig. 5, including:

determining a first initial model, wherein the first initial model comprises a first generator and a first discriminator, and performing first-stage multi-round iterative training on the first initial model according to a first sample set until a first iteration stop condition is met;

replacing the first discriminator which meets the first iteration stop condition with a second discriminator with stronger discrimination capability to obtain an updated first initial model;

and performing multi-round iterative training of a second stage on the updated first initial model according to the first sample set until a second iterative stopping condition is met, so as to obtain the trained first initial model.

The first initial model of the embodiment of the present application includes a first generator and a first discriminator, where the function of the first generator can be seen in the above embodiment, which is not described herein again, and the first discriminator is configured to discriminate whether the converted image is a non-original image, and to discriminate whether the reference image is an original image, and by the countermeasure training between the first generator and the first discriminator, the first generator is able to generate a realistic converted image as much as possible. When the first iteration stop condition is reached, the first discriminator is replaced by the second discriminator with stronger discrimination capability, so that the conversion capability of the first generator is further improved by using the second discriminator, and when the original image is a sample content image, it can be understood that the conversion capability of the first generator refers to the capability of carrying out style migration on the content image to be consistent with the style of the style image, the capability and the fidelity of carrying out style migration on the obtained trained first initial model are stronger when carrying out multi-round iteration training of the second stage through the updated first initial model.

In some embodiments, the second arbiter may be a arbiter of a StyleGan network, which is characterized by its ability to generate high quality, realistic images with better diversity and controllability. This is achieved by introducing two new mechanisms, style conversion and noise injection. The style conversion mechanism allows the generator to generate images of different styles from the input style vector. The noise injection mechanism makes the generated image more diversified by introducing noise in the generation process.

The first iteration stop condition in the embodiment of the present application may be that the loss function in the first stage converges or the number of iterative training reaches a preset threshold, and the second iteration stop condition in the embodiment of the present application may be that the loss function in the second stage converges or the number of iterative training reaches a preset threshold.

On the basis of the above embodiments, as an alternative embodiment, each round of iterative training for any one of the first phase and the second phase includes:

inputting the original image and the reference image in the first sample set into a first generator to obtain a converted image output by the first generator;

inputting the converted image into a second generator, obtaining a restored original image output by the second generator, and determining a first difference between the restored original image and the original image;

Judging the converted image and the reference image by a corresponding stage judging device to obtain a first probability of identifying the converted image as a non-original image and a second probability of identifying the reference image as an original image;

and determining a first loss function value according to the first probability, the second probability and the first difference, and adjusting parameters of the first initial model according to the first loss function value.

In this embodiment of the present application, the first initial model includes two generators and a discriminator, where the first generator is configured to perform global feature enhancement and local feature enhancement on both an input original image and a reference image, and fuse the original image and the reference image after feature enhancement to obtain a transformed image, and the second generator is configured to restore the transformed image to obtain a restored original image, and optimize performance of the first generator based on an effect of restoration.

First difference between restored original image and the original imageL _erc The expression can be expressed as follows:

L _erc =E[||F(G(I _{original, original} ))-I _{Original, original} || ₁ ]

Wherein,I _{original, original} The original image is represented by a representation of the original image,G(I _{original, original} ) Representing a transformed image generated by the first generator from the original image,F(G(I _{original, original} ) A second generator for generating a restored original image based on the converted image;

The embodiment of the application can obtain the challenge loss (adaptive loss) according to the first probability and the second probability, wherein the challenge loss can be expressed by the following formula:

L _adv =E[logD(I _{reference to} )]+E[log(1-D(G(I _{Original, original} )]

Wherein,I _{reference to} Representing a reference image, 1-D(G(I _{Original, original} ) A first probability is represented as such,D(I _{reference to} ) Representing a second probability.

Combining the above formulas of the countering loss and the period consistent loss, the expression of the first loss function value of the present application is obtained:

L ₁ =L _erc +L _adv

referring to fig. 6, a schematic diagram of a training process of a first initial model provided in an embodiment of the present application is shown, where the training process includes:

taking the sample content image as an original imageI _{Original, original} Sample style image as reference imageI _{Reference to} ；

During the first training stage, the original image is displayedI _{Original, original} Reference imageI _{Reference to} Inputting into a first generator to obtain a converted image output by the first generatorG(I _{Original, original} ) Will transform the imageG(I _{Original, original} ) Inputting the second generator to obtain the restored original image output by the second generatorF(G(I _{Original, original} ) Determining the restored original imageF(G(I _{Original, original} ) And the original imageI _{Original, original} First difference betweenL _erc And the first discriminator is used for converting the imageG(I _{Original, original} ) Reference imageI _{Reference to} Discrimination is performed to obtain a first probability [1 ]D(G(I _{Original, original} )]And a second probability of identifying the reference image as the original imageD(I _{Reference to} ) Determining a first loss function value according to the first probability, the second probability and the first difference, adjusting parameters of the first initial model according to the first loss function value, and replacing the first discriminator with the second discriminator when a first iteration stop condition is met, so as to obtain an updated first initial model;

during the second stage training, the original image is displayedI _{Original, original} Inputting into a first generator to obtain a converted image output by the first generatorG(I _{Original, original} ) Will transform the imageG(I _{Original, original} ) Inputting the second generator to obtain the restored original image output by the second generatorF(G(I _{Original, original} ) Determining the restored original imageF(G(I _{Original, original} ) And the original imageI _{Original, original} First difference betweenL _erc And discriminating the converted image and the reference image by a first discriminator to obtain a first probability [1 ]D(G(I _{Original, original} )]And a second probability of identifying the reference image as the original imageD(I _{Reference to} ) And determining a first loss function value according to the first probability, the second probability and the first difference, and adjusting the parameters of the updated first initial model according to the first loss function value until a second iteration stop condition is met, so as to obtain the trained first initial model.

On the basis of the above embodiments, as an alternative embodiment, in the second stage, the original image further includes a sample-style image;

when the original image is a sample style image, the reference image is a sample content image, the converted image is a Continue style image, and the Continue style image refers to an image obtained by transferring the content of the sample content image to the sample style image.

When the first initial model is trained, the sample style image is also used as an original image, the sample content image is used as a reference image, at this time, the converted image is a content style image, the content style image refers to the image after the content of the sample content image is migrated to the sample style image, so that the first generator can learn the content migration capability, the first generator can understand the content in the content image more accurately, the rendering loss of the content image when the first generator performs style migration can be further reduced, and the details of the obtained stylized content image can be further reserved.

In the embodiment of the application, when one sample content image and one sample style image are input to the first initial model, the first initial model executes two training threads in parallel, one training thread takes the sample content image as an original image, takes the sample style image as a reference image, and the other training thread takes the sample style image as the original image and takes the sample content image as the reference image.

The first difference between the restored original image and the original imageL _erc The expression can be expressed as follows:

L _erc =E[||F(G(I _content ))-I _Content || ₁ ]+E[||F(G(I _{Style of style} ))-I _{Style of style} || ₁ ]

Wherein,I _content Representing an image of the content of the sample,G(I _content ) Representing a stylized content image generated by the first generator from the sample content image,F(G(I _content ) A restored content image that represents the restoration of the second generator from the stylized content image,I _{style of style} A sample style image is represented and,G(I _{style of style} ) Representing an interior generated by a first generator from a sample style imageThe image of the style of the container is formed,F(G(I _{style of style} ) A restored style image that the second generator restores from the contextualized style image).

Based on the above embodiments, in the second training stage, the second arbiter includes a first sub-arbiter and a second sub-arbiter;

the first sub-discriminant is used for discriminating the stylized content image and the sample style image;

the second sub-discriminant is used for discriminating the content style image from the sample content image.

L _adv =E[logD(I _Content )]+E[log(1-D(G(I _{Style of style} )]+

E[logD(I _{Style of style} )]+E[log(1-D(G(I _Content )]

Wherein,D(I _content ) Representing a probability that the discriminator recognizes the sample content image as the original image when the original image is the sample content image; 1-D(G(I _{Style of style} ) Representing a probability that the arbiter recognizes the contextualized style image as a non-original image when the original image is a sample content image;

D(I _{style of style} ) Representing a probability that the discriminator recognizes the sample-style image as the original image when the original image is the sample-style image; 1-D(G(I _Content ) Representing the probability that the arbiter recognizes the stylized content image as a non-original image when the original image is a sample-style image.

On the basis of the foregoing embodiments, as an optional embodiment, training a second initial model by using a stylized content image and a sample color conversion image corresponding to each sample content image in the second sample set, to obtain a trained second initial model, including:

and carrying out multiple rounds of iterative training on the second initial model through the stylized content images and the sample color conversion images corresponding to the sample content images in the second sample set until a third iterative stopping condition is met, wherein each round of iterative training is as shown in fig. 7 and comprises the following steps:

Inputting the stylized content image and the sample color conversion image corresponding to the sample content image into a feature extraction module of the second initial model to perform feature extraction to obtain a first feature of the stylized content image and a second feature of the color conversion image;

inputting the first feature and the second feature into a color conversion module of the second initial model, and performing color conversion on the stylized content image to obtain a stylized content image with the converted colors;

determining a second difference between the color converted image and the color converted stylized content image;

and determining a second loss function value according to the second difference, and adjusting parameters of the feature extraction module and the color conversion module according to the second loss function value.

Referring to fig. 8, a flow chart of a training method of a style conversion model according to an embodiment of the present application is shown, and as shown in the drawing, the method includes:

s401, determining a first initial model, wherein the first initial model comprises the first generator, a second generator and a first discriminator;

s402, performing a first-stage multi-round iterative training on a first initial model according to a first sample set until a first iterative stopping condition is met, wherein each round of iterative training comprises:

S4021, taking a sample content image as an original image, and inputting a sample style image reference image into a first generator to obtain a conversion image output by the first generator;

s4022, inputting the converted image into a second generator, obtaining a restored original image output by the second generator, and determining a first difference between the restored original image and the original image;

s4023, judging the converted image and the reference image through a first judging device to obtain a first probability of identifying the converted image as a non-original image and a second probability of identifying the reference image as an original image;

s4024, determining a first loss function value according to the first probability, the second probability and the first difference, and adjusting parameters of the first initial model according to the first loss function value;

s403, replacing the first discriminator which meets the first iteration stop condition with a second discriminator with stronger discrimination capability, wherein the second discriminator comprises a first sub-discriminator and a second sub-discriminator, and obtaining an updated first initial model;

s404, performing a second-stage multi-round iterative training on the updated first initial model according to the first sample set until a second iterative stopping condition is met, wherein each round of iterative training comprises:

S4041, taking the sample content image as an original image, and taking the sample style image as a reference image to be input into a first generator to obtain a first converted image output by the first generator;

taking the sample style image as an original image, and taking the sample content image as a reference image to be input into a first generator to obtain a second conversion image output by the first generator;

s4042, inputting the first converted image into a second generator, obtaining a restored sample content image output by the second generator, and determining a first sub-difference between the restored sample content image and the sample content image;

inputting a second converted image into a second generator to obtain a restored sample style image output by the second generator, and determining a second sub-difference between the restored sample style image and the sample style image;

s4043, obtaining a first difference according to the first sub-difference and the second sub-difference;

s4044, judging the stylized content image and the style image through a first sub-judging device to obtain a first sub-probability of failing to change the stylized content image into a non-original image and a second sub-probability of identifying the style image into the original image;

judging the content style image and the content image through a second sub-judging device to obtain a third sub-probability of failing to convert the content style image into a non-original image and a fourth sub-probability of recognizing the content image into an original image;

S4045, obtaining a first probability according to the first sub-probability and the third sub-probability, and obtaining a second probability according to the second sub-probability and the fourth sub-probability;

s4046, determining a first loss function value according to the first probability, the second probability and the first difference, and adjusting parameters of the first initial model according to the first loss function value;

s405, determining a second sample set, wherein the second sample set comprises a plurality of sample content images and sample color conversion images of which each sample content image is subjected to color conversion;

s406, performing stylization processing on each sample content image in the second sample set through a first generator in the trained first initial model to obtain a stylized content image corresponding to each sample content image;

s407, performing iterative training on the second initial model through the stylized content images and the sample color conversion images corresponding to the sample content images in the second sample set until a third iterative stopping condition is met, wherein each round of iterative training comprises:

s4071, inputting the stylized content image and the sample color conversion image corresponding to the sample content image into a feature extraction module of the second initial model to perform feature extraction, and obtaining a first feature of the stylized content image and a second feature of the color conversion image;

S4072, inputting the first feature and the second feature into a color conversion module of the second initial model, and performing color conversion on the stylized content image to obtain a stylized content image with the color converted;

s4073, determining a second difference between the color conversion image and the color converted stylized content image;

s4074, determining a second loss function value according to the second difference, and adjusting parameters of the feature extraction module and the color conversion module according to the second loss function value.

The embodiment of the application also provides an image style conversion method, as shown in fig. 9, including:

s501, determining a content image and a style image;

s502, taking the content image as an original image, taking the style image as a reference image, inputting the style image into a style conversion model trained by the training method based on the style conversion model of each embodiment, and obtaining a style content image which is output by the style conversion model and is matched with the content image and the style image.

It should be appreciated that if the style conversion model includes only the first generator, the style content image output by the style conversion model is a stylized content image.

If the style conversion model comprises a first generator and a trained second initial model, the style content image output by the style conversion model is a stylized content image subjected to color conversion.

Fig. 10 is a schematic diagram of an application scenario provided in the embodiment of the present application, where the application scenario is a video call scenario, as shown in the drawing, an operation and maintenance person 21 collects a first sample set and a second sample set, where the first sample set includes a plurality of sample content images and a plurality of sample style images, the second sample set includes a plurality of sample content images and a sample color conversion image of each sample content image after color conversion, the sample content images are self-photographing of a person in a dim environment, the sample style images are images with higher brightness, and the sample color conversion images are images after brightness adjustment of the sample content images.

It should be understood that, in the embodiment of the present application, the relevant data collection process should obtain the informed consent or the individual consent of the personal information body strictly according to the requirements of the relevant national laws and regulations, and develop the subsequent data use and processing actions within the authorized range of the laws and regulations and the personal information body.

The first server 22 obtains the style conversion model through training by using the first sample set and the second sample set according to the training method of the style conversion model provided in the above embodiments of the present application.

The first user 23 logs in an application program capable of carrying out video call on the terminal 24, and initiates the video call to the second user 26 through the application program, the terminal 24 sends video frames collected in real time to a background server 25 of the application program, the background server detects the brightness of the video frames to find that the quality of the video frames is poor, the video frames are darker, the background server 25 calls a style conversion model in the first server 22 to carry out style conversion on the video frames, and the video frames with bright styles are obtained and sent to a terminal 27 of the second user 26.

The embodiment of the application provides a training device for a style conversion model, as shown in fig. 11, the training device for a style conversion model may include: a first sample set determination module 101, and an countermeasure training module 102, wherein,

a first sample set determining module 101 for determining a first sample set comprising a plurality of sample content images and a plurality of sample style images;

the countermeasure training module 102 is configured to perform countermeasure training on a first initial model according to a first sample set to obtain a trained first initial model, where the first initial model includes a first generator, and the first generator is configured to perform global feature enhancement and local feature enhancement on both an input original image and a reference image, and fuse the original image and the reference image after feature enhancement to obtain a converted image;

The device of the embodiment of the present application may execute the training method of the style conversion model provided by the embodiment of the present application, and its implementation principle is similar, and actions executed by each module in the device of each embodiment of the present application correspond to steps in the training method of the style conversion model of each embodiment of the present application, and detailed functional descriptions of each module of the training device of the style conversion model may be specifically referred to descriptions in the corresponding methods shown in the foregoing, and will not be repeated herein.

An embodiment of the present application provides an image style conversion device, as shown in fig. 12, the image style conversion device may include: an image determination module 201, and an inference module 202, wherein,

an image determination module 201 for determining a content image and a style image;

the reasoning module 202 is configured to input the style image as a reference image into a style conversion model trained based on the method described in the above embodiments, and obtain a style content image output by the style conversion model and matching the content image and the style image.

The embodiment of the application provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of a training method of a style conversion model or an image style conversion method, and compared with the related technology, the method can realize the steps of the training method of the style conversion model: the method comprises the steps of performing countermeasure training on a first initial model through a first sample set, performing global feature and local feature enhancement on an original image and a reference image through a first sound generator, fusing the original image with the reference image after feature enhancement to obtain a converted image, and when the original image is a sample content image, performing style migration on the converted image to obtain a stylized content image.

In an alternative embodiment, there is provided an electronic device, as shown in fig. 13, the electronic device 4000 shown in fig. 13 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 13, but not only one bus or one type of bus.

Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.

The memory 4003 is used for storing a computer program that executes an embodiment of the present application, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to realize the steps shown in the foregoing method embodiment.

Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, where the computer program, when executed by a processor, may implement the steps and corresponding content of the foregoing method embodiments.

The embodiments of the present application also provide a computer program product, which includes a computer program, where the computer program can implement the steps of the foregoing method embodiments and corresponding content when executed by a processor.

The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the present application described herein may be implemented in other sequences than those illustrated or otherwise described.

It should be understood that, although the flowcharts of the embodiments of the present application indicate the respective operation steps by arrows, the order of implementation of these steps is not limited to the order indicated by the arrows. In some implementations of embodiments of the present application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages may be flexibly configured according to the requirement, which is not limited in the embodiment of the present application.

The foregoing is merely an optional implementation manner of the implementation scenario of the application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the application are adopted without departing from the technical ideas of the application, and also belong to the protection scope of the embodiments of the application.

Claims

1. A method for training a style conversion model, comprising:

2. The method of claim 1, wherein obtaining a style conversion model from the trained first initial model comprises:

determining a second sample set comprising a plurality of sample content images, each sample content image color-converted sample color-converted image;

performing stylization processing on each sample content image in the second sample set through a first generator in the trained first initial model to obtain a stylized content image corresponding to each sample content image;

training a second initial model through the stylized content images and the sample color conversion images corresponding to the sample content images in the second sample set to obtain a trained second initial model, wherein the second initial model is used for extracting image features of the sample color conversion images and the stylized content images, and performing color conversion on the stylized content images according to the image features to obtain color-converted content images;

and obtaining the style conversion model according to the first generator in the trained first initial model and the trained second initial model.

3. The method according to claim 1, wherein the first generator is specifically configured to:

for any one image of an input original image and a reference image, encoding the image through a preset encoder, extracting the output of a plurality of convolution layers, and obtaining a plurality of groups of convolution characteristics of the image; the convolution features output by different convolution layers have different channel widths;

taking a group of convolution features with the largest channel width in the groups of convolution features as reference convolution features, and aligning the channel width of each group of convolution features to the channel width of the reference convolution features;

performing dense representation on all the aligned groups of convolution features to obtain a first dense feature, and performing dense representation on the reference convolution features to obtain a second dense feature;

respectively carrying out feature enhancement on the first dense feature and the second dense feature, and obtaining a first point multiplication feature according to the enhancement feature after the feature enhancement is carried out on the first dense feature and the second dense feature; obtaining a second point multiplication feature according to the enhancement feature obtained by carrying out feature enhancement on the second dense feature and the first dense feature;

and carrying out feature fusion on the first dense feature, the second dense feature, the first dot product feature and the second dot product feature of the input original image and the reference image, and obtaining the converted image according to a feature fusion result.

4. A method according to claim 3, wherein the first dense feature and the second dense feature are respectively feature enhanced, and a first point multiplication feature is obtained according to the enhanced features of the first dense feature and the second dense feature after feature enhancement; and obtaining a second point multiplication feature according to the enhanced feature obtained by enhancing the second dense feature and the first dense feature, wherein the second point multiplication feature comprises:

respectively carrying out nonlinear processing on the first dense feature and the second dense feature through a preset activation function to obtain respective activation values of the first dense feature and the second dense feature;

performing global feature enhancement and local feature enhancement on the respective activation values of the first dense feature and the second dense feature according to weights of a global attention mechanism and a local attention mechanism to obtain respective enhancement features of the first dense feature and the second dense feature;

performing point multiplication on the activation value of the first dense feature and the enhancement feature of the second dense feature to obtain a first point multiplication feature; and carrying out point multiplication on the activation value of the second dense feature and the enhancement feature of the first dense feature to obtain a second point multiplication feature.

5. The method according to any one of claims 1-4, wherein the performing the countermeasure training on the first initial model according to the first sample set to obtain a trained first initial model includes:

6. The method of claim 5, wherein each round of iterative training for any one of the first phase and the second phase comprises:

7. The method of claim 6, wherein in a second stage, the original image further comprises a sample-style image;

8. The method of claim 7, wherein the second arbiter comprises a first sub-arbiter and a second sub-arbiter;

9. The method according to claim 2, wherein the training the second initial model through the stylized content image and the sample color conversion image corresponding to each sample content image in the second sample set to obtain a trained second initial model includes:

And performing multiple rounds of iterative training on the second initial model through the stylized content images and the sample color conversion images corresponding to the sample content images in the second sample set until a third iterative stopping condition is met, wherein each round of iterative training comprises:

10. An image style conversion method, comprising:

determining a content image and a style image;

and taking the content image as an original image, inputting the style image as a reference image into a style conversion model trained based on the method of any one of claims 1-9, and obtaining a style content image which is output by the style conversion model and is matched with the content image and the style image.

11. A training device for a style conversion model, comprising:

12. An image style conversion device, comprising:

an image determining module for determining a content image and a style image;

the reasoning module is used for taking the content image as an original image, inputting the style image as a reference image into a style conversion model trained based on the method of any one of claims 1-9, and obtaining a style content image which is output by the style conversion model and is matched with the content image and the style image.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1-10.

14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-10.

15. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-10.