CN113705302A

CN113705302A - Training method and device for image generation model, computer equipment and storage medium

Info

Publication number: CN113705302A
Application number: CN202110287832.6A
Authority: CN
Inventors: 刘亚辉; 陈雅静; 张浩贤; 暴林超
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2021-11-26

Abstract

The application discloses a training method and device of an image generation model, computer equipment and a storage medium, and belongs to the technical field of image processing. The method comprises the steps of obtaining multiple style characteristics belonging to different styles through an image generation model, determining a first error and a second error based on the distribution condition of the multiple style characteristics on the space, and performing model training by applying the first error and the second error, wherein the multiple style characteristics obtained by the image generation model can be constrained in a compact space by applying the first error, so that the style characteristics belonging to different styles can be in smooth transition, and the style characteristics belonging to different styles can be kept in separability by applying the second error, namely, by applying the training mode, the discrimination and smooth transition among the multiple style characteristics can be kept, so that when the image generation model is used for converting the image style, the original image and a newly generated image can be in good transition, and the reality degree and the image quality of the newly generated image are improved.

Description

Training method and device for image generation model, computer equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and an apparatus for training an image generation model, a computer device, and a storage medium.

Background

In the field of image processing technology, computer equipment can transform the style of an image by applying a neural network model to generate a new image. Taking the example of performing the style conversion on the face image, the computer device can perform gender conversion, makeup conversion, and the like on the face image through the image generation model.

However, in the process of generating a new image by converting the graphics style of the target application image generation model, because the image generation model is difficult to realize smooth transition between different image styles, the newly generated image has low degree of reality and poor image quality. Therefore, how to train the image generation model, improve the performance of the image generation model, and improve the reality of the image generated by the model is an important research direction.

Disclosure of Invention

The embodiment of the application provides a training method and device for an image generation model, computer equipment and a storage medium, which can improve the model performance of the image generation model when executing an image style conversion task and improve the reality degree and the image quality of an image generated by the image generation model. The technical scheme is as follows:

in one aspect, a method for training an image generation model is provided, the method including:

acquiring an image generation model to be trained, wherein the image generation model is used for converting any style information into corresponding style characteristics, and converting the style of any image based on the style characteristics, and the style information is information used for describing the corresponding style;

acquiring at least two groups of style information, wherein the style information of the same group corresponds to the same style;

converting each style information into a corresponding style characteristic through the image generation model;

obtaining a first error and a second error based on the plurality of style features obtained by conversion, distribution information of the style features and target distribution information, wherein the target distribution information is used for limiting the spatial distribution area of the style features, the first error is used for indicating the error between the distribution information of the style features and the target distribution information, and the second error is used for indicating the difference of the style features corresponding to different styles in the spatial distribution;

based on the first error and the second error, parameters of the image generation model are adjusted.

In one aspect, an apparatus for training an image generation model is provided, the apparatus including:

the model acquisition module is used for acquiring an image generation model to be trained, the image generation model is used for converting any style information into corresponding style characteristics, the style of any image is converted based on the style characteristics, and the style information is information used for describing the corresponding style;

the information acquisition module is used for acquiring at least two groups of style information, and the style information of the same group corresponds to the same style;

the information conversion module is used for converting each style information into a corresponding style characteristic through the image generation model;

an error obtaining module, configured to obtain a first error and a second error based on the plurality of style features obtained through conversion, distribution information of the style features, and target distribution information, where the target distribution information is used to define a spatial distribution area of the style features, the first error is used to indicate an error between the distribution information of the style features and the target distribution information, and the second error is used to indicate a difference in spatial distribution of the style features corresponding to different styles;

and the parameter adjusting module is used for adjusting the parameters of the image generation model based on the first error and the second error.

In one possible implementation, the information conversion module is configured to perform any one of:

in response to the style information comprising an image, extracting style features of the image through a feature extraction network in the image generation model;

in response to the style information including a style label, the style label is mapped to a corresponding style feature through a feature mapping network in the image generation model.

In one possible implementation, the error obtaining module is configured to

Acquiring KL divergence between the distribution information of the style characteristics and the target distribution information;

determining the first error based on the KL divergence, the first error being positively correlated with the KL divergence;

acquiring a first distance between any two style features belonging to the same style and a second distance between any two style features belonging to different styles;

determining the second error based on a difference between the first distance and the second distance corresponding to the same style feature.

In one possible implementation, the apparatus further includes:

and the image acquisition module is used for converting the style of the initial image to be processed based on any style characteristic through a generation network in the image generation model to obtain a target image, and the style of the target image is consistent with the style indicated by any style characteristic.

In one possible implementation, the apparatus further includes:

an error determination module for determining a third error based on a similarity between the target image and the initial image;

the parameter adjusting module is configured to adjust a parameter of the image generation model based on the first error, the second error, and the third error.

In one possible implementation, the error determination module is configured to:

respectively extracting features of different scales from the target image and the initial image through feature extraction layers of at least two scales to obtain at least two first features corresponding to the target image and at least two second features corresponding to the initial image;

respectively acquiring third distances between the first feature and the second feature with the same scale;

the third error is determined based on the acquired at least two third distances.

In one possible implementation, the apparatus further includes:

the weighting processing module is used for weighting the first error, the second error and the third error to obtain a fourth error;

and the parameter adjusting module is used for adjusting the parameters of the image generation model based on the fourth error.

In one aspect, a computer device is provided that includes one or more processors and one or more memories having stored therein at least one computer program that is loaded and executed by the one or more processors to perform operations performed by a training method of the image generation model.

In one aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, the at least one computer program being loaded and executed by a processor to perform operations performed by a training method of the image generation model.

In one aspect, a computer program product is provided that includes at least one computer program stored in a computer readable storage medium. The at least one computer program is read by a processor of the computer device from the computer-readable storage medium, and the at least one computer program is executed by the processor to cause the computer device to implement the operations performed by the training method of the image generation model.

According to the technical scheme provided by the embodiment of the application, the image generation model is used for acquiring a plurality of style characteristics belonging to different styles, a first error and a second error are determined based on the distribution condition of the style characteristics on the space, and the image generation model is trained by applying the first error and the second error, wherein the first error can restrict the space distribution of the style characteristics acquired by the image generation model, so that the style characteristics are distributed in a compact space, the style characteristics belonging to different styles can be smoothly transited, the second error can enable the style characteristics belonging to different styles to have separability, namely, by adopting the model training mode, the style characteristics belonging to different styles can not only keep the discrimination, but also can be smoothly transited, so that when the image generation model performs conversion with larger style span on an initial image, the method ensures good transition between the input image and the newly generated image, and improves the reality degree and the image quality of the newly generated image.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an implementation environment of a training method for an image generation model according to an embodiment of the present disclosure;

FIG. 2 is a flowchart of a training method for an image generation model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an image generation model provided by an embodiment of the present application;

FIG. 4 is a flowchart of a training method for an image generation model according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a distribution of style characteristics provided in an embodiment of the application;

FIG. 6 is a schematic diagram illustrating comparison of performances of image generation models trained by different model training methods according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of an apparatus for training an image generation model according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the purpose, technical solutions and advantages of the present application clearer, the following will describe embodiments of the present application in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like, and the embodiment of the application relates to the computer vision technology and the deep learning technology in the artificial intelligence technology.

Computer Vision technology (CV) is a science for researching how to make a machine look, and more specifically, it refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

In order to facilitate understanding of the technical processes of the embodiments of the present application, some terms referred to in the embodiments of the present application are explained below:

image Domain (Image Domain): for example, the female face images belong to the same image domain, the male face images belong to the same image domain, and the sketch-style images belong to the same image domain.

Unsupervised Image-to-Image transformation (Unsupervised Image-to-Image transformation): the method is characterized in that no real target result is used as supervision data in the process of training a neural network model to perform style conversion on an input image and generating a new image.

Style hidden Space (late Style Space): refers to characterizing the style of an image in a lower-dimensional vector space.

Style vector (Style Code): the style hidden space vector is a corresponding vector of an image in the style hidden space, and in the embodiment of the application, the style feature of the image is represented in the form of a style vector.

Fig. 1 is a schematic diagram of an implementation environment of a training method for an image generation model according to an embodiment of the present application, and referring to fig. 1, the implementation environment includes a terminal 110 and a server 140.

The terminal 110 is installed and operated with an application program supporting image style conversion, for example, the application program is an image processing application program, an image capturing application program, a social contact application program, and the like, which is not limited in this embodiment of the present application. For example, in an image processing application, a user inputs an initial image to be processed and then inputs a reference image with a target style, where the style of the initial image is different from that of the reference image, and the image processing application can perform style transformation on the initial image based on the image style of the reference image to generate a new image, where the newly generated image has the image style of the reference image, that is, the newly generated image also has the target style. Optionally, the terminal 110 is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, or the like, and the device type of the terminal 110 is not limited in this embodiment of the application. Illustratively, the terminal 110 is a terminal used by a user, and an application running in the terminal 110 is logged with a user account. The terminal 110 generally refers to one of a plurality of terminals, and the embodiment is only illustrated by the terminal 110.

In one possible implementation, the server 140 is at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The server 140 is used for providing background services for the application program of the image style transformation. Optionally, server 140 undertakes primary image processing tasks and terminal 110 undertakes secondary image processing tasks; alternatively, the server 140 undertakes the secondary image processing job and the terminal 110 undertakes the primary image processing job; alternatively, the server 140 or the terminal 110 may be separately responsible for the image processing job. Optionally, the server 140 includes: the system comprises an access server, an image processing server and a database. The access server is used to provide access services for the terminal 110. The image Processing server is used for providing background services related to image style conversion, and illustratively, the image Processing server may be loaded with an image processor (GPU) and support multithread parallel computing of the image processor. Illustratively, the image processing server is one or more. When the image processing servers are multiple, at least two image processing servers exist for providing different services, and/or at least two image processing servers exist for providing the same service, for example, providing the same service in a load balancing manner, which is not limited in the embodiment of the present application. The image processing server can be provided with an image generation model, and can carry an image processor and support the image processor to operate in parallel in the model training and application process. For example, the server is an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the number of servers and the type of devices are not limited in the embodiment of the present application.

The embodiment of the application provides a training method of an image generation model, and the image generation model related to the embodiment of the application can be combined with various application scenes and deployed in various application programs, so that the application programs have the function of image style conversion. For example, by combining the scheme with an image processing application program, a user can re-edit the style of an image through the image processing application program; by combining the scheme with the image shooting application program, a user can freely select an image style during shooting, and a photo with the image style can be directly shot. In the multiple application scenarios, by applying the model training method provided by the embodiment of the application, the model performance of the image generation model during the execution of the image style conversion task can be improved, when the initial image is converted with a large style span, the initial image and the newly generated target image can be in good transition, and the reality degree and the image quality of the newly generated target image are improved.

Fig. 2 is a flowchart of a training method for an image generation model according to an embodiment of the present disclosure. The method can be applied to the terminal or the server, and both the terminal and the server can be regarded as a computer device, in this application, the computer device is taken as an execution subject, and a training method of an image generation model is described, referring to fig. 2, in one possible implementation, the embodiment includes the following steps:

201. the computer equipment acquires an image generation model to be trained, the image generation model is used for converting any style information into corresponding style characteristics, the style of any image is converted based on the style characteristics, and the style information is information used for describing the corresponding style.

In an embodiment of the application, the style information includes an image or a style label. Taking an image as an example, the style information is a style presented by the image, for example, the style information is a sketch style, an oil painting style, or the like, or the style information is a male face image style, a female face image style, or the like; taking the style label as an example, the style information is information indicated by the style label, and it should be noted that the specific content of the style information is not limited in this application embodiment.

In one possible implementation, the image generation model is a model constructed based on GAN (Generative adaptive Net), for example, the image generation model is StarGAN v2 model. Illustratively, the image generation model includes a feature encoding network and a generation network. The feature coding network is used for acquiring corresponding style features based on style information, optionally, the feature coding network comprises two sub-networks, namely a feature extraction network and a feature mapping network, the feature extraction network is used for extracting features of the image to obtain style features, and the feature mapping network is used for mapping the style labels to the corresponding style features; the generation network is used for carrying out style conversion on the image based on the acquired style characteristics to generate a new image. The above description of the structure of the image generation model is only an exemplary description, and the embodiment of the present application does not limit the structure of the image generation model and the method for converting the style of the image by the image generation model.

202. The computer device obtains at least two sets of style information, the style information of the same set corresponding to the same style.

In one possible implementation, taking the example that the style information includes images, the computer device acquires at least two sets of images, each set of images corresponding to a style, e.g., the computer device acquires a set of male face images, a set of female face images, etc. Taking the example that the style information includes style labels, the computer device obtains at least two groups of style labels, each group of style labels corresponding to a style, for example, the computer device obtains a group of labels for describing male face features, a group of labels for describing female face features, and the like.

203. The computer device converts each style information into a corresponding style feature through the image generation model.

In one possible implementation, the computer device is capable of determining an acquisition method of the style feature based on the acquired style information. Illustratively, if the style information comprises an image, the computer device performs feature extraction on the image through a feature extraction network in the image generation model, and takes the extracted image features as style features corresponding to the style information; if the style information comprises a style label, the computer equipment encodes the style label through a feature mapping network in the image generation model to obtain the style feature. The style features are expressed in a vector or matrix form, which is not limited in the embodiment of the present application. The method for acquiring the style characteristics is not limited in the embodiments of the present application.

204. The computer device obtains a first error and a second error based on the plurality of style features obtained by conversion, distribution information of the plurality of style features, and target distribution information, the target distribution information being used for defining a spatial distribution area of the plurality of style features, the first error being used for indicating an error between the distribution information of the plurality of style features and the target distribution information, and the second error being used for indicating a difference in spatial distribution of style features corresponding to different styles.

The target distribution information is set by a developer, and illustratively, the target distribution information is a prior gaussian distribution N (0, I), i.e. a normal distribution with an expected value of 0 and a variance of I, and a distribution space indicated by the target distribution information is a compact distribution space, i.e. a distribution space indicated by the target distribution information is a distribution space of a smaller area. In the embodiment of the application, by setting compact space constraint, that is, acquiring a first error between distribution information of a plurality of style features and the target distribution information, and applying the first error to a subsequent model training step, the spatial distribution of the style features can be constrained in a compact space, so that smooth transition between different styles can be realized.

Wherein the second error is indicative of a spatial distribution of the style features corresponding to the different styles. In the embodiment of the application, by setting the separability constraint of the style feature clusters, that is, acquiring the second error between style features belonging to different styles, and applying the second error to the subsequent model training step, the separability between the style feature clusters belonging to different styles can be ensured.

205. A computer device adjusts parameters of the image generation model based on the first error and the second error.

In a possible implementation manner, the computer device reversely propagates the first error and the second error to the image generation model, solves parameters of each operation layer in the image generation model, stops model training until the image generation model meets a model convergence condition, and obtains the trained image generation model. The model convergence condition is set by a developer, and is not limited in the embodiment of the present application.

The foregoing embodiment is a brief introduction to the mode of the embodiment of the present application, and fig. 3 is a schematic diagram of an image generation model provided in the embodiment of the present application, and referring to fig. 3, the image generation model includes a feature extraction network 301, a feature mapping network 302, and a generation network 303, and optionally, the image generation model further includes a discrimination network 304, and the discrimination network 304 is configured to classify an image output by the generation network 303, and determine whether the image belongs to a real image or a model generated image, so as to achieve an effect of detecting image quality of an image output by the generation network 303. In the embodiment of the present application, an unsupervised training mode is adopted for training the image generation model, that is, a correct image that should be generated by the image generation model is not provided in the model training process, and no real target result is used as the supervision data. Fig. 4 is a flowchart of a training method for an image generation model provided in an embodiment of the present application, and taking the image generation model shown in fig. 3 as an example, a training process of the image generation model is described with reference to fig. 4, in a possible implementation manner, the embodiment includes the following steps:

401. the computer equipment acquires an image generation model to be trained, at least two groups of style information and an initial image to be processed, and inputs the at least two groups of style information and the initial image into the image generation model.

In one possible implementation manner, the computer device responds to a model training instruction, and acquires an image generation model to be trained and training data, wherein the training data is at least two groups of style information and initial images to be processed. In the embodiment of the present application, the style information is used to indicate the style of the image, and the style information can be represented in various forms, for example, the style information is represented by the image or by the style label. In a possible implementation manner, if the style information includes an image, the computer device randomly selects one or more images from the acquired multiple images as the initial image, and of course, the computer device may also additionally acquire the initial image, which is not limited in this embodiment of the present application. In one possible implementation, if the style information is a style label, the computer device acquires at least two sets of style labels and an initial image to be processed. In some embodiments, the style label includes a style index value d, and the different style index values can distinguish different styles, and optionally, the wind label further includes a random gaussian noise z, and z satisfies a distribution N (0, I), wherein a value of I is set by a developer. In the subsequent process of acquiring style characteristics based on style information, the computer device can encode the random gaussian noise z based on the style index value to obtain style characteristics, in this case, the style characteristics mapped by the random gaussian noise corresponding to the same style index value belong to the same style, but the style characteristics are different due to the introduction of the random gaussian noise. Illustratively, the three groups of data (z1, j), (z2, j) and (z3, j) all have the same style index value j and all belong to the same style, but since z1, z2 and z3 are randomly sampled from gaussian noise z and z1, z2 and z3 data are different, the image generation model can obtain three different style characteristics belonging to the same style, and further generate three images belonging to the same style, but the three images have differences in details. For example, the style index value j represents "a male face", and when the image generation model performs style conversion on the initial image based on (z1, j), (z2, j), and (z3, j), three different male face images can be obtained, for example, the faces in the three images have different skin colors and different hair styles.

It should be noted that, in the embodiment of the present application, the number of the initial images acquired by the computer device is not limited, and in the embodiment of the present application, only one initial image is taken as an example for description.

In a possible implementation manner, before the computer device inputs the style information and the initial image into the image generation model, the style information and the initial image are preprocessed, for example, the image is scaled according to the actual situation to adjust the image to a reference size, and the reference size is set by a developer, which is not limited by the embodiment of the present application. Or, the computer device performs data enhancement on the style information, for example, if a plurality of the style information includes an image, the image is rotated, noise is added, and the like. It should be noted that the above description of the method for preprocessing the style information and the initial image is only an exemplary description of one possible implementation manner, and the data and the processing method are not limited in the embodiment of the present application.

402. The computer device converts each style information into a corresponding style feature through the image generation model.

In one possible implementation, in response to the stylistic information comprising an image, the computer device extracts stylistic features of the image through a feature extraction network in the image generation model. In one possible implementation, the feature extraction network includes a plurality of convolutional layers for extracting image features of an image, and includes K output branches, where K represents the number of styles corresponding to an image generation model, and one output branch can output a feature representation of the image in one style, and optionally, the feature extraction network employs a style encoder (style encoder) in the StarGAN v2 model. In a possible implementation manner, the computer device sequentially performs feature extraction on the style information through a plurality of convolution layers in the feature extraction network, that is, performs convolution operation on the image sequentially through the plurality of convolution layers to obtain an image feature, each output branch in the feature extraction network outputs one sub-feature, one sub-feature is a feature representation of the image in a certain style, and the computer device obtains the sub-features output by each branch to form a style feature corresponding to the image. Illustratively, if each sub-feature is represented as a numerical value, the style feature is represented in the form of a style vector; if each sub-feature is represented as a vector, the style feature is represented in the form of a style matrix. The present embodiment does not limit the structure of the feature extraction network and the method for extracting the style features by the feature extraction network.

In one possible implementation, in response to the style information including a style label, the computer device maps the style label to a corresponding style feature through a feature mapping network in the image generation model. In one possible implementation, the feature Mapping Network includes an MLP (multi layer Perceptron) with K output branches, and optionally, the feature Mapping Network employs a Mapping Network (Mapping Network) in the StarGAN v2 model. In one possible implementation, the computer device encodes the style information through the MLP in the feature mapping network, that is, encodes the random gaussian noise z based on the style index value d to obtain the style feature. It should be noted that, the method for obtaining the style characteristics by the characteristic mapping network is not limited in the embodiment of the present application.

403. And the computer equipment converts the style of the initial image to be processed based on any style characteristic through a generation network in the image generation model to obtain a target image.

Wherein the generation network is capable of generating a target image based on an input initial image x

Wherein s represents any style feature, and the style of the target image is consistent with the style indicated by the any style feature. In one possible implementation, the generation network employs a Generator (Generator) in the StarGAN v2 model, which includes four downsampling layers, four intermediate layers, and four upsampling layers, which may each be ResBlock (residual network module). In one possible implementation, the computer device applies an Adaptive Instance Normalization (AdaIN) algorithm, inputs style features into the generation network, performs style transformation on the initial image based on the style features through a plurality of operation layers in the generation network, and illustratively, interpolates the style features and image features of the initial image through the generation network to obtain a new image feature, and then generates a target image based on the new image feature. In the embodiment of the present application, the structure of the generation network and the method of generating the target image are not limited.

404. The computer device obtains a first error and a second error based on the plurality of style features obtained by conversion, the distribution information of the plurality of style features, and the target distribution information.

In one possible implementation manner, the computer device obtains KL-divergence (Kullback-Leibler divergence) between the distribution information of the style features and the target distribution information, and determines the first error based on the KL-divergence. In the embodiment of the present application, the target distribution information is gaussian prior distribution N (0, I), that is, the style features are expected to be distributed in a compact space, so that a good transition effect can be achieved between style features belonging to different styles. The KL divergence can be used to measure a difference between the distribution information of the plurality of style features and the target distribution information, the larger the value of the KL divergence, the larger the difference between the distribution information representing the plurality of style features and the target distribution information, the smaller the value of the KL divergence, the smaller the difference between the distribution information representing the plurality of style features and the target distribution information, in the embodiment of the present application, the first error is positively correlated with the KL divergence. Illustratively, the first error obtaining method is expressed by the following formula (1):

L_kl＝E_s[D_kl(p(s)||N(0,I))] (1)

wherein L is_klIs the first error, D_kl(p | | q) represents the KL divergence between p and q; s represents a style feature; p(s) distribution information representing the style characteristics s; n (0, I) represents target distribution information, I can be set by a developer; e_sIndicating a desire.

In one possible implementation manner, the computer device obtains a first distance between any two style features belonging to the same style and a second distance between any two style features belonging to different styles, and determines the second error based on a difference value between the first distance and the second distance corresponding to the same style feature. In one possible implementation, the second error is obtained by the following equation (2):

L_tril＝E_(Sa,Sp,Sn)[max(||s_a-s_p||-||s_a-s_n||+α,0)] (2)

wherein L is_triIs the second error; s_a、s_pFor style features belonging to the same style, s_a、s_nThe style characteristics belong to different styles; i s_a-s_p| represents the first distance, | s_a-s_n| represents the second distance; α is a constant, the value of which is set by the developer; e_(Sa,Sp,Sn)Indicating a desire.

It should be noted that the above description of the first error and the second error obtaining method is only an exemplary description of one possible implementation manner, and the embodiment of the present application does not limit which method is used to obtain the first error and the second error. In the embodiment of the application, by obtaining the first error and the second error, that is, by setting two constraint conditions of a compact space constraint and a style feature clustering separability constraint, on one hand, a plurality of style features are compressed in a compact distribution space, so that style features belonging to different styles can be smoothly transited, and on the other hand, the distinctiveness between style features belonging to different styles can be maintained.

405. The computer device determines a third error based on a similarity between the target image and the initial image.

In one possible implementation, the computer device determines the Similarity between the target Image and the initial Image by using an LPIPS (learning of Perceptual Image Patch Similarity) method, and further determines the third error. For example, firstly, the computer device performs feature extraction on the target image and the initial image in different scales through at least two scale feature extraction layers to obtain at least two first features corresponding to the target image and at least two second features corresponding to the initial image, that is, for the target image, the computer device can obtain first features in multiple scales, and for the initial image, the computer device can obtain second features in multiple scales. Then, the computer device obtains a third distance between the first feature and the second feature of the same scale, and optionally, the third distance is an L2 distance between the first feature and the second feature. Finally, the computer device determines the third error based on the obtained at least two third distances, for example, the computer device averages the obtained at least two third distances to obtain the third error. In one possible implementation, the above third error obtaining method is expressed by the following formula (3):

L_cont＝E_s,x[ψ(x,G(x,s))] (3)

wherein L is_contIs the third error; x represents an initial image, G (x, s) represents a target image, and s represents a style feature; Ψ (x, G (x, s)) represents the distance between the target image and the initial imageThe degree of similarity in visual perception, in the embodiment of the present application, psi (x, G (x, s)) is obtained by LPIPS method; e_s,xIndicating a desire.

It should be noted that the above description of the method for acquiring the third error is merely an exemplary description of one possible implementation manner, and the embodiment of the present application does not limit which method is used to acquire the third error. In the embodiment of the present application, by setting an image content similarity keeping constraint condition, that is, by obtaining a third error, which indicates a similarity of the initial image and the target image generated by the model in the visual sense, in the subsequent model training step, the trained image generation model can keep identity characteristics of the input image unchanged in the image generation process, that is, the initial image and the target image are kept consistent in the image content.

It should be noted that the step 405 of obtaining the third error is an optional step, and in some embodiments, the image generation model may be trained by applying only the first error and the second error.

406. A computer device adjusts parameters of the image generation model based on the first error, the second error, and the third error.

In one possible implementation, the computer device performs weighting processing on the first error, the second error, and the third error to obtain a fourth error, and adjusts parameters of the image generation model based on the fourth error. Illustratively, the fourth error is obtained by the following equation (4):

L_smooth＝λ_klL_kl+λ_triL_tri+λ_contL_cont (4)

wherein L is_smoothIs the fourth error, L_klIs the first error, L_triIs the second error, L_contIs the third error; lambda [ alpha ]_kl，λ_triAnd λ_contFor the purpose of hyper-parameters, for adjusting the weight of the respective error, the value of which can be entered by the developerAnd (4) setting rows. In one possible implementation, the computer device may apply a back propagation method to update the parameters of the image generation model. For example, the computer device solves the respective parameters in the image generation model based on a gradient descent method of an Adam (adaptive moment estimation) algorithm. The embodiment of the present application does not limit the specific method for updating the parameters of the image generation model. In a possible implementation manner, after the computer device completes updating the parameters of the image generation model, if the image generation model satisfies the model convergence condition, the trained image generation model is obtained, and if the model convergence adjustment is not satisfied, the training data of the next batch is continuously read, and the above steps 401 to 406 are continuously executed. The model convergence condition is set by a developer, and is not limited in the embodiment of the present application.

It should be noted that, in some embodiments, the computer device may also perform model training by directly propagating the first error, the second error, and the third error back to the image generation model without performing a weighting operation on the first error, the second error, and the third error, which is not limited in this application.

In some embodiments, in addition to obtaining the loss functions of the four errors, the developer may set other forms of loss functions L_existAnd acquiring other errors, combining the other errors with the fourth error to obtain a fifth error, performing model training based on the fifth error, namely, reversely propagating the fifth error to the image generation model, and re-solving the parameters of each operation layer in the image generation model. In addition, the examples of the present application are given for L_existThe specific loss function is not limited. In one possible implementation, the fifth error is expressed by the following equation (5):

L_new＝L_smooth+L_exist (5)

wherein L is_newIs the fifth error, L_smoothIs the fourth error, L_existThe values of (b) are other errors.

Fig. 5 is a schematic diagram of distribution of a style feature provided in an embodiment of the application, in a general case, as shown in (a) of fig. 5, a plurality of style features 501 and 502 corresponding to different styles lack good transitions in spatial distribution, and when an image generation model converts an initial image 503 into a style indicated by a reference image 504, a generated target image 505 has low degree of realism and poor image quality. By applying the image generation model training method provided by the embodiment of the application, the model is trained by setting two constraint conditions of compact space constraint and style feature clustering separability constraint, so that style features 506 and 507 corresponding to different styles can be smoothly transited in spatial distribution, and the discrimination can be kept, as shown in (b) diagram in fig. 5, a target image 508 generated by the image generation model has high reality degree and high image quality. FIG. 6 is a schematic diagram illustrating comparison of performances of image generation models trained by different model training methods according to an embodiment of the present application, image generation models obtained by training methods (a), (b), (c), and (d), in performing the image style conversion task, the target images generated by the image generation model are shown as the first, second, third and fourth lines in FIG. 6, wherein the training mode (d) is the model training mode provided in the embodiment of the present application, and as shown in the effect diagram in the fourth row in fig. 6, the image generation model obtained by the training of the model training method provided in the embodiment of the present application is applied, the method has the advantages that good model expression is realized when the image is subjected to style conversion, the reality degree and the image quality of each generated target image are high, and the consistency of the content among the images can be kept.

The embodiment describes a method for training an image generation model, and the image generation trained by applying the model training method can be deployed in various types of application programs and combined with various application scenes. In the embodiment of the present application, taking the application of the image generation model to an image processing application as an example, in a possible implementation manner, the process of generating a new image by applying the image generation model to convert the image style includes the following steps.

The method comprises the steps that a terminal responds to image generation operation, obtains a first image to be processed and style transformation demand information, and sends the first image and the style transformation demand information to a server.

The terminal is a device used by a user, and the terminal is provided with and runs a target application program capable of providing an image style conversion function. The server is a background server of the target application program, and is loaded with a trained image generation model which is obtained by training by the model training method. The style transformation requirement is used to indicate an object of style transformation of the first image, and in this embodiment, the style transformation requirement information can be represented in various ways, for example, the style transformation requirement information is represented by an image or by a style label, which is not limited in this embodiment.

In a possible implementation manner, the terminal displays an image editing interface of the target application program, the image editing interface includes a first upload entry of the to-be-processed image, and the user can upload the to-be-processed first image through the first upload entry. Optionally, the image editing interface includes a second upload entry of a reference image, where the reference image is used to represent the image style conversion requirement, that is, the style presented by the reference image is the style conversion target of the first image. Optionally, the image editing interface includes a plurality of style selection controls, each style selection control corresponds to a style, and the style selection controls may be presented in the form of an image or a text label. Optionally, the image editing interface includes an information input area where a user can input style transformation requirements. The method for acquiring the style conversion requirement in the embodiment of the present application is not limited.

And step two, the server calls the trained image generation model, and performs style transformation on the first image based on the style transformation demand information to generate a second image.

In the embodiment of the present application, the image style conversion process will be described by taking the image generation model as the StarGAN v2 model as an example. In one possible implementation, in response to receiving the first image and the style transformation requirement information, the server may determine a style feature extraction manner based on a representation manner of the style transformation requirement information, for example, if the style transformation requirement information is an image, the server performs feature extraction on the image through a feature extraction network in the image generation model to obtain a style feature, and if the style transformation information is a style label, the server encodes the style label into the style feature through a feature mapping network in the image generation model. After obtaining the style feature, the server inputs the style feature and the first image into a generation network in the image generation model, and transforms the style of the first image into the style indicated by the style feature through the generation network, for example, the server performs interpolation processing on the style feature and the image feature of the first image through the generation network to obtain a new image feature, and then generates a second image based on the new image feature, that is, outputs the image after the style transformation is performed on the first image. It should be noted that the above description of the method for generating the second image is only an exemplary description of one possible implementation manner, and the embodiment of the present application does not limit which method is specifically used to generate the second image.

And step three, the server sends the second image to the terminal, and the terminal displays the second image.

In a possible implementation manner, the terminal displays an image display interface of the target application program, and after receiving the second image, the terminal displays the second image on the image display interface.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

Fig. 7 is a schematic structural diagram of an image generation model training apparatus provided in an embodiment of the present application, and referring to fig. 7, the apparatus includes:

a model obtaining module 701, configured to obtain an image generation model to be trained, where the image generation model is configured to convert any style information into a corresponding style feature, and convert the style of any image based on the style feature, where the style information is information used to describe a corresponding style;

an information obtaining module 702, configured to obtain at least two groups of style information, where the style information of a same group corresponds to a same style;

an information conversion module 703, configured to convert each style information into a corresponding style feature through the image generation model;

an error obtaining module 704, configured to obtain a first error and a second error based on the plurality of style features obtained through the conversion, distribution information of the style features, and target distribution information, where the target distribution information is used to define a spatial distribution area of the style features, the first error is used to indicate an error between the distribution information of the style features and the target distribution information, and the second error is used to indicate a difference in spatial distribution of the style features corresponding to different styles;

a parameter adjusting module 705, configured to adjust a parameter of the image generation model based on the first error and the second error.

In one possible implementation, the information conversion module 703 is configured to perform any one of the following:

In one possible implementation, the error obtaining module 704 is configured to obtain an error value

In one possible implementation, the apparatus further includes:

the parameter adjusting module 705 is configured to adjust a parameter of the image generation model based on the first error, the second error, and the third error.

In one possible implementation, the apparatus further includes:

the parameter adjusting module 705 is configured to adjust a parameter of the image generation model based on the fourth error.

The device provided by the embodiment of the application acquires multiple style characteristics belonging to different styles through the image generation model, determines a first error and a second error based on the distribution condition of the multiple style characteristics on the space, and trains the image generation model by applying the first error and the second error, wherein the first error can restrict the space distribution of the style characteristics acquired by the image generation model, so that the multiple style characteristics are distributed in a compact space, thereby enabling the style characteristics belonging to different styles to be in smooth transition, and the second error can enable the style characteristics belonging to different styles to be in separability, namely, by applying the model training device, the style characteristics belonging to different styles can be in discrimination and smooth transition, so that when the image generation model performs conversion with larger style span on an initial image, the method ensures good transition between the input image and the newly generated image, and improves the reality degree and the image quality of the newly generated image.

It should be noted that: in the training apparatus for an image generation model provided in the above embodiment, when the image generation model is trained, only the division of the above functional modules is illustrated, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to complete all or part of the above described functions. In addition, the training apparatus for the image generation model and the training method for the image generation model provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiments and are not described herein again.

The computer device provided by the above technical solution can be implemented as a terminal or a server, for example, fig. 8 is a schematic structural diagram of a terminal provided in the embodiment of the present application. The terminal 800 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal 800 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, the terminal 800 includes: one or more processors 801 and one or more memories 802.

The processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 801 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 801 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 801 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 801 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 802 is used to store at least one computer program for execution by the processor 801 to implement the training method of the image generation model provided by the method embodiments of the present application.

In some embodiments, the terminal 800 may further include: a peripheral interface 803 and at least one peripheral. The processor 801, memory 802 and peripheral interface 803 may be connected by bus or signal lines. Various peripheral devices may be connected to peripheral interface 803 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 804, a display screen 805, a camera assembly 806, an audio circuit 807, a positioning assembly 808, and a power supply 809.

The peripheral interface 803 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 801 and the memory 802. In some embodiments, the processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.

The Radio Frequency circuit 804 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 804 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 804 converts an electrical signal into an electromagnetic signal to be transmitted, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 804 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 804 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 805 is used to display a UI (user interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to capture touch signals on or above the surface of the display 805. The touch signal may be input to the processor 801 as a control signal for processing. At this point, the display 805 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 805 may be one, providing the front panel of the terminal 800; in other embodiments, the display 805 may be at least two, respectively disposed on different surfaces of the terminal 800 or in a folded design; in some embodiments, display 805 may be a flexible display disposed on a curved surface or a folded surface of terminal 800. Even further, the display 805 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 805 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 806 is used to capture images or video. Optionally, camera assembly 806 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 806 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 801 for processing or inputting the electric signals to the radio frequency circuit 804 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 800. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 807 may also include a headphone jack.

The positioning component 808 is used to locate the current geographic position of the terminal 800 for navigation or LBS (Location Based Service). The Positioning component 808 may be a Positioning component based on the GPS (Global Positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

Power supply 809 is used to provide power to various components in terminal 800. The power supply 809 can be ac, dc, disposable or rechargeable. When the power source 809 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 800 also includes one or more sensors 810. The one or more sensors 810 include, but are not limited to: acceleration sensor 811, gyro sensor 812, pressure sensor 813, fingerprint sensor 814, optical sensor 815 and proximity sensor 816.

The acceleration sensor 811 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 800. For example, the acceleration sensor 811 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 801 may control the display 805 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 811. The acceleration sensor 811 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 812 may detect a body direction and a rotation angle of the terminal 800, and the gyro sensor 812 may cooperate with the acceleration sensor 811 to acquire a 3D motion of the user with respect to the terminal 800. From the data collected by the gyro sensor 812, the processor 801 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 813 may be disposed on the side frames of terminal 800 and/or underneath display 805. When the pressure sensor 813 is disposed on the side frame of the terminal 800, the holding signal of the user to the terminal 800 can be detected, and the processor 801 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 813. When the pressure sensor 813 is disposed at a lower layer of the display screen 805, the processor 801 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 805. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 814 is used for collecting a fingerprint of the user, and the processor 801 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 814, or the fingerprint sensor 814 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 801 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 814 may be disposed on the front, back, or side of terminal 800. When a physical button or a vendor Logo is provided on the terminal 800, the fingerprint sensor 814 may be integrated with the physical button or the vendor Logo.

The optical sensor 815 is used to collect the ambient light intensity. In one embodiment, processor 801 may control the display brightness of display 805 based on the ambient light intensity collected by optical sensor 815. Specifically, when the ambient light intensity is high, the display brightness of the display screen 805 is increased; when the ambient light intensity is low, the display brightness of the display 805 is reduced. In another embodiment, the processor 801 may also dynamically adjust the shooting parameters of the camera assembly 806 based on the ambient light intensity collected by the optical sensor 815.

A proximity sensor 816, also known as a distance sensor, is typically provided on the front panel of the terminal 800. The proximity sensor 816 is used to collect the distance between the user and the front surface of the terminal 800. In one embodiment, when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal 800 gradually decreases, the processor 801 controls the display 805 to switch from the bright screen state to the dark screen state; when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal 800 becomes gradually larger, the display 805 is controlled by the processor 801 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 8 is not intended to be limiting of terminal 800 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Fig. 9 is a schematic structural diagram of a server provided in this embodiment of the present application, where the server 900 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 901 and one or more memories 902, where the one or more memories 902 store at least one computer program, and the at least one computer program is loaded and executed by the one or more processors 901 to implement the methods provided by the foregoing method embodiments. Certainly, the server 900 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 900 may also include other components for implementing device functions, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory including at least one computer program, executable by a processor, is also provided to perform the training method of the image generation model in the above embodiments. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, the computer program product comprising at least one computer program, the at least one computer program being stored in a computer readable storage medium. The at least one computer program is read by a processor of the computer device from the computer-readable storage medium, and the at least one computer program is executed by the processor to cause the computer device to implement the operations performed by the training method of the image generation model.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of training an image generation model, the method comprising:

obtaining a first error and a second error based on the plurality of style features obtained by conversion, distribution information of the style features and target distribution information, wherein the target distribution information is used for limiting the spatial distribution area of the style features, the first error is used for indicating the error between the distribution information of the style features and the target distribution information, and the second error is used for indicating the difference of the style features corresponding to different styles in spatial distribution;

adjusting parameters of the image generation model based on the first error and the second error.

2. The method of claim 1, wherein the converting each style information into a corresponding style feature by the image generation model comprises any one of:

and in response to the style information comprising a style label, mapping the style label to a corresponding style feature through a feature mapping network in the image generation model.

3. The method according to claim 1, wherein the obtaining a first error and a second error based on the plurality of style features obtained by the conversion, the distribution information of the style features, and the target distribution information comprises:

4. The method of claim 1, wherein after converting each style information into a corresponding style feature by the image generation model, the method further comprises:

and converting the style of the initial image to be processed based on any style characteristic through a generating network in the image generating model to obtain a target image, wherein the style of the target image is consistent with the style indicated by any style characteristic.

5. The method according to claim 4, wherein after the style of the initial image to be processed is converted based on any style feature through a generation network in the image generation model to obtain the target image, the method further comprises:

determining a third error based on a similarity between the target image and the initial image;

the adjusting parameters of the image generation model based on the first error and the second error comprises:

adjusting parameters of the image generation model based on the first error, the second error, and the third error.

6. The method of claim 5, wherein determining a third error based on the similarity between the target image and the initial image comprises:

respectively acquiring a third distance between the first feature and the second feature with the same dimension;

determining the third error based on the acquired at least two third distances.

7. The method of claim 5, wherein after determining a third error based on the similarity between the target image and the initial image, the method further comprises:

weighting the first error, the second error and the third error to obtain a fourth error;

adjusting parameters of the image generation model based on the first error, the second error, and the third error comprises:

adjusting parameters of the image generation model based on the fourth error.

8. An apparatus for training an image generation model, the apparatus comprising:

a parameter adjustment module for adjusting a parameter of the image generation model based on the first error and the second error.

9. A computer device, characterized in that the computer device comprises one or more processors and one or more memories, in which at least one computer program is stored, which is loaded and executed by the one or more processors to implement the operations performed by the training method of an image generation model according to any one of claims 1 to 7.

10. A computer-readable storage medium, having at least one computer program stored therein, the at least one computer program being loaded and executed by a processor to perform operations performed by a training method for an image generation model according to any one of claims 1 to 7.