CN113628103A

CN113628103A - High-fine-granularity cartoon face generation method based on multi-level loss and related components thereof

Info

Publication number: CN113628103A
Application number: CN202110990302.8A
Authority: CN
Inventors: 王健
Original assignee: Shenzhen Wondershare Software Co Ltd
Current assignee: Shenzhen Wondershare Software Co Ltd
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2021-11-09
Anticipated expiration: 2041-08-26
Also published as: CN113628103B

Abstract

The invention discloses a high-fine granularity cartoon face generation method based on multi-level loss and related components thereof, wherein the method comprises the following steps: inputting the face sample picture feature map into a plurality of continuous decoding modules of a face generation model for decoding, and obtaining feature mapping maps decoded by each decoding module; calculating the loss of each decoding module based on the face sample picture feature map and the feature mapping map decoded by each decoding module, and combining the loss of each decoding module to obtain multi-level loss; and calculating and updating the weight of the face generation model by utilizing the multilevel loss, and performing cartoon face conversion on the feature map of the face target picture by utilizing the updated face generation model to obtain a cartoon face picture. The method updates the weight of the face generation model by utilizing the multi-level loss, so that the multi-level loss has the function of influencing the cartoon face generation effect as a supervision signal, and the generation effect on the fine granularity of the face generation model is improved.

Description

High-fine-granularity cartoon face generation method based on multi-level loss and related components thereof

Technical Field

The invention relates to the technical field of image conversion, in particular to a high-fine-granularity cartoon face generation method based on multi-level loss and a related component thereof.

Background

At present, video creative software with large flow at home and abroad has the function of cartoon face generation, for example, corresponding cartoon style templates are arranged in clipping and mapping software, and a user can upload pictures to generate specific cartoon style faces; snapcat software in apple application stores also provides cartoons filters, and can generate exclusive cartoon faces according to input face images. The mainstream methods for generating the cartoon faces are realized based on a CycleGAN network structure, a non-paired training data set is adopted, a real face image is used as a source domain, and a cartoon face with a specific style is used as a target domain, so that the network can 'learn' key different points between the source domain and the target domain during training, and the specific cartoon face is generated according to the style of the target domain under the condition of giving the source domain. Although the existing cartoon face generation algorithm can accurately generate the corresponding cartoon style face, the existing cartoon face generation algorithm has a plurality of improvement points: 1. the cartoon face generation algorithm is mostly realized based on a CycleGAN network structure, the CycleGAN network comprises a generator and a discriminator, the generator and the discriminator mainly adopt a coding-decoding structure, but the utilization of a feature map in the decoding process in the generator is less, so that the network loses some semantic edge information in the up-sampling process, and the finally generated cartoon face is not processed well at a fine granularity. 2. Feature information in the deep network is not used as a supervision signal to guide the model to complete the conversion from the source domain to the target domain in the model training process.

Disclosure of Invention

The embodiment of the invention provides a high-fine-granularity cartoon face generation method based on multi-level loss and related components thereof, and aims to solve the problems that the fine granularity of a cartoon face picture is insufficient and feature information cannot be sufficiently utilized for model guidance in the prior art.

In a first aspect, an embodiment of the present invention provides a method for generating a high-fine-granularity cartoon face based on multi-level loss, including:

inputting the face sample picture feature map into a plurality of continuous decoding modules of a face generation model for decoding, and acquiring a feature mapping map decoded by each decoding module;

calculating the loss of each decoding module based on the face sample picture feature map and the feature mapping map decoded by each decoding module, and combining the loss of each decoding module to obtain multi-level loss;

and calculating and updating the weight of the face generation model by utilizing the multistage loss to obtain an updated face generation model, and performing cartoon face conversion on the face target picture characteristic picture by utilizing the updated face generation model to obtain a cartoon face picture.

In a second aspect, an embodiment of the present invention provides a high-granularity cartoon face generation system based on multi-level loss, including:

the characteristic diagram coding unit is used for inputting the characteristic diagram of the face sample picture into a plurality of continuous decoding modules of the face generation model for decoding and acquiring a characteristic mapping diagram decoded by each decoding module;

the multistage loss calculation unit is used for calculating the loss of each decoding module based on the human face sample picture feature map and the feature mapping map decoded by each decoding module, and combining the loss of each decoding module to obtain multistage loss;

and the cartoon face picture acquisition unit is used for calculating and updating the weight of the face generation model by utilizing the multilevel loss to obtain an updated face generation model, and performing cartoon face conversion on the face target picture characteristic picture by utilizing the updated face generation model to obtain a cartoon face picture.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the method for generating a high-granularity cartoon face based on multi-level loss according to the first aspect.

In a fourth aspect, the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the method for generating a high-granularity cartoon face based on multi-level loss according to the first aspect.

The embodiment of the invention provides a high-fine-granularity cartoon face generation method based on multi-level loss and related components thereof, wherein the method comprises the following steps: inputting the face sample picture feature map into a plurality of continuous decoding modules of a face generation model for decoding, and acquiring a feature mapping map decoded by each decoding module; calculating the loss of each decoding module based on the face sample picture feature map and the feature mapping map decoded by each decoding module, and combining the loss of each decoding module to obtain multi-level loss; and calculating and updating the weight of the face generation model by utilizing the multistage loss to obtain an updated face generation model, and performing cartoon face conversion on the face target picture characteristic picture by utilizing the updated face generation model to obtain a cartoon face picture. The embodiment of the invention updates the weight of the face generation model by utilizing the multi-level loss, thereby enabling the multi-level loss to have the function of influencing the generation effect of the cartoon face as a supervision signal and improving the generation effect of the face generation model on the fine granularity.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a high-fine-granularity cartoon face generation method based on multi-level loss according to an embodiment of the present invention;

fig. 2 is a decoding flow chart of a decoding module of the high-fine-granularity cartoon face generation method based on multi-level loss according to the embodiment of the present invention;

fig. 3 is a multi-level loss calculation flowchart of the high-fine-granularity cartoon face generation method based on multi-level loss according to the embodiment of the present invention;

fig. 4 is a schematic block diagram of a high-granularity cartoon face generation system based on multi-level loss according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, fig. 1 is a schematic flow chart of a method for generating a high-fine-grained cartoon face based on multi-level loss according to an embodiment of the present invention, where the method includes steps S101 to S103.

S101, inputting a human face sample picture feature map into a plurality of continuous decoding modules of a human face generation model for decoding, and acquiring a feature mapping map decoded by each decoding module;

s102, calculating the loss of each decoding module based on the face sample picture feature map and the feature mapping map decoded by each decoding module, and combining the loss of each decoding module to obtain multi-level loss;

s103, calculating and updating the weight of the face generation model by using the multilevel loss to obtain an updated face generation model, and performing cartoon face conversion on the face target picture feature picture by using the updated face generation model to obtain a cartoon face picture.

In this embodiment, a face sample picture feature map is input into a plurality of consecutive decoding modules for decoding, so as to obtain a feature mapping map corresponding to each decoding module, then a loss of each decoding module is calculated according to the face sample picture feature map and the feature mapping map corresponding to each decoding module, then the losses of each decoding module are combined, so as to obtain a multi-level loss, the multi-level loss is used for updating the weight of the face generation model, so as to guide the training direction of the face generation model, and finally the finally trained face generation model is subjected to cartoon face conversion on the face target picture feature map, so as to obtain a cartoon face picture. In this embodiment, the loss of the decoding module may be calculated as L1 loss of the decoding module, after L1 loss of each decoding module is calculated, all L1 losses are combined to form multi-level loss, and the weight of the face generation model is updated by using the multi-level loss, so that the face generation model has a better fine-grained generation effect, and the converted cartoon face picture is more fine and natural. Before receiving the face sample picture feature map, the size of the face sample picture feature map may be set according to the size of a convolution kernel in the decoding module, so as to ensure that the size of the face sample picture feature map is adapted to the size of the convolution kernel of the decoding module.

In an embodiment, the inputting the feature map of the face sample picture into a plurality of consecutive decoding modules of the face generation model for decoding, and obtaining the feature map decoded by each decoding module includes:

taking the first decoding module as a current decoding module, and taking the face sample picture feature map as a current feature map;

and (3) decoding: inputting the current feature map into the current decoding module for decoding, and outputting a corresponding feature map;

and taking the feature mapping image output by the current decoding module as a current feature image, taking the next decoding module of the current decoding module as the current decoding module, returning to the step of executing decoding until the last decoding module finishes decoding operation and outputs a corresponding final feature mapping image.

In this embodiment, a face sample picture feature map is input into a plurality of consecutive decoding modules for decoding, a first decoding module is used as a current decoding module, the face sample picture feature map is used as a current feature map, the current feature map is input into the current decoding module for decoding, a corresponding feature map is output, the output of the current decoding module (i.e., the feature map corresponding to the current decoding module) is used as the input of a next decoding module, the corresponding feature map is output again, and the process is repeated until the last decoding module completes decoding operation and outputs a final feature map. In this embodiment, 4 consecutive decoding modules (i.e., a first decoding module, a second decoding module, a third decoding module, and a fourth decoding module) are used for decoding, and the face sample picture feature map is decoded 4 times and a result of each decoding operation is output, that is, the first decoding module outputs a corresponding first feature map, the second decoding module outputs a corresponding second feature map, the third decoding module outputs a corresponding third feature map, and the fourth decoding module outputs a corresponding final feature map.

In an embodiment, the inputting the current feature map into the current decoding module for decoding and outputting a corresponding feature map includes:

inputting the current feature map into a residual error part in the current decoding module to carry out convolution operation for multiple times to obtain an initial feature map;

and performing fusion processing based on the initial feature map and the current feature map, and outputting a corresponding feature map.

In this embodiment, each decoding module adopts a residual network structure, where the residual network structure has two branches, one branch is to input the current feature map into the residual part for performing convolution operation for multiple times and output an initial feature map, and the other branch is to directly map the current feature map into an output, and then perform fusion processing on the outputs of the two branches to obtain a feature map corresponding to the current decoding module.

In an embodiment, referring to fig. 2, the performing a plurality of convolution operations on the residual portion of the current feature map input into the current decoding module to obtain an initial feature map includes:

inputting the current feature map into a first convolution unit of the residual error part for convolution processing to obtain a first convolution result;

and accelerating convergence is carried out on the first convolution result by utilizing a ReLU activation function, and the convergence result is input into the second convolution unit of the residual error part for convolution processing, so as to obtain an initial characteristic mapping chart.

In this embodiment, the current feature map is subjected to a plurality of convolution operations, first the current feature map is input to a first convolution unit for convolution processing to obtain a first convolution result, then a ReLU activation function is used to perform accelerated convergence on the first convolution result, and finally the convergence result of the ReLU activation function is input to a second convolution unit for convolution processing to output the initial feature map.

In an embodiment, as shown in fig. 2, the performing convolution processing on the first convolution unit that inputs the current feature map to the residual part to obtain a first convolution result includes:

carrying out boundary expansion processing on the current feature map;

and inputting the current feature map subjected to the boundary expansion processing into a first convolution layer for convolution operation, and performing self-adaptive regularization processing on a convolution result to obtain a first convolution result.

In this embodiment, when the first convolution unit performs convolution operation, the boundary expansion processing is performed on the current feature map, then the current feature map after the boundary expansion is input into the first convolution layer to perform convolution operation, and finally the self-adaptive regularization processing is performed on the convolution result, and finally the first convolution result is output. Since the size of the convolution kernel affects the size of the learnable spatial feature value of the generator network, the size of the convolution kernel of the first convolution layer needs to be set correspondingly according to the size of the input human face sample picture feature map.

In an embodiment, as shown in fig. 2, the accelerating convergence of the first convolution result by using the ReLU activation function, and inputting the convergence result to the second convolution unit of the residual part for convolution processing to obtain the initial feature map includes:

accelerating convergence is carried out on the first convolution result by utilizing a ReLU activation function, and a convergence characteristic diagram is obtained;

and performing boundary expansion processing on the convergence characteristic diagram, inputting the convergence characteristic diagram after the boundary expansion processing into a second convolution layer for convolution operation, and performing self-adaptive regularization processing on a convolution result to obtain an initial characteristic mapping diagram.

In this embodiment, after the first convolution result is accelerated and converged by the ReLU activation function, the convergence result is subjected to boundary expansion, and then the boundary expansion result is input into the second convolution layer to be subjected to convolution operation, and then the convolution result is subjected to adaptive regularization processing, so as to output an initial feature map. In the generator, since the size of the convolution kernel affects the size of the learnable spatial feature value of the generator network, the size of the convolution kernel of the second convolution layer needs to be set correspondingly according to the size of the input human face sample picture feature map.

In an embodiment, the calculating the loss of each decoding module based on the face sample picture feature map and the feature map decoded by each decoding module, and combining the losses of each decoding module to obtain a multi-level loss includes:

calculating a loss between the current feature map and the final feature map for each of the decoding modules;

and calculating the sum of losses of all the decoding modules, and taking the calculation result as a multi-stage loss.

In this embodiment, the loss between the current feature map and the final feature map of each decoding module is calculated, and the sum of the losses is calculated to obtain the multi-level loss. The loss may be an L1 loss, but may be other losses (e.g., cross-entropy loss, etc.). As shown in fig. 3, when the loss is L1 loss, for example, the input of the current decoding module is d0, and the final feature map is d4, the two inputs for calculating L1 loss are d0 and d4, respectively, and loss1 | d0-d4|, i.e., L1 loss can be calculated by finding the absolute value of the difference between the two inputs, i.e., the absolute value of the difference between the input and the final output of each decoding module. In this embodiment, there are 4 decoding modules, and the input (i.e., the current feature map) of each decoding module respectively calculates L1 loss with the final feature map (i.e., the fourth feature map corresponding to the fourth decoding module), so as to obtain a first L1 loss1, a second L1 loss2, a third L1 loss3, and a fourth L1 loss4, where the multi-level loss total is loss1+ loss2+ loss3+ loss 4.

Referring to fig. 4, fig. 4 is a schematic block diagram of a high-granularity cartoon face generation system based on multi-level loss according to an embodiment of the present invention, where the high-granularity cartoon face generation system 200 based on multi-level loss includes:

the feature map decoding unit 201 is configured to input the face sample picture feature map into a plurality of consecutive decoding modules of the face generation model for decoding, and obtain a feature map decoded by each decoding module;

a multi-level loss calculation unit 202, configured to calculate a loss of each decoding module based on the face sample picture feature map and a feature map obtained after decoding by each decoding module, and combine the losses of each decoding module to obtain a multi-level loss;

and the cartoon face picture acquisition unit 203 is configured to calculate and update the weight of the face generation model by using the multilevel loss to obtain an updated face generation model, and perform cartoon face conversion on the face target picture feature map by using the updated face generation model to obtain a cartoon face picture.

In one embodiment, the signature graph code unit 201 includes:

the decoding module and the feature map defining unit are used for taking the first decoding module as a current decoding module and taking the face sample picture feature map as a current feature map;

a feature map output unit for decoding: inputting the current feature map into the current decoding module for decoding, and outputting a corresponding feature map;

and a decoding step repeating unit, configured to use the feature map output by the current decoding module as a current feature map, use a next decoding module of the current decoding module as a current decoding module, and return to the step of performing decoding until the last decoding module completes the decoding operation and outputs a corresponding final feature map.

In one embodiment, the feature map output unit includes:

an initial feature map obtaining unit, configured to perform multiple convolution operations on a residual error portion in the current decoding module to which the current feature map is input, so as to obtain an initial feature map;

and the feature map fusion unit is used for performing fusion processing on the basis of the initial feature map and the current feature map and outputting a corresponding feature map.

In an embodiment, the initial feature map obtaining unit includes:

the first convolution processing unit is used for inputting the current feature map into the first convolution unit of the residual error part for convolution processing to obtain a first convolution result;

and the second convolution processing unit is used for accelerating convergence of the first convolution result by utilizing the ReLU activation function, and inputting the convergence result into the second convolution unit of the residual error part for convolution processing to obtain an initial feature mapping chart.

In one embodiment, the first volume processing unit includes:

the boundary expansion unit is used for carrying out boundary expansion processing on the current feature map;

and the first convolution result acquisition unit is used for inputting the current feature map subjected to the boundary expansion processing into the first convolution layer for convolution operation, and performing self-adaptive regularization processing on the convolution result to obtain a first convolution result.

In one embodiment, the second convolution processing unit includes:

the convergence characteristic diagram obtaining unit is used for carrying out accelerated convergence on the first convolution result by utilizing a ReLU activation function to obtain a convergence characteristic diagram;

and the second convolution result acquisition unit is used for performing boundary expansion processing on the convergence characteristic diagram, inputting the convergence characteristic diagram after the boundary expansion processing into a second convolution layer for convolution operation, and performing self-adaptive regularization processing on a convolution result to obtain an initial characteristic mapping diagram.

In one embodiment, the multi-stage loss calculation unit 202 includes:

a loss calculation unit for calculating a loss between the current feature map and the final feature map of each of the decoding modules;

and the loss statistical unit is used for calculating the sum of the losses of all the decoding modules and taking the calculation result as the multi-stage loss.

The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for generating a high-fine-grained cartoon face based on multi-level loss as described above is implemented.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for generating a high-fine-grained cartoon face based on multi-level loss as described above is implemented.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A high-fine-granularity cartoon face generation method based on multi-level loss is characterized by comprising the following steps:

2. The method for generating high-fine-grained cartoon face based on multistage loss according to claim 1, wherein the step of inputting the face sample picture feature map into a plurality of continuous decoding modules of a face generation model for decoding, and obtaining the feature map decoded by each decoding module comprises:

3. The method for generating high-fine-grained cartoon face based on multi-level loss according to claim 2, wherein the inputting the current feature map into the current decoding module for decoding and outputting the corresponding feature map comprises:

4. The method for generating a high-granularity cartoon face based on multi-level loss according to claim 3, wherein the convolving the residual part of the current feature map input into the current decoding module for multiple times to obtain an initial feature map comprises:

5. The method for generating high-granularity cartoon faces based on multi-level loss according to claim 4, wherein the step of inputting the current feature map into the first convolution unit of the residual part for convolution processing to obtain a first convolution result comprises:

carrying out boundary expansion processing on the current feature map;

6. The method for generating a high-granularity cartoon face based on multi-level loss according to claim 4, wherein the accelerating convergence of the first convolution result by using the ReLU activation function and the convolution processing of the convergence result by inputting the convergence result to the second convolution unit of the residual part are performed to obtain an initial feature map, and the method comprises:

7. The method for generating a high-granularity cartoon face based on multistage loss according to claim 2, wherein the calculating the loss of each decoding module based on the face sample picture feature map and the feature map decoded by each decoding module and combining the losses of each decoding module to obtain multistage loss comprises:

8. A high fine granularity cartoon face generation system based on multistage loss is characterized by comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for generating highly fine-grained cartoon faces based on multi-level losses according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to execute the multi-level loss based high-fine grain cartoon face generation method according to any one of claims 1 to 7.