CN113657190A

CN113657190A - Driving method of face picture, training method of related model and related device

Info

Publication number: CN113657190A
Application number: CN202110845959.5A
Authority: CN
Inventors: 韩欣彤; 林哲
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2021-11-16

Abstract

The application discloses a driving method of a face picture, a training method of a related model and a related device, wherein an image generation network model comprises an image segmentation network, a structured generation network and an unstructured generation network; the training method of the image generation network model comprises the following steps: acquiring a portrait image training set comprising a first portrait image and a second portrait image of the same target; obtaining a structured area image and a first unstructured area image corresponding to the first portrait image and the second portrait image respectively through an image segmentation network; generating a prediction target image by utilizing a structured generation network and an unstructured generation network based on the corresponding structured area image and the first unstructured area image of the first portrait image and the second portrait image; adjusting network parameters in the image generation network model based on a loss function between the second portrait image and the prediction target image. By the scheme, the driving effect of the human face picture can be improved.

Description

Driving method of face picture, training method of related model and related device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method for driving a face image, a method for training a related model, and a related device.

Background

The human face animation is widely applied to many fields such as computer animation industry, game industry, teleconferencing, agents, avatars and the like, and is a hot spot of domestic and foreign research in recent years. The human face animation based on a single image carries out geometric transformation on a given image, can generate some interesting visual effects, and is commonly used for generating special effects in movie and television entertainment and advertisement design.

The human face animation is generated by image deformation on the basis of a real human face picture, so that the human face animation has high reality sense, however, the effect of an image obtained by the existing human face picture driving method in an unstructured area is not good.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a driving method of a face picture, a training method of a related model and a related device, and the driving effect of the face picture can be improved.

In order to solve the above problem, a first aspect of the present application provides a training method for an image generation network model, where the image generation network model includes an image segmentation network, a structured generation network, and an unstructured generation network; the training method of the image generation network model comprises the following steps: acquiring a portrait image training set; wherein the training set of portrait images includes a first portrait image and a second portrait image of the same target; utilizing an image segmentation network to respectively segment the first portrait image and the second portrait image to obtain a first structured area image and a first unstructured area image corresponding to the first portrait image, and a second structured area image and a second unstructured area image corresponding to the second portrait image; generating a preliminary background image with an unstructured formation network based on the first unstructured region image and the second unstructured region image; generating a preliminary facial image with a structured formation network based on the first structured region image and the second structured region image; generating a prediction target image based on the preliminary face image and the preliminary background image; obtaining a loss function between the second portrait image and the prediction target image; adjusting network parameters in the image-generating network model based on the loss function.

In order to solve the above problem, a second aspect of the present application provides a method for driving a face picture, where the method for driving a face picture includes: acquiring a source diagram and a driving diagram; inputting the source map and the drive map into an image generation network model to obtain a target image; wherein the image generation network model is obtained by training through the training method of the image generation network model of the first aspect.

In order to solve the above problem, a third aspect of the present application provides a training apparatus for an image generation network model, where the image generation network model includes an image segmentation network, a structured generation network, and an unstructured generation network; the training device for the image generation network model comprises: the system comprises a sample acquisition module, a storage module and a display module, wherein the sample acquisition module is used for acquiring a portrait image training set; wherein the training set of portrait images includes a first portrait image and a second portrait image of the same target; the image prediction module is used for utilizing an image segmentation network to respectively segment the first portrait image and the second portrait image to obtain a first structured area image and a first unstructured area image corresponding to the first portrait image and a second structured area image and a second unstructured area image corresponding to the second portrait image; generating a preliminary background image with an unstructured formation network based on the first unstructured region image and the second unstructured region image; generating a preliminary facial image with a structured formation network based on the first structured region image and the second structured region image; generating a prediction target image based on the preliminary face image and the preliminary background image; a loss function determination module for obtaining a loss function between the second portrait image and the prediction target image; a parameter adjustment module to adjust network parameters in the image-generating network model based on the loss function.

In order to solve the above problem, a fourth aspect of the present application provides a driving apparatus for a human face picture, including: the image acquisition module is used for acquiring a source image and a driving image; an image generation module for inputting the source map and the driver map into an image generation network model to obtain a target image; wherein the image generation network model is obtained by training through the training method of the image generation network model of the first aspect.

In order to solve the above problem, a fifth aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, where the processor is configured to execute program instructions stored in the memory to implement the method for training an image generation network model according to the first aspect or the method for driving a face picture according to the second aspect.

In order to solve the above problem, a sixth aspect of the present application provides a computer-readable storage medium, on which program instructions are stored, which when executed by a processor, implement the training method of the image generation network model of the above first aspect, or the driving method of the face picture of the above second aspect.

The invention has the beneficial effects that: different from the situation of the prior art, the image generation network model comprises an image segmentation network, a structured generation network and an unstructured generation network, and a portrait image training set is obtained, wherein the portrait image training set comprises a first portrait image and a second portrait image of the same target; then, the first portrait image and the second portrait image are respectively segmented by utilizing an image segmentation network to obtain a first structured area image and a first unstructured area image corresponding to the first portrait image, and a second structured area image and a second unstructured area image corresponding to the second portrait image; then based on the first unstructured regional image and the second unstructured regional image, generating a preliminary background image by using the unstructured network, and based on the first structured regional image and the second structured regional image, generating a preliminary face image by using the structured network; then generating a prediction target image based on the preliminary face image and the preliminary background image; a loss function between the second portrait image and the prediction target image may then be obtained, and based on the loss function, network parameters in the image generation network model may be adjusted. The method comprises the steps of utilizing an image segmentation network to respectively segment a first portrait image and a second portrait image to obtain a first structured area image and a first unstructured area image corresponding to the first portrait image, and after the second structured area image and the second unstructured area image corresponding to the second portrait image, the primary face image can be generated through the structured formation network, the primary background image is generated through the unstructured formation network, the prediction target image generated based on the primary face image and the primary background image is processed separately from the face part and the background part, which is beneficial to learning better action representation respectively and avoiding mutual influence, not only can accurately control the driving effect of the face key point region, meanwhile, the method can cover the areas except the key points of the human face, and can obtain better driving effect in the unstructured key point areas.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a training method for an image generation network model according to the present application;

FIG. 2 is a flowchart illustrating an embodiment of step S13 in FIG. 1;

FIG. 3 is a flowchart illustrating an embodiment of step S14 in FIG. 1;

fig. 4 is a schematic flowchart of an embodiment of a method for driving a face picture according to the present application;

FIG. 5 is a flowchart illustrating an embodiment of step S52 in FIG. 4;

FIG. 6 is a block diagram of an embodiment of a training apparatus for image generation network model according to the present application;

FIG. 7 is a schematic diagram of a frame of an embodiment of a driving apparatus for human face pictures according to the present application;

FIG. 8 is a block diagram of an embodiment of an electronic device of the present application;

FIG. 9 is a block diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an embodiment of a training method for an image generation network model according to the present application. The image generation network model comprises an image segmentation network, a structured generation network and an unstructured generation network; specifically, the training method of the image generation network model in this embodiment may include the following steps:

step S11: acquiring a portrait image training set; wherein the training set of portrait images includes a first portrait image and a second portrait image of the same target.

Specifically, two different frames of images having portrait areas of the target may be obtained from a video of the same target, where the images are a first portrait image and a second portrait image, the first portrait image includes a face of the target, the face expression is a source expression before migration, the second portrait image also includes a face of the target, and the face expression is a target expression after migration. By using the image generation network model, the expression of the target in the first portrait image can be switched from the source expression to the target expression of the second portrait image.

Step S12: and utilizing an image segmentation network to respectively segment the first portrait image and the second portrait image to obtain a first structured area image and a first unstructured area image corresponding to the first portrait image, and a second structured area image and a second unstructured area image corresponding to the second portrait image.

The image segmentation network may divide the image into several specific regions with unique properties and suggest objects of interest. For example, different images have different gray levels and generally have obvious edges at boundaries, the images can be segmented by using the characteristic, and the positions with abrupt changes of gray levels or structures in the images are detected by edge detection through an edge-based segmentation method, so that the positions indicate the ending of one region and the beginning of the other region. Thus, after the first portrait image is input into the image segmentation network, a first structured area image and a first unstructured area image corresponding to the first portrait image can be output, and after the second portrait image is input into the image segmentation network, a second structured area image and a second unstructured area image corresponding to the second portrait image can be output.

Step S13: generating a preliminary background image using an unstructured formation network based on the first unstructured region image and the second unstructured region image.

The unstructured generating network is obtained through self-supervision training. The self-supervision learning mainly utilizes the auxiliary task to mine the self supervision information from the large-scale unsupervised data, and trains the network through the constructed supervision information, so that the valuable representation of the downstream task can be learned, in other words, the supervision information of the self-supervision learning is not manually marked, but the algorithm automatically constructs the supervision information in the large-scale unsupervised data to perform supervision learning or training.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating an embodiment of step S13 in fig. 1. In an embodiment, the unstructured generative network comprises a resnet-tiny subnetwork and an optical flow prediction subnetwork; the step S13 may specifically include:

step S131: and respectively detecting the first unstructured regional image and the second unstructured regional image by using the resnet-tiny sub-network to obtain a hotspot graph of the first unstructured key point and a hotspot graph of the second unstructured key point.

Detecting the first unstructured regional image and the second unstructured regional image by utilizing a resnet-tiny sub-network to obtain a hot spot map of a first unstructured key point and a hot spot map of a second unstructured key point; the hot spot map is an analysis means for labeling and presenting areas on the image or the page according to different attention degrees by using different marks, the labeling means generally adopts the forms of color depth, density of points and presentation weight, the hot spot map of the first unstructured key point and the hot spot map of the second unstructured key point are used for representing the position information of the unstructured key points in the first portrait image and the second portrait image, and optionally, the unstructured key points comprise other areas except five sense organs of the human face, such as hair areas. The network structure of the resnet-tiny sub-network may be determined according to the number of the keypoints that need to be predicted, for example, when the number of the keypoints that need to be predicted is 3, a smaller network may be used, and the resnet-tiny sub-network may be composed of 3 downsampling blocks and three upsampling blocks, where the number of channels of each block is 128.

Step S132: and inputting the hotspot graph of the first unstructured key point and the hotspot graph of the second unstructured key point into the optical flow prediction sub-network to generate optical flow data.

The optical flow prediction sub-network takes a hot spot graph of the first unstructured key point and a hot spot graph of the second unstructured key point as input information, and carries out optical flow estimation on the first unstructured key point of the first portrait image and the second unstructured key point of the second portrait image to obtain optical flow data from the first portrait image to the second portrait image. Specifically, the optical flow prediction subnetwork may consist of an encoder and a decoder, which will input hot-spot maps of 6 keypoints (hot-spot map 3 for the first unstructured keypoint, hot-spot map 3 for the second unstructured keypoint), and finally output optical flow data, which has the shape of (64, 64, 2).

Step S133: and carrying out deformation processing on the first unstructured area image by using the optical flow data to obtain a deformed first unstructured area image serving as a preliminary background image.

Specifically, the first unstructured area image corresponding to the first portrait image and the optical flow data from the first portrait image to the second portrait image are mapped, and the optical flow data can be mapped onto the first unstructured area image, so that the obtained deformed first unstructured area image contains the optical flow data, and therefore, when the expression of the second portrait image is subsequently transferred to the first portrait image, the generated prediction target image can avoid the problems of distortion, abnormal texture, blurring and the like, and the overall quality of the prediction target image can be improved.

Further, the target loss function trained by the optical flow prediction sub-network comprises a total variation loss function. Specifically, with respect to the training of the optical flow prediction subnetwork, the optical flow data plus a Total Variation loss function (TV loss) may be trained. During the image fusion process, a little noise on the image may have a great influence on the generated result, and at this time, some regular terms may be added in the optical flow prediction subnetwork to keep the smoothness of the image, for example, TV loss is a commonly used regular term, which helps to avoid the disorder of the preliminary background image obtained by subsequently utilizing optical flow data due to the disorder of the optical flow data.

Further, the objective loss function of the unstructured formation network training includes at least one of a distance loss function and an equal constraint loss function.

Specifically, regarding the training of the unstructured generating network, at the initial stage of the training, the network parameters are randomly initialized, which may cause the extracted key points to gather at the center, or at the later stage of the training, the output of the unstructured generating network is the same as the output of the structured generating network, which does not play a role in detecting the unstructured key points; thus, a distance loss function L may be added_sepNamely:

wherein, sigma is the distance standard deviation of the face key points of the images in the training set, and a priori distance standard deviation is given to avoid a distance loss function L_sepToo large or too small. By setting a distance loss function L_sepEach unstructured key point may be compared with all key points (including face key points and unstructured key points) so that all key points are not repeated, thereby preventing the unstructured key points detected by the unstructured network from being gathered together or detecting the face key points.

In addition, in the self-supervision learning training process of the unstructured generation network, disordered learning is easy, and some key points which are irrelevant to the input image may be output; therefore, an equivalent constraint loss function (equivalent constraint) can be added, which has the central idea that the key points detected from the affine-transformed picture should be the same as those detected from the non-affine-transformed picture after inverse affine transformation, so that by adding the equivalent constraint loss, a certain key point should still be well positioned when transforming coordinates in another image.

Step S14: and generating a preliminary face image by using a structural generation network based on the first structured area image and the second structured area image.

The structured generation network is obtained through supervised training, specifically, the structured generation network is a deep neural network, a large number of labeled face pictures are used for training the structured generation network, backward propagation is carried out after the prediction and real labels of the structured generation network are lost, and finally the structured generation network can obtain the capability of generating a preliminary face image based on the first structured area image and the second structured area image through continuous learning.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating an embodiment of step S14 in fig. 1. In an embodiment, the structured formation network comprises a face keypoint sub-network and a HRNet sub-network; the step S14 may specifically include:

step S141: and respectively acquiring a first face key point and a Jacobian matrix thereof corresponding to the first structured area image and a second face key point and a Jacobian matrix thereof corresponding to the second structured area image by using the face key point sub-network and the HRNet sub-network.

The method comprises the steps of detecting a first structured area image and a second structured area image by using a face key point sub-network, obtaining a first face key point corresponding to the first structured area image and a second face key point corresponding to the second structured area image, rasterizing the first face key point and the second face key point, wherein the rasterizing is to convert geometric data into pixels after a series of conversion, and accordingly, the pixels are displayed on display equipment. In addition, a Jacobian matrix of key points can be predicted through the HRNet sub-network, the Jacobian matrix can provide direction information of the key points, and then a first face key point Keypoint corresponding to the first structured area image can be obtained by utilizing the face key point sub-network and the HRNet sub-network_srcAnd its Jaco matrix Jaco_srcAnd a second face key point Keypoint corresponding to the second structured area image_driverAnd its Jaco matrix Jaco_driver。

Step S142: and generating a face key point connecting line graph based on the first face key point and the Jacobian matrix thereof and the second face key point and the Jacobian matrix thereof.

Step S143: and generating the preliminary face image based on the first portrait image and the face key point connecting line graph.

In the application of generating the network model from the actual image, the face of the second portrait image may be inconsistent with the first portrait image, and if the key point of the second face of the second portrait image is directly used, the generated preliminary face image may be deformed, so that the jacobian matrix needs to be usedAnd adjusting the action information of the second portrait image and then transferring the action information to the first face key point of the first portrait image so as to reduce the generated preliminary face image deformation. Then, the key points are connected and drawn into a face key point connecting line graph, namely, the key points are based on the first face key point Keypoint_srcAnd its Jaco matrix Jaco_srcAnd a second face key point Keypoint_driverAnd its Jaco matrix Jaco_driverGenerating a face key point connecting line graph, wherein the formula is as follows:

kp_new＝(Keypoint_src-Keypoint_driver)(Jaco_srcJaco_driver ^-1)+

Keypoint_src. And then inputting the first portrait image and the face key point connecting line graph into an encoder and a decoder together to generate a preliminary face image.

Step S15: and generating a prediction target image based on the preliminary face image and the preliminary background image.

After the preliminary face image and the preliminary background image are obtained, the preliminary face image and the preliminary background image can be fused, so that a prediction target image can be obtained.

In an embodiment, the image generation network model further comprises a residual generation network; at this time, after the face key point wiring diagram is generated, a residual image may be generated by using the residual generation network based on the first portrait image and the face key point wiring diagram. Then, the step S15 specifically includes: and fusing the preliminary face image, the preliminary background image and the residual image to obtain the prediction target image.

It can be understood that, because the image is divided into the face part of the structured area and the background part of the unstructured area for processing, some areas are missing during fusion, and therefore, the network generated through the residual error can fill the missing parts to adjust the fused boundary. Specifically, a first portrait image and a human face key point connecting line graph are used as input and input into an encoder and a decoder, so that a residual image can be generated; and then fusing the preliminary face image, the preliminary background image and the residual image to obtain a final prediction target image.

Further, the image generation network model further includes a discriminator network, and after the step S15, the method further includes: optimizing the prediction target image with the discriminator network so that the discriminator network cannot distinguish between the prediction target image and the second portrait image.

The discriminator network is used for distinguishing whether the input prediction target image is a real second portrait image or a prediction target image generated by using the image generation network model; specifically, in the training stage, the whole image generation network model takes a real second portrait image as a target to generate a prediction target image similar to the real second portrait image, the discriminator network discriminates the truth of the generated prediction target image according to the real second portrait image, the real portrait image and the prediction target image are in game interaction with each other, and finally a balance state is achieved, namely the image generation network model can obtain a vivid prediction target image, and the discriminator network cannot discriminate the truth.

Step S16: obtaining a loss function between the second portrait image and the prediction target image.

Step S17: adjusting network parameters in the image-generating network model based on the loss function.

The penalty function is used to measure the deviation of the prediction made by the model from the true value (Ground Truth), and it is usually necessary to minimize the objective function. Specifically, after the loss function is obtained, whether the current loss function is converged can be judged; if so, stopping training the image generation network model to obtain a trained image generation network model; if not, adjusting network parameters in the image generation network model by using a back propagation algorithm, and continuing training the image generation network model until the loss function is converged.

In particular, the loss function may include an L1 loss function, a perceptual loss function, and a style loss function.

Wherein, the L1 Loss function (L1 Loss) is also called Mean Absolute Error (MAE), which measures the average Error amplitude of the distance between the predicted value and the true value, and the action range is 0 to positive infinity; and the Perceptual loss function (Perceptual loss) measures the mean square error of the distance between the predicted value and the true value. By minimizing the L1 loss function and the perception loss function in the training process, the predicted target image output by the image generation network model can have a better driving effect.

The Style loss function (Style loss) is a key to whether the Style of an image can be accurately and effectively transferred to another image, and is generally determined by the difference between the Style features of the image. It is understood that, since there may be movement of a camera that captures a video, a background region of the first portrait image and the second portrait image may change in addition to a change in the portrait region, and the change in the background region may cause the image generation network model to be unable to learn, because the background region is too complex, the image generation network model cannot represent the background region through a key point, and thus the image generation network model does not know how the background region should be generated. Therefore, it is necessary to add a style loss function in which the direction is optimized such that the generated prediction target image is in accordance with the style of the first portrait image when the background area is changed.

In the scheme, the first portrait image and the second portrait image are respectively segmented by utilizing the image segmentation network to obtain the first structured area image and the first unstructured area image corresponding to the first portrait image, and after the second structured area image and the second unstructured area image corresponding to the second portrait image, the primary face image can be generated through the structured formation network, the primary background image is generated through the unstructured formation network, the prediction target image generated based on the primary face image and the primary background image is processed separately from the face part and the background part, which is beneficial to learning better action representation respectively and avoiding mutual influence, not only can accurately control the driving effect of the face key point region, meanwhile, the method can cover the areas except the key points of the human face, and can obtain better driving effect in the unstructured key point areas.

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating an embodiment of a method for driving a face picture according to the present application. Specifically, the driving method for a face picture in the embodiment may include the following steps:

step S41: a source map and a driver map are acquired.

Step S42: inputting the source map and the driver map into an image generation network model to obtain a target image.

The image generation network model is obtained by training through the training method of the image generation network model.

Specifically, the face included in the source graph and the face included in the driver graph may have the same identity, that is, the face in the source graph and the face in the driver graph are faces of the same person. Alternatively, it is also possible to have different identities, i.e. the face in the source map and the face in the driver map are not faces of the same person. The image generation network model of the application can output accurate target images under the two conditions.

The source graph comprises a face of a target object, the expression of the face is a source expression before migration, the drive graph comprises a face of a drive object, the expression of the face is a target expression after migration, and the target object and the drive object can be the same or different. And generating the target image by using the image generation network model, wherein the expression of the target object in the source image can be switched from the source expression to the target expression of the driving image. It is understood that the face in the target image generated by the image generation network model is still the face of the target object, and only the facial expression of the target object is switched from the source expression to the target expression possessed by the driving graph. For example, if the facial expression of the target object in the source image is a crying expression, and the facial expression in the driver image is a laughing expression, the target image generated by the image generation network model is an image including the laughing expression of the target object.

Further, the image generation network model comprises an image segmentation network, a structured generation network and an unstructured generation network, wherein the unstructured generation network comprises a rest-tiny sub-network and an optical flow prediction sub-network, and the structured generation network comprises a human face key point sub-network and a HRNet sub-network. Referring to fig. 5, fig. 5 is a flowchart illustrating an embodiment of step S42 in fig. 4. In an embodiment, the step S42 may specifically include:

step S421: and utilizing an image segmentation network to respectively segment the source map and the driving map to obtain a first structured area image and a first unstructured area image corresponding to the source map, and a second structured area image and a second unstructured area image corresponding to the driving map.

Step S422: and respectively detecting the first unstructured regional image and the second unstructured regional image by using the resnet-tiny sub-network to obtain a hotspot graph of the first unstructured key point and a hotspot graph of the second unstructured key point.

Step S423: and inputting the hotspot graph of the first unstructured key point and the hotspot graph of the second unstructured key point into the optical flow prediction sub-network to generate optical flow data.

Step S424: and carrying out deformation processing on the first unstructured area image by using the optical flow data to obtain a deformed first unstructured area image serving as a preliminary background image.

Step S425: and respectively acquiring a first face key point and a Jacobian matrix thereof corresponding to the first structured area image and a second face key point and a Jacobian matrix thereof corresponding to the second structured area image by using the face key point sub-network and the HRNet sub-network.

Step S426: and generating a face key point connecting line graph based on the first face key point and the Jacobian matrix thereof and the second face key point and the Jacobian matrix thereof.

Step S427: and generating the preliminary face image based on the source image and the face key point connecting line image.

Step S428: and generating a residual image by using the residual generation network based on the source image and the face key point connecting line image.

Step S429: and fusing the preliminary face image, the preliminary background image and the residual image to obtain the prediction target image.

By using the driving method of the face picture, the source picture and the driving picture are processed, more unstructured key points are learned through the unstructured network beyond the key points of the face learned by the structured network, the generated predicted target picture is processed separately with the face part and the background part, better action representation is favorably learned respectively, mutual influence is avoided, not only can the driving effect of the face key point region be accurately controlled, but also the region beyond the face key points can be covered, and better driving effect can be obtained in the unstructured key point region.

Referring to fig. 6, fig. 6 is a schematic diagram of a framework of an embodiment of a training apparatus for generating a network model from an image according to the present application. The image generation network model comprises an image segmentation network, a structured generation network and an unstructured generation network; the training device 60 for the image generation network model includes: a sample obtaining module 600, wherein the sample obtaining module 600 is used for obtaining a portrait image training set; wherein the training set of portrait images includes a first portrait image and a second portrait image of the same target; an image prediction module 602, where the image prediction module 602 is configured to separately segment the first portrait image and the second portrait image by using an image segmentation network to obtain a first structured area image and a first unstructured area image corresponding to the first portrait image, and a second structured area image and a second unstructured area image corresponding to the second portrait image; generating a preliminary background image with an unstructured formation network based on the first unstructured region image and the second unstructured region image; generating a preliminary facial image with a structured formation network based on the first structured region image and the second structured region image; generating a prediction target image based on the preliminary face image and the preliminary background image; a loss function determination module 604, the loss function determination module 604 being configured to obtain a loss function between the second portrait image and the prediction target image; a parameter adjustment module 606, the parameter adjustment module 606 configured to adjust a network parameter in the image-generating network model based on the loss function.

In some embodiments, image prediction module 602 performs the step of generating a preliminary background image using an unstructured generation network based on the first unstructured region image and the second unstructured region image, comprising: respectively detecting the first unstructured regional image and the second unstructured regional image by using the resnet-tiny sub-network to obtain a hotspot graph of the first unstructured key point and a hotspot graph of the second unstructured key point; inputting the hotspot graph of the first unstructured keypoint and the hotspot graph of the second unstructured keypoint into the optical flow prediction subnetwork to generate optical flow data; and carrying out deformation processing on the first unstructured area image by using the optical flow data to obtain a deformed first unstructured area image serving as a preliminary background image.

In some embodiments, the structured formation network comprises a face keypoint sub-network and a HRNet sub-network; the image prediction module 602 performs the step of generating a preliminary face image using a structured generation network based on the first structured region image and the second structured region image, including: respectively utilizing the face key point sub-network and the HRNet sub-network to obtain a first face key point and a Jacobian matrix thereof corresponding to the first structured area image and a second face key point and a Jacobian matrix thereof corresponding to the second structured area image; generating a face key point connecting line graph based on the first face key point and a Jacobian matrix thereof, and the second face key point and a Jacobian matrix thereof; and generating the preliminary face image based on the first portrait image and the face key point connecting line graph.

In some embodiments, the image generation network model further comprises a residual generation network; the image prediction module 602 is executed to perform the step of generating a prediction target image based on the preliminary face image and the preliminary background image, and is further configured to generate a residual image using the residual generation network based on the first portrait image and the face key point link map; at this time, the step of executing, by the image prediction module 602, the generation of the prediction target image based on the preliminary face image and the preliminary background image specifically includes: and fusing the preliminary face image, the preliminary background image and the residual image to obtain the prediction target image.

In some embodiments, the image generation network model further comprises a network of discriminators; after the image prediction module 602 performs the step of fusing the preliminary face image, the preliminary background image and the residual image to obtain the prediction target image, the image prediction module 602 is further configured to optimize the prediction target image by using the discriminator network, so that the discriminator network cannot distinguish the prediction target image from the second portrait image.

Referring to fig. 7, fig. 7 is a schematic frame diagram of an embodiment of a driving device for a human face picture according to the present application. The driving device 70 for a face picture includes: an image acquisition module 700, wherein the image acquisition module 700 is used for acquiring a source image and a drive image; an image generation module 702, the image generation module 702 configured to input the source map and the driver map into an image generation network model to obtain a target image.

Referring to fig. 8, fig. 8 is a schematic frame diagram of an embodiment of an electronic device according to the present application. The electronic device 80 includes a memory 801 and a processor 802 coupled to each other, and the processor 802 is configured to execute program instructions stored in the memory 801 to implement any one of the above-mentioned training methods for an image generation network model, or the steps of the driving method embodiment of a face picture. In one particular implementation scenario, the electronic device 80 may include, but is not limited to: microcomputer, server.

Specifically, the processor 802 is configured to control itself and the memory 801 to implement any of the above-described training methods of the image generation network model, or the steps of the driving method embodiment of the face picture. The processor 802 may also be referred to as a CPU (Central Processing Unit). The processor 802 may be an integrated circuit chip having signal processing capabilities. The Processor 802 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 802 may be commonly implemented by integrated circuit chips.

Referring to fig. 9, fig. 9 is a block diagram illustrating an embodiment of a computer-readable storage medium according to the present application. The computer-readable storage medium 90 stores program instructions 900 capable of being executed by the processor, where the program instructions 900 are used to implement any one of the above-mentioned training methods for an image generation network model, or the steps of the driving method embodiment of a face picture.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely one type of logical division, and an actual implementation may have another division, for example, a unit or a component may be combined or integrated with another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on network elements. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. The training method of the image generation network model is characterized in that the image generation network model comprises an image segmentation network, a structured generation network and an unstructured generation network;

the training method of the image generation network model comprises the following steps:

acquiring a portrait image training set; wherein the training set of portrait images includes a first portrait image and a second portrait image of the same target;

utilizing an image segmentation network to respectively segment the first portrait image and the second portrait image to obtain a first structured area image and a first unstructured area image corresponding to the first portrait image, and a second structured area image and a second unstructured area image corresponding to the second portrait image;

generating a preliminary background image with an unstructured formation network based on the first unstructured region image and the second unstructured region image;

generating a preliminary facial image with a structured formation network based on the first structured region image and the second structured region image;

generating a prediction target image based on the preliminary face image and the preliminary background image;

obtaining a loss function between the second portrait image and the prediction target image;

adjusting network parameters in the image-generating network model based on the loss function.

2. The method of claim 1, wherein the unstructured generating network is trained in an unsupervised manner, and the target loss function of the unstructured generating network training comprises at least one of a distance loss function and an equal constraint loss function.

3. The method of claim 1, wherein the unstructured generative network comprises a respet-tiny subnetwork and an optical flow prediction subnetwork;

the generating a preliminary background image with an unstructured formation network based on the first unstructured region image and the second unstructured region image comprises:

respectively detecting the first unstructured regional image and the second unstructured regional image by using the resnet-tiny sub-network to obtain a hotspot graph of the first unstructured key point and a hotspot graph of the second unstructured key point;

inputting the hotspot graph of the first unstructured keypoint and the hotspot graph of the second unstructured keypoint into the optical flow prediction subnetwork to generate optical flow data;

and carrying out deformation processing on the first unstructured area image by using the optical flow data to obtain a deformed first unstructured area image serving as a preliminary background image.

4. The method of claim 3, wherein the objective loss function of the optical flow prediction subnetwork training comprises a total variation loss function.

5. A method for training an image generation network model according to claim 1, wherein the structured generation network comprises a face keypoint sub-network and a HRNet sub-network;

generating a preliminary facial image using a structured generation network based on the first structured region image and the second structured region image, comprising:

respectively utilizing the face key point sub-network and the HRNet sub-network to obtain a first face key point and a Jacobian matrix thereof corresponding to the first structured area image and a second face key point and a Jacobian matrix thereof corresponding to the second structured area image;

generating a face key point connecting line graph based on the first face key point and a Jacobian matrix thereof, and the second face key point and a Jacobian matrix thereof;

and generating the preliminary face image based on the first portrait image and the face key point connecting line graph.

6. The method of claim 5, wherein the image generation network model further comprises a residual generation network;

before the generating of the prediction target image based on the preliminary face image and the preliminary background image, the method further includes:

generating a network by using the residual error to generate a residual error image based on the first portrait image and the face key point connecting line graph;

generating a prediction target image based on the preliminary face image and the preliminary background image, including:

and fusing the preliminary face image, the preliminary background image and the residual image to obtain the prediction target image.

7. The method of training an image generation network model according to claim 6, wherein the image generation network model further comprises a discriminator network;

after the fusing the preliminary face image, the preliminary background image, and the residual image to obtain the prediction target image, the method further includes:

optimizing the prediction target image with the discriminator network so that the discriminator network cannot distinguish between the prediction target image and the second portrait image.

8. The method of training an image generation network model according to any one of claims 1 to 7, wherein the loss function includes an L1 loss function, a perceptual loss function, and a style loss function.

9. A method for driving a face picture, the method comprising:

acquiring a source diagram and a driving diagram;

inputting the source map and the drive map into an image generation network model to obtain a target image;

wherein the image generation network model is trained by the training method of the image generation network model according to any one of claims 1 to 8.

10. The training device for the image generation network model is characterized in that the image generation network model comprises an image segmentation network, a structured generation network and an unstructured generation network; the training device for the image generation network model comprises:

the system comprises a sample acquisition module, a storage module and a display module, wherein the sample acquisition module is used for acquiring a portrait image training set; wherein the training set of portrait images includes a first portrait image and a second portrait image of the same target;

the image prediction module is used for utilizing an image segmentation network to respectively segment the first portrait image and the second portrait image to obtain a first structured area image and a first unstructured area image corresponding to the first portrait image and a second structured area image and a second unstructured area image corresponding to the second portrait image; generating a preliminary background image with an unstructured formation network based on the first unstructured region image and the second unstructured region image; generating a preliminary facial image with a structured formation network based on the first structured region image and the second structured region image; generating a prediction target image based on the preliminary face image and the preliminary background image;

a loss function determination module for obtaining a loss function between the second portrait image and the prediction target image;

a parameter adjustment module to adjust network parameters in the image-generating network model based on the loss function.

11. A driving device for a human face picture is characterized by comprising:

the image acquisition module is used for acquiring a source image and a driving image;

an image generation module for inputting the source map and the driver map into an image generation network model to obtain a target image;

12. An electronic device, comprising a memory and a processor coupled to each other, wherein the processor is configured to execute program instructions stored in the memory to implement the method for training an image generation network model according to any one of claims 1 to 8, or the method for driving a face picture according to claim 9.

13. A computer-readable storage medium having stored thereon program instructions, which when executed by a processor, implement the method for training an image generation network model according to any one of claims 1 to 8, or the method for driving a face picture according to claim 9.