CN111783647A

CN111783647A - Training method of face fusion model, face fusion method, device and equipment

Info

Publication number: CN111783647A
Application number: CN202010615462.XA
Authority: CN
Inventors: 薛洁婷; 余席宇; 洪智滨; 韩钧宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2020-10-16
Anticipated expiration: 2040-06-30
Also published as: CN111783647B

Abstract

The application discloses a training method of a face fusion model, a face fusion method, a face fusion device and face fusion equipment, and relates to the field of deep learning. The specific implementation scheme is as follows: acquiring a sample image, wherein the sample image comprises a user sample image and a bottom plate sample image; and training the generated countermeasure network according to the user sample image and the bottom plate sample image to obtain a face fusion model, wherein the face fusion model is used for replacing the face in the bottom plate image with the user image. Because the face fusion model is obtained based on the generative confrontation network training, the deep semantic feature information of the image can be extracted, and the face fusion is carried out based on the deep semantic feature information, so that a good face fusion effect is obtained.

Description

Training method of face fusion model, face fusion method, device and equipment

Technical Field

The embodiment of the application relates to the field of deep learning in image processing, in particular to a training method of a face fusion model, a face fusion method, a face fusion device and face fusion equipment.

Background

The human face fusion aims at seamlessly replacing a target human face from a source image (a user image) to a bottom plate image so as to generate a human face fusion result, and the fused image needs to keep semantic rationality and consistency on the whole and the part.

However, the human face fusion effect is poor due to the fact that the human face area of the bottom plate image and the human face area of the source image have influence factors such as large illumination conditions and individual skin color difference.

Disclosure of Invention

The application provides a training method, a face fusion device and face fusion equipment for improving a face fusion effect.

According to a first aspect of the present application, there is provided a training method for a face fusion model, including: acquiring a sample image, wherein the sample image comprises a user sample image and a bottom plate sample image; and training the generated countermeasure network according to the user sample image and the bottom plate sample image to obtain a face fusion model, wherein the face fusion model is used for replacing the face in the bottom plate image with the user image.

According to a second aspect of the present application, there is provided a face fusion method, including: acquiring a user image and a bottom plate image; and inputting the user image and the bottom plate image into a face fusion model obtained by the training method of the first aspect to obtain a face fusion result, wherein the face fusion model is used for replacing the face in the bottom plate image with the user image.

According to a third aspect of the present application, there is provided a training apparatus for a face fusion model, comprising: the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a sample image, and the sample image comprises a user sample image and a bottom plate sample image; and the training module is used for training the generated confrontation network according to the user sample image and the bottom plate sample image to obtain a face fusion model, and the face fusion model is used for replacing the face in the bottom plate image with the user image.

According to a fourth aspect of the present application, there is provided a face fusion apparatus, comprising: the second acquisition module is used for acquiring a user image and a bottom plate image; an input module, configured to input the user image and the bottom plate image into a face fusion model obtained by using the training method of the first aspect, so as to obtain a face fusion result, where the face fusion model is used to replace a face in the bottom plate image with the user image.

According to a fifth aspect of the present application, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first and second aspects.

According to a sixth aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first and second aspects.

The technology solves the problem of poor face fusion effect.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1A is a schematic view of face fusion according to an embodiment of the present application;

FIG. 1B is a schematic view of face fusion according to another embodiment of the present application;

FIG. 2 is a diagram of a hardware architecture according to an embodiment of the present application;

FIG. 3 is a flowchart of a training method of a face fusion model according to an embodiment of the present application;

FIG. 4 is a block diagram of a generative confrontation network and a face recognition network according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a generator generating a first face fusion result according to an embodiment of the application;

FIG. 6 is a flowchart of a face fusion method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a training apparatus for a face fusion model according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a face fusion device according to an embodiment of the present application;

fig. 9 is a block diagram of an electronic device for implementing a face fusion model training method and/or a face fusion method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The face fusion technology is widely applied to various entertainment applications supporting face special effects and movie special effect making or propaganda scenes, can quickly and accurately position key points of the face, and performs face layer fusion on a photo (user image) uploaded by a user and a specific image (bottom plate image), so that the generated picture not only comprises the characteristics of the five sense organs of the user, but also presents the appearance characteristics of the specific image.

Fig. 1A and 1B are schematic diagrams of face fusion provided in the embodiment of the present application. As shown in fig. 1A, including a user face image and a specific image, face fusion is implemented by replacing a face region B in the specific image with a face region a. As shown in fig. 1B, the face fusion system includes a user face image a and a user face image B, and face fusion is realized by exchanging the user face image a and the user face image B.

Of course, the face fusion process shown in fig. 1A and 1B is only schematically illustrated, and the face fusion of the present application is not limited, and all the techniques related to face fusion are within the scope of the present application.

In the human face fusion process, the quality of the fusion effect is very important for application and popularization of human face fusion products. At present, there are many face fusion techniques, for example, image fusion is implemented by directly operating original image pixels, or sparse coding is performed on each image to obtain a sparse coefficient of the image, and then fusion processing is performed based on the sparse coefficient to obtain a fused image, or multi-scale transformation is performed on the image, and then image fusion is performed. Although these fusion techniques can achieve a certain fusion effect, the first method directly calculates pixels, which results in an unsatisfactory fusion effect, and the second method requires block calculation of images, which results in a loss of a large amount of detailed information, and multi-scale transformation requires a multi-scale transformation tool and a fusion rule, which results in a loss of a large amount of image information if the tool and the fusion rule are not reasonable.

In conclusion, the loss of image information causes the existing face fusion technology to be deficient in the aspects of face edges, face postures and face naturalness, so that the problems of false edges, abnormal colors and the like often occur after fusion, and the face fusion effect is poor. For example, a contour line appears where the contour line originally did not appear, and a contour line does not appear where the contour line originally should appear; and the fused face area has abnormal color, so that the face fusion result is unnatural.

In the prior art, images are calculated on a shallow feature level, so that the understanding degree of the images is shallow, and therefore, the face fusion effect is poor. According to the method and the device, the deep semantic understanding is carried out on the image through the deep learning model, and richer image information is extracted, so that the user face image and the specific image are better fused in the face fusion process.

In the process of applying the deep learning model to face fusion, a deep learning model needs to be trained first, a hardware architecture diagram used in the training process is shown in fig. 2, and the hardware architecture includes: an image pickup device 21 and an image processing device 22; the image capturing device 21 includes a camera or other device having a camera function, and the image processing device 22 includes a desktop computer, a notebook computer, an Ipad, a smart phone, a server or other device having an image processing function.

During the training of the deep learning model, the image acquisition device 21 is configured to acquire a sample image, where the sample image exists in the form of an image pair, that is, the sample image includes an image pair composed of a user sample image and a floor sample image. The image processing device 22 is configured to train the neural network according to the sample image, and obtain a deep learning model, where the deep learning model is used to replace a human face in the floor image with a user image.

In some scenarios where a deep learning model is applied for face fusion, the trained deep learning model may be stored in the image processing device 22. In a specific application process, the image acquisition device 21 acquires a user image and uploads the user image to the image processing device 22, and the image processing device 22 receives the user image uploaded by the user and a bottom plate image selected by the user, so that a face in the bottom plate image is replaced by the user image through the depth learning model, and face fusion is realized. The backplane image may be stored in the image processing apparatus 22 in advance, or may be stored in another apparatus other than the image processing apparatus 22, for example, in a server.

It should be noted that the image capturing device 21 and the image processing device 22 described above may be two independent devices, or may be integrated on the same electronic device, and if the image capturing device 21 and the image processing device 22 are integrated on the same electronic device, the electronic device is a device having an image capturing function and an image processing function, such as a smart phone, an Ipad, and the like.

The application provides a training method of a face fusion model, a face fusion method, a face fusion device and face fusion equipment, which are applied to the field of deep learning in the field of image processing so as to achieve the purpose of improving the face fusion effect.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 3 is a flowchart of a training method of a face fusion model according to an embodiment of the present application. The embodiment of the present application provides a training method for a face fusion model, which aims at the above technical problems in the prior art, and as shown in fig. 3, the method specifically includes the following steps:

step 301, obtaining a sample image.

Wherein the sample image comprises a user sample image and a floor sample image; the user sample image is a photo uploaded by a user and comprises a face area to be replaced, and the bottom plate sample image comprises an image of a specific image, such as an image with a specific expression and a specific ornament, for example, a cartoon character image, a military photo, a Chinese clothes photo and a specific head ornament.

The sample image in this embodiment may be an image in a public training data set, or may be an image acquired by an image acquisition device, which is not specifically limited in this embodiment.

The sample images in this embodiment are image pairs including a user sample image and a backplane sample image, and a large number of sample images used in the training process may be a plurality of image pairs formed by one user sample image and a plurality of backplane sample images, may also be a plurality of image pairs formed by a plurality of user sample images and one backplane sample image, and may also be a plurality of image pairs formed by a plurality of user sample images and a plurality of backplane sample images. In the actual training process, the image pair may be in one of the above forms alone, or may be in a combination of any two or three of the above three image pairs. This embodiment is not particularly limited thereto.

It should be noted that regardless of the image pair format, an image pair includes a user sample and a floor sample image.

Step 302, training the generated confrontation network according to the user sample image and the bottom plate sample image to obtain a face fusion model.

The human face fusion model obtained in the step can replace the human face in the bottom plate sample image with the user image.

The structure and principle of the Generative Adaptive Networks (GAN) may be specifically referred to the introduction of the prior art, and will not be described herein again.

According to the embodiment of the application, the face fusion model is obtained by acquiring the sample images including the user sample image and the bottom plate sample image and training the generation type countermeasure network according to the user sample image and the bottom plate sample image, and the face in the bottom plate image can be replaced by the user image through the face fusion model. Because the face fusion model is obtained based on the generative confrontation network training, the deep semantic feature information of the image can be extracted, and the face fusion is carried out based on the deep semantic feature information, so that a good face fusion effect is obtained.

As shown in fig. 4, the generative confrontation network includes a feature extraction network 41, a generator 42, and a discriminator 43, and the image processing apparatus further includes a face recognition network 44; training the generative confrontation network according to the user sample image and the bottom plate sample image to obtain a face fusion model, which comprises the following steps:

step a1, inputting the bottom plate sample image into a feature extraction network to obtain first feature information.

The feature extraction network may be a neural network, such as a convolutional neural network. The present embodiment may adopt an existing network architecture, such as a vgg (visual Geometry Group network) network, a Resnet network, or other general image feature extraction networks. The characteristic extraction network carries out characteristic extraction on the bottom plate sample image to obtain first characteristic information. The first feature information in this embodiment is feature map information representing a background region (other regions except a face region) of the bottom plate sample image, and compared with feature extraction in a conventional face fusion process, the feature map information can extract more abundant deep image semantic information.

Step a2, obtaining face information of the user sample image through a face recognition network to obtain first face information.

The face recognition network in this step adopts the existing face recognition technology to perform face recognition, and face information is obtained. The face recognition network can be an insight face model.

Optionally, the face recognition network may be stored in the image processing device, and then the user sample image is input into the image processing device, and the image processing device performs face recognition on the user sample image through the face recognition network to obtain the first face information.

In the step, the face recognition is carried out on the user sample image, and the face characteristic information, such as the feature information of five sense organs, is extracted, so that the user identity can be uniquely identified.

And a3, inputting the first characteristic information and the first face information into a generator to obtain a first face fusion result.

The generator can generate a first face fusion result according to the first characteristic information and the first face information, wherein the first face fusion result refers to a result image obtained by face fusion of the generator on the user sample image and the bottom plate sample image, namely an image after face changing.

Step a4, inputting the first face fusion result and the bottom plate sample image into a discriminator respectively to obtain a first discrimination result aiming at the first face fusion result and a second discrimination result aiming at the bottom plate sample image.

Wherein the first discrimination result for the first face fusion result is to determine whether the first face fusion result is generated by the generator or is real data; likewise, the second determination result for the floor sample image is to determine whether the floor sample image is generated by the generator or real data.

The floor sample image in the present embodiment is used as the label information.

Optionally, the discriminators will output separatelyProbability D that first face fusion result belongs to real data_fakeAnd the probability D that the floor sample image belongs to the real data_real。

Step a5, determining a first loss function according to the first judgment result, the second judgment result and the first face fusion result.

This step a5 includes:

in the formula,

and representing a first loss function, adjusting the network parameters of the generator by minimizing the first loss function, and adjusting the network parameters of the discriminator by maximizing the first loss function, so that the generator and the discriminator finally reach a balanced state through the game process, namely the image generated by the generator cannot be identified by the discriminator as the generated image or real data.

Step a6, adjusting the network parameters of the generator according to the first loss function.

The network parameters of the generator are adjusted by converging the first loss function of the above equation (1) towards minimization.

On the basis of the foregoing embodiment, the method of this embodiment may further include the following method steps:

and b1, inputting the first feature sample and the first face information into a generator to obtain the attention information.

Wherein the attention information comprises an attention map, the generator of the embodiment introduces an attention mechanism, and the attention map aims to make the generator focus more on the background area of the bottom plate sample image, so that the generated attention map is smoother, that is, the background area is kept continuous, and noise points are reduced as much as possible.

Step b2, determining a second loss function based on the attention information.

Wherein, step b2 can be implemented by the following formula (2):

in the formula (2), the reaction mixture is,

representing a second loss function; h and W represent the width and height of the attention map, respectively; i and j represent pixel points of the attention map in the width direction and the height direction, respectively; g_maskThe information on the attention is represented and,

the attention is shown to try to find the (i + 1) th pixel point in the width direction, and the (j) th pixel point in the height direction,

the display attention is tried to be paid to the ith pixel point in the width direction and the jth pixel point in the height direction,

the display attention is tried to the ith pixel point in the width direction and the (j + 1) th pixel point in the height direction.

And b3, adjusting the network parameters of the feature extraction network and the generator according to the weighted sum of the first loss function and the second loss function.

Wherein adjusting the network parameters of the feature extraction network and the generator according to the weighted sum of the first loss function and the second loss function comprises: and adjusting the network parameters of the feature extraction network and the generator according to the accumulated values of the first loss function and the second loss function.

Optionally, in this embodiment, the network parameters of the feature extraction network and the generator may also be separately adjusted according to the first loss function and the second loss function, for example, the network parameters of the feature extraction network and the generator are adjusted according to the first loss function, and then the network parameters of the feature extraction network and the generator are adjusted according to the second loss function; or adjusting the network parameters of the feature extraction network and the generator according to the second loss function, and then adjusting the network parameters of the feature extraction network and the generator according to the first loss function. The two adjustment modes of the network parameters are not specifically limited in this embodiment, and those skilled in the art can select the adjustment mode of the network parameters according to actual requirements in the actual application process.

The embodiment obtains attention information by inputting a first feature sample and first face information into a generator; and determining a second loss function based on the attention information; and adjusting the network parameters of the feature extraction network and the generator according to the weighted sum of the first loss function and the second loss function. Due to the fact that the attention mechanism is introduced into the generator, the generator can pay more attention to the background area of the bottom plate sample image and generate a smoother attention map, the background area is kept continuous, noise points are reduced as much as possible, and therefore the obtained human face fusion result is more vivid and reasonable in visual effect.

On the basis of the above embodiments, the embodiments of the present application may further include the following method steps:

and c1, performing face fusion according to the user sample image and the bottom plate sample image to obtain a second face fusion result.

The step can be to adopt the existing face fusion method to carry out face fusion on the user sample image and the bottom plate sample image, wherein the fusion index of the first face fusion result is larger than that of the second face fusion result, and the fusion index is positively correlated with the quality of the face fusion result. It can be understood that the second face fusion result is more rough than the first face fusion result in the fusion effect, for example, the second face fusion result has the problems of false edges and abnormal colors, the naturalness of the fusion result is not good, and the first face fusion result does not have the problems of false edges and abnormal colors, and the fusion result is more natural. Or, the first face fusion result has better effect than the second face fusion result in the aspects of false edges, abnormal colors, naturalness and the like.

And c2, determining a third loss function according to the first face fusion result and the second face fusion result.

Wherein, step c2 can be expressed as the following formula (3):

in the formula (3), the reaction mixture is,

representing a third loss function, G_fakeRepresenting the first face fusion result, I_gtRepresenting a floor sample image. It may be understood that the third loss function is determined using pixel losses between the first face fusion result and the second face fusion result.

And c3, adjusting the network parameters of the feature extraction network and the generator according to the weighted sum of the first loss function, the second loss function and the third loss function.

Wherein the third loss function is intended to make the generator face fusion capable.

Wherein adjusting network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, and the third loss function comprises: and adjusting the network parameters of the feature extraction network and the generator according to the accumulated values of the first loss function, the second loss function and the third loss function.

Optionally, in this embodiment, the network parameters of the feature extraction network and the generator may also be individually adjusted according to the first loss function, the second loss function, and the third loss function, for example, the network parameters of the feature extraction network and the generator may be sequentially adjusted according to the order of the first loss function, the second loss function, and the third loss function. Of course, the network parameters of the feature extraction network and the generator may also be sequentially adjusted according to other sequences of the first loss function, the second loss function, and the third loss function, which is not specifically limited in this embodiment.

In the embodiment, the face fusion is performed according to the user sample image and the bottom plate sample image to obtain the second face fusion result, wherein the fusion index of the first face fusion result is greater than the fusion index of the second face fusion result, and the fusion index is positively correlated with the quality of the face fusion result, so that the skin color and the background information in the image after the face fusion are more natural and consistent.

and d1, acquiring the face information of the first face fusion result through the face recognition network to obtain second face information.

Optionally, the face recognition network may be stored in the image processing device, and then the first face fusion result is input into the image processing device, and the image processing device performs face recognition on the first face fusion result through the face recognition network to obtain the second face information.

The step is to perform face recognition on the first face fusion result, and extract face feature information, such as feature information of five sense organs, which can uniquely identify the identity of the user in the first face fusion result.

And d2, determining a fourth loss function according to the second face information and the first face information.

This step d2 can be expressed by the following equation (4):

in the formula (4), Z_idRepresenting first face information, Z_fakeThe second face information is represented by a second face information,

a fourth loss function is represented.

Wherein the fourth loss function is intended to enable the feature extraction network to ignore the face information of the floor sample image and to pay attention to the face information of the user sample image.

And d3, adjusting the network parameters of the feature extraction network and the generator according to the weighted sum of the first loss function, the second loss function, the third loss function and the fourth loss function.

Wherein adjusting the network parameters of the feature extraction network and the generator according to the weighted sum of the first loss function, the second loss function, the third loss function, and the fourth loss function comprises: and adjusting the network parameters of the feature extraction network and the generator according to the accumulated values of the first loss function, the second loss function, the third loss function and the fourth loss function.

Optionally, in this embodiment, the network parameters of the feature extraction network and the generator may also be separately adjusted according to the first loss function, the second loss function, the third loss function, and the fourth loss function, for example, the network parameters of the feature extraction network and the generator are sequentially adjusted according to the sequence of the first loss function, the second loss function, the third loss function, and the fourth loss function. Of course, the network parameters of the feature extraction network and the generator may also be sequentially adjusted according to other sequences of the first loss function, the second loss function, the third loss function, and the fourth loss function, which is not specifically limited in this embodiment.

and e1, inputting the first face fusion result into the feature extraction network to obtain second feature information.

The feature extraction network may be a neural network, such as a convolutional neural network. The present embodiment may adopt an existing network architecture, such as a vgg (visual Geometry Group network) network, a Resnet network, or other general image feature extraction networks. And the feature extraction network extracts features aiming at the first face fusion result to obtain second feature information. The second feature information in the present embodiment is feature map information indicating a background region (a region other than a face region) of the first face fusion result.

And e2, determining a fifth loss function according to the first characteristic information and the second characteristic information.

This step e2 can be expressed by the following equation (5):

in the formula (5), F_attsRepresenting first characteristic information, F_fakesThe second characteristic information is represented by a second characteristic information,

a fifth loss function is represented.

And e3, adjusting the network parameters of the feature extraction network and the generator according to the weighted sum of the first loss function, the second loss function, the third loss function, the fourth loss function and the fifth loss function.

Adjusting network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, the third loss function, the fourth loss function and the fifth loss function, including: and adjusting the network parameters of the feature extraction network and the generator according to the accumulated values of the first loss function, the second loss function, the third loss function, the fourth loss function and the fifth loss function.

Optionally, in this embodiment, the network parameters of the feature extraction network and the generator may also be separately adjusted according to the first loss function, the second loss function, the third loss function, the fourth loss function, and the fifth loss function, for example, the network parameters of the feature extraction network and the generator may be sequentially adjusted according to the sequence of the first loss function, the second loss function, the third loss function, the fourth loss function, and the fifth loss function. Of course, the network parameters of the feature extraction network and the generator may also be sequentially adjusted according to other sequences of the first loss function, the second loss function, the third loss function, the fourth loss function, and the fifth loss function, which is not specifically limited in this embodiment.

In another embodiment of the present application, inputting the first feature sample and the first face information into a generator to obtain a first face fusion result, including: inputting the first characteristic sample and the first face information into a generator to obtain an initial face fusion result and attention information; and inputting the initial face fusion result and the attention information into a generator to obtain a first face fusion result.

As shown in fig. 5, the output of the generator is divided into two branches, one branch is the initial face fusion result, the other branch is the attention diagram, and then the generator performs face fusion again according to the attention diagram and the initial face fusion result to obtain the first face fusion result. This process can be expressed as the following equation (6):

G_fake＝G_fake1*G_mask+(1-G_mask)*I_att； (6)

in the formula (6), G_fake1Representing an initial face fusion result; g_maskAn attention map is shown; i is_attRepresenting a floor sample image. The first term of the formula can be understood as face information obtained after initial fusion, and the second term of the formula can be understood as background information of retained bottom plate sample images.

In the embodiment, an initial face fusion result and attention information are obtained by inputting a first feature sample and first face information into a generator; and inputting the initial face fusion result and the attention information into the generator again to perform face fusion again to obtain a first face fusion result, and increasing the attention mechanism in the generator to ensure that the background information of the bottom plate sample image can be retained to a greater extent while the face fusion result obtained by the generator performs face fusion, so that the background area is kept continuous, noise points are reduced as much as possible, and the face fusion result generated by the generator is more vivid and natural.

Optionally, training the generated countermeasure network according to the user sample image and the bottom plate sample image to obtain a face fusion model, including:

and f1, performing affine transformation on the user sample image and the bottom plate sample image respectively.

Here, this step may be understood as a preprocessing process for the user sample image and the backplane sample image. Respectively carrying out affine transformation on the user sample image and the bottom plate sample image, wherein the affine transformation comprises the following steps: extracting key point information of the user sample image and the bottom plate sample image to obtain user key point information and bottom plate key point information; and respectively aligning the user sample image and the bottom plate sample image to the standard face through affine transformation according to the user key point information and the bottom plate key point information.

For example, a standard face may be set, where the standard face includes position information of a standard face key point, then similarity transformation is performed according to user key point information and the standard face key point, and similarity transformation is performed according to baseplate key point information and the standard face key point information, where the transformation process includes rotation, translation, and scaling, so that a homogeneous transformation matrix M may be obtained, and then the user sample image and the baseplate sample image are aligned to the standard face, respectively, with M as a parameter, so as to obtain an aligned user sample image and baseplate sample image.

Among them, the affine transformation aims at adjusting the face angle, for example, adjusting the face to be a forward angle.

And f2, training the generating countermeasure network according to the user sample image and the bottom plate sample image after affine transformation to obtain a face fusion model.

The step is similar to the implementation process of the foregoing embodiment, and reference may be specifically made to the description of the foregoing embodiment, which is not described herein again.

According to the embodiment of the application, the application also provides a face fusion method.

Fig. 6 is a flowchart of a face fusion method according to an embodiment of the present application. The face fusion method of the embodiment comprises the following steps:

step 601, acquiring a user image and a bottom plate image.

Step 602, inputting the user image and the bottom plate image into the face fusion model to obtain a face fusion result.

The human face fusion model is obtained according to a training method of the human face fusion model and used for replacing the human face in the bottom plate image with the user image.

The execution main body of the embodiment may be an electronic device integrating an image capturing function and an image processing function, such as a smartphone, an Ipad, and the like. Taking a smart phone as an example, a user may select an image in an album of the smart phone as a user image, or use a camera of the smart phone to capture an image as the user image. The user can also select the bottom plate image on a face fusion application program of the smart phone, wherein the face fusion application program can provide various specific image images, and after the user inputs the user image and the bottom plate image, the smart phone can perform face fusion on the user image and the bottom plate image.

According to the embodiment of the application, the user image and the bottom plate image are acquired and input into the face fusion model, and the face fusion result is obtained. Because the face fusion model is obtained based on the generative confrontation network training, the deep semantic feature information of the image can be extracted, and the face fusion is carried out based on the deep semantic feature information, so that a good face fusion effect is obtained.

According to the embodiment of the application, the application also provides a block diagram of a training device of the face fusion model.

Fig. 7 is a block diagram of a training apparatus for a face fusion model according to an embodiment of the present application. The training apparatus 70 for a face fusion model of the present embodiment includes: a first acquisition module 71 and a training module 72; a first obtaining module 71, configured to obtain a sample image, where the sample image includes a user sample image and a bottom plate sample image; and the training module 72 is configured to train the generative confrontation network according to the user sample image and the bottom plate sample image to obtain a face fusion model, where the face fusion model is used to replace a face in the bottom plate image with the user image.

Optionally, the generative confrontation network includes a feature extraction network, a generator and a discriminator; the training module 72 includes: an input unit 721, configured to input the bottom plate sample image into a feature extraction network, so as to obtain first feature information; an obtaining unit 722, configured to obtain face information of a user sample image through a face recognition network, so as to obtain first face information; an input unit 721, further configured to input the first feature information and the first face information into the generator, resulting in a first face fusion result; inputting the first face fusion result and the bottom plate image into a discriminator respectively to obtain a first discrimination result aiming at the first face fusion result and a second discrimination result aiming at the bottom plate image; a determining unit 723, configured to determine a first loss function according to the first determination result, the second determination result, and the first face fusion result; an adjusting unit 724, configured to adjust a network parameter of the generator according to the first loss function.

Optionally, the input unit 721 is further configured to input the first feature sample and the first face information into the generator, so as to obtain the attention information; a determining unit 723, further configured to determine a second loss function according to the attention information; the adjusting unit 724 is further configured to adjust a network parameter of the feature extraction network and the generator according to the weighted sum of the first loss function and the second loss function. Optionally, the training module 72 further includes: the face fusion unit 725 is configured to perform face fusion according to the user sample image and the bottom plate sample image to obtain a second face fusion result, where a fusion index of the first face fusion result is greater than a fusion index of the second face fusion result, and the fusion index is positively correlated with the quality of the face fusion result; the determining unit 723 is further configured to determine a third loss function according to the first face fusion result and the second face fusion result; the adjusting unit 724 is further configured to adjust a network parameter of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, and the third loss function.

Optionally, the obtaining unit 722 is further configured to obtain face information of the first face fusion result through a face recognition network, so as to obtain second face information; the determining unit 723, configured to determine a fourth loss function according to the second face information and the first face information; the adjusting unit 724 is further configured to adjust a network parameter of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, the third loss function, and the fourth loss function.

Optionally, the input unit 721 is further configured to input the first face fusion result into the feature extraction network, so as to obtain second feature information; the determining unit 723, configured to determine a fifth loss function according to the first characteristic information and the second characteristic information; the adjusting unit 724 is further configured to adjust the network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, the third loss function, the fourth loss function, and the fifth loss function.

Optionally, when the input unit 721 inputs the first feature sample and the first face information into the generator to obtain the first face fusion result, the method specifically includes: inputting the first characteristic sample and the first face information into a generator to obtain an initial face fusion result and attention information; and inputting the initial face fusion result and the attention information into a generator to obtain a first face fusion result.

Optionally, the apparatus 70 further comprises: a transformation module 73, configured to perform affine transformation on the user sample image and the backplane sample image respectively; and the training module 72 is configured to train the generating countermeasure network according to the affine-transformed user sample image and the bottom plate sample image to obtain a face fusion model.

According to the embodiment of the application, the application also provides a block diagram of the human face fusion device.

Fig. 8 is a block diagram of a face fusion apparatus according to an embodiment of the present application. The face fusion device of the embodiment comprises: a second acquisition module 81 and an input module 82; the second obtaining module 81 is configured to obtain a user image and a bottom plate image; an input module 82, configured to input the user image and the bottom plate image into the face fusion model obtained by using the training method in the foregoing embodiment, so as to obtain a face fusion result, where the face fusion model is used to replace a face in the bottom plate image with the user image.

The human face fusion model comprises a feature extraction network and a generator; the input module 82 includes: an obtaining unit 821, configured to obtain face information of a user sample image through a face recognition network; and an input unit 822, configured to input the baseplate image into a pre-trained feature extraction network to obtain feature information, and input the face information and the feature information into a pre-trained generator to obtain a face fusion result.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 9 is a block diagram of an electronic device for a face fusion model training method and/or a face fusion method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 9, the electronic apparatus includes: one or more processors 901, memory 902, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 9 illustrates an example of a processor 901.

Memory 902 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by the at least one processor, so that the at least one processor executes the training method of the face fusion model and/or the face fusion method provided by the application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the training method of the face fusion model and/or the face fusion method provided by the present application.

The memory 902, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as the program instructions/modules corresponding to the training method of the face fusion model and/or the face fusion method in the embodiments of the present application (e.g., the first obtaining module 71, the transformation module 72, and the training module 73 shown in fig. 7, and the second obtaining module 81 and the input module 82 shown in fig. 8). The processor 901 executes various functional applications and data processing of the server by running non-transitory software programs, instructions and modules stored in the memory 902, that is, implementing the training method of the face fusion model and/or the face fusion method in the above method embodiments.

The memory 902 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created from the training method of the face fusion model and/or the use of the electronic device of the face fusion method, and the like. Further, the memory 902 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 902 may optionally include a memory remotely located from the processor 901, and these remote memories may be connected to the electronic device of the face fusion model training method and/or the face fusion method via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The training method of the face fusion model and/or the electronic device of the face fusion method may further include: an input device 903 and an output device 904. The processor 901, the memory 902, the input device 903 and the output device 904 may be connected by a bus or other means, and fig. 9 illustrates the connection by a bus as an example.

The input device 903 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the face fusion model training method and/or the face fusion method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, and the like. The output device Y04 may include a display device, an auxiliary lighting device (e.g., LED), a tactile feedback device (e.g., vibration motor), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS").

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A training method of a face fusion model comprises the following steps:

acquiring a sample image, wherein the sample image comprises a user sample image and a bottom plate sample image;

and training the generated countermeasure network according to the user sample image and the bottom plate sample image to obtain a face fusion model, wherein the face fusion model is used for replacing the face in the bottom plate image with the user image.

2. The method of claim 1, the generative confrontation network comprising a feature extraction network, a generator, and a discriminator;

the training of the generated confrontation network according to the user sample image and the bottom plate sample image to obtain the face fusion model comprises the following steps:

inputting the bottom plate sample image into the feature extraction network to obtain first feature information;

acquiring face information of the user sample image through a face recognition network to obtain first face information;

inputting the first characteristic information and the first face information into the generator to obtain a first face fusion result;

inputting the first face fusion result and the bottom plate image into the discriminator respectively to obtain a first discrimination result aiming at the first face fusion result and a second discrimination result aiming at the bottom plate image;

determining a first loss function according to the first judgment result, the second judgment result and the first face fusion result;

adjusting a network parameter of the generator according to the first loss function.

3. The method of claim 2, wherein the method further comprises:

inputting the first feature sample and the first face information into the generator to obtain attention information;

determining a second loss function according to the attention information;

adjusting network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function and the second loss function.

4. The method of claim 3, wherein the method further comprises:

performing face fusion according to the user sample image and the bottom plate sample image to obtain a second face fusion result, wherein the fusion index of the first face fusion result is greater than that of the second face fusion result, and the fusion index is positively correlated with the quality of the face fusion result;

determining a third loss function according to the first face fusion result and the second face fusion result;

adjusting network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, and the third loss function.

5. The method of claim 4, wherein the method further comprises:

acquiring face information of the first face fusion result through a face recognition network to obtain second face information;

determining a fourth loss function according to the second face information and the first face information;

adjusting network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, the third loss function, and the fourth loss function.

6. The method of claim 5, wherein the method further comprises:

inputting the first face fusion result into the feature extraction network to obtain second feature information;

determining a fifth loss function according to the first characteristic information and the second characteristic information;

adjusting network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, the third loss function, the fourth loss function, and the fifth loss function.

7. The method according to any one of claims 2, 4-6, wherein the inputting the first feature sample and the first face information into the generator resulting in a first face fusion result comprises:

inputting the first feature sample and the first face information into the generator to obtain an initial face fusion result and attention information;

and inputting the initial face fusion result and the attention information into the generator to obtain the first face fusion result.

8. The method according to any one of claims 1 to 6, wherein training the generative confrontation network according to the user sample image and the soleplate sample image to obtain a face fusion model comprises:

carrying out affine transformation on the user sample image and the bottom plate sample image respectively;

and training the generating countermeasure network according to the user sample image and the bottom plate sample image after affine transformation to obtain a face fusion model.

9. A face fusion method, comprising:

acquiring a user image and a bottom plate image;

inputting the user image and the bottom plate image into a face fusion model obtained by the training method according to any one of claims 1 to 8 to obtain a face fusion result, wherein the face fusion model is used for replacing the face in the bottom plate image with the user image.

10. The method of claim 9, wherein the face fusion model comprises a feature extraction network and generator;

the inputting the user image and the bottom plate image into the face fusion model obtained by the training method according to any one of claims 1 to 8 to obtain a face fusion result, comprising:

acquiring face information of the user sample image through a face recognition network;

inputting the bottom plate image into the feature extraction network obtained by pre-training to obtain feature information;

and inputting the face information and the feature information into the generator obtained by pre-training to obtain the face fusion result.

11. A training device for a face fusion model comprises:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a sample image, and the sample image comprises a user sample image and a bottom plate sample image;

and the training module is used for training the generated confrontation network according to the user sample image and the bottom plate sample image to obtain a face fusion model, and the face fusion model is used for replacing the face in the bottom plate image with the user image.

12. The apparatus of claim 11, the generative countermeasure network comprising a feature extraction network, a generator, and a discriminator;

the training module comprises:

the input unit is used for inputting the bottom plate sample image into the feature extraction network to obtain first feature information;

the acquisition unit is used for acquiring the face information of the user sample image through a face recognition network to obtain first face information;

the input unit is further configured to input the first feature information and the first face information into the generator to obtain a first face fusion result; inputting the first face fusion result and the bottom plate image into the discriminator respectively to obtain a first discrimination result aiming at the first face fusion result and a second discrimination result aiming at the bottom plate image;

a determining unit, configured to determine a first loss function according to the first determination result, the second determination result, and the first face fusion result;

and the adjusting unit is used for adjusting the network parameters of the generator according to the first loss function.

13. The apparatus of claim 12, wherein the input unit is further configured to input the first feature sample and the first face information into the generator, resulting in attention information;

the determining unit is further configured to determine a second loss function according to the attention information;

the adjusting unit is further configured to adjust network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function and the second loss function.

14. The apparatus of claim 13, wherein the training module further comprises:

the face fusion unit is used for carrying out face fusion according to the user sample image and the bottom plate sample image to obtain a second face fusion result, the fusion index of the first face fusion result is greater than that of the second face fusion result, and the fusion index is positively correlated with the quality of the face fusion result;

the determining unit is further configured to determine a third loss function according to the first face fusion result and the second face fusion result;

the adjusting unit is further configured to adjust network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, and the third loss function.

15. The apparatus according to claim 14, wherein the obtaining unit is further configured to obtain face information of the first face fusion result through a face recognition network to obtain second face information;

the determining unit is further configured to determine a fourth loss function according to the second face information and the first face information;

the adjusting unit is further configured to adjust network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, the third loss function, and the fourth loss function.

16. The apparatus according to claim 15, wherein the input unit is further configured to input the first face fusion result into the feature extraction network, so as to obtain second feature information;

the determining unit is further configured to determine a fifth loss function according to the first feature information and the second feature information;

the adjusting unit is further configured to adjust network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, the third loss function, the fourth loss function, and the fifth loss function.

17. The apparatus according to any one of claims 12 and 14 to 16, wherein the input unit, when inputting the first feature sample and the first face information into the generator to obtain a first face fusion result, specifically includes:

18. The apparatus of any of claims 11-16, further comprising:

the transformation module is used for carrying out affine transformation on the user sample image and the bottom plate sample image respectively;

and the training module is used for training the generating countermeasure network according to the user sample image and the bottom plate sample image after affine transformation to obtain a face fusion model.

19. A face fusion apparatus comprising:

the second acquisition module is used for acquiring a user image and a bottom plate image;

an input module, configured to input the user image and the bottom plate image into a face fusion model obtained by using the training method according to any one of claims 1 to 8, so as to obtain a face fusion result, where the face fusion model is used to replace a face in the bottom plate image with the user image.

20. The apparatus of claim 19, wherein the face fusion model comprises a feature extraction network and generator;

the input module includes:

the acquisition unit is used for acquiring the face information of the user sample image through a face recognition network;

and the input unit is used for inputting the bottom plate image into the feature extraction network obtained by pre-training to obtain feature information, and inputting the face information and the feature information into the generator obtained by pre-training to obtain the face fusion result.

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.