CN111783647B

CN111783647B - Training method of face fusion model, face fusion method, device and equipment

Info

Publication number: CN111783647B
Application number: CN202010615462.XA
Authority: CN
Inventors: 薛洁婷; 余席宇; 洪智滨; 韩钧宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2023-11-03
Anticipated expiration: 2040-06-30
Also published as: CN111783647A

Abstract

The application discloses a training method of a face fusion model, a face fusion method, a device and equipment, and relates to the field of deep learning. The specific implementation scheme is as follows: acquiring a sample image, wherein the sample image comprises a user sample image and a bottom plate sample image; training a generated type countermeasure network according to the user sample image and the base plate sample image to obtain a face fusion model, wherein the face fusion model is used for replacing a face in the base plate image with the user image. Because the human face fusion model is obtained based on the generated type countermeasure network training, deep semantic feature information of the image can be extracted, and human face fusion is performed based on the deep semantic feature information, so that a good human face fusion effect is obtained.

Description

Training method of face fusion model, face fusion method, device and equipment

Technical Field

The embodiment of the application relates to the field of deep learning in image processing, in particular to a training method of a face fusion model, a face fusion method, a device and equipment.

Background

Face fusion aims at seamlessly replacing a target face from a source image (user image) onto a base plate image, so that a face fusion result is generated, and the fused image needs to maintain semantic rationality and consistency in whole and part.

However, the face fusion effect is poor because of the influence factors such as large illumination conditions and individual skin color differences existing in the face region of the bottom plate image and the face region of the source image.

Disclosure of Invention

The application provides a training method, a face fusion method, a device and equipment for a face fusion model for improving a face fusion effect.

According to a first aspect of the present application, there is provided a training method of a face fusion model, including: acquiring a sample image, wherein the sample image comprises a user sample image and a bottom plate sample image; training a generated type countermeasure network according to the user sample image and the base plate sample image to obtain a face fusion model, wherein the face fusion model is used for replacing a face in the base plate image with the user image.

According to a second aspect of the present application, there is provided a face fusion method comprising: acquiring a user image and a bottom plate image; and inputting the user image and the bottom plate image into a face fusion model obtained by the training method of the first aspect to obtain a face fusion result, wherein the face fusion model is used for replacing the face in the bottom plate image with the user image.

According to a third aspect of the present application, there is provided a training apparatus for a face fusion model, comprising: the first acquisition module is used for acquiring a sample image, wherein the sample image comprises a user sample image and a bottom plate sample image; the training module is used for training the generated type countermeasure network according to the user sample image and the base plate sample image to obtain a face fusion model, and the face fusion model is used for replacing the face in the base plate image with the user image.

According to a fourth aspect of the present application, there is provided a face fusion apparatus comprising: the second acquisition module is used for acquiring the user image and the bottom plate image; the input module is used for inputting the user image and the base plate image into the face fusion model obtained by the training method according to the first aspect to obtain a face fusion result, and the face fusion model is used for replacing the face in the base plate image with the user image.

According to a fifth aspect of the present application, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods of the first and second aspects.

According to a sixth aspect of the present application there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the methods of the first and second aspects.

According to a seventh aspect of the present application there is provided a computer program product comprising: a computer program stored in a readable storage medium from which at least one processor of an electronic device can read, the at least one processor executing the computer program causing the electronic device to perform the method of the first or second aspect.

The technology solves the problem of poor face fusion effect.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

FIG. 1A is a schematic diagram of face fusion according to an embodiment of the present application;

FIG. 1B is a face fusion schematic diagram of another embodiment of the present application;

FIG. 2 is a schematic diagram of a hardware architecture according to an embodiment of the present application;

FIG. 3 is a flowchart of a training method of a face fusion model according to an embodiment of the present application;

FIG. 4 is a block diagram of a generated countermeasure network and a face recognition network of an embodiment of the application;

FIG. 5 is a schematic diagram of a generator of an embodiment of the present application generating a first face fusion result;

FIG. 6 is a flow chart of a face fusion method of an embodiment of the present application;

fig. 7 is a schematic structural diagram of a training device for a face fusion model according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a face fusion device according to an embodiment of the present application;

fig. 9 is a block diagram of an electronic device for implementing a training method and/or a face fusion method of a face fusion model according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The face fusion technology is widely applied to various entertainment applications supporting the face special effect and movie special effect manufacturing or propaganda scenes, can rapidly and accurately position face key points, and carries out face layer fusion on a photo (user image) uploaded by a user and a specific image (bottom plate image), so that the generated picture not only comprises the five sense organ characteristics of the user, but also presents the appearance characteristics of the specific image.

Fig. 1A and fig. 1B are schematic face fusion diagrams according to an embodiment of the present application. As shown in fig. 1A, the face fusion is realized by replacing the face area B in the specific character image with the face area a, including the user face image and the specific character image. As shown in fig. 1B, the face fusion is realized by exchanging the user face image a and the user face image B, including the user face image a and the user face image B.

Of course, the face fusion process shown in fig. 1A and 1B is only illustrative, and is not limited to the face fusion of the present application, and any face fusion technology is within the scope of the present application.

In the face fusion process, the quality of the fusion effect is important to the application and popularization of the face fusion product. At present, there are many face fusion technologies, for example, image fusion is achieved by directly operating on original image pixels, or sparse coding is performed on each image to obtain sparse coefficients of the image, and then fusion processing is performed on the sparse coefficients to obtain a fused image, or multi-scale transformation is performed on the image, and then image fusion is performed. Although a certain fusion effect can be obtained by the fusion technology, the first method directly calculates pixels, which results in unsatisfactory fusion effect, the second method needs to perform block calculation on images, which results in losing a large amount of detail information, and the multi-scale transformation needs to rely on multi-scale transformation tools and fusion rules, which result in losing a large amount of image information if the tools and fusion rules are unreasonable.

In summary, the existing face fusion technology is deficient in face edges, face gestures and face naturalness due to the loss of image information, so that problems of false edges, abnormal colors and the like often occur in the fused result, and the face fusion effect is poor. For example, a contour line occurs where it was originally not, whereas a contour line does not occur where it should occur; and abnormal colors appear in the fused face area, so that the face fusion result is unnatural.

In the prior art, the image is calculated on the shallow sub-feature level, so that the understanding degree of the image is shallow, and therefore, the human face fusion effect is poor. The application carries out deep semantic understanding on the image through the deep learning model, extracts richer image information, and better fuses the face image of the user with the specific image in the face fusion process.

In the process of applying the deep learning model to the face fusion, the application needs to train a deep learning model firstly, and a hardware architecture diagram used in the training process is shown in fig. 2, and the hardware architecture comprises: an image acquisition device 21 and an image processing device 22; the image capturing apparatus 21 includes a camera or other apparatus having a photographing function, and the image processing apparatus 22 includes a desktop computer, a notebook computer, an Ipad, a smart phone, a server or other apparatus having an image processing function.

During training of the deep learning model, the image acquisition device 21 is used to acquire sample images, wherein the sample images are in the form of image pairs, i.e. the sample images comprise image pairs of user sample images and floor sample images. The image processing device 22 is configured to train the neural network according to the sample image, and obtain a deep learning model, where the deep learning model is used to replace a face in the bottom plate image with the user image.

In some scenarios where a deep learning model is applied for face fusion, the trained deep learning model may be stored in the image processing device 22. In a specific application process, the image acquisition device 21 acquires a user image and uploads the user image to the image processing device 22, the image processing device 22 receives the user image uploaded by the user and the bottom plate image selected by the user, and then the face in the bottom plate image is replaced by the user image through the deep learning model, so that the face fusion is realized. Wherein the backplane image may be stored in advance in the image processing apparatus 22 or in another apparatus other than the image processing apparatus 22, for example, in a server.

It should be noted that, the image capturing device 21 and the image processing device 22 described above may be two independent devices, or may be integrated on the same electronic device, and if the image capturing device 21 and the image processing device 22 are integrated on the same electronic device, the electronic device is a device having an image capturing function and an image processing function, for example, a smart phone, an Ipad, or the like.

The application provides a training method of a face fusion model, a face fusion method, a device and equipment, which are applied to the field of deep learning in the field of image processing so as to achieve the aim of improving the face fusion effect.

The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 3 is a flowchart of a training method of a face fusion model provided by an embodiment of the present application. Aiming at the technical problems in the prior art, the embodiment of the application provides a training method of a face fusion model, as shown in fig. 3, which comprises the following specific steps:

step 301, acquiring a sample image.

The sample image comprises a user sample image and a bottom plate sample image; the user sample image is a photograph uploaded by the user and comprises a face area to be replaced, and the bottom plate sample image comprises an image with a specific image, such as an image with a specific expression and a specific ornament, for example, a cartoon character image, a military photograph, a Chinese garment photograph and a specific headwear.

The sample image in this embodiment may be an image in the public training data set, or may be an image acquired by the image acquisition device, which is not specifically limited in this embodiment.

The sample images in this embodiment are image pairs including a user sample image and a base plate sample image, and the plurality of sample images used in the training process may be a plurality of image pairs formed by one user sample image and a plurality of base plate sample images, may be a plurality of image pairs formed by a plurality of user sample images and one base plate sample image, or may be a plurality of image pairs formed by a plurality of user sample images and a plurality of base plate sample images. In the actual training process, the image pair can be singly adopted in one form, or the combination of any two or three of the three forms. The present embodiment is not particularly limited thereto.

It should be noted that, regardless of which image pair format is used, an image pair is comprised of one user sample and one floor sample image.

Step 302, training a generated type countermeasure network according to the user sample image and the bottom plate sample image to obtain a face fusion model.

The face fusion model obtained by the step can replace the face in the bottom plate sample image with the user image.

The structure and principle of the generated countermeasure network (Generative Adversarial Networks, GAN) can be specifically referred to the description of the prior art, and will not be repeated here.

According to the embodiment of the application, the face fusion model is obtained by acquiring the sample images comprising the user sample image and the base plate sample image and training the generated type countermeasure network according to the user sample image and the base plate sample image, and the face in the base plate image can be replaced by the user image. Because the human face fusion model is obtained based on the generated type countermeasure network training, deep semantic feature information of the image can be extracted, and human face fusion is performed based on the deep semantic feature information, so that a good human face fusion effect is obtained.

Wherein, as shown in fig. 4, the generated countermeasure network includes a feature extraction network 41, a generator 42, and a discriminator 43, and the image processing apparatus further includes a face recognition network 44; training a generated type countermeasure network according to a user sample image and a bottom plate sample image to obtain a face fusion model, wherein the face fusion model comprises the following steps:

And a1, inputting the bottom plate sample image into a feature extraction network to obtain first feature information.

The feature extraction network may be a neural network, such as a convolutional neural network. The present embodiment may employ an existing network architecture, such as a VGG (Visual Geometry Group Network) network, a Resnet network, or other general image feature extraction network. And the feature extraction network performs feature extraction on the bottom plate sample image to obtain first feature information. The first feature information in this embodiment is feature map information representing a background region (other region than a face region) of the image of the base plate sample, which can extract more abundant deep image semantic information than feature extraction in the conventional face fusion process.

And a2, acquiring face information of a user sample image through a face recognition network to obtain first face information.

The face recognition network of the step adopts the existing face recognition technology to carry out face recognition, and face information is obtained. The face recognition network may be an InsightFace model, among other things.

Optionally, the face recognition network may be stored in the image processing device, and then the user sample image is input into the image processing device, and the image processing device performs face recognition on the user sample image through the face recognition network to obtain the first face information.

The step is to perform face recognition on the user sample image, and extract face characteristic information, such as five sense organs characteristic information, which can uniquely identify the user identity.

And a3, inputting the first characteristic information and the first face information into a generator to obtain a first face fusion result.

The generator can generate a first face fusion result according to the first feature information and the first face information, wherein the first face fusion result is a result image obtained by carrying out face fusion on a user sample image and a bottom plate sample image by the generator, namely an image after face change.

And a4, respectively inputting the first face fusion result and the bottom plate sample image into a discriminator to obtain a first discrimination result aiming at the first face fusion result and a second discrimination result aiming at the bottom plate sample image.

The method comprises the steps of determining whether a first face fusion result is generated by a generator or real data according to a first discrimination result of the first face fusion result; likewise, the second discrimination result for the floor sample image is to determine whether the floor sample image is generated by the generator or the real data.

The floor sample image in this embodiment is as tag information.

Optionally, the discriminator outputs the probability D that the first face fusion result belongs to the real data _fake And probability D that the floor sample image belongs to real data _real 。

And a step a5, determining a first loss function according to the first discrimination result, the second discrimination result and the first face fusion result.

The step a5 includes:

in the method, in the process of the invention,representing the first loss function, by minimizing the first loss function, the network parameters of the generator can be adjusted, by maximizing the first loss function, the network parameters of the arbiter can be adjusted, and by such a game process, the generator and the arbiter finally reach a balanced state, i.e. whether the image generated by the generator is the generated image or the real data cannot be identified by the arbiter.

And a6, adjusting the network parameters of the generator according to the first loss function.

The network parameters of the generator are adjusted by converging the first loss function of equation (1) above toward minimization.

On the basis of the above embodiment, the method of this embodiment may further include the following method steps:

and b1, inputting the first characteristic sample and the first face information into a generator to obtain attention information.

Wherein the attention information comprises an attention force map, the generator of the present embodiment introduces an attention mechanism, the attention force map aims at enabling the generator to pay more attention to the background area of the bottom plate sample image, so that the generated attention force map is smoother, that is, the background area is kept continuous, and noise points are reduced as much as possible.

And b2, determining a second loss function according to the attention information.

Step b2 may be implemented by the following formula (2):

in the formula (2), the amino acid sequence of the compound,representing a second loss function; h and W represent the width and height, respectively, of the attention map; i and j represent pixel points in the width direction and the height direction, respectively, to which attention is paid; g _mask Representing attention information, ++>Representing that attention is sought to be given to the i+1th pixel in the width direction and the j-th pixel in the height direction, +.>Representing that attention is sought to be given to the ith pixel in the width direction and the jth pixel in the height direction,/>Representing attention seeking to the ith pixel point in the width direction and the (j+1) th pixel point in the height direction.

And b3, adjusting network parameters of the feature extraction network and the generator according to the weighted sum of the first loss function and the second loss function.

Wherein adjusting network parameters of the feature extraction network and the generator according to a weighted sum of the first and second loss functions comprises: and adjusting network parameters of the feature extraction network and the generator according to the accumulated values of the first loss function and the second loss function.

Optionally, in this embodiment, the network parameters of the feature extraction network and the generator may be adjusted separately according to the first loss function and the second loss function, for example, the network parameters of the feature extraction network and the generator may be adjusted according to the first loss function, and then the network parameters of the feature extraction network and the generator may be adjusted according to the second loss function; or adjusting the network parameters of the feature extraction network and the generator according to the second loss function, and then adjusting the network parameters of the feature extraction network and the generator according to the first loss function. The above two adjustment manners of the network parameters are not particularly limited, and those skilled in the art can select the adjustment manner of the network parameters according to actual requirements in the actual application process.

In the embodiment, attention information is obtained by inputting a first feature sample and first face information into a generator; and determining a second loss function based on the attention information; and adjusting network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function and the second loss function. Because the attention mechanism is introduced into the generator, the generator can pay more attention to the background area of the bottom plate sample image and generate smoother attention force, so that the background area is kept continuous, noise points are reduced as much as possible, and the obtained face fusion result is more vivid and reasonable in visual effect.

On the basis of the above embodiment, the embodiment of the present application may further include the following method steps:

and c1, carrying out face fusion according to the user sample image and the bottom plate sample image to obtain a second face fusion result.

The step of performing face fusion on the user sample image and the bottom plate sample image by adopting an existing face fusion method, wherein the fusion index of the first face fusion result is larger than that of the second face fusion result, and the fusion index is positively correlated with the quality of the face fusion result. It can be understood that, compared with the first face fusion result, the second face fusion result has a coarser fusion effect, for example, the second face fusion result has the problems of false edges, abnormal colors and the like, the naturalness of the fusion result is poor, and the first face fusion result has no problems of false edges, abnormal colors and the like, so that the fusion result is more natural. Or, the first face fusion result has better effects than the second face fusion result in terms of false edges, abnormal colors, naturalness and the like.

And c2, determining a third loss function according to the first face fusion result and the second face fusion result.

Wherein step c2 can be expressed as the following formula (3):

In the formula (3), the amino acid sequence of the compound,representing a third loss function, G _fake Representing the first face fusion result, I _gt Representing a floor sample image. It is understood that the third loss function is determined using the pixel loss between the first face fusion result and the second face fusion result.

And c3, adjusting network parameters of the feature extraction network and the generator according to the weighted sum of the first loss function, the second loss function and the third loss function.

Wherein the third loss function is intended to enable the generator to have face fusion capabilities.

Wherein adjusting network parameters of the feature extraction network and the generator according to a weighted sum of the first, second, and third loss functions comprises: and adjusting network parameters of the feature extraction network and the generator according to the accumulated values of the first loss function, the second loss function and the third loss function.

Optionally, the present embodiment may further adjust network parameters of the feature extraction network and the generator separately according to the first loss function, the second loss function, and the third loss function, for example, sequentially adjust network parameters of the feature extraction network and the generator according to an order of the first loss function, the second loss function, and the third loss function. Of course, the network parameters of the feature extraction network and the generator may be sequentially adjusted according to other orders of the first loss function, the second loss function, and the third loss function, which is not particularly limited in this embodiment.

According to the embodiment, the second face fusion result is obtained by carrying out face fusion according to the user sample image and the bottom plate sample image, wherein the fusion index of the first face fusion result is larger than that of the second face fusion result, and the fusion index is positively correlated with the quality of the face fusion result, so that the skin color and the background information in the face fused image are more natural and consistent.

and d1, acquiring face information of the first face fusion result through a face recognition network to obtain second face information.

Optionally, the face recognition network may be stored in the image processing device, and then the first face fusion result is input into the image processing device, and the image processing device performs face recognition on the first face fusion result through the face recognition network to obtain the second face information.

The step is to perform face recognition on the first face fusion result, and extract face feature information, such as facial feature information, which can uniquely identify the user identity in the first face fusion result.

And d2, determining a fourth loss function according to the second face information and the first face information.

This step d2 can be expressed by the following formula (4):

in the formula (4), Z _id Representing first face information, Z _fake Representing the information of the second face of the person,representing a fourth loss function.

Wherein the fourth loss function is intended to enable the feature extraction network to ignore face information of the floor sample image and to focus on face information of the user sample image.

And d3, adjusting network parameters of the feature extraction network and the generator according to the weighted sum of the first loss function, the second loss function, the third loss function and the fourth loss function.

Wherein adjusting network parameters of the feature extraction network and the generator according to a weighted sum of the first, second, third, and fourth loss functions comprises: and adjusting network parameters of the feature extraction network and the generator according to the accumulated values of the first loss function, the second loss function, the third loss function and the fourth loss function.

Optionally, the present embodiment may further adjust network parameters of the feature extraction network and the generator separately according to the first loss function, the second loss function, the third loss function, and the fourth loss function, for example, sequentially adjust network parameters of the feature extraction network and the generator according to an order of the first loss function, the second loss function, the third loss function, and the fourth loss function. Of course, the network parameters of the feature extraction network and the generator may be sequentially adjusted according to other orders of the first loss function, the second loss function, the third loss function, and the fourth loss function, which is not specifically limited in this embodiment.

and e1, inputting the first face fusion result into a feature extraction network to obtain second feature information.

The feature extraction network may be a neural network, such as a convolutional neural network. The present embodiment may employ an existing network architecture, such as a VGG (Visual Geometry Group Network) network, a Resnet network, or other general image feature extraction network. And the feature extraction network performs feature extraction on the first face fusion result to obtain second feature information. The second feature information in the present embodiment is feature map information indicating a background region (other region than the face region) of the first face fusion result.

And e2, determining a fifth loss function according to the first characteristic information and the second characteristic information.

This step e2 can be expressed by the following formula (5):

in the formula (5), F _atts Representing the first characteristic information, F _fakes The second characteristic information is represented by a second characteristic information,representing a fifth loss function.

And e3, adjusting network parameters of the feature extraction network and the generator according to the weighted sum of the first loss function, the second loss function, the third loss function, the fourth loss function and the fifth loss function.

Wherein adjusting network parameters of the feature extraction network and the generator according to a weighted sum of the first, second, third, fourth, and fifth loss functions comprises: and adjusting network parameters of the feature extraction network and the generator according to the accumulated values of the first loss function, the second loss function, the third loss function, the fourth loss function and the fifth loss function.

Optionally, the present embodiment may further adjust the network parameters of the feature extraction network and the generator separately according to the first loss function, the second loss function, the third loss function, the fourth loss function, and the fifth loss function, for example, sequentially adjust the network parameters of the feature extraction network and the generator according to the order of the first loss function, the second loss function, the third loss function, the fourth loss function, and the fifth loss function. Of course, the network parameters of the feature extraction network and the generator may be sequentially adjusted according to other orders of the first loss function, the second loss function, the third loss function, the fourth loss function, and the fifth loss function, which is not specifically limited in this embodiment.

In another embodiment of the present application, inputting the first feature sample and the first face information into the generator to obtain a first face fusion result includes: inputting the first characteristic sample and the first face information into a generator to obtain an initial face fusion result and attention information; and inputting the initial face fusion result and the attention information into a generator to obtain a first face fusion result.

As shown in fig. 5, the output of the generator is divided into two branches, one branch is an initial face fusion result, the other branch is an attention diagram, and then the generator performs face fusion again according to the attention diagram and the initial face fusion result, so as to obtain a first face fusion result. This process can be expressed as the following formula (6):

G _fake ＝G _fake1 *G _mask +(1-G _mask )*I _att ； (6)

in the formula (6), G _fake1 Representing an initial face fusion result; g _mask Representing an attention profile; i _att Representing a floor sample image. The first term of the formula is understood as face information obtained after initial fusion, and the second term of the formula is understood as background information of the bottom plate sample image.

According to the embodiment, the initial face fusion result and the attention information are obtained by inputting the first feature sample and the first face information into the generator; and inputting the initial face fusion result and the attention information into the generator again to perform face fusion again to obtain a first face fusion result, and adding an attention mechanism into the generator to ensure that the face fusion result obtained by the generator can retain the background information of the bottom plate sample image to a greater extent while performing face fusion, thereby keeping the background area continuous, reducing noise points as much as possible, and ensuring that the face fusion result generated by the generator is more vivid and natural.

Optionally, training the generated countermeasure network according to the user sample image and the bottom plate sample image to obtain a face fusion model, including:

and f1, carrying out affine transformation on the user sample image and the bottom plate sample image respectively.

This step is understood to be a preprocessing of the user sample image and the floor sample image. Affine transformation is respectively carried out on the user sample image and the bottom plate sample image, and the affine transformation method comprises the following steps: extracting key point information of a user sample image and a base plate sample image to obtain user key point information and base plate key point information; and aligning the user sample image and the bottom plate sample image to the standard face through affine transformation respectively according to the user key point information and the bottom plate key point information.

For example, a standard face may be set first, the standard face includes position information of key points of the standard face, then, similarity transformation is performed according to the user key point information and the standard face key point information, and similarity transformation is performed according to the base plate key point information and the standard face key point information, the transformation process includes rotation, translation and scaling, so that a homogeneous transformation matrix M may be obtained, then, M is used as a parameter, and the user sample image and the base plate sample image are aligned to the standard face respectively, so as to obtain an aligned user sample image and base plate sample image.

The affine transformation aims at adjusting the face angle, for example, adjusting the face to a forward angle.

And f2, training the generated type countermeasure network according to the affine transformed user sample image and the base plate sample image to obtain a face fusion model.

The implementation process of this step is similar to that of the foregoing embodiment, and specific reference may be made to the description of the foregoing embodiment, which is not repeated here.

According to the embodiment of the application, the application further provides a face fusion method.

As shown in fig. 6, a flowchart of a face fusion method according to an embodiment of the present application is shown. The face fusion method of the embodiment comprises the following method steps:

step 601, acquiring a user image and a bottom plate image.

Step 602, inputting the user image and the bottom plate image into a face fusion model to obtain a face fusion result.

The face fusion model is obtained according to a training method of the face fusion model and is used for replacing the face in the bottom plate image with the user image.

The execution subject of the present embodiment may be an electronic device integrating an image capturing function and an image processing function, such as a smart phone, an Ipad, or the like. Taking a smart phone as an example, a user may select an image in an album of the smart phone as a user image, or take an image as a user image using a camera of the smart phone. The user can also select the bottom plate image on the face fusion application program of the smart phone, wherein the face fusion application program can provide various specific image images, and after the user image and the bottom plate image are input by the user, the smart phone can perform face fusion on the user image and the bottom plate image.

According to the embodiment of the application, the user image and the bottom plate image are acquired and input into the face fusion model, so that the face fusion result is obtained. Because the human face fusion model is obtained based on the generated type countermeasure network training, deep semantic feature information of the image can be extracted, and human face fusion is performed based on the deep semantic feature information, so that a good human face fusion effect is obtained.

According to the embodiment of the application, the application further provides a block diagram of the training device of the face fusion model.

Fig. 7 is a block diagram of a training apparatus for a face fusion model according to an embodiment of the present application. The training device 70 of the face fusion model of the present embodiment includes: a first acquisition module 71 and a training module 72; a first acquisition module 71 for acquiring a sample image including a user sample image and a floor sample image; the training module 72 is configured to train the generated type countermeasure network according to the user sample image and the base plate sample image, so as to obtain a face fusion model, where the face fusion model is used to replace a face in the base plate image with the user image.

Optionally, the generated countermeasure network includes a feature extraction network, a generator, and a discriminator; the training module 72 includes: an input unit 721 for inputting the bottom plate sample image into a feature extraction network to obtain first feature information; an acquiring unit 722, configured to acquire face information of a user sample image through a face recognition network, to obtain first face information; the input unit 721 is further configured to input the first feature information and the first face information into the generator, to obtain a first face fusion result; respectively inputting the first face fusion result and the bottom plate image into a discriminator to obtain a first discrimination result aiming at the first face fusion result and a second discrimination result aiming at the bottom plate image; a determining unit 723, configured to determine a first loss function according to the first discrimination result, the second discrimination result, and the first face fusion result; an adjusting unit 724 for adjusting the network parameters of the generator according to the first loss function.

Optionally, the input unit 721 is further configured to input the first feature sample and the first face information into the generator, so as to obtain attention information; a determining unit 723 for determining a second loss function based on the attention information; the adjusting unit 724 is further configured to adjust network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function and the second loss function. Optionally, the training module 72 further includes: the face fusion unit 725 is configured to perform face fusion according to the user sample image and the bottom plate sample image to obtain a second face fusion result, where a fusion index of the first face fusion result is greater than a fusion index of the second face fusion result, and the fusion index is positively correlated with quality of the face fusion result; a determining unit 723, further configured to determine a third loss function according to the first face fusion result and the second face fusion result; the adjusting unit 724 is further configured to adjust network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, and the third loss function.

Optionally, the acquiring unit 722 is further configured to acquire face information of the first face fusion result through a face recognition network, so as to obtain second face information; a determining unit 723, further configured to determine a fourth loss function according to the second face information and the first face information; the adjusting unit 724 is further configured to adjust network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, the third loss function, and the fourth loss function.

Optionally, the input unit 721 is further configured to input the first face fusion result into the feature extraction network to obtain second feature information; a determining unit 723, further configured to determine a fifth loss function according to the first feature information and the second feature information; the adjusting unit 724 is further configured to adjust network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, the third loss function, the fourth loss function, and the fifth loss function.

Optionally, when the input unit 721 inputs the first feature sample and the first face information into the generator to obtain the first face fusion result, the input unit specifically includes: inputting the first characteristic sample and the first face information into a generator to obtain an initial face fusion result and attention information; and inputting the initial face fusion result and the attention information into a generator to obtain a first face fusion result.

Optionally, the apparatus 70 further includes: a transformation module 73 for performing affine transformation on the user sample image and the base plate sample image, respectively; the training module 72 is configured to train the generated countermeasure network according to the affine transformed user sample image and the base plate sample image, so as to obtain a face fusion model.

The application further provides a block diagram of the face fusion device according to the embodiment of the application.

As shown in fig. 8, a block diagram of a face fusion apparatus according to an embodiment of the present application is shown. The face fusion device of the embodiment includes: a second acquisition module 81 and an input module 82; wherein, the second obtaining module 81 is configured to obtain a user image and a bottom plate image; the input module 82 is configured to input the user image and the base plate image into the face fusion model obtained by using the training method of the foregoing embodiment, to obtain a face fusion result, where the face fusion model is used to replace a face in the base plate image with the user image.

The face fusion model comprises a feature extraction network and a generator; the input module 82 includes: an obtaining unit 821, configured to obtain face information of a user sample image through a face recognition network; the input unit 822 is configured to input the bottom plate image into a feature extraction network obtained by pre-training to obtain feature information, and input the face information and the feature information into a generator obtained by pre-training to obtain a face fusion result.

According to an embodiment of the present application, the present application also provides an electronic device and a readable storage medium.

According to an embodiment of the present application, there is also provided a computer program product comprising: computer program stored in a readable storage medium, from which the computer program can be read by at least one processor of an electronic device, the at least one processor executing the computer program causing the electronic device to perform the solutions of any of the method embodiments described above.

Fig. 9 is a block diagram of an electronic device according to a training method of a face fusion model and/or a face fusion method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 9, the electronic device includes: one or more processors 901, memory 902, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). In fig. 9, a processor 901 is taken as an example.

Memory 902 is a non-transitory computer readable storage medium provided by the present application. The memory stores instructions executable by the at least one processor to cause the at least one processor to execute the training method and/or the face fusion method of the face fusion model provided by the application. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to execute the training method and/or the face fusion method of the face fusion model provided by the present application.

The memory 902 is used as a non-transitory computer readable storage medium, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as training methods of a face fusion model and/or program instructions/modules corresponding to the face fusion method in an embodiment of the present application (e.g., the first acquisition module 71, the transformation module 72, and the training module 73 shown in fig. 7, and the second acquisition module 81 and the input module 82 shown in fig. 8). The processor 901 executes various functional applications of the server and data processing, that is, implements the training method and/or the face fusion method of the face fusion model in the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 902.

The memory 902 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created according to a training method of the face fusion model and/or use of an electronic device of the face fusion method, and the like. In addition, the memory 902 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory 902 optionally includes memory remotely located relative to the processor 901, which may be connected to the training method of the face fusion model and/or the electronics of the face fusion method via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The training method of the face fusion model and/or the electronic device of the face fusion method may further include: an input device 903 and an output device 904. The processor 901, memory 902, input devices 903, and output devices 904 may be connected by a bus or other means, for example in fig. 9.

The input device 903 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the training method of the face fusion model and/or the face fusion method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, etc. The output means Y04 may include a display device, an auxiliary lighting means (e.g., LED), a haptic feedback means (e.g., vibration motor), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A training method of a face fusion model comprises the following steps:

acquiring a sample image, wherein the sample image comprises a user sample image and a bottom plate sample image;

training a generated type countermeasure network according to the user sample image and the base plate sample image to obtain a face fusion model, wherein the face fusion model is used for replacing a face in the base plate image with the user image;

the generating type countermeasure network comprises a feature extraction network, a generator and a discriminator;

Training a generated type countermeasure network according to the user sample image and the bottom plate sample image to obtain the face fusion model, wherein the face fusion model comprises the following steps:

inputting the bottom plate sample image into the feature extraction network to obtain first feature information;

acquiring face information of the user sample image through a face recognition network to obtain first face information;

inputting the first characteristic information and the first face information into the generator to obtain a first face fusion result;

respectively inputting the first face fusion result and the bottom plate image into the discriminator to obtain a first discrimination result aiming at the first face fusion result and a second discrimination result aiming at the bottom plate image;

determining a first loss function according to the first discrimination result, the second discrimination result and the first face fusion result;

inputting the first characteristic information and the first face information into the generator to obtain attention information;

determining a second loss function according to the attention information;

performing face fusion according to the user sample image and the bottom plate sample image to obtain a second face fusion result, wherein the fusion index of the first face fusion result is larger than that of the second face fusion result, and the fusion index is positively correlated with the quality of the face fusion result;

Determining a third loss function according to the first face fusion result and the second face fusion result;

and adjusting network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function and the third loss function.

2. The method of claim 1, wherein the method further comprises:

acquiring face information of the first face fusion result through a face recognition network to obtain second face information;

determining a fourth loss function according to the second face information and the first face information;

and adjusting network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, the third loss function and the fourth loss function.

3. The method of claim 2, wherein the method further comprises:

inputting the first face fusion result into the feature extraction network to obtain second feature information;

determining a fifth loss function according to the first characteristic information and the second characteristic information;

and adjusting network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, the third loss function, the fourth loss function and the fifth loss function.

4. A method according to any one of claims 1-3, wherein said inputting the first feature information and the first face information into the generator to obtain a first face fusion result includes:

inputting the first characteristic information and the first face information into the generator to obtain an initial face fusion result and attention information;

and inputting the initial face fusion result and the attention information into the generator to obtain the first face fusion result.

5. A method according to any one of claims 1-3, wherein training the generated challenge network according to the user sample image and the floor sample image to obtain a face fusion model comprises:

affine transformation is respectively carried out on the user sample image and the bottom plate sample image;

training the generated type countermeasure network according to the affine transformed user sample image and the base plate sample image to obtain a face fusion model.

6. A face fusion method comprising:

acquiring a user image and a bottom plate image;

inputting the user image and the bottom plate image into a face fusion model obtained by the training method according to any one of claims 1-5 to obtain a face fusion result, wherein the face fusion model is used for replacing a face in the bottom plate image with the user image.

7. The method of claim 6, wherein the face fusion model comprises a feature extraction network and a generator;

the step of inputting the user image and the bottom plate image into a face fusion model obtained by the training method according to any one of claims 1 to 5 to obtain a face fusion result, comprising:

acquiring face information of the user sample image through a face recognition network;

inputting the bottom plate image into the feature extraction network which is obtained through pre-training to obtain feature information;

and inputting the face information and the characteristic information into the generator which is obtained by training in advance to obtain the face fusion result.

8. A training device for a face fusion model, comprising:

the first acquisition module is used for acquiring a sample image, wherein the sample image comprises a user sample image and a bottom plate sample image;

the training module is used for training the generated type countermeasure network according to the user sample image and the base plate sample image to obtain a face fusion model, and the face fusion model is used for replacing the face in the base plate image with the user image;

The training module comprises:

the input unit is used for inputting the bottom plate sample image into the feature extraction network to obtain first feature information;

the acquisition unit is used for acquiring the face information of the user sample image through a face recognition network to obtain first face information;

the input unit is further configured to input the first feature information and the first face information into the generator to obtain a first face fusion result; inputting the first face fusion result and the bottom plate image into the discriminator respectively to obtain a first discrimination result aiming at the first face fusion result and a second discrimination result aiming at the bottom plate image;

the determining unit is used for determining a first loss function according to the first judging result, the second judging result and the first face fusion result;

the input unit is further configured to input the first feature information and the first face information into the generator to obtain attention information;

the determining unit is further configured to determine a second loss function according to the attention information;

the human face fusion unit is used for carrying out human face fusion according to the user sample image and the bottom plate sample image to obtain a second human face fusion result, wherein the fusion index of the first human face fusion result is larger than that of the second human face fusion result, and the fusion index is positively correlated with the quality of the human face fusion result;

The determining unit is further configured to determine a third loss function according to the first face fusion result and the second face fusion result;

and the adjusting unit is used for adjusting the network parameters of the feature extraction network and the generator according to the weighted sum of the first loss function, the second loss function and the third loss function.

9. The apparatus of claim 8, wherein the obtaining unit is further configured to obtain face information of the first face fusion result through a face recognition network, to obtain second face information;

the determining unit is further configured to determine a fourth loss function according to the second face information and the first face information;

the adjusting unit is further configured to adjust network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, the third loss function, and the fourth loss function.

10. The apparatus of claim 9, wherein the input unit is further configured to input the first face fusion result into the feature extraction network to obtain second feature information;

the determining unit is further configured to determine a fifth loss function according to the first feature information and the second feature information;

The adjusting unit is further configured to adjust network parameters of the feature extraction network and the generator according to a weighted sum of the first loss function, the second loss function, the third loss function, the fourth loss function, and the fifth loss function.

11. The apparatus according to any one of claims 8-10, wherein the input unit, when inputting the first feature information and the first face information into the generator, obtains a first face fusion result, specifically includes:

12. The apparatus of any of claims 8-10, further comprising:

a transformation module for performing affine transformation on the user sample image and the base plate sample image, respectively;

the training module is used for training the generated type countermeasure network according to the affine transformed user sample image and the base plate sample image to obtain a face fusion model.

13. A face fusion apparatus comprising:

the second acquisition module is used for acquiring the user image and the bottom plate image;

the input module is used for inputting the user image and the bottom plate image into a face fusion model obtained by adopting the training method according to any one of claims 1-5 to obtain a face fusion result, wherein the face fusion model is used for replacing the face in the bottom plate image with the user image.

14. The apparatus of claim 13, wherein the face fusion model comprises a feature extraction network and a generator;

the input module comprises:

the acquisition unit is used for acquiring the face information of the user sample image through a face recognition network;

the input unit is used for inputting the bottom plate image into the feature extraction network which is obtained through pre-training to obtain feature information, and inputting the face information and the feature information into the generator which is obtained through pre-training to obtain the face fusion result.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.