CN113850168A

CN113850168A - Fusion method, device and equipment of face pictures and storage medium

Info

Publication number: CN113850168A
Application number: CN202111089159.1A
Authority: CN
Inventors: 陶洪; 李玉乐; 项伟
Original assignee: Bigo Technology Singapore Pte Ltd
Current assignee: Bigo Technology Singapore Pte Ltd
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-12-28
Also published as: WO2023040679A1

Abstract

The embodiment of the application discloses a fusion method, a fusion device, fusion equipment and a storage medium of face pictures, and belongs to the field of machine learning. The method comprises the following steps: acquiring a source face picture and a target face picture; acquiring an identity characteristic hidden code of the source face picture; acquiring an attribute feature hidden code of the target face picture; and fusing based on the identity characteristic hidden code and the attribute characteristic hidden code to generate a fused face picture. The application provides a face fusion method with a better fusion effect, and a real fusion face picture can be generated under the condition that the feature difference between a source face and a target face is too large.

Description

Fusion method, device and equipment of face pictures and storage medium

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a method, an apparatus, a device, and a storage medium for fusing face pictures.

Background

The human face fusion refers to a process of fusing two human face pictures into one human face picture, and the human face obtained through the human face fusion process has the characteristics of the human face in the two pictures. At present, the face fusion technology is widely applied to the fields of various photo retouching, video clipping and the like.

In the related art, a triangulation method is adopted to divide a source face picture and a target face picture to obtain a fusion picture. Firstly, aligning the face positions in a source face picture and a target face picture; respectively extracting feature points and positioning points which can represent the identity of a person from a source face picture and a target face picture, generally selecting points on the outline of five sense organs in the face picture as the feature points, and selecting points on the edge of the picture and the face outline as the positioning points; connecting the positioning points with the characteristic points respectively, and obtaining a plurality of triangulation subareas according to a triangulation algorithm; for any triangulation partition on a source face picture, finding a corresponding triangulation partition on a target face picture, carrying out mapping transformation on the two triangulation partitions to obtain a fusion triangulation partition, and determining a pixel value of the fusion triangulation partition based on the pixel values of the two triangulation partitions; and generating a fused face picture based on all the fused triangulation areas.

However, when the face fusion is performed by the triangulation method, under the condition that the difference between the characteristics of the source face and the target face is large, for example, the difference between the face angles of the source face picture and the target face picture or the difference between the face skin colors or the illumination conditions is large, the face fusion method based on the triangulation cannot fuse natural and harmonious faces.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for fusing face pictures, and natural and harmonious face pictures can be fused under the condition that a source face picture and a target face picture have a large difference. The technical scheme is as follows:

according to an aspect of the embodiments of the present application, a method for fusing face pictures is provided, where the method includes:

acquiring a source face picture and a target face picture;

acquiring an identity characteristic hidden code of the source face picture, wherein the identity characteristic hidden code is used for representing the identity characteristics of people in the source face picture;

acquiring an attribute feature hidden code of the target face picture, wherein the attribute feature hidden code is used for representing attribute features of people in the target face picture;

and fusing based on the identity characteristic hidden code and the attribute characteristic hidden code to generate a fused face picture.

According to an aspect of an embodiment of the present application, a training method for a face fusion model is provided, where the face fusion model includes a generating network and a discriminating network, and the generating network includes an identity encoding network, an attribute encoding network, and a decoding network; the method comprises the following steps:

acquiring training samples of a face fusion model, wherein the training samples comprise a source face picture sample and a target face picture sample;

acquiring an identity characteristic hidden code of the source face picture sample through the identity coding network, wherein the identity characteristic hidden code is used for representing the identity characteristics of people in the source face picture sample;

acquiring an attribute feature hidden code of the target face picture sample through the attribute coding network, wherein the attribute feature hidden code is used for representing character attribute features in the target face picture sample;

fusing based on the identity characteristic hidden code and the attribute characteristic hidden code through the decoding network to generate a fused face picture sample;

determining whether a sample to be distinguished is generated by the generating network or not through the distinguishing network, wherein the sample to be distinguished comprises the fused face picture sample;

determining a discrimination network loss based on a discrimination result of the discrimination network, and adjusting parameters in the discrimination network based on the discrimination network loss;

determining a generated network loss based on the fused face picture sample, the source face picture sample, the target face picture sample and a discrimination result of the discrimination network, and adjusting parameters in the generated network based on the generated network loss.

According to an aspect of the embodiments of the present application, there is provided a device for fusing face pictures, the device including:

the face image acquisition module is used for acquiring a source face image and a target face image;

the identity characteristic obtaining module is used for obtaining an identity characteristic hidden code of the source face picture, and the identity characteristic hidden code is used for representing the identity characteristics of people in the source face picture;

the attribute feature acquisition module is used for acquiring an attribute feature hidden code of the target face picture, wherein the attribute feature hidden code is used for representing the attribute features of the human body in the target face picture;

and the fused picture generating module is used for fusing based on the identity characteristic hidden code and the attribute characteristic hidden code to generate a fused face picture.

According to an aspect of an embodiment of the present application, a training apparatus for a face fusion model is provided, where the face fusion model includes a generating network and a discriminating network, and the generating network includes an identity encoding network, an attribute encoding network, and a decoding network; the device comprises:

the system comprises a training sample acquisition module, a face fusion module and a face fusion module, wherein the training sample acquisition module is used for acquiring a training sample of a face fusion model, and the training sample comprises a source face picture sample and a target face picture sample;

the identity characteristic acquisition module is used for acquiring an identity characteristic hidden code of the source face picture sample through the identity coding network, wherein the identity characteristic hidden code is used for representing the identity characteristics of people in the source face picture sample;

the attribute feature acquisition module is used for acquiring an attribute feature hidden code of the target face picture sample through the attribute coding network, wherein the attribute feature hidden code is used for representing the character attribute features in the target face picture sample;

the fused picture generation module is used for fusing based on the identity characteristic hidden code and the attribute characteristic hidden code through the decoding network to generate a fused face picture sample;

the face image distinguishing module is used for determining whether a sample to be distinguished is generated by the generating network or not through the distinguishing network, and the sample to be distinguished comprises the fused face image sample;

the first parameter adjusting module is used for determining the judgment network loss based on the judgment result of the judgment network and adjusting the parameters in the judgment network based on the judgment network loss;

and the second parameter adjusting module is used for determining and generating network loss based on the fused face picture sample, the source face picture sample, the target face picture sample and the judgment result of the judgment network, and adjusting parameters in the generation network based on the generation network loss.

According to an aspect of the embodiments of the present application, there is provided a computer device, the computer device includes a processor and a memory, the memory stores a computer program, and the processor executes the computer program to implement the above-mentioned fusion method of face pictures or implement the above-mentioned training method of face fusion models.

According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium, in which a computer program is stored, where the computer program is used to be executed by a processor to implement the above-mentioned fusion method for face images or the above-mentioned training method for face fusion models.

According to an aspect of the present application, a computer program product is provided, which, when running on a computer device, causes the computer device to execute the above-mentioned method for fusion of face pictures or the above-mentioned method for training a face fusion model.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

the identity characteristic hidden code of the source face picture is extracted, the attribute characteristic hidden code of the target face picture is extracted, and the fused face picture is obtained by fusing the identity characteristic hidden code and the attribute characteristic hidden code.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by one embodiment of the present application;

fig. 2 is a flowchart of a method for fusing face images according to an embodiment of the present application;

fig. 3 is a schematic diagram of a fusion method of face pictures according to another embodiment of the present application;

FIG. 4 is a flowchart of a training method for a face fusion model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a training method for a face fusion model according to an embodiment of the present application;

fig. 6 is a block diagram of a device for fusing face pictures according to an embodiment of the present application;

fig. 7 is a block diagram of a training apparatus for a face fusion model according to another embodiment of the present application;

FIG. 8 is a schematic diagram of a computer device provided by one embodiment of the present application.

Detailed Description

Before the technical solutions of the present application are introduced, some background knowledge related to the present application will be described. The following related arts as alternatives can be arbitrarily combined with the technical solutions of the embodiments of the present application, and all of them belong to the scope of the embodiments of the present application. The embodiment of the present application includes at least part of the following contents.

In the following, some nouns appearing in the present application are introduced.

Computer Vision (CV) refers to the automated extraction, analysis, and understanding of useful information by a Computer from an image or series of pictures. The computer vision technology covers the fields of scene reconstruction, event detection, video tracking, target recognition, three-dimensional attitude estimation, motion estimation, image restoration and the like, and also comprises common face recognition, fingerprint recognition and other biological feature recognition technologies, face fusion and other technologies.

The Generative Adaptive Networks (GAN) is composed of a Generative neural network for processing input data to generate Generative data and a discriminative neural network for discriminating real data and generating data. In the training process, the generated neural network and the discrimination neural network are mutually confronted, and the generated neural network adjusts the network parameters of the generated neural network according to the generated network loss function, so that the generated data can mislead the judgment result of the discrimination neural network. The discrimination neural network adjusts the network parameters of the discrimination neural network according to the discrimination network loss function, so that the discrimination neural network can correctly discriminate real data and generated data. After a certain number of training, the generated data generated by the generated neural network is close to the real data. The discriminator cannot discriminate the generated data from the real data.

The following describes affine transformation.

Affine Transformation (AF) is a process of linearly changing a vector space once and translating the vector space once in geometry to obtain a new vector space.

Taking a two-dimensional vector space as an example, the process of obtaining the two-dimensional coordinates (u, v) by affine transformation of the two-dimensional coordinates (x, y) is as follows:

U＝a₁*x+b₁*y+c₁

V＝a₂*x+b₂*y+c₂

operations such as translation, scaling and rotation of the two-dimensional image can be realized through affine transformation.

The straightness and parallelism of the two-dimensional image can be kept after affine transformation, wherein the straightness refers to that the straight line is still a straight line obtained after the straight line is subjected to affine transformation, and the arc is still an arc obtained after the arc is subjected to affine transformation; the parallelism is that the relative positional relationship between straight lines after affine transformation is not changed, and the relative position of points on the straight lines after affine transformation is not changed.

An Adaptive Normalization (AdaIN) operation is described below.

The AdaIN operation requires the input of a content x and a style feature y and matches the channel mean and variance of x to the mean and variance of y according to the following format.

AdaIN(x,y)＝σ(y)(x-μ(x)/σ(x))+μ(y)

For example, there is a pattern feature with a specific style of texture that, after normalization by an AdaIN operating layer, produces a higher average activation value at that layer. The output produced by the AdaIN process has a high average activation level for the style feature while maintaining the spatial structure of the content x. This pattern feature can be transformed into the image space of content x by the decoder, and finer style feature information can be passed into the AdaIN output and the final output image by the variance of the grain pattern feature. In short, AdaIN achieves style migration in feature space by migrating feature statistics, i.e., mean and variance in channel direction.

Refer to fig. 1, which illustrates an implementation environment of an embodiment of the present application. This embodiment implementation environment may implement what is referred to as a face fusion system. The scheme system architecture may include a server 10 and at least one terminal device 20.

The terminal device 20 may be an electronic device such as a mobile phone, a tablet Computer, a PC (Personal Computer), a smart television, a multimedia player, and the like. The target application carries a face fusion model, and the terminal device 20 runs a target application, which may be a photographing application, a video application, a social application, and the like, and the type of the target application is not limited herein. In some embodiments, the target application is deployed on the terminal device 20, the fusion process of the face pictures may be performed on the terminal device, and the terminal device obtains the source face picture and the target face picture, extracts the identity feature hidden code for the source face picture, extracts the attribute feature hidden code for the target face picture, and fuses the identity feature hidden code and the attribute feature hidden code to generate the fused face picture, thereby completing the fusion process of the face pictures.

The server 10 is a backend server that can run the target application. The server 10 may be a server, a server cluster composed of a plurality of servers, or a cloud computing service center. In other embodiments, the fusion process of the face pictures may also be performed on the server 10, the terminal device 20 uploads the obtained source face picture and the obtained target face picture to the server 10, the server 10 extracts the identity feature hidden code for the source face picture, extracts the attribute feature hidden code for the target face picture, fuses the identity feature hidden code and the attribute feature hidden code to generate a fused face picture, and sends the generated fused picture to the terminal device 20, thereby completing the fusion process of the face pictures.

The terminal device 20 and the server 10 can communicate with each other via a network.

The system architecture and the service scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not constitute a limitation to the technical solution provided in the embodiment of the present application, and it can be known by a person skilled in the art that the technical solution provided in the embodiment of the present application is also applicable to similar technical problems along with the evolution of the implementation environment of the solution and the appearance of a new service scenario.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 2, a flowchart of a method for fusing face pictures according to an embodiment of the present application is shown, where an execution subject of each step of the method may be the terminal device 20 in the implementation environment of the embodiment shown in fig. 1, or may be the server 10. For convenience of description, the following takes a computer device as an execution subject, and the method may include at least one of the following steps (210-240):

step 210, a source face picture and a target face picture are obtained.

The source face picture refers to a face picture that needs to be modified according to a certain style, and the source face picture is generally a real picture provided by a user, for example, a person picture taken by the user through a mobile phone, a camera and other tools. The target face picture is a face picture which can provide style change for the source face picture, and the target face picture can be a face picture provided by an application program on the terminal equipment or a face picture uploaded by a user. In the embodiment of the present application, the obtaining manner of the source face picture and the target face picture is not limited.

And step 220, acquiring an identity characteristic hidden code of the source face picture, wherein the identity characteristic hidden code is used for representing the identity characteristics of the people in the source face picture.

The identity characteristic hidden code is used for representing the shape of five sense organs, the relative position between the five sense organs, the face shape and other characteristics of the face in the source face picture, and the characteristics are related to the identity of a person, that is, different faces generally have different shapes of the five sense organs, relative positions between the five sense organs and face shape characteristics of the face. Therefore, different identity characteristic hidden codes can be acquired for different source face pictures. In some embodiments, the identity crypto code is obtained by encoding the source face picture through an identity encoding network.

And step 230, acquiring an attribute feature hidden code of the target face picture, wherein the attribute feature hidden code is used for representing the attribute features of the human body in the target face picture.

The character attribute features in the target face picture include but are not limited to at least one of the following: the face makeup, the face complexion, the person hairstyle, the accessories, the head posture and other characteristics in the target face picture. The head pose characteristics of the target face picture refer to mapping of a deflection angle of a target face in a three-dimensional space in a two-dimensional picture, the target face refers to a face in the target face picture, and the head pose of the target face includes a pitch angle (pitch), a yaw angle (yaw) and a rotation angle (roll), for example, in the case of a front-view lens, the pitch angle, the yaw angle and the rotation angle of the head pose of the target face picture are all equal to 0 °. In some embodiments, the attribute feature hidden code is obtained by encoding the target face picture through an attribute encoding network.

In some embodiments, the obtaining of the identity feature hidden code of the source face picture and the obtaining of the attribute feature hidden code of the target face picture are performed in two different coding networks, so that the obtaining of the identity feature hidden code of the source face picture and the obtaining of the attribute feature hidden code of the target face picture may be performed simultaneously or sequentially, which is not limited in the present application.

And 240, fusing based on the identity characteristic hidden codes and the attribute characteristic hidden codes to generate a fused face picture.

The fused face picture is a picture which has both the identity characteristic of a source face picture and the attribute characteristic of a target face picture, and the face in the fused face picture is closer to the source face picture in visual effect and closer to the target face picture in the makeup posture of a figure. The face fusion model comprises an identity coding network and an attribute coding network. In some embodiments, the face fusion model performs fusion based on the identity feature hidden code and the attribute feature hidden code to generate a fused face picture.

In summary, according to the technical scheme provided by the embodiment of the application, the fused face picture is obtained by obtaining the source face picture and the target face picture, obtaining the identity feature hidden code based on the source face picture and obtaining the attribute feature hidden code based on the target face picture, and fusing the identity feature hidden code and the attribute feature hidden code. The method for generating the face fusion picture is provided, and the purpose of generating a natural and vivid face picture is achieved.

In addition, in the related technology, the fused face picture is obtained by fusing the triangular sections corresponding to the source face picture and the target face picture, and under the condition that the difference between the characteristics of the source face picture and the target face picture is large, certain characteristics in the fused face picture are affected by the source face picture and the target face picture together, so that the corresponding characteristics in the fused face picture are not in accordance with reality, and the face authenticity in the fused picture is poor. In the embodiment, the identity characteristic hidden code is obtained through the source face picture, the attribute characteristic hidden code is obtained through the target face picture, the identity characteristic hidden code is used for controlling the identity characteristic of the face generated in the fused face picture in the fusion process, and the attribute characteristic of the face generated in the fused face picture is controlled through the attribute characteristic hidden code, so that the condition that the generated fused face picture is not true when the characteristics of the face in the source face picture and the characteristics of the face in the target face picture have great differences is avoided.

Next, a method for generating a fused face picture by a face fusion model will be described.

Please refer to fig. 3, which illustrates a schematic diagram of a method for fusing face pictures according to another embodiment of the present application.

In some embodiments, the fused face picture is generated by a face fusion model, the face fusion model comprising an identity encoding network, an attribute encoding network and a decoding network; the identity coding network is used for acquiring an identity characteristic hidden code of a source face picture; the attribute coding network is used for acquiring an attribute feature hidden code of the target face picture; the decoding network is used for fusing the identity characteristic hidden codes and the attribute characteristic hidden codes to generate a fused face picture.

In some embodiments, the identity coding network and the attribute coding network both have N serially connected coding layers, and the structures and parameters of the corresponding coding layers of the identity coding network and the attribute coding network are correspondingly the same. The size of the identity characteristic hidden code obtained through the identity coding network is the same as the size of the attribute characteristic hidden code obtained through the attribute coding network. The input of the N layer in the identity coding network and the attribute coding network is the output of the N-1 layer, and N is a positive integer less than or equal to N. In some embodiments, ResNet Block (residual neural network Block) is adopted in any coding layer structure of the identity coding network and the attribute coding network, in any coding layer, for the intermediate hidden code input in the previous coding layer, firstly convolution is performed through a convolution kernel of 1 × 1 and activation is performed through lreul (weak reconstructed Linear unit), secondly convolution is performed through a convolution kernel of 3 × 3 and activation is performed through lreul, and finally, pixel value is increased, convolution is performed through another convolution kernel of 3 × 3 and activation is performed through lreul, and the obtained intermediate hidden code is transmitted to the next coding layer.

And the attribute coding network codes the target face picture and outputs an attribute feature hidden code through a full connection layer.

By adopting the identity coding network with N coding layers to code the source face picture and adopting the attribute coding network with N coding layers to code the target face picture, the identity characteristics and the attribute characteristics are decoupled in the coding process, and the characteristic entanglement is effectively avoided.

In some embodiments, the identity encoding network comprises N encoding layers in series, N being an integer greater than 1; obtaining an identity characteristic hidden code of a source face picture, comprising: encoding the source face picture through the 1 st to the n1 th encoding layers in the identity encoding network to obtain shallow hidden codes; the shallow hidden code is used for representing the face appearance characteristics of the source face picture; encoding shallow hidden codes through the n1 th to n2 th encoding layers in the identity encoding network to obtain middle hidden codes; the middle-layer hidden code is used for representing fine facial features of the source face picture; encoding the middle-layer hidden code through the N2 th to N th encoding layers in the identity encoding network to obtain a deep-layer hidden code; the deep hidden codes are used for representing the face color characteristics and the face microscopic characteristics of the source face picture; wherein, the identity characteristic cryptocode includes: shallow hidden code, middle hidden code and deep hidden code, N1, N2 are positive integers less than N.

The identity coding network acquires the source face picture in multiple layers to obtain the identity characteristic hidden codes with different receptive fields. The shallow hidden code is an identity characteristic hidden code obtained after low resolution and less coding layer coding processing, so that the perception field of the shallow hidden code is small, the pixel region of a pixel value in the shallow hidden code mapped on a source face picture is small, and the characteristic in the shallow hidden code is rough, so that the shallow hidden code represents the facial appearance characteristics of the source face picture, such as the face contour, hair style, posture and the like of the source face picture. With the increase of the number of coding layers and the increase of the resolution, the perception field is increased by the middle layer hidden code through multiple convolutions, the pixel area of the pixel value in the middle layer hidden code mapped in the source face picture is increased, and the characteristic represented in the middle layer hidden code is more detailed, so that the middle layer hidden code represents more detailed facial characteristics of the source face picture, such as the opening and closing of eyes, the details of five sense organs and the like in the source face picture. With the continuous increase of the number of coding layers, the resolution ratio is further increased, the pixel value in the deep hidden code is mapped in the pixel region in the original face picture to be the maximum, and the deep hidden code is used for representing more fine identity characteristics in the source face picture, such as the skin color, the pupil color and the like of the face in the source face picture.

The size of the shallow hidden code output by the identity coding network is a1, the size of the middle hidden code is a2, and the size of the deep hidden code is a3, and in some embodiments, a1 is a2 is a 3. In some embodiments, the sizes of a1, a2 and a3 are not equal, and the face fusion model divides the sizes of the shallow hidden code, the middle hidden code and the deep hidden code according to the characteristics of the identity coding network, for example, the size of the shallow hidden code is increased and the sizes of the middle hidden code and the deep hidden code are reduced when the entanglement of features in the shallow hidden code is small.

In some embodiments, the identity coding network has 6 coding layers, where n1 is 2 and n2 is 4, the shallow hidden code is output by the 2 nd coding layer, the middle hidden code is output by the 4 th coding layer, and the deep hidden code is output by the 6 th coding layer. The hidden identity code consists of a shallow hidden code, a middle hidden code and a deep hidden code, wherein in some embodiments, the size of the shallow hidden code obtained by the identity coding network is 8 × 512, the size of the middle hidden code is 6 × 512, the size of the deep hidden code is 2 × 512, and the size of the hidden identity code is 16(8+6+2) × 512.

In some embodiments, the decoding network comprises M decoding layers, M being an integer greater than 1; fusing based on the identity characteristic hidden code and the attribute characteristic hidden code to generate a fused face picture, comprising the following steps: carrying out affine transformation on the identity characteristic hidden codes to generate M groups of control vectors; decoding the attribute feature hidden codes and the M groups of control vectors through M decoding layers to generate a fused face picture; the input of the 1 st decoding layer comprises an attribute feature hidden code and a1 st group of control vectors, the input of the (i + 1) th decoding layer comprises the output of the (i) th decoding layer and the (i + 1) th group of control vectors, the output of the M th decoding layer comprises a fused face picture, and i is a positive integer smaller than M.

After the identity characteristic hidden code is subjected to affine transformation, the relative position relation among the features in the identity characteristic hidden code is not changed, and the affine transformation can filter the positions where the features appear and reserve the relative relation among the features. The control vector is used for controlling the style of the fused face picture.

In some embodiments, affine transforming the identity steganography to generate M sets of control vectors includes: dividing the identity characteristic hidden codes into M groups of identity characteristic vectors; performing affine transformation on the M groups of identity characteristic vectors respectively to generate M groups of control vectors; each group of control vectors comprises at least two control vectors, and different control vectors are used for representing identity characteristics with different dimensions.

In some embodiments, by grouping or grouping two adjacent features in the identity feature hidden code into one group, waiting for M groups of control vectors, for example, the size of the identity feature hidden code is 16 × 512, and dividing two adjacent columns of identity features (1 × 512) into one control vector group, the identity features of different dimensions can represent different classes of identity features of the source face picture, and in some embodiments, the identity features of different dimensions have different senses, so the identity features of different dimensions represent features of different granularities. In some embodiments, the receptive fields of the identity features of different dimensions are the same, and the identity features of different dimensions characterize different types of identity features of the source face picture, for example, a certain control vector group includes a feature characterizing the eye shape of the source face picture and a feature characterizing the nose shape of the source face picture.

In some embodiments, decoding the attribute feature hidden code and the M groups of control vectors through the M decoding layers to generate the fused face picture includes receiving, in an ith decoding layer of the M decoding layers, an output of an i-1 th layer and a control vector group corresponding to the i th layer, where the control vector group includes a first control vector and a second control vector, performing, by the decoding layers, an adaptive normalization operation on an input vector of the i-1 th layer and the first control vector to obtain an intermediate vector, performing convolution on the intermediate vector through a convolution kernel with a size of 3 × 3, performing an adaptive normalization operation on the convolved vector and the second control vector, and inputting a vector obtained after the adaptive normalization operation to an i +1 th layer to complete the decoding operation of one decoding layer.

In some embodiments, the decoding network includes 8 decoding layers, the decoding network encodes the attribute features as the input of the 1 st decoding layer, repeats the decoding steps performed by the above single decoding layer 8 times, and outputs the fused face picture with the pixel value of 512 × 512 in the 8 th decoding layer.

The characteristic hidden codes are encoded through a plurality of encoding layers, so that mutual entanglement among the characteristic hidden codes can be avoided, the attribute characteristic hidden codes and the control vector group are decoded through a decoding network, the identity characteristics of the fused face picture can be controlled through the control vector, and the real and natural fused face picture is generated.

In the following, a description is given of a training flow of the face fusion model by an embodiment, and a description of a content involved in the use process of the face fusion model and a content involved in the training process are corresponding to each other and intercommunicated with each other, for example, where a detailed description is not given on one side, reference may be made to a description on the other side.

Referring to fig. 4, which shows a flowchart of a training method for a face fusion model according to an embodiment of the present application, an execution subject of each step of the method may be a server 10, or may be a computer, and for convenience of description, a computer device is taken as an execution subject, and the method may include at least one of the following steps (410 and 470):

step 410, obtaining training samples of the face fusion model, wherein the training samples comprise a source face image sample and a target face image sample.

The face fusion model comprises a generating network and a judging network, wherein the generating network comprises an identity coding network, an attribute coding network and a decoding network.

The face fusion model is a generative confrontation network model, and in some embodiments, the input of the face fusion model includes a source face picture sample and a target face picture sample. Each training sample comprises two picture samples, one is used as a source face picture sample, and the other is used as a target face picture sample. The training sample is used for training the face fusion model, and the face fusion model capable of generating a real fusion face picture can be obtained through training. The two picture samples in a training sample set may be different characters or have different attribute characteristics. The face fusion model is trained by using a plurality of training sample groups, so that the trained face fusion model can still generate a real and natural fused face picture under the condition that the difference between an input source face picture sample and a target face picture sample is large. In some embodiments, the training samples are from a High definition face data set (FFHQ) including face pictures with different genders, races, face angles, expressions, and makeup, the High definition face data set is divided into a source face picture sample group and a target face picture sample group, and each training sample group selects one picture sample from the source face picture sample group and the target face picture sample group as a source face picture sample and a target face picture sample of the training sample group, respectively.

Step 420, obtaining an identity feature hidden code of the source face image sample through an identity coding network, wherein the identity feature hidden code is used for representing the identity features of the persons in the source face image sample.

In the training process, the angles and identity characteristics of the human faces of different source human face picture samples are different, and the identity coding network can decouple the characteristic information through training, so that the hidden code characteristics of the identity characteristics of the source human face picture samples obtained through identity coding network coding are less in feature entanglement.

And 430, acquiring an attribute feature hidden code of the target face picture sample through an attribute coding network, wherein the attribute feature hidden code is used for representing the character attribute features in the target face picture sample.

In the training process, different target human face picture samples have different human face postures, makeup and environmental factors, and the attribute coding network can decouple the characteristic information through training, so that the hidden code characteristics of the attribute characteristics of the target human face picture samples obtained through the attribute coding network coding are less entangled.

And step 440, fusing the identity characteristic hidden codes and the attribute characteristic hidden codes through a decoding network to generate a fused face picture sample.

The decoding network is a pre-trained network, the decoding network does not participate in training in the training process of the face fusion model, and the decoding network is only used for decoding the identity characteristic hidden codes and the attribute characteristic hidden codes to generate high-definition vivid face fusion picture samples.

In some embodiments, the decoding network employs a decoding network in a StyleGAN network architecture to decode the identity feature covert code and the attribute feature covert code.

And step 450, determining whether a sample to be distinguished is generated by a generating network or not through a distinguishing network, wherein the sample to be distinguished comprises a fused face picture sample.

And the judging network judges whether the image to be judged is a real image or not in a layer-by-layer growth mode. The discrimination network gradually increases the pixel value of the picture from the RGB image with the pixel value of 4 × 4, and expands the pixels of the image to be discriminated to 8 × 8,16 × 16 and 32 × 32 until the size of the image to be discriminated is reached.

In some embodiments, after the image to be determined is determined by the determination network, the image to be determined is a real image or a predicted value of a network-generated image is output.

Step 460, determining the discrimination network loss based on the discrimination result of the discrimination network, and adjusting the parameters in the discrimination network based on the discrimination network loss.

The discrimination network loss is used to measure the discrimination network performance. In some embodiments, based on the discrimination network loss, a gradient descent algorithm is employed to optimize parameters in the discrimination network.

Step 470, determining the generated network loss based on the fused face picture sample, the source face picture sample, the target face picture sample and the discrimination result of the discrimination network, and adjusting the parameters in the generated network based on the generated network loss.

Since the decoding network in the generation network does not participate in training, the generation network loss is used to measure the performance of the identity coding network and the attribute coding network. In some embodiments, based on the generated network loss, a gradient descent algorithm is used to optimize parameters in the identity coding network and parameters in the attribute coding network, respectively.

In summary, a training sample set is obtained through a generation network, parameters of a face fusion model are adjusted through a loss function, and countertraining is performed through a generation network pre-countervailing network, so that the trained face fusion model has better robustness, can adapt to a source face picture sample and a target face picture sample with larger feature differences, and can be fused into a real and natural fused face picture sample.

Please refer to fig. 5, which illustrates a schematic diagram of a training method of a face fusion model according to an embodiment of the present application.

In some embodiments, the identity encoding network comprises N encoding layers in series, N being an integer greater than 1; the method for acquiring the identity characteristic hidden code of the source face picture sample through the identity coding network comprises the following steps: coding the source face picture sample through the 1 st to the n1 th coding layers in the identity coding network to obtain a shallow hidden code; the shallow hidden code is used for representing the face appearance characteristics of the source face picture sample; encoding shallow hidden codes through the n1 th to n2 th encoding layers in the identity encoding network to obtain middle hidden codes; the middle-layer hidden code is used for representing fine facial features of the source face picture sample; encoding the middle-layer hidden code through the N2 th to N th encoding layers in the identity encoding network to obtain a deep-layer hidden code; the deep hidden code is used for representing the face color characteristics and the face microscopic characteristics of the source face picture sample; wherein, the identity characteristic cryptocode includes: shallow hidden code, middle hidden code and deep hidden code, N1, N2 are positive integers less than N.

For the encoding process of the identity encoding network, please refer to the previous embodiment, which is not described herein.

In some embodiments, the decoding network comprises M decoding layers, M being an integer greater than 1; fusing the identity characteristic hidden code and the attribute characteristic hidden code through a decoding network to generate a fused face picture sample, wherein the fusing comprises the following steps: carrying out affine transformation on the identity characteristic hidden codes to generate M groups of control vectors, and decoding the attribute characteristic hidden codes and the M groups of control vectors through M decoding layers to generate a fused face picture sample; the input of the 1 st decoding layer comprises an attribute feature hidden code and a1 st group of control vectors, the input of the (i + 1) th decoding layer comprises the output of the (i) th decoding layer and the (i + 1) th group of control vectors, the output of the M th decoding layer comprises a fused face picture sample, and i is a positive integer smaller than M.

For the decoding process of the decoding network, please refer to the previous embodiment, which is not described herein.

In some embodiments, a discrimination network loss is determined based on the discrimination result, the discrimination loss being a countermeasure loss of the discrimination network, the discrimination loss being calculated by the following formula:

L_d＝log(exp(D(G(x_s)))+1)+log(exp(D(x))+1)

where x denotes the real picture sample, G (x)_s) Representing the fused face picture sample generated by the generating network, D (G (x)_s) D (x) represents the result of the discrimination network on the fused face picture sample, d (x) represents the result of the discrimination network on the real face picture sample, in some implementations, the result of the discrimination network includes 0 and 1, the result of the discrimination is 0 represents that the discrimination network considers that the picture to be discriminated is generated by the generation network (fake), and the result of the discrimination is 1 represents that the discrimination network considers that the picture to be discriminated is real.

In some embodiments, determining to generate a network loss based on the fused face picture sample, the source face picture sample, the target face picture sample, and the discrimination result of the discrimination network comprises: determining a perception similarity loss based on the target face picture sample and the fused face picture sample, wherein the perception similarity loss is used for representing picture style difference between the target face picture sample and the fused face picture sample; determining a multi-scale identity characteristic loss based on the source face picture sample and the fused face picture sample, wherein the multi-scale identity characteristic loss is used for representing identity characteristic difference between the source face picture sample and the fused face picture sample; determining face pose loss based on the target face picture sample and the fused face picture sample, wherein the face pose loss is used for describing a face pose difference between the target face picture sample and the fused face picture sample; determining to generate network countermeasure loss based on the discrimination result; and determining to generate network loss according to the perception similarity loss, the multi-scale identity characteristic loss, the face pose loss and the network confrontation loss.

In some embodiments, determining a perceptual similarity loss based on the target face picture sample and the fused face picture sample comprises: respectively extracting visual features of a target face picture sample and visual features of a fused face picture sample through a visual feature extraction network; and calculating the similarity between the visual characteristics of the target face picture sample and the visual characteristics of the fused face picture sample to obtain the perception similarity loss.

This perceptual similarity loss can be calculated by the following equation:

L_LPIPS＝||F(x_t)-F(y_s2t)||₂

wherein x is_tRepresenting a sample of a target face picture, y_s2tRepresenting a fused face picture sample, F (x)_t) The visual characteristics of a target face image sample obtained by extracting the target face image sample through a visual characteristic extraction network are F (y)_s2t) The visual characteristics of the fused face image sample are extracted by the fused face image sample through a visual characteristic extraction network.

In some embodiments, determining a multi-scale identity feature loss based on the source face picture sample and the fused face picture sample comprises: respectively extracting an identity characteristic hidden code of a source face picture sample and an identity characteristic hidden code of a fused face picture sample through an identity characteristic extraction network; and calculating the similarity between the identity characteristic hidden codes of the source face picture sample and the identity characteristic hidden codes of the fused face picture sample to obtain the multi-scale identity characteristic loss.

The multi-scale identity characteristic loss can be calculated by the following formula:

L_ID＝Σ(1-cos(N_i(x_s),N_i(y_s2t)))

wherein x is_sRepresenting a source face picture sample, y_s2tRepresenting a fused face picture sample, N (x)_s) The identity characteristic of a source face image sample obtained by extracting a target face image sample through an identity characteristic extraction network is N (y)_s2t) The identity characteristics of the fused face image sample are extracted from the fused face image sample through an identity characteristic extraction network. In some embodiments, a VGG (Visual Geometry Group) face network is used as the identity feature extraction network to extract the identity features of the target face picture sample and the fused face picture sample respectively.

In some embodiments, determining a face pose loss based on the target face picture sample and the fused face picture sample comprises: determining a face pose loss based on the target face picture sample and the fused face picture sample, comprising:

respectively extracting the Euler angle intersection of the face postures of the target face picture samples and the Euler angle of the face postures of the fused face picture samples through a face posture prediction network;

and calculating the similarity between the Euler angles of the human face postures of the target human face picture samples and the Euler angles of the human face postures of the fused human face picture samples to obtain the loss of the human face postures.

The face pose loss can be calculated by the following formula:

L_POSE＝||E(x_t)–E(y_s2t)||₂

wherein x_tRepresenting a sample of a target face picture, y_s2tRepresenting a fused face picture sample, E (x)_t) Is the face attitude Euler angle, E of the target face image sample extracted by the face attitude prediction network(y_s2t) The Euler angle of the face pose of the fused face image sample is extracted from the fused face image sample through a face pose prediction network.

In some embodiments, MTCNN (Multi-task convolved neural Networks) Networks are used as face pose prediction Networks to extract the face pose euler angles of the target face picture sample and the fused face picture sample, respectively.

In some embodiments, determining the countermeasure loss of the generation network based on the discrimination result may be obtained by the following calculation formula:

L_g＝-log(exp(D(G(x_s)))+1)

wherein, G (x)_s) Representing the fused face picture sample generated by the generating network, D (G (x)_s) And) represents the discrimination result of the discrimination network on the fused face picture sample.

In some embodiments, the training process of the face fusion model is to initialize parameters in an identity coding network, an attribute coding network and a discrimination network, and extract m groups of training sample sets from a training sample set, wherein each group of training samples comprises a source face picture sample and a target face picture sample; for each training sample group, respectively obtaining the identity feature codes of the source face picture samples through an identity coding network, obtaining the attribute feature codes of the target face pictures through an attribute coding network, and decoding the identity feature coding network through a decoding network to generate fused face picture samples; after m fused face picture samples are generated, fixing a generation network, extracting m real picture samples from a training sample set, respectively distinguishing the m fused face picture samples and the m real picture samples through a distinguishing network, and outputting a distinguishing result; determining a loss function of the discrimination network by adopting a logistic regression loss function according to a discrimination result of the discrimination network, and optimizing parameters in the discrimination network by adopting a gradient descent mode; determining a generated loss function by fusing a face picture sample, a source face picture sample, a target face picture sample and a judgment result of a judgment network, and optimizing parameters in the generation network by adopting a gradient descent method according to the generated loss function to complete a group of training; at the end of a set of training, the total loss of the face fusion model is calculated by the following formula:

L_total＝W_LPIPS*L_LPIPIS+W_ID*L_ID+W_POSE*L_POSE+W_gan*(L_g+L_d)

wherein, W_LPIPS、W_ID、W_POSEAnd W_ganTo account for the weight taken by the loss in the total loss, in some embodiments, W_LPIPS、W_ID、W_POSEAnd W_ganThe values of (A) are 1, 5 and 5 respectively.

And stopping training under the condition that the total loss of the face fusion model reaches the minimum.

In the actual training process, a face fusion model capable of generating a vivid face fusion picture can be obtained by performing 16 stages (epochs) on a training sample set.

By introducing loss of perception similarity, loss of multi-scale identity characteristics, generation of confrontation loss, loss of human face posture and other losses, parameters can be better adjusted by the human face fusion model in the training process.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Please refer to fig. 6, which illustrates a block diagram of a device for fusing facial pictures according to an embodiment of the present application. The device has the function of realizing the fusion method of the face pictures, and the function can be realized by hardware or by hardware executing corresponding software. The apparatus may be the electronic device described above, or may be provided in an electronic device. The apparatus 600 may include: a face image acquisition module 610, an identity characteristic acquisition module 620, an attribute characteristic acquisition module 630 and a fused image generation module 640.

The face image obtaining module 610 is configured to obtain a source face image and a target face image.

An identity characteristic obtaining module 620, configured to obtain an identity characteristic hidden code of the source face picture, where the identity characteristic hidden code is used to represent a person identity characteristic in the source face picture.

An attribute feature obtaining module 630, configured to obtain an attribute feature hidden code of the target face picture, where the attribute feature hidden code is used to represent attribute features of a person in the target face picture.

And a fused picture generating module 640, configured to perform fusion based on the identity feature hidden code and the attribute feature hidden code to generate a fused face picture.

In some embodiments, the fused face picture is generated by a face fusion model, the face fusion model comprising an identity encoding network, an attribute encoding network, and a decoding network; the identity coding network is used for acquiring an identity characteristic hidden code of the source face picture; the attribute coding network is used for acquiring an attribute feature hidden code of the target face picture; and the decoding network is used for fusing based on the identity characteristic hidden code and the attribute characteristic hidden code to generate the fused face picture.

In some embodiments, the identity encoding network comprises N encoding layers in series, N being an integer greater than 1; the identity obtaining module 620 is configured to:

coding the source face picture through the 1 st to the n1 th coding layers in the identity coding network to obtain a shallow hidden code; the shallow hidden code is used for representing the facial appearance characteristics of the source face picture; encoding the shallow hidden code through the n1 th to n2 th encoding layers in the identity encoding network to obtain a middle hidden code; the middle-layer hidden code is used for representing fine facial features of the source face picture; encoding the middle-layer hidden code through the N2 th to N th encoding layers in the identity encoding network to obtain a deep hidden code; the deep hidden code is used for representing the face color characteristic and the face micro characteristic of the source face picture; wherein the identity crypto code comprises: the shallow implicit code, the middle implicit code and the deep implicit code, N1 and N2 are positive integers smaller than N.

In some embodiments, the fused picture generating module 640 includes: the control vector generating unit is used for carrying out affine transformation on the identity characteristic hidden codes to generate M groups of control vectors; the fusion unit is used for decoding the attribute feature hidden codes and the M groups of control vectors through the M decoding layers to generate a fusion face picture; the input of the 1 st decoding layer comprises the attribute feature hidden code and the 1 st group of control vectors, the input of the (i + 1) th decoding layer comprises the output of the (i) th decoding layer and the (i + 1) th group of control vectors, the output of the M th decoding layer comprises the fused face picture, and i is a positive integer smaller than M.

In some embodiments, the fusion unit is configured to divide the identity steganographic code into M groups of identity feature vectors; performing affine transformation on the M groups of identity characteristic vectors respectively to generate M groups of control vectors; and each group of control vectors comprises at least two control vectors, and different control vectors are used for representing identity characteristics with different dimensions.

Referring to fig. 7, a block diagram of a training apparatus for a face fusion model according to an embodiment of the present application is shown. The device has the function of realizing the training method of the human face fusion model, and the function can be realized by hardware or by hardware executing corresponding software. The apparatus may be an analysis device as described above, or may be provided in an analysis device. The apparatus 700 may include: a training sample acquisition module 710, an identity feature acquisition module 720, an attribute feature acquisition module 730, a fused picture generation module 740, a face picture discrimination module 750, a first parameter adjustment module 760 and a second parameter adjustment module 770.

A training sample obtaining module 710, configured to obtain training samples of the face fusion model, where the training samples include a source face image sample and a target face image sample.

An identity characteristic obtaining module 720, configured to obtain, through the identity encoding network, an identity characteristic cryptocode of the source face image sample, where the identity characteristic cryptocode is used to characterize identity characteristics of a person in the source face image sample.

An attribute feature obtaining module 730, configured to obtain an attribute feature hidden code of the target face image sample through the attribute coding network, where the attribute feature hidden code is used to characterize the person attribute feature in the target face image sample.

And a fused picture generating module 740, configured to perform fusion based on the identity feature hidden code and the attribute feature hidden code through the decoding network, so as to generate a fused face picture sample.

And a face image distinguishing module 750, configured to determine, through the distinguishing network, whether a sample to be distinguished is generated by the generating network, where the sample to be distinguished includes the fused face image sample.

A first parameter adjusting module 760, configured to determine a discrimination network loss based on a discrimination result of the discrimination network, and adjust a parameter in the discrimination network based on the discrimination network loss.

A second parameter adjusting module 770, configured to determine a generated network loss based on the fused face picture sample, the source face picture sample, the target face picture sample, and the discrimination result of the discrimination network, and adjust a parameter in the generated network based on the generated network loss.

In some embodiments, the decoding network includes M decoding layers, where M is an integer greater than 1, and the identity obtaining module 720 is configured to:

coding the source face picture sample through the 1 st to the n1 th coding layers in the identity coding network to obtain a shallow hidden code; the shallow hidden code is used for representing the facial appearance characteristics of the source face picture sample; encoding the shallow hidden code through the n1 th to n2 th encoding layers in the identity encoding network to obtain a middle hidden code; the middle-layer hidden code is used for representing fine facial features of the source face picture sample; encoding the middle-layer hidden code through the N2 th to N th encoding layers in the identity encoding network to obtain a deep hidden code; the deep hidden code is used for representing the face color characteristics and the face micro characteristics of the source face picture sample; wherein the identity crypto code comprises: the shallow implicit code, the middle implicit code and the deep implicit code, N1 and N2 are positive integers smaller than N.

In some embodiments, the decoding network comprises M decoding layers, M being an integer greater than 1, the sample fused picture generating module 740 configured to:

carrying out affine transformation on the identity characteristic hidden codes to generate M groups of control vectors; decoding the attribute feature hidden codes and the M groups of control vectors through the M decoding layers to generate the fused face picture sample; the input of the 1 st decoding layer comprises the attribute feature hidden code and the 1 st group of control vectors, the input of the (i + 1) th decoding layer comprises the output of the (i) th decoding layer and the (i + 1) th group of control vectors, the output of the M th decoding layer comprises the fused face picture sample, and i is a positive integer smaller than M.

In some embodiments, the second parameter adjustment module 770 includes:

a first loss function unit, configured to determine a perceptual similarity loss based on the target face picture sample and the fused face picture sample, where the perceptual similarity loss is used to characterize a picture style difference between the target face picture sample and the fused face picture sample; a second loss function unit, configured to determine the multi-scale identity feature loss based on the source face picture sample and the fused face picture sample, where the multi-scale identity feature loss is used to characterize an identity feature difference between the source face picture sample and the fused face picture sample; a third loss function unit, configured to determine a face pose loss based on the target face picture sample and the fused face picture sample, where the face pose loss is used to describe a face pose difference between the target face picture sample and the fused face picture sample; a loss determination unit configured to determine to generate a network countermeasure loss based on the discrimination result; and determining the generated network loss according to the perception similarity loss, the multi-scale identity feature loss, the face pose loss and the network confrontation loss.

In some embodiments, the first loss function unit is configured to extract, through a visual feature extraction network, a visual feature of the target face picture sample and a visual feature of the fused face picture sample, respectively; and calculating the similarity between the visual characteristics of the target face picture sample and the visual characteristics of the fused face picture sample to obtain the perception similarity loss.

In some embodiments, the second loss function unit is configured to extract, through the identity feature extraction network, an identity feature hidden code of the source face picture sample and an identity feature hidden code of the fused face picture sample respectively; and calculating the similarity between the identity characteristic hidden code of the source face picture sample and the identity characteristic hidden code of the fused face picture sample to obtain the multi-scale identity characteristic loss.

In some embodiments, the third loss function unit is configured to extract, through a face pose prediction network, a face pose euler angle of the target face picture sample and a face pose euler angle of the fused face picture sample, respectively; and calculating the similarity between the Euler angles of the human face postures of the target human face picture samples and the Euler angles of the human face postures of the fused human face picture samples to obtain the loss of the human face postures.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the content structure of the device may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Referring to fig. 8, a block diagram of a computer device 800 according to an embodiment of the present application is shown. The computer device 800 may be configured to implement the above-described fused face generation method; the method can also be used for implementing the training method of the human face fusion model.

Generally, the computer device 800 includes: a processor 801 and a memory 802.

The processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 801 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 801 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 801 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 801 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices.

Those skilled in the art will appreciate that the configuration illustrated in FIG. 8 is not intended to be limiting of the computer device 800 and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components may be employed.

In an example embodiment, there is also provided a computer device comprising a processor and a memory, the memory having stored therein a computer program. The computer program is configured to be executed by one or more processors to implement the above-described method for fusion of face pictures or the above-described method for training a face fusion model.

In an exemplary embodiment, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when being executed by a processor of a computer device, implements the above-mentioned fusion method of face pictures or implements a training method of the above-mentioned face fusion model.

Alternatively, the computer-readable storage medium may be a ROM (Read-Only Memory), a RAM (Random Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which, when running on a computer device, causes the computer device to execute the above-mentioned fusion method of face pictures or the above-mentioned training method of face fusion model.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only exemplarily show one possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the order shown in the figure, which is not limited by the embodiment of the present application.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for fusing face pictures is characterized by comprising the following steps:

acquiring a source face picture and a target face picture;

2. The method of claim 1, wherein the fused face picture is generated by a face fusion model, the face fusion model comprising an identity coding network, an attribute coding network, and a decoding network; wherein the content of the first and second substances,

the identity coding network is used for acquiring an identity characteristic hidden code of the source face picture;

the attribute coding network is used for acquiring an attribute feature hidden code of the target face picture;

and the decoding network is used for fusing based on the identity characteristic hidden code and the attribute characteristic hidden code to generate the fused face picture.

3. The method of claim 2, wherein the identity encoding network comprises N number of encoding layers connected in series, N being an integer greater than 1; the obtaining of the identity hidden code of the source face picture includes:

coding the source face picture through the 1 st to the n1 th coding layers in the identity coding network to obtain a shallow hidden code; the shallow hidden code is used for representing the facial appearance characteristics of the source face picture;

encoding the shallow hidden code through the n1 th to n2 th encoding layers in the identity encoding network to obtain a middle hidden code; the middle-layer hidden code is used for representing fine facial features of the source face picture;

encoding the middle-layer hidden code through the N2 th to N th encoding layers in the identity encoding network to obtain a deep hidden code; the deep hidden code is used for representing the face color characteristic and the face micro characteristic of the source face picture;

wherein the identity crypto code comprises: the shallow implicit code, the middle implicit code and the deep implicit code, N1 and N2 are positive integers smaller than N.

4. The method of claim 2, wherein the decoding network comprises M decoding layers, M being an integer greater than 1; the generating of the fused face picture based on the fusion of the identity characteristic hidden code and the attribute characteristic hidden code comprises the following steps:

carrying out affine transformation on the identity characteristic hidden codes to generate M groups of control vectors;

decoding the attribute feature hidden codes and the M groups of control vectors through the M decoding layers to generate the fused face picture;

the input of the 1 st decoding layer comprises the attribute feature hidden code and the 1 st group of control vectors, the input of the (i + 1) th decoding layer comprises the output of the (i) th decoding layer and the (i + 1) th group of control vectors, the output of the M th decoding layer comprises the fused face picture, and i is a positive integer smaller than M.

5. The method of claim 4, wherein the affine transforming the implicit identity code to generate M sets of control vectors comprises:

dividing the identity characteristic hidden codes into M groups of identity characteristic vectors;

performing affine transformation on the M groups of identity characteristic vectors respectively to generate M groups of control vectors;

and each group of control vectors comprises at least two control vectors, and different control vectors are used for representing identity characteristics with different dimensions.

6. The training method of the human face fusion model is characterized in that the human face fusion model comprises a generating network and a judging network, wherein the generating network comprises an identity coding network, an attribute coding network and a decoding network; the method comprises the following steps:

7. The method of claim 6, wherein the identity encoding network comprises N number of encoding layers connected in series, N being an integer greater than 1; the obtaining of the identity feature hidden code of the source face picture sample through the identity coding network includes:

coding the source face picture sample through the 1 st to the n1 th coding layers in the identity coding network to obtain a shallow hidden code; the shallow hidden code is used for representing the facial appearance characteristics of the source face picture sample;

encoding the shallow hidden code through the n1 th to n2 th encoding layers in the identity encoding network to obtain a middle hidden code; the middle-layer hidden code is used for representing fine facial features of the source face picture sample;

encoding the middle-layer hidden code through the N2 th to N th encoding layers in the identity encoding network to obtain a deep hidden code; the deep hidden code is used for representing the face color characteristics and the face micro characteristics of the source face picture sample;

8. The method of claim 6, wherein the decoding network comprises M decoding layers, M being an integer greater than 1; the generating of the fused face picture sample based on the identity feature hidden code and the attribute feature hidden code through the fusion of the decoding network comprises the following steps:

decoding the attribute feature hidden codes and the M groups of control vectors through the M decoding layers to generate the fused face picture sample;

the input of the 1 st decoding layer comprises the attribute feature hidden code and the 1 st group of control vectors, the input of the (i + 1) th decoding layer comprises the output of the (i) th decoding layer and the (i + 1) th group of control vectors, the output of the M th decoding layer comprises the fused face picture sample, and i is a positive integer smaller than M.

9. The method of claim 6, wherein the determining a network loss based on the fused face picture sample, the source face picture sample, the target face picture sample, and the discrimination result of the discrimination network comprises:

determining a perception similarity loss based on the target face picture sample and the fused face picture sample, wherein the perception similarity loss is used for representing picture style difference between the target face picture sample and the fused face picture sample;

determining the multi-scale identity feature loss based on the source face picture sample and the fused face picture sample, wherein the multi-scale identity feature loss is used for representing identity feature differences between the source face picture sample and the fused face picture sample;

determining a face pose loss based on the target face picture sample and the fused face picture sample, the face pose loss being used to describe a face pose difference between the target face picture sample and the fused face picture sample;

determining to generate network countermeasure loss based on the discrimination result;

and determining the generated network loss according to the perception similarity loss, the multi-scale identity feature loss, the face pose loss and the network confrontation loss.

10. The method of claim 9, wherein determining a perceptual similarity loss based on the target face picture sample and the fused face picture sample comprises:

respectively extracting the visual features of the target face picture sample and the visual features of the fused face picture sample through a visual feature extraction network;

and calculating the similarity between the visual characteristics of the target face picture sample and the visual characteristics of the fused face picture sample to obtain the perception similarity loss.

11. The method of claim 9, wherein the determining the multi-scale identity feature loss based on the source face picture sample and the fused face picture sample comprises:

respectively extracting the identity characteristic hidden codes of the source face picture sample and the identity characteristic hidden codes of the fused face picture sample through the identity characteristic extraction network;

and calculating the similarity between the identity characteristic hidden code of the source face picture sample and the identity characteristic hidden code of the fused face picture sample to obtain the multi-scale identity characteristic loss.

12. The method of claim 9, wherein determining a face pose loss based on the target face picture sample and the fused face picture sample comprises:

respectively extracting a face attitude Euler angle of the target face picture sample and a face attitude Euler angle of the fused face picture sample through a face attitude prediction network;

and calculating the similarity between the Euler angles of the face postures of the target face picture sample and the Euler angles of the face postures of the fused face picture sample to obtain the face posture loss.

13. An apparatus for fusing face pictures, the apparatus comprising:

14. The training device of the human face fusion model is characterized in that the human face fusion model comprises a generating network and a judging network, wherein the generating network comprises an identity coding network, an attribute coding network and a decoding network; the device comprises:

15. A computer device comprising a processor and a memory, the memory having stored therein a computer program that is loaded and executed by the processor to implement the method of any of claims 1 to 5 or to implement the method of any of claims 6 to 12.

16. A computer-readable storage medium, in which a computer program is stored, which is loaded and executed by a processor to implement the method of any one of claims 1 to 5 or to implement the method of any one of claims 6 to 12.