CN111461959B

CN111461959B - Face emotion synthesis method and device

Info

Publication number: CN111461959B
Application number: CN202010095755.XA
Authority: CN
Inventors: 沈海斌; 孔家慧; 黄科杰
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-02-17
Filing date: 2020-02-17
Publication date: 2023-04-25
Anticipated expiration: 2040-02-17
Also published as: CN111461959A

Abstract

The invention discloses a face emotion synthesis method and device. The method comprises the steps of obtaining a color image of a current frame, extracting a face image and adjusting the face image to a preset size; detecting a plurality of preset key point positions of a face image, drawing a contour map of each part of the face according to the key point positions, and obtaining a face contour image; inputting the face image, the face outline image and the target emotion label into a first-stage convolution neural network to obtain a rough synthetic face image; using residual images of the rough synthesized face image and the original face image as inputs, and using a second-stage convolutional neural network prediction image mask; and calculating a corrected synthesized face image according to the rough synthesized face image, the face image and the image mask. The natural and lifelike face image or face video with target emotion can be synthesized under various environmental illumination, face shielding and extreme gesture conditions.

Description

Face emotion synthesis method and device

Technical Field

The invention belongs to the technical field of facial emotion synthesis, and particularly relates to a facial emotion synthesis method and device.

Background

Facial emotion synthesis refers to changing the emotional expression of a person in a given image or video by technical means, such as nature, happiness, surprise, heart injury, etc. The facial emotion synthesis has more entertainment application in image editing software, photographing software and small video software, and has commercial application value in the field of picture making and film and television making. However, the existing face emotion synthesis is not mature enough, is mainly used on some special effects APP, and has not strong application capability. The prior art mainly has the following defects: (1) the emotion is not abundant enough; (2) The continuity of the synthesized video obtained by processing the video frame by frame is not enough, the emotion expression modes of the existing effect synthesis are relatively uniform, for example, after the video of the character lecture is processed, the content of the original lecture of the character cannot be reserved, so that the video is not natural enough, and the application capability in the fields of small video and film and television production is limited; (3) Under the conditions of complex illumination environment, occlusion of a human face and large gesture of a person, the synthesis effect is unstable, and the robustness is poor.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method for synthesizing natural and vivid face images or face videos with target moods under various conditions of ambient light, face shielding and extreme postures, and designs a device for realizing the method.

The invention adopts the following technical scheme:

a face emotion synthesis method, comprising:

step S101, acquiring a current frame color image;

step S102, extracting a face image from the current frame color image, and adjusting the face image to a preset size;

step S103, obtaining a face contour image according to a plurality of preset key point positions of the face image;

step S104, setting a target emotion label, inputting the face image, the face outline image and the target emotion label into a first-stage convolutional neural network, and obtaining a rough synthetic face image; the target emotion label refers to the emotion of a desired rough synthetic face image;

step S105, taking the difference between the rough synthesized face image and the face image to obtain a residual image, and inputting the residual image into a second-stage convolutional neural network to obtain an image mask;

and step S106, calculating the rough synthesized face image and the face image by using the image mask to obtain a final corrected synthesized face image.

As a preferred aspect of the present invention, the first-stage convolutional neural network includes an image encoder, a contour encoder, an image decoder, and a contour decoder; the image encoder and the contour encoder are composed of a plurality of downsampling layers, the adjusted face image is input to the image encoder, the face contour image and the emotion label are spliced and then input to the contour encoder, the encoding features output by the image encoder and the contour encoder are spliced, and the mixed features are obtained after a plurality of cascaded residual blocks are processed; the image decoder comprises a plurality of up-sampling layers and splicing layers, wherein each up-sampling layer is followed by one splicing layer, and the last splicing layer is connected with the output layer; the contour decoder consists of a plurality of upsampling layers, and the last upsampling layer is connected with the output layer; inputting the mixed features to an image decoder, wherein each time an up-sampling layer is passed, the obtained features are spliced with the features with the same size calculated by the image encoder, and a rough synthetic face image is obtained; and inputting the mixed features to a contour decoder to obtain a synthetic human face contour image.

As a preferred aspect of the present invention, the second-stage convolutional neural network includes a plurality of residual blocks and a convolutional layer; subtracting the adjusted face image from the rough synthesized face image to obtain a residual image; and inputting the residual images into a plurality of cascaded residual blocks, and finally, processing the residual images through a convolution layer to obtain an image mask.

Aiming at the face emotion synthesis method, the invention discloses a face emotion synthesis device which comprises an image acquisition module, a face extraction module, a contour extraction module, a coarse synthesis module and a correction module; the image acquisition module is used for acquiring a color image of the current frame; the face extraction module is used for extracting a face image from the color image of the current frame and adjusting the size of the face image; the contour extraction module is used for detecting a plurality of key point coordinates from the face image and drawing the face contour image; the rough synthesis module is used for processing the adjusted face image by using a first-stage convolutional neural network, the face outline image and a target emotion label to obtain a rough synthesized face image, wherein the target emotion label refers to the emotion of the expected rough synthesized face image; the correction module is used for processing the residual errors among the adjusted face images and the roughly synthesized face images by using a second-stage convolutional neural network, obtaining an image mask, and calculating a finally corrected synthesized face image according to the image mask.

Compared with the prior art, the invention has the beneficial effects that:

the scheme of the invention acquires a current frame color image, extracts a face image from the image, detects a plurality of key point coordinates of the face image, draws a face contour image, processes the face image by using a first-stage convolution neural network, acquires a rough synthesized face image by using the face contour image and a target emotion label, processes the rough synthesized face image and a residual image of the face image by using a second-stage convolution neural network, acquires an image mask, and finally calculates to acquire a final corrected synthesized face image. According to the scheme, the robustness of the rough synthesized face image under the conditions of complex illumination, face shielding and extreme gesture is improved through the face contour image and the first-stage convolution neural network. In addition, the cascade connection of the first-stage convolutional neural network and the second-stage convolutional neural network improves the image consistency of the synthesized video obtained after video processing. According to the scheme, natural and lifelike emotion expression of the person can be synthesized under any image or video shooting environment and any gesture.

Drawings

FIG. 1 is a flowchart of a face emotion synthesis method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a facial emotion synthesizing device according to an embodiment of the present invention;

FIG. 3 is a block diagram of a first stage convolutional neural network in accordance with the present invention; in the figure, 31 image encoder, 32 contour encoder, 33 image decoder, 34 contour decoder;

fig. 4 is a block diagram of a second level convolutional neural network in the present invention.

Detailed Description

The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.

As shown in fig. 1, in a first aspect of the present invention, a face emotion synthesis method S100 is designed, including:

step S101, acquiring a current frame color image;

in a specific embodiment of the present invention, a previously photographed image or video clip may be provided, or a current frame color image may be directly obtained through a camera.

in a specific embodiment of the present invention, face detection is performed using a face detector of a machine learning library such as OpenCV or Dlib, and the detected face image is adjusted to a preset size after the detected face image is acquired. The preset size may be set to m×m (e.g., 128×128), where M is an integer greater than zero.

It should be noted that, when the face image is obtained, certain background information may be included, which is not limited to the face portion, but generally the face image should include the head of the person and a part of the background.

Step S103, detecting a plurality of preset key point positions of the face image, drawing the outline of each part of the face on the blank image according to the key point positions, and obtaining a face outline image;

in one embodiment of the present invention, the face_alignment library is a machine learning library written by python and dedicated to detecting face keypoints, which is used to obtain the coordinates of 68 face keypoints and then to draw a face contour image. The face contour image should correspond to the adjusted face image, i.e. the same size, and the coordinates of the key points.

in a specific embodiment of the present invention, the structure of the first-stage convolutional neural network is shown in fig. 3, the face image is input to an image encoder, and the face contour image and the target emotion label are spliced and then input to the contour encoder; splicing the coding vectors obtained by coding the two encoders, and then processing a plurality of residual blocks to obtain a mixed characteristic, wherein the number of the residual blocks is 3; the mixed features are input to an image decoder, the features obtained at present and the features with the same size obtained by encoding of the encoder are spliced after each layer of up-sampling, the spliced features are input to the next layer of up-sampling layer, and finally a coarse synthesized face image is obtained; optionally, the mixed features are input to a contour decoder to obtain a synthesized face contour image.

The training method of the first-stage convolutional neural network specifically comprises the following steps: and acquiring a field public data set with an expression label, preprocessing all images in the data set (extracting face images from the images and scaling to a preset size, drawing corresponding face contour images) and obtaining the face images and the face contour images with the preset size. The first-stage convolutional neural network comprises an image encoder, a contour encoder, an image decoder, a contour decoder and a plurality of residual blocks. In the training stage, firstly, a face image, a face outline image and a target emotion label are received as inputs, meanwhile, a coarse synthesized face image and a face outline image corresponding to the coarse synthesized face image are output, a training mode of an antagonistic generation network is adopted in the training process, besides a first-stage convolutional neural network, two different convolutional neural networks are required to be arranged for distinguishing, supervising and judging the authenticity and emotion labels of the coarse synthesized face image and the face outline image corresponding to the coarse synthesized face image, and the face outline image corresponding to the coarse synthesized face image and the emotion label of the original face image are input into the first-stage neural network again, so that the first-stage neural network can recover the original face image and the original face outline image, then the Loss function Loss is calculated, a model is optimized by using an Adam optimizer, the learning rate of all networks can be 0.0001, the total iteration times can be 300000 times, and each 1000 output results are observed, and the test data set is subjected to pretreatment to obtain the face image and the face outline image with the preset size. The contour decoder in the first-stage convolutional neural network can be omitted during testing and actual use.

Wherein, the target emotion label refers to emotion expression of the corresponding face image, including but not limited to nature, happiness, surprise, injury, happiness, aversion, fear and the like, and for example, the target emotion label may be 0 (nature), 1 (happiness), 0 (surprise), 0 (injury), 0 (happiness), 0 (aversion), 0 (fear), and the emotion expression of the face image corresponding to the emotion label is happy.

Step S105, subtracting the face image from the rough synthesized face image to obtain a residual image, and processing the residual image through a second-stage convolutional neural network to obtain a predicted image mask. The structure of the second-stage convolutional neural network is shown in fig. 4, and consists of a plurality of residual blocks and a convolutional layer.

In a specific embodiment of the present invention, a residual image obtained by differencing the coarse synthesized face image and the face image is processed by a plurality of residual blocks and then passed through a convolution layer, and a final image mask is predicted.

The training method of the second-stage convolutional neural network specifically comprises the following steps: acquiring a field public data set with an expression label, acquiring a trained first-stage convolutional neural network according to the first-stage convolutional neural network training method, obtaining a coarse synthetic image by using the first-stage convolutional neural network, subtracting a corresponding face image from the coarse synthetic image to obtain a residual image, processing the residual image by using a second-stage convolutional neural network to obtain an image mask, and calculating according to the step S106 to obtain a final corrected synthetic face image, wherein the second-stage convolutional neural network consists of a plurality of residual blocks and a layer of convolutional layer. The training process adopts a training mode of an countermeasure generation network, a convolutional neural network is required to be additionally arranged to judge the authenticity of the final corrected synthesized face image, then Loss function Loss is calculated, an Adam optimizer is used for model optimization, the learning rate of all networks can be 0.0001, the total iteration times can be 10000, and the result is output every 1000 times for observation, and a test data set is processed to obtain the residual image.

Step S106, calculating the rough synthesized face image and the face image by using the image mask to obtain a final corrected synthesized face image;

the final corrected composite face image satisfies the following relationship:

I＝Isrc*(1-Mask)+Isyn*Mask

wherein I is the final corrected synthesized face image, isrc is the adjusted face image, isyn is the rough synthesized face image, and Mask is the image Mask.

The face emotion synthesis method can synthesize richer emotion; the invention uses the assistance of the key points of the human face and the outline information of the human face, thereby being capable of adapting to complex illumination environment, face shielding and extreme gesture conditions and having good robustness; the invention uses two convolutional neural network cascades to further optimize the results, so that more natural and lifelike images or coherent videos can be synthesized.

In a second aspect of the present invention, as shown in fig. 2, there is provided a face emotion synthesis device 20 including:

an image acquisition module 21 for acquiring a color image of a current frame;

a face extraction module 22, configured to extract a face image from the color image of the current frame and adjust the size;

the contour extraction module 23 is used for detecting a plurality of key point coordinates from the face image and drawing the face contour image;

the coarse synthesis module 24 is configured to process the adjusted face image by using a first-level convolutional neural network, the face contour image and a target emotion label, and obtain a coarse synthesized face image, where the target emotion label refers to the emotion of the desired coarse synthesized face image.

And the correction module 25 is configured to process the residues between the adjusted face images and the coarsely synthesized face images by using a second-stage convolutional neural network, obtain an image mask, and calculate a final corrected synthesized face image according to the mask.

In a specific embodiment of the present invention, the face extraction module includes:

extraction unit: the face image is extracted from the current frame color image;

an adjusting unit: and the face image is used for adjusting the face image to a preset size.

In a specific embodiment of the present invention, the contour extraction module includes:

and a detection unit: coordinates for detecting 68 key points from the adjusted face image;

and a drawing unit: and the method is used for creating blank images with preset sizes and drawing the outline of the corresponding face part according to 68 key point coordinates.

In a specific embodiment of the present invention, the coarse synthesis module includes:

a synthesis unit: the face image processing module is used for processing the adjusted face image, the face outline image and the target emotion label by using a first-stage convolutional neural network to obtain a roughly synthesized face image; the target emotion label is preset and input into the synthesis unit, and the roughly synthesized face image has emotion corresponding to the target emotion label.

In a specific embodiment of the present invention, the correction module includes:

residual calculation unit: the residual image is used for calculating the face image which is roughly synthesized and the adjusted face image;

prediction unit: the second-stage convolutional neural network is used for processing the residual image and predicting an image mask;

and a correction unit: and the method is used for calculating a final corrected synthesized face image by using the rough synthesized face image, the adjusted image and the predicted image mask.

In one embodiment of the present invention, the working process of the face emotion synthesis device 20 specifically includes: acquiring a color image of a current frame by adopting an image acquisition module, sequentially connecting an extraction unit and an adjustment unit, extracting a face image from the color image of the current frame, and adjusting the face image to a preset size; the output of the adjusting unit is connected with the input of the detecting unit, the detecting unit obtains the coordinates of key points of the adjusted face image and inputs the coordinates into the drawing unit, and the drawing unit draws the outline of the corresponding face part according to the coordinates of the key points of the face. The output of the drawing unit and the output of the adjusting unit are both connected with the synthesizing unit, the synthesizing unit further comprises an input port of a target emotion label, a trained first-stage convolutional neural network model is loaded in the synthesizing unit, the output of the synthesizing unit is connected with the correcting module, a trained second-stage convolutional neural network model is loaded in the correcting module, and finally a corrected synthetic face image is obtained.

The facial emotion synthesis device provided by the embodiment of the invention can be applied to the related embodiment of the facial emotion synthesis method, and details of the method are described in the above description, and are not repeated here.

It is to be understood that the above embodiments are merely illustrative of the application of the principles of the present invention, but not in limitation thereof. Modifications of the above-described embodiments, or equivalent substitutions of some of the features thereof, will be apparent to those of ordinary skill in the art, and are intended to be within the scope of the present invention.

Claims

1. A method of face emotion synthesis, comprising:

step S101, acquiring a current frame color image;

2. The face emotion synthesis method according to claim 1, wherein the step S103 is specifically: and detecting 68 key point coordinates of the adjusted face image, drawing the outline of each part of the face on a blank image with a preset size according to the 68 key point coordinates, and obtaining a face outline image.

3. The method of face emotion synthesis according to claim 1, wherein the first level convolutional neural network comprises an image encoder, a contour encoder, an image decoder, and a contour decoder; the image encoder and the contour encoder are composed of a plurality of downsampling layers, the adjusted face image is input to the image encoder, the face contour image and the emotion label are spliced and then input to the contour encoder, the encoding features output by the image encoder and the contour encoder are spliced, and the mixed features are obtained after a plurality of cascaded residual blocks are processed;

the image decoder comprises a plurality of up-sampling layers and splicing layers, wherein each up-sampling layer is followed by one splicing layer, and the last splicing layer is connected with the output layer; the contour decoder consists of a plurality of upsampling layers, and the last upsampling layer is connected with the output layer; inputting the mixed features to an image decoder, wherein each time an up-sampling layer is passed, the obtained features are spliced with the features with the same size calculated by the image encoder, and a rough synthetic face image is obtained; and inputting the mixed features to a contour decoder to obtain a synthetic human face contour image.

4. The face emotion synthesis method of claim 1, wherein the second-level convolutional neural network comprises a plurality of residual blocks and a convolutional layer; subtracting the adjusted face image from the rough synthesized face image to obtain a residual image; and inputting the residual images into a plurality of cascaded residual blocks, and finally, processing the residual images through a convolution layer to obtain an image mask.

5. The face emotion synthesis method according to claim 1, characterized in that in step S106, the final corrected synthesized face image satisfies the following relation:

I＝Isrc*(1-Mask)+Isyn*Mask

6. A facial emotion synthesizing device, characterized by comprising:

the image acquisition module is used for acquiring a color image of the current frame;

the face extraction module is used for extracting a face image from the color image of the current frame and adjusting the size of the face image;

the contour extraction module is used for detecting a plurality of key point coordinates from the face image and drawing the face contour image;

the rough synthesis module is used for processing the adjusted face image, the face outline image and the target emotion label by using the first-stage convolutional neural network to obtain a rough synthesized face image, wherein the target emotion label refers to the emotion of the expected rough synthesized face image;

and the correction module is used for processing the residual errors among the adjusted face images and the roughly synthesized face images by using a second-stage convolutional neural network, obtaining an image mask, and calculating a finally corrected synthesized face image according to the image mask.

7. The facial emotion synthesis apparatus of claim 6, wherein said facial extraction module comprises:

8. The facial emotion synthesis device as recited in claim 6, said contour extraction module comprising:

and a detection unit: coordinates for detecting key points from the adjusted face image;

and a drawing unit: and the method is used for creating blank images with preset sizes and drawing the outline of the corresponding face part according to the coordinates of the key points.

9. The facial emotion synthesis device as recited in claim 6, said correction module comprising:

10. The facial emotion synthesis device as recited in claim 6, wherein said first level convolutional neural network comprises an image encoder, a contour encoder, an image decoder, and a contour decoder;

the image encoder and the contour encoder are composed of a plurality of downsampling layers, and the outputs of the image encoder and the contour encoder are sequentially connected with a splicing layer and a plurality of cascaded residual blocks;

the image decoder comprises a plurality of up-sampling layers and splicing layers, wherein each up-sampling layer is followed by one splicing layer, and the last splicing layer is connected with the output layer; the contour decoder consists of a plurality of upsampling layers, and the last upsampling layer is connected with the output layer;

the second-stage convolutional neural network is formed by sequentially connecting an input layer, a plurality of cascaded residual blocks, a convolutional layer and an output layer.