CN116681579A

CN116681579A - Real-time video face replacement method, medium and system

Info

Publication number: CN116681579A
Application number: CN202310446852.2A
Authority: CN
Inventors: 周安斌; 晏武志; 潘见见; 郑建华
Original assignee: Shandong Jindong Digital Creative Co ltd
Current assignee: Shandong Jindong Digital Creative Co ltd
Priority date: 2023-04-24
Filing date: 2023-04-24
Publication date: 2023-09-01

Abstract

The invention provides a real-time video face replacement method, medium and system, belonging to the technical field of video processing, wherein the real-time video face replacement method comprises the following steps: extracting images of each frame from the live video stream, and determining the position and the size of an original face image; selecting a target face image and a target mouth shape image corresponding to the original face image from the target face set according to the facial expression similarity; performing face pairing on the target face image; carrying out face segmentation on the target face image, and extracting the outline and the characteristic region of the face; performing color correction on the original face image and the target face image; splicing the outline and the characteristic area of the target face image with the corresponding part of the original face image to generate a replaced basic replacement image; performing mouth shape adjustment on the basic replacement image by using the target mouth shape image to obtain a replaced face image; and embedding the replaced face image into the live video stream to form a real-time video face replacement effect.

Description

Real-time video face replacement method, medium and system

Technical Field

The invention belongs to the technical field of video processing, and particularly relates to a real-time video face replacement method, medium and system.

Background

Video face replacement is a technology for replacing a face in a video picture with another face image, and can be used in the fields of entertainment, education, safety and the like. The difficulty of video face replacement is how to achieve a high-quality, high-efficiency, high-fidelity face replacement effect while maintaining the consistency and naturalness of video pictures.

Currently, some methods for video face replacement have been proposed, such as a deep learning-based method, an optimization-based method, a three-dimensional model-based method, and the like. Among them, the deep learning-based method generally requires a large amount of training data and computing resources, and is prone to problems of over-fitting or distortion; the optimization-based method generally needs to solve a complex optimization problem, and is difficult to process large-scale facial pose and expression changes; three-dimensional model-based methods generally require accurate reconstruction of three-dimensional face models and textures, and it is difficult to maintain face detail and illumination consistency. For example, chinese patent publication No. CN114005156a, application No. CN202111186185.6, discloses a face replacing method, system, terminal device and computer storage medium, the face replacing method is that: determining a target face which needs to be replaced in the real-time playing process of the video picture; acquiring weight grades of person actions associated with the target face, and optimizing a face feature extraction model based on the weight grades; extracting a first facial feature and feature combination relation of the target face by using the optimized face feature extraction model; and replacing the face to be replaced, which needs to be replaced, in the video picture according to the first facial feature and the feature combination relation.

In the prior art, each frame in the video stream needs to be analyzed and replaced in the face replacement process, so that the calculated amount is large, and the requirement on hardware equipment is high.

Disclosure of Invention

In view of the above, the invention provides a real-time video face replacement method, medium and system, which can solve the technical problems of large calculation amount and high requirement on hardware equipment because each frame in a video stream needs to be analyzed and replaced in the face replacement process.

The invention is realized in the following way:

the first aspect of the present invention provides a real-time video face replacement method, which includes the following steps:

s10, extracting an image of each frame from a live video stream, performing face detection on the image, and determining the position and the size of an original face image serving as the original face image;

s20, selecting a target face image corresponding to the original face image from a preset target face set according to the facial expression similarity;

s30, selecting a target mouth shape image corresponding to the mouth shape in the original face image from a preset target mouth shape set according to the mouth shape similarity, wherein the target mouth shape set is the mouth shape of all phonemes of a person corresponding to the target face, the mouth shape has marked mouth shape key points with the numbers of 49-68 according to a Dlib algorithm, the mouth shape key points are marked as target mouth shape key points, and a target mouth shape vector K is established according to the target mouth shape key points;

s40, carrying out face alignment on the target face image to enable the pose and the expression of the target face image to be matched with those of the original face image;

s50, carrying out face segmentation on the target face image, and extracting a contour and a characteristic region of the face, wherein the characteristic region is a region corresponding to points numbered 18-48 in a Dlib algorithm;

s60, carrying out color correction on the original face image and the target face image, so that the brightness and the tone of the original face image and the target face image are kept consistent;

s70, splicing the outline and the characteristic area of the target face image with the corresponding part of the original face image to generate a replaced basic replacement image;

s80, performing mouth shape adjustment on the basic replacement image by using the target mouth shape image to obtain a replaced face image;

s90, embedding the replaced face image into the live video stream to form a real-time video face replacement effect.

Based on the technical scheme, the real-time video face replacement method can be improved as follows:

the preset target face set is a face image set of various angles and expressions of a target face to be replaced, the face image is marked with face key points with numbers of 1-48 according to a Dlib algorithm, the face key points are marked as a target key point set, and a target vector A is established according to the target key point set.

Further, the step of selecting a target face image corresponding to the original face image from the target face set according to the facial expression similarity specifically includes:

marking the face key points with the numbers of 1-48 on the original face image according to a Dlib algorithm, and marking the face key points as an original key point set;

establishing an original vector B according to the original key point set;

calculating the similarity S1 of A and B of the face images for all the face images in the target face set;

and selecting the face image corresponding to the largest S1 as a target face image.

The step of selecting a target mouth shape corresponding to the mouth shape in the original face image from the target mouth shape set according to the mouth shape similarity in the preset target mouth shape set specifically comprises the following steps:

marking face key points with the numbers of 49-68 on the original face image according to a Dlib algorithm, and marking the face key points as an original mouth shape key point set;

establishing an original mouth shape vector C according to the original mouth shape key point set;

calculating the similarity S2 of K of the target mouth shapes and C of all the target mouth shapes in the target mouth shape set;

and selecting the target mouth shape corresponding to the largest S2 as a target mouth shape image.

The step of aligning the face of the target face image to match the pose and the expression of the target face image with those of the original face image specifically comprises the following steps:

adopting a face alignment algorithm based on deep learning, and realizing the adjustment of the pose and the expression of a target face image by training a face pose estimation network and a face expression generation network;

the step of carrying out face segmentation on the target face image and extracting the outline and the characteristic area of the face comprises the following steps:

adopting a face segmentation algorithm based on deep learning, and realizing extraction of the contour and the characteristic region of the target face image by training a semantic segmentation network;

the step of performing color correction on the original face image and the target face image specifically includes:

adopting a color correction algorithm based on deep learning, and realizing color matching of an original face image and a target face image by training a color conversion network;

the step of embedding the replaced face image into the live video stream specifically comprises the following steps:

the embedding algorithm based on deep learning is adopted, and natural embedding of the replaced face image and the original video frame is realized by training a visual attention network.

The step of performing the mouth shape adjustment on the basic replacement image by using the target mouth shape image to obtain a replaced face image specifically includes:

marking basic mouth shape key points in the basic replacement image, wherein the basic mouth shape key points are key points with the numbers of 49-68 in a Dlib algorithm;

establishing grids for the basic replacement images according to basic mouth shape key points;

and according to the corresponding relation of the numbers, moving the coordinates of the basic mouth shape key points to the corresponding coordinate positions of the key points of the target mouth shape image, and correspondingly adjusting the grids to obtain the replaced face image.

The acquisition mode of the preset target face set is as follows:

the facial expression and the expression of a corresponding target person of a target face are acquired by using three high-definition cameras for recording, and the three high-definition cameras are arranged in the following way: the first high-definition camera is right opposite to the front of the face of the target person, and the other two high-definition cameras are respectively positioned at the left side and the right side of the face of the target person; the high-definition camera is a camera with resolution of 4K.

The mouth shapes of all the phonemes are mouth shapes corresponding to all the phonemes in a regular method for driving the phonemes.

A second aspect of the present invention provides a computer readable storage medium, wherein the computer readable storage medium has stored therein program instructions, which when executed, are configured to perform a real-time video face replacement method as described above.

A third aspect of the present invention provides a real-time video face replacement system comprising a computer readable storage medium as described above.

Compared with the prior art, the real-time video face replacement method, medium and system provided by the invention have the beneficial effects that: the face is divided into a contour, a characteristic area and a mouth shape area by utilizing a face key point in a Dlib algorithm, a target face image corresponding to the original face image is selected from the target face set according to the facial expression similarity, and a target mouth shape image corresponding to the mouth shape in the original face image is selected from the target mouth shape set according to the mouth shape similarity, so that a target face image and a target mouth shape image with the highest similarity with the original face image can be obtained, the original face image is spliced by adopting the target face image and the target mouth shape image with the highest similarity, and the calculation amount in the splicing process is small due to the high similarity, and the consumed hardware resource is small;

in addition, because the mouth of the face has relatively larger movements when speaking, the shape and movements of the mouth can be unnatural when speaking by adopting a replacement mode, and therefore, the mouth of the image after face replacement can be smoother and more natural by adopting a preset target mouth shape set of a person corresponding to the target face to search the target mouth shape with the highest similarity to carry out mouth adjustment on the basic replacement image.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for replacing a real-time video face according to a first aspect of the present invention;

fig. 2 is a diagram of key points of face recognition in Dlib algorithm.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

As shown in fig. 1, a flowchart of a real-time video face replacement method is provided in a first aspect of the present invention, and the method includes the following steps:

s30, selecting a target mouth shape image corresponding to the mouth shape in the original face image from a target mouth shape set according to the mouth shape similarity in the preset target mouth shape set, wherein the target mouth shape set is the mouth shape of all phonemes of a person corresponding to the target face, the mouth shape is marked with mouth shape key points with the numbers of 49-68 according to a Dlib algorithm, marked as target mouth shape key points, and a target mouth shape vector K is established according to the target mouth shape key points;

The face recognition key points of the Dlib algorithm are shown in fig. 2, and the numbers in fig. 2 are key point numbers;

the step of S10 may specifically be: the video stream is acquired using the cv2.videocapture function in OpenCV and each frame of image is read through a loop. And then, each frame of image is transmitted into a face detection model, and the frame position and the size of the face are obtained. Finally, the face position and size information is stored in a list and returned to the list.

In the above technical solution, the preset target face set is a face image set of various angles and expressions of the target face to be replaced, the face image has been marked with face key points numbered 1-48 according to Dlib algorithm, and is marked as a target key point set, and a target vector a is established according to the target key point set.

Further, in the above technical solution, the step of selecting the target face image corresponding to the original face image from the target face set according to the facial expression similarity specifically includes:

marking face key points with the numbers of 1-48 on an original face image according to a Dlib algorithm, and marking the face key points as an original key point set;

establishing an original vector B according to the original key point set;

The calculation mode of S1 is cosine similarity.

In the above technical solution, in a preset target mouth shape set, selecting a target mouth shape corresponding to a mouth shape in an original face image from the target mouth shape set according to a mouth shape similarity, specifically including:

marking face key points with the numbers of 49-68 on an original face image according to a Dlib algorithm, and marking the face key points as an original mouth shape key point set;

calculating the similarity S2 of K and C of the target mouth shapes for all the target mouth shapes in the target mouth shape set;

The calculation mode of S2 is cosine similarity.

In the above technical solution, the step of aligning the face of the target face image to match the pose and the expression of the target face image with those of the original face image specifically includes:

the method comprises the steps of carrying out face segmentation on a target face image and extracting the outline and the characteristic area of the face, and specifically comprises the following steps:

the method comprises the following steps of carrying out color correction on an original face image and a target face image:

embedding the replaced face image into a live video stream, specifically:

The face alignment algorithm used in the step is a method based on 3DMM (3D Morphable Model), and the specific implementation modes are as follows:

first, the key point detection is carried out on the face in the original face image, and the positions of the characteristic points such as eyes, nose, mouth and the like are obtained. Then, according to the principle of the 3DMM, the 3D gesture and expression information of the original face are calculated by utilizing the position information of the feature points and the standard face model in the 3 DMM. And finally, calculating corresponding 3D gesture and expression information of the target face image according to the same method, and deforming, rotating and translating the target face image to match the gesture and expression of the original face image.

The face segmentation can also use a cv2.Grabcut function in OpenCV, take a face area as a foreground, divide an image into a foreground part and a background part, and carry out iterative correction on uncertain pixels. After the iteration is finished, foreground pixels are acquired, and a new face image is generated based on the foreground pixels. Finally, the obtained face image is stored in a list as input for subsequent processing.

The color correction algorithm adopts a method based on gray world assumption, and the specific implementation mode is as follows:

firstly, respectively calculating average brightness values of an original face image and a target face image to obtain a gray average value of the two images. Then, the scale factors of the red channel, the green channel, and the blue channel of the two images are calculated, respectively, so that the two images are kept identical in brightness. Finally, the two images are subjected to operations such as translation and scaling, so that the two images are kept as consistent as possible in hue.

The face replacement method can also use a Seamless Cloning face replacement method, and the specific implementation modes are as follows:

firstly, performing image fusion processing on an original face image and a target face image to ensure that edges of the two images can be smoothly connected together. Then, the target face image is fused with the corresponding part of the original face image according to the outline and the characteristic area of the target face image by using a Seamless Cloning method. And finally, storing the obtained replaced face image in a list as input of subsequent processing.

In the above technical solution, the step of performing the mouth shape adjustment on the basic replacement image by using the target mouth shape image to obtain the replaced face image specifically includes:

marking basic mouth shape key points in the basic replacement image, wherein the basic mouth shape key points are key points numbered 49-68 in a Dlib algorithm;

establishing grids for the basic replacement image according to the basic mouth shape key points;

and according to the corresponding relation of the numbers, moving the coordinates of the basic mouth shape key points to the coordinate positions of the key points of the corresponding target mouth shape image, and correspondingly adjusting the grids to obtain the replaced face image.

In the above technical solution, the preset target face set is obtained by:

the facial expression and the expression of the corresponding target person of the target face are acquired by using three high-definition cameras for recording, and the arrangement modes of the three high-definition cameras are as follows: the first high-definition camera is right opposite to the front of the face of the target person, and the other two high-definition cameras are respectively positioned at the left side and the right side of the face of the target person; the high-definition camera is a camera with resolution of 4K.

In the above technical solution, the mouth shapes of all phonemes are mouth shapes corresponding to all phonemes in the regular method of phoneme driving.

It should be noted that any known face detection algorithm, for example, a face detection algorithm based on deep learning, such as MTCNN, retinaFace, or a face detection algorithm based on conventional features, such as Haar features, HOG features, and the like, may be used in the present invention. The invention can dynamically adjust the parameters of the face detection according to the resolution and the frame rate of the video stream so as to improve the speed and the accuracy of the detection. The invention can also screen the detected faces and remove the faces which do not meet the replacement conditions, such as shielding, blurring, side faces and the like. The invention can also track the detected face to maintain the continuity and stability of replacement.

For example, the invention adopts a face detection algorithm based on deep learning, such as MTCNN (Multi-task Cascaded Convolutional Neural Networks), which can rapidly and accurately detect the position and the size of a face and five key points (left eye, right eye, nose tip, left mouth corner and right mouth corner) of the face on a Multi-scale image. The specific implementation of the algorithm is as follows:

firstly, scaling an input image to different scales to form an image pyramid;

then, inputting each layer of image in the image pyramid into a first convolutional neural network (P-Net), wherein the network can output a binary classification heat map and a bounding box regression heat map which respectively show whether each pixel point belongs to a human face or not and the approximate position of the human face;

then, performing non-maximum suppression (NMS) on the output of the P-Net, removing overlapped candidate areas, and inputting the rest candidate areas into a second convolutional neural network (R-Net), wherein the network can further classify and regress the candidate areas, so that the face detection accuracy is improved;

finally, NMS is performed on the output of R-Net, and the remaining candidate regions are input into a third convolutional neural network (O-Net), which can output the final face bounding box and the coordinates of the five key points.

The present invention may employ any known face alignment algorithm, such as a deep learning based face alignment algorithm, e.g., face Alignment Network (FAN), etc., or a conventional feature based face alignment algorithm, e.g., active Shape Model (ASM), etc. The invention can also calculate the affine transformation matrix between the original face image and the target face image according to the feature point coordinates of the two images, and perform corresponding rotation, scaling, translation and other transformations on the target face image, so that the affine transformation matrix is as close to the original face image as possible in terms of gesture and expression.

The invention can adopt any known face segmentation algorithm, such as semantic segmentation algorithm based on deep learning, such as U-Net, deep Lab and the like, or edge detection algorithm based on traditional methods, such as Canny edge detection and the like. The invention can also determine the corresponding areas between the original face image and the target face image according to the feature point coordinates of the two images, and extract the areas in the target face image to be used as the components of the replaced face image. These areas include eyes, eyebrows, nose, mouth, ears, hair, etc.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The real-time video face replacement method is characterized by comprising the following steps of:

2. The method for replacing a real-time video face according to claim 1, wherein the preset target face set is a face image set of multiple angles and expressions of a target face to be replaced, the face image is marked with face key points numbered 1-48 according to Dlib algorithm, the face key points are marked as a target key point set, and a target vector a is established according to the target key point set.

3. The method according to claim 2, wherein the step of selecting a target face image corresponding to the original face image from the target face set according to facial expression similarity comprises:

establishing an original vector B according to the original key point set;

4. The method for replacing a real-time video face according to claim 1, wherein the step of selecting a target mouth shape corresponding to a mouth shape in the original face image from the target mouth shape set according to a mouth shape similarity in a preset target mouth shape set specifically comprises:

5. The method for replacing a real-time video face according to claim 1, wherein the step of performing face alignment on the target face image to match the pose and expression of the target face image with those of the original face image comprises the following steps:

6. The method for replacing a real-time video face according to claim 1, wherein said step of performing a mouth shape adjustment on said basic replacement image using said target mouth shape image to obtain a replaced face image comprises:

7. The method for replacing a real-time video face according to claim 1, wherein the preset target face set is obtained by:

8. The method according to claim 1, wherein the mouth shapes of all phonemes are mouth shapes corresponding to all phonemes in a regular method of phoneme driving.

9. A computer readable storage medium, wherein program instructions are stored in the computer readable storage medium, which program instructions, when executed, are adapted to perform a real-time video face replacement method as claimed in any one of claims 1-8.

10. A real-time video face replacement system comprising a computer readable storage medium of claim 9.