CN116612518A

CN116612518A - Facial expression capturing method, system, electronic equipment and medium

Info

Publication number: CN116612518A
Application number: CN202310639831.2A
Authority: CN
Inventors: 赵翰琳
Original assignee: Chongqing Zhongke Yuncong Technology Co ltd
Current assignee: Chongqing Zhongke Yuncong Technology Co ltd
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2023-08-18

Abstract

The invention relates to the technical field of computer vision, in particular to a facial expression capturing method, a system, electronic equipment and a medium, and aims to solve the problem of how to capture facial expressions by using 2D (two-dimensional) camera equipment. For this purpose, the facial expression capturing method of the present invention includes: performing face detection based on the acquired image to acquire a face recognition frame; performing key point feature coding based on the face recognition frame to obtain a plurality of key point features representing facial expression structural information; based on a preset expression base, constructing mapping relations between the key point features and the expression base to obtain an expression driving coefficient; and mixing and superposing the expression driving coefficient and a preset head model, and visually rendering to obtain a virtual face image. Facial expression capturing is performed based on the acquired images, virtual face images are obtained, and the face capturing can be completed by using common 2D camera equipment, so that the face capturing technology can be more simply, conveniently and quickly applied to various fields such as live broadcasting, movies, VR and the like.

Description

Facial expression capturing method, system, electronic equipment and medium

Technical Field

The invention relates to the technical field of computer vision, and particularly provides a facial expression capturing method, a facial expression capturing system, electronic equipment and a medium.

Background

With the rise of the metauniverse concept, industries such as digital people, VR experience, movie animation and the like also enter a rapid development stage, and particularly, the rise of novel services such as virtual idol, virtual anchor and the like, and the motion capture technology is widely applied. One large application of motion capture is the driving task of virtual data people, and facial expression capture is one of the important links. Through facial expression capturing technology, the virtual person can make the same lifelike expression as human in reality, and through rich expression presentation, the virtual person can interact with human more vividly and realistically.

Facial expression capturing technology is widely applied in the film making industry, however, the technology is often based on expensive equipment and specialized personnel, and driving a virtual human expression to move by using the technology is extremely labor-and material-consuming, such as a special and expensive facial capture helmet, a special animation producer, a structured light projection device, a positioning assistance algorithm on the face, and the like, and for most common people, the specialized equipment is not obtained by a method, and the mode of capturing the face like marking points on the face of the person is extremely unfriendly in terms of interaction, the facial expression of the person needs to be finely spotted every time the virtual human expression is driven, and the process is excessively troublesome and tedious, so the facial expression capturing technology has not been well popularized.

Along with the popularization of 2D cameras, 2D camera devices are arranged in devices which are very easy to obtain by the masses such as mobile phones and computers. However, the existing facial expression capturing technology cannot be completed by using a common 2D image capturing device, and only a professional facial capturing device can be adopted, so that the application of the 2D image capturing device in the facial expression capturing technology is greatly limited.

Therefore, how to capture facial expressions by using the 2D image capturing apparatus becomes a problem to be solved by the development of the facial expression capturing technology.

Accordingly, there is a need in the art for a new facial expression capture scheme to address the above-described problems.

Disclosure of Invention

In order to overcome the above-mentioned drawbacks, the present invention provides a facial expression capturing method, system, electronic device and medium, so as to solve or at least partially solve the technical problem of how to capture facial expression by using a 2D image capturing device.

In a first aspect, the present invention provides a facial expression capturing method, comprising: performing face detection based on the acquired image to acquire a face recognition frame; performing key point feature coding based on the face recognition frame to obtain a plurality of key point features representing facial expression structural information; based on a preset expression base, constructing mapping relations between the key point features and the expression base to obtain an expression driving coefficient; and mixing and superposing the expression driving coefficient and a preset head model, and visually rendering to obtain a virtual face image.

In one technical scheme of the facial expression capturing method, performing face detection based on the acquired image, acquiring a face recognition frame includes: face detection is carried out based on the acquired image, and a face area is obtained; and expanding the detected face area to obtain a face recognition frame comprising the skull range.

In one technical scheme of the facial expression capturing method, performing key point feature encoding based on the face recognition frame, and obtaining a plurality of key point features representing facial expression structural information includes: obtaining a deep learning algorithm model; and carrying out feature coding on key points based on the face recognition frame and the deep learning algorithm model to obtain coordinate information of a plurality of key points.

In one technical scheme of the facial expression capturing method, performing feature encoding of key points based on the face recognition frame and the deep learning algorithm model, and obtaining coordinate information of a plurality of key points includes: converting the face recognition frame into a multi-channel neural network feature map through feature fusion, wherein each channel corresponds to a face key point; carrying out refinement treatment on each channel to obtain accurate coordinate information of each key point; and acquiring a key point feature map based on the accurate coordinate information of the key points.

In one technical scheme of the facial expression capturing method, based on a preset expression group, constructing a mapping relation between the plurality of key point features and the expression group, and obtaining an expression driving coefficient includes: acquiring a facial expression template with preset standards as an expression base; constructing mapping relations between the plurality of key point features and the expression base through a convolution layer and a full connection layer; based on the mapping relation, the expression driving coefficient with multiple dimensions is obtained.

In one technical scheme of the facial expression capturing method, mixing and superposing the expression driving coefficient and a preset head model, and visually rendering to obtain a virtual face image comprises the following steps: mixing and superposing the expression driving coefficient and a preset head model to obtain the head model with the expression; and performing visual rendering on the head model with the expression to obtain a virtual face image.

In one aspect of the above facial expression capturing method, the method further includes: before face detection is carried out on the obtained images, extracting image frames in the video to obtain a plurality of images; and/or after a plurality of virtual face images are formed based on the acquired images, combining the virtual face images to form a video.

In a second aspect, the present invention provides a facial expression capturing system, including a face detection module, a key point feature recognition module, an expression base mapping module, and a hybrid rendering module; the face detection module is configured to perform face detection based on the acquired image, and acquire a face recognition frame; the key point feature recognition module is configured to perform key point feature coding based on the face recognition frame to obtain a plurality of key point features representing facial expression structural information; the expression base mapping module is configured to construct mapping relations between the key point features and the expression base based on preset expression bases to obtain expression driving coefficients; the hybrid rendering module is configured to mix and superimpose the expression driving coefficient and a preset head model, and perform visual rendering to obtain a virtual face image.

In a third aspect, an electronic device is provided, the electronic device comprising a processor and a memory, the memory being adapted to store a plurality of program codes, the program codes being adapted to be loaded and run by the processor to perform the facial expression capturing method according to any one of the above-mentioned aspects of the facial expression capturing method.

In a fourth aspect, a computer readable storage medium is provided, in which a plurality of program codes are stored, the program codes being adapted to be loaded and executed by a processor to perform the facial expression capturing method according to any one of the above-mentioned aspects of the facial expression capturing method.

One or more of the above technical solutions of the present invention at least has one or more of the following

The beneficial effects are that:

in the technical scheme of the invention, facial expression capturing is carried out based on the acquired image, and the virtual face image is obtained, so that the face capturing can be finished by using common 2D camera equipment, and the face capturing technology can be more simply, conveniently and quickly applied to various fields such as live broadcasting, movies, VR and the like.

Drawings

The present disclosure will become more readily understood with reference to the accompanying drawings. As will be readily appreciated by those skilled in the art: the drawings are for illustrative purposes only and are not intended to limit the scope of the present invention. Moreover, like numerals in the figures are used to designate like parts, wherein:

FIG. 1 is a flow chart of the main steps of a facial expression capture method of one embodiment of the present invention;

FIG. 2 is a block diagram of the main structure of a facial expression capture system according to one embodiment of the invention;

fig. 3 is a main block diagram of an electronic device for performing the facial expression capturing method of the present invention.

List of reference numerals：

21: a face detection module; 22: a key point feature identification module; 23: expression base mapping module; 24: and a hybrid rendering module.

Detailed Description

Some embodiments of the invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.

In the description of the present invention, a "module," "processor" may include hardware, software, or a combination of both. A module may comprise hardware circuitry, various suitable sensors, communication ports, memory, or software components, such as program code, or a combination of software and hardware. The processor may be a central processor, a microprocessor, an image processor, a digital signal processor, or any other suitable processor. The processor has data and/or signal processing functions. The processor may be implemented in software, hardware, or a combination of both. Non-transitory computer readable storage media include any suitable medium that can store program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random access memory, and the like. The term "a and/or B" means all possible combinations of a and B, such as a alone, B alone or a and B. The term "at least one A or B" or "at least one of A and B" has a meaning similar to "A and/or B" and may include A alone, B alone or A and B. The singular forms "a", "an" and "the" include plural referents.

Referring to fig. 1, fig. 1 is a flowchart of main steps of a facial expression capturing method according to an embodiment of the present invention. As shown in fig. 1, the facial expression capturing method in the embodiment of the present invention mainly includes the following steps S11 to S14.

Step S11, face detection is carried out based on the acquired image, and a face recognition frame is acquired.

Specifically, the method further comprises the step of extracting image frames in the video to obtain a plurality of images before face detection based on the obtained images.

In one embodiment, the image or video may be accomplished by a 2D imaging device. In one embodiment of the present invention, one skilled in the art can perform facial expression capture by deriving each frame of video as an image; facial expression capture can also be performed directly on the image.

In one embodiment, the image may be a 2D face RGB image. RGB pictures are color images formed by three basic colors of Red (Red), green (Green) and Blue (Blue) for each pixel point.

Further, performing face detection based on the acquired image, acquiring a face recognition frame includes: face detection is carried out based on the acquired image, and a face area is obtained; and expanding the detected face area to obtain a face recognition frame comprising the skull range.

Since the face region obtained in the prior art generally intercepts a rectangular frame region including the five sense organs below the eyebrows, sufficient key point features cannot be obtained and subsequent motion capture cannot be performed on the rotation of the head. Therefore, the technical scheme expands the detected face area to obtain the face recognition frame comprising the skull range so as to ensure that enough key point characteristics are obtained, and the face recognition frame range of the subsequent input deep learning algorithm model is enlarged.

And step S12, performing key point feature coding based on the face recognition frame to obtain a plurality of key point features representing facial expression structural information.

The key point features are a series of key point coordinate information obtained through a face key point detection algorithm and are used for representing the expression state of the face. These key points are usually some facial areas with semantic information, such as the angle, position, etc. of eyes, eyebrows, mouth, etc. In expression recognition, the change of these key points may characterize different expressions, such as mouth opening, mouth closing, eye opening, mouth corner lifting, etc. By analyzing and processing the changes of the key points, the structured information of the expressions can be extracted and used for classifying and identifying different expressions.

Specifically, performing key point feature encoding based on the face recognition frame, and obtaining a plurality of key point features representing facial expression structural information includes: obtaining a deep learning algorithm model; and carrying out feature coding on key points based on the face recognition frame and the deep learning algorithm model to obtain coordinate information of a plurality of key points.

In one embodiment, the acquired deep learning algorithm model is a face key point detection algorithm.

Further, performing feature encoding of key points based on the face recognition frame and the deep learning algorithm model, and obtaining coordinate information of a plurality of key points includes: converting the face recognition frame into a multichannel neural network feature map (featuremap) through feature fusion, wherein each channel corresponds to a face key point; carrying out refinement treatment on each channel to obtain accurate coordinate information of each key point; and acquiring a key point feature map based on the accurate coordinate information of the key points.

In one embodiment, the key point feature map includes coordinate information of each key point.

Specifically, feature fusion can be performed through convolution and pooling, and the face recognition frame is converted into a multichannel neural network feature map; and further performing a gaussian blur processing, a thresholding processing, and the like on each channel to finish the refinement processing.

In one embodiment, in order to obtain structural feature information of a facial expression, firstly, performing key point feature coding on a 2D facial RGB image, and obtaining feature images of 68 key points by using a key point coding network Encoder network; wherein the Encoder network is a neural network for converting input data into high-dimensional feature vectors representing the data. The person skilled in the art can select other face key point extraction networks as key point coding networks according to actual needs, and can select other key point numbers according to actual needs.

In this embodiment, the Encoder network is selected from the HRNet-backhaul. In HRNet, a plurality of branches are parallel, each branch has different resolutions, and information interaction is continuously carried out in branches with different resolutions, so that the whole backstbone (trunk) can keep high resolution as far as possible from beginning to end, the purposes of enhancing semantic information and accurate position information are achieved, and more accurate key point features provide more accurate expression structural features for a follow-up expression coefficient mapping module.

Wherein, HRNet (High-resolution network) is a High-resolution neural network structure, and HRNet-backbone is an application of HRNet as backbone network. Compared with the traditional neural network structure, the HRNet can better process high-resolution input data, and the accuracy of the network is improved while the computational complexity is kept low, so that the HRNet-backup is widely applied to the fields of face recognition and the like.

And step S13, constructing mapping relations between the key point features and the expression base based on the preset expression base, and obtaining an expression driving coefficient.

Specifically, based on a preset expression base, constructing a mapping relationship between the plurality of key point features and the expression base, and obtaining an expression driving coefficient includes: acquiring a facial expression template with preset standards as an expression base; constructing mapping relations between the plurality of key point features and the expression base through a convolution layer and a full connection layer; based on the mapping relation, the expression driving coefficient with multiple dimensions is obtained.

In one embodiment, 51 expressions defined by ARkit (an application tool) are used as expression base units (for example, basic expression bases such as mouth opening, blink, skimming, and the like), driving coefficients corresponding to each expression base are predicted through a deep learning algorithm model, and a key point feature map is further mapped into normalized expression driving coefficients of all the expression bases, namely, 51-dimensional expression driving coefficients through an expression base coefficient mapping module.

And S14, mixing and superposing the expression driving coefficient and a preset head model, and visually rendering to obtain a virtual face image.

In one embodiment, mixing and superposing the expression driving coefficient and a preset head model, and visually rendering to obtain a virtual face image includes: mixing and superposing the expression driving coefficient and a preset head model to obtain the head model with the expression; and performing visual rendering on the head model with the expression to obtain a virtual face image.

And modeling the head model, namely the head of the virtual person, and forming the virtual person face image with the expression after mixing and overlapping with the expression driving coefficient. Illustratively, the virtual face image may be obtained by visual rendering through a pytorch3 d.

Further, the method further comprises: and after a plurality of virtual face images are formed based on the acquired images, combining the virtual face images to form a video.

In one embodiment, a person skilled in the art may export each frame of the video as a picture, complete facial expression capturing to obtain a plurality of virtual face images, and then combine the plurality of virtual face images to form the video, so as to complete video conversion of the virtual face images.

In actual operation, the method can be used for converting common 2D video and video live broadcast, and converting live broadcast of a real person into virtual image live broadcast.

Based on the steps S11-S14, the virtual face image is obtained based on the obtained image, and the face capturing can be completed by using the common 2D camera equipment to drive the virtual person to move, so that the face capturing technology can be more simply, conveniently and rapidly applied to various fields such as live broadcasting, movies, VR and the like, the efficiency is effectively improved, and the use cost is reduced.

It should be noted that, although the foregoing embodiments describe the steps in a specific order, it will be understood by those skilled in the art that, in order to achieve the effects of the present invention, the steps are not necessarily performed in such an order, and may be performed simultaneously (in parallel) or in other orders, and these variations are within the scope of the present invention.

Further, the invention also provides a facial expression capturing system.

Referring to fig. 2, fig. 2 is a main block diagram of a facial expression capturing system according to one embodiment of the present invention. As shown in fig. 2, the facial expression capturing system in the embodiment of the present invention mainly includes a face detection module 21, a key point feature recognition module 22, an expression base mapping module 23, and a hybrid rendering module 24. In some embodiments, one or more of the face detection module 21, the keypoint feature recognition module 22, the expression base mapping module 23, and the hybrid rendering module 24 may be combined together into one module. In some embodiments, the face detection module 21 is configured to perform face detection based on the acquired image, and acquire a face recognition frame; the key point feature recognition module 22 is configured to perform key point feature encoding based on the face recognition frame to obtain a plurality of key point features representing facial expression structural information; the expression base mapping module 23 is configured to construct mapping relations between the plurality of key point features and the expression base based on a preset expression base, and obtain an expression driving coefficient; the hybrid rendering module 24 is configured to mix and superimpose the expression driving coefficient and a preset head model, and perform visual rendering to obtain a virtual face image.

In one embodiment, the description of the specific implementation function may be described with reference to step S11 to step S14.

The above facial expression capturing system is used for executing the embodiment of the facial expression capturing method shown in fig. 1, and the technical principles of the two are similar to each other, the technical problems to be solved and the technical effects to be produced are similar, and those skilled in the art can clearly understand that, for convenience and brevity of description, the specific working process and the related description of the facial expression capturing system can refer to the description of the embodiment of the facial expression capturing method, and the description is not repeated here.

It will be appreciated by those skilled in the art that the present invention may implement all or part of the above-described methods according to the above-described embodiments, or may be implemented by means of a computer program for instructing relevant hardware, where the computer program may be stored in a computer readable storage medium, and where the computer program may implement the steps of the above-described embodiments of the method when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable storage medium may include: any entity or device, medium, usb disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signals, telecommunications signals, software distribution media, and the like capable of carrying the computer program code. It should be noted that the computer readable storage medium may include content that is subject to appropriate increases and decreases as required by jurisdictions and by jurisdictions in which such computer readable storage medium does not include electrical carrier signals and telecommunications signals.

The invention further provides electronic equipment. Referring to fig. 3, fig. 3 is a main block diagram of an electronic device for performing the facial expression capturing method of the present invention.

As shown in fig. 3, in one electronic device embodiment according to the present invention, the electronic device comprises a processor 301 and a memory 302, the memory 302 may be configured to store program code 303 for performing the facial expression capturing method of the above-described method embodiment, and the processor 301 may be configured to execute the program code 303 in the memory 302, the program code 303 including, but not limited to, the program code 303 for performing the facial expression capturing method of the above-described method embodiment. For convenience of explanation, only those portions of the embodiments of the present invention that are relevant to the embodiments of the present invention are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present invention.

Further, the invention also provides a computer readable storage medium. In one computer-readable storage medium embodiment according to the present invention, the computer-readable storage medium may be configured to store a program that performs the facial expression capturing method of the above-described method embodiment, the program being loadable and executable by a processor to implement the above-described facial expression capturing method. For convenience of explanation, only those portions of the embodiments of the present invention that are relevant to the embodiments of the present invention are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present invention. The computer readable storage medium may be a storage device including various electronic devices, and optionally, the computer readable storage medium in the embodiments of the present invention is a non-transitory computer readable storage medium.

It should be understood that since the individual modules are merely set to illustrate the functional units of the facial expression capture system of the present invention, the physical devices corresponding to these modules may be the processor itself, or a portion of the software in the processor, a portion of the hardware, or a portion of a combination of the software and hardware. Accordingly, the number of individual modules in the figures is merely illustrative.

Those skilled in the art will appreciate that the various modules in the system may be adaptively split or combined. Such splitting or combining of specific modules does not cause the technical solution to deviate from the principle of the present invention, and therefore, the technical solution after splitting or combining falls within the protection scope of the present invention.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will fall within the scope of the present invention.

Claims

1. A facial expression capturing method, comprising:

performing face detection based on the acquired image to acquire a face recognition frame;

performing key point feature coding based on the face recognition frame to obtain a plurality of key point features representing facial expression structural information;

based on a preset expression base, constructing mapping relations between the key point features and the expression base to obtain an expression driving coefficient;

and mixing and superposing the expression driving coefficient and a preset head model, and visually rendering to obtain a virtual face image.

2. The method of claim 1, wherein performing face detection based on the acquired image, acquiring a face recognition frame comprises:

face detection is carried out based on the acquired image, and a face area is obtained;

and expanding the detected face area to obtain a face recognition frame comprising the skull range.

3. The method of claim 1, wherein performing keypoint feature encoding based on the face recognition frame to obtain a plurality of keypoint features representing facial expression structured information comprises:

obtaining a deep learning algorithm model;

and carrying out feature coding on key points based on the face recognition frame and the deep learning algorithm model to obtain coordinate information of a plurality of key points.

4. The method of claim 3, wherein performing keypoint feature encoding based on the face recognition frame and the deep learning algorithm model, obtaining coordinate information of a plurality of keypoints comprises:

converting the face recognition frame into a multi-channel neural network feature map through feature fusion, wherein each channel corresponds to a face key point;

carrying out refinement treatment on each channel to obtain accurate coordinate information of each key point;

and acquiring a key point feature map based on the accurate coordinate information of the key points.

5. The method of claim 1, wherein constructing the mapping relationship between the plurality of key point features and the expression base based on a preset expression base, and obtaining an expression driving coefficient comprises:

acquiring a facial expression template with preset standards as an expression base;

constructing mapping relations between the plurality of key point features and the expression base through a convolution layer and a full connection layer;

based on the mapping relation, the expression driving coefficient with multiple dimensions is obtained.

6. The method of claim 1, wherein blending and superimposing the expression driving coefficients with a preset head model, and visually rendering to obtain a virtual face image comprises:

mixing and superposing the expression driving coefficient and a preset head model to obtain the head model with the expression;

and performing visual rendering on the head model with the expression to obtain a virtual face image.

7. The method according to any one of claims 1-6, further comprising:

before face detection is carried out on the obtained images, extracting image frames in the video to obtain a plurality of images;

and/or the number of the groups of groups,

and after a plurality of virtual face images are formed based on the acquired images, combining the virtual face images to form a video.

8. The facial expression capturing system is characterized by comprising a face detection module, a key point feature recognition module, an expression base mapping module and a mixed rendering module; wherein,,

the face detection module is configured to perform face detection based on the acquired image, and acquire a face recognition frame;

the key point feature recognition module is configured to perform key point feature coding based on the face recognition frame to obtain a plurality of key point features representing facial expression structural information;

the expression base mapping module is configured to construct mapping relations between the key point features and the expression base based on preset expression bases to obtain expression driving coefficients;

the hybrid rendering module is configured to mix and superimpose the expression driving coefficient and a preset head model, and perform visual rendering to obtain a virtual face image.

9. An electronic device comprising a processor and a memory, the memory being adapted to store a plurality of program codes, characterized in that the program codes are adapted to be loaded and executed by the processor to perform the facial expression capturing method of any one of claims 1 to 7.

10. A computer readable storage medium having stored therein a plurality of program codes, wherein the program codes are adapted to be loaded and executed by a processor to perform the facial expression capturing method of any one of claims 1 to 7.