CN115984943A

CN115984943A - Facial expression capturing and model training method, device, equipment, medium and product

Info

Publication number: CN115984943A
Application number: CN202310088843.0A
Authority: CN
Inventors: 黄美佳; 陈志远; 马晨光
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-01-16
Filing date: 2023-01-16
Publication date: 2023-04-18
Anticipated expiration: 2043-01-16
Also published as: CN115984943B

Abstract

The embodiment of the specification discloses a facial expression capturing and model training method, device, equipment, medium and product. The facial expression capturing method comprises the following steps: acquiring target face video data, wherein the target face video data comprises continuous multi-frame target face images; extracting a first target parameter sequence corresponding to target face video data, wherein the first target parameter sequence comprises first target parameters corresponding to multiple frames of target face images respectively, and the first target parameters comprise first target expression parameters and first target rotation and translation parameters; and optimizing the first target parameter sequence by using a target time sequence neural network model to obtain a second target parameter sequence, wherein the second target parameter sequence comprises a second target expression parameter sequence and a second target rotation and translation parameter sequence, and the target time sequence neural network model is obtained by training based on facial video data of a plurality of known facial feature point sequences.

Description

Facial expression capturing and model training method, device, equipment, medium and product

Technical Field

The present description relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, a medium, and a product for capturing facial expressions and training models.

Background

With the rapid development of electronic devices and rendering technologies in recent years, motion capture technologies have been indispensable generation tools in the fields of game development, 3D movie animation, and virtual reality, and capture spatial feature points by recording motion trajectories of dynamic objects to calculate three-dimensional spatial coordinates, so that the motion trajectories are digitized. Compared with the motion capture of the body, the difficulty of facial expression capture is higher, and fine and precise facial expression motion changes need to be acquired.

Disclosure of Invention

The embodiment of the specification provides a facial expression capturing and model training method, a device, equipment, a medium and a product, which not only improve the accuracy and continuity of facial expression capturing on facial video data, but also can be widely applied to low-end equipment and have low application cost. The technical scheme is as follows:

in a first aspect, an embodiment of the present specification provides a facial expression capturing method, including:

acquiring target face video data; the target face video data includes a plurality of consecutive frames of target face images;

extracting a first target parameter sequence corresponding to the target face video data; the first target parameter sequence comprises first target parameters corresponding to the target face images of the multiple frames respectively; the first target parameters comprise first target expression parameters and first target rotation and translation parameters;

optimizing the first target parameter sequence by using a target time sequence neural network model to obtain a second target parameter sequence; the second target parameter sequence comprises a second target expression parameter sequence and a second target rotation and translation parameter sequence; the target time sequence neural network model is obtained by training based on face video data of a plurality of known face feature point sequences.

In a possible implementation manner, the extracting a first target parameter sequence corresponding to the target face video data includes:

extracting a first target parameter sequence corresponding to the target face video data by using a target parameter extractor; the target parameter extractor is obtained by training based on a plurality of facial images of known facial feature points.

In one possible implementation, the target parameter extractor is trained based on facial images of a plurality of known facial feature points and facial video data of the plurality of known facial feature point sequences.

In a possible implementation manner, the target parameter extractor includes a first target convolutional network and a second target convolutional network; the first target convolutional network is used for extracting a first target expression parameter sequence corresponding to the target face video data; the second target convolutional network is used for extracting a first target rotation-translation parameter sequence corresponding to the target face video data.

In one possible implementation, the target time-series neural network model includes a first target time-series neural network and a second target time-series neural network; the first target time sequence neural network is used for optimizing a first target expression parameter sequence corresponding to the target face video data; the second target time sequence neural network is used for optimizing a first target rotation translation parameter sequence corresponding to the target face video data.

In a possible implementation manner, after the optimizing the first target parameter sequence by using the target time-series neural network model to obtain the second target parameter sequence, the method further includes:

and migrating the second target parameter sequence to a target three-dimensional virtual image.

In a possible implementation manner, the second target expression parameter sequence includes second target expression parameters corresponding to the multiple frames of target facial images, and is used to represent a change condition of a facial expression in the target facial video data; the second target expression parameter is used for representing multidimensional target expression base coefficients of the whole facial expression in the target facial image.

In one possible implementation manner, the dimension of the expression base of the whole facial expression of the target three-dimensional virtual image is equal to the dimension of the target expression base coefficient.

In a possible implementation manner, the second target rototranslation parameter sequence includes second target rototranslation parameters corresponding to respective ones of the multiple frames of target face images, and is used for characterizing a change of a head pose in the video data composing the target face.

In a possible implementation manner, the acquiring target face video data includes:

acquiring video data based on image acquisition equipment; the video data includes a plurality of consecutive frames of images containing faces;

carrying out face detection on the continuous multiple frames of images containing faces to obtain target face video data; the target face image is an image including only a face or an image with a known face position.

In a second aspect, an embodiment of the present specification provides a method for training a time series neural network model, where the method includes:

acquiring face video data of a plurality of known face feature point sequences; the face video data includes a plurality of consecutive frames of first face images; the facial feature point sequence comprises first facial feature points corresponding to the first facial images of the plurality of frames respectively;

extracting a parameter sequence corresponding to the face video data; the first parameter sequence comprises parameters corresponding to the first face images of the plurality of frames respectively; the parameters comprise a first identity parameter, a first expression parameter and a first rotation and translation parameter;

inputting a first expression parameter sequence and a first rotation and translation parameter sequence corresponding to the facial video data into a time sequence neural network model, and outputting an optimized second expression parameter sequence and an optimized second rotation and translation parameter sequence;

generating a first three-dimensional grid corresponding to each of the plurality of frames of first face images based on a first identity parameter sequence, the second expression parameter sequence and the second rotation and translation parameter sequence corresponding to the face video data;

determining a first loss of the time-series neural network model based on the first three-dimensional grids corresponding to the first facial images of the plurality of frames and the first facial feature points corresponding to the first facial images of the plurality of frames;

training the time sequence neural network model based on the first loss to obtain a trained target time sequence neural network model; the target time series neural network model is used to optimize the first target parameter sequence in the first aspect of the embodiments of the present specification or any one of the possible implementations of the first aspect.

In one possible implementation manner, the determining a first loss of the time-series neural network model based on the first three-dimensional mesh corresponding to each of the plurality of frames of first facial images and the first facial feature point corresponding to each of the plurality of frames of first facial images includes:

acquiring three-dimensional facial feature points of first three-dimensional facial grids corresponding to the plurality of frames of first facial images;

projecting the three-dimensional facial feature points to obtain corresponding two-dimensional facial feature points;

and determining a first loss of the time-series neural network model based on the first facial feature points corresponding to the first facial images of the plurality of frames and the two-dimensional facial feature points corresponding to the first three-dimensional grids corresponding to the first facial images of the plurality of frames.

In a possible implementation manner, the extracting a parameter sequence corresponding to the face video data includes:

and extracting a parameter sequence corresponding to the face video data by using a target parameter extractor.

In a possible implementation manner, before the extracting the parameter sequence corresponding to the face video data, the method further includes:

acquiring a plurality of second face images of known second face feature points;

and training a parameter extractor based on the plurality of second face images with known second face feature points to obtain the trained target parameter extractor.

In one possible implementation manner, the obtaining of the trained target parameter extractor based on the second face image training parameter extractors of the plurality of known second face feature points includes:

inputting a plurality of second face images with known second face characteristic points into a parameter extractor, and outputting a parameter set corresponding to each second face image;

combining the basis vectors of the parameterized three-dimensional face model based on the parameter set to generate a corresponding second three-dimensional grid;

acquiring two-dimensional face feature points corresponding to the second three-dimensional grid;

rendering the second three-dimensional grid into a two-dimensional image;

determining a second loss corresponding to the parameter extractor based on the two-dimensional face feature points corresponding to the second three-dimensional mesh and the two-dimensional image;

and training the parameter extractor based on the second loss to obtain the trained target parameter extractor.

In a possible implementation manner, the parameter set includes a second identity parameter, a texture parameter, a second expression parameter, and a second rotation-translation parameter;

combining the basis vectors of the parameterized three-dimensional face model based on the parameter set to generate a corresponding second three-dimensional mesh, comprising:

combining the basis vectors of the parameterized three-dimensional face model based on the second identity parameters, the texture parameters and the second expression parameters to generate a corresponding second three-dimensional grid;

rendering the second three-dimensional mesh into a two-dimensional image includes:

rendering the second three-dimensional mesh into a two-dimensional image based on the texture parameter and the second rotational-translational parameter.

In a possible implementation manner, the parameter extractor includes a first convolution network, a second convolution network, a third convolution network, and a fourth convolution network; wherein:

the first convolution network is used for extracting a second expression parameter corresponding to the second facial image;

the second convolution network is configured to extract a second rotation-translation parameter corresponding to the second face image;

the third convolutional network is configured to extract a second identity parameter corresponding to the second face image;

and the fourth convolution network is used for extracting the texture parameters corresponding to the second face image.

In a possible implementation manner, the acquiring the two-dimensional facial feature points corresponding to the second three-dimensional mesh includes:

acquiring three-dimensional face feature points corresponding to the second three-dimensional grid;

and projecting the three-dimensional facial feature points corresponding to the second three-dimensional grid to obtain corresponding two-dimensional facial feature points.

In one possible implementation manner, the determining a second loss corresponding to the parameter extractor based on the two-dimensional facial feature points corresponding to the second three-dimensional mesh and the two-dimensional image includes:

determining a feature point loss of the parameter extractor based on the two-dimensional facial feature points corresponding to the second three-dimensional mesh and the second facial feature points of the second facial image;

determining a pixel loss of the parameter extractor based on the two-dimensional image and the second face image;

the training of the parameter extractor based on the second loss to obtain the trained target parameter extractor includes:

and training the parameter extractor based on the characteristic point loss and the pixel loss to obtain the trained target parameter extractor.

In one possible implementation manner, the acquiring a plurality of second face images of known second face feature points includes:

acquiring a plurality of second face images;

and determining second face feature points corresponding to the plurality of second face images by using a face detection algorithm and a feature point detection algorithm.

In a possible implementation manner, the number of the second face feature points of the second face image is multiple; the second facial image is based on a two-dimensional image acquired by an image acquisition device.

In a possible implementation manner, the extracting, by the target parameter extractor, a parameter sequence corresponding to the face video data includes:

extracting a first expression parameter sequence corresponding to the facial video data by using a first convolution network of the target parameter extractor;

extracting a first rotation-translation parameter sequence corresponding to the face video data by using a second convolution network of the target parameter extractor;

and extracting a first identity parameter sequence corresponding to the face video data by using a third convolution network of the target parameter extractor.

In one possible implementation manner, the training the time-series neural network model based on the first loss to obtain a trained target time-series neural network model includes:

training the time sequence neural network model, the first convolution network, the second convolution network and the third convolution network based on the first loss to obtain a trained target time sequence neural network model, a first target convolution network, a second target convolution network and a third target convolution network;

the first target convolutional network is configured to extract a first target expression parameter sequence corresponding to target facial video data in the first aspect or any one of the possible implementation manners of the first aspect of the embodiment of the present specification;

the second target convolutional network is configured to extract a first target rotational-translational parameter sequence corresponding to the target face video data in the first aspect or any one of the possible implementations of the first aspect of the embodiments of the present specification.

In a third aspect, embodiments of the present specification provide a facial expression capture apparatus, comprising:

the first acquisition module is used for acquiring target face video data; the target face video data includes a plurality of consecutive frames of target face images;

the first extraction module is used for extracting a first target parameter sequence corresponding to the target face video data; the first target parameter sequence comprises first target parameters corresponding to the target face images of the multiple frames respectively; the first target parameters comprise first target expression parameters and first target rotation and translation parameters;

the first optimization module is used for optimizing the first target parameter sequence by using a target time sequence neural network model to obtain a second target parameter sequence; the second target parameter sequence comprises a second target expression parameter sequence and a second target rotation and translation parameter sequence; the target time sequence neural network model is obtained by training based on face video data of a plurality of known face feature point sequences.

In a possible implementation manner, the first extraction module is specifically configured to:

In a possible implementation manner, the target parameter extractor includes a first target convolutional network and a second target convolutional network; the first target convolutional network is used for extracting a first target expression parameter sequence corresponding to the target face video data; the second target convolutional network is used for extracting a first target rotation translation parameter sequence corresponding to the target face video data.

In one possible implementation manner, the facial expression capturing apparatus further includes:

and the expression transferring module is used for transferring the second target parameter sequence to the target three-dimensional virtual image.

In a possible implementation manner, the first obtaining module includes:

the first acquisition unit is used for acquiring video data based on image acquisition equipment; the video data includes a plurality of consecutive frames of images containing faces;

a first face detection unit, configured to perform face detection on the consecutive frames of images including faces to obtain target face video data; the target face image is an image including only a face or an image with a known face position.

In a fourth aspect, an embodiment of the present specification provides a time series neural network model training apparatus, including:

the second acquisition module is used for acquiring face video data of a plurality of known face feature point sequences; the face video data includes a plurality of consecutive frames of first face images; the facial feature point sequence comprises first facial feature points corresponding to the first facial images of the plurality of frames respectively;

the second extraction module is used for extracting a parameter sequence corresponding to the face video data; the first parameter sequence comprises parameters corresponding to the first face images of the plurality of frames respectively; the parameters comprise a first identity parameter, a first expression parameter and a first rotation and translation parameter;

the second optimization module is used for inputting the first expression parameter sequence and the first rotation and translation parameter sequence corresponding to the facial video data into a time sequence neural network model and outputting the optimized second expression parameter sequence and the optimized second rotation and translation parameter sequence;

a first generating module, configured to generate a first three-dimensional mesh corresponding to each of the plurality of frames of first face images based on a first identity parameter sequence, the second expression parameter sequence, and the second rotation-translation parameter sequence corresponding to the face video data;

a first determining module, configured to determine a first loss of the time-series neural network model based on a first three-dimensional grid corresponding to each of the plurality of frames of first facial images and a first facial feature point corresponding to each of the plurality of frames of first facial images;

the first training module is used for training the time sequence neural network model based on the first loss to obtain a trained target time sequence neural network model; the target time series neural network model is used to optimize the first target parameter sequence in the first aspect of the embodiments of the present specification or any one of the possible implementations of the first aspect.

In a possible implementation manner, the first determining module includes:

a second obtaining unit, configured to obtain three-dimensional facial feature points of first three-dimensional facial meshes corresponding to the plurality of frames of first facial images;

the projection unit is used for projecting the three-dimensional facial feature points to obtain corresponding two-dimensional facial feature points;

and a first determining unit configured to determine a first loss of the time-series neural network model based on the first facial feature points corresponding to the plurality of frames of first facial images and the two-dimensional facial feature points corresponding to the first three-dimensional meshes corresponding to the plurality of frames of first facial images.

In a possible implementation manner, the second extraction module is specifically configured to:

In a possible implementation manner, the training apparatus for a time series neural network model further includes:

the third acquisition module is used for acquiring a plurality of second face images of known second face characteristic points;

and the second training module is used for training the parameter extractor based on the plurality of second facial images with known second facial feature points to obtain the trained target parameter extractor.

In a possible implementation manner, the second training module includes:

a parameter extraction unit, configured to input a plurality of second face images with known second face feature points into a parameter extractor, and output a parameter set corresponding to each of the second face images;

the combination unit is used for combining the basis vectors of the parameterized three-dimensional face model based on the parameter set to generate a corresponding second three-dimensional grid;

a third obtaining unit, configured to obtain two-dimensional facial feature points corresponding to the second three-dimensional mesh;

a rendering unit, configured to render the second three-dimensional mesh into a two-dimensional image;

a second determining unit configured to determine a second loss corresponding to the parameter extractor based on the two-dimensional face feature point corresponding to the second three-dimensional mesh and the two-dimensional image;

and a training unit configured to train the parameter extractor based on the second loss to obtain the trained target parameter extractor.

the combination unit is specifically configured to:

the rendering unit is specifically configured to:

In a possible implementation manner, the third obtaining unit is specifically configured to:

acquiring three-dimensional face feature points corresponding to the second three-dimensional grid; and projecting the three-dimensional facial feature points corresponding to the second three-dimensional grid to obtain corresponding two-dimensional facial feature points.

In a possible implementation manner, the second determining unit is specifically configured to:

determining a feature point loss of the parameter extractor based on the two-dimensional facial feature points corresponding to the second three-dimensional mesh and the second facial feature points of the second facial image; determining a pixel loss of the parameter extractor based on the two-dimensional image and the second face image;

the training unit is specifically configured to:

In a possible implementation manner, the third obtaining module includes:

a fourth acquiring unit configured to acquire a plurality of second face images;

and a third determining unit configured to determine second face feature points corresponding to the plurality of second face images by using a face detection algorithm and a feature point detection algorithm.

In a possible implementation manner, the number of the second face feature points of the second face image is multiple; the second face image is a two-dimensional image acquired by an image acquisition device.

extracting a first expression parameter sequence corresponding to the facial video data by using a first convolution network of the target parameter extractor; extracting a first rotation-translation parameter sequence corresponding to the face video data by using a second convolution network of the target parameter extractor; and extracting a first identity parameter sequence corresponding to the face video data by using a third convolution network of the target parameter extractor.

In a possible implementation manner, the first training module is specifically configured to:

training the time sequence neural network model, the first convolutional network, the second convolutional network and the third convolutional network based on the first loss to obtain a trained target time sequence neural network model, a first target convolutional network, a second target convolutional network and a third target convolutional network;

In a fifth aspect, an embodiment of the present specification provides an electronic device, including: a processor and a memory;

the processor is connected with the memory;

the memory is used for storing executable program codes;

the processor reads the executable program code stored in the memory to execute a program corresponding to the executable program code, so as to perform the method provided by any one of the first aspect or any one of the possible implementation manners of the first aspect or the second aspect or any one of the possible implementation manners of the second aspect of the embodiments of the present specification.

In a sixth aspect, an embodiment of the present specification provides a computer storage medium, where the computer storage medium stores multiple instructions, and the instructions are adapted to be loaded by a processor and execute a method provided by any one of the possible implementations of the first aspect or any one of the possible implementations of the second aspect or the second aspect of the embodiment of the present specification.

In a seventh aspect, the present specification provides a computer program product containing instructions, which when run on a computer or a processor, causes the computer or the processor to execute the method provided in any one of the possible implementations of the first aspect or any one of the possible implementations of the second aspect or the second aspect of the present specification.

In an embodiment of the present specification, target face video data is acquired, where the target face video data includes a plurality of consecutive frames of target face images; extracting a first target parameter sequence corresponding to target face video data, wherein the first target parameter sequence comprises first target parameters corresponding to multiple frames of target face images respectively, and the first target parameters comprise first target expression parameters and first target rotation and translation parameters; and optimizing the first target parameter sequence by using a target time sequence neural network model to obtain a second target parameter sequence, wherein the second target parameter sequence comprises a second target expression parameter sequence and a second target rotation translation parameter sequence, and the target time sequence neural network model is obtained by training based on facial video data of a plurality of known facial feature point sequences, so that the first target parameter sequence corresponding to the target facial video data and related to facial expressions is optimized by using the target time sequence neural network model to obtain a second target parameter sequence with good continuity and high precision, namely, facial expression capture with high precision, high continuity and low application cost can be realized without high-end image acquisition equipment.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is an architectural diagram of a facial expression capture system provided in an exemplary embodiment of the present description;

FIG. 2 is a flowchart illustrating a facial expression capture method according to an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an implementation process for acquiring video data of a target face according to an exemplary embodiment of the present disclosure;

fig. 4 is a schematic diagram of an implementation process for extracting a first target parameter sequence corresponding to target face video data according to an exemplary embodiment of the present specification;

FIG. 5 is a schematic diagram illustrating an implementation process of a facial expression capturing method according to an exemplary embodiment of the present specification;

FIG. 6 is a flowchart of another facial expression capture method provided by an exemplary embodiment of the present description;

FIG. 7 is a schematic diagram illustrating an overall implementation process of a facial expression capturing method according to an exemplary embodiment of the present specification;

FIG. 8 is a schematic flow chart diagram illustrating a method for training a sequential neural network model according to an exemplary embodiment of the present disclosure;

FIG. 9 is a schematic diagram illustrating an implementation process of a time series neural network model training method according to an exemplary embodiment of the present disclosure;

FIG. 10 is a schematic flow chart illustrating a process for determining a first loss of a sequential neural network model provided in an exemplary embodiment of the present description;

FIG. 11 is a schematic diagram illustrating a training process of a target parameter extractor according to an exemplary embodiment of the present disclosure;

FIG. 12 is a diagram illustrating a training process of a target parameter extractor according to an exemplary embodiment of the present disclosure;

FIG. 13 is a detailed flowchart of a training target parameter extractor according to an exemplary embodiment of the present disclosure;

FIG. 14 is a diagram illustrating an implementation process of a training target parameter extractor according to an exemplary embodiment of the present disclosure;

FIG. 15 is a schematic diagram of a facial expression capture device according to an exemplary embodiment of the present disclosure;

FIG. 16 is a schematic structural diagram of a sequential neural network model training apparatus according to an exemplary embodiment of the present disclosure;

fig. 17 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification.

The terms "first," "second," "third," and the like in the description and in the claims, and in the drawings described above, are used for distinguishing between different objects and not necessarily for describing a particular sequential order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

It should be noted that the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.) and signals referred to in the embodiments of the present description are authorized by the user or fully authorized by various parties, and the collection, use and processing of the relevant data need to comply with relevant laws and regulations and standards in relevant countries and regions. For example, the target face video data, face images, and the like referred to in this specification are acquired with sufficient authority.

First, the noun terms to which one or more embodiments of the present specification relate are explained.

Capturing facial expressions: the process of recording a series of movements and expressions of the face by a mechanical means such as a camera and converting the expressions into a set of parameter data.

Parameterized three-dimensional Face model (3D Mobile Face model, 3DMM): the method is generally obtained by performing statistical decomposition on a large number of topological parameterized 3D faces, and comprises shape basis, expression basis and texture basis vectors, and the 3D faces in various shapes can be generated by fitting various combinations of the basis vectors.

A time sequence neural network: the recurrent neural network is a recurrent neural network which takes sequence data as input, recurses in the evolution direction of the sequence and all nodes (cyclic units) are connected in a chain mode.

Blendshape: a set of references (expression bases in 3 DMM) constituting the whole facial expression, typically 52 BlendShape expression reference sets of ARKit, can obtain various facial expressions by weighted combination to drive the facial expression of the virtual character to produce 3D animation.

Currently, there are three main types of facial expression capturing technologies: the first facial expression capturing technology is composed of a complete hardware camera system and an expression capturing algorithm, is commonly used in professional teams, such as movie and television special-effect companies and the like, is high in price, complex to operate, dependent on high-end hardware equipment, and is not suitable for the common public; the second type of facial expression capturing technology is mainly based on a depth information acquisition device, obtains depth information of a face besides image data, obtains three-dimensional data of the face, and outputs the three-dimensional face data as an expression reference according to two-dimensional and three-dimensional tracked face data by using an algorithm, however, the depth information is obtained by using a 3D structured light technology of a camera, the hardware cost is high, the depth information cannot be obtained by a common monocular RGB camera, and the depth information cannot be applied to low-end equipment; the third type of facial expression capturing technology is a facial capturing algorithm based on image data, and directly acquires expression parameters for each frame of static image to realize facial capturing.

However, no matter which kind of facial expression capturing technologies, high-precision facial expression capturing can be realized only for a single frame of facial image, and for facial video data, continuity of facial expression actions is lacked in the process of facial expression capturing, so that the precision is low, and the user experience is poor. Therefore, a method for capturing facial expressions, which has high precision and consistency, can be widely applied to low-end equipment and has low application cost, is needed.

Reference is next made to fig. 1, which is a schematic diagram illustrating an architecture of a facial expression capture system according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the facial expression capture system includes: image capture device 110 and terminal 120. Wherein:

the image capturing device 110 may be a camera (for example, but not limited to, a monocular RGB camera) or other device equipped with a camera, and the like, which is not limited in this specification. When facial expression capture of a target user, a target animal, or the like is desired, target facial video data corresponding thereto, which includes a plurality of consecutive frames of target facial images, may be captured by the image capture device 110 first. And then sends the target face video data to the terminal 120 for facial expression capture. The data transmission between the image capturing device 110 and the terminal 120 may be a wireless transmission mode or a wired transmission mode. The wireless transmission mode may include wireless internet access, bluetooth, a mobile device network, and the like, and the wired transmission mode may include a coaxial cable, a card reader, an optical fiber, a digital subscriber line, and the like.

The terminal 120 may be a user terminal, and specifically includes one or more user terminals. Any user side of the terminal 120 may establish a data relationship with the network, and establish a data connection relationship with the image capturing device 110 through the network, for example, receive target face video data and the like. A user version of software may be installed in the terminal 120, and is configured to extract a first target parameter sequence corresponding to target face video data, where the first target parameter sequence includes first target parameters corresponding to multiple frames of target face images, the first target parameters include first target expression parameters and first target rotation and translation parameters, and optimize the first target parameter sequence by using a target timing neural network model to obtain a second target parameter sequence, and the second target parameter sequence includes a second target expression parameter sequence and a second target rotation and translation parameter sequence. Before capturing the facial expression, the terminal 120 may also train the time-series neural network model based on the facial video data of a plurality of known facial feature point sequences by using the time-series neural network model training method provided in the embodiment of the present specification, so as to obtain a trained target time-series neural network model used in the above-mentioned facial expression capturing process. Any user side of the terminal 120 may be, but is not limited to, a mobile phone, a tablet computer, a notebook computer, and the like, which are installed with user version software.

Alternatively, the image capturing device 110 and the terminal 120 may be two devices independent from each other, and the image capturing device 110 may also be integrated with the terminal 120 and be a camera disposed on the terminal 120, which is not limited in this embodiment of the specification.

Optionally, the facial expression capturing method and the time-series neural network model training method provided in the embodiments of the present specification are not limited to be executed by the terminal 120, and may also be executed by a server connected to the terminal 120 or the image capturing device 110 through a network, and the embodiments of the present specification are not particularly limited to this, and all of the following embodiments take the example of executing the facial expression capturing by the terminal 120 as an example. The server may be, but is not limited to, a hardware server, a virtual server, a cloud server, and the like.

Illustratively, the facial expression capture system can be applied to various scenes such as game development, 3D movie animation, 3D virtual image dynamic display and the like, but is not limited to the application.

The network may be a medium providing a communication link between the terminal 120 and any one of the image capturing devices 110, and may also be the internet including network devices and transmission media, without being limited thereto. The transmission medium may be a wired link such as, but not limited to, a coaxial cable, an optical fiber, a Digital Subscriber Line (DSL), etc., or a wireless link such as, but not limited to, a wireless internet protocol (WIFI), bluetooth, a mobile device network, etc.

It is to be understood that the number of image capture devices 110 and terminals 120 in the facial expression capture system shown in fig. 1 is by way of example only, and that any number of image capture devices and terminals may be included in the facial expression capture system in a particular implementation. The examples in this specification are not particularly limited thereto. For example, but not limited to, the image capturing device 110 may be an image capturing device cluster composed of a plurality of image capturing devices, and the terminal 120 may be a terminal cluster composed of a plurality of terminals.

Next, referring to fig. 1, a facial expression capturing method provided by an embodiment of the present specification will be described by taking facial expression capturing performed by the terminal 120 as an example. Please refer to fig. 2, which is a flowchart illustrating a facial expression capturing method according to an exemplary embodiment of the present disclosure. As shown in fig. 2, the facial expression capturing method includes the following steps:

s202, target face video data are obtained, and the target face video data comprise continuous multi-frame target face images.

Specifically, the target face video data is a 2D face video, and the target face image is a 2D face image including only a face. The multiple faces corresponding to the multiple consecutive frames of target face images included in the target face video data may be from the same target object (for example, but not limited to, a target user or a target animal), or may be from different target objects, which is not limited by the embodiment of the present specification.

Alternatively, when facial expression capture of the expressive actions of the user is required, the image capture device 110 may be used to record video of the face of the user, thereby obtaining target facial video data.

Alternatively, when facial expression capture of a person in target face video data prepared in advance is desired, the target face video data may be transmitted to the terminal 120 or the server through a network, but not limited thereto. After the target face video data is acquired, the terminal 120 or the server captures the facial expression according to the facial expression capturing method provided in the embodiments of the present specification.

Optionally, when it is required to acquire video data of a target face through the image acquisition device 110, the terminal 120 generally acquires the video data based on the image acquisition device 110, that is, receives the acquired video data sent by the image acquisition device 110 through a network, where the video data includes a plurality of consecutive frames of images including faces. Since it is often difficult for an image in the video data acquired by the image acquisition device 110 to include only the face of the target object and also include a certain background portion, in order to ensure the accuracy of facial expression capture, after receiving the video data to be subjected to facial expression capture acquired by the image acquisition device 110, as shown in fig. 3, face detection (face detection) may be performed on consecutive frames of images containing faces in the video data, so as to obtain target face video data actually used for facial expression capture. At this time, each of the continuous multiple frames of target face images in the target face video data is an image only containing a face or an image with a known face position, so that when facial expression capturing is performed, interference of non-face parts in each frame of target face image of the target face video data on facial expression capturing can be avoided as much as possible, or the position of the face in each frame of target face image of the target face video data can be known efficiently and accurately, and the efficiency and the precision of facial expression capturing are ensured.

S204, extracting a first target parameter sequence corresponding to the target face video data, wherein the first target parameter sequence comprises first target parameters corresponding to multiple frames of target face images, and the first target parameters comprise first target expression parameters and first target rotation and translation parameters.

Specifically, the first target parameter sequence includes a first target expression parameter sequence and a first target rotation-translation parameter sequence. After the target face video data is obtained, a first target expression parameter and a first target rotation and translation parameter corresponding to each frame of target face image in the target face video data can be respectively extracted, so that a first target expression parameter sequence and a first target rotation and translation parameter sequence corresponding to the target face video data are respectively obtained according to the sequence of each frame of target face image in the target face video data.

Optionally, after the target face video data is acquired, in order to accurately extract the first target parameters corresponding to each frame of target face image in the target face video data, a trained target parameter extractor may be directly used to extract a first target parameter sequence corresponding to the target face video data, and the target parameter extractor is trained based on a plurality of facial images with known facial feature points, so as to ensure high continuity and low application cost of facial expression capture, and further ensure high accuracy of facial expression capture.

Further, in order to further ensure the accuracy of the first target parameter corresponding to each frame of extracted target face image and improve the optimization effect of optimizing the first target parameter sequence by using the target time sequence neural network model in S206, the target parameter extractor may be obtained by training face images based on a plurality of known facial feature points to improve the accuracy of parameter extraction on the images, and then training face video data based on a plurality of known facial feature point sequences and the target time sequence neural network model together, so that the accuracy and a certain time sequence of the first target parameter sequence requiring target time sequence neural network model optimization are improved by the target parameter extractor with higher accuracy, and thus more coherent and higher-accuracy face capturing of the target face video data can be realized, and a second target parameter sequence with higher continuity and higher accuracy can be obtained.

Further, as shown in fig. 4, the target parameter extractor at least includes a first target convolutional network and a second target convolutional network, where the first target convolutional network is used to extract a first target expression parameter sequence corresponding to the target facial video data, and the second target convolutional network is used to extract a first target rotation-translation parameter sequence corresponding to the target facial video data.

And S206, optimizing the first target parameter sequence by using the target time sequence neural network model to obtain a second target parameter sequence, wherein the second target parameter sequence comprises a second target expression parameter sequence and a second target rotation and translation parameter sequence.

Specifically, the target time-series neural network model is obtained by training based on face video data of a plurality of known face feature point sequences. The facial feature point sequence comprises facial feature points corresponding to each frame of facial image in the facial video data. In order to ensure the accuracy of the trained target time-series neural network as much as possible, the number of the facial feature points corresponding to each frame of facial image is multiple, such as but not limited to 66, 68, etc. The facial feature points in the facial feature point sequence are 2D feature points.

Specifically, as shown in fig. 5, the target time-series neural network model includes a first target time-series neural network and a second target time-series neural network. After a first target expression parameter sequence and a first target rotation and translation parameter sequence corresponding to target face video data are extracted, the first target expression parameter sequence can be optimized by using a first target time sequence neural network of a target time sequence neural network model, the first target rotation and translation parameter sequence can be optimized by using a second target time sequence neural network of the target time sequence neural network model, namely, the first target expression parameter sequence and the first target rotation and translation parameter sequence are input into a trained target time sequence neural network model, and the optimized second target expression parameter sequence and the optimized second target rotation and translation parameter sequence with high coherence and high precision are correspondingly output respectively.

Specifically, the second target expression parameter sequence includes second target expression parameters corresponding to respective target face images of multiple frames in the target face video data, and is used for representing a change situation of a face expression in the target face video data. The second target expression parameter in the second target expression parameter sequence is used to characterize a multidimensional target expression base coefficient, such as but not limited to a blendshape coefficient of 52 dimensions, that constitutes the overall expression of the face in the target facial image.

Specifically, the second target rotational-translational parameter sequence includes second target rotational-translational parameters corresponding to respective target face images of multiple frames in the target face video data, and is used for representing a change situation of a head pose in the target face video data. The second target rotational-translational parameters corresponding to each frame of the target face image may include, but are not limited to, 3 rotational parameters and three translational parameters in a three-dimensional stereoscopic space.

In order to solve the problem, in the embodiment of the present specification, a target time sequence neural network model obtained by training face video data based on a plurality of known facial feature point sequences is used to optimize a first target parameter sequence including first target parameters related to facial expressions of each of continuous multiple frames of target face images in the target face video data, and a second target parameter sequence with good continuity and high time sequence is obtained, so that high-continuity, high-precision and low-application-cost facial expression capture can be realized without high-end image acquisition equipment.

Reference is next made to fig. 6, which is a flowchart illustrating another facial expression capturing method according to an exemplary embodiment of the present disclosure. As shown in fig. 6, the facial expression capturing method includes the following steps:

s602, acquiring target face video data, wherein the target face video data comprises continuous multi-frame target face images.

Specifically, S602 is identical to S202, and is not described herein again.

S604, extracting a first target parameter sequence corresponding to the target face video data, wherein the first target parameter sequence comprises first target parameters corresponding to multiple frames of target face images, and the first target parameters comprise first target expression parameters and first target rotation and translation parameters.

Specifically, S604 is identical to S204, and is not described herein again.

And S606, optimizing the first target parameter sequence by using the target time sequence neural network model to obtain a second target parameter sequence, wherein the second target parameter sequence comprises a second target expression parameter sequence and a second target rotation and translation parameter sequence.

Specifically, S606 is the same as S206, and is not described here.

And S608, migrating the second target parameter sequence to the target three-dimensional virtual image.

Specifically, when the facial expression action in the target facial video data acquired in the physical world is to be displayed through a preset target three-dimensional virtual image, the first target parameter sequence is optimized by using a target time sequence neural network model to obtain a second target parameter sequence, namely after a second target parameter sequence related to the facial expression in the target facial video data is captured, each second target expression parameter and each second target rotational translation parameter in the second target parameter sequence respectively correspond to the sequence of each frame of target facial image in the target facial video data and are transferred to the target three-dimensional virtual image, so that the target three-dimensional virtual image can make the target three-dimensional virtual image have a plurality of second target expression parameters and second target rotational translation parameters which are the same as the second target parameter sequence corresponding to the target facial video data, and a plurality of change conditions between the second target expression parameters and the second target rotational translation parameters, so that the target three-dimensional virtual image can make the target three-dimensional virtual image have the same target object facial expression action as the target facial expression in the target facial video data, consistency and precision of the action made by the target three-dimensional virtual image are greatly ensured, and consistency of the target facial expression action made by the target three-dimensional virtual image data made by the target three-dimensional virtual image.

Further, the second target expression parameter sequence includes second target expression parameters corresponding to the target facial images of the multiple frames, and is used for representing the change situation of the facial expression in the target facial video data. The second target expression parameters are used for representing multi-dimensional target expression base coefficients forming the whole facial expression in the target facial image. The second target rotational translation parameter sequence comprises second target rotational translation parameters corresponding to multiple frames of target face images in the target face video data respectively, and is used for representing the change situation of the head posture in the target face video data. Therefore, after the second target parameter sequence is migrated to the target three-dimensional avatar, the target three-dimensional avatar can make facial expression changes same as those in the target facial video data according to the second target expression parameter sequence, and simultaneously make head posture changes same as those in the target facial video data according to the second target rotation translation parameter sequence, so as to ensure the accuracy and consistency of facial expression capture.

Further, in order to ensure the expression migration effect of the target three-dimensional virtual image, the dimensionality of the expression base of the whole facial expression of the target three-dimensional virtual image and the dimensionality of the target expression base coefficient represented by the second target expression parameter should be equal, so that the second target parameter sequence can be smoothly and efficiently migrated to the target three-dimensional virtual image.

It can be understood that, in the embodiment of the present specification, it is only necessary to ensure that the dimension of the expression base of the whole facial expression of the target three-dimensional avatar is equal to the dimension of the target expression base coefficient represented by the second target expression parameter, so as to enable the target three-dimensional avatar to make the same expression motion in the target facial video data, and as for example, but not limited to, the target three-dimensional avatar corresponding to the animal may make the same expression motion after the facial expression capturing and migrating, the target three-dimensional avatar and the face of the target object in the target facial video data may be identical or similar, and may also be completely different.

Illustratively, as shown in fig. 7, the facial expression capturing method provided by the embodiments of the present specification may be mainly divided into three parts, namely a video data acquisition and preprocessing module, a facial expression capturing module, and an expression migration and virtual avatar display module. The video data acquisition and preprocessing module is mainly used for acquiring video data of the face of a target object (such as a person or an animal) by a user through image acquisition equipment, such as a monocular RGB (red, green and blue) camera, a mobile phone and other equipment, and detecting the face of each acquired frame of image to obtain target face video data. Then, the facial expression capturing module extracts a first target parameter sequence corresponding to the preprocessed target facial video data, inputs the first target parameter sequence into a target time sequence neural network model, and outputs the first target parameter sequence to obtain an optimized second target parameter sequence. Finally, the expression (second target parameter sequence) of the captured target object in the target face video data is transferred to a pre-manufactured 3D virtual avatar (target three-dimensional virtual image) by an expression transfer and virtual avatar display module, the 3D virtual avatar with the second target parameter sequence can make the same expression action as the target object, and the dynamic effect is displayed.

Referring next to fig. 1-7, the sequential neural network model training method involved in the above embodiments will be described by taking the sequential neural network model training performed by the terminal 120 as an example. Fig. 8 is a schematic flow chart of a method for training a time-series neural network model according to an exemplary embodiment of the present disclosure. As shown in fig. 8, the training method of the time-series neural network model includes the following steps:

s802, face video data of a plurality of known face feature point sequences are obtained, the face video data comprise continuous multi-frame first face images, and the face feature point sequences comprise first face feature points corresponding to the multi-frame first face images respectively.

Specifically, when capturing facial expressions, in order to ensure consistency and accuracy of facial expression capturing, a first target parameter sequence corresponding to target facial video data needs to be optimized by using a target time-series neural network model. Therefore, in order to ensure the optimization effect of the target time sequence neural network model, in S206, before the target time sequence neural network model is used to optimize the first target parameter sequence and obtain the second target parameter sequence, it is necessary to obtain face video data of a plurality of known face feature point sequences to train the time sequence neural network model, so as to obtain the trained target time sequence neural network model.

Specifically, in order to train the time-series neural network model involved in the facial expression capturing method, the terminal 120 may first receive a plurality of video data transmitted by the image capturing apparatus 110 through a network or acquire a plurality of video data from a plurality of video platforms. In order to improve the quality of training data and ensure the training effect of a time sequence neural network model, after a plurality of video data are obtained, video data with facial expression which cannot be identified due to face blurring and the like in the plurality of video data can be cleaned, then a face area of each cleaned video data is cut by using a face detection algorithm to obtain a plurality of face video data, and each frame of first face image of each face video data is detected by using a face feature point detection algorithm (such as but not limited to an openseface face 2D feature point detection algorithm) to obtain a plurality of first face feature points (2D feature points), so that a face feature point sequence corresponding to each of the plurality of face video data and the plurality of face video data is obtained. The number of the first facial feature points corresponding to each frame of the first facial image may be 66, 69, or the like, which is not limited in this specification.

S804, extracting a first parameter sequence corresponding to the face video data, wherein the first parameter sequence comprises parameters corresponding to a plurality of frames of first face images, and the parameters comprise a first identity parameter, a first expression parameter and a first rotation and translation parameter.

Specifically, the first parameter sequence includes a first identity parameter sequence, a first expression parameter sequence, and a first rotation-translation parameter sequence. After the plurality of facial video data are acquired, the first identity parameter, the first expression parameter and the first rotation and translation parameter corresponding to each frame of the first facial image in each facial video data can be respectively extracted, so that the first identity parameter sequence, the first expression parameter sequence and the first rotation and translation parameter sequence corresponding to the facial video data are respectively obtained according to the sequence of each frame of the first facial image in the facial video data.

It can be understood that, in order to avoid that the identities of objects corresponding to the faces of multiple frames of first face images in the face video data change, for example, the face in the first face image of the first frame of the face video data is the face of the user a, and the face in the first face image of the second frame is the face of the user B, that is, the shape of the head in the face video data changes to affect the training effect of the time-series neural network model, the first identity parameter corresponding to each frame of the first face image in the face video data may be directly extracted to participate in the training of the time-series neural network model; when the object identity corresponding to the face of the first face image of the plurality of frames in the face video data does not change, in order to improve the training efficiency of the time sequence neural network model, only the first identity parameter corresponding to any first face image of the first frame in the face video data can be extracted to participate in the training of the time sequence neural network model.

Optionally, after the plurality of facial video data are acquired, in order to accurately extract parameters corresponding to the first facial image of each frame in the facial video data, a trained target parameter extractor may be directly used to extract a first parameter sequence corresponding to the facial video data, and the parameter extractor is obtained by training based on facial images of a plurality of known facial feature points, so that the precision and the accuracy of the parameter sequence which needs to be input into the time sequence neural network model in the time sequence neural network model training process are improved, and the optimization effect of the consistency and the precision of the first target parameter sequence when the trained target time sequence neural network model improves the facial expression capture is enhanced.

Further, as shown in fig. 9, the target parameter extractor includes at least a first convolution network, a second convolution network, and a third convolution network. When the target parameter extractor is used for extracting a first parameter sequence corresponding to the face video data, a first expression parameter sequence corresponding to the face video data is extracted by using a first convolution network of the target parameter extractor, a first rotation and translation parameter sequence corresponding to the face video data is extracted by using a second convolution network of the target parameter extractor, and a first identity parameter sequence corresponding to the face video data is extracted by using a third convolution network of the target parameter extractor.

And S806, inputting the first expression parameter sequence and the first rotation and translation parameter sequence corresponding to the face video data into a time sequence neural network model, and outputting the optimized second expression parameter sequence and the optimized second rotation and translation parameter sequence.

Specifically, because the first expression parameter sequence and the first rotation and translation parameter sequence are obtained only through a single frame of image, no association relationship exists between the front frame and the rear frame in the sequence, and the problem that the front expression and the rear expression are incoherent or the head slightly jumps during rotation movement exists.

It is understood that since the identity parameters of each frame in the face video data are generally fixed and do not change with the movement of the person, optimization is not required. However, in scenes such as 3D movie animation production and 3D avatar animation display, in order to improve the accuracy and continuity of animation or avatar animation display and enhance the optimization effect of the trained target time sequence network model, the first identity parameter sequence corresponding to the facial video data, the first expression parameter sequence and the first rotation and translation parameter sequence may be input into the time sequence neural network model together, and the output second identity parameter sequence, the second expression parameter sequence and the second rotation and translation parameter sequence are used to generate the first three-dimensional meshes corresponding to the multiple frames of the first facial images together.

Specifically, as shown in fig. 9, after the first parameter sequence corresponding to the facial video data is extracted, the first expression parameter sequence corresponding to the facial video data may be input into the first time-series neural network of the time-series neural network model, and the corresponding optimized second expression parameter sequence is output, and the first rotation-translation parameter sequence corresponding to the facial video data may be input into the second time-series neural network of the time-series neural network model, and the corresponding optimized second rotation-translation parameter sequence is output. The first and second time-series neural networks are recurrent neural networks in which sequence data is input, recursion is performed in the evolution direction of the sequence, and all nodes (cyclic units) are connected in a chain, and may be, but not limited to, a Long Short-Term Memory (LSTM).

And S809, generating first three-dimensional grids corresponding to the first face images of the multiple frames based on the first identity parameter sequence, the second expression parameter sequence and the second rotation and translation parameter sequence corresponding to the face video data.

Specifically, as shown in fig. 9, after obtaining the optimized second expression parameter sequence and second rotation-translation parameter sequence, a 3D mesh (first three-dimensional mesh) may be generated by combining the 3D dm basis vectors to fit the first face image of each frame of the face video data. The first three-dimensional mesh is composed of a three-dimensional mesh and a plurality of three-dimensional vertices (feature points) capable of characterizing the head.

S810, determining a first loss of the time-series neural network model based on the first three-dimensional grids corresponding to the first face images of the plurality of frames and the first face feature points corresponding to the first face images of the plurality of frames.

Specifically, as shown in fig. 10, the process of determining the first loss of the time series neural network model in S810 includes the following steps:

s1002, three-dimensional face feature points of first three-dimensional face grids corresponding to the first face images of the multiple frames are obtained.

Specifically, after the first three-dimensional meshes to which the plurality of frames of the first face images respectively correspond are determined, the three-dimensional face feature points in each of the first three-dimensional face meshes can be obtained according to the feature point index. The three-dimensional face feature points are points that can represent the face contour and five sense organs among all the vertices that constitute the first three-dimensional face mesh.

And S1004, projecting the three-dimensional facial feature points to obtain corresponding two-dimensional facial feature points.

Specifically, since low-end image capturing devices such as a monocular RGB camera can only acquire two-dimensional image data and cannot acquire three-dimensional information, in order to ensure that the high-continuity and high-precision facial expression capturing method in the embodiment of the present specification can be widely applied to low-cost image capturing devices, in the process of training a time-series neural network model, corresponding two-dimensional facial feature points can be obtained by projecting three-dimensional facial feature points to two dimensions, so that a first loss in determining the time-series neural network model according to the first facial feature points of each two-dimensional first facial image in the facial video data and the corresponding two-dimensional facial feature points can be realized without providing additional three-dimensional information.

S1006, determining a first loss of the time-series neural network model based on the first facial feature points corresponding to the first facial images of the plurality of frames and the two-dimensional facial feature points corresponding to the first three-dimensional grids corresponding to the first facial images of the plurality of frames.

Specifically, as shown in fig. 9, after determining two-dimensional facial feature points corresponding to the first three-dimensional meshes corresponding to each of the first facial images of the plurality of frames, the first loss of the time-series neural network model may be determined directly according to the differences between the first facial feature points corresponding to each of the first facial images and the two-dimensional facial feature points. The number of first facial feature points corresponding to the first facial image is the same as the number of corresponding two-dimensional facial feature points.

Exemplarily, the first loss

Wherein N represents the total number of first facial feature points corresponding to the first facial image; omega _n Representing the weight coefficient corresponding to the nth first surface characteristic point; q. q.s _n Denotes the nthA facial feature point (vector); q's' _n Representing the nth two-dimensional facial feature point (vector) corresponding to the first three-dimensional mesh.

It can be understood that, for each frame of the first facial image, the first facial feature points of different parts or different first facial feature points may correspond to different weight coefficients, for example, but not limited to, the weight coefficient corresponding to the first facial feature point of the mouth part is greater than the weight coefficient of the first facial feature point of the facial contour, so that the time-series neural network model can be trained based on the first loss to achieve better optimization effect of the expression parameter sequence.

And S812, training the time sequence neural network model based on the first loss to obtain a trained target time sequence neural network model.

Optionally, after determining the first loss of the time-series neural network model, only the time-series neural network model may be trained based on the first loss, resulting in a trained target time-series neural network model.

In the embodiment of the description, a time sequence neural network model is trained through face video data of a plurality of known two-dimensional face feature point sequences to obtain a trained target time sequence neural network model, so that the target time sequence neural network model is endowed with the capability of enhancing the incidence relation of parameters between frames in a first target parameter sequence corresponding to target face video data involved in facial expression capture, the continuity and the precision of facial expression capture can be improved through the target time sequence neural network model, and certain support can be provided for the facial expression capture method in the embodiment of the description to be widely applied to low-end image acquisition equipment.

Optionally, after the first loss of the time series neural network model is determined, the time series neural network model, the first convolution network, the second convolution network, and the third convolution network may also be trained based on the first loss, so as to obtain a trained target time series neural network model, and the first target convolution network, the second target convolution network, and the third target convolution network. When the trained target time sequence neural network model is obtained based on the first loss training time sequence neural network model, besides only the time sequence neural network model is trained, the first convolution network, the second convolution network and the third convolution network in the target parameter extractor are further trained simultaneously, so that the accuracy of parameter extraction of the first convolution network (the first target convolution network) and the second convolution network (the second target convolution network) in the target parameter extractor during facial expression capturing is further enhanced. The first target convolutional network is configured to extract a first target expression parameter sequence corresponding to target facial video data involved in the facial expression capturing method provided in the embodiment of the present specification; the second target convolutional network is configured to extract a first target rotational-translational parameter sequence corresponding to target facial video data involved in the facial expression capturing method provided by the embodiment of the present specification.

In order to ensure the accuracy of the parameter sequence extracted in S704 and improve the training effect of the time sequence neural network model, thereby ensuring the accuracy of facial expression capture, the target parameter extractor needs to be trained before the parameter sequence corresponding to the facial video data is extracted by the target parameter extractor in S704. Please refer to fig. 11, which is a schematic diagram of a training process of a target parameter extractor according to an exemplary embodiment of the present disclosure. As shown in fig. 11, the training process of the target parameter extractor includes the following steps:

s1102, a plurality of second face images of known second face feature points are acquired.

Specifically, when a trained target parameter extractor is desired, a plurality of second facial images of known second facial feature points may be acquired as training data through a network or an image acquisition device. The number of the second face feature points of the second face image is multiple, such as but not limited to 66 or 68. The second facial image may be, but is not limited to, a two-dimensional image based on an image captured by an image capture device. The second face feature point is a 2D feature point.

Alternatively, as shown in fig. 12, to train the target parameter extractor, a plurality of second face images may be obtained, and then the second face feature points corresponding to the plurality of second face images and the positions of the faces in the second face images may be determined by using a face detection algorithm and a feature point detection algorithm (for example, but not limited to, an OpenSeeFace face 2D feature point detection algorithm).

And S1104, training a parameter extractor based on a plurality of second face images with known second face feature points to obtain a trained target parameter extractor.

Specifically, as shown in fig. 13, the specific implementation flow of the training parameter extractor in S1104 includes the following steps:

s1302, a plurality of second surface images with known second surface feature points are input into the parameter extractor, and a parameter set corresponding to each second surface image is output.

Specifically, as shown in fig. 14, the parameter set includes a second identity parameter, a texture parameter, a second expression parameter, and a second rotation-translation parameter. The parameter extractor includes a first convolutional network, a second convolutional network, a third convolutional network, and a fourth convolutional network. Wherein: the first convolution network is used for extracting a second expression parameter corresponding to the second facial image; the second convolution network is used for extracting a second rotation and translation parameter corresponding to the second face image; the three-convolution network is used for extracting a second identity parameter corresponding to the second face image; and the fourth convolution network is used for extracting the texture parameters corresponding to the second face image.

And S1304, combining the basis vectors of the parameterized three-dimensional face model based on the parameter set to generate a corresponding second three-dimensional grid.

Specifically, as shown in fig. 14, when the basis vectors of the parameterized three-dimensional face model are combined based on the parameter set to generate the corresponding second three-dimensional mesh, the basis vectors of the parameterized three-dimensional face model may be combined based on the second identity parameter, the texture parameter, and the second expression parameter to generate the corresponding second three-dimensional mesh including the texture. The parameterized three-dimensional face model is usually obtained by performing statistical decomposition on a large number of parameterized 3D faces with the same topology, and comprises shape base vectors, expression base vectors and texture base vectors, and 3D faces (three-dimensional grids) with various shapes can be generated by fitting various combinations of the base vectors.

And S1306, acquiring two-dimensional facial feature points corresponding to the second three-dimensional grid.

Specifically, after the second three-dimensional meshes corresponding to the plurality of frames of second face images are determined, the three-dimensional face feature points corresponding to the second three-dimensional meshes may be obtained according to the feature point index. Since low-end image capturing devices such as monocular RGB cameras can only acquire two-dimensional image data but cannot acquire three-dimensional information, in order to ensure that the facial expression capturing method with high continuity and high precision in the embodiments of the present specification can be widely applied to low-cost image capturing devices, in the process of training the target parameter extractor, the corresponding two-dimensional facial feature points are also obtained by projecting the three-dimensional facial feature points corresponding to the second three-dimensional mesh, so that the loss of the parameter extractor is determined without providing additional three-dimensional information. The three-dimensional face feature points are points that can represent the facial contour and five sense organs among all the vertices constituting the second three-dimensional face mesh.

And S1308, rendering the second three-dimensional grid into a two-dimensional image.

Specifically, after determining the second three-dimensional meshes corresponding to the second face images of the plurality of frames, the differentiable renderer may be further used to render the second three-dimensional meshes into corresponding two-dimensional images based on the texture parameter and the second rotation and translation parameter.

It is understood that S1308 and S1306 may be executed sequentially or synchronously, which is not limited in this embodiment of the present disclosure.

S1310, a second loss corresponding to the parameter extractor is determined based on the two-dimensional facial feature points and the two-dimensional image corresponding to the second three-dimensional mesh.

Specifically, as shown in fig. 14, the second loss includes a feature point loss and a pixel loss. After determining the two-dimensional facial feature points and the two-dimensional image corresponding to the second three-dimensional mesh, a feature point loss of the parameter extractor may be determined based on the two-dimensional facial feature points corresponding to the second three-dimensional mesh and the second facial feature points of the second facial image, and a pixel loss of the parameter extractor may be determined based on the two-dimensional image and the second facial image.

Examples of the inventionCharacteristically, the above characteristic point loss

Wherein N represents the total number of second face feature points corresponding to the second face image; omega _n Representing the weight coefficient corresponding to the nth second face characteristic point; q _n Represents an nth second face feature point (vector); q' _n Representing the nth two-dimensional facial feature point (vector) corresponding to the second three-dimensional mesh. The pixel loss is greater than or equal to>

Wherein M represents the sum of pixel points of the second face image; a. The _i Representing a mask map generated based on a face position in the second face image, the mask map being used to distinguish between facial regions and non-facial regions in the second face image; i is _i The ith pixel point representing the second face image I; i is _i 'denotes the ith pixel point of the rendered two-dimensional image I'.

And S1312, obtaining a trained target parameter extractor based on the second loss training parameter extractor.

Specifically, after determining the second loss of the parameter extractor based on the plurality of second facial images of known second facial feature points, the parameter extractor may be trained based on the feature point loss in the second loss of the parameter extractor, thereby ensuring the accuracy of the expression parameters, the rotational translation parameters, and the like extracted by the target parameter extractor obtained after training. Meanwhile, as the second face image contains a lot of information, in order to avoid the situation that overfitting or learning cannot be performed in the training process due to too many parameters, the training process of the parameter extractor is restrained by utilizing the pixel loss in the second loss of the parameter extractor, so that all the parameters can be fitted better, and the training effect of the target parameter extractor is further improved.

Next, refer to fig. 15, which is a schematic structural diagram of a facial expression capturing apparatus according to an exemplary embodiment of the present disclosure. As shown in fig. 15, the facial expression capture apparatus 1500 includes:

a first obtaining module 1510 configured to obtain target face video data; the target face video data includes a plurality of consecutive frames of target face images;

a first extracting module 1520, configured to extract a first target parameter sequence corresponding to the target face video data; the first target parameter sequence comprises first target parameters corresponding to the multiple frames of target face images respectively; the first target parameters comprise first target expression parameters and first target rotation and translation parameters;

the first optimization module 1530 is configured to optimize the first target parameter sequence by using a target time series neural network model to obtain a second target parameter sequence; the second target parameter sequence comprises a second target expression parameter sequence and a second target rotation and translation parameter sequence; the target time sequence neural network model is obtained by training based on face video data of a plurality of known face feature point sequences.

In a possible implementation manner, the first extraction module 1520 is specifically configured to:

In one possible implementation manner, the facial expression capturing apparatus 1500 further includes:

In a possible implementation manner, the first obtaining module 1510 includes:

a first face detection unit, configured to perform face detection on the consecutive frames of images including faces to obtain target face video data; the target face image is an image including only a face or an image whose position of the face is known.

The division of the modules in the facial expression capture device is for illustration only, and in other embodiments, the facial expression capture device may be divided into different modules as needed to complete all or part of the functions of the facial expression capture device. The implementation of the respective modules in the facial expression capturing apparatus provided in the embodiments of the present specification may be in the form of a computer program. The computer program may be run on a terminal or a server. The program modules constituted by the computer program may be stored on the memory of the terminal or the server. The computer program, when executed by a processor, implements all or part of the steps of the facial expression capturing method described in the embodiments of the present specification.

Fig. 16 is a schematic structural diagram of a time series neural network model training device according to an exemplary embodiment of the present disclosure. As shown in fig. 16, the time-series neural network model training apparatus 1600 includes:

a second obtaining module 1610, configured to obtain face video data of a plurality of known sequences of facial feature points; the face video data includes a plurality of consecutive frames of first face images; the facial feature point sequence comprises first facial feature points corresponding to the first facial images of the plurality of frames respectively;

a second extracting module 1620, configured to extract a parameter sequence corresponding to the facial video data; the first parameter sequence comprises parameters corresponding to the first face images of the plurality of frames respectively; the parameters comprise a first identity parameter, a first expression parameter and a first rotation and translation parameter;

a second optimizing module 1630, configured to input the first expression parameter sequence and the first rotation/translation parameter sequence corresponding to the facial video data into a time-series neural network model, and output an optimized second expression parameter sequence and an optimized second rotation/translation parameter sequence;

a first generating module 1640, configured to generate a first three-dimensional mesh corresponding to each of the plurality of frames of first facial images based on the first identity parameter sequence, the second expression parameter sequence, and the second rotation-translation parameter sequence corresponding to the facial video data;

a first determining module 1650, configured to determine a first loss of the time-series neural network model based on the first three-dimensional meshes corresponding to the first facial images of the plurality of frames and the first facial feature points corresponding to the first facial images of the plurality of frames;

a first training module 1660, configured to train the time-series neural network model based on the first loss to obtain a trained target time-series neural network model; the target time series neural network model described above is used to optimize the first target parameter sequence described in the embodiments of the present specification.

In a possible implementation manner, the first determining module 1650 includes:

and a first determining unit, configured to determine a first loss of the time-series neural network model based on the first facial feature points corresponding to the first facial images of the plurality of frames and the two-dimensional facial feature points corresponding to the first three-dimensional meshes corresponding to the first facial images of the plurality of frames.

In a possible implementation manner, the second extracting module 1620 is specifically configured to:

In a possible implementation manner, the time series neural network model training apparatus 1600 further includes:

and the second training module is used for training the parameter extractor based on the plurality of second face images with known second face characteristic points to obtain the trained target parameter extractor.

In a possible implementation manner, the second training module includes:

In a possible implementation manner, the parameter set includes a second identity parameter, a texture parameter, a second expression parameter, and a second rotation and translation parameter;

the combination unit is specifically configured to:

combining the base vectors of the parameterized three-dimensional face model based on the second identity parameters, the texture parameters and the second expression parameters to generate a corresponding second three-dimensional grid;

the rendering unit is specifically configured to:

determining a feature point loss of the parameter extractor based on the two-dimensional face feature points corresponding to the second three-dimensional mesh and the second face feature points of the second face image; determining a pixel loss of the parameter extractor based on the two-dimensional image and the second face image;

the training unit is specifically configured to:

and training the parameter extractor based on the feature point loss and the pixel loss to obtain the trained target parameter extractor.

In a possible implementation manner, the third obtaining module includes:

extracting a first expression parameter sequence corresponding to the facial video data by using a first convolution network of the target parameter extractor; extracting a first rotation-translation parameter sequence corresponding to the face video data by utilizing a second convolution network of the target parameter extractor; and extracting a first identity parameter sequence corresponding to the face video data by using a third convolution network of the target parameter extractor.

In a possible implementation manner, the first training module 1660 is specifically configured to:

the first target convolutional network is used for extracting a first target expression parameter sequence corresponding to target face video data described in the embodiments of the present specification;

the second target convolutional network is used to extract the first target rototranslation parameter sequence corresponding to the target face video data described in the embodiments of the present specification.

The division of each module in the time sequence neural network model training device is only used for illustration, and in other embodiments, the time sequence neural network model training device may be divided into different modules as needed to complete all or part of the functions of the time sequence neural network model training device. The implementation of each module in the time series neural network model training apparatus provided in the embodiments of the present specification may be in the form of a computer program. The computer program may be run on a terminal or a server. The program modules constituted by the computer program may be stored on the memory of the terminal or the server. The computer program, when executed by a processor, implements all or part of the steps of the sequential neural network model training method described in the embodiments of the present specification.

Next, refer to fig. 17, which is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure. As shown in fig. 17, the electronic device 1700 may include: at least one processor 1710, at least one communication bus 1720, a user interface 1730, at least one network interface 1740, and memory 1750.

The communication bus 1720 may be used for connection communication among the above components.

User interface 1730 may include a Display screen (Display) and a Camera (Camera), and optional user interfaces may include standard wired and wireless interfaces.

The network interface 1740 may optionally include a bluetooth module, a Near Field Communication (NFC) module, a Wireless Fidelity (Wi-Fi) module, and the like.

Among other things, the processor 1710 may include one or more processing cores. The processor 1710 interfaces with various components throughout the electronic device 1700 using various interfaces and lines to perform various functions and process data of the routing electronic device 1700 by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1750 and invoking data stored in the memory 1750. Optionally, the processor 1710 may be implemented in at least one hardware form of Digital Signal Processing (DSP), field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1710 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is understood that the above-mentioned modem may not be integrated into the processor 1710, but may be implemented by a single chip.

The Memory 1750 may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). Optionally, the memory 1750 includes non-transitory computer readable media. The memory 1750 may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory 1750 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for at least one function (such as an acquisition function, an extraction function, an optimization function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 1750 may also optionally be at least one memory device located remotely from the processor 1710 as previously described. As shown in fig. 17, memory 1750, which is one type of computer storage medium, may include an operating system, a network communication module, a user interface module, and program instructions.

In some possible embodiments, the electronic device 1700 may be the aforementioned facial expression capture apparatus, and in the electronic device 1700 shown in fig. 17, the user interface 1730 is mainly used for providing an interface for a user to input, such as a key on the facial expression capture apparatus and the like, and acquiring an instruction triggered by the user; and the processor 1710 may be configured to call the program instructions stored in the memory 1750 and specifically perform the following operations:

acquiring target face video data; the target face video data includes a plurality of consecutive frames of target face images.

Extracting a first target parameter sequence corresponding to the target face video data; the first target parameter sequence comprises first target parameters corresponding to the target face images of the multiple frames respectively; the first target parameters comprise first target expression parameters and first target rotation and translation parameters.

In some possible embodiments, when the processor 1710 extracts the first target parameter sequence corresponding to the target face video data, it is specifically configured to:

In some possible embodiments, the target parameter extractor is trained based on facial images of a plurality of known facial feature points and facial video data of the plurality of known facial feature point sequences.

In some possible embodiments, the target parameter extractor includes a first target convolutional network and a second target convolutional network; the first target convolutional network is used for extracting a first target expression parameter sequence corresponding to the target face video data; the second target convolutional network is used for extracting a first target rotation translation parameter sequence corresponding to the target face video data.

In some possible embodiments, the target time-series neural network model includes a first target time-series neural network and a second target time-series neural network; the first target time sequence neural network is used for optimizing a first target expression parameter sequence corresponding to the target face video data; the second target time sequence neural network is used for optimizing a first target rotation translation parameter sequence corresponding to the target face video data.

In some possible embodiments, the processor 1710 optimizes the first target parameter sequence by using a target time-series neural network model, and after obtaining a second target parameter sequence, the processor is further configured to:

In some possible embodiments, the second target expression parameter sequence includes second target expression parameters corresponding to respective target facial images of the multiple frames of target facial images, and is used to characterize changes in facial expressions in the target facial video data; the second target expression parameter is used for representing multidimensional target expression base coefficients of the whole facial expression in the target facial image.

In some possible embodiments, the dimension of the expression base of the whole facial expression constituting the target three-dimensional avatar is equal to the dimension of the target expression base coefficient.

In some possible embodiments, the second target rototranslation parameter sequence includes second target rototranslation parameters corresponding to respective ones of the plurality of frames of target face images, and is used for characterizing changes in head pose in the video data composing the target face.

In some possible embodiments, when the processor 1710 obtains the target face video data, it is specifically configured to:

acquiring video data based on image acquisition equipment; the video data includes a plurality of consecutive frames of images containing faces.

In some possible embodiments, the electronic device 1700 may be the aforementioned sequential neural network model training apparatus, and the aforementioned processor 1710 specifically further performs:

acquiring face video data of a plurality of known face feature point sequences; the face video data includes a plurality of consecutive frames of first face images; the facial feature point sequence includes first facial feature points corresponding to the plurality of frames of first facial images.

Extracting a parameter sequence corresponding to the face video data; the first parameter sequence comprises parameters corresponding to the first face images of the plurality of frames respectively; the parameters comprise a first identity parameter, a first expression parameter and a first rotation and translation parameter.

And inputting the first expression parameter sequence and the first rotation and translation parameter sequence corresponding to the facial video data into a time sequence neural network model, and outputting the optimized second expression parameter sequence and the optimized second rotation and translation parameter sequence.

And generating a first three-dimensional grid corresponding to each of the plurality of frames of first face images based on the first identity parameter sequence, the second expression parameter sequence and the second rotation and translation parameter sequence corresponding to the face video data.

And determining a first loss of the time-series neural network model based on the first three-dimensional grids corresponding to the first facial images of the plurality of frames and the first facial feature points corresponding to the first facial images of the plurality of frames.

Training the time sequence neural network model based on the first loss to obtain a trained target time sequence neural network model; the target time series neural network model described above is used to optimize the first target parameter sequence described in the embodiments of the present specification.

In some possible embodiments, when the processor 1710 determines the first loss of the time-series neural network model based on the first three-dimensional mesh corresponding to each of the plurality of frames of the first facial images and the first facial feature points corresponding to each of the plurality of frames of the first facial images, the processor is specifically configured to:

and acquiring three-dimensional face feature points of the first three-dimensional face grids corresponding to the plurality of frames of first face images respectively.

And projecting the three-dimensional facial feature points to obtain corresponding two-dimensional facial feature points.

In some possible embodiments, when the processor 1710 extracts the parameter sequence corresponding to the face video data, it is specifically configured to:

In some possible embodiments, before the processor 1710 extracts the parameter sequence corresponding to the face video data, it is further configured to:

a plurality of second facial images of known second facial feature points are acquired.

In some possible embodiments, the processor 1710 is specifically configured to, when obtaining the trained target parameter extractor based on the plurality of second facial image training parameter extractors with known second facial feature points, perform:

inputting a plurality of second face images with known second face feature points into a parameter extractor, and outputting a parameter set corresponding to each second face image.

And combining the basis vectors of the parameterized three-dimensional face model based on the parameter set to generate a corresponding second three-dimensional grid.

And acquiring two-dimensional face feature points corresponding to the second three-dimensional grid.

And rendering the second three-dimensional grid into a two-dimensional image.

And determining a second loss corresponding to the parameter extractor based on the two-dimensional face feature points corresponding to the second three-dimensional mesh and the two-dimensional image.

In some possible embodiments, the parameter set includes a second identity parameter, a texture parameter, a second expression parameter, and a second rotation-translation parameter;

the processor 1710 is specifically configured to, when combining the basis vectors of the parameterized three-dimensional face model based on the parameter set to generate the corresponding second three-dimensional mesh:

and combining the basis vectors of the parameterized three-dimensional face model based on the second identity parameters, the texture parameters and the second expression parameters to generate a corresponding second three-dimensional grid.

When the processor 1710 renders the second three-dimensional mesh into a two-dimensional image, the processor is specifically configured to:

In some possible embodiments, the parameter extractor includes a first convolutional network, a second convolutional network, a third convolutional network, and a fourth convolutional network; wherein:

In some possible embodiments, when the processor 1710 acquires the two-dimensional facial feature points corresponding to the second three-dimensional mesh, the processor is specifically configured to:

and acquiring the three-dimensional face feature points corresponding to the second three-dimensional grid.

In some possible embodiments, when the processor 1710 determines the second loss corresponding to the parameter extractor based on the two-dimensional facial feature points corresponding to the second three-dimensional mesh and the two-dimensional image, the processor is specifically configured to:

and determining the loss of the feature points of the parameter extractor based on the two-dimensional face feature points corresponding to the second three-dimensional mesh and the second face feature points of the second face image.

The pixel loss of the parameter extractor is determined based on the two-dimensional image and the second face image.

The processor 1710, when training the parameter extractor based on the second loss to obtain the trained target parameter extractor, is specifically configured to:

In some possible embodiments, when the processor 1710 acquires a plurality of second facial images of known second facial feature points, the processor is specifically configured to:

a plurality of second face images are acquired.

In some possible embodiments, the number of the second face feature points of the second face image is multiple; the second facial image is based on a two-dimensional image acquired by an image acquisition device.

In some possible embodiments, when the processor 1710 extracts the parameter sequence corresponding to the face video data by using the target parameter extractor, the processor is specifically configured to:

and extracting a first expression parameter sequence corresponding to the facial video data by using a first convolution network of the target parameter extractor.

And extracting a first rotation-translation parameter sequence corresponding to the face video data by utilizing a second convolution network of the target parameter extractor.

In some possible embodiments, the processor 1710 is specifically configured to, when training the time-series neural network model based on the first loss to obtain a trained target time-series neural network model, perform:

training the time sequence neural network model, the first convolution network, the second convolution network and the third convolution network based on the first loss to obtain a trained target time sequence neural network model, a first target convolution network, a second target convolution network and a third target convolution network; the first target convolutional network is used for extracting a first target expression parameter sequence corresponding to target face video data described in the embodiments of the present specification; the second target convolutional network is used to extract the first target rototranslation parameter sequence corresponding to the target face video data in the implementation manner described in the embodiment of the present specification.

The present specification also provides a computer readable storage medium having stored therein instructions, which when run on a computer or processor, cause the computer or processor to perform one or more of the steps of the above embodiments. The respective constituent modules of the facial expression capturing apparatus or the time-series neural network model training apparatus may be stored in the computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The processes or functions described above in accordance with the embodiments of this specification are all or partially performed when the computer program instructions described above are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a flexible Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a Digital Versatile Disk (DVD)), a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by instructing relevant hardware by a computer program, and the program may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. And the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks. The technical features in the present examples and embodiments may be arbitrarily combined without conflict.

The above-described embodiments are merely preferred embodiments of the present disclosure, and are not intended to limit the scope of the present disclosure, and various modifications and improvements made to the technical solutions of the present disclosure by those skilled in the art without departing from the design spirit of the present disclosure should fall within the protection scope defined by the claims.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims and in the specification may be performed in an order different than in the embodiments recited in the specification and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Claims

1. A method of facial expression capture, the method comprising:

2. The method of claim 1, wherein said extracting a first target parameter sequence corresponding to the target face video data comprises:

extracting a first target parameter sequence corresponding to the target face video data by using a target parameter extractor; the target parameter extractor is obtained by training based on a plurality of face images of known face feature points.

3. The method of claim 2, wherein the target parameter extractor is trained based on facial images of a plurality of known facial feature points and facial video data of the sequence of known facial feature points.

4. The method of claim 2, the target parameter extractor comprising a first target convolutional network and a second target convolutional network; the first target convolutional network is used for extracting a first target expression parameter sequence corresponding to the target face video data; the second target convolutional network is used for extracting a first target rotation and translation parameter sequence corresponding to the target face video data.

5. The method of claim 1, the target time-sequential neural network model comprising a first target time-sequential neural network and a second target time-sequential neural network; the first target time sequence neural network is used for optimizing a first target expression parameter sequence corresponding to the target face video data; the second target time sequence neural network is used for optimizing a first target rotation and translation parameter sequence corresponding to the target face video data.

6. The method of any one of claims 1-5, after optimizing the first sequence of target parameters using a target time-series neural network model to obtain a second sequence of target parameters, the method further comprising:

7. The method of claim 6, wherein the second target expression parameter sequence comprises second target expression parameters corresponding to respective target facial images of the plurality of frames, and is used for representing changes of facial expressions in the target facial video data; the second target expression parameter is used for representing multi-dimensional target expression base coefficients of the whole facial expression in the target facial image.

8. The method of claim 7, wherein the dimensions of the expression base of the whole facial expression constituting the target three-dimensional avatar are equal to the dimensions of the target expression base coefficients.

9. The method of claim 1, wherein the second sequence of target rototranslation parameters includes respective second target rototranslation parameters for each of the plurality of frames of target facial images that characterize changes in head pose that make up the target facial video data.

10. The method of claim 1, the obtaining target face video data, comprising:

acquiring video data based on image acquisition equipment; the video data comprises a plurality of consecutive frames of images containing faces;

carrying out face detection on the continuous multi-frame images containing faces to obtain target face video data; the target face image is an image containing only faces or an image of known face locations.

11. A method of time series neural network model training, the method comprising:

acquiring face video data of a plurality of known face feature point sequences; the face video data includes a plurality of consecutive frames of a first face image; the facial feature point sequence comprises first facial feature points corresponding to the first facial images of the plurality of frames respectively;

extracting a first parameter sequence corresponding to the face video data; the first parameter sequence comprises parameters corresponding to the first face images of the plurality of frames respectively; the parameters comprise a first identity parameter, a first expression parameter and a first rotation and translation parameter;

training the time sequence neural network model based on the first loss to obtain a trained target time sequence neural network model; the target time-series neural network model is used to optimize a first target parameter sequence as claimed in any one of claims 1-10.

12. The method of claim 11, wherein said determining a first loss of the temporal neural network model based on the first three-dimensional mesh corresponding to each of the plurality of frames of the first facial images and the first facial feature points corresponding to each of the plurality of frames of the first facial images, comprises:

acquiring three-dimensional facial feature points of first three-dimensional facial grids corresponding to the multiple frames of first facial images;

13. The method of claim 11, wherein said extracting a first sequence of parameters corresponding to said facial video data comprises:

and extracting a first parameter sequence corresponding to the face video data by using a target parameter extractor.

14. The method of claim 13, prior to said extracting the first sequence of parameters corresponding to the facial video data, the method further comprising:

15. The method of claim 14, wherein training a parameter extractor based on the plurality of second known facial images to obtain the trained target parameter extractor comprises:

inputting a plurality of second face images with known second face feature points into a parameter extractor, and outputting a parameter set corresponding to each second face image;

combining the basis vectors of the parameterized three-dimensional face models based on the parameter set to generate a corresponding second three-dimensional grid;

acquiring two-dimensional facial feature points corresponding to the second three-dimensional grid;

rendering the second three-dimensional mesh into a two-dimensional image;

determining a second loss corresponding to the parameter extractor based on the two-dimensional facial feature points corresponding to the second three-dimensional grid and the two-dimensional image;

16. The method of claim 15, the set of parameters comprising a second identity parameter, a texture parameter, a second expression parameter, and a second rototranslation parameter;

said combining basis vectors of the parameterized three-dimensional face models based on the parameter set to generate corresponding second three-dimensional meshes comprises:

combining the basis vectors of the parameterized three-dimensional face model based on the second identity parameter, the texture parameter and the second expression parameter to generate a corresponding second three-dimensional grid;

the rendering the second three-dimensional mesh into a two-dimensional image comprises:

rendering the second three-dimensional mesh as a two-dimensional image based on the texture parameter and the second rotational-translation parameter.

17. The method of claim 16, the parameter extractor comprising a first convolutional network, a second convolutional network, a third convolutional network, and a fourth convolutional network; wherein:

the second convolution network is used for extracting a second rotation and translation parameter corresponding to the second face image;

the third convolutional network is used for extracting a second identity parameter corresponding to the second face image;

18. The method of claim 15, wherein said obtaining two-dimensional facial feature points corresponding to said second three-dimensional mesh comprises:

acquiring three-dimensional facial feature points corresponding to the second three-dimensional grid;

19. The method of claim 15, said determining a second penalty for said parameter extractor based on said two-dimensional facial feature points corresponding to said second three-dimensional mesh and said two-dimensional image, comprising:

determining a feature point loss of the parameter extractor based on the two-dimensional facial feature points corresponding to the second three-dimensional mesh and second facial feature points of the second facial image;

training the parameter extractor based on the second loss to obtain the trained target parameter extractor, including:

20. The method of claim 14, the obtaining a plurality of second face images of known second face feature points, comprising:

acquiring a plurality of second face images;

21. The method of any one of claims 14-20, wherein the second facial image has a plurality of second facial feature points; the second facial image is a two-dimensional image acquired by an image acquisition device.

22. The method of any one of claims 13-20, wherein said extracting, with a target parameter extractor, a first sequence of parameters corresponding to the facial video data comprises:

extracting a first rotation-translation parameter sequence corresponding to the face video data by utilizing a second convolution network of the target parameter extractor;

and extracting a first identity parameter sequence corresponding to the face video data by utilizing a third convolution network of the target parameter extractor.

23. The method of claim 22, said training said temporal neural network model based on said first loss resulting in a trained target temporal neural network model, comprising:

wherein the first target convolutional network is configured to extract a first target expression parameter sequence corresponding to the target facial video data of any one of claims 1 to 10;

the second target convolutional network is used to extract the first target roto-translation parameter sequence corresponding to the target face video data as claimed in any one of claims 1 to 10.

24. A facial expression capture apparatus, the apparatus comprising:

25. An apparatus for sequential neural network model training, the apparatus comprising:

a second obtaining module, configured to obtain face video data of a plurality of known sequences of facial feature points; the face video data includes a plurality of consecutive frames of a first face image; the facial feature point sequence comprises first facial feature points corresponding to the first facial images of the plurality of frames respectively;

the second extraction module is used for extracting a first parameter sequence corresponding to the face video data; the first parameter sequence comprises parameters corresponding to the first face images of the plurality of frames respectively; the parameters comprise a first identity parameter, a first expression parameter and a first rotation and translation parameter;

a first determining module, configured to determine a first loss of the time-series neural network model based on the first three-dimensional meshes corresponding to the first facial images of the plurality of frames and the first facial feature points corresponding to the first facial images of the plurality of frames;

the first training module is used for training the time sequence neural network model based on the first loss to obtain a trained target time sequence neural network model; the target time-series neural network model is used to optimize a first target parameter sequence as claimed in any one of claims 1-10.

26. An electronic device, comprising: a processor and a memory;

the processor is connected with the memory;

the memory for storing executable program code;

the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for performing the method of any one of claims 1-23.

27. A computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps according to any of claims 1-23.

28. A computer program product comprising instructions which, when run on a computer or processor, cause the computer or processor to perform the method of any one of claims 1-23.