CN109584353B

CN109584353B - Method for reconstructing three-dimensional facial expression model based on monocular video

Info

Publication number: CN109584353B
Application number: CN201811230151.0A
Authority: CN
Inventors: 王珊; 沈旭昆
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2018-10-22
Filing date: 2018-10-22
Publication date: 2023-04-07
Anticipated expiration: 2038-10-22
Also published as: CN109584353A

Abstract

A method for reconstructing a three-dimensional facial expression model based on a monocular video does not need extra multi-angle shooting, a universal 3D facial model is directly driven to deform from a neutral expression image frame in the monocular video to generate an individualized three-dimensional facial template, deformation of three-dimensional facial expressions corresponding to different frames is expressed as change of the individualized three-dimensional facial template in a three-dimensional space 3D vertex stream, and a facial expression coarse-scale geometric model is solved through consistency with 2D light stream change. And improving the shape accuracy of the coarse-scale reconstruction model by using dense optical flow, relaxing the shooting requirement of an input video, adding details on the recovered coarse-scale face model by using a shading recovery shape technology so as to recover fine-scale face geometric data, and reconstructing a high-fidelity three-dimensional face geometric model.

Description

Method for reconstructing three-dimensional facial expression model based on monocular video

Technical Field

The invention relates to a method for reconstructing a three-dimensional facial expression model based on monocular video, and belongs to the technical field of computer virtual reality.

Background

The vividly reconstructed three-dimensional human face expression model has wide application in the fields of computer games, movie making, social contact, medical treatment and the like, and the traditional three-dimensional human face model acquisition and reconstruction mostly depend on heavy and expensive hardware equipment and controllable illumination environment in a laboratory. With the rapid advance of the virtual reality technology and the mobile intelligent terminal into mass life, people increasingly hope to obtain a high-quality three-dimensional facial expression model through low-cost equipment in a daily living environment and apply the model to a virtual environment. The video is shot by a mobile phone and a camera, or the three-dimensional facial expression model is reconstructed by directly utilizing the internet video, so that the complexity of the acquisition equipment is reduced to the minimum, and a new opportunity is brought to the consumption-level three-dimensional facial digital application.

In the visual range, a person's face can be divided into different hierarchical representations from a geometric scale: coarse (e.g., nose, cheeks, lips, eyelids, etc.), fine (e.g., wrinkles), and micro (e.g., pores, moles, and freckles). At present, a three-dimensional facial expression reconstruction algorithm based on monocular video mainly comprises two steps: coarse scale three-dimensional face geometric reconstruction, wrinkle and other detail scale geometric reconstruction. In general, when a video is shot, the posture of a camera is relatively fixed, and the change of an illumination environment is not large, so that a method of driving a prior human face model by using facial 2D feature points is mostly used for reconstructing the coarse-scale three-dimensional human face geometry. Geometric reconstruction of detail scales such as wrinkles mostly adopts a Shape From Shading (SFS) method, and a restored coarse scale model is used as a reference for detail optimization; some algorithms also use regression prediction methods to regress the wrinkle geometry details in real time based on a pre-trained finite wrinkle detail dataset.

Document 1-Garrido, p, et al, reconstracturing detailed dynamic facial surface geometry from a single expression video, acm trans, graph, 2013.32 (6): p.1-10. By registering a general Blendshape model to a neutral expression target 3D face obtained through scanning, a personalized Blendshape of the target face is obtained, 2D image feature points of the whole video sequence are tracked through sparse feature point tracking and optical flow estimation, the 3D Blendshape model is aligned to the 2D sparse feature points of each frame to obtain coarse-scale expression and posture estimation, then unknown illumination is estimated through an SFS algorithm, and fine-scale facial details are restored. The method includes the steps that 29 3D feature points of a universal Blendshape model and a scanned face model need to be manually aligned, an individualized Blendshape model is obtained through a deformation algorithm, and the biggest limitation is that each photographer needs to perform 3D scanning of neutral expressions in advance. Document 2-Shi, f., et al, automatic acquisition of high-fidelity facial image, acm trans, graph, 2014.33 (6): p.1-13, proposes a full-Automatic high-fidelity three-dimensional facial expression reconstruction method, proposes an optimization framework of key frame space-time constraint by fully utilizing time continuous information, calculates 3D head pose and coarse-scale facial expression deformation frame by utilizing the acquired 2D feature points and a multi-linear facial model (FaceWarehouse), and recovers fine-scale facial detail information by utilizing a spherical harmonic function to approximate ambient light. They effectively reduce the ambiguity in the illumination and reflectivity estimation using the assumption that the illumination and reflectivity are consistent throughout the sequence. Meanwhile, the accuracy and robustness are effectively improved by initializing the fine-scale geometric reconstruction by using the coarse-scale deformed reconstruction result. Document 3-Suwajanakorn, s., i.e., kemelacher-Shlizerman, and s.m.seitz.total Moving Face reconstruction.in European Conference reference on Computer vision.2014, proposes a three-dimensional facial expression reconstruction method based on simultaneous optimization of coarse-scale geometry and fine-scale geometry of 3D Flow (3D Flow). Aiming at a celebrity video, firstly, a large number of photo sets of the same face under different illumination environments are searched on the Internet, an average face model is generated by utilizing the previous working literature 4-Kemelcher-Shlizrman, I.and S.M.Seitz.face retrieval in the world.in 2011International Conference on Computer Vision.2011, the corresponding relation between the vertex of the average face model and the pixels of an input video frame is established by constructing a scene flow, an SFS imaging equation is superposed to construct a unified numerical optimization frame, and a coarse-scale model and fine-scale details are alternately and iteratively optimized. 2015, 5-Cao, c, et al, real-time high-fidelity facial performance capture, acm trans, graph, 2015.34 (4): p.1-9. A first Real-time high-precision facial expression capture method [75] is proposed, which assumes that human facial wrinkles appear at different locations and depths of the face, but the wrinkles are self-similar and their visual appearance can be combined by local shapes, which separates coarse-scale geometry and fine-scale geometry such as wrinkles from a scanned high-precision three-dimensional face model and divides the wrinkles into tiny local geometric detail regions to construct a training set of geometric detail regression predictions, on the basis of the previous coarse-scale reconstruction, training a set of local detail regression to add geometric detail information such as wrinkles in Real time.

2016, 6-Garrido, P., et al., reconstruction of Personalized 3D Face Rigs from cellular video. ACM Trans. Graph, 2016.35 (3): p.1-15. A method for creating high fidelity facial expressions from Monocular video and their manipulation model (rig) parameters is presented that simulates human facial shapes and obtains high fidelity expressions based on three different levels of coarse scale facial geometry to medium scale correction elements, fine scale details at the wrinkle level. The method is characterized in that a parameterized shape prior model is used for coding identity and expression variables, a coarse scale model is recovered by calculating, tracking and simultaneously optimizing facial shapes, expressions and illumination parameters, on the basis, the precision is further improved by using linear specific user correction elements, and light and shade information of an input image is used for reversely rendering to obtain fine scale details at the wrinkle level. 2016, document 7-Wu, C, et al, and An anatomical-constrained local reconstruction model for a facial surface capture. ACM trans. Graph, 2016.35 (4): p.1-12. A three-dimensional facial expression reconstruction algorithm based on facial anatomical bone structure is proposed to solve the reconstruction problem of complex expressions (such as extreme facial expressions caused by strong wind face). The method comprises the steps of acquiring a high-precision 3D face model from 2D motion data by using an anatomically constrained local deformation model, wherein the local deformation model comprises a plurality of small subspaces distributed on the whole face and a potential anatomically skeleton structure, tracking local parts and skeletons by using an anatomically constrained condition, and combining the blocks into a complete face mesh. The anatomical restriction constraint condition can restrict the deformation of the face in a reasonable and effective expression range, can reconstruct extreme expressions, and can help eliminate the common fuzzy problem based on video three-dimensional reconstruction.

Document 8-Ichim, a.e., s.bouaziz, and m.pauly, dynamic 3D avatar creation from hand-held video input.acm trans.graph, 2015.34 (4): p.1-14, proposes a three-dimensional facial expression reconstruction system that obtains facial video from a handheld-based device and can simultaneously create facial expression animation manipulation model parameters for the reconstructed character, and first proposes a 3D facial expression manipulation parameter model represented by two layers of scales. The method comprises the steps of shooting a video by a mobile phone around a circle of collected faces of people, enabling the people to keep neutral expression unchanged, rebuilding initial 3D face point cloud by using a multi-viewpoint stereo vision technology, and transforming a universal face 3D model to point cloud data to form a personalized face neutral expression model. And transferring the deformation of the group of the universal Blendshapes to the personalized neutral expression model by a deformation technology to obtain a group of personalized Blendshapes of the collected person. Based on the personalized Blendshape, the coarse-scale three-dimensional facial expression is reconstructed by tracking 2D feature points and optical flow in the video, and geometrical details such as wrinkles are reconstructed by estimating a normal vector map and an environmental Occlusion map Ambient Occlusion map. Meanwhile, the expression reconstruction parameters based on Blendshapes can be directly applied to the large-scale animation drive of the face of the virtual character; the geometric details are predicted in real time by training a radial basis function regressor. Document 9-Yu, r, et al, direct, detect, and distortion, template-Based Non-rigid three-dimensional Reconstruction algorithm from RGB video in 2015IEEE International Conference on Computer Vision (ICCV) 2015, proposes a Template-Based Non-rigid three-dimensional Reconstruction algorithm with RGB video as input, and applies it to three-dimensional face Reconstruction. Firstly, image sets with different postures and similar expressions are searched from a video sequence, a three-dimensional face template is reconstructed by using a multi-viewpoint stereo vision technology, then a numerical optimization frame containing photometric value difference as a data item, time smooth constraint, space smooth constraint and local rigid body change is constructed, and the reconstructed three-dimensional face template is optimized and deformed to each frame of image, so that smooth and continuous three-dimensional face expression is reconstructed. Document 10-Thies, j., et al, face2Face: real-Time Face Capture and retrieval of RGB video.in 2016IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016, proposes a Real-Time three-dimensional facial expression reconstruction and expression migration method. The method utilizes a multi-linear face prior model to match video characteristics to optimize and solve a three-dimensional facial expression model of a collected person, and designs problem solving into a uniform numerical optimization residual function which comprises pixel value similarity of a rendered image and a collected image, consistency of 2D characteristic points of the face and 3D characteristics and a probability statistical rule item. The function is very difficult to solve, and an author provides a data parallel GPU optimization strategy based on iterative weighted Least square (IRLS), so that the aim of real-time reconstruction is fulfilled.

The current monocular video-based three-dimensional facial expression model reconstruction technology depends on inputting two-dimensional and three-dimensional characteristic point position information of a monocular video and a restored camera matrix, the most common method is based on optimization of similarity between a brightness value projected to a two-dimensional plane by a restored three-dimensional model and a brightness value of an input image, and the method cannot obtain a good restoration result for large-scale facial expression change, illumination change and slight shielding. And the accuracy of reconstructing the shape of the coarse-scale face is difficult to ensure due to the excessive sparseness of the feature points, especially for the face region far away from the feature points.

Disclosure of Invention

The technical problem of the invention is solved: the method comprises the steps of overcoming the defects of the prior art, providing a method for reconstructing a three-dimensional face expression model based on monocular video, utilizing a coarse-scale face expression reconstruction method driven by consistency of an individualized face template and dense optical flow, utilizing the dense optical flow to improve the shape accuracy of the coarse-scale reconstruction model, simultaneously relaxing the shooting requirement of an input video, adding wrinkles and other details on the recovered coarse-scale face model through a light and shade shape recovery technology to recover fine-scale face geometric data, and reconstructing a high-fidelity three-dimensional face expression model.

The technical solution of the invention is as follows: a method for reconstructing a three-dimensional facial expression model based on monocular video comprises the following implementation steps:

(1) Under the subspace constraint condition that a two-dimensional optical flow field is represented by a finite track baseline linear combination, calculating an input two-dimensional image sequence to obtain a face multi-frame 2D dense optical flow;

(2) Calculating an individualized three-dimensional face template from an input two-dimensional image sequence and a universal face template with a neutral expression by using a three-dimensional deformation technology;

(3) Constructing a numerical optimization frame based on 2D-3D optical flow change consistency by using the multi-frame 2D dense optical flow of the human face obtained by the calculation in the step (1) and the personalized three-dimensional human face template obtained by the calculation in the step (2), and calculating to obtain a coarse-scale human face expression model reflecting the five sense organs and the outline of the target human face;

(4) And (4) adding the geometrical detail information of wrinkles and expression lines on the coarse-scale facial expression model obtained by calculation in the step (3) by using a light and shade recovery shape algorithm, and calculating to obtain a final three-dimensional facial expression model.

The method for calculating the face multiframe 2D dense optical flow in the step (1) comprises the following steps:

(21) Constructing an energy function based on subspace constraints, wherein the energy term comprises a data term, a smooth constraint term and an error term;

(22) And (3) solving an energy function in the step (21), wherein the energy function has two unknown variables of a linear track base and a base coefficient, the linear track base is pre-estimated in a mode of tracking an angular point or a material with distinct characteristics, the base coefficient is solved through singular value decomposition, and the multiframe 2D dense optical flow of the human face is obtained through solving the linear track base and the base coefficient.

The method for calculating the personalized three-dimensional face template in the step (2) comprises the following steps:

(31) Feature point recognition is carried out on the neutral expression image frames in the input monocular video by using an active appearance model, 68 human face 2D feature points are obtained, wherein 17 points represent the outline of a human face, 10 points represent the position of eyebrows, 12 points represent the positions of eyes, 9 points represent the position of a nose, and 20 points represent the position of a mouth.

(32) And (3) utilizing the 2D feature points obtained in the step (31) and a universal face template with a neutral expression, and constructing the following energy function to drive the universal face template with the neutral expression to deform by the 2D feature points:

argmin∑‖PDS ^t -W‖+λ‖ΔS ^t -ΔS ^g ‖

wherein P represents a weak perspective projection matrix of the camera, the matrix W represents the acquired 2D characteristic points of the human face, S ^t Representing a personalized face template grid, wherein a matrix D is a selection matrix consisting of 0 and 1 and is used for selectingPersonalized face template grid S ^t Upper corresponding 3D feature point, S ^g Is an initial template, and the lambda represents a weight coefficient and has a value range of 0-1.

(33) Solving the energy equation constructed in step (32), wherein the energy equation has two unknowns P and S ^t Solving by adopting an iterative optimization method: firstly fixing P as a known parameter, solving an energy equation to obtain new S ^t (ii) a Re-fixing S ^t Solving the energy equation as a known parameter to obtain a new P, and so on until the energy equation converges, the current P and S ^t Namely, the solution result of the iterative optimization is calculated to obtain the personalized three-dimensional face template which accords with the target face characteristics.

The method for calculating the coarse-scale facial expression model in the step (3) comprises the following steps:

(41) Constructing a numerical optimization frame, wherein the numerical optimization frame comprises a data item, a space smoothing item, a rigid constraint item ARAP and a time domain regular item, and the specific energy equation is as follows:

E(f,R,T)＝E _flow +βE _smooth +γE _arap +ωE _temp

wherein E _flow The method is a 2D-3D dense optical flow change consistency item, namely a data item, and is used for ensuring the consistency mapping relation between a 3D vertex flow of a personalized face template and a 2D image optical flow obtained by calculation, and the method is specifically represented as follows:

wherein v is _i Denotes S ^t The ith vertex of (c), f (v) _i ) Representing a vertex v _i P represents a projection matrix, L (·) represents a dense 2D optical flow, R and T represent rotation and translation matrices of the camera, | · |, of _ε Is the Huber loss function;

E _smooth for the spatial smoothing term, two parts are involved: adjacent smoothing terms and discrete global variation terms. The adjacent smooth terms are used for smoothly deforming the rear model, directly correcting the consistency of the 3D light streams of adjacent vertexes, and performing discrete integral transformationAnd the conversion item is based on the initial face template, and ensures that the deformed model can still keep consistent face local characteristics with the initial face template, and the conversion item is specifically represented as follows:

wherein v is _j ∈Λ _i Representing a vertex v _i And sigma is a weight coefficient, and the value range of sigma is 0-1.

E _arap A Rigid constraint term ARAP (As Rigid As posable) used in the non-Rigid 3D reconstruction and deformation to ensure local rigidity during deformation, which is specifically expressed As:

wherein A is _i Representing a vertex v _i A corresponding rotation matrix;

E _temp the time domain regular term is similar to the 2D optical flow, the 3D optical flow should have continuous and smooth characteristics, and the time domain regular term is used for ensuring smooth deformation between frames, preventing inter-frame jumping in the whole video sequence and improving the robustness of the system, and is specifically represented as follows:

wherein f is _p (v _i ) Denotes v _i Optical flow in the p-th frame, f _p+1 (v _i ) Denotes v _i Optical flow at p +1 frame;

in the energy equation, beta, gamma and omega respectively represent the weight of the corresponding three energy terms in the energy function, and the value range is 0-1;

(42) In the step (41), the unknown numbers in the energy equation for constructing the numerical optimization frame comprise vertexes v _i 3D light stream f (v) of _i ) And rotation and translation matrices R and T of the camera, when energy is given to Cheng Qiujie, an iterative optimization strategy is adopted to maintain the 3D lightFlow f (v) _i ) The value is unchanged, rotation and translation matrixes R and T of the camera are solved, then the R and the T are maintained unchanged, and the 3D optical flow f (v) is optimized _i ) And alternately iterating until convergence, namely solving to obtain a coarse-scale facial expression model.

Compared with the prior art, the invention has the advantages that:

(1) By adding subspace constraint, the multi-frame dense 2D optical flow obtained through calculation solves the problems that the traditional dense optical flow calculation method is large in calculation amount and difficult to process large-amplitude displacement and partial occlusion.

(2) And the target face personalized template is calculated by utilizing a three-dimensional deformation technology, and a more accurate target face model is obtained without extra multi-angle shooting.

(3) The coarse-scale facial expression optimization reconstruction method based on the consistency of the 3D vertex stream and the 2D image optical flow change has a better recovery result on large-scale facial expression change, illumination change and slight shielding.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a schematic diagram of a personalized three-dimensional face template computed from an input monocular video and a universal face template, wherein the first image on the left is from an i-bug data set, the second image on the left is downloaded from the Internet, and the two images on the right are face images shot by a mobile phone in a daily environment;

fig. 3 is a schematic diagram of a face model reconstructed from an input monocular video of the present invention, where the input video is data set data and a face video shot by a mobile phone in a daily environment.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples.

As shown in fig. 1, the method comprises the following specific steps:

(1) Multi-frame 2D dense optical flow calculation based on subspace constraint

The two-dimensional optical flow field can be regarded as being composed of motion vectors of each pixel point of an image in a two-dimensional plane, and the position of any pixel point j on the image relative to the pixel point j in a reference image is definedIs arranged (x) _1j ,y _1j ) The 2D motion vector of (a) is:

w _j ＝[u _1j u _2j …u _Fj |v _1j v _2j …v _Fj ] ^T

suppose there are P pixels in a picture and the input video has F frames, where u _ij ＝x _ij -x _1j ，v _ij ＝y _ij -y _1j ，(x _ij ,y _ij ) The coordinates at frame i of point j, (x) _1j ,y _1j ) I is more than or equal to 1 and less than or equal to F, and j is more than or equal to 1 and less than or equal to P, wherein j is the coordinate of the point j in the 1 st frame, namely the reference image.

On the basis of a traditional optical flow algorithm, according to the principle that facial expression changes can be linearly combined by a plurality of basic expressions, a tighter subspace constraint is added, namely, a motion vector of each point in a video stream can be obtained by linearly combining a limited number of track bases, and the formula is expressed as follows:

wherein, w _j Is the motion track vector of the jth pixel point, Q _r Is the R-th motion vector base, R is more than or equal to 1 and less than or equal to R,

for the ith base vector Q of the jth frame _r The coefficient of (a).

Represents the component of the r-th motion vector in the horizontal direction, based on each frame, based on>

Representing the component of the r-th motion vector in the vertical direction per frame basis.

The above formula is expressed in matrix form as:

where W is an observation matrix of 2F × P, Q ^u And Q ^v Respectively, a motion vector basis matrix in the horizontal and vertical directions, and L a coefficient matrix.

Based on the theory, the following energy function is constructed to calculate the human face 2D optical flow field in the two-dimensional image sequence:

wherein the data item E _data Quantizing the luminance error between the moved pixel and the reference pixel according to a luminance invariance assumption; e _reg Is a smooth constraint term used for constraining the gradient amplitude change of the optical flow vector to ensure that the optical flow vector of the pixel point and the pixel point in the neighborhood keeps consistent as much as possible, E _error Adding the difference between the two-dimensional observation matrix and the linear basis fitting as an energy term into an energy equation for the error term, wherein alpha and beta are respectively the weight coefficients of a regular term and the error term and are used for adjusting the proportion of the corresponding energy term in an energy function, the value range is between 0 and 1, and I is _f (x, y) represents the brightness value of the pixel point (x, y) in the f-th frame, I ₁ (x, y) indicates the brightness value of the pixel point (x, y) in the reference image frame,

represents the component of the track base of the f-th frame in the horizontal direction,

representing the component of the track base of the f-th frame in the vertical direction.

And solving the energy function, wherein the energy function has two unknown variables of a linear track base and a base coefficient, the linear track base is pre-estimated in a mode of tracking angular points and materials with distinct features, the base coefficient is solved through singular value decomposition, and the multi-frame 2D dense light stream of the human face is obtained through solving the linear track base and the base coefficient.

(2) Reconstruction of personalized three-dimensional face template by using 3D deformation technology

68 two-dimensional feature points of a neutral expression image frame face in an input monocular video are identified by using an Active Appearance Model (AAMs), a neutral expression face grid is selected from an existing face database as an initial template, and the initial template is driven by the two-dimensional feature points to deform by using a three-dimensional deformation technology (3D Warping) so as to obtain a personalized three-dimensional face template according with the target face feature. The following energy equation was constructed:

argmin∑‖PDS ^t -W‖+λ‖ΔS ^t -ΔS ^g ‖

wherein P represents the weak perspective projection matrix of the camera, the matrix W represents the acquired two-dimensional characteristic points of the human face, S ^t Representing the personalized face template grid, the matrix D is a selection matrix consisting of 0 and 1 and is used for selecting the personalized face template grid S ^t Upper corresponding 3D feature point, S ^g Is the initial template. II S in the energy equation ^t -ΔS ^g And II, a space regularization term is used, in order to ensure that the geometric shape of the whole three-dimensional model cannot be deformed due to excessive pursuit of the matching degree of the feature points, and lambda is a weight coefficient (lambda takes a value between 0 and 1) of the space regularization term.

Solving the energy equation, wherein the energy equation has two unknowns P and S ^t Solving by adopting an iterative optimization method: firstly fixing P as a known parameter, solving an energy equation to obtain new S ^t (ii) a Re-fixing S ^t Solving the energy equation as a known parameter yields a new P, alternating so far until the energy equation converges, the current P and S ^t The three-dimensional face template is a solution result of iterative optimization, namely, the personalized three-dimensional face template which accords with the target face characteristics is obtained through calculation.

Fig. 2 is a personalized three-dimensional face template calculated from input images and face templates, wherein the two images of (a) are downloaded from the internet, and the two images of (b) are face images shot by a mobile phone in a daily environment.

FIG. 2 shows that the method can calculate the personalized three-dimensional face template which accords with the target face characteristics whether the input source is the face image downloaded on the Internet or the face image shot by the mobile phone in the daily environment.

(3) Constructing a numerical optimization frame by using the personalized three-dimensional face template generated in the step (2) and the 2D dense optical flow generated in the step (1), and calculating a coarse-scale expression model

The deformation of the three-dimensional facial expression corresponding to different frames is expressed as the motion trail estimation of the 3D vertex flow of the face template, so that the change of the 3D vertex flow and the change of the 2D facial pixel optical flow have consistency, the consistency mapping relation of the 3D vertex flow and the 2D image optical flow is established through the internal and external parameters of the camera, the optical flow consistency change is adopted as a data item, meanwhile, constraint items or smooth items such as local deformation characteristics, space-time deformation characteristics and the like related to deformation are added, a numerical optimization framework is constructed to solve the coarse-scale facial expression, and the following energy equation is constructed:

E(f,R,T)＝E _flow +βE _smooth +γE _arap +ωE _temp

wherein E _flow The 2D-3D dense optical flow change consistency item, namely a data item, is used for ensuring the consistency mapping relation between the 3D vertex flow of the personalized face template and the calculated 2D image optical flow, and is specifically represented as follows:

wherein v is _i Denotes S ^t The ith vertex of (c), f (v) _i ) Representing the vertex v _i P represents a projection matrix, L (·) represents a dense 2D optical flow, R and T represent rotation and translation matrices of the camera, | · |, of _ε Is the Huber loss function.

E _smooth For the spatial smoothing term, two parts are involved: adjacent smoothing terms and discrete global variation terms. And the adjacent smoothing term is used for smoothly deforming the rear model, and directly corrects the 3D optical flow consistency of adjacent vertexes. Discrete integral change items based on the initial face template to ensure that the deformed model can still be usedThe local features of the face, which are consistent with the local features, are specifically expressed as:

E _arap Is an ARAP term (As Rigid As Possible) which is used in the non-Rigid body 3D reconstruction and deformation to ensure the local rigidity in the deformation process, and is specifically expressed As:

wherein A is _i Representing a vertex v _i Corresponding rotation matrix of (a).

wherein f is _p (v _i ) Denotes v _i Optical flow in the p-th frame, f _p+1 (v _i ) Denotes v _i Optical flow at p +1 th frame.

Beta, gamma and omega in the energy equation respectively represent the weight coefficients of the corresponding three energy terms in the energy function, and the value range is 0-1.

The unknowns in the energy equation include the vertex v _i 3D light stream f (v) of _i ) And rotation and translation matrixes R and T of the camera, and when the energy direction Cheng Qiujie is subjected to iterative optimization strategy, the 3D optical flow f (v) is maintained _i ) The value is unchanged, rotation and translation matrixes R and T of the camera are solved, then the R and the T are maintained unchanged, and the 3D light is optimizedFlow f (v) _i ) And alternately iterating until convergence, and solving to obtain the coarse-scale facial expression model.

The invention adopts the optical flow change consistency item to replace the traditional method and uses the brightness consistency item as the data item, thereby avoiding the influence of image brightness noise, projection error, reflection and illumination.

(4) Adding details by using a light and shade recovery shape technology to generate a fine-scale expression model

Assuming that the human face is a Lambertian surface, the ambient light can be approximately expressed by a spherical harmonic function, and the estimation process of wrinkle details can be regarded as optimization between brightness values of an actually captured image and a rendered image, namely, the following formula can be used for expressing:

wherein N represents the number of pixels, I _r (u, v) represents the pixel brightness value of the actual captured image, ρ (u, v) represents the reflectance value, n (u, v) represents the normal vector, Y (n (u, v)) is the basis of the spherical harmonic function, l ^T Coefficients representing the basis of the harmonic function, λ ([ delta ] G (n (u, v) -n) _ref (u, v))) as a regularizing term, n _ref And (u, v) is a reference normal vector calculated from the obtained coarse-scale human face model, and a Gaussian operator delta G is used for reducing the 2D-3D projection error. The above equation is solved by an iterative optimization strategy, the illumination coefficient l, the reflectivity rho (u, v) and the normal vector n (u, v) are optimized alternately, and the final wrinkle detail is calculated by the normal vector.

Fig. 3 is a schematic diagram of a face model reconstructed from input videos, wherein the input videos are data of a data set and a face video shot by a mobile phone in a daily environment. FIG. 3 shows that the present invention can accurately reconstruct a three-dimensional facial expression model, regardless of whether the input is data set data or a face video taken by a mobile phone in a daily environment.

Claims

1. A method for reconstructing a three-dimensional facial expression model based on monocular video is characterized by comprising the following steps:

(3) Constructing a numerical optimization frame based on 2D-3D optical flow change consistency by using the multi-frame 2D dense optical flow of the human face calculated in the step (1) and the personalized three-dimensional human face template calculated in the step (2), and calculating to obtain a coarse-scale human face expression model reflecting the five sense organs and the outline of the target human face;

(4) Adding geometrical detail information of wrinkles and expression lines on the coarse-scale facial expression model obtained by calculation in the step (3) by using a light and shade recovery shape algorithm, and calculating to obtain a final three-dimensional facial expression model;

in the step (1), the method for calculating the face multi-frame 2D dense optical flow comprises the following steps:

(22) Solving an energy function in the step (21), wherein the energy function has two unknown variables of a linear track base and a base coefficient, the linear track base is pre-estimated in a mode of tracking angular points and materials with distinct features, the base coefficient is solved through singular value decomposition, and the multi-frame 2D dense optical flow of the face is obtained through solving the linear track base and the base coefficient;

the method for calculating the personalized three-dimensional face template in the step (2) comprises the following steps: selecting a neutral expression general face grid from an existing face database as an initial template, driving the initial template to deform by using two-dimensional feature points to obtain an individualized three-dimensional face template which accords with the target face features, and specifically realizing the following steps:

(31) Carrying out feature point identification on neutral expression image frames in an input monocular video by using an active appearance model to obtain 68 human face 2D feature points, wherein 17 points represent the outline of a human face, 10 points represent the positions of eyebrows, 12 points represent the positions of eyes, 9 points represent the positions of noses, and 20 points represent the positions of mouths;

(32) And (4) utilizing the 2D feature points obtained in the step (31) and a universal face template with a neutral expression, and constructing an energy function as follows, wherein the 2D feature points drive the universal face template with the neutral expression to deform:

argmin∑||PDS ^t -W||+λ||ΔS ^t -ΔS ^g ||

wherein P represents a weak perspective projection matrix of the camera, the matrix W represents the acquired 2D characteristic points of the human face, S ^t The matrix D is a selection matrix composed of 0 and 1 and used for selecting the personalized face template grid S ^t Upper corresponding 3D feature point, S ^g Is an initial template, lambda represents a weight coefficient, and the value range is between 0 and 1;

(33) Solving the energy equation constructed in step (32), wherein the energy equation has two unknowns P and S ^t Solving by adopting an iterative optimization method: firstly fixing P as a known parameter, solving an energy equation to obtain a new S ^t (ii) a Then fixing S ^t Solving the energy equation as a known parameter yields a new P, alternating so far until the energy equation converges, the current P and S ^t Namely, the solution result of the iterative optimization is calculated to obtain the personalized three-dimensional face template which accords with the target face characteristics.

2. The method for reconstructing the three-dimensional facial expression model based on the monocular video of claim 1, wherein: the subspace constraint-based construction of the energy function in said step (21) is:

wherein E _data For data items, E _reg To smooth the constraint term, E _error Alpha and beta are respectively weight coefficients of a smooth constraint term and an error term, and the value range is 0-1; common image F frame, I in the input two-dimensional image sequence _f Representing the luminance value of the F-th frame image, F is more than or equal to 1 and less than or equal to F,I ₁ representing the brightness value of the 1 st frame image, the track base number is R, Q _r Represents the R track base, R is more than or equal to 1 and less than or equal to R,

represents the component of the track base of the f-th frame in the horizontal direction, based on the value of the reference value>

Represents the component of the track base of the f-th frame in the vertical direction, L represents the base coefficient, and/or>

For the ith base vector Q of the jth frame _r Coefficient of (a), w _j Representing the motion trajectory vector of the j-th frame.

3. The method for reconstructing the three-dimensional facial expression model based on the monocular video of claim 1, wherein: the method for solving the coarse-scale facial expression model in the step (3) comprises the following steps:

E(f，R，T)＝E _flow +βE _smooth +γE _arap +ωE _temp

wherein v is _i Denotes S ^t The ith vertex of (c), f (v) _i ) Representing a vertex v _i P represents a projection matrix, L (-) represents a dense 2D optical flow, R and T represent rotational and translational moments of the cameraMatrix, | \ | live through _ε Is the Huber loss function;

E _smooth the term, which is a spatial smoothing term, includes two parts: the method comprises an adjacent smoothing item and a discrete overall change item, wherein the adjacent smoothing item is used for smoothly deforming a rear model, the 3D light stream consistency of adjacent vertexes is directly corrected, the discrete overall change item is based on an initial face template, and the deformed model can still keep consistent face local features with the deformed model, and the specific expression is as follows:

wherein v is _j ∈Λ _i Representing the vertex v _i Sigma is a weight coefficient, the value range of the adjacent vertex of the vector is 0 to 1,

wherein A is _i Representing the vertex v _i A corresponding rotation matrix;

beta, gamma and omega in the energy equation respectively represent the weight of the corresponding three energy items in the energy function, and the value range is 0-1;

(42) In the step (41), the unknown numbers in the energy equation for constructing the numerical optimization frame comprise vertexes v _i 3D optical flow f (v) of _i ) And rotation and translation matrixes R and T of the camera, and when the energy direction Cheng Qiujie is subjected to iterative optimization strategy, the 3D optical flow f (v) is maintained _i ) The value is unchanged, rotation and translation matrixes R and T of the camera are solved, then the R and the T are maintained unchanged, and the 3D optical flow f (v) is optimized _i ) And alternately iterating until convergence, and solving to obtain the coarse-scale facial expression model.