CN109584353B - Method for reconstructing three-dimensional facial expression model based on monocular video - Google Patents

Method for reconstructing three-dimensional facial expression model based on monocular video Download PDF

Info

Publication number
CN109584353B
CN109584353B CN201811230151.0A CN201811230151A CN109584353B CN 109584353 B CN109584353 B CN 109584353B CN 201811230151 A CN201811230151 A CN 201811230151A CN 109584353 B CN109584353 B CN 109584353B
Authority
CN
China
Prior art keywords
face
dimensional
frame
optical flow
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811230151.0A
Other languages
Chinese (zh)
Other versions
CN109584353A (en
Inventor
王珊
沈旭昆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201811230151.0A priority Critical patent/CN109584353B/en
Publication of CN109584353A publication Critical patent/CN109584353A/en
Application granted granted Critical
Publication of CN109584353B publication Critical patent/CN109584353B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/215Motion-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/08Indexing scheme for image data processing or generation, in general involving all processing steps from image acquisition to 3D model generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Processing Or Creating Images (AREA)
  • Image Analysis (AREA)

Abstract

A method for reconstructing a three-dimensional facial expression model based on a monocular video does not need extra multi-angle shooting, a universal 3D facial model is directly driven to deform from a neutral expression image frame in the monocular video to generate an individualized three-dimensional facial template, deformation of three-dimensional facial expressions corresponding to different frames is expressed as change of the individualized three-dimensional facial template in a three-dimensional space 3D vertex stream, and a facial expression coarse-scale geometric model is solved through consistency with 2D light stream change. And improving the shape accuracy of the coarse-scale reconstruction model by using dense optical flow, relaxing the shooting requirement of an input video, adding details on the recovered coarse-scale face model by using a shading recovery shape technology so as to recover fine-scale face geometric data, and reconstructing a high-fidelity three-dimensional face geometric model.

Description

Method for reconstructing three-dimensional facial expression model based on monocular video
Technical Field
The invention relates to a method for reconstructing a three-dimensional facial expression model based on monocular video, and belongs to the technical field of computer virtual reality.
Background
The vividly reconstructed three-dimensional human face expression model has wide application in the fields of computer games, movie making, social contact, medical treatment and the like, and the traditional three-dimensional human face model acquisition and reconstruction mostly depend on heavy and expensive hardware equipment and controllable illumination environment in a laboratory. With the rapid advance of the virtual reality technology and the mobile intelligent terminal into mass life, people increasingly hope to obtain a high-quality three-dimensional facial expression model through low-cost equipment in a daily living environment and apply the model to a virtual environment. The video is shot by a mobile phone and a camera, or the three-dimensional facial expression model is reconstructed by directly utilizing the internet video, so that the complexity of the acquisition equipment is reduced to the minimum, and a new opportunity is brought to the consumption-level three-dimensional facial digital application.
In the visual range, a person's face can be divided into different hierarchical representations from a geometric scale: coarse (e.g., nose, cheeks, lips, eyelids, etc.), fine (e.g., wrinkles), and micro (e.g., pores, moles, and freckles). At present, a three-dimensional facial expression reconstruction algorithm based on monocular video mainly comprises two steps: coarse scale three-dimensional face geometric reconstruction, wrinkle and other detail scale geometric reconstruction. In general, when a video is shot, the posture of a camera is relatively fixed, and the change of an illumination environment is not large, so that a method of driving a prior human face model by using facial 2D feature points is mostly used for reconstructing the coarse-scale three-dimensional human face geometry. Geometric reconstruction of detail scales such as wrinkles mostly adopts a Shape From Shading (SFS) method, and a restored coarse scale model is used as a reference for detail optimization; some algorithms also use regression prediction methods to regress the wrinkle geometry details in real time based on a pre-trained finite wrinkle detail dataset.
Document 1-Garrido, p, et al, reconstracturing detailed dynamic facial surface geometry from a single expression video, acm trans, graph, 2013.32 (6): p.1-10. By registering a general Blendshape model to a neutral expression target 3D face obtained through scanning, a personalized Blendshape of the target face is obtained, 2D image feature points of the whole video sequence are tracked through sparse feature point tracking and optical flow estimation, the 3D Blendshape model is aligned to the 2D sparse feature points of each frame to obtain coarse-scale expression and posture estimation, then unknown illumination is estimated through an SFS algorithm, and fine-scale facial details are restored. The method includes the steps that 29 3D feature points of a universal Blendshape model and a scanned face model need to be manually aligned, an individualized Blendshape model is obtained through a deformation algorithm, and the biggest limitation is that each photographer needs to perform 3D scanning of neutral expressions in advance. Document 2-Shi, f., et al, automatic acquisition of high-fidelity facial image, acm trans, graph, 2014.33 (6): p.1-13, proposes a full-Automatic high-fidelity three-dimensional facial expression reconstruction method, proposes an optimization framework of key frame space-time constraint by fully utilizing time continuous information, calculates 3D head pose and coarse-scale facial expression deformation frame by utilizing the acquired 2D feature points and a multi-linear facial model (FaceWarehouse), and recovers fine-scale facial detail information by utilizing a spherical harmonic function to approximate ambient light. They effectively reduce the ambiguity in the illumination and reflectivity estimation using the assumption that the illumination and reflectivity are consistent throughout the sequence. Meanwhile, the accuracy and robustness are effectively improved by initializing the fine-scale geometric reconstruction by using the coarse-scale deformed reconstruction result. Document 3-Suwajanakorn, s., i.e., kemelacher-Shlizerman, and s.m.seitz.total Moving Face reconstruction.in European Conference reference on Computer vision.2014, proposes a three-dimensional facial expression reconstruction method based on simultaneous optimization of coarse-scale geometry and fine-scale geometry of 3D Flow (3D Flow). Aiming at a celebrity video, firstly, a large number of photo sets of the same face under different illumination environments are searched on the Internet, an average face model is generated by utilizing the previous working literature 4-Kemelcher-Shlizrman, I.and S.M.Seitz.face retrieval in the world.in 2011International Conference on Computer Vision.2011, the corresponding relation between the vertex of the average face model and the pixels of an input video frame is established by constructing a scene flow, an SFS imaging equation is superposed to construct a unified numerical optimization frame, and a coarse-scale model and fine-scale details are alternately and iteratively optimized. 2015, 5-Cao, c, et al, real-time high-fidelity facial performance capture, acm trans, graph, 2015.34 (4): p.1-9. A first Real-time high-precision facial expression capture method [75] is proposed, which assumes that human facial wrinkles appear at different locations and depths of the face, but the wrinkles are self-similar and their visual appearance can be combined by local shapes, which separates coarse-scale geometry and fine-scale geometry such as wrinkles from a scanned high-precision three-dimensional face model and divides the wrinkles into tiny local geometric detail regions to construct a training set of geometric detail regression predictions, on the basis of the previous coarse-scale reconstruction, training a set of local detail regression to add geometric detail information such as wrinkles in Real time.
2016, 6-Garrido, P., et al., reconstruction of Personalized 3D Face Rigs from cellular video. ACM Trans. Graph, 2016.35 (3): p.1-15. A method for creating high fidelity facial expressions from Monocular video and their manipulation model (rig) parameters is presented that simulates human facial shapes and obtains high fidelity expressions based on three different levels of coarse scale facial geometry to medium scale correction elements, fine scale details at the wrinkle level. The method is characterized in that a parameterized shape prior model is used for coding identity and expression variables, a coarse scale model is recovered by calculating, tracking and simultaneously optimizing facial shapes, expressions and illumination parameters, on the basis, the precision is further improved by using linear specific user correction elements, and light and shade information of an input image is used for reversely rendering to obtain fine scale details at the wrinkle level. 2016, document 7-Wu, C, et al, and An anatomical-constrained local reconstruction model for a facial surface capture. ACM trans. Graph, 2016.35 (4): p.1-12. A three-dimensional facial expression reconstruction algorithm based on facial anatomical bone structure is proposed to solve the reconstruction problem of complex expressions (such as extreme facial expressions caused by strong wind face). The method comprises the steps of acquiring a high-precision 3D face model from 2D motion data by using an anatomically constrained local deformation model, wherein the local deformation model comprises a plurality of small subspaces distributed on the whole face and a potential anatomically skeleton structure, tracking local parts and skeletons by using an anatomically constrained condition, and combining the blocks into a complete face mesh. The anatomical restriction constraint condition can restrict the deformation of the face in a reasonable and effective expression range, can reconstruct extreme expressions, and can help eliminate the common fuzzy problem based on video three-dimensional reconstruction.
Document 8-Ichim, a.e., s.bouaziz, and m.pauly, dynamic 3D avatar creation from hand-held video input.acm trans.graph, 2015.34 (4): p.1-14, proposes a three-dimensional facial expression reconstruction system that obtains facial video from a handheld-based device and can simultaneously create facial expression animation manipulation model parameters for the reconstructed character, and first proposes a 3D facial expression manipulation parameter model represented by two layers of scales. The method comprises the steps of shooting a video by a mobile phone around a circle of collected faces of people, enabling the people to keep neutral expression unchanged, rebuilding initial 3D face point cloud by using a multi-viewpoint stereo vision technology, and transforming a universal face 3D model to point cloud data to form a personalized face neutral expression model. And transferring the deformation of the group of the universal Blendshapes to the personalized neutral expression model by a deformation technology to obtain a group of personalized Blendshapes of the collected person. Based on the personalized Blendshape, the coarse-scale three-dimensional facial expression is reconstructed by tracking 2D feature points and optical flow in the video, and geometrical details such as wrinkles are reconstructed by estimating a normal vector map and an environmental Occlusion map Ambient Occlusion map. Meanwhile, the expression reconstruction parameters based on Blendshapes can be directly applied to the large-scale animation drive of the face of the virtual character; the geometric details are predicted in real time by training a radial basis function regressor. Document 9-Yu, r, et al, direct, detect, and distortion, template-Based Non-rigid three-dimensional Reconstruction algorithm from RGB video in 2015IEEE International Conference on Computer Vision (ICCV) 2015, proposes a Template-Based Non-rigid three-dimensional Reconstruction algorithm with RGB video as input, and applies it to three-dimensional face Reconstruction. Firstly, image sets with different postures and similar expressions are searched from a video sequence, a three-dimensional face template is reconstructed by using a multi-viewpoint stereo vision technology, then a numerical optimization frame containing photometric value difference as a data item, time smooth constraint, space smooth constraint and local rigid body change is constructed, and the reconstructed three-dimensional face template is optimized and deformed to each frame of image, so that smooth and continuous three-dimensional face expression is reconstructed. Document 10-Thies, j., et al, face2Face: real-Time Face Capture and retrieval of RGB video.in 2016IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016, proposes a Real-Time three-dimensional facial expression reconstruction and expression migration method. The method utilizes a multi-linear face prior model to match video characteristics to optimize and solve a three-dimensional facial expression model of a collected person, and designs problem solving into a uniform numerical optimization residual function which comprises pixel value similarity of a rendered image and a collected image, consistency of 2D characteristic points of the face and 3D characteristics and a probability statistical rule item. The function is very difficult to solve, and an author provides a data parallel GPU optimization strategy based on iterative weighted Least square (IRLS), so that the aim of real-time reconstruction is fulfilled.
The current monocular video-based three-dimensional facial expression model reconstruction technology depends on inputting two-dimensional and three-dimensional characteristic point position information of a monocular video and a restored camera matrix, the most common method is based on optimization of similarity between a brightness value projected to a two-dimensional plane by a restored three-dimensional model and a brightness value of an input image, and the method cannot obtain a good restoration result for large-scale facial expression change, illumination change and slight shielding. And the accuracy of reconstructing the shape of the coarse-scale face is difficult to ensure due to the excessive sparseness of the feature points, especially for the face region far away from the feature points.
Disclosure of Invention
The technical problem of the invention is solved: the method comprises the steps of overcoming the defects of the prior art, providing a method for reconstructing a three-dimensional face expression model based on monocular video, utilizing a coarse-scale face expression reconstruction method driven by consistency of an individualized face template and dense optical flow, utilizing the dense optical flow to improve the shape accuracy of the coarse-scale reconstruction model, simultaneously relaxing the shooting requirement of an input video, adding wrinkles and other details on the recovered coarse-scale face model through a light and shade shape recovery technology to recover fine-scale face geometric data, and reconstructing a high-fidelity three-dimensional face expression model.
The technical solution of the invention is as follows: a method for reconstructing a three-dimensional facial expression model based on monocular video comprises the following implementation steps:
(1) Under the subspace constraint condition that a two-dimensional optical flow field is represented by a finite track baseline linear combination, calculating an input two-dimensional image sequence to obtain a face multi-frame 2D dense optical flow;
(2) Calculating an individualized three-dimensional face template from an input two-dimensional image sequence and a universal face template with a neutral expression by using a three-dimensional deformation technology;
(3) Constructing a numerical optimization frame based on 2D-3D optical flow change consistency by using the multi-frame 2D dense optical flow of the human face obtained by the calculation in the step (1) and the personalized three-dimensional human face template obtained by the calculation in the step (2), and calculating to obtain a coarse-scale human face expression model reflecting the five sense organs and the outline of the target human face;
(4) And (4) adding the geometrical detail information of wrinkles and expression lines on the coarse-scale facial expression model obtained by calculation in the step (3) by using a light and shade recovery shape algorithm, and calculating to obtain a final three-dimensional facial expression model.
The method for calculating the face multiframe 2D dense optical flow in the step (1) comprises the following steps:
(21) Constructing an energy function based on subspace constraints, wherein the energy term comprises a data term, a smooth constraint term and an error term;
(22) And (3) solving an energy function in the step (21), wherein the energy function has two unknown variables of a linear track base and a base coefficient, the linear track base is pre-estimated in a mode of tracking an angular point or a material with distinct characteristics, the base coefficient is solved through singular value decomposition, and the multiframe 2D dense optical flow of the human face is obtained through solving the linear track base and the base coefficient.
The method for calculating the personalized three-dimensional face template in the step (2) comprises the following steps:
(31) Feature point recognition is carried out on the neutral expression image frames in the input monocular video by using an active appearance model, 68 human face 2D feature points are obtained, wherein 17 points represent the outline of a human face, 10 points represent the position of eyebrows, 12 points represent the positions of eyes, 9 points represent the position of a nose, and 20 points represent the position of a mouth.
(32) And (3) utilizing the 2D feature points obtained in the step (31) and a universal face template with a neutral expression, and constructing the following energy function to drive the universal face template with the neutral expression to deform by the 2D feature points:
argmin∑‖PDS t -W‖+λ‖ΔS t -ΔS g
wherein P represents a weak perspective projection matrix of the camera, the matrix W represents the acquired 2D characteristic points of the human face, S t Representing a personalized face template grid, wherein a matrix D is a selection matrix consisting of 0 and 1 and is used for selectingPersonalized face template grid S t Upper corresponding 3D feature point, S g Is an initial template, and the lambda represents a weight coefficient and has a value range of 0-1.
(33) Solving the energy equation constructed in step (32), wherein the energy equation has two unknowns P and S t Solving by adopting an iterative optimization method: firstly fixing P as a known parameter, solving an energy equation to obtain new S t (ii) a Re-fixing S t Solving the energy equation as a known parameter to obtain a new P, and so on until the energy equation converges, the current P and S t Namely, the solution result of the iterative optimization is calculated to obtain the personalized three-dimensional face template which accords with the target face characteristics.
The method for calculating the coarse-scale facial expression model in the step (3) comprises the following steps:
(41) Constructing a numerical optimization frame, wherein the numerical optimization frame comprises a data item, a space smoothing item, a rigid constraint item ARAP and a time domain regular item, and the specific energy equation is as follows:
E(f,R,T)=E flow +βE smooth +γE arap +ωE temp
wherein E flow The method is a 2D-3D dense optical flow change consistency item, namely a data item, and is used for ensuring the consistency mapping relation between a 3D vertex flow of a personalized face template and a 2D image optical flow obtained by calculation, and the method is specifically represented as follows:
Figure BDA0001836941620000071
wherein v is i Denotes S t The ith vertex of (c), f (v) i ) Representing a vertex v i P represents a projection matrix, L (·) represents a dense 2D optical flow, R and T represent rotation and translation matrices of the camera, | · |, of ε Is the Huber loss function;
E smooth for the spatial smoothing term, two parts are involved: adjacent smoothing terms and discrete global variation terms. The adjacent smooth terms are used for smoothly deforming the rear model, directly correcting the consistency of the 3D light streams of adjacent vertexes, and performing discrete integral transformationAnd the conversion item is based on the initial face template, and ensures that the deformed model can still keep consistent face local characteristics with the initial face template, and the conversion item is specifically represented as follows:
Figure BDA0001836941620000072
wherein v is j ∈Λ i Representing a vertex v i And sigma is a weight coefficient, and the value range of sigma is 0-1.
E arap A Rigid constraint term ARAP (As Rigid As posable) used in the non-Rigid 3D reconstruction and deformation to ensure local rigidity during deformation, which is specifically expressed As:
Figure BDA0001836941620000073
wherein A is i Representing a vertex v i A corresponding rotation matrix;
E temp the time domain regular term is similar to the 2D optical flow, the 3D optical flow should have continuous and smooth characteristics, and the time domain regular term is used for ensuring smooth deformation between frames, preventing inter-frame jumping in the whole video sequence and improving the robustness of the system, and is specifically represented as follows:
Figure BDA0001836941620000081
wherein f is p (v i ) Denotes v i Optical flow in the p-th frame, f p+1 (v i ) Denotes v i Optical flow at p +1 frame;
in the energy equation, beta, gamma and omega respectively represent the weight of the corresponding three energy terms in the energy function, and the value range is 0-1;
(42) In the step (41), the unknown numbers in the energy equation for constructing the numerical optimization frame comprise vertexes v i 3D light stream f (v) of i ) And rotation and translation matrices R and T of the camera, when energy is given to Cheng Qiujie, an iterative optimization strategy is adopted to maintain the 3D lightFlow f (v) i ) The value is unchanged, rotation and translation matrixes R and T of the camera are solved, then the R and the T are maintained unchanged, and the 3D optical flow f (v) is optimized i ) And alternately iterating until convergence, namely solving to obtain a coarse-scale facial expression model.
Compared with the prior art, the invention has the advantages that:
(1) By adding subspace constraint, the multi-frame dense 2D optical flow obtained through calculation solves the problems that the traditional dense optical flow calculation method is large in calculation amount and difficult to process large-amplitude displacement and partial occlusion.
(2) And the target face personalized template is calculated by utilizing a three-dimensional deformation technology, and a more accurate target face model is obtained without extra multi-angle shooting.
(3) The coarse-scale facial expression optimization reconstruction method based on the consistency of the 3D vertex stream and the 2D image optical flow change has a better recovery result on large-scale facial expression change, illumination change and slight shielding.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a schematic diagram of a personalized three-dimensional face template computed from an input monocular video and a universal face template, wherein the first image on the left is from an i-bug data set, the second image on the left is downloaded from the Internet, and the two images on the right are face images shot by a mobile phone in a daily environment;
fig. 3 is a schematic diagram of a face model reconstructed from an input monocular video of the present invention, where the input video is data set data and a face video shot by a mobile phone in a daily environment.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and examples.
As shown in fig. 1, the method comprises the following specific steps:
(1) Multi-frame 2D dense optical flow calculation based on subspace constraint
The two-dimensional optical flow field can be regarded as being composed of motion vectors of each pixel point of an image in a two-dimensional plane, and the position of any pixel point j on the image relative to the pixel point j in a reference image is definedIs arranged (x) 1j ,y 1j ) The 2D motion vector of (a) is:
w j =[u 1j u 2j …u Fj |v 1j v 2j …v Fj ] T
suppose there are P pixels in a picture and the input video has F frames, where u ij =x ij -x 1j ,v ij =y ij -y 1j ,(x ij ,y ij ) The coordinates at frame i of point j, (x) 1j ,y 1j ) I is more than or equal to 1 and less than or equal to F, and j is more than or equal to 1 and less than or equal to P, wherein j is the coordinate of the point j in the 1 st frame, namely the reference image.
On the basis of a traditional optical flow algorithm, according to the principle that facial expression changes can be linearly combined by a plurality of basic expressions, a tighter subspace constraint is added, namely, a motion vector of each point in a video stream can be obtained by linearly combining a limited number of track bases, and the formula is expressed as follows:
Figure BDA0001836941620000091
wherein, w j Is the motion track vector of the jth pixel point, Q r Is the R-th motion vector base, R is more than or equal to 1 and less than or equal to R,
Figure BDA0001836941620000092
Figure BDA0001836941620000093
for the ith base vector Q of the jth frame r The coefficient of (a).
Figure BDA0001836941620000094
Represents the component of the r-th motion vector in the horizontal direction, based on each frame, based on>
Figure BDA0001836941620000095
Representing the component of the r-th motion vector in the vertical direction per frame basis.
The above formula is expressed in matrix form as:
Figure BDA0001836941620000101
where W is an observation matrix of 2F × P, Q u And Q v Respectively, a motion vector basis matrix in the horizontal and vertical directions, and L a coefficient matrix.
Based on the theory, the following energy function is constructed to calculate the human face 2D optical flow field in the two-dimensional image sequence:
Figure BDA0001836941620000102
wherein the data item E data Quantizing the luminance error between the moved pixel and the reference pixel according to a luminance invariance assumption; e reg Is a smooth constraint term used for constraining the gradient amplitude change of the optical flow vector to ensure that the optical flow vector of the pixel point and the pixel point in the neighborhood keeps consistent as much as possible, E error Adding the difference between the two-dimensional observation matrix and the linear basis fitting as an energy term into an energy equation for the error term, wherein alpha and beta are respectively the weight coefficients of a regular term and the error term and are used for adjusting the proportion of the corresponding energy term in an energy function, the value range is between 0 and 1, and I is f (x, y) represents the brightness value of the pixel point (x, y) in the f-th frame, I 1 (x, y) indicates the brightness value of the pixel point (x, y) in the reference image frame,
Figure BDA0001836941620000103
represents the component of the track base of the f-th frame in the horizontal direction,
Figure BDA0001836941620000104
representing the component of the track base of the f-th frame in the vertical direction.
And solving the energy function, wherein the energy function has two unknown variables of a linear track base and a base coefficient, the linear track base is pre-estimated in a mode of tracking angular points and materials with distinct features, the base coefficient is solved through singular value decomposition, and the multi-frame 2D dense light stream of the human face is obtained through solving the linear track base and the base coefficient.
(2) Reconstruction of personalized three-dimensional face template by using 3D deformation technology
68 two-dimensional feature points of a neutral expression image frame face in an input monocular video are identified by using an Active Appearance Model (AAMs), a neutral expression face grid is selected from an existing face database as an initial template, and the initial template is driven by the two-dimensional feature points to deform by using a three-dimensional deformation technology (3D Warping) so as to obtain a personalized three-dimensional face template according with the target face feature. The following energy equation was constructed:
argmin∑‖PDS t -W‖+λ‖ΔS t -ΔS g
wherein P represents the weak perspective projection matrix of the camera, the matrix W represents the acquired two-dimensional characteristic points of the human face, S t Representing the personalized face template grid, the matrix D is a selection matrix consisting of 0 and 1 and is used for selecting the personalized face template grid S t Upper corresponding 3D feature point, S g Is the initial template. II S in the energy equation t -ΔS g And II, a space regularization term is used, in order to ensure that the geometric shape of the whole three-dimensional model cannot be deformed due to excessive pursuit of the matching degree of the feature points, and lambda is a weight coefficient (lambda takes a value between 0 and 1) of the space regularization term.
Solving the energy equation, wherein the energy equation has two unknowns P and S t Solving by adopting an iterative optimization method: firstly fixing P as a known parameter, solving an energy equation to obtain new S t (ii) a Re-fixing S t Solving the energy equation as a known parameter yields a new P, alternating so far until the energy equation converges, the current P and S t The three-dimensional face template is a solution result of iterative optimization, namely, the personalized three-dimensional face template which accords with the target face characteristics is obtained through calculation.
Fig. 2 is a personalized three-dimensional face template calculated from input images and face templates, wherein the two images of (a) are downloaded from the internet, and the two images of (b) are face images shot by a mobile phone in a daily environment.
FIG. 2 shows that the method can calculate the personalized three-dimensional face template which accords with the target face characteristics whether the input source is the face image downloaded on the Internet or the face image shot by the mobile phone in the daily environment.
(3) Constructing a numerical optimization frame by using the personalized three-dimensional face template generated in the step (2) and the 2D dense optical flow generated in the step (1), and calculating a coarse-scale expression model
The deformation of the three-dimensional facial expression corresponding to different frames is expressed as the motion trail estimation of the 3D vertex flow of the face template, so that the change of the 3D vertex flow and the change of the 2D facial pixel optical flow have consistency, the consistency mapping relation of the 3D vertex flow and the 2D image optical flow is established through the internal and external parameters of the camera, the optical flow consistency change is adopted as a data item, meanwhile, constraint items or smooth items such as local deformation characteristics, space-time deformation characteristics and the like related to deformation are added, a numerical optimization framework is constructed to solve the coarse-scale facial expression, and the following energy equation is constructed:
E(f,R,T)=E flow +βE smooth +γE arap +ωE temp
wherein E flow The 2D-3D dense optical flow change consistency item, namely a data item, is used for ensuring the consistency mapping relation between the 3D vertex flow of the personalized face template and the calculated 2D image optical flow, and is specifically represented as follows:
Figure BDA0001836941620000121
wherein v is i Denotes S t The ith vertex of (c), f (v) i ) Representing the vertex v i P represents a projection matrix, L (·) represents a dense 2D optical flow, R and T represent rotation and translation matrices of the camera, | · |, of ε Is the Huber loss function.
E smooth For the spatial smoothing term, two parts are involved: adjacent smoothing terms and discrete global variation terms. And the adjacent smoothing term is used for smoothly deforming the rear model, and directly corrects the 3D optical flow consistency of adjacent vertexes. Discrete integral change items based on the initial face template to ensure that the deformed model can still be usedThe local features of the face, which are consistent with the local features, are specifically expressed as:
Figure BDA0001836941620000122
wherein v is j ∈Λ i Representing a vertex v i And sigma is a weight coefficient, and the value range of sigma is 0-1.
E arap Is an ARAP term (As Rigid As Possible) which is used in the non-Rigid body 3D reconstruction and deformation to ensure the local rigidity in the deformation process, and is specifically expressed As:
Figure BDA0001836941620000123
wherein A is i Representing a vertex v i Corresponding rotation matrix of (a).
E temp The time domain regular term is similar to the 2D optical flow, the 3D optical flow should have continuous and smooth characteristics, and the time domain regular term is used for ensuring smooth deformation between frames, preventing inter-frame jumping in the whole video sequence and improving the robustness of the system, and is specifically represented as follows:
Figure BDA0001836941620000131
wherein f is p (v i ) Denotes v i Optical flow in the p-th frame, f p+1 (v i ) Denotes v i Optical flow at p +1 th frame.
Beta, gamma and omega in the energy equation respectively represent the weight coefficients of the corresponding three energy terms in the energy function, and the value range is 0-1.
The unknowns in the energy equation include the vertex v i 3D light stream f (v) of i ) And rotation and translation matrixes R and T of the camera, and when the energy direction Cheng Qiujie is subjected to iterative optimization strategy, the 3D optical flow f (v) is maintained i ) The value is unchanged, rotation and translation matrixes R and T of the camera are solved, then the R and the T are maintained unchanged, and the 3D light is optimizedFlow f (v) i ) And alternately iterating until convergence, and solving to obtain the coarse-scale facial expression model.
The invention adopts the optical flow change consistency item to replace the traditional method and uses the brightness consistency item as the data item, thereby avoiding the influence of image brightness noise, projection error, reflection and illumination.
(4) Adding details by using a light and shade recovery shape technology to generate a fine-scale expression model
Assuming that the human face is a Lambertian surface, the ambient light can be approximately expressed by a spherical harmonic function, and the estimation process of wrinkle details can be regarded as optimization between brightness values of an actually captured image and a rendered image, namely, the following formula can be used for expressing:
Figure BDA0001836941620000132
wherein N represents the number of pixels, I r (u, v) represents the pixel brightness value of the actual captured image, ρ (u, v) represents the reflectance value, n (u, v) represents the normal vector, Y (n (u, v)) is the basis of the spherical harmonic function, l T Coefficients representing the basis of the harmonic function, λ ([ delta ] G (n (u, v) -n) ref (u, v))) as a regularizing term, n ref And (u, v) is a reference normal vector calculated from the obtained coarse-scale human face model, and a Gaussian operator delta G is used for reducing the 2D-3D projection error. The above equation is solved by an iterative optimization strategy, the illumination coefficient l, the reflectivity rho (u, v) and the normal vector n (u, v) are optimized alternately, and the final wrinkle detail is calculated by the normal vector.
Fig. 3 is a schematic diagram of a face model reconstructed from input videos, wherein the input videos are data of a data set and a face video shot by a mobile phone in a daily environment. FIG. 3 shows that the present invention can accurately reconstruct a three-dimensional facial expression model, regardless of whether the input is data set data or a face video taken by a mobile phone in a daily environment.

Claims (3)

1. A method for reconstructing a three-dimensional facial expression model based on monocular video is characterized by comprising the following steps:
(1) Under the subspace constraint condition that a two-dimensional optical flow field is represented by a finite track baseline linear combination, calculating an input two-dimensional image sequence to obtain a face multi-frame 2D dense optical flow;
(2) Calculating an individualized three-dimensional face template from an input two-dimensional image sequence and a universal face template with a neutral expression by using a three-dimensional deformation technology;
(3) Constructing a numerical optimization frame based on 2D-3D optical flow change consistency by using the multi-frame 2D dense optical flow of the human face calculated in the step (1) and the personalized three-dimensional human face template calculated in the step (2), and calculating to obtain a coarse-scale human face expression model reflecting the five sense organs and the outline of the target human face;
(4) Adding geometrical detail information of wrinkles and expression lines on the coarse-scale facial expression model obtained by calculation in the step (3) by using a light and shade recovery shape algorithm, and calculating to obtain a final three-dimensional facial expression model;
in the step (1), the method for calculating the face multi-frame 2D dense optical flow comprises the following steps:
(21) Constructing an energy function based on subspace constraints, wherein the energy term comprises a data term, a smooth constraint term and an error term;
(22) Solving an energy function in the step (21), wherein the energy function has two unknown variables of a linear track base and a base coefficient, the linear track base is pre-estimated in a mode of tracking angular points and materials with distinct features, the base coefficient is solved through singular value decomposition, and the multi-frame 2D dense optical flow of the face is obtained through solving the linear track base and the base coefficient;
the method for calculating the personalized three-dimensional face template in the step (2) comprises the following steps: selecting a neutral expression general face grid from an existing face database as an initial template, driving the initial template to deform by using two-dimensional feature points to obtain an individualized three-dimensional face template which accords with the target face features, and specifically realizing the following steps:
(31) Carrying out feature point identification on neutral expression image frames in an input monocular video by using an active appearance model to obtain 68 human face 2D feature points, wherein 17 points represent the outline of a human face, 10 points represent the positions of eyebrows, 12 points represent the positions of eyes, 9 points represent the positions of noses, and 20 points represent the positions of mouths;
(32) And (4) utilizing the 2D feature points obtained in the step (31) and a universal face template with a neutral expression, and constructing an energy function as follows, wherein the 2D feature points drive the universal face template with the neutral expression to deform:
argmin∑||PDS t -W||+λ||ΔS t -ΔS g ||
wherein P represents a weak perspective projection matrix of the camera, the matrix W represents the acquired 2D characteristic points of the human face, S t The matrix D is a selection matrix composed of 0 and 1 and used for selecting the personalized face template grid S t Upper corresponding 3D feature point, S g Is an initial template, lambda represents a weight coefficient, and the value range is between 0 and 1;
(33) Solving the energy equation constructed in step (32), wherein the energy equation has two unknowns P and S t Solving by adopting an iterative optimization method: firstly fixing P as a known parameter, solving an energy equation to obtain a new S t (ii) a Then fixing S t Solving the energy equation as a known parameter yields a new P, alternating so far until the energy equation converges, the current P and S t Namely, the solution result of the iterative optimization is calculated to obtain the personalized three-dimensional face template which accords with the target face characteristics.
2. The method for reconstructing the three-dimensional facial expression model based on the monocular video of claim 1, wherein: the subspace constraint-based construction of the energy function in said step (21) is:
Figure FDA0003940881720000021
wherein E data For data items, E reg To smooth the constraint term, E error Alpha and beta are respectively weight coefficients of a smooth constraint term and an error term, and the value range is 0-1; common image F frame, I in the input two-dimensional image sequence f Representing the luminance value of the F-th frame image, F is more than or equal to 1 and less than or equal to F,I 1 representing the brightness value of the 1 st frame image, the track base number is R, Q r Represents the R track base, R is more than or equal to 1 and less than or equal to R,
Figure FDA0003940881720000022
represents the component of the track base of the f-th frame in the horizontal direction, based on the value of the reference value>
Figure FDA0003940881720000031
Represents the component of the track base of the f-th frame in the vertical direction, L represents the base coefficient, and/or>
Figure FDA0003940881720000032
For the ith base vector Q of the jth frame r Coefficient of (a), w j Representing the motion trajectory vector of the j-th frame.
3. The method for reconstructing the three-dimensional facial expression model based on the monocular video of claim 1, wherein: the method for solving the coarse-scale facial expression model in the step (3) comprises the following steps:
(41) Constructing a numerical optimization frame, wherein the numerical optimization frame comprises a data item, a space smoothing item, a rigid constraint item ARAP and a time domain regular item, and the specific energy equation is as follows:
E(f,R,T)=E flow +βE smooth +γE arap +ωE temp
wherein E flow The 2D-3D dense optical flow change consistency item, namely a data item, is used for ensuring the consistency mapping relation between the 3D vertex flow of the personalized face template and the calculated 2D image optical flow, and is specifically represented as follows:
Figure FDA0003940881720000033
wherein v is i Denotes S t The ith vertex of (c), f (v) i ) Representing a vertex v i P represents a projection matrix, L (-) represents a dense 2D optical flow, R and T represent rotational and translational moments of the cameraMatrix, | \ | live through ε Is the Huber loss function;
E smooth the term, which is a spatial smoothing term, includes two parts: the method comprises an adjacent smoothing item and a discrete overall change item, wherein the adjacent smoothing item is used for smoothly deforming a rear model, the 3D light stream consistency of adjacent vertexes is directly corrected, the discrete overall change item is based on an initial face template, and the deformed model can still keep consistent face local features with the deformed model, and the specific expression is as follows:
Figure FDA0003940881720000034
wherein v is j ∈Λ i Representing the vertex v i Sigma is a weight coefficient, the value range of the adjacent vertex of the vector is 0 to 1,
E arap a Rigid constraint term ARAP (As Rigid As posable) used in the non-Rigid 3D reconstruction and deformation to ensure local rigidity during deformation, which is specifically expressed As:
Figure FDA0003940881720000041
wherein A is i Representing the vertex v i A corresponding rotation matrix;
E temp the time domain regular term is similar to the 2D optical flow, the 3D optical flow should have continuous and smooth characteristics, and the time domain regular term is used for ensuring smooth deformation between frames, preventing inter-frame jumping in the whole video sequence and improving the robustness of the system, and is specifically represented as follows:
Figure FDA0003940881720000042
wherein f is p (v i ) Denotes v i Optical flow in the p-th frame, f p+1 (v i ) Denotes v i Optical flow at p +1 frame;
beta, gamma and omega in the energy equation respectively represent the weight of the corresponding three energy items in the energy function, and the value range is 0-1;
(42) In the step (41), the unknown numbers in the energy equation for constructing the numerical optimization frame comprise vertexes v i 3D optical flow f (v) of i ) And rotation and translation matrixes R and T of the camera, and when the energy direction Cheng Qiujie is subjected to iterative optimization strategy, the 3D optical flow f (v) is maintained i ) The value is unchanged, rotation and translation matrixes R and T of the camera are solved, then the R and the T are maintained unchanged, and the 3D optical flow f (v) is optimized i ) And alternately iterating until convergence, and solving to obtain the coarse-scale facial expression model.
CN201811230151.0A 2018-10-22 2018-10-22 Method for reconstructing three-dimensional facial expression model based on monocular video Active CN109584353B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811230151.0A CN109584353B (en) 2018-10-22 2018-10-22 Method for reconstructing three-dimensional facial expression model based on monocular video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811230151.0A CN109584353B (en) 2018-10-22 2018-10-22 Method for reconstructing three-dimensional facial expression model based on monocular video

Publications (2)

Publication Number Publication Date
CN109584353A CN109584353A (en) 2019-04-05
CN109584353B true CN109584353B (en) 2023-04-07

Family

ID=65920336

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811230151.0A Active CN109584353B (en) 2018-10-22 2018-10-22 Method for reconstructing three-dimensional facial expression model based on monocular video

Country Status (1)

Country Link
CN (1) CN109584353B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977925B (en) * 2019-04-22 2020-11-27 北京字节跳动网络技术有限公司 Expression determination method and device and electronic equipment
CN110176052A (en) * 2019-05-30 2019-08-27 湖南城市学院 Model is used in a kind of simulation of facial expression
CN110298319B (en) * 2019-07-01 2021-10-08 北京字节跳动网络技术有限公司 Image synthesis method and device
CN110298917B (en) * 2019-07-05 2023-07-25 北京华捷艾米科技有限公司 Face reconstruction method and system
CN110443885B (en) * 2019-07-18 2022-05-03 西北工业大学 Three-dimensional human head and face model reconstruction method based on random human face image
CN110689625B (en) * 2019-09-06 2021-07-16 清华大学 Automatic generation method and device for customized face mixed expression model
CN110807364B (en) * 2019-09-27 2022-09-30 中国科学院计算技术研究所 Modeling and capturing method and system for three-dimensional face and eyeball motion
CN111028346B (en) * 2019-12-23 2023-10-10 北京奇艺世纪科技有限公司 Reconstruction method and device of video object
CN111582121A (en) * 2020-04-29 2020-08-25 北京攸乐科技有限公司 Method for capturing facial expression features, terminal device and computer-readable storage medium
CN111598927B (en) * 2020-05-18 2023-08-01 京东方科技集团股份有限公司 Positioning reconstruction method and device
CN112312230B (en) * 2020-11-18 2023-01-31 秒影工场(北京)科技有限公司 Method for automatically generating 3D special effect for film
CN112734895A (en) * 2020-12-30 2021-04-30 科大讯飞股份有限公司 Three-dimensional face processing method and electronic equipment
CN112700523B (en) * 2020-12-31 2022-06-07 魔珐(上海)信息科技有限公司 Virtual object face animation generation method and device, storage medium and terminal
CN112991381B (en) * 2021-03-15 2022-08-02 深圳市慧鲤科技有限公司 Image processing method and device, electronic equipment and storage medium
CN113076918B (en) * 2021-04-15 2022-09-06 河北工业大学 Video-based facial expression cloning method
CN113395476A (en) * 2021-06-07 2021-09-14 广东工业大学 Virtual character video call method and system based on three-dimensional face reconstruction
CN115100327B (en) * 2022-08-26 2022-12-02 广东三维家信息科技有限公司 Method and device for generating animation three-dimensional video and electronic equipment
CN117132461B (en) * 2023-10-27 2023-12-22 中影年年(北京)文化传媒有限公司 Method and system for whole-body optimization of character based on character deformation target body

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473801A (en) * 2013-09-27 2013-12-25 中国科学院自动化研究所 Facial expression editing method based on single camera and motion capturing data
CN103942822A (en) * 2014-04-11 2014-07-23 浙江大学 Facial feature point tracking and facial animation method based on single video vidicon

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473801A (en) * 2013-09-27 2013-12-25 中国科学院自动化研究所 Facial expression editing method based on single camera and motion capturing data
CN103942822A (en) * 2014-04-11 2014-07-23 浙江大学 Facial feature point tracking and facial animation method based on single video vidicon

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Contour-based 3D Face Modeling from a Monocular Video;Gupta H等;《BMVC》;20041231;全文 *
Dense optical flow variation based 3d face reconstruction from monocular video;Wang S等;《2018 25th IEEE International Conference on Image Processing (ICIP)》;20190228 *
Face2face: Real-time face capture and reenactment of rgb videos;Thies J等;《Proceedings of the IEEE conference on computer vision and pattern recognition》;20161231;全文 *
Monocular 3D facial information retrieval for automated facial expression analysis;Oveneke M C等;《2015 International Conference on Affective Computing and Intelligent Interaction (ACII)》;20151231;全文 *
Towards reconstructing a 3D face model from an uncontrolled video sequence;Prithviraj J L等;《2012 International Conference on Cyberworlds. IEEE》;20121231;全文 *
三维人脸表情获取及重建技术综述;王珊等;《系统仿真学报》;20180708(第07期);全文 *

Also Published As

Publication number Publication date
CN109584353A (en) 2019-04-05

Similar Documents

Publication Publication Date Title
CN109584353B (en) Method for reconstructing three-dimensional facial expression model based on monocular video
Zeng et al. 3d human mesh regression with dense correspondence
Zhao et al. Thin-plate spline motion model for image animation
CN109636831B (en) Method for estimating three-dimensional human body posture and hand information
Zuffi et al. Lions and tigers and bears: Capturing non-rigid, 3d, articulated shape from images
CN110503680B (en) Unsupervised convolutional neural network-based monocular scene depth estimation method
Achenbach et al. Fast generation of realistic virtual humans
Chen et al. Animatable neural radiance fields from monocular rgb videos
Shi et al. Automatic acquisition of high-fidelity facial performances using monocular videos
Stoll et al. Fast articulated motion tracking using a sums of gaussians body model
Hasler et al. Estimating body shape of dressed humans
CN110569768B (en) Construction method of face model, face recognition method, device and equipment
CN113421328B (en) Three-dimensional human body virtual reconstruction method and device
CN114863035B (en) Implicit representation-based three-dimensional human motion capturing and generating method
US11928778B2 (en) Method for human body model reconstruction and reconstruction system
CN110660076A (en) Face exchange method
JP2023524252A (en) Generative nonlinear human shape model
CN112818860B (en) Real-time three-dimensional face reconstruction method based on end-to-end multitask multi-scale neural network
WO2021228183A1 (en) Facial re-enactment
CN115951784A (en) Dressing human body motion capture and generation method based on double nerve radiation fields
CN111640172A (en) Attitude migration method based on generation of countermeasure network
Yang et al. Human bas-relief generation from a single photograph
Wang et al. Physical Priors Augmented Event-Based 3D Reconstruction
Kabadayi et al. Gan-avatar: Controllable personalized gan-based human head avatar
Jian et al. Realistic face animation generation from videos

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant