CN1920886A

CN1920886A - Video flow based three-dimensional dynamic human face expression model construction method

Info

Publication number: CN1920886A
Application number: CNA2006100533938A
Authority: CN
Inventors: 庄越挺; 张剑; 肖俊; 王玉顺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2006-09-14
Filing date: 2006-09-14
Publication date: 2007-02-28
Anticipated expiration: 2026-09-14
Also published as: CN100416612C

Abstract

The invention relates to a three-dimension dynamic face pathetic model construction method based on video flow, which can return the three-dimension face pathetic based on input video flow, wherein the algorism comprises: (1) marking face character point at the first frame of input video; (2) using light stream method of affine correction to track the character point; (3) rebuilding the two-dimension track data based on factor decomposition into three-dimension data; (4 using rebuilt three-dimension date to match general face model, to generate personal face and dynamic pathetic motion; (5) using character face technique to compress the original video; (6) using character face to rebuild input video and projecting dynamic pattern, to compose true virtual appearance. The invention has high time/spatial efficiency and high value.

Description

Three-dimensional dynamic human face expression model construction method based on video flowing

Technical field

The present invention relates to the crossing domain of computer vision and computer graphics, relate in particular to a kind of three-dimensional dynamic human face expression model construction method based on video flowing.

Background technology

It all is a challenging subject that personalized human face modeling and sense of reality expression animation generate all the time, and is widely used in aspects such as virtual reality, film making, Entertainment.From Parke[1] since initiative work in 1972, the research of people's face and expression modeling aspect has obtained remarkable progress.According to required input data difference,, modeling pattern mainly is divided into following a few class: based on catching the three-dimensional samples data modeling that obtains; Based on image modeling; Based on the video flowing modeling.People such as Blanz [2] are by the statistical nature in the study three-dimensional face storehouse, set up the personalized human face model according to width of cloth input facial image,, this need use expensive laser scanning equipment to scan in advance and set up the three-dimensional face storehouse, and data volume is too big, and computation complexity is too high.People such as Deng [3] move and extract independent expression parameter and synthetic expression by catching labeled real human face, and this needs comparatively expensive capturing movement equipment equally, and must make marks on the face the performer.Document [4,5,6,7] extracts three-dimensional information and rebuilds faceform, Pighin[4 from image] use several picture reconstruction faceform, but must be on every width of cloth image manual marker characteristic point, and the generation of expression also needs a lot of manual interactions.Document [5] operating specification orthogonal image is to the modeling of people's face and use the driving of expressing one's feelings of muscle vector, and shortcoming is that the muscle vector position is difficult to correct setting, and the quadrature constraint is too strict, makes method lack generalization.Document [6] uses two width of cloth direct pictures to the modeling of people's face, and camera must be demarcated in advance, and the unique point of rebuilding is fewer, only the unique point interpolation is generated people's face grid and is difficult to accurately reflect people's face local feature.Document [7] adopts orthogonal image equally, by one progressively the process optimization of refinement obtain the faceform, have the too strict shortcoming of constraint condition equally.People [8] such as Li Zhang utilize structured light, rebuild human face expression by the stereoscopic vision method from video flowing, this need comprise the hardware device of structured light projection instrument, and the model that scanning obtains will carry out loaded down with trivial details manual pre-service, and ambient lighting is had relatively high expectations.The method that people [9] such as Zicheng Liu propose highly significant, the i.e. video flowing reconstruction of three-dimensional faceform who never demarcates, the method is not strict with the input data, but the detection of angle point and coupling robust inadequately itself, be subjected to the influence of illumination easily, this may cause the failure of reconstruction.

Traditional face cartoon method is mainly considered faceform's geometric deformation [5,6,7,9], texture is to model vertices, thereby when grid produces deformation, texture also can stretch thereupon and twist, therefore traditional texture can be regarded a kind of static method as. yet people's face is highly non-rigid surface, human face expression not only comprises the small geometric deformation (as wrinkle) in surface, also comprise the change of the colour of skin and expression, and from the angle of geometric deformation is very difficult these variations are simulated merely. therefore in this sense traditional texture mapping method is not enough to produce the human face expression with height sense of reality.

[1]Parke F.Computer generated animation of faces.Proceedings of the ACM Annual Conference，Boston，1972：451-457.

[2]Blanz V，Vetter T.A morphable model for the synthesis of 3D faces.Proceedings ofSIGGRAPH’99，Los Angeles，1999：187-194.

[3]Deng Z，Bulut M，Neumann U，Narayanan S.Automatic dynamic expression synthesis forspeech animation.Proceedings of IEEE Computer Animation and Social Agents，Geneva，2004：267-274.

[4]Pighin F，Hecker J，Lichinski D，Szeliski R，Salesin D.H.Synthesizing realistic facialexpressions from photographs.Proceedings of SIGGRAPH’98，Orlando，Florida，1998：75-84.

[5]Mei L，Bao HJ，Peng QS.Quick customization of particular human face and muscle-drivenexpression animation.Joumal of Computer-Aided Design & Computer Graphics，2001，13(12)：1077-1082.

[6]Wang K，Zheng NN.3D face modeling based on SFM algorithm.Chinese Journal of Computers，2005，28(6)：1048-1053.

[7]Su CY，Zhuang YT，Huang L，Wu F.Analysis-by-synthesis approach for facial modeling basedon orthogonal images.Journal of Zhejiang University(Engineering Science)，2005，39(2)：175-179

[8]Li Zhang，Snavely N，Curless B，Seitz S.Spacetime faces：high resolution capture for modelingand animation.ACM Transactions on Graphics，2004，23(3)：548-558.

[9]ZC Liu，ZY Zhang，Jacobs C，Cohen M.Rapid modeling of animated faces from video images.ACM Multimedia，Los Angeles，2000：475-476.

Summary of the invention

The object of the present invention is to provide a kind of three-dimensional dynamic human face expression model construction method based on video flowing.

Method step is:

1) do not demarcate the manual position that marks human face characteristic point of the first frame of video at the monocular of input;

2) adopt the optical flow method of affine rectification that the unique point of first frame mark is followed the tracks of, determine the change in location situation of these unique points every frame in video sequence;

3) adopt the method for decomposing that the two-dimensional tracking data are reverted to the three-dimensional motion data based on the factor;

4) preceding 3 frames of three-dimensional motion data are averaged, thereby produce the personalized three-dimensional faceform with the adaptive general three-dimensional face model of this mean value;

5) use other this personalized three-dimensionals of three-dimensional motion data-driven faceform, generate the dynamic 3 D human face expression;

6) adopt video-frequency compression method that input video is compressed, with less storage space based on eigenface;

7) the use characteristic face is rebuild input video, and in conjunction with the two-dimensional tracking data dynamic 3 D people face is carried out the dynamic texture mapping automatically, generates realistic three-dimensional face expression sequence.

Described human face characteristic point: people's face shape defined parameters and human face animation parameter according to mpeg 4 standard are provided with, have 40, be distributed in positions such as facial contour, eyes, lip edge, not only can better reflect people's face topology, and the human face expression motion can be described, when remaining neutral expression, people's face can be regarded as rigid body substantially, unique point definition this moment people face shape feature; When people's face presents the expression motion, unique point definition human face animation parameter.

The optical flow method of affine rectification: the accuracy of correcting traditional optical flow tracking method by the affined transformation between the calculating frame of video; The purpose of tradition optical flow tracking is that the search side-play amount makes matching error minimum with character pair spot correlation neighborhood: given two adjacent video frames I ₁And I ₂, mark I ₁In the position of certain unique point be f=(u, v) ^T, the mark light stream is p=(p _u, p _v) ^T, I then ₂The position of middle character pair point is f+p; P can be by minimizing

Obtain, wherein to be one be the square area at center with f to T; Yet when human face posture in the image and the bigger variation of illumination generation, the tracking effect of the point on nose, chin and the crown can be very poor in people's face, but the tracking effect of the point of canthus, hairline, mouth and cheek is still very accurate, therefore defines P ₁ ^aAnd P ₂ ^aBe I ₁And I ₂In the unique point of accurately following the tracks of, then according to hypothesis, P ₁ ^aAnd P ₂ ^aBetween an available affined transformation w change P mutually ₂ ^a=wP ₁ ^a=AP ₁ ^a+ B; W is applied to I ₁In unique point p to be corrected ₁ ^IaObtain P _w=wP ₁ ^Ia, establish P _oBe P ₁ ^IaAt I ₂In traditional optical flow method tracking results, then the tracking results P of these unique points can be corrected for P=argmin (| P-P _o| ²+ | P-P _w| ²), promptly utilize P _wAs constraint condition further to P _oBe optimized.

Method based on factor decomposition: with weak perspective projection modeling video imaging process; According to the method, the non-rigid object shape is regarded the weighted linear combination of one group of shape bases as, and shape bases is one group of basic 3D shape, and any 3D shape can be organized the 3D shape base by this and combine; Given tracking data, the unique point in every frame can be as follows with weak perspective projection model description:

P_{fn} = {(x, y)}_{fn}^{T} = [e_{f} c_{f 1} R_{f} . . . e_{f} c_{fK} R_{f}] \cdot {[S_{1 n} . . . S_{Kn}]}^{T} + t_{f}

f＝1，...，F n＝1，...，N

Wherein F and N are respectively the numbers of frame number and unique point, e _fBe the weak perspective projection zoom factor of non-zero, S _1n... S _KnBe K shape bases, c _F1... c _FKBe the combining weights of shape bases, t _fBe translation, R _fPreceding two row of representing f camera projection matrix, P _FnRepresent n unique point in the f frame, then if regard x, the y coordinate of each unique point in every frame as one 2 * 1 matrix, all tracking datas form the matrix P of a 2F * K, and P=MS+T, wherein M is a broad sense camera projection matrix, and S is a K shape bases, and T is a translation matrix:

M = [\begin{matrix} e_{1} c_{11} R_{1} & \cdot \cdot \cdot & e_{1} c_{1 K} R_{1} \\ \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot \\ e_{F} c_{F 1} R_{F} & \cdot \cdot \cdot & e_{F} c_{FK} R_{F} \end{matrix}], S = [\begin{matrix} S_{11} & \cdot \cdot \cdot & S_{1 N} \\ \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot \\ S_{K 1} & \cdot \cdot \cdot & S_{KN} \end{matrix}]

Deduct translation matrix and obtain canonical form P=MS, P is carried out svd, the exponent number that obtains P is the approximate value of 3K

\tilde{P} = \tilde{M} \cdot \tilde{S},

K can determine that this decomposition is not unique by rank (P)/3, given any reversible 3K * 3K matrix A,

\tilde{P} = \tilde{M} A \cdot A^{- 1} \tilde{S}

All set up; Therefore as A when being known, broad sense camera projection matrix and shape bases can be represented

M = \tilde{M} \cdot A,

S = A^{- 1} \cdot \tilde{S},

For calculating A, the orthogonal property that at first utilizes projection matrix makes Q=AA as constraint condition ^T, then

{MM}^{T} = \tilde{M} Q {\tilde{M}}^{T},

Order again

Expression

I 2 row submatrixs, can get following two orthogonality constraint conditions according to the orthogonal property of projection matrix:

{\tilde{M}}_{2 \cdot i - 1} Q {\tilde{M}}_{2 \cdot i - 1}^{T} = {\tilde{M}}_{2 \cdot i} Q {\tilde{M}}_{2 \cdot i}^{T},

{\tilde{M}}_{2 \cdot i - 1} Q {\tilde{M}}_{2 \cdot i}^{T} = 0;

Next use shape bases constraint condition to eliminate condition of orthogonal constraints ambiguity in some cases, the individual three row submatrixs of the k of A are expressed as a _k, for each Q _k=a _ka _k ^TK=1 ..., K, set other one group of shape bases constraint according to the independence between shape bases:

{\tilde{M}}_{2 \cdot i - 1} Q_{k} {\tilde{M}}_{2 \cdot j - 1}^{T} = 1

(i，j)∈ω ₁

{\tilde{M}}_{2 \cdot i - 1} Q_{k} {\tilde{M}}_{2 \cdot j - 1}^{T} = 0

(i，j)∈ω ₂

{\tilde{M}}_{2 \cdot i} Q_{k} {\tilde{M}}_{2 \cdot j}^{T} = 1

(i，j)∈ω ₁

{\tilde{M}}_{2 \cdot i} Q_{k} {\tilde{M}}_{2 \cdot j}^{T} = 0

(i，j)∈ω ₂

{\tilde{M}}_{2 \cdot i - 1} Q_{k} {\tilde{M}}_{2 \cdot j}^{T} = 0

(i，j)∈ω ₁∪ω ₂

{\tilde{M}}_{2 \cdot i} Q_{k} {\tilde{M}}_{2 \cdot j - 1}^{T} = 1

(i，j)∈ω ₁∪ω ₂

ω ₁＝{(i，j)|i＝j＝k} ω ₂＝{(i，j)|i＝1，...，K，j＝1，...，F，i≠k}

Correctly find the solution Q in conjunction with this two classes constraint condition, obtain A through svd again, M by Obtain zoom factor e ₁..., e _FCan regard constant as, so broad sense camera projection matrix can be expressed as

M = [\begin{matrix} {c_{11}}^{1} R_{1} & \cdot \cdot \cdot & {c_{1 K}}^{1} R_{1} \\ \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot \\ {c_{F 1}}^{1} R_{F} & \cdot \cdot \cdot & {c_{FK}}^{1} R_{F} \end{matrix}];

Because

R_{f} = [\begin{matrix} r_{f 1} & r_{f 2} & r_{f 3} \\ r_{f 4} & r_{f 5} & r_{f 6} \end{matrix}]

F=1 ..., F is preceding two row of camera rotation matrix, two of expression f frame among the M is gone launch to obtain

m_{f} = [\begin{matrix} {c_{f 1}}^{1} r_{f 1} & {c_{f 1}}^{1} r_{f 2} & {c_{f 1}}^{1} r_{f 3} & \cdot \cdot \cdot & {c_{fK}}^{1} r_{f 1} & {c_{fK}}^{1} r_{f 2} & {c_{f 2}}^{1} r_{f 3} \\ {c_{f 1}}^{1} r_{f 4} & {c_{f 1}}^{1} r_{f 5} & {c_{f 1}}^{1} r_{f 6} & \cdot \cdot \cdot & {c_{fK}}^{1} r_{f 4} & {c_{fK}}^{1} r_{f 5} & {{c_{fK}}^{1} r}_{f 6} \end{matrix}],

Adjust element position and obtain new matrix

{m_{f}}^{1} = [\begin{matrix} {c_{f 1}}^{1} r_{f 1} & {c_{f 1}}^{1} r_{f 2} & {c_{f 1}}^{1} r_{f 3} & {c_{f 1}}^{1} r_{f 4} & {{c_{f 1}}^{1} r}_{f 5} & {c_{f 1}}^{1} r_{f 6} \\ \cdot \cdot \cdot \\ {{c_{fK}}^{1} r}_{f 1} & {c_{fK}}^{1} r_{f 2} & {{c_{fK}}^{1} r}_{f 3} & {c_{fK}}^{1} r_{f 4} & {c_{fK}}^{1} r_{f 5} & {c_{fK}}^{1} r_{f 6} \end{matrix}],

This matrix is column vector (c _F1 ¹... c _FK ¹) ^TWith row vector (r _F1r _F2r _F3r _F4r _F5r _F6) product; Thus, the camera projection matrix of every frame and shape bases combining weights can be by m _f ¹Obtain through svd, the 3D shape in the Euclidean space also draws thus, and this shape is exactly the unique point three-dimensional coordinate, calculates the three-dimensional coordinate of every frame unique point in Euclidean space, has in fact just obtained one group of three-dimensional motion data.

General three-dimensional face model: comprise more than 3000 summit, be through registration by several true three-dimension people faces that obtain by laser scanning, simplify and average and obtain, the fine structure feature of people's face can be described, 3 frames before the three-dimensional motion data are averaged, as the three-dimensional feature point of describing people's face shape, and on general three-dimensional face, specify feature summit with three-dimensional feature point similar number, side-play amount between defined feature summit and the unique point is d, and with d and feature summit training radial basis function, can release the skew on these feature summits with the radial basis function that all the other feature summit inputs train, thereby obtain the personalized three-dimensional faceform.

Other three-dimensional motion data: all removing preceding 3 frame data that are used for defining people's face shape in the three-dimensional motion data, every frame expression drives the same radial basis function that adopts and carries out.

Video-frequency compression method based on eigenface: given one section video sequence, suppose that video sequence comprises the F frame, every two field picture resolution is R * C, all row of every two field picture are stacked together frame of video are converted into the column vector of RC * 1, thereby video sequence is converted into the sample matrix X of a RC * F, if X is a sample average, then normalized sample is

\tilde{X} = (X - \overset{&OverBar;}{X}) / F^{1 / 2};

In order to handle the too high problem of bringing of dimension, it is as follows in conjunction with svd calculated characteristics vector to adopt QR to decompose:

[q, r] = QR (\tilde{X})

[u，s，v]＝SVD(r) U＝q·u

The QR decomposition is found the solution the proper vector of higher dimensional matrix with stable manner on the mathematics; Obtain proper vector U by above three formulas, U has reflected the statistical law that contains in the sample space, and we are referred to as eigenface, and given any frame of video f projects to it on the U, obtains one group of projection coefficient y=U ^T(f-X), then f available feature face and this group coefficient reconstruction are

\tilde{f} = U \cdot y + \overset{&OverBar;}{X};

When video transmission, only need transmit sample average altogether, proper vector, the projection coefficient of every frame, therefore general faceform and three-dimensional feature point coordinate have saved storage space.

Dynamic texture mapping: regard every frame two dimensional character point position coordinates that tracking obtains the texture coordinate of a predefined stack features summit on the three-dimensional face model as, thereby map to the faceform that reconstruct corresponding frame by frame from original video with each frame of video by people's face texture information that interpolation will extract automatically;

The dynamic texture mapping is divided into two steps:

1) overall texture:

At first make as giving a definition:

T=(u _nv _n) ^T: characteristic point coordinates in every frame, n=1...N wherein, N is the number of unique point;

Num: the number on all summits in the three-dimensional face model;

I: the index on the three-dimensional model feature summit of a series of prior appointments, i satisfy i| (i  1 ..., num}) ∩ (| i|=N) } and in whole process i remain unchanged;

P=(X[i] Y[i] Z[i]) T: in every frame three-dimensional model with image characteristic point characteristic of correspondence apex coordinate;

When carrying out overall texture, the corresponding relation on first frame specific characteristic point and some three-dimensional model summits, the every frame thereafter upgrade T and P automatically and carry out interpolation with T and P training radial basis function and shine upon;

2) local grain optimization: overall texture depends on mutual appointment initial characteristics summit, and manual characteristic specified summit may not be optimum, therefore needs the process of an optimization to find feature summit accurately;

For describing local grain optimization, make as giving a definition:

F: follow the tracks of a two dimensional character point that obtains;

S: initial characteristic specified summit;

f ¹: the two dimensional character point that S obtains by weak perspective projection;

Δ p:f and f ¹Between error;

I _Input: input video frame;

I _Project: the two dimensional image that the three-dimensional model that has texture of reconstruction obtains by weak perspective projection;

T: image I _InputGoing up with f is the square area at center;

Local grain optimization is finished by the process of an iteration:

Loop

Δp = \arg \min \underset{f_{i} &Element; T}{Σ} {| | I_{input} (f_{i}) - I_{project} (f_{i} + Δp) | |}^{2};

P sets out by Δ, oppositely tries to achieve the shifted by delta S on three-dimensional feature summit through weak perspective projection model;

Upgrade S, make S=S+ Δ S;

Again carry out overall texture, upgrade I _Project

The variation of Until S is less than a certain threshold value.

The three-dimensional face dynamic expression modeling method that the present invention is based on video flowing has then been broken away from the constraint of priori, can flow from natural video frequency to reconstruct three-dimensional human face expression (as films and television programs).Compare with traditional optical flow tracking method, the optical flow tracking method of affine rectification is without any need for training data, to the variation of gradation of image robust comparatively, and reduced the iterations of optical flow algorithm, improved the time efficiency of algorithm; With respect to traditional texture mapping method, the dynamic texture mapping can produce more true and natural expression effect; The eigenface technology has effectively been compressed video under the prerequisite that keeps picture quality, reduced the storage space that original video takies.

Table 1 has indicated the compression efficiency contrast of eigenface technology and MPEG-2 compress technique.Select the number of eigenface when carrying out video compress according to the big freedom in minor affairs of original video, optimize compromise so that between compression efficiency and picture quality, do one.E in the table 1 _fThe eigenface of using during-5 expression compressions is 5, and all the other by that analogy.The compression efficiency of MPEG-2 technology is constant about 60: 1 as shown in Table 1, irrelevant with video size to be compressed, and the compression efficiency of eigenface technology improves with the increase of original video volume, video to 1000 frames uses the MPEG-2 technology to can be compressed to 16.64MB, and use characteristic face technology (15 eigenface) then can be compressed to 14.83MB.This shows that in some application scenario eigenface technology and Moving Picture Experts Group-2 are more approaching aspect compression efficiency and picture quality, but the eigenface technology is simpler than the compression/decompression algorithm of Moving Picture Experts Group-2.

The compression efficiency contrast of table 1 eigenface technology and MPEG-2 technology

Frame number	Different-format video occupation space unit (MB)
	Different-format video occupation space unit (MB)					AVI	MPEG-2	e _f-5	e _f-7	e _f-10	e _f-15
	100	98.89	1.66	4.94		AVI	MPEG-2	e _f-5	e _f-7	e _f-10	e _f-15
200	100	98.89	1.66	4.94		197.78	3.33		6.92
200	500	494.44	8.32		9.98	197.78	3.33		6.92
1000	500	494.44	8.32		9.98	988.72	16.64				14.83

The present invention can fast and effeciently not demarcate from monocular and recovers the three-dimensional dynamic human face expression the video flowing, and the expression true nature that generates has also kept greater efficiency in time domain and spatial domain, have abundanter expressive force than the two dimension expression, have good practical value in fields such as virtual reality, man-machine interaction, Entertainment and the creation of video display animation.

Description of drawings

Fig. 1 is based on the three-dimensional dynamic human face expression model construction method schematic flow sheet of video flowing;

Fig. 2 is a human face characteristic point synoptic diagram of the present invention;

Fig. 3 need not correct the remarkable characteristic that can accurately follow the tracks of among the present invention;

Fig. 4 is the optical flow tracking effect of affine rectification of the present invention and simple optical flow tracking effect comparison synoptic diagram;

Fig. 5 is general three-dimensional face model of the present invention and personalized three-dimensional faceform's contrast, and (a) (c) is the front and the lateral plan of common people's face, and (b) (d) is the front and the lateral plan of personalized human face;

The expression frame of video and corresponding three-dimensional face model of tracking acquisition that Fig. 6 has been a description of the invention with expression deformation, (a) (b) (c) is respectively indignation, fear, the surprised expression of following the tracks of with affine rectification optical flow method, and (d) (e) is corresponding model deformation (f);

Fig. 7 is dynamic texture mapping of the present invention and traditional static texture effect comparison synoptic diagram, is the effect of dynamic texture mapping (a), (b) is the effect of static texture;

Fig. 8 is different video-frequency compression method contrast synoptic diagram, (a) is original video frame, (b) is the frame of video of rebuilding with 5 eigenface among the present invention, (c) is the frame of video with the Moving Picture Experts Group-2 compression;

Fig. 9 is the final effect synoptic diagram of Three-Dimensional Dynamic expression of the present invention modeling, and (a) (c) is to catch the sequence of frames of video that obtains (e), is respectively angry, surprised and frightened expression, and (b) (d) is corresponding sense of reality dynamic 3 D expression sequence (f).

Embodiment

As shown in Figure 1, implement as follows based on the three-dimensional dynamic human face expression model construction method of video flowing:

The first step is in 40 unique points that pre-define of monocular video head frame mark of not demarcating, and we have developed an interactive tools and have marked unique point at the first frame of video according to prompting with mouse easily for the user;

Second step used the optical flow approach of affine rectification unique point to be carried out the tracking of robust, in optical flow tracking, these 8 unique points of the inside and outside branch hole angle of the both sides corners of the mouth, eyes and both sides temples can accurately be followed the tracks of, therefore we utilize these 8 unique points to calculate affined transformation between two frames, optimize the optical flow tracking result of all the other 32 unique points with this affined transformation;

The 3rd step adopted the algorithm that decomposes based on the factor to recover the unique point three-dimensional coordinate and distortion obtains personalized human face model and expression effect to common people's face;

In the 4th step, we use the mean value of preceding 3 frame three-dimensional feature point coordinate as the three-dimensional feature point of describing specific people's face shape, with these unique points general faceform are out of shape to obtain the personalized three-dimensional faceform.This distortion is finished based on radial basis function, and the kernel function of radial basis function is made as Gaussian function, and the parameter of Gaussian function is made as 0.01;

The 5th step used continuous three-dimensional unique point coordinate that the distortion that the personalized three-dimensional faceform carries out is frame by frame moved to produce continuous expression, and this distortion realizes with radial basis function equally;

The 6th step adopted the eigenface technology that input video is compressed to save storage space, when use characteristic face technology, the number of eigenface depends on the frame number of input video, when with the frame of video of n eigenface reconstruction and the error between the original video frame during less than a certain threshold value q, n is appropriate eigenface number;

In the 7th step, the dynamic texture mapping uses texture variations rather than geometric deformation to simulate the slight change on people's face surface in the expression motion, as wrinkle and colour of skin variation etc." dynamically " refers to our each frame update texture at three-dimensional animation, rather than when initial the disposable texture of finishing.Owing to compare with still image, contained abundant expression detailed information in the continuous video flowing, have strict corresponding relation frame by frame because of reconstruction of three-dimensional people face and original video stream again, so we extract texture information frame by frame and map to the three-dimensional face corresponding with this frame from input video stream.Specify 40 initial three-dimensional feature summits in advance according to 40 unique points on three-dimensional face model before carrying out the dynamic texture mapping, aforementioned 40 characteristic point coordinates have obtained and can regard as the texture coordinate on this group three-dimensional feature summit when video tracking.So set up the corresponding relation of one group of three-dimensional feature summit to two dimensional image, because tracking data is known, and the faceform that every frame reconstructs has topological invariance, therefore this group corresponding relation has unchangeability, and the value that only need upgrade previous frame with the unique point coordinate and the three-dimensional feature point coordinate of this frame when every frame shines upon get final product.After having set up the discrete corresponding relation of this group, obtain dense three-dimensional vertices and the corresponding relation between the texture by the radial basis function interpolation, finish texture frame by frame.Whether preassigned three-dimensional feature summit accurately will influence the effect of dynamic texture mapping, therefore need optimize from initial three-dimensional feature summit and obtain three-dimensional feature apex coordinate accurately, finally finish texture, this is an iterative process based on light stream.

We use a hand-held camera Sony HDV 1080i who does not demarcate to catch three kinds of typical human face expressions, and promptly angry, surprised and frightened, the resolution of frame of video has reached 1920 * 1080 pixels.After the manual mark of the first step, all the other steps can automatically perform.Fig. 2 is 40 human face characteristic points that the present invention defines, Fig. 3 is 8 unique points of accurately following the tracks of that wherein are used for calculating the interframe affined transformation, affine rectification optical flow tracking algorithm is without any need for training data, and be no more than under 30 ° the situation still effective in the rotation of level/vertically, Fig. 4 first row is the tracking results that adopts affine rectification optical flow method, and second row is simple tracking results based on optical flow approach.Be not difficult to find out, mistake has appearred in the method based on light stream when following the tracks of nose and chin and crown point merely, and reasonable this problem that solved of the optical flow tracking of affine rectification, than traditional optical flow tracking method, the optical flow tracking method of affine rectification is more accurate.In video capture, we point out the performer to keep neutral expression earlier, and performance is angry, surprised and frightened respectively successively then, and every kind of expression all comprises a dynamic progressive formation, promptly carries out the transition to the amplitude peak of expression from neutrality.Because people's face presents neutral expression in preceding 3 frames, therefore the coordinate of three-dimensional feature point has been described the shape facility of people's face, we are averaged preceding 3 frame unique point coordinates and with this mean value general faceform are out of shape and obtain the personalized human face model, Fig. 5 is general three-dimensional face model and personalized three-dimensional faceform's a contrast synoptic diagram, (a) (c) be the front and the lateral plan of common people's face, (b) (d) is the front and the lateral plan of personalized human face.When people's face presented the expression motion, the three-dimensional feature point of reconstruction can well drive the personalized human face model, made it to produce the expression effect.We use the interpolation method based on radial basis function to drive, and when the training radial basis function, not have directly to use the three-dimensional feature point coordinate of reconstruction, and are to use three-dimensional feature in the every frame o'clock side-play amount with respect to three-dimensional feature point in first frame.After having obtained to specify the side-play amount on summit, radial basis function optimization has obtained the side-play amount on all the other summits, and it is that unit carries out frame by frame that radial basis function drives with the frame.Fig. 6 has described and has followed the tracks of the expression frame of video and the corresponding three-dimensional face model with expression deformation that obtains, (a) (b) (c) is three kinds of typical case's expressions (indignation, frightened, surprised) of following the tracks of with affine rectification optical flow method, and (d) (e) is corresponding model deformation (f).Compare with static texture, the dynamic texture mapping method that the present invention relates to provides more natural outward appearance.Fig. 7 (a) and Fig. 7 (b) are compared, and when using the dynamic texture mapping as can be seen, very significantly wrinkle has all appearred in the bridge of the nose, chin and wing of nose both sides, and these expression minutias are that static texture is beyond expression.To be applied to original video sequence based on the compression algorithm of eigenface, we find only to need 5 eigenface just can well rebuild every frame picture for the video sequence about one section 100 frame, and the while image quality loss is very little.Eigenface technology and MPEG-2 technology are applied to video compress respectively, and picture quality contrasts as shown in Figure 8, (a) is original video frame, (b) is the frame of video of rebuilding with 5 eigenface, (c) is the frame of video with the Moving Picture Experts Group-2 compression.As can be seen based on the video compress effect of eigenface aspect the picture quality very near Moving Picture Experts Group-2.

At the indignation of catching, surprised and frightened three kinds of expressions, we have carried out the expression modeling respectively.

Embodiment 1

The modeling embodiment of indignation expression:

Step 1: input video has 100 frames, and in 40 unique points that pre-define of the first frame mark of the monocular video of not demarcating, unique point as shown in Figure 2;

Step 2: use the optical flow approach of affine rectification unique point to be carried out the tracking of robust, utilize the inside and outside branch hole angle of the both sides corners of the mouth, eyes and these 8 unique points of both sides temples to calculate affined transformation between two frames, optimize the optical flow tracking result of all the other 32 unique points with this affined transformation;

Step 3: the algorithm that employing is decomposed based on the factor recovers the unique point three-dimensional coordinate and distortion obtains the personalized human face model and the effect of expressing one's feelings to common people's face;

Step 4: the mean value that uses preceding 3 frame three-dimensional feature point coordinate adopts radial basis function as the three-dimensional feature point of describing specific people's face shape, general faceform is out of shape obtain the personalized three-dimensional faceform.The kernel function of radial basis function is made as Gaussian function, and the parameter of Gaussian function is made as 0.01;

Step 5: use continuous three-dimensional unique point coordinate that the distortion that the personalized three-dimensional faceform carries out is frame by frame moved to produce continuous expression, this distortion realizes with radial basis function equally;

Step 6: adopt 5 eigenface that input video is carried out compression expression;

Step 7: the compression expression based on eigenface is rebuild original input video frame by frame, adopts the dynamic texture mapping techniques that the frame of video of rebuilding is mapped to the three-dimensional face model that has the expression motion accordingly frame by frame then, produces sense of reality indignation expression sequence;

This example reconstructs the indignation expression sequence of 100 frame dynamic 3 D people faces according to 100 frame videos, and the wrinkle on people's face surface is high-visible, and is very lively, has abundant expressive force, can be used for the creation of video display animation, development of games.

Embodiment 2

The modeling embodiment of surprised expression:

Step 1: input video has 80 frames, in 40 unique points that pre-define of the first frame mark of the monocular video of not demarcating;

Step 2: use the optical flow approach of affine strong distance unique point to be carried out the tracking of robust, utilize the inside and outside branch hole angle of the both sides corners of the mouth, eyes and these 8 unique points of both sides temples to calculate affined transformation between two frames, optimize the optical flow tracking result of all the other 32 unique points with this affined transformation;

Step 4: the mean value that uses preceding 3 frame three-dimensional feature point coordinate adopts radial basis function as the three-dimensional feature point of describing specific people's face shape, general faceform is out of shape obtain the personalized three-dimensional faceform.The kernel function of radial basis function is made as Gaussian function, and the parameter of Gaussian function is made as 0.05;

Step 7: the compression expression based on eigenface is rebuild original input video frame by frame, adopts the dynamic texture mapping techniques that the frame of video of rebuilding is mapped to the three-dimensional face model that has the expression motion accordingly frame by frame then, produces the surprised expression sequence of the sense of reality;

This example reconstructs the surprised expression sequence of 80 frame dynamic 3 D people faces according to 80 frame videos, and the lighting effect on people's face surface is comparatively obvious, and surprised expression is comparatively lively, can be used for the creation of video display animation, development of games.

Embodiment 3

The modeling embodiment of frightened expression:

Step 1: input video has 100 frames, in 40 unique points that pre-define of the first frame mark of the monocular video of not demarcating;

Step 4: the mean value that uses preceding 3 frame three-dimensional feature point coordinate adopts radial basis function as the three-dimensional feature point of describing specific people's face shape, general faceform is out of shape obtain the personalized three-dimensional faceform.The kernel function of radial basis function is made as Gaussian function, and the parameter of Gaussian function is made as 0.03;

Step 7: the compression expression based on eigenface is rebuild original input video frame by frame, adopts the dynamic texture mapping techniques that the frame of video of rebuilding is mapped to the three-dimensional face model that has the expression motion accordingly frame by frame then, produces the frightened expression of sense of reality sequence;

This example reconstructs the fear expression sequence of 100 frame dynamic 3 D people faces according to 100 frame videos, and the human face expression details is comparatively lively, demonstrates fully the tense situation of personage's heart, can be used for the creation of video display animation, development of games and man-machine interaction.

Final effect as shown in Figure 9.Fig. 9 is the final effect synoptic diagram of Three-Dimensional Dynamic expression modeling, and (a) (c) is to catch the sequence of frames of video that obtains (e), is respectively angry, surprised and frightened expression, and (b) (d) is corresponding sense of reality dynamic 3 D expression sequence (f).For the video sequence of one section 100 frame, whole process of reconstruction approximately needs 7-8 branch clock time on the computer of a Pentium-IV 2.4GHZ.The present invention is not particularly limited input video, can not only produce the three-dimensional face expression sequence with suitable sense of reality, and all kept higher performance on time domain and spatial domain.Entered digital times at present, new things such as digital video, digital communication, digital library emerge in an endless stream, this method is that the character expression that material carries out in the virtual environment is made with the video under this background, the trend that meets era development, have wide application prospect, especially high value of practical is arranged in fields such as man-machine interaction, cartoon making and Entertainments.

Claims

1. three-dimensional dynamic human face expression model construction method based on video flowing is characterized in that the step of method is:

2. a kind of three-dimensional dynamic human face expression model construction method according to claim 1 based on video flowing, it is characterized in that described human face characteristic point: people's face shape defined parameters and human face animation parameter according to mpeg 4 standard are provided with, have 40, be distributed in positions such as facial contour, eyes, lip edge, not only can better reflect people's face topology, and the human face expression motion can be described, when remaining neutral expression, people's face can be regarded as rigid body substantially, unique point definition this moment people face shape feature; When people's face presents the expression motion, unique point definition human face animation parameter.

3. a kind of three-dimensional dynamic human face expression model construction method based on video flowing according to claim 1 is characterized in that the optical flow method of described affine rectification: correct the accuracy of traditional optical flow tracking method by calculating affined transformation between frame of video; The purpose of tradition optical flow tracking is that the search side-play amount makes matching error minimum with character pair spot correlation neighborhood: given two adjacent video frames I ₁And I ₂, mark I ₁In the position of certain unique point be f=(u, v) ^T, the mark light stream is p=(p _u, p _v) ^T, I then ₂The position of middle character pair point is f+p; P can be by minimizing

\underset{ft &Element; T}{Σ} {(I_{2} (ft + p) - I_{1} (ft))}^{2}

4. a kind of three-dimensional dynamic human face expression model construction method based on video flowing according to claim 1 is characterized in that described method of decomposing based on the factor: with weak perspective projection modeling video imaging process; According to the method, the non-rigid object shape is regarded the weighted linear combination of one group of shape bases as, and shape bases is one group of basic 3D shape, and any 3D shape can be organized the 3D shape base by this and combine; Given tracking data, the unique point in every frame can be as follows with weak perspective projection model description:

{P_{fn} = {(x, y)}^{T}_{fn} = [e}_{f} c_{f 1} R_{f} \cdot \cdot \cdot e_{f} c_{fK} R_{f}] \cdot {[S_{1 n} \cdot \cdot \cdot S_{Kn}]}^{T} + t_{f}, f = 1, \cdot \cdot \cdot, F, n = 1, \cdot \cdot \cdot, N

M = [\begin{matrix} e_{1} c_{11} R_{1} & \cdot \cdot \cdot & e_{1} c_{1 K} R_{1} \\ \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot \\ e_{F} c_{F 1} R_{F} & \cdot \cdot \cdot & e_{F} c_{FK} R_{F} \end{matrix}],

S = [\begin{matrix} S_{11} & \cdot \cdot \cdot & S_{1 N} \\ \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot \\ S_{K 1} & \cdot \cdot \cdot & S_{KN} \end{matrix}]

\tilde{P} = \tilde{M} \cdot \tilde{S},

\tilde{P} = \tilde{M} A \cdot A^{- 1} \tilde{S}

All set up; Therefore as A when being known, broad sense camera projection matrix and shape bases can be expressed as

M = \tilde{M} \cdot A,

S = A^{- 1} \cdot \tilde{S},

M M^{T} = \tilde{M} Q {\tilde{M}}^{T},

Order again

Expression

{\tilde{M}}_{2 \cdot i - 1} Q {\tilde{M}}_{2 \cdot i - 1}^{T} = {\tilde{M}}_{2 \cdot i} Q {\tilde{M}}_{2 \cdot i}^{T},

{\tilde{M}}_{2 \cdot i - 1} Q {\tilde{M}}_{2 \cdot i}^{T} = 0;

{\tilde{M}}_{2 \cdot i - 1} Q_{k} {\tilde{M}}_{2 \cdot j - 1}^{T} = 1, (i, j) &Element; ω_{1}

{\tilde{M}}_{2 \cdot i - 1} Q_{k} {\tilde{M}}_{2 \cdot j - 1}^{T} = 0, (i, j) &Element; ω_{2}

{\tilde{M}}_{2 \cdot i} Q_{k} {\tilde{M}}_{2 \cdot j}^{T} = 1, (i, j) &Element; ω_{1}

{\tilde{M}}_{2 \cdot i} Q_{k} {\tilde{M}}_{2 \cdot j}^{T} = 0, (i, j) &Element; ω_{2}

{\tilde{M}}_{2 \cdot i - 1} Q_{k} {\tilde{M}}_{2 \cdot j}^{T} = 0, (i, j) &Element; ω_{1} \cup ω_{2}

{\tilde{M}}_{2 \cdot i} Q_{k} {\tilde{M}}_{2 \cdot j - 1}^{T} = 0, (i, j) &Element; ω_{1} \cup ω_{2}

ω_{1} = {(i, j) | i = j = k}

ω_{2} = {(i, j) | i = 1, \cdot \cdot \cdot, K, j = 1, \cdot \cdot \cdot, F, i &NotEqual; k}

Correctly find the solution Q in conjunction with this two classes constraint condition, obtain A through svd again, M by

Obtain zoom factor e ₁..., e _FCan regard constant as, so broad sense camera projection matrix can be expressed as

M = [\begin{matrix} {c_{11}}^{1} R_{1} & \cdot \cdot \cdot & {c_{1 K}}^{1} R_{1} \\ \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot \\ {c_{F 1}}^{1} R_{F} & \cdot \cdot \cdot & {c_{FK}}^{1} R_{F} \end{matrix}];

Because

R_{f} = [\begin{matrix} r_{f 1} & r_{f 2} & r_{f 3} \\ r_{f 4} & r_{f 5} & r_{f 6} \end{matrix}], f = 1, . . ., F

Be preceding two row of camera rotation matrix, two of expression f frame among the M gone launch to obtain

m_{f} = [\begin{matrix} {c_{f 1}}^{1} r_{f 1} & {c_{f 1}}^{1} r_{f 2} & {c_{f 1}}^{1} r_{f 3} & \cdot \cdot \cdot & {c_{fK}}^{1} r_{f 1} & {c_{fK}}^{1} r_{f 2} & {r_{fK}}^{1} r_{f 3} \\ {c_{f 1}}^{1} r_{f 4} & {c_{f 1}}^{1} r_{f 5} & {c_{f 1}}^{1} r_{f 6} & \cdot \cdot \cdot & {c_{fK}}^{1} r_{f 4} & {c_{fK}}^{1} r_{f 5} & {c_{fK}}^{1} r_{f 6} \end{matrix}],

Adjust element position and obtain new matrix

{m_{f}}^{1} = [\begin{matrix} {c_{f 1}}^{1} r_{f 1} & {c_{f 1}}^{1} r_{f 2} & {c_{f 1}}^{1} r_{f 3} & {c_{f 1}}^{1} r_{f 4} & {c_{f 1}}^{1} r_{f 5} & {c_{f 1}}^{1} r_{f 6} \\ \cdot \cdot \cdot \\ {c_{fK}}^{1} r_{f 1} & {c_{fK}}^{1} r_{f 2} & {c_{fK}}^{1} r_{f 3} & {c_{fK}}^{1} r_{f 4} & {c_{fK}}^{1} r_{f 5} & {c_{fK}}^{1} r_{f 6} \end{matrix}],

5. a kind of three-dimensional dynamic human face expression model construction method according to claim 1 based on video flowing, it is characterized in that described general three-dimensional face model: comprise more than 3000 summit, be through registration by several true three-dimension people faces that obtain by laser scanning, simplify and average and obtain, the fine structure feature of people's face can be described, 3 frames before the three-dimensional motion data are averaged, as the three-dimensional feature point of describing people's face shape, and on general three-dimensional face, specify feature summit with three-dimensional feature point similar number, side-play amount between defined feature summit and the unique point is d, and with d and feature summit training radial basis function, can release the skew on these feature summits with the radial basis function that all the other feature summit inputs train, thereby obtain the personalized three-dimensional faceform.

6. a kind of three-dimensional dynamic human face expression model construction method according to claim 1 based on video flowing, it is characterized in that described other three-dimensional motion data: all removing preceding 3 frame data that are used for defining people's face shape in the three-dimensional motion data, every frame expression drives the same radial basis function that adopts and carries out.

7. a kind of three-dimensional dynamic human face expression model construction method according to claim 1 based on video flowing, it is characterized in that described video-frequency compression method: given one section video sequence based on eigenface, suppose that video sequence comprises the F frame, every two field picture resolution is R * C, all row of every two field picture are stacked together frame of video are converted into the column vector of RC * 1, thereby video sequence is converted into the sample matrix X of a RC * F, if X is a sample average, then normalized sample is

\tilde{X} = (X - \overset{&OverBar;}{X}) / F^{1 / 2};

\begin{matrix} [q, r] = QR (\overset{&OverBar;}{X}) & [u, s, v] = SVD (r) & U = q \cdot u \end{matrix}

\tilde{f} = U \cdot y + \overset{&OverBar;}{X};

8. a kind of three-dimensional dynamic human face expression model construction method according to claim 1 based on video flowing, it is characterized in that described dynamic texture mapping: regard every frame two dimensional character point position coordinates that tracking obtains the texture coordinate of a predefined stack features summit on the three-dimensional face model as, thereby map to the faceform that reconstruct corresponding frame by frame from original video with each frame of video by people's face texture information that interpolation will extract automatically;

The dynamic texture mapping is divided into two steps:

1) overall texture:

At first make as giving a definition:

Num: the number on all summits in the three-dimensional face model;

For describing local grain optimization, make as giving a definition:

F: follow the tracks of a two dimensional character point that obtains;

S: initial characteristic specified summit;

Δ p:f and f ¹Between error;

I _Input: input video frame;

T: image I _InputGoing up with f is the square area at center;

Local grain optimization is finished by the process of an iteration:

Loop

Δp = \arg \min \underset{f_{i} &Element; T}{Σ} {| | I_{input} (f_{i}) - I_{project} (f_{i} + Δp) | |}^{2};

Upgrade S, make S=S+ Δ S;

Again carry out overall texture, upgrade I _Project

The variation of Until S is less than a certain threshold value.