CN1952850A

CN1952850A - Three-dimensional face cartoon method driven by voice based on dynamic elementary access

Info

Publication number: CN1952850A
Application number: CN 200510086646
Authority: CN
Inventors: 陶建华; 尹潘嵘
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2005-10-20
Filing date: 2005-10-20
Publication date: 2007-04-25

Abstract

This invention discloses a sound drive human face cartoon method based on dynamic element selection, which comprises the following steps: real time capturing multi-mode database of system and analyzing audio frequency on multi-mode data to get relative characteristic vector; processing simultaneous dividing based on audio element; computing each basic element audio matching error and front and back basic element match error; finally dynamically selecting one best path in the prepare base elements.

Description

The three-dimensional face cartoon method driven by voice of choosing based on dynamic primitive

Technical field

The present invention relates to the multi-modal field of human-computer interaction of voice-driven human face animation, refer to a kind of three-dimensional face cartoon method driven by voice of choosing based on dynamic primitive especially.

Background technology

Butter-tonsiled people's face interface (Talking Head) is served the deaf person, information is propagated, the video display amusement, there is application prospects in the multinomial application such as animated gaming and man-machine interaction, but this technology does not also have and can popularize and apply at present, main bottleneck is that the real-time synchronous processing of voice and people's face and the sense true to nature and the intelligibility of human face animation itself also do not reach requirement, its reason is that the user is too familiar to people's face and motion thereof, very responsive to its motion real time synchronization, people's face motion simultaneously is that non-movement of Rigid Body process exists a large amount of variations in detail, and is very complicated.Therefore, the key that realize the voice-driven human face animation is to study voice and lip synchronization association relation and the human face animation synthetic method true to nature between moving.

According to " Picture My Voice " definition, any one visual speech synthesis system all can be divided into four steps to carry out: set up a multi-modal database of audio-visual; Provide the voice and video character representation; Find a kind of method to describe two incidence relations between features partly; Carry out visual synthetic according to voice sequence.

Up to the present, still there is not a multi-modal database that unification is general in visual phonetic synthesis field, the basis that therefore to set up a database be all working.Many research institutions all record the necessary multi-modal database of experiment voluntarily, and what have records two-dimentional continuous videos, extract area-of-interest; Have pass through three-dimensional laser scanner intercepting static apparent place; Then passing through of having is two-dimentional--and the three-dimensional reconstruction technology is replied the three-dimensional coordinate of unique point from video.

Extracting the audio and video characteristic vector exactly also is the difficult point place that system realizes usually, for audio frequency, many people carry out classification analysis to the text language unit, but workload is very big, the manual intervention degree is also high, therefore the acoustics, the prosodic features that extract voice signal have become main method, but the influence that the selection of feature is made up lip still remains further to be analyzed.For video, dynamically apparent place, light stream, three-dimensional location coordinates, distance component all is extracted as video features.

The real-time synchronization association control of voice people face is the synthetic bottleneck problem of Talking Head, because between voice and the human face expression motion is the complex mapping relation of multi-to-multi, the while people are very responsive again synchronously to the motion of voice people face, make to realize that the voice-driven human face animation becomes challenging research topic.Present voice driven research can be divided into two classes: by speech recognition with not by speech recognition.First method is by voice are divided into linguistic unit, and as phoneme (Phoneme), vision primitive (Viseme) and syllable (syllable) further map directly to these linguistic units after the lip posture synthetic with the splicing method subsequently.Yamamoto E. directly corresponds to corresponding lip then by training hidden Markov model (HMM) identification phoneme, and through obtaining the moving sequence of lip after level and smooth, this method relies on the text language kind too much, and does not consider contextual information.Second method is to walk around this form of speech primitive, utilize the method for statistical learning to pass through mathematical model, on the basis of mass data, dual-mode data is carried out statistical study, or utilize neural network, the linear prediction function shines upon study, finds the mapping relations between voice signal and the controlled variable, directly drives lip motion then.Attempt to find real time nonlinear mapping relations between the voice human face animation based on neural network method.Training multi-layer perceptions such as D.W.Massaro are used for learning the mapping of LPC parameter to the human face animation parameter.They attempt by the context relation of considering that each five frame speech analysis window of front and back come the modelling voice, but neural net method caused training result easily, also are links that needs artificial experience control.

The human face animation of computing machine generation at present is usually all based on grid model, and this model is made up of geometric grid and texture, is generally adopted by everybody.Often be difficult to generate the ideal three-dimensional expression effect in view of parameter people face is synthetic, nearly a period of time, many people begin to adopt the immediate data method of driving.Up to the present, the method based on data-driven mainly contains three kinds of different implementations: the method for the method of image sequence splicing, key frame distortion, the method for face component combination.But operation parameter faceform still has the advantage that data volume is little, be convenient to transplant.

Summary of the invention

The present invention relates to a kind of three-dimensional face cartoon method driven by voice of choosing based on dynamic primitive, can be applicable to the multimedia human-computer interaction system, use motion captured in real time equipment, set up multi-modal database; Use speech analysis, motion analysis technique to extract the voice and video feature simultaneously; With the phoneme is that primitive unit is cut apart synchronously to multi-modal data; To the given voice sequence of user, calculate the audio frequency matching error of each primitive and the vision matching error between the primitive of front and back, in candidate's primitive, dynamically select at last an optimal path, output and the synchronous human face animation argument sequence of voice sequence, to drive the 3 D human face animation model, the speech conversion of any language of the Any user of input is become the synchronous output of voice and 3 D human face animation, the speech conversion of any language of the Any user of input is become the synchronous output of voice and 3 D human face animation.

In order to achieve the above object, the invention provides a kind of voice-driven human face animation method of choosing based on dynamic primitive, comprise step:

A. adopt the multi-modal database of motion captured in real time system creation;

B. multi-modal data are carried out the audio frequency and video analysis, obtain corresponding proper vector respectively;

C. be that primitive unit is cut apart synchronously to multi-modal data with the phoneme;

D. adopt dynamic primitive choosing method, output and the synchronous human face animation argument sequence of user input voice.

According to method provided by the invention, make the animation sequence of system's output keep validity and naturalness preferably, and be applicable to multi-user and multilingual voice driven.

This system realizes easily, and the animation sequence of output keeps validity and naturalness preferably, and is applicable to multi-user and multilingual voice driven.The present invention can be applicable to the multimedia human-computer interaction system.

Description of drawings

Fig. 1 voice-driven human face animation system frame diagram.

Fig. 2 experimental subjects face characteristic point selection and collection sample figure.

The groups of people's face characteristic point and FAP (human face animation parameter) the unit figure of Fig. 3 MPEG-4 definition.

The dynamic primitive of Fig. 4 is chosen synoptic diagram.

The comparison diagram of the human face animation argument sequence that Fig. 5 synthesizes (FAP 51 and FAP 52) and the argument sequence recorded.

Partial frame figure in Fig. 6 voice driven 3 D human face animation example.

Embodiment

1. voice-driven human face animation system framework explanation (Fig. 1)

System framework realizes being divided into two processes: training process and animation process.

Training process comprises: the multi-modal database of voice-video, video features are handled, audio frequency characteristics is handled and voice-video primitives is cut apart four parts of storage.Video features is handled and is comprised: attitude standardization, rigid motion compensation and three parts of FAP parameter extraction.Audio frequency characteristics is handled and is comprised that parameters,acoustic extracts part.

Animation process comprises: parameters,acoustic extracts, primitive extracts, the FAP argument sequence is synthetic, smoothing algorithm and five parts of driving human face animation model.

2. adopt the multi-modal database of motion captured in real time system creation (Fig. 2, Fig. 3)

A) at first adopt the multi-modal database of three-dimensional motion captured in real time system creation, as shown in Figure 3: on the face the experimenter, paste 50 Partial Feature points consistent with MPEG-4 human face animation standard definition, the basic FAP of the people's face unit of Fig. 3 (a) expression definition, the Partial Feature point of Fig. 3 (b) expression groups of people face definition; As shown in Figure 2: wherein 5 stick at the head hair band, are used for compensating the head rigid motion.

B) language material experiment storehouse size is: the training repertorie: 286 sentences, record 5 times for every, and 4 times 1 time as the checking sample as training sample; Test repertorie: 9 sentences.

C) use motion captured in real time system and video camera synchronous recording people face movable information and voice messaging.The three-dimensional coordinate of unique point is stored in the .trb file, and voice messaging is stored in video .avi file.

2. multi-modal data are carried out the audio frequency and video analysis, obtain corresponding proper vector respectively

For multi-modal data are carried out the audio frequency and video analysis, must shunt both, extract corresponding proper vector respectively.

A) since the experimenter in experiment, about head and health have inevitably or swing, in order to obtain the accurate non-rigid motion information of face, so the unique point coordinates of motion information d={x that obtains for motion capture system ₁, y ₁, z ₁..., x _n, y _n, z _n} ^T∈ R ³ⁿ, n=45 at first will carry out the rigid motion compensation.With 5 gauge points of head as the reference target, because they only contain the head rigid motion.Suppose to have the coordinate set of two groups of three-dimensional feature points: { p _i, { q _i, i=1 ..., 5, represent same group of coordinate and the post exercise coordinate before the motion of three-dimensional feature point respectively, following relation is arranged so between the two:

q _i＝Rp _i+t，i＝1，...，5。(1)

According to the least square method principle:

Σ^{2} = Σ_{i = 1}^{n} {| | {Rp}_{i} + t - q_{i} | |}^{2},

By the svd incidence matrix, try to achieve the optimum solution of rotation matrix R and transition matrix t, then to the inverse transformation computing of the three-dimensional coordinate application of formula 1 of other 45 points of each frame, promptly draw the non-rigid motion information of people's face.

B) calculate the difference of each frame unique point coordinate and neutral frame coordinate:

Δ d ^t={ Δ x ₁, Δ y ₁, Δ z ₁..., Δ x _n, Δ y _n, Δ z _n} ^T∈ R ³ⁿ, n=45 calculates each unique point yardstick reference quantity FAP unit: the D={D by the MPEG-4 definition simultaneously ₁, D ₂..., D _n, can be according to formula:

{FAP}_{i}^{t} = (({Δx}_{i} | {Δy}_{i} | {Δz}_{i}) / D_{i}) * 1024

Obtain i people's face kinematic parameter of t frame.We have extracted 15 parameters (see Table 1) relevant with lip movement altogether to every frame at last:

\overset{&RightArrow;t}{ν} = {{FAP}_{1}^{t}, {FAP}_{2}^{t}, . . ., {FAP}_{15}^{t}}^{T},

Therefore the video feature vector in the data sample of a m frame can be expressed as:

V = {\overset{&RightArrow;1}{ν}, \overset{&RightArrow;2}{ν}, . . ., \overset{&RightArrow;m}{ν}}^{T}

Gro up	FAP name	Gro up	FAP name
Gro up	FAP name	Gro up	FAP name	2 2 2 2 2 8 8	Open_jaw Trust_jaw Shift_jaw Push_b_lip Push_t_lip Lower_t_midli p_o Raise_b_midli p_o	8 8 8 8 8 8 8	Stretch_l_cor herlip_o Lower_t_lip_l m_o Lower_t_lip_r m_o Raise_b_lip_l m_o Raise_b-lip_r m_o Raise_l_corne rlip_o Raise_r_corne rlip_o

8 Stretch_r_cor nerlip_o

15 FAP parameters that table 1 extracts

C) voice are separated from the .avi file, through voice signal being removed processing such as neighbourhood noise, gain amplification, the 16 rank Mel frequency marking cepstrum parameters (MFCC) that extract speech data in the hamming window then are as speech feature vector:

\overset{&RightArrow;t}{a} = {c_{1}^{t}, c_{2}^{t}, . . ., c_{16}^{t}}^{T}

The speech feature vector of representing the t frame, therefore the speech feature vector in the data sample of a m frame can be expressed as:

A = {\overset{&RightArrow;1}{a}, \overset{&RightArrow;2}{a}, . . ., \overset{&RightArrow;m}{a}}^{T}

3. be that primitive unit is cut apart synchronously to multi-modal data with the phoneme

For simultaneous voice and video feature vector, so must be that primitive unit is cut apart synchronously to multi-modal data with the phoneme.

A) for audio frequency, speech analysis window size is WinSize=27ms, the displacement of analysis window is WinMove=5ms, so audio sample rate AudioFrameSample=1*1000/WinMove=200Hz/s, the video sampling rate is VideoFrameSample=75Hz/s simultaneously.

B) ratio that therefore can calculate frame of video and audio frame is: n=AudioFrameSample/VideoFrameSamle marks off synchronous audio-video feature set according to ratio n then.

C), be that primitive provides boundary marker and fullstop record with the phoneme for audio frequency.

All training datas are through being that unit is stored in the tranining database with the primitive after the above-mentioned processing, are extracted out and mate in order to choose the stage at dynamic primitive.

4. adopt dynamic primitive choosing method, the human face animation argument sequence (Fig. 4) of output and voice synchronous

A) for the new voice sequence of user's input, at first extract phonetic feature;

B) then to each phoneme primitive, according to audio frequency matching error C ^αAnd the vision matching error C between the primitive of front and back ^νIn tranining database, extract candidate's primitive.

C) at first calculate audio frequency matching error C ^α:

I. each phoneme primitive is judged whether it is simple or compound vowel of a Chinese syllable or initial consonant,, consider the contextual information of preceding two and latter two primitive for simple or compound vowel of a Chinese syllable; For initial consonant, consider the contextual information of a previous and back primitive.

Ii. the calculating of contextual information adopts linear function to represent the coarticulation coefficient, i.e. the coarticulation coefficient maximum of current primitive, and the coarticulation coefficient of every frame is pressed linear decrease in a previous primitive and the back primitive.

Iii. the audio frequency matching error that calculates is:

C^{a} = Σ_{t = 1}^{n} \overset{m}{Σ} w_{m} a (t_{t + m}, u_{t + m}), m = [- 1,0, + 1] or [- 2, - 1,0, + 1, + 2],

, w _mFor collaborative

The pronunciation coefficient, a (t _T+m, u _T+m) be acoustic feature Euclidean distance between primitive.In order to reduce the complexity in Viterbi algorithm search path, be limited to 10 on the number of this method setting candidate primitive.

D) calculate then before and after vision matching error C between primitive ^ν:

I. the level and smooth degree of coupling before and after each phoneme primitive being considered between primitive is calculated the last frame of previous primitive and the matching degree that a back primitive begins people's face kinematic parameter of a frame most.

Ii. the vision matching error that calculates is:

C^{ν} = Σ_{t = 2}^{n} ν (u_{t - 1}, u_{t}),

ν (u _T-1, u _t) be phase

The Euclidean distance of adjacent two primitive adjacent video interframe video parameters.

E) calculate total cost function at last: COST=α C ^α+ (1-α) C ^νAs shown in Figure 4: after a sentence was finished, seeking an optimal path with the Viterbi algorithm was the total cost function minimum, and coupling together is exactly continuous people's face kinematic parameter sequence.

F) argument sequence is carried out smoothing processing, the input three-dimensional face model drives human face animation.

5. experimental result evaluation (Fig. 5, Fig. 6)

Give qualitative and quantitative assessment here to this method experimental result.Qualitative assessment be will output the synthetic parameters sequence and the error between the argument sequence of the sample recorded of lane database calculate.Therefore the variance similarity that related coefficient (formula 2) energy response curve track changes is used to calculate the error of weighing both.Simultaneously also carried out the test of opener (test set does not comprise training data) and closed set (test set comprises training data), table 2 provides the average correlation coefficient of synthetic parameters sequence in the entire database.Simultaneously, as shown in Figure 5, the synthetic parameters sequence and the curve comparison diagram of truly recording argument sequence of a checking sample and a test sample book have also been provided, Fig. 5 (a) is the composition sequence of a checking FAP#51 of sample and FAP#52 and the comparison diagram of truly recording sequence, Fig. 5 (b) be a test sample book FAP#51 and the composition sequence of FAP#52 and truly record the comparison diagram of sequence.

CC = \frac{1}{T} Σ_{t = 1}^{T} \frac{(\hat{f} (t) - \hat{μ}) (f (t) - μ)}{\hat{σ} σ} \cdot - - - (2)

Correlation Coefficients		Valida tion	Test
Correlation Coefficients		Valida tion	Test	FAP No.	#	3 #14 #15 #16 #17 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60	0.855 0.674 0.553 0.767 0.716 0.820 0.812 0.817 0.832 0.709 0.768 0.749 0.787 0.661 0.569	0.662 0.656 0.761 0.677 0.788 0.732 0.741 0.722 0.636 0.684 0.736 0.714 0.720 0.640 0.730

The average correlation coefficient of the synthetic FAP parameter of table 2 entire database

Analyze for synthetic finally the providing qualitatively of human face animation, i.e. whether the animation sequence that perception observes is true, nature.Therefore argument sequence is carried out smoothing processing, the input three-dimensional face model drives human face animation, as shown in Figure 6.

The present invention relates to a kind of three-dimensional face cartoon method driven by voice of choosing based on dynamic primitive, can be applicable to the multimedia human-computer interaction system, the speech conversion of any language of the Any user of input is become the synchronous output of voice and 3 D human face animation.The experimental result of using this method acquisition has all obtained experimental result preferably from qualitative assessment or qualitative evaluation, please import new voice sequence by unspecified person simultaneously, also very true, the nature of synthetic effect.Therefore the voice-driven human face animation method of choosing based on dynamic primitive provided by the invention is realized easily, and synthetic result keeps validity and naturalness preferably, and is applicable to multi-user and multilingual.

Claims

1. a voice-driven human face animation method of choosing based on dynamic primitive is characterized in that: use motion captured in real time equipment, set up multi-modal database; Use speech analysis, motion analysis technique to extract the voice and video feature simultaneously; With the phoneme is that primitive unit is cut apart synchronously to multi-modal data; To the given voice sequence of user, calculate the audio frequency matching error of each primitive and the vision matching error between the primitive of front and back, in candidate's primitive, dynamically select at last an optimal path, output and the synchronous human face animation argument sequence of voice sequence, to drive the 3 D human face animation model, the speech conversion of any language of the Any user of input is become the synchronous output of voice and 3 D human face animation, comprises step:

2. method according to claim 1 is characterized in that: adopt the multi-modal database of motion captured in real time system creation, comprise step:

A) on the face the experimenter, paste 50 with the consistent unique point of MPEG-4 human face animation standard definition, wherein 5 stick at the head hair band, are used for compensating the head rigid motion;

B) language material experiment storehouse size: training repertorie: 286 sentences, record 5 times for every, 4 times 1 time as the checking sample as training sample; Test repertorie: 9 sentences;

When c) using motion captured in real time system to record the motion of people's face, characteristic point coordinates information and synchronous voice messaging.

3. method according to claim 1 is characterized in that: multi-modal data are carried out the audio frequency and video analysis, obtain corresponding proper vector respectively, comprise step:

A) the unique point coordinates of motion information d={x that motion capture system is obtained ₁, y ₁, z ₁..., x _n, y _n, z _n, } ^T∈ R ³ⁿ, n=45 carries out the rigid motion compensation, 5 gauge point coordinates of head is used based on the three-dimensional point of least square method mated, and estimates head rigid motion q _i=Rp _i+ t, i=1 ..., 5, the inverse transformation computing of then three-dimensional coordinate of other 45 points of each frame being used formula 1 promptly draws the non-rigid motion information of people's face;

B) calculate the difference of each frame unique point coordinate and neutral frame coordinate: Δ d ⁱ={ Δ x ₁, Δ y ₁, Δ z ₁..., Δ x _n, Δ y _n, Δ z _n} ^T∈ R ³ⁿ, n=45 calculates each unique point yardstick reference quantity FAP unit: the D={D by the MPEG-4 definition simultaneously ₁, D ₂..., D _n, can be according to formula: FAP _i ^t=((Δ x _i| Δ y _i| Δ z _i)/D _i) ^*1024 obtain i people's face kinematic parameter of t frame, and we have extracted 15 parameters relevant with lip movement altogether to every frame at last:

\overset{&RightArrow; t}{v} = {FA P_{1}^{t}, FA P_{2}^{t}, . . ., FA P_{15}^{t}}^{T};

C) for audio frequency, the 16 rank Mel frequency marking cepstrum parameters (MFCC) that extract speech data in the hamming window are as speech feature vector.

4. method according to claim 1 is characterized in that: be that primitive unit is cut apart synchronously to multi-modal data with the phoneme, comprise step:

A) for audio frequency, speech analysis window size is WinSize=27ms, the displacement of analysis window is WinMove=5ms, so audio sample rate AudioFrameSample=1*1000/WinMove=200Hz/s, the video sampling rate is VideoFrameSample=75Hz/s simultaneously;

B) ratio of frame of video and audio frame is: n=AudioFrameSample/VideoFrameSamle marks off synchronous audio-video feature set according to ratio n;

5. method according to claim 1 is characterized in that: adopt dynamic primitive choosing method, output and the synchronous human face animation argument sequence of user input voice comprise step:

A) given new voice extract phonetic feature;

B), in tranining database, extract candidate's primitive according to the vision matching error Cv between audio frequency matching error Ca and front and back primitive to each phoneme primitive;

C) audio frequency matching error Ca calculates:

I. whether be that simple or compound vowel of a Chinese syllable and initial consonant are distinguished for each phoneme primitive to it,, consider the contextual information of preceding two and latter two primitive for simple or compound vowel of a Chinese syllable; For initial consonant, consider the contextual information of a previous and back primitive;

Ii. computational context information adopts linear function to represent the coarticulation coefficient;

Iii. the audio frequency matching error that calculates is:

C^{a} = Σ_{i = 1}^{n} \overset{m}{Σ} w_{m} a (t_{t + m}, u_{t + m}), m = [- 1,0 + 1] or [- 2, - 1, 0, + 1, + 2]

, w _mBe the coarticulation coefficient, α (t _T+m, u _T+m) be acoustic feature Euclidean distance between primitive;

D) vision matching error C ^vCalculate:

I. the level and smooth degree of coupling before and after each phoneme primitive being considered between primitive is calculated the last frame of previous primitive and the matching degree that a back primitive begins people's face kinematic parameter of a frame most;

Ii. the vision matching error that calculates is:

C^{v} = Σ_{i = 2}^{n} v (u_{t - 1}, u_{t})

, v (u _T-l, u _t) be the Euclidean distance of adjacent two primitive adjacent video interframe video parameters;

E) calculate total cost function: COST=α C ^a+ (1-α) C ^v, after a sentence was finished, seeking an optimal path with the Viterbi algorithm was the total cost function minimum, coupling together is exactly continuous people's face kinematic parameter sequence;