CN1952850A - Three-dimensional face cartoon method driven by voice based on dynamic elementary access - Google Patents
Three-dimensional face cartoon method driven by voice based on dynamic elementary access Download PDFInfo
- Publication number
- CN1952850A CN1952850A CN 200510086646 CN200510086646A CN1952850A CN 1952850 A CN1952850 A CN 1952850A CN 200510086646 CN200510086646 CN 200510086646 CN 200510086646 A CN200510086646 A CN 200510086646A CN 1952850 A CN1952850 A CN 1952850A
- Authority
- CN
- China
- Prior art keywords
- primitive
- voice
- motion
- human face
- audio frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000012545 processing Methods 0.000 claims abstract description 7
- 230000001360 synchronised effect Effects 0.000 claims description 15
- 238000004458 analytical method Methods 0.000 claims description 14
- 239000000284 extract Substances 0.000 claims description 13
- 238000012549 training Methods 0.000 claims description 12
- 238000012360 testing method Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 150000001875 compounds Chemical class 0.000 claims description 4
- 230000008878 coupling Effects 0.000 claims description 4
- 238000010168 coupling process Methods 0.000 claims description 4
- 238000005859 coupling reaction Methods 0.000 claims description 4
- 238000002474 experimental method Methods 0.000 claims description 4
- 238000009499 grossing Methods 0.000 claims description 4
- 238000006073 displacement reaction Methods 0.000 claims description 2
- 238000012886 linear function Methods 0.000 claims description 2
- 239000003550 marker Substances 0.000 claims description 2
- 239000000463 material Substances 0.000 claims description 2
- 230000007935 neutral effect Effects 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims description 2
- 230000009466 transformation Effects 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000003993 interaction Effects 0.000 description 5
- 238000013507 mapping Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 206010011878 Deafness Diseases 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000010224 classification analysis Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000003760 hair shine Effects 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000010189 synthetic method Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Images
Landscapes
- Processing Or Creating Images (AREA)
Abstract
This invention discloses a sound drive human face cartoon method based on dynamic element selection, which comprises the following steps: real time capturing multi-mode database of system and analyzing audio frequency on multi-mode data to get relative characteristic vector; processing simultaneous dividing based on audio element; computing each basic element audio matching error and front and back basic element match error; finally dynamically selecting one best path in the prepare base elements.
Description
Technical field
The present invention relates to the multi-modal field of human-computer interaction of voice-driven human face animation, refer to a kind of three-dimensional face cartoon method driven by voice of choosing based on dynamic primitive especially.
Background technology
Butter-tonsiled people's face interface (Talking Head) is served the deaf person, information is propagated, the video display amusement, there is application prospects in the multinomial application such as animated gaming and man-machine interaction, but this technology does not also have and can popularize and apply at present, main bottleneck is that the real-time synchronous processing of voice and people's face and the sense true to nature and the intelligibility of human face animation itself also do not reach requirement, its reason is that the user is too familiar to people's face and motion thereof, very responsive to its motion real time synchronization, people's face motion simultaneously is that non-movement of Rigid Body process exists a large amount of variations in detail, and is very complicated.Therefore, the key that realize the voice-driven human face animation is to study voice and lip synchronization association relation and the human face animation synthetic method true to nature between moving.
According to " Picture My Voice " definition, any one visual speech synthesis system all can be divided into four steps to carry out: set up a multi-modal database of audio-visual; Provide the voice and video character representation; Find a kind of method to describe two incidence relations between features partly; Carry out visual synthetic according to voice sequence.
Up to the present, still there is not a multi-modal database that unification is general in visual phonetic synthesis field, the basis that therefore to set up a database be all working.Many research institutions all record the necessary multi-modal database of experiment voluntarily, and what have records two-dimentional continuous videos, extract area-of-interest; Have pass through three-dimensional laser scanner intercepting static apparent place; Then passing through of having is two-dimentional--and the three-dimensional reconstruction technology is replied the three-dimensional coordinate of unique point from video.
Extracting the audio and video characteristic vector exactly also is the difficult point place that system realizes usually, for audio frequency, many people carry out classification analysis to the text language unit, but workload is very big, the manual intervention degree is also high, therefore the acoustics, the prosodic features that extract voice signal have become main method, but the influence that the selection of feature is made up lip still remains further to be analyzed.For video, dynamically apparent place, light stream, three-dimensional location coordinates, distance component all is extracted as video features.
The real-time synchronization association control of voice people face is the synthetic bottleneck problem of Talking Head, because between voice and the human face expression motion is the complex mapping relation of multi-to-multi, the while people are very responsive again synchronously to the motion of voice people face, make to realize that the voice-driven human face animation becomes challenging research topic.Present voice driven research can be divided into two classes: by speech recognition with not by speech recognition.First method is by voice are divided into linguistic unit, and as phoneme (Phoneme), vision primitive (Viseme) and syllable (syllable) further map directly to these linguistic units after the lip posture synthetic with the splicing method subsequently.Yamamoto E. directly corresponds to corresponding lip then by training hidden Markov model (HMM) identification phoneme, and through obtaining the moving sequence of lip after level and smooth, this method relies on the text language kind too much, and does not consider contextual information.Second method is to walk around this form of speech primitive, utilize the method for statistical learning to pass through mathematical model, on the basis of mass data, dual-mode data is carried out statistical study, or utilize neural network, the linear prediction function shines upon study, finds the mapping relations between voice signal and the controlled variable, directly drives lip motion then.Attempt to find real time nonlinear mapping relations between the voice human face animation based on neural network method.Training multi-layer perceptions such as D.W.Massaro are used for learning the mapping of LPC parameter to the human face animation parameter.They attempt by the context relation of considering that each five frame speech analysis window of front and back come the modelling voice, but neural net method caused training result easily, also are links that needs artificial experience control.
The human face animation of computing machine generation at present is usually all based on grid model, and this model is made up of geometric grid and texture, is generally adopted by everybody.Often be difficult to generate the ideal three-dimensional expression effect in view of parameter people face is synthetic, nearly a period of time, many people begin to adopt the immediate data method of driving.Up to the present, the method based on data-driven mainly contains three kinds of different implementations: the method for the method of image sequence splicing, key frame distortion, the method for face component combination.But operation parameter faceform still has the advantage that data volume is little, be convenient to transplant.
Summary of the invention
The present invention relates to a kind of three-dimensional face cartoon method driven by voice of choosing based on dynamic primitive, can be applicable to the multimedia human-computer interaction system, use motion captured in real time equipment, set up multi-modal database; Use speech analysis, motion analysis technique to extract the voice and video feature simultaneously; With the phoneme is that primitive unit is cut apart synchronously to multi-modal data; To the given voice sequence of user, calculate the audio frequency matching error of each primitive and the vision matching error between the primitive of front and back, in candidate's primitive, dynamically select at last an optimal path, output and the synchronous human face animation argument sequence of voice sequence, to drive the 3 D human face animation model, the speech conversion of any language of the Any user of input is become the synchronous output of voice and 3 D human face animation, the speech conversion of any language of the Any user of input is become the synchronous output of voice and 3 D human face animation.
In order to achieve the above object, the invention provides a kind of voice-driven human face animation method of choosing based on dynamic primitive, comprise step:
A. adopt the multi-modal database of motion captured in real time system creation;
B. multi-modal data are carried out the audio frequency and video analysis, obtain corresponding proper vector respectively;
C. be that primitive unit is cut apart synchronously to multi-modal data with the phoneme;
D. adopt dynamic primitive choosing method, output and the synchronous human face animation argument sequence of user input voice.
According to method provided by the invention, make the animation sequence of system's output keep validity and naturalness preferably, and be applicable to multi-user and multilingual voice driven.
This system realizes easily, and the animation sequence of output keeps validity and naturalness preferably, and is applicable to multi-user and multilingual voice driven.The present invention can be applicable to the multimedia human-computer interaction system.
Description of drawings
Fig. 1 voice-driven human face animation system frame diagram.
Fig. 2 experimental subjects face characteristic point selection and collection sample figure.
The groups of people's face characteristic point and FAP (human face animation parameter) the unit figure of Fig. 3 MPEG-4 definition.
The dynamic primitive of Fig. 4 is chosen synoptic diagram.
The comparison diagram of the human face animation argument sequence that Fig. 5 synthesizes (FAP 51 and FAP 52) and the argument sequence recorded.
Partial frame figure in Fig. 6 voice driven 3 D human face animation example.
Embodiment
1. voice-driven human face animation system framework explanation (Fig. 1)
System framework realizes being divided into two processes: training process and animation process.
Training process comprises: the multi-modal database of voice-video, video features are handled, audio frequency characteristics is handled and voice-video primitives is cut apart four parts of storage.Video features is handled and is comprised: attitude standardization, rigid motion compensation and three parts of FAP parameter extraction.Audio frequency characteristics is handled and is comprised that parameters,acoustic extracts part.
Animation process comprises: parameters,acoustic extracts, primitive extracts, the FAP argument sequence is synthetic, smoothing algorithm and five parts of driving human face animation model.
2. adopt the multi-modal database of motion captured in real time system creation (Fig. 2, Fig. 3)
A) at first adopt the multi-modal database of three-dimensional motion captured in real time system creation, as shown in Figure 3: on the face the experimenter, paste 50 Partial Feature points consistent with MPEG-4 human face animation standard definition, the basic FAP of the people's face unit of Fig. 3 (a) expression definition, the Partial Feature point of Fig. 3 (b) expression groups of people face definition; As shown in Figure 2: wherein 5 stick at the head hair band, are used for compensating the head rigid motion.
B) language material experiment storehouse size is: the training repertorie: 286 sentences, record 5 times for every, and 4 times 1 time as the checking sample as training sample; Test repertorie: 9 sentences.
C) use motion captured in real time system and video camera synchronous recording people face movable information and voice messaging.The three-dimensional coordinate of unique point is stored in the .trb file, and voice messaging is stored in video .avi file.
2. multi-modal data are carried out the audio frequency and video analysis, obtain corresponding proper vector respectively
For multi-modal data are carried out the audio frequency and video analysis, must shunt both, extract corresponding proper vector respectively.
A) since the experimenter in experiment, about head and health have inevitably or swing, in order to obtain the accurate non-rigid motion information of face, so the unique point coordinates of motion information d={x that obtains for motion capture system
1, y
1, z
1..., x
n, y
n, z
n}
T∈ R
3n, n=45 at first will carry out the rigid motion compensation.With 5 gauge points of head as the reference target, because they only contain the head rigid motion.Suppose to have the coordinate set of two groups of three-dimensional feature points: { p
i, { q
i, i=1 ..., 5, represent same group of coordinate and the post exercise coordinate before the motion of three-dimensional feature point respectively, following relation is arranged so between the two:
q
i=Rp
i+t,i=1,...,5。(1)
According to the least square method principle:
By the svd incidence matrix, try to achieve the optimum solution of rotation matrix R and transition matrix t, then to the inverse transformation computing of the three-dimensional coordinate application of formula 1 of other 45 points of each frame, promptly draw the non-rigid motion information of people's face.
B) calculate the difference of each frame unique point coordinate and neutral frame coordinate:
Δ d
t={ Δ x
1, Δ y
1, Δ z
1..., Δ x
n, Δ y
n, Δ z
n}
T∈ R
3n, n=45 calculates each unique point yardstick reference quantity FAP unit: the D={D by the MPEG-4 definition simultaneously
1, D
2..., D
n, can be according to formula:
Obtain i people's face kinematic parameter of t frame.We have extracted 15 parameters (see Table 1) relevant with lip movement altogether to every frame at last:
Therefore the video feature vector in the data sample of a m frame can be expressed as:
Gro up | FAP name | Gro up | FAP |
2 2 2 2 2 8 8 | Open_jaw Trust_jaw Shift_jaw Push_b_lip Push_t_lip Lower_t_midli p_o Raise_b_midli p_o | 8 8 8 8 8 8 8 | Stretch_l_cor herlip_o Lower_t_lip_l m_o Lower_t_lip_r m_o Raise_b_lip_l m_o Raise_b-lip_r m_o Raise_l_corne rlip_o Raise_r_corne rlip_o |
8 Stretch_r_cor nerlip_o |
15 FAP parameters that table 1 extracts
C) voice are separated from the .avi file, through voice signal being removed processing such as neighbourhood noise, gain amplification, the 16 rank Mel frequency marking cepstrum parameters (MFCC) that extract speech data in the hamming window then are as speech feature vector:
The speech feature vector of representing the t frame, therefore the speech feature vector in the data sample of a m frame can be expressed as:
3. be that primitive unit is cut apart synchronously to multi-modal data with the phoneme
For simultaneous voice and video feature vector, so must be that primitive unit is cut apart synchronously to multi-modal data with the phoneme.
A) for audio frequency, speech analysis window size is WinSize=27ms, the displacement of analysis window is WinMove=5ms, so audio sample rate AudioFrameSample=1*1000/WinMove=200Hz/s, the video sampling rate is VideoFrameSample=75Hz/s simultaneously.
B) ratio that therefore can calculate frame of video and audio frame is: n=AudioFrameSample/VideoFrameSamle marks off synchronous audio-video feature set according to ratio n then.
C), be that primitive provides boundary marker and fullstop record with the phoneme for audio frequency.
All training datas are through being that unit is stored in the tranining database with the primitive after the above-mentioned processing, are extracted out and mate in order to choose the stage at dynamic primitive.
4. adopt dynamic primitive choosing method, the human face animation argument sequence (Fig. 4) of output and voice synchronous
A) for the new voice sequence of user's input, at first extract phonetic feature;
B) then to each phoneme primitive, according to audio frequency matching error C
αAnd the vision matching error C between the primitive of front and back
νIn tranining database, extract candidate's primitive.
C) at first calculate audio frequency matching error C
α:
I. each phoneme primitive is judged whether it is simple or compound vowel of a Chinese syllable or initial consonant,, consider the contextual information of preceding two and latter two primitive for simple or compound vowel of a Chinese syllable; For initial consonant, consider the contextual information of a previous and back primitive.
Ii. the calculating of contextual information adopts linear function to represent the coarticulation coefficient, i.e. the coarticulation coefficient maximum of current primitive, and the coarticulation coefficient of every frame is pressed linear decrease in a previous primitive and the back primitive.
Iii. the audio frequency matching error that calculates is:
The pronunciation coefficient, a (t
T+m, u
T+m) be acoustic feature Euclidean distance between primitive.In order to reduce the complexity in Viterbi algorithm search path, be limited to 10 on the number of this method setting candidate primitive.
D) calculate then before and after vision matching error C between primitive
ν:
I. the level and smooth degree of coupling before and after each phoneme primitive being considered between primitive is calculated the last frame of previous primitive and the matching degree that a back primitive begins people's face kinematic parameter of a frame most.
Ii. the vision matching error that calculates is:
ν (u
T-1, u
t) be phase
The Euclidean distance of adjacent two primitive adjacent video interframe video parameters.
E) calculate total cost function at last: COST=α C
α+ (1-α) C
νAs shown in Figure 4: after a sentence was finished, seeking an optimal path with the Viterbi algorithm was the total cost function minimum, and coupling together is exactly continuous people's face kinematic parameter sequence.
F) argument sequence is carried out smoothing processing, the input three-dimensional face model drives human face animation.
5. experimental result evaluation (Fig. 5, Fig. 6)
Give qualitative and quantitative assessment here to this method experimental result.Qualitative assessment be will output the synthetic parameters sequence and the error between the argument sequence of the sample recorded of lane database calculate.Therefore the variance similarity that related coefficient (formula 2) energy response curve track changes is used to calculate the error of weighing both.Simultaneously also carried out the test of opener (test set does not comprise training data) and closed set (test set comprises training data), table 2 provides the average correlation coefficient of synthetic parameters sequence in the entire database.Simultaneously, as shown in Figure 5, the synthetic parameters sequence and the curve comparison diagram of truly recording argument sequence of a checking sample and a test sample book have also been provided, Fig. 5 (a) is the composition sequence of a checking FAP# 51 of sample and FAP# 52 and the comparison diagram of truly recording sequence, Fig. 5 (b) be a test sample book FAP# 51 and the composition sequence of FAP# 52 and truly record the comparison diagram of sequence.
Correlation Coefficients | Valida tion | Test | ||
| # | 3 #14 #15 #16 #17 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60 | 0.855 0.674 0.553 0.767 0.716 0.820 0.812 0.817 0.832 0.709 0.768 0.749 0.787 0.661 0.569 | 0.662 0.656 0.761 0.677 0.788 0.732 0.741 0.722 0.636 0.684 0.736 0.714 0.720 0.640 0.730 |
The average correlation coefficient of the synthetic FAP parameter of table 2 entire database
Analyze for synthetic finally the providing qualitatively of human face animation, i.e. whether the animation sequence that perception observes is true, nature.Therefore argument sequence is carried out smoothing processing, the input three-dimensional face model drives human face animation, as shown in Figure 6.
The present invention relates to a kind of three-dimensional face cartoon method driven by voice of choosing based on dynamic primitive, can be applicable to the multimedia human-computer interaction system, the speech conversion of any language of the Any user of input is become the synchronous output of voice and 3 D human face animation.The experimental result of using this method acquisition has all obtained experimental result preferably from qualitative assessment or qualitative evaluation, please import new voice sequence by unspecified person simultaneously, also very true, the nature of synthetic effect.Therefore the voice-driven human face animation method of choosing based on dynamic primitive provided by the invention is realized easily, and synthetic result keeps validity and naturalness preferably, and is applicable to multi-user and multilingual.
Claims (5)
1. a voice-driven human face animation method of choosing based on dynamic primitive is characterized in that: use motion captured in real time equipment, set up multi-modal database; Use speech analysis, motion analysis technique to extract the voice and video feature simultaneously; With the phoneme is that primitive unit is cut apart synchronously to multi-modal data; To the given voice sequence of user, calculate the audio frequency matching error of each primitive and the vision matching error between the primitive of front and back, in candidate's primitive, dynamically select at last an optimal path, output and the synchronous human face animation argument sequence of voice sequence, to drive the 3 D human face animation model, the speech conversion of any language of the Any user of input is become the synchronous output of voice and 3 D human face animation, comprises step:
A. adopt the multi-modal database of motion captured in real time system creation;
B. multi-modal data are carried out the audio frequency and video analysis, obtain corresponding proper vector respectively;
C. be that primitive unit is cut apart synchronously to multi-modal data with the phoneme;
D. adopt dynamic primitive choosing method, output and the synchronous human face animation argument sequence of user input voice.
2. method according to claim 1 is characterized in that: adopt the multi-modal database of motion captured in real time system creation, comprise step:
A) on the face the experimenter, paste 50 with the consistent unique point of MPEG-4 human face animation standard definition, wherein 5 stick at the head hair band, are used for compensating the head rigid motion;
B) language material experiment storehouse size: training repertorie: 286 sentences, record 5 times for every, 4 times 1 time as the checking sample as training sample; Test repertorie: 9 sentences;
When c) using motion captured in real time system to record the motion of people's face, characteristic point coordinates information and synchronous voice messaging.
3. method according to claim 1 is characterized in that: multi-modal data are carried out the audio frequency and video analysis, obtain corresponding proper vector respectively, comprise step:
A) the unique point coordinates of motion information d={x that motion capture system is obtained
1, y
1, z
1..., x
n, y
n, z
n, }
T∈ R
3n, n=45 carries out the rigid motion compensation, 5 gauge point coordinates of head is used based on the three-dimensional point of least square method mated, and estimates head rigid motion q
i=Rp
i+ t, i=1 ..., 5, the inverse transformation computing of then three-dimensional coordinate of other 45 points of each frame being used formula 1 promptly draws the non-rigid motion information of people's face;
B) calculate the difference of each frame unique point coordinate and neutral frame coordinate: Δ d
i={ Δ x
1, Δ y
1, Δ z
1..., Δ x
n, Δ y
n, Δ z
n}
T∈ R
3n, n=45 calculates each unique point yardstick reference quantity FAP unit: the D={D by the MPEG-4 definition simultaneously
1, D
2..., D
n, can be according to formula: FAP
i t=((Δ x
i| Δ y
i| Δ z
i)/D
i)
*1024 obtain i people's face kinematic parameter of t frame, and we have extracted 15 parameters relevant with lip movement altogether to every frame at last:
C) for audio frequency, the 16 rank Mel frequency marking cepstrum parameters (MFCC) that extract speech data in the hamming window are as speech feature vector.
4. method according to claim 1 is characterized in that: be that primitive unit is cut apart synchronously to multi-modal data with the phoneme, comprise step:
A) for audio frequency, speech analysis window size is WinSize=27ms, the displacement of analysis window is WinMove=5ms, so audio sample rate AudioFrameSample=1*1000/WinMove=200Hz/s, the video sampling rate is VideoFrameSample=75Hz/s simultaneously;
B) ratio of frame of video and audio frame is: n=AudioFrameSample/VideoFrameSamle marks off synchronous audio-video feature set according to ratio n;
C), be that primitive provides boundary marker and fullstop record with the phoneme for audio frequency.
5. method according to claim 1 is characterized in that: adopt dynamic primitive choosing method, output and the synchronous human face animation argument sequence of user input voice comprise step:
A) given new voice extract phonetic feature;
B), in tranining database, extract candidate's primitive according to the vision matching error Cv between audio frequency matching error Ca and front and back primitive to each phoneme primitive;
C) audio frequency matching error Ca calculates:
I. whether be that simple or compound vowel of a Chinese syllable and initial consonant are distinguished for each phoneme primitive to it,, consider the contextual information of preceding two and latter two primitive for simple or compound vowel of a Chinese syllable; For initial consonant, consider the contextual information of a previous and back primitive;
Ii. computational context information adopts linear function to represent the coarticulation coefficient;
Iii. the audio frequency matching error that calculates is:
D) vision matching error C
vCalculate:
I. the level and smooth degree of coupling before and after each phoneme primitive being considered between primitive is calculated the last frame of previous primitive and the matching degree that a back primitive begins people's face kinematic parameter of a frame most;
Ii. the vision matching error that calculates is:
, v (u
T-l, u
t) be the Euclidean distance of adjacent two primitive adjacent video interframe video parameters;
E) calculate total cost function: COST=α C
a+ (1-α) C
v, after a sentence was finished, seeking an optimal path with the Viterbi algorithm was the total cost function minimum, coupling together is exactly continuous people's face kinematic parameter sequence;
F) argument sequence is carried out smoothing processing, the input three-dimensional face model drives human face animation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200510086646 CN1952850A (en) | 2005-10-20 | 2005-10-20 | Three-dimensional face cartoon method driven by voice based on dynamic elementary access |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200510086646 CN1952850A (en) | 2005-10-20 | 2005-10-20 | Three-dimensional face cartoon method driven by voice based on dynamic elementary access |
Publications (1)
Publication Number | Publication Date |
---|---|
CN1952850A true CN1952850A (en) | 2007-04-25 |
Family
ID=38059214
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 200510086646 Pending CN1952850A (en) | 2005-10-20 | 2005-10-20 | Three-dimensional face cartoon method driven by voice based on dynamic elementary access |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN1952850A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101488346B (en) * | 2009-02-24 | 2011-11-02 | 深圳先进技术研究院 | Speech visualization system and speech visualization method |
CN102176313B (en) * | 2009-10-10 | 2012-07-25 | 北京理工大学 | Formant-frequency-based Mandarin single final vioce visualizing method |
CN106327555A (en) * | 2016-08-24 | 2017-01-11 | 网易(杭州)网络有限公司 | Method and device for obtaining lip animation |
CN110610534A (en) * | 2019-09-19 | 2019-12-24 | 电子科技大学 | Automatic mouth shape animation generation method based on Actor-Critic algorithm |
CN111279413A (en) * | 2017-10-26 | 2020-06-12 | 斯纳普公司 | Joint audio and video facial animation system |
CN111508064A (en) * | 2020-04-14 | 2020-08-07 | 北京世纪好未来教育科技有限公司 | Expression synthesis method and device based on phoneme driving and computer storage medium |
CN112102448A (en) * | 2020-09-14 | 2020-12-18 | 北京百度网讯科技有限公司 | Virtual object image display method and device, electronic equipment and storage medium |
CN113744371A (en) * | 2020-05-29 | 2021-12-03 | 武汉Tcl集团工业研究院有限公司 | Method, device, terminal and storage medium for generating face animation |
CN114610158A (en) * | 2022-03-25 | 2022-06-10 | Oppo广东移动通信有限公司 | Data processing method and device, electronic equipment and storage medium |
-
2005
- 2005-10-20 CN CN 200510086646 patent/CN1952850A/en active Pending
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101488346B (en) * | 2009-02-24 | 2011-11-02 | 深圳先进技术研究院 | Speech visualization system and speech visualization method |
CN102176313B (en) * | 2009-10-10 | 2012-07-25 | 北京理工大学 | Formant-frequency-based Mandarin single final vioce visualizing method |
CN106327555A (en) * | 2016-08-24 | 2017-01-11 | 网易(杭州)网络有限公司 | Method and device for obtaining lip animation |
CN111279413A (en) * | 2017-10-26 | 2020-06-12 | 斯纳普公司 | Joint audio and video facial animation system |
CN110610534A (en) * | 2019-09-19 | 2019-12-24 | 电子科技大学 | Automatic mouth shape animation generation method based on Actor-Critic algorithm |
CN110610534B (en) * | 2019-09-19 | 2023-04-07 | 电子科技大学 | Automatic mouth shape animation generation method based on Actor-Critic algorithm |
CN111508064B (en) * | 2020-04-14 | 2022-06-17 | 北京世纪好未来教育科技有限公司 | Expression synthesis method and device based on phoneme driving and computer storage medium |
CN111508064A (en) * | 2020-04-14 | 2020-08-07 | 北京世纪好未来教育科技有限公司 | Expression synthesis method and device based on phoneme driving and computer storage medium |
CN113744371A (en) * | 2020-05-29 | 2021-12-03 | 武汉Tcl集团工业研究院有限公司 | Method, device, terminal and storage medium for generating face animation |
CN113744371B (en) * | 2020-05-29 | 2024-04-16 | 武汉Tcl集团工业研究院有限公司 | Method, device, terminal and storage medium for generating face animation |
CN112102448A (en) * | 2020-09-14 | 2020-12-18 | 北京百度网讯科技有限公司 | Virtual object image display method and device, electronic equipment and storage medium |
CN112102448B (en) * | 2020-09-14 | 2023-08-04 | 北京百度网讯科技有限公司 | Virtual object image display method, device, electronic equipment and storage medium |
CN114610158A (en) * | 2022-03-25 | 2022-06-10 | Oppo广东移动通信有限公司 | Data processing method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101751692B (en) | Method for voice-driven lip animation | |
CN111915707B (en) | Mouth shape animation display method and device based on audio information and storage medium | |
CN1952850A (en) | Three-dimensional face cartoon method driven by voice based on dynamic elementary access | |
CN103218842B (en) | A kind of voice synchronous drives the method for the three-dimensional face shape of the mouth as one speaks and facial pose animation | |
CN112053690B (en) | Cross-mode multi-feature fusion audio/video voice recognition method and system | |
CN107515674A (en) | It is a kind of that implementation method is interacted based on virtual reality more with the mining processes of augmented reality | |
CN107423398A (en) | Exchange method, device, storage medium and computer equipment | |
Sargin et al. | Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation | |
KR20060090687A (en) | System and method for audio-visual content synthesis | |
CN112151030A (en) | Multi-mode-based complex scene voice recognition method and device | |
CN113835522A (en) | Sign language video generation, translation and customer service method, device and readable medium | |
Wang et al. | Synthesizing photo-real talking head via trajectory-guided sample selection | |
Ma et al. | Accurate visible speech synthesis based on concatenating variable length motion capture data | |
Wang et al. | HMM trajectory-guided sample selection for photo-realistic talking head | |
CN115188074A (en) | Interactive physical training evaluation method, device and system and computer equipment | |
CN113822187B (en) | Sign language translation, customer service, communication method, device and readable medium | |
CN116665275A (en) | Facial expression synthesis and interaction control method based on text-to-Chinese pinyin | |
Asadiabadi et al. | Multimodal speech driven facial shape animation using deep neural networks | |
Beskow et al. | Data-driven synthesis of expressive visual speech using an MPEG-4 talking head. | |
Putra et al. | Designing translation tool: Between sign language to spoken text on kinect time series data using dynamic time warping | |
Liu et al. | Real-time speech-driven animation of expressive talking faces | |
Li et al. | A novel speech-driven lip-sync model with CNN and LSTM | |
Lan et al. | Low level descriptors based DBLSTM bottleneck feature for speech driven talking avatar | |
Zhao et al. | Realizing speech to gesture conversion by keyword spotting | |
Filntisis et al. | Photorealistic adaptation and interpolation of facial expressions using HMMS and AAMS for audio-visual speech synthesis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |