CN1952850A - Three-dimensional face cartoon method driven by voice based on dynamic elementary access - Google Patents

Three-dimensional face cartoon method driven by voice based on dynamic elementary access Download PDF

Info

Publication number
CN1952850A
CN1952850A CN 200510086646 CN200510086646A CN1952850A CN 1952850 A CN1952850 A CN 1952850A CN 200510086646 CN200510086646 CN 200510086646 CN 200510086646 A CN200510086646 A CN 200510086646A CN 1952850 A CN1952850 A CN 1952850A
Authority
CN
China
Prior art keywords
primitive
voice
motion
human face
audio frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200510086646
Other languages
Chinese (zh)
Inventor
陶建华
尹潘嵘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN 200510086646 priority Critical patent/CN1952850A/en
Publication of CN1952850A publication Critical patent/CN1952850A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

This invention discloses a sound drive human face cartoon method based on dynamic element selection, which comprises the following steps: real time capturing multi-mode database of system and analyzing audio frequency on multi-mode data to get relative characteristic vector; processing simultaneous dividing based on audio element; computing each basic element audio matching error and front and back basic element match error; finally dynamically selecting one best path in the prepare base elements.

Description

The three-dimensional face cartoon method driven by voice of choosing based on dynamic primitive
Technical field
The present invention relates to the multi-modal field of human-computer interaction of voice-driven human face animation, refer to a kind of three-dimensional face cartoon method driven by voice of choosing based on dynamic primitive especially.
Background technology
Butter-tonsiled people's face interface (Talking Head) is served the deaf person, information is propagated, the video display amusement, there is application prospects in the multinomial application such as animated gaming and man-machine interaction, but this technology does not also have and can popularize and apply at present, main bottleneck is that the real-time synchronous processing of voice and people's face and the sense true to nature and the intelligibility of human face animation itself also do not reach requirement, its reason is that the user is too familiar to people's face and motion thereof, very responsive to its motion real time synchronization, people's face motion simultaneously is that non-movement of Rigid Body process exists a large amount of variations in detail, and is very complicated.Therefore, the key that realize the voice-driven human face animation is to study voice and lip synchronization association relation and the human face animation synthetic method true to nature between moving.
According to " Picture My Voice " definition, any one visual speech synthesis system all can be divided into four steps to carry out: set up a multi-modal database of audio-visual; Provide the voice and video character representation; Find a kind of method to describe two incidence relations between features partly; Carry out visual synthetic according to voice sequence.
Up to the present, still there is not a multi-modal database that unification is general in visual phonetic synthesis field, the basis that therefore to set up a database be all working.Many research institutions all record the necessary multi-modal database of experiment voluntarily, and what have records two-dimentional continuous videos, extract area-of-interest; Have pass through three-dimensional laser scanner intercepting static apparent place; Then passing through of having is two-dimentional--and the three-dimensional reconstruction technology is replied the three-dimensional coordinate of unique point from video.
Extracting the audio and video characteristic vector exactly also is the difficult point place that system realizes usually, for audio frequency, many people carry out classification analysis to the text language unit, but workload is very big, the manual intervention degree is also high, therefore the acoustics, the prosodic features that extract voice signal have become main method, but the influence that the selection of feature is made up lip still remains further to be analyzed.For video, dynamically apparent place, light stream, three-dimensional location coordinates, distance component all is extracted as video features.
The real-time synchronization association control of voice people face is the synthetic bottleneck problem of Talking Head, because between voice and the human face expression motion is the complex mapping relation of multi-to-multi, the while people are very responsive again synchronously to the motion of voice people face, make to realize that the voice-driven human face animation becomes challenging research topic.Present voice driven research can be divided into two classes: by speech recognition with not by speech recognition.First method is by voice are divided into linguistic unit, and as phoneme (Phoneme), vision primitive (Viseme) and syllable (syllable) further map directly to these linguistic units after the lip posture synthetic with the splicing method subsequently.Yamamoto E. directly corresponds to corresponding lip then by training hidden Markov model (HMM) identification phoneme, and through obtaining the moving sequence of lip after level and smooth, this method relies on the text language kind too much, and does not consider contextual information.Second method is to walk around this form of speech primitive, utilize the method for statistical learning to pass through mathematical model, on the basis of mass data, dual-mode data is carried out statistical study, or utilize neural network, the linear prediction function shines upon study, finds the mapping relations between voice signal and the controlled variable, directly drives lip motion then.Attempt to find real time nonlinear mapping relations between the voice human face animation based on neural network method.Training multi-layer perceptions such as D.W.Massaro are used for learning the mapping of LPC parameter to the human face animation parameter.They attempt by the context relation of considering that each five frame speech analysis window of front and back come the modelling voice, but neural net method caused training result easily, also are links that needs artificial experience control.
The human face animation of computing machine generation at present is usually all based on grid model, and this model is made up of geometric grid and texture, is generally adopted by everybody.Often be difficult to generate the ideal three-dimensional expression effect in view of parameter people face is synthetic, nearly a period of time, many people begin to adopt the immediate data method of driving.Up to the present, the method based on data-driven mainly contains three kinds of different implementations: the method for the method of image sequence splicing, key frame distortion, the method for face component combination.But operation parameter faceform still has the advantage that data volume is little, be convenient to transplant.
Summary of the invention
The present invention relates to a kind of three-dimensional face cartoon method driven by voice of choosing based on dynamic primitive, can be applicable to the multimedia human-computer interaction system, use motion captured in real time equipment, set up multi-modal database; Use speech analysis, motion analysis technique to extract the voice and video feature simultaneously; With the phoneme is that primitive unit is cut apart synchronously to multi-modal data; To the given voice sequence of user, calculate the audio frequency matching error of each primitive and the vision matching error between the primitive of front and back, in candidate's primitive, dynamically select at last an optimal path, output and the synchronous human face animation argument sequence of voice sequence, to drive the 3 D human face animation model, the speech conversion of any language of the Any user of input is become the synchronous output of voice and 3 D human face animation, the speech conversion of any language of the Any user of input is become the synchronous output of voice and 3 D human face animation.
In order to achieve the above object, the invention provides a kind of voice-driven human face animation method of choosing based on dynamic primitive, comprise step:
A. adopt the multi-modal database of motion captured in real time system creation;
B. multi-modal data are carried out the audio frequency and video analysis, obtain corresponding proper vector respectively;
C. be that primitive unit is cut apart synchronously to multi-modal data with the phoneme;
D. adopt dynamic primitive choosing method, output and the synchronous human face animation argument sequence of user input voice.
According to method provided by the invention, make the animation sequence of system's output keep validity and naturalness preferably, and be applicable to multi-user and multilingual voice driven.
This system realizes easily, and the animation sequence of output keeps validity and naturalness preferably, and is applicable to multi-user and multilingual voice driven.The present invention can be applicable to the multimedia human-computer interaction system.
Description of drawings
Fig. 1 voice-driven human face animation system frame diagram.
Fig. 2 experimental subjects face characteristic point selection and collection sample figure.
The groups of people's face characteristic point and FAP (human face animation parameter) the unit figure of Fig. 3 MPEG-4 definition.
The dynamic primitive of Fig. 4 is chosen synoptic diagram.
The comparison diagram of the human face animation argument sequence that Fig. 5 synthesizes (FAP 51 and FAP 52) and the argument sequence recorded.
Partial frame figure in Fig. 6 voice driven 3 D human face animation example.
Embodiment
1. voice-driven human face animation system framework explanation (Fig. 1)
System framework realizes being divided into two processes: training process and animation process.
Training process comprises: the multi-modal database of voice-video, video features are handled, audio frequency characteristics is handled and voice-video primitives is cut apart four parts of storage.Video features is handled and is comprised: attitude standardization, rigid motion compensation and three parts of FAP parameter extraction.Audio frequency characteristics is handled and is comprised that parameters,acoustic extracts part.
Animation process comprises: parameters,acoustic extracts, primitive extracts, the FAP argument sequence is synthetic, smoothing algorithm and five parts of driving human face animation model.
2. adopt the multi-modal database of motion captured in real time system creation (Fig. 2, Fig. 3)
A) at first adopt the multi-modal database of three-dimensional motion captured in real time system creation, as shown in Figure 3: on the face the experimenter, paste 50 Partial Feature points consistent with MPEG-4 human face animation standard definition, the basic FAP of the people's face unit of Fig. 3 (a) expression definition, the Partial Feature point of Fig. 3 (b) expression groups of people face definition; As shown in Figure 2: wherein 5 stick at the head hair band, are used for compensating the head rigid motion.
B) language material experiment storehouse size is: the training repertorie: 286 sentences, record 5 times for every, and 4 times 1 time as the checking sample as training sample; Test repertorie: 9 sentences.
C) use motion captured in real time system and video camera synchronous recording people face movable information and voice messaging.The three-dimensional coordinate of unique point is stored in the .trb file, and voice messaging is stored in video .avi file.
2. multi-modal data are carried out the audio frequency and video analysis, obtain corresponding proper vector respectively
For multi-modal data are carried out the audio frequency and video analysis, must shunt both, extract corresponding proper vector respectively.
A) since the experimenter in experiment, about head and health have inevitably or swing, in order to obtain the accurate non-rigid motion information of face, so the unique point coordinates of motion information d={x that obtains for motion capture system 1, y 1, z 1..., x n, y n, z n} T∈ R 3n, n=45 at first will carry out the rigid motion compensation.With 5 gauge points of head as the reference target, because they only contain the head rigid motion.Suppose to have the coordinate set of two groups of three-dimensional feature points: { p i, { q i, i=1 ..., 5, represent same group of coordinate and the post exercise coordinate before the motion of three-dimensional feature point respectively, following relation is arranged so between the two:
q i=Rp i+t,i=1,...,5。(1)
According to the least square method principle: Σ 2 = Σ i = 1 n | | Rp i + t - q i | | 2 , By the svd incidence matrix, try to achieve the optimum solution of rotation matrix R and transition matrix t, then to the inverse transformation computing of the three-dimensional coordinate application of formula 1 of other 45 points of each frame, promptly draw the non-rigid motion information of people's face.
B) calculate the difference of each frame unique point coordinate and neutral frame coordinate:
Δ d t={ Δ x 1, Δ y 1, Δ z 1..., Δ x n, Δ y n, Δ z n} T∈ R 3n, n=45 calculates each unique point yardstick reference quantity FAP unit: the D={D by the MPEG-4 definition simultaneously 1, D 2..., D n, can be according to formula: FAP i t = ( ( Δx i | Δy i | Δz i ) / D i ) * 1024 Obtain i people's face kinematic parameter of t frame.We have extracted 15 parameters (see Table 1) relevant with lip movement altogether to every frame at last: ν →t = { FAP 1 t , FAP 2 t , . . . , FAP 15 t } T , Therefore the video feature vector in the data sample of a m frame can be expressed as: V = { ν →1 , ν →2 , . . . , ν →m } T
Gro up FAP name Gro up FAP name
2 2 2 2 2 8 8 Open_jaw Trust_jaw Shift_jaw Push_b_lip Push_t_lip Lower_t_midli p_o Raise_b_midli p_o 8 8 8 8 8 8 8 Stretch_l_cor herlip_o Lower_t_lip_l m_o Lower_t_lip_r m_o Raise_b_lip_l m_o Raise_b-lip_r m_o Raise_l_corne rlip_o Raise_r_corne rlip_o
8 Stretch_r_cor nerlip_o
15 FAP parameters that table 1 extracts
C) voice are separated from the .avi file, through voice signal being removed processing such as neighbourhood noise, gain amplification, the 16 rank Mel frequency marking cepstrum parameters (MFCC) that extract speech data in the hamming window then are as speech feature vector: a →t = { c 1 t , c 2 t , . . . , c 16 t } T The speech feature vector of representing the t frame, therefore the speech feature vector in the data sample of a m frame can be expressed as:
A = { a →1 , a →2 , . . . , a →m } T
3. be that primitive unit is cut apart synchronously to multi-modal data with the phoneme
For simultaneous voice and video feature vector, so must be that primitive unit is cut apart synchronously to multi-modal data with the phoneme.
A) for audio frequency, speech analysis window size is WinSize=27ms, the displacement of analysis window is WinMove=5ms, so audio sample rate AudioFrameSample=1*1000/WinMove=200Hz/s, the video sampling rate is VideoFrameSample=75Hz/s simultaneously.
B) ratio that therefore can calculate frame of video and audio frame is: n=AudioFrameSample/VideoFrameSamle marks off synchronous audio-video feature set according to ratio n then.
C), be that primitive provides boundary marker and fullstop record with the phoneme for audio frequency.
All training datas are through being that unit is stored in the tranining database with the primitive after the above-mentioned processing, are extracted out and mate in order to choose the stage at dynamic primitive.
4. adopt dynamic primitive choosing method, the human face animation argument sequence (Fig. 4) of output and voice synchronous
A) for the new voice sequence of user's input, at first extract phonetic feature;
B) then to each phoneme primitive, according to audio frequency matching error C αAnd the vision matching error C between the primitive of front and back νIn tranining database, extract candidate's primitive.
C) at first calculate audio frequency matching error C α:
I. each phoneme primitive is judged whether it is simple or compound vowel of a Chinese syllable or initial consonant,, consider the contextual information of preceding two and latter two primitive for simple or compound vowel of a Chinese syllable; For initial consonant, consider the contextual information of a previous and back primitive.
Ii. the calculating of contextual information adopts linear function to represent the coarticulation coefficient, i.e. the coarticulation coefficient maximum of current primitive, and the coarticulation coefficient of every frame is pressed linear decrease in a previous primitive and the back primitive.
Iii. the audio frequency matching error that calculates is:
C a = Σ t = 1 n Σ m w m a ( t t + m , u t + m ) , m = [ - 1,0 , + 1 ] or [ - 2 , - 1,0 , + 1 , + 2 ] , , w mFor collaborative
The pronunciation coefficient, a (t T+m, u T+m) be acoustic feature Euclidean distance between primitive.In order to reduce the complexity in Viterbi algorithm search path, be limited to 10 on the number of this method setting candidate primitive.
D) calculate then before and after vision matching error C between primitive ν:
I. the level and smooth degree of coupling before and after each phoneme primitive being considered between primitive is calculated the last frame of previous primitive and the matching degree that a back primitive begins people's face kinematic parameter of a frame most.
Ii. the vision matching error that calculates is: C ν = Σ t = 2 n ν ( u t - 1 , u t ) , ν (u T-1, u t) be phase
The Euclidean distance of adjacent two primitive adjacent video interframe video parameters.
E) calculate total cost function at last: COST=α C α+ (1-α) C νAs shown in Figure 4: after a sentence was finished, seeking an optimal path with the Viterbi algorithm was the total cost function minimum, and coupling together is exactly continuous people's face kinematic parameter sequence.
F) argument sequence is carried out smoothing processing, the input three-dimensional face model drives human face animation.
5. experimental result evaluation (Fig. 5, Fig. 6)
Give qualitative and quantitative assessment here to this method experimental result.Qualitative assessment be will output the synthetic parameters sequence and the error between the argument sequence of the sample recorded of lane database calculate.Therefore the variance similarity that related coefficient (formula 2) energy response curve track changes is used to calculate the error of weighing both.Simultaneously also carried out the test of opener (test set does not comprise training data) and closed set (test set comprises training data), table 2 provides the average correlation coefficient of synthetic parameters sequence in the entire database.Simultaneously, as shown in Figure 5, the synthetic parameters sequence and the curve comparison diagram of truly recording argument sequence of a checking sample and a test sample book have also been provided, Fig. 5 (a) is the composition sequence of a checking FAP#51 of sample and FAP#52 and the comparison diagram of truly recording sequence, Fig. 5 (b) be a test sample book FAP#51 and the composition sequence of FAP#52 and truly record the comparison diagram of sequence.
CC = 1 T Σ t = 1 T ( f ^ ( t ) - μ ^ ) ( f ( t ) - μ ) σ ^ σ · - - - ( 2 )
Correlation Coefficients Valida tion Test
FAP No. # 3 #14 #15 #16 #17 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60 0.855 0.674 0.553 0.767 0.716 0.820 0.812 0.817 0.832 0.709 0.768 0.749 0.787 0.661 0.569 0.662 0.656 0.761 0.677 0.788 0.732 0.741 0.722 0.636 0.684 0.736 0.714 0.720 0.640 0.730
The average correlation coefficient of the synthetic FAP parameter of table 2 entire database
Analyze for synthetic finally the providing qualitatively of human face animation, i.e. whether the animation sequence that perception observes is true, nature.Therefore argument sequence is carried out smoothing processing, the input three-dimensional face model drives human face animation, as shown in Figure 6.
The present invention relates to a kind of three-dimensional face cartoon method driven by voice of choosing based on dynamic primitive, can be applicable to the multimedia human-computer interaction system, the speech conversion of any language of the Any user of input is become the synchronous output of voice and 3 D human face animation.The experimental result of using this method acquisition has all obtained experimental result preferably from qualitative assessment or qualitative evaluation, please import new voice sequence by unspecified person simultaneously, also very true, the nature of synthetic effect.Therefore the voice-driven human face animation method of choosing based on dynamic primitive provided by the invention is realized easily, and synthetic result keeps validity and naturalness preferably, and is applicable to multi-user and multilingual.

Claims (5)

1. a voice-driven human face animation method of choosing based on dynamic primitive is characterized in that: use motion captured in real time equipment, set up multi-modal database; Use speech analysis, motion analysis technique to extract the voice and video feature simultaneously; With the phoneme is that primitive unit is cut apart synchronously to multi-modal data; To the given voice sequence of user, calculate the audio frequency matching error of each primitive and the vision matching error between the primitive of front and back, in candidate's primitive, dynamically select at last an optimal path, output and the synchronous human face animation argument sequence of voice sequence, to drive the 3 D human face animation model, the speech conversion of any language of the Any user of input is become the synchronous output of voice and 3 D human face animation, comprises step:
A. adopt the multi-modal database of motion captured in real time system creation;
B. multi-modal data are carried out the audio frequency and video analysis, obtain corresponding proper vector respectively;
C. be that primitive unit is cut apart synchronously to multi-modal data with the phoneme;
D. adopt dynamic primitive choosing method, output and the synchronous human face animation argument sequence of user input voice.
2. method according to claim 1 is characterized in that: adopt the multi-modal database of motion captured in real time system creation, comprise step:
A) on the face the experimenter, paste 50 with the consistent unique point of MPEG-4 human face animation standard definition, wherein 5 stick at the head hair band, are used for compensating the head rigid motion;
B) language material experiment storehouse size: training repertorie: 286 sentences, record 5 times for every, 4 times 1 time as the checking sample as training sample; Test repertorie: 9 sentences;
When c) using motion captured in real time system to record the motion of people's face, characteristic point coordinates information and synchronous voice messaging.
3. method according to claim 1 is characterized in that: multi-modal data are carried out the audio frequency and video analysis, obtain corresponding proper vector respectively, comprise step:
A) the unique point coordinates of motion information d={x that motion capture system is obtained 1, y 1, z 1..., x n, y n, z n, } T∈ R 3n, n=45 carries out the rigid motion compensation, 5 gauge point coordinates of head is used based on the three-dimensional point of least square method mated, and estimates head rigid motion q i=Rp i+ t, i=1 ..., 5, the inverse transformation computing of then three-dimensional coordinate of other 45 points of each frame being used formula 1 promptly draws the non-rigid motion information of people's face;
B) calculate the difference of each frame unique point coordinate and neutral frame coordinate: Δ d i={ Δ x 1, Δ y 1, Δ z 1..., Δ x n, Δ y n, Δ z n} T∈ R 3n, n=45 calculates each unique point yardstick reference quantity FAP unit: the D={D by the MPEG-4 definition simultaneously 1, D 2..., D n, can be according to formula: FAP i t=((Δ x i| Δ y i| Δ z i)/D i) *1024 obtain i people's face kinematic parameter of t frame, and we have extracted 15 parameters relevant with lip movement altogether to every frame at last: v → t = { FA P 1 t , FA P 2 t , . . . , FA P 15 t } T ;
C) for audio frequency, the 16 rank Mel frequency marking cepstrum parameters (MFCC) that extract speech data in the hamming window are as speech feature vector.
4. method according to claim 1 is characterized in that: be that primitive unit is cut apart synchronously to multi-modal data with the phoneme, comprise step:
A) for audio frequency, speech analysis window size is WinSize=27ms, the displacement of analysis window is WinMove=5ms, so audio sample rate AudioFrameSample=1*1000/WinMove=200Hz/s, the video sampling rate is VideoFrameSample=75Hz/s simultaneously;
B) ratio of frame of video and audio frame is: n=AudioFrameSample/VideoFrameSamle marks off synchronous audio-video feature set according to ratio n;
C), be that primitive provides boundary marker and fullstop record with the phoneme for audio frequency.
5. method according to claim 1 is characterized in that: adopt dynamic primitive choosing method, output and the synchronous human face animation argument sequence of user input voice comprise step:
A) given new voice extract phonetic feature;
B), in tranining database, extract candidate's primitive according to the vision matching error Cv between audio frequency matching error Ca and front and back primitive to each phoneme primitive;
C) audio frequency matching error Ca calculates:
I. whether be that simple or compound vowel of a Chinese syllable and initial consonant are distinguished for each phoneme primitive to it,, consider the contextual information of preceding two and latter two primitive for simple or compound vowel of a Chinese syllable; For initial consonant, consider the contextual information of a previous and back primitive;
Ii. computational context information adopts linear function to represent the coarticulation coefficient;
Iii. the audio frequency matching error that calculates is:
C a = Σ i = 1 n Σ m w m a ( t t + m , u t + m ) , m = [ - 1,0 + 1 ] or [ - 2 , - 1 , 0 , + 1 , + 2 ] , w mBe the coarticulation coefficient, α (t T+m, u T+m) be acoustic feature Euclidean distance between primitive;
D) vision matching error C vCalculate:
I. the level and smooth degree of coupling before and after each phoneme primitive being considered between primitive is calculated the last frame of previous primitive and the matching degree that a back primitive begins people's face kinematic parameter of a frame most;
Ii. the vision matching error that calculates is: C v = Σ i = 2 n v ( u t - 1 , u t ) , v (u T-l, u t) be the Euclidean distance of adjacent two primitive adjacent video interframe video parameters;
E) calculate total cost function: COST=α C a+ (1-α) C v, after a sentence was finished, seeking an optimal path with the Viterbi algorithm was the total cost function minimum, coupling together is exactly continuous people's face kinematic parameter sequence;
F) argument sequence is carried out smoothing processing, the input three-dimensional face model drives human face animation.
CN 200510086646 2005-10-20 2005-10-20 Three-dimensional face cartoon method driven by voice based on dynamic elementary access Pending CN1952850A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200510086646 CN1952850A (en) 2005-10-20 2005-10-20 Three-dimensional face cartoon method driven by voice based on dynamic elementary access

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200510086646 CN1952850A (en) 2005-10-20 2005-10-20 Three-dimensional face cartoon method driven by voice based on dynamic elementary access

Publications (1)

Publication Number Publication Date
CN1952850A true CN1952850A (en) 2007-04-25

Family

ID=38059214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200510086646 Pending CN1952850A (en) 2005-10-20 2005-10-20 Three-dimensional face cartoon method driven by voice based on dynamic elementary access

Country Status (1)

Country Link
CN (1) CN1952850A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101488346B (en) * 2009-02-24 2011-11-02 深圳先进技术研究院 Speech visualization system and speech visualization method
CN102176313B (en) * 2009-10-10 2012-07-25 北京理工大学 Formant-frequency-based Mandarin single final vioce visualizing method
CN106327555A (en) * 2016-08-24 2017-01-11 网易(杭州)网络有限公司 Method and device for obtaining lip animation
CN110610534A (en) * 2019-09-19 2019-12-24 电子科技大学 Automatic mouth shape animation generation method based on Actor-Critic algorithm
CN111279413A (en) * 2017-10-26 2020-06-12 斯纳普公司 Joint audio and video facial animation system
CN111508064A (en) * 2020-04-14 2020-08-07 北京世纪好未来教育科技有限公司 Expression synthesis method and device based on phoneme driving and computer storage medium
CN112102448A (en) * 2020-09-14 2020-12-18 北京百度网讯科技有限公司 Virtual object image display method and device, electronic equipment and storage medium
CN113744371A (en) * 2020-05-29 2021-12-03 武汉Tcl集团工业研究院有限公司 Method, device, terminal and storage medium for generating face animation
CN114610158A (en) * 2022-03-25 2022-06-10 Oppo广东移动通信有限公司 Data processing method and device, electronic equipment and storage medium

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101488346B (en) * 2009-02-24 2011-11-02 深圳先进技术研究院 Speech visualization system and speech visualization method
CN102176313B (en) * 2009-10-10 2012-07-25 北京理工大学 Formant-frequency-based Mandarin single final vioce visualizing method
CN106327555A (en) * 2016-08-24 2017-01-11 网易(杭州)网络有限公司 Method and device for obtaining lip animation
CN111279413A (en) * 2017-10-26 2020-06-12 斯纳普公司 Joint audio and video facial animation system
CN110610534A (en) * 2019-09-19 2019-12-24 电子科技大学 Automatic mouth shape animation generation method based on Actor-Critic algorithm
CN110610534B (en) * 2019-09-19 2023-04-07 电子科技大学 Automatic mouth shape animation generation method based on Actor-Critic algorithm
CN111508064B (en) * 2020-04-14 2022-06-17 北京世纪好未来教育科技有限公司 Expression synthesis method and device based on phoneme driving and computer storage medium
CN111508064A (en) * 2020-04-14 2020-08-07 北京世纪好未来教育科技有限公司 Expression synthesis method and device based on phoneme driving and computer storage medium
CN113744371A (en) * 2020-05-29 2021-12-03 武汉Tcl集团工业研究院有限公司 Method, device, terminal and storage medium for generating face animation
CN113744371B (en) * 2020-05-29 2024-04-16 武汉Tcl集团工业研究院有限公司 Method, device, terminal and storage medium for generating face animation
CN112102448A (en) * 2020-09-14 2020-12-18 北京百度网讯科技有限公司 Virtual object image display method and device, electronic equipment and storage medium
CN112102448B (en) * 2020-09-14 2023-08-04 北京百度网讯科技有限公司 Virtual object image display method, device, electronic equipment and storage medium
CN114610158A (en) * 2022-03-25 2022-06-10 Oppo广东移动通信有限公司 Data processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN101751692B (en) Method for voice-driven lip animation
CN111915707B (en) Mouth shape animation display method and device based on audio information and storage medium
CN1952850A (en) Three-dimensional face cartoon method driven by voice based on dynamic elementary access
CN103218842B (en) A kind of voice synchronous drives the method for the three-dimensional face shape of the mouth as one speaks and facial pose animation
CN112053690B (en) Cross-mode multi-feature fusion audio/video voice recognition method and system
CN107515674A (en) It is a kind of that implementation method is interacted based on virtual reality more with the mining processes of augmented reality
CN107423398A (en) Exchange method, device, storage medium and computer equipment
Sargin et al. Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation
KR20060090687A (en) System and method for audio-visual content synthesis
CN112151030A (en) Multi-mode-based complex scene voice recognition method and device
CN113835522A (en) Sign language video generation, translation and customer service method, device and readable medium
Wang et al. Synthesizing photo-real talking head via trajectory-guided sample selection
Ma et al. Accurate visible speech synthesis based on concatenating variable length motion capture data
Wang et al. HMM trajectory-guided sample selection for photo-realistic talking head
CN115188074A (en) Interactive physical training evaluation method, device and system and computer equipment
CN113822187B (en) Sign language translation, customer service, communication method, device and readable medium
CN116665275A (en) Facial expression synthesis and interaction control method based on text-to-Chinese pinyin
Asadiabadi et al. Multimodal speech driven facial shape animation using deep neural networks
Beskow et al. Data-driven synthesis of expressive visual speech using an MPEG-4 talking head.
Putra et al. Designing translation tool: Between sign language to spoken text on kinect time series data using dynamic time warping
Liu et al. Real-time speech-driven animation of expressive talking faces
Li et al. A novel speech-driven lip-sync model with CNN and LSTM
Lan et al. Low level descriptors based DBLSTM bottleneck feature for speech driven talking avatar
Zhao et al. Realizing speech to gesture conversion by keyword spotting
Filntisis et al. Photorealistic adaptation and interpolation of facial expressions using HMMS and AAMS for audio-visual speech synthesis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication