CN1971621A - Generating method of cartoon face driven by voice and text together - Google Patents

Generating method of cartoon face driven by voice and text together Download PDF

Info

Publication number
CN1971621A
CN1971621A CNA2006101144956A CN200610114495A CN1971621A CN 1971621 A CN1971621 A CN 1971621A CN A2006101144956 A CNA2006101144956 A CN A2006101144956A CN 200610114495 A CN200610114495 A CN 200610114495A CN 1971621 A CN1971621 A CN 1971621A
Authority
CN
China
Prior art keywords
voice
syllable
apparent place
text
voice segments
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2006101144956A
Other languages
Chinese (zh)
Other versions
CN100476877C (en
Inventor
陈益强
刘军发
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Music Intelligent Technology Jinan Co ltd
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CNB2006101144956A priority Critical patent/CN100476877C/en
Publication of CN1971621A publication Critical patent/CN1971621A/en
Application granted granted Critical
Publication of CN100476877C publication Critical patent/CN100476877C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a producing method for cartoon face animation driven by the voice and text, it includes: building voice vision mapping library, text analysis, voice cutting and voice merging, vision parameter montage step; the advantages of the invention are: self-defining syllable-vision map to realize the lip shape and face shape with various grandiloquent expression effect, and finally synthesized the cartoon face animation; don't need large scale database to train; it can realize voice cutting, extract the time length information of syllable guided by the text, and synthesized the face cartoon synchronization with the lip shape and expression.

Description

Voice and text are united the generating method of cartoon face of driving
Technical field
The invention belongs to the Computer Animated Graph field, relate to computer graphics, cartoon technique and speech analysis techniques, the cartoon face that particularly a kind of voice and text are united driving generates method automatically.
Background technology
2004, the output value of the digital animation industry in the whole world has reached 2,228 hundred million dollars, then more than 5,000 hundred million dollars, from the development of states such as English, U.S., Japan and Korea S., animation industry has become a huge industry to the peripheral derived product output value relevant with animation industry.As the important content of animation works, cartoon face is loved by the people.
In general, the animation generation technique has three kinds automatically: wherein a kind of is the video drive mode.Based on the motion of the method track human faces of video, and the movable information of people's face is converted into control faceform's kinematic parameter.The present difficult point of this method is to extract the also motion of track human faces unique point, and for this reason, the tracked people that is everlasting adds feature mark point (as reflective spot) on the face.If all face characteristics or reflective spot can be detected all the time, the data that extract so can be directly and faceform's parameter set up a kind of mapping, at this moment, this method can obtain good result.Based on the synthetic personalized expression of reproduction that is fit to of video drive, when the moving control of the accurate lip of needs, the lip that this technology is difficult to finish with voice synchronous calculates.In addition, this technology implementation procedure more complicated, it is very expensive to gather tracking equipment.
It two is text-driven modes, and common processing all is to take Database Mapping that text-converted is become animation apparent place parameter.Apparent place meaning of parameters is as follows: the expression state in a certain moment of people's face can represent that a FAP value is made up of the displacement of 68 points with a FAP (Facial Animation Parameter) value.One group of FAP value then can constitute one apparent place, (in order to describing the process of certain expression of face, and with a series of apparent place just splice and can generate continuous FA Facial Animation.But the text-driven mode generates the process list of references [Cai Lian is red, and Wu Zhi is brave, Tao Jianhua, small-sized microcomputer system, Vol.23, No.4, PP.474-477,2002.4 for the research of Chinese language text-visual speech conversion, Wang Zhiming] of animation.Generally be, at first set up the corresponding relation of text unit (as syllable) and lip type, expression, then the text of input is resolved, the text unit (syllable) that parsing is obtained converts lip type and expression parameter to, synthesize people's face according to parameter then, these people's face sequences are just spliced to synthesize animation.Though the text-driven mode has driven nature well directly perceived, but simple text-driven lacks time span information, the time that lip type and Facial Expression Animation continue when synthesizing animation can't determine that therefore the animation that generates visually is easy to generate inharmonious, factitious sensation.Accompanying drawing 2 is typical text-driven human face animation flow processs.
The type of drive of the third human face animation is the voice driven mode, is animation with people's speech conversion apparent place parameter, and present method is divided into two kinds: by the mode of speech recognition and the mode of speech parameter mapping.(1) carries out text-driven by the mode of speech recognition based on the database or the method for rule, all need to set up text on single word or the single speech aspect with animation apparent place corresponding at first with the speech recognition syllabication, adopt the mode identical to synthesize animation then with text-driven.This method is because identification problem itself is exactly the research field that has much room for improvement, so the degree of accuracy of syllable identification is not high, and synthetic animation is also just true to nature inadequately.(2) the speech parameter mapping mode as shown in Figure 3, be directly speech characteristic parameter to be mapped directly to animation apparent place parameter is at first gathered a large amount of speech datas and corresponding people's face apparent place data, adopt machine learning method such as artificial neural network or Hidden Markov Model (HMM) to learn relationship maps relation between the two then, as document [based on the voice-driven human face animation method of machine learning, Chen Yiqiang, Gao Wen, Wang Zhaoqi, Jiang Dalong, software journal, Vol.14, No.2] in the method that provides.This method can realize that facial expression changes, the lip type changes and the synchronism of voice, but owing to expressing one's feelings and voice data from true man of gathering, therefore last effect also can only be the simulation true man, can't realize the exaggeration effect of cartoon.In addition, as described in document, this method need be based on the training of extensive audio-visual synchronization database.Its last effect depends on the scale of database and the robustness of training method.
Summary of the invention
The objective of the invention is to overcome the deficiencies in the prior art, take all factors into consideration the characteristics of text-driven and two kinds of methods of voice driven, thereby provide a kind of and can show abundant expression, can guarantee voice and lip type, the synchronous generating method of cartoon face of expression again.
For achieving the above object, voice provided by the invention and text are united the generating method of cartoon face of driving, it is characterized in that, comprise the steps:
1) one section speech data of input and corresponding text thereof are carried out text analyzing to described input text, extract valid utterance syllables all in the described input text, the syllable number that obtains importing in the speech data and comprised; And search syllable apparent place mapping library according to described valid utterance syllable, it is pairing apparent place the parameter subsequence to obtain importing in the voice each syllable;
2) adopt dual threshold end-point detection algorithm that the speech data of input is carried out phonetic segmentation, obtain a series of voice segments, this voice segments number is more than the syllable number that obtains in the step 1); Constantly that duration is the shortest voice segments is adjacent voice segments and merges, and the syllable number that obtains in the number of voice segments and step 1) is consistent, with the duration of each voice segments of finally the obtaining duration information as each syllable;
3) according to step 2) in the duration information of each syllable of obtaining, with each syllable of obtaining in the step 1) apparent place the parameter subsequence be spliced into whole input voice apparent place argument sequence, with this apparent place argument sequence as the continuous animation parameters of exporting at last.
In the technique scheme, the syllable in the described step 1) apparent place mapping library to set up process as follows:
Select the abundant performer of the full facial expression of pronunciation standard to read aloud the language material text, the voice of this language material text correspondence cover all syllables commonly used of Chinese;
Facial is according to the fixing sensitivity speck of motion capture device of MPEG4 animation grid, adopt motion capture device to gather performer's facial motion data, and carry out voice recording at the same time, thereby the voice sequence of being enrolled synchronously and apparent place argument sequence, after carrying out the later stage dividing processing, obtain all syllables commonly used with apparent place the parameter subsequence one to one syllable apparent place mapping library.
In the technique scheme, described step 2) to be adjacent the method that voice segments merges as follows for the voice segments that duration is the shortest in: the setting-up time direction of principal axis is for from left to right, at first from all voice segments, find out the shortest voice segments of duration, the duration that compares two voice segments adjacent then with this voice segments left and right sides, select the phrase segment of short voice segments of duration and described duration to merge into a voice segments, in the merging process with the end points of two voice segments high order ends as the starting point that merges the back voice segments, the end points of two voice segments low order ends is as the end point that merges the back voice segments.
In the technique scheme, described step 3) comprises following substep:
31) according to step 2) duration of each syllable of obtaining and syllable be apparent place the length proportion of the original syllable in the mapping library, to each syllable of obtaining in the step 1) apparent place the parameter subsequence carries out convergent-divergent in proportion, then with each syllable behind the convergent-divergent apparent place the parameter subsequence is spliced into complete apparent place argument sequence in order;
32) adopt three rank Hermite functions to step 31) in obtain apparent place argument sequence carries out smoothing processing, obtain final apparent place argument sequence.
The invention has the advantages that: (1) can self-defined syllable one apparent place mapping, thereby realize the various lip type and the shapes of face, finally synthetic cartoon face with exaggeration expression effect; (2) do not need large scale database to train; (3) can under the guidance of text, carry out phonetic segmentation, extract the syllable duration information, thereby synthetic have the lip type and the synchronous human face animation of expressing one's feelings.
Description of drawings
Fig. 1 is sound end testing process figure
Fig. 2 is typical text-driven face cartoon method structural representation
Fig. 3 is typical voice-driven human face animation method structural representation
Fig. 4 is that voice and text are united the structural representation that drives cartoon face
Fig. 5 is the phonetic segmentation process flow diagram
Embodiment
Below in conjunction with the drawings and specific embodiments the present invention is done description further.
Embodiment 1
As described in accompanying drawing 4, the inventive method comprises syllable altogether apparent place the foundation of mapping library, text analyzing, phonetic segmentation and apparent place four steps such as splicing.
At first set up a syllable apparent place mapping library, comprise in the storehouse all Chinese syllables commonly used with apparent place the mapping relations of parameter.
Import one section text then, after text analyzing, can obtain the number of one group of single syllable and syllable; Single syllable is through mapping, can obtain corresponding apparent place.
And the syllable of all inputs can be marked out in voice segments by phonetic segmentation, promptly obtain the duration of each syllable.
At last apparent place the splicing stage, according to the duration of syllable, with all syllables apparent place be stitched together, just formed continuous apparent place parameter, thereby can synthesize continuous animation at last.
The implementation method of each step is as follows:
1. syllable is apparent place the foundation of mapping library.Have 422 according to (Cai Lianhong, Huang Dezhi, Cai Rui, " modern voice technology basis and application ", publishing house of Tsing-Hua University, 2003) Chinese syllable.This method at first by experiment based on real human face gather all syllables apparent place.Following the carrying out of data acquisition test: 1) select pronunciation standard full, the performer that facial expression is abundant carries out text reading; 2) language material that adopts 863 phonetic synthesis corpus Coss-1 (Corpus of SpeechSynthesis) to adopt, totally 265 in this language material text has covered the various aspects of social life, and corresponding voice have covered all syllables of Chinese; 3) facial is according to the fixing sensitivity speck of proprietary material (this proprietary material provides by collecting device is supporting) of MPEG4 (Moving Picture Experts Group) animation grid, adopt motion capture device to gather performer's facial motion data, and carry out voice recording at the same time, thereby the voice sequence of being enrolled synchronously and apparent place argument sequence.What adopt in the present embodiment is the acquisition instrument (http://www.vicon.com) of Vicon company.To the voice sequence recorded with apparent place argument sequence carries out the later stage according to its text cuts apart, then obtained at last based on true man's syllable with apparent place mapping library.The process that later stage is cut apart is such: be syncopated as a complete syllable with phonetic segmentation instrument (as praat) from voice sequence, so just obtained the beginning and ending time of this syllable.According to this beginning and ending time, from apparent place argument sequence be partitioned into this syllable correspondence apparent place the parameter subsequence.In like manner can obtain the pairing apparent place parameter subsequence of all syllables, thus set up syllable with apparent place the parameter subsequence one to one syllable apparent place mapping library.
Since cartoon apparent place do not need the sense of reality, in order further to embody its expansiveness and feeling of unreality, the present invention can also be to real apparent place database have carried out manual editor.In addition, because True Data gathers in a long period scope, the performer is tired unavoidably, make some syllable apparent place not too remarkable, therefore also need to update adjustment.Its function of instrument that editor uses is comparatively simple, and existing open report is as document [the human face animation system on the cell phone platform, Wang Jie, Wang Zhaoqi, the Yellow River, Xia Shihong, first intelligence CAD and digital entertainment academic conference (CIDE2004)].Its input be one initial apparent place, the grid of employing is general purpose MPEG 4 human face animation grids, drags the position that it is suitable that grid is fixed a point by manual, make that the expression of grid correspondence and lip type are more moderate till.
2. text analyzing.The effect of text analyzing is to extract all valid utterance syllables in the input text, so that these pronunciation syllables can embody in lip type animation.Mainly comprise: 1) remove various punctuation marks in the text, as quotation marks, punctuation marks used to enclose the title, dash, colons etc., these symbols are the parts that can't embody in animation.For some punctuation mark, as comma or colon, though may be corresponding the stationary state on the human face animation, removals can't influence animation effect because in the process of next step phonetic segmentation, can detect this a part of quiet section.2) Roman number or part English alphabet are translated into corresponding Chinese syllable.Only translate into Chinese syllable, can be at syllable apparent place find corresponding in the mapping library apparent place.
The text unit that obtains through text analyzing, all be can be from syllable apparent place be mapped to the mapping library corresponding apparent place effective syllable unit.Such as for text " " talk on the journey to west " the 2nd one very funny ", become " second one of talk on the journey to west sent out the Buddhist nun very much " through after the text analyzing.
After the text analyzing, can obtain the number of one group of single syllable and syllable, described single syllable refers to the pairing Chinese phonetic alphabet of each Chinese character in the text, and the number of syllable then refers to comprise the number of Chinese character.
3. phonetic segmentation.The purpose of phonetic segmentation is the text syllable is marked out in voice sequence, thereby obtains the duration parameters of each syllable.At present, the phonetic segmentation algorithm is a lot, because this method has been known the pairing text of voice, has therefore designed a kind of phonetic segmentation algorithm based on text message.By the text analyzing process, obtained the syllable number, in other words, known that voice will be by the unit number of cutting before the cutting.This makes can determine end points as much as possible in advance when sound end detects, the end points number constantly merges end points adjacent and that duration is the shortest then more than the syllable number, and is consistent with the syllable number up to final detected end points number.As shown in Figure 5.
The phonetic segmentation of present embodiment has adopted dual threshold end-point detection algorithm, does not need to set up the model training and obtains duration parameters more accurately.Wherein used short-time energy, the most basic of two kinds of voice of zero-crossing rate also is most important temporal signatures.Fig. 1 has provided the end-point detection treatment scheme.
Voice signal at first carries out normalized, and its amplitude is limited between [1,1].
The zero-crossing rate of computing voice signal and short-time energy are for short-time energy and zero-crossing rate are provided with high and low two threshold values respectively.The selection of these threshold values is relevant with the voice signal of input.At first the voice to input carry out the statistics of short-time energy and zero-crossing rate, calculate the mean value of short-time energy and zero-crossing rate respectively, with mean value 5% as low threshold value, with mean value 10% as high threshold.It may not be exactly the beginning of voice that low threshold value is exceeded, and might be that very short noise of time causes.High threshold is exceeded then can be sure of substantially because voice signal causes.In order to produce more end points, selected 5% and 10% these two lower threshold values in the present embodiment, as can be seen, in implementation process of the present invention, also can select other threshold values according to actual conditions in testing process.
Voice signal generally is divided into four sections: quiet section, transition section, voice segments and ending segment.The end points of voice signal is exactly above-mentioned four sections arbitrary starting point and end point, end-point detection be exactly with all starting points and end point all mark come out, also can be understood as the initial voice signal cutting is the section of above-mentioned four types.At quiet section, if energy or zero-crossing rate have surpassed low threshold value, then the beginning label starting point enters transition section.At transition section, because the numeric ratio of parameter is less, can not be sure of whether be in real voice segments, therefore as long as the numerical value of energy and two parameters of zero-crossing rate all falls back to below the low threshold value, just with the current mute state that returns to.And if any in two parameters surpasses high threshold in transition section, just can be sure of to have entered voice segments.Noise may cause that short-time energy or zero-crossing rate numerical value are very high, often the time very short, can set the shortest time threshold value and judge.When current state was in voice segments, total mark time span was then thought noise less than shortest time threshold value (value of this threshold value is generally at 20~100ms, and value is 60ms in the present embodiment), continued the later speech data of scanning, otherwise with regard to the mark end point.
After previous step detected abundant end points, the process that end points merges as shown in Figure 5.At first from all voice segments, find out the shortest voice segments of duration, the duration of two relatively more adjacent with this voice segments left and right sides voice segments then selects the phrase segment of short voice segments of duration and described duration to merge into a voice segments.Way be end points with the two high order end as new syllable starting point, the end points of the two low order end is as new syllable end point (time-axis direction for from left to right).Original transition section, quiet section part that then becomes new syllable automatically that may exist between two voice segments.Constantly merge voice segments according to the method described above, the syllable number that obtains until syllable number and text analyzing is consistent.So just, obtained the duration information of each voice segments (being syllable).
4. apparent place splicing.Through aforementioned two steps, obtained syllable apparent place parameter and apparent place after the corresponding duration parameters, with each syllable apparent place the parameter subsequence is spliced synthetic complete apparent place argument sequence (i.e. one group of continuous animation parameters) corresponding to the input voice according to duration.Consider people in the process of speaking, the degree of lip-rounding is a continually varying, the influence of the degree of lip-rounding before and after each degree of lip-rounding will be subjected to, and from syllable-apparent place mate the storehouse apparent place fundamental type, if without processing, directly play according to the splicing of phonetic segmentation time, effect is quite coarse, lacks the sense of reality.
Therefore also need to apparent place argument sequence adjust.The present invention adopts the open source literature [research of Chinese language text-visual speech conversion, Wang Zhiming, Cai Lian is red, and Wu Zhi is brave, Tao Jianhua, small-sized microcomputer system, Vol.23, No.4, PP.474-477,2002.4] in method, adjust apparent place argument sequence makes it more near actual conditions, that is: with three rank Hermite functions (hermite function) curves
FAP(t)=FAP(t 1)+(3β 2-2β 3)(FAP(t 2)-FAP(t 1)) (1)
T wherein 1, t, t 2Represent previous moment respectively, in constantly middle and one moment of back, unit is a millisecond, and t 1≤ t≤t 2FAP (t) is a t FAP parameter value constantly, β=(t-t 1)/(t 2-t 1).Concrete implementation step is as described below:
1) with all syllables apparent place parameter is carried out corresponding convergent-divergent.The most original syllable has a duration parameters T 0, and phonetic segmentation can obtain the duration T of this syllable in this statement 1, have a ratio R=T between the two 1/ T 0, this just need to syllable apparent place parameter is carried out the convergent-divergent of R in proportion, so be met duration parameters apparent place argument sequence.
2) then this sequence is carried out smoothly splicing according to formula (1).Apparent place argument sequence by time series to (t 1, FAP (t 1)), (t 2, FAP (t 2)) ..., (t N, FAP (t N)) form, carry out iterative computation according to formula (4), at first according to t 1And t 3Data computation t 2, basis is again according to t then 2And t 4Calculate t 3..., up to according to t N-2And t NCalculate t N-1Finally obtain a new apparent place argument sequence, this is new apparent place argument sequence promptly can be used as the final continuous animation parameters corresponding to the input voice.

Claims (4)

1, a kind of voice and text are united the generating method of cartoon face of driving, it is characterized in that, comprise the steps:
1) one section speech data of input and corresponding text thereof are carried out text analyzing to described input text, extract valid utterance syllables all in the described input text, the syllable number that obtains importing in the speech data and comprised; And search syllable apparent place mapping library according to described valid utterance syllable, it is pairing apparent place the parameter subsequence to obtain importing in the voice each syllable;
2) adopt dual threshold end-point detection algorithm that the speech data of input is carried out phonetic segmentation, obtain a series of voice segments, this voice segments number is more than the syllable number that obtains in the step 1); Constantly that duration is the shortest voice segments is adjacent voice segments and merges, and the syllable number that obtains in the number of voice segments and step 1) is consistent, with the duration of each voice segments of finally the obtaining duration information as each syllable;
3) according to step 2) in the duration information of each syllable of obtaining, with each syllable of obtaining in the step 1) apparent place the parameter subsequence be spliced into whole input voice apparent place argument sequence, with this apparent place argument sequence as the continuous animation parameters of exporting at last.
2, unite the generating method of cartoon face of driving by described voice of claim 1 and text, it is characterized in that, the syllable in the described step 1) apparent place mapping library to set up process as follows:
Select the abundant performer of the full facial expression of pronunciation standard to read aloud the language material text, the voice of this language material text correspondence cover all syllables commonly used of Chinese;
Facial is according to the fixing sensitivity speck of motion capture device of MPEG4 animation grid, adopt motion capture device to gather performer's facial motion data, and carry out voice recording at the same time, thereby the voice sequence of being enrolled synchronously and apparent place argument sequence, after carrying out the later stage dividing processing, obtain all syllables commonly used with apparent place the parameter subsequence one to one syllable apparent place mapping library.
3, the generating method of cartoon face of uniting driving by described voice of claim 1 and text, it is characterized in that, described step 2) to be adjacent the method that voice segments merges as follows for the voice segments that duration is the shortest in: the setting-up time direction of principal axis is for from left to right, at first from all voice segments, find out the shortest voice segments of duration, the duration that compares two voice segments adjacent then with this voice segments left and right sides, select the phrase segment of short voice segments of duration and described duration to merge into a voice segments, in the merging process with the end points of two voice segments high order ends as the starting point that merges the back voice segments, the end points of two voice segments low order ends is as the end point that merges the back voice segments.
4, unite the generating method of cartoon face of driving by described voice of claim 1 and text, it is characterized in that described step 3) comprises following substep:
31) according to step 2) duration of each syllable of obtaining and syllable be apparent place the length proportion of the original syllable in the mapping library, to each syllable of obtaining in the step 1) apparent place the parameter subsequence carries out convergent-divergent in proportion, then with each syllable behind the convergent-divergent apparent place the parameter subsequence is spliced into complete apparent place argument sequence in order;
32) adopt three rank hermite functions to step 31) in obtain apparent place argument sequence carries out smoothing processing, obtain final apparent place argument sequence.
CNB2006101144956A 2006-11-10 2006-11-10 Generating method of cartoon face driven by voice and text together Active CN100476877C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006101144956A CN100476877C (en) 2006-11-10 2006-11-10 Generating method of cartoon face driven by voice and text together

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006101144956A CN100476877C (en) 2006-11-10 2006-11-10 Generating method of cartoon face driven by voice and text together

Publications (2)

Publication Number Publication Date
CN1971621A true CN1971621A (en) 2007-05-30
CN100476877C CN100476877C (en) 2009-04-08

Family

ID=38112424

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006101144956A Active CN100476877C (en) 2006-11-10 2006-11-10 Generating method of cartoon face driven by voice and text together

Country Status (1)

Country Link
CN (1) CN100476877C (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101826216A (en) * 2010-03-31 2010-09-08 中国科学院自动化研究所 Automatic generating system for role Chinese mouth shape cartoon
CN101482976B (en) * 2009-01-19 2010-10-27 腾讯科技(深圳)有限公司 Method for driving change of lip shape by voice, method and apparatus for acquiring lip cartoon
CN101925949A (en) * 2008-01-23 2010-12-22 索尼公司 Method for deriving animation parameters and animation display device
CN102609969A (en) * 2012-02-17 2012-07-25 上海交通大学 Method for processing face and speech synchronous animation based on Chinese text drive
CN103218842A (en) * 2013-03-12 2013-07-24 西南交通大学 Voice synchronous-drive three-dimensional face mouth shape and face posture animation method
CN104424955A (en) * 2013-08-29 2015-03-18 国际商业机器公司 Audio graphical expression generation method and equipment, and audio searching method and equipment
WO2015117373A1 (en) * 2014-07-22 2015-08-13 中兴通讯股份有限公司 Method and device for realizing voice message visualization service
CN106653050A (en) * 2017-02-08 2017-05-10 康梅 Method for matching animation mouth shapes with voice in real time
CN107004287A (en) * 2014-11-05 2017-08-01 英特尔公司 Incarnation video-unit and method
CN107045870A (en) * 2017-05-23 2017-08-15 南京理工大学 A kind of the Method of Speech Endpoint Detection of feature based value coding
CN108447474A (en) * 2018-03-12 2018-08-24 北京灵伴未来科技有限公司 A kind of modeling and the control method of virtual portrait voice and Hp-synchronization
CN109727504A (en) * 2018-12-19 2019-05-07 安徽建筑大学 A kind of animation interactive system based on Art Design
CN110288077A (en) * 2018-11-14 2019-09-27 腾讯科技(深圳)有限公司 A kind of synthesis based on artificial intelligence is spoken the method and relevant apparatus of expression
CN110493613A (en) * 2019-08-16 2019-11-22 江苏遨信科技有限公司 A kind of synthetic method and system of video audio lip sync
CN110624247A (en) * 2018-06-22 2019-12-31 奥多比公司 Determining mouth movement corresponding to real-time speech using machine learning models
CN110677598A (en) * 2019-09-18 2020-01-10 北京市商汤科技开发有限公司 Video generation method and device, electronic equipment and computer storage medium
CN111225237A (en) * 2020-04-23 2020-06-02 腾讯科技(深圳)有限公司 Sound and picture matching method of video, related device and storage medium
CN111863001A (en) * 2020-06-17 2020-10-30 广州华燎电气科技有限公司 Method for inhibiting background noise in multi-party call system
CN112466287A (en) * 2020-11-25 2021-03-09 出门问问(苏州)信息科技有限公司 Voice segmentation method and device and computer readable storage medium
CN112541957A (en) * 2020-12-09 2021-03-23 北京百度网讯科技有限公司 Animation generation method, animation generation device, electronic equipment and computer readable medium
WO2021073416A1 (en) * 2019-10-18 2021-04-22 平安科技(深圳)有限公司 Method for generating virtual character video on the basis of neural network, and related device
CN114581813A (en) * 2022-01-12 2022-06-03 北京云辰信通科技有限公司 Visual language identification method and related equipment
CN110624247B (en) * 2018-06-22 2024-04-30 奥多比公司 Determining movement of a mouth corresponding to real-time speech using a machine learning model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5657426A (en) * 1994-06-10 1997-08-12 Digital Equipment Corporation Method and apparatus for producing audio-visual synthetic speech
US6654018B1 (en) * 2001-03-29 2003-11-25 At&T Corp. Audio-visual selection process for the synthesis of photo-realistic talking-head animations
CN1152336C (en) * 2002-05-17 2004-06-02 清华大学 Method and system for computer conversion between Chinese audio and video parameters
CN1320497C (en) * 2002-07-03 2007-06-06 中国科学院计算技术研究所 Statistics and rule combination based phonetic driving human face carton method

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101925949A (en) * 2008-01-23 2010-12-22 索尼公司 Method for deriving animation parameters and animation display device
US8606384B2 (en) 2008-01-23 2013-12-10 Sony Corporation Method for deriving animation parameters and animation display device
CN101482976B (en) * 2009-01-19 2010-10-27 腾讯科技(深圳)有限公司 Method for driving change of lip shape by voice, method and apparatus for acquiring lip cartoon
CN101826216B (en) * 2010-03-31 2011-12-07 中国科学院自动化研究所 Automatic generating system for role Chinese mouth shape cartoon
CN101826216A (en) * 2010-03-31 2010-09-08 中国科学院自动化研究所 Automatic generating system for role Chinese mouth shape cartoon
CN102609969A (en) * 2012-02-17 2012-07-25 上海交通大学 Method for processing face and speech synchronous animation based on Chinese text drive
CN103218842A (en) * 2013-03-12 2013-07-24 西南交通大学 Voice synchronous-drive three-dimensional face mouth shape and face posture animation method
CN103218842B (en) * 2013-03-12 2015-11-25 西南交通大学 A kind of voice synchronous drives the method for the three-dimensional face shape of the mouth as one speaks and facial pose animation
CN104424955B (en) * 2013-08-29 2018-11-27 国际商业机器公司 Generate figured method and apparatus, audio search method and the equipment of audio
CN104424955A (en) * 2013-08-29 2015-03-18 国际商业机器公司 Audio graphical expression generation method and equipment, and audio searching method and equipment
WO2015117373A1 (en) * 2014-07-22 2015-08-13 中兴通讯股份有限公司 Method and device for realizing voice message visualization service
CN107004287A (en) * 2014-11-05 2017-08-01 英特尔公司 Incarnation video-unit and method
CN107004287B (en) * 2014-11-05 2020-10-23 英特尔公司 Avatar video apparatus and method
CN106653050A (en) * 2017-02-08 2017-05-10 康梅 Method for matching animation mouth shapes with voice in real time
CN107045870A (en) * 2017-05-23 2017-08-15 南京理工大学 A kind of the Method of Speech Endpoint Detection of feature based value coding
CN108447474A (en) * 2018-03-12 2018-08-24 北京灵伴未来科技有限公司 A kind of modeling and the control method of virtual portrait voice and Hp-synchronization
CN108447474B (en) * 2018-03-12 2020-10-16 北京灵伴未来科技有限公司 Modeling and control method for synchronizing virtual character voice and mouth shape
CN110624247B (en) * 2018-06-22 2024-04-30 奥多比公司 Determining movement of a mouth corresponding to real-time speech using a machine learning model
CN110624247A (en) * 2018-06-22 2019-12-31 奥多比公司 Determining mouth movement corresponding to real-time speech using machine learning models
CN110288077A (en) * 2018-11-14 2019-09-27 腾讯科技(深圳)有限公司 A kind of synthesis based on artificial intelligence is spoken the method and relevant apparatus of expression
CN110288077B (en) * 2018-11-14 2022-12-16 腾讯科技(深圳)有限公司 Method and related device for synthesizing speaking expression based on artificial intelligence
CN109727504A (en) * 2018-12-19 2019-05-07 安徽建筑大学 A kind of animation interactive system based on Art Design
CN110493613A (en) * 2019-08-16 2019-11-22 江苏遨信科技有限公司 A kind of synthetic method and system of video audio lip sync
CN110677598A (en) * 2019-09-18 2020-01-10 北京市商汤科技开发有限公司 Video generation method and device, electronic equipment and computer storage medium
CN110677598B (en) * 2019-09-18 2022-04-12 北京市商汤科技开发有限公司 Video generation method and device, electronic equipment and computer storage medium
WO2021073416A1 (en) * 2019-10-18 2021-04-22 平安科技(深圳)有限公司 Method for generating virtual character video on the basis of neural network, and related device
CN111225237B (en) * 2020-04-23 2020-08-21 腾讯科技(深圳)有限公司 Sound and picture matching method of video, related device and storage medium
CN111225237A (en) * 2020-04-23 2020-06-02 腾讯科技(深圳)有限公司 Sound and picture matching method of video, related device and storage medium
US11972778B2 (en) 2020-04-23 2024-04-30 Tencent Technology (Shenzhen) Company Limited Sound-picture matching method of video, related apparatus, and storage medium
CN111863001A (en) * 2020-06-17 2020-10-30 广州华燎电气科技有限公司 Method for inhibiting background noise in multi-party call system
CN112466287A (en) * 2020-11-25 2021-03-09 出门问问(苏州)信息科技有限公司 Voice segmentation method and device and computer readable storage medium
CN112466287B (en) * 2020-11-25 2023-06-27 出门问问(苏州)信息科技有限公司 Voice segmentation method, device and computer readable storage medium
CN112541957A (en) * 2020-12-09 2021-03-23 北京百度网讯科技有限公司 Animation generation method, animation generation device, electronic equipment and computer readable medium
US11948236B2 (en) 2020-12-09 2024-04-02 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for generating animation, electronic device, and computer readable medium
CN114581813A (en) * 2022-01-12 2022-06-03 北京云辰信通科技有限公司 Visual language identification method and related equipment

Also Published As

Publication number Publication date
CN100476877C (en) 2009-04-08

Similar Documents

Publication Publication Date Title
CN100476877C (en) Generating method of cartoon face driven by voice and text together
CN101178896B (en) Unit selection voice synthetic method based on acoustics statistical model
CN102169642B (en) Interactive virtual teacher system having intelligent error correction function
CN111915707B (en) Mouth shape animation display method and device based on audio information and storage medium
CN103218842A (en) Voice synchronous-drive three-dimensional face mouth shape and face posture animation method
CN104361620A (en) Mouth shape animation synthesis method based on comprehensive weighted algorithm
Mattheyses et al. Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis
CN102184731A (en) Method for converting emotional speech by combining rhythm parameters with tone parameters
Malcangi Text-driven avatars based on artificial neural networks and fuzzy logic
EP3866117A1 (en) Voice signal-driven facial animation generation method
Urbain et al. Arousal-driven synthesis of laughter
CN101930619A (en) Collaborative filtering-based real-time voice-driven human face and lip synchronous animation system
KR20080018408A (en) Computer-readable recording medium with facial expression program by using phonetic sound libraries
Zhou et al. Content-dependent fine-grained speaker embedding for zero-shot speaker adaptation in text-to-speech synthesis
Dongmei Design of English text-to-speech conversion algorithm based on machine learning
Sivaprasad et al. Emotional prosody control for speech generation
Levy et al. The effect of pitch, intensity and pause duration in punctuation detection
WO2022164725A1 (en) Generating diverse and natural text-to-speech samples
Kettebekov et al. Prosody based audiovisual coanalysis for coverbal gesture recognition
Kang et al. Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion
Li et al. A novel speech-driven lip-sync model with CNN and LSTM
Kirandzhiska et al. Sound features used in emotion classification
CN113628610B (en) Voice synthesis method and device and electronic equipment
Hacioglu et al. Parsing speech into articulatory events
Zorić et al. Real-time language independent lip synchronization method using a genetic algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230111

Address after: Room 1211, Floor 12, Building A, Binhe Business Center, Tianqiao District, Jinan, Shandong Province 250033

Patentee after: Zhongke Music Intelligent Technology (Jinan) Co.,Ltd.

Address before: 100080 No. 6 South Road, Zhongguancun Academy of Sciences, Beijing, Haidian District

Patentee before: Institute of Computing Technology, Chinese Academy of Sciences

TR01 Transfer of patent right