CN110880315A - Personalized voice and video generation system based on phoneme posterior probability - Google Patents
Personalized voice and video generation system based on phoneme posterior probability Download PDFInfo
- Publication number
- CN110880315A CN110880315A CN201910991186.4A CN201910991186A CN110880315A CN 110880315 A CN110880315 A CN 110880315A CN 201910991186 A CN201910991186 A CN 201910991186A CN 110880315 A CN110880315 A CN 110880315A
- Authority
- CN
- China
- Prior art keywords
- lip
- video
- speaker
- phoneme posterior
- posterior probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 claims abstract description 23
- 238000013528 artificial neural network Methods 0.000 claims abstract description 16
- 238000005516 engineering process Methods 0.000 claims abstract description 16
- 238000012549 training Methods 0.000 claims abstract description 16
- 230000000306 recurrent effect Effects 0.000 claims abstract description 13
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 8
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 8
- 230000004927 fusion Effects 0.000 claims abstract description 7
- 230000003287 optical effect Effects 0.000 claims abstract description 7
- 238000006243 chemical reaction Methods 0.000 claims abstract description 6
- 238000013507 mapping Methods 0.000 claims abstract description 6
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 4
- 230000009471 action Effects 0.000 claims description 5
- 238000012937 correction Methods 0.000 claims description 3
- 230000007774 longterm Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000001360 synchronised effect Effects 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 230000000694 effects Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000001815 facial effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention discloses a personalized voice and video generation system based on phoneme posterior probability, which mainly comprises the following steps: s1, extracting phoneme posterior probability through an automatic speech recognition system; s2, training a recurrent neural network to learn the mapping relation between the phoneme posterior probability and the lip-shaped feature, and inputting the audio of any target speaker through the network to output the corresponding lip-shaped feature; s3, synthesizing the lip-shaped features into corresponding face images through the technologies of face alignment, image fusion, an optical flow method and the like; and S4, generating a final speaker speech video from the generated face sequence through a dynamic programming technology and the like. The present invention relates to the technical field of speech synthesis and speech conversion. The invention provides a method for generating lip shape based on phoneme posterior probability, which greatly reduces the requirement on the video data volume of the target speaker, and simultaneously can directly generate the video of the target speaker from the text content without additionally recording the audio of the speaker.
Description
Technical Field
The invention relates to the technical field of voice and video, in particular to a personalized voice and video generation system based on phoneme posterior probability.
Background
With the improvement of computing power, the collection of a large amount of internet data and the breakthrough of a core algorithm, artificial intelligence enters a new development stage and a man-machine interaction mode is gradually changed. An important part in the human-computer interaction process is that the real human image is simulated to interact with the user, wherein the key technology is the virtual image generation technology, and the personalized voice and video synthesis can be realized by combining the voice synthesis technology and the voice conversion technology.
Speech synthesis is a technique that converts text to speech, which can be used as a custom tone for the synthesized speech. With the application of deep learning, the naturalness and fluency of synthesized voice and converted voice are greatly improved.
The current mainstream virtual image generation technology is to change the expression of the virtual image in real time according to facial recognition, and the mode is more suitable for the two-dimensional image but is difficult to generate the virtual image similar to a real person. In recent years, research and development are carried out in the technical and industrial fields of virtual image generation based on real-person modeling, and the generation effect still needs to be further improved, so that the problems of strange lip, hard sound, improper facial action and sound, low face, particularly lip pixels and the like exist. In addition, the technology has certain requirements on the video data volume of the target speaker, the data volume is insufficient, the generation effect of a user is difficult to ensure, the use experience of the user is reduced, the overall practicability is not strong, and the operation is inconvenient for the user.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a personalized voice and video generation system based on phoneme posterior probability, which greatly reduces the requirement on the video data volume of a target speaker and can directly generate the video of the target speaker from the text content without additionally recording the audio of the speaker.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme: a personalized voice and video generation system based on phoneme posterior probability mainly comprises the following steps:
s1, firstly, extracting phoneme Posterior Probability (PPG) from the voice of the source speaker by using an automatic voice recognition system (SI-ASR) independent of the speaker;
s2, secondly, training a Recurrent Neural Network (RNN) to learn the mapping relation between the phoneme posterior probability and the lip-shaped feature; through the network, the corresponding lip characteristics can be output by inputting the audio frequency of any target speaker; if the input is a text, firstly outputting the audio frequency of the target speaker through voice synthesis and voice conversion, and then outputting the lip-shaped feature through a network;
s3, synthesizing lip-shaped features generated by the recurrent neural network into corresponding face images through technologies such as face alignment, image fusion and an optical flow method, wherein the lip shape of the face is kept synchronous with audio;
and S4, generating a final speaker speech video from the generated face sequence through a dynamic programming technology and the like.
Preferably, the speaker-independent automatic speech recognition is abbreviated as SI-ASR, the training recurrent neural network is abbreviated as RNN, and the phoneme posterior probability is abbreviated as PPG.
Preferably, in S2, two steps are shifted in the RNN model, and in order to generate smooth and natural lip motion, a long-short term memory network (LSTM) is used as a basic unit of the neural network, and a gating mechanism of the LSTM unit can control necessary information storage and state transition so that it can simultaneously retain audio and long-term dependency of previous lip and head gestures, so that after the RNN model is trained, a speaker video with natural lip and head motion can be generated in accordance with the input audio.
Preferably, the synthesized face image in S3 uses multiple image processing algorithms, such as normalizing the face in the video by a face alignment technique, seamlessly joining the synthesized lip texture to the face by an image fusion technique, completing chin correction by an optical flow method, and readjusting the time axis of the video by dynamic programming to make the video act more naturally in cooperation with the audio head.
Preferably, in S4, the generated video supports editing and modifying again.
(III) advantageous effects
The invention provides a personalized voice and video generation system based on phoneme posterior probability. The method has the following beneficial effects:
(1) the personalized voice and video generation system based on the phoneme posterior probability specifically comprises the following steps: s1, firstly, extracting phoneme Posterior Probability (PPG) from the voice of the source speaker by using an automatic voice recognition system (SI-ASR) independent of the speaker; the requirements on the amount of video data of the target speaker are greatly reduced.
(2) The personalized voice and video generation system based on the phoneme posterior probability specifically comprises the following steps: s2, secondly, training a recurrent neural network to learn the mapping relation between the phoneme posterior probability and the lip-shaped characteristics; through the network, the corresponding lip characteristics can be output by inputting the audio frequency of any target speaker; if the input is a text, firstly outputting the audio frequency of the target speaker through voice synthesis and voice conversion, and then outputting the lip-shaped feature through a network; the video of the target speaker can be generated directly from the text content without additionally recording the audio of the speaker.
Drawings
FIG. 1 is a diagram of the main steps of the practice of the present invention;
FIG. 2 is a schematic representation of the RNN model of the present invention;
FIG. 3 is a detailed flow chart of the practice of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-3, an embodiment of the present invention provides a technical solution: a personalized voice and video generation system based on phoneme posterior probability mainly comprises the following steps:
s1, first, a phoneme Posterior Probability (PPG) is extracted from the speech of the source speaker using a speaker-independent automatic speech recognition (SI-ASR) system, the posterior probability based method being based in part on the following assumptions: the posterior probability obtained from the speaker-independent speech recognition system is independent of the speaker and only related to the content of the utterance, and the phoneme posterior probability-based method is divided into three stages: the method comprises a first training stage (marked as a training stage 1), a second training stage (marked as a training stage 2) and a video generation stage, wherein the SI-ASR model is used for obtaining PPG representation of input voice, the second training stage is used for modeling mapping relation between PPG characteristics and lip characteristics of a target speaker for voice parameter generation through training a Recurrent Neural Network (RNN) model, and the video generation stage is used for generating corresponding lip characteristics for input text or voice through the SI-ASR and RNN models so as to synthesize corresponding faces and videos.
S2, next, normalizing the lip shape of each frame in the video by extracting it and translating, rotating, scaling, etc., as a feature vector of the lip shape, and then generating a corresponding lip feature vector sequence { y } in time0,y1,…,ytAnd training with a Recurrent Neural Network (RNN) model, which schematically shows the audio features x at time t0How to input into the LSTM unit is noteworthy because lip movements usually precede our voice productionFor example when we say "orange", our mouth has opened before the sound of "o", so we move two steps in the model; learning the mapping relation between the posterior probability of the audio feature phoneme and the lip-shaped feature through the recurrent neural network, so that the corresponding lip-shaped feature can be output by inputting the audio of any target speaker; if the input is a text, firstly outputting the audio frequency of the target speaker through voice synthesis and voice conversion, and then outputting the lip-shaped feature through a network;
s3, synthesizing lip-shaped features generated based on the training recurrent neural network into corresponding face images through technologies such as face alignment, image fusion and an optical flow method, wherein the lip shape of the face is kept synchronous with audio;
and S4, generating a final speaker speech video from the generated face sequence through a dynamic programming technology and the like.
In the invention, the speaker-independent automatic speech recognition is called SI-ASR for short, the training recurrent neural network is called RNN for short, and the phoneme posterior probability is called PPG for short.
In the present invention, in S2, two steps are shifted in the RNN model, in order to generate smooth and natural lip motion, a long-short term memory network (LSTM) is used as a basic unit of the neural network, and a gating mechanism of the LSTM unit can control necessary information storage and state transition so that it can simultaneously retain audio and long-term dependency of previous lip and head gestures, and thus, after the RNN model is trained, a speaker video with natural lip and head motion can be generated in accordance with the input audio.
In the invention, the synthesized face image in the S3 uses a plurality of image processing algorithms, for example, the face in the video is orthogonalized by a face alignment technology, the synthesized lip texture is seamlessly connected with the face by an image fusion technology, the chin correction is completed by an optical flow method, and the time axis of the video is readjusted by dynamic programming, so that the action of matching the video with the audio head is more natural.
In the present invention, in S4, the generated video supports editing and modifying again.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (5)
1. A system for personalized speech and video generation based on phoneme posterior probabilities, comprising: the method mainly comprises the following steps:
s1, firstly, extracting phoneme Posterior Probability (PPG) from the voice of the source speaker by using an automatic voice recognition system (SI-ASR) independent of the speaker;
s2, secondly, training a Recurrent Neural Network (RNN) to learn the mapping relation between the phoneme posterior probability and the lip-shaped feature; through the network, the corresponding lip characteristics can be output by inputting the audio frequency of any target speaker; if the input is a text, firstly outputting the audio frequency of the target speaker through voice synthesis and voice conversion, and then outputting the lip-shaped feature through a network;
s3, synthesizing lip-shaped features generated based on the training recurrent neural network into corresponding face images through technologies such as face alignment, image fusion and an optical flow method, wherein the lip shape of the face is kept synchronous with audio;
and S4, generating a final speaker speech video from the generated face sequence through a dynamic programming technology and the like.
2. The system of claim 1 for personalized speech and video generation based on phoneme posterior probabilities, wherein: the speaker-independent automatic speech recognition is called SI-ASR for short, the training recurrent neural network is called RNN for short, and the phoneme posterior probability is called PPG for short.
3. The system of claim 1 for personalized speech and video generation based on phoneme posterior probabilities, wherein: in S2, two steps are shifted in the RNN model, and in order to generate smooth and natural lip motion, a long-short term memory network (LSTM) is used as a basic unit of the neural network, and a gating mechanism of the LSTM unit can control necessary information storage and state transition so that it can simultaneously retain audio and long-term dependency of previous lip and head gestures, and thus, after the RNN model is trained, a speaker video with natural lip and head motion in accordance with input audio can be generated.
4. The system of claim 1 for personalized speech and video generation based on phoneme posterior probabilities, wherein: the synthesized face image in the S3 uses a plurality of image processing algorithms, for example, the face in the video is orthogonalized by a face alignment technique, the synthesized lip texture is seamlessly joined to the face by an image fusion technique, the chin correction is completed by an optical flow method, and the time axis of the video is readjusted by dynamic programming, so that the action of the video in cooperation with the audio head is more natural.
5. The system of claim 1 for personalized speech and video generation based on phoneme posterior probabilities, wherein: in S4, the generated video supports editing and modifying again.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910991186.4A CN110880315A (en) | 2019-10-17 | 2019-10-17 | Personalized voice and video generation system based on phoneme posterior probability |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910991186.4A CN110880315A (en) | 2019-10-17 | 2019-10-17 | Personalized voice and video generation system based on phoneme posterior probability |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110880315A true CN110880315A (en) | 2020-03-13 |
Family
ID=69728108
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910991186.4A Pending CN110880315A (en) | 2019-10-17 | 2019-10-17 | Personalized voice and video generation system based on phoneme posterior probability |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110880315A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111432233A (en) * | 2020-03-20 | 2020-07-17 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating video |
CN111666831A (en) * | 2020-05-18 | 2020-09-15 | 武汉理工大学 | Decoupling representation learning-based speaking face video generation method |
CN111933110A (en) * | 2020-08-12 | 2020-11-13 | 北京字节跳动网络技术有限公司 | Video generation method, generation model training method, device, medium and equipment |
CN112541956A (en) * | 2020-11-05 | 2021-03-23 | 北京百度网讯科技有限公司 | Animation synthesis method and device, mobile terminal and electronic equipment |
CN112634918A (en) * | 2020-09-29 | 2021-04-09 | 江苏清微智能科技有限公司 | Acoustic posterior probability based arbitrary speaker voice conversion system and method |
CN112735371A (en) * | 2020-12-28 | 2021-04-30 | 出门问问(苏州)信息科技有限公司 | Method and device for generating speaker video based on text information |
CN112766166A (en) * | 2021-01-20 | 2021-05-07 | 中国科学技术大学 | Lip-shaped forged video detection method and system based on polyphone selection |
CN113035235A (en) * | 2021-03-19 | 2021-06-25 | 北京有竹居网络技术有限公司 | Pronunciation evaluation method and apparatus, storage medium, and electronic device |
CN113077819A (en) * | 2021-03-19 | 2021-07-06 | 北京有竹居网络技术有限公司 | Pronunciation evaluation method and device, storage medium and electronic equipment |
CN113079327A (en) * | 2021-03-19 | 2021-07-06 | 北京有竹居网络技术有限公司 | Video generation method and device, storage medium and electronic equipment |
CN113314094A (en) * | 2021-05-28 | 2021-08-27 | 北京达佳互联信息技术有限公司 | Lip-shaped model training method and device and voice animation synthesis method and device |
CN113760100A (en) * | 2021-09-22 | 2021-12-07 | 入微智能科技(南京)有限公司 | Human-computer interaction equipment with virtual image generation, display and control functions |
CN113838174A (en) * | 2021-11-25 | 2021-12-24 | 之江实验室 | Audio-driven face animation generation method, device, equipment and medium |
CN114338959A (en) * | 2021-04-15 | 2022-04-12 | 西安汉易汉网络科技股份有限公司 | End-to-end text-to-video synthesis method, system medium and application |
CN114578969A (en) * | 2020-12-30 | 2022-06-03 | 北京百度网讯科技有限公司 | Method, apparatus, device and medium for human-computer interaction |
WO2022252890A1 (en) * | 2021-05-31 | 2022-12-08 | 上海商汤智能科技有限公司 | Interaction object driving and phoneme processing methods and apparatus, device and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101165679A (en) * | 2006-10-20 | 2008-04-23 | 东芝泰格有限公司 | Pattern matching device and method |
CN103021440A (en) * | 2012-11-22 | 2013-04-03 | 腾讯科技(深圳)有限公司 | Method and system for tracking audio streaming media |
CN103035236A (en) * | 2012-11-27 | 2013-04-10 | 河海大学常州校区 | High-quality voice conversion method based on modeling of signal timing characteristics |
CN106599198A (en) * | 2016-12-14 | 2017-04-26 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Image description method for multi-stage connection recurrent neural network |
CN106653052A (en) * | 2016-12-29 | 2017-05-10 | Tcl集团股份有限公司 | Virtual human face animation generation method and device |
US20180012613A1 (en) * | 2016-07-11 | 2018-01-11 | The Chinese University Of Hong Kong | Phonetic posteriorgrams for many-to-one voice conversion |
CN109308731A (en) * | 2018-08-24 | 2019-02-05 | 浙江大学 | The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM |
-
2019
- 2019-10-17 CN CN201910991186.4A patent/CN110880315A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101165679A (en) * | 2006-10-20 | 2008-04-23 | 东芝泰格有限公司 | Pattern matching device and method |
CN103021440A (en) * | 2012-11-22 | 2013-04-03 | 腾讯科技(深圳)有限公司 | Method and system for tracking audio streaming media |
CN103035236A (en) * | 2012-11-27 | 2013-04-10 | 河海大学常州校区 | High-quality voice conversion method based on modeling of signal timing characteristics |
US20180012613A1 (en) * | 2016-07-11 | 2018-01-11 | The Chinese University Of Hong Kong | Phonetic posteriorgrams for many-to-one voice conversion |
CN107610717A (en) * | 2016-07-11 | 2018-01-19 | 香港中文大学 | Many-one phonetics transfer method based on voice posterior probability |
CN106599198A (en) * | 2016-12-14 | 2017-04-26 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Image description method for multi-stage connection recurrent neural network |
CN106653052A (en) * | 2016-12-29 | 2017-05-10 | Tcl集团股份有限公司 | Virtual human face animation generation method and device |
CN109308731A (en) * | 2018-08-24 | 2019-02-05 | 浙江大学 | The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM |
Non-Patent Citations (6)
Title |
---|
RICK PARENT 著: "《计算机动画算法与技术》", 31 January 2018, 清华大学出版社 * |
SAMER AL MOUBAYED: "EXPIREMENT FOR LIPS SYNCHRONIZATION USING PHONE LATTICE TO FACE PARAMETERS", 《LEUVEN UNIVERSITY》 * |
XINJIAN ZHANG等: "A New Language Independent, Photo-realistic Talking Head Driven by Voice Only", 《INTERSPEECH 2013》 * |
YILONG LIU等: "Video-audio driven real-time facial animatio", 《ACM TRANSACTIONS ON GRAPHICS》 * |
张普等: "《数字化汉语教学的研究与应用》", 30 June 2006, 语文出版社 * |
徐涵 著: "《大数据、人工智能和网络舆情治理》", 31 October 2018, 武汉大学出版社 * |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111432233A (en) * | 2020-03-20 | 2020-07-17 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating video |
CN111666831B (en) * | 2020-05-18 | 2023-06-20 | 武汉理工大学 | Method for generating face video of speaker based on decoupling expression learning |
CN111666831A (en) * | 2020-05-18 | 2020-09-15 | 武汉理工大学 | Decoupling representation learning-based speaking face video generation method |
CN111933110A (en) * | 2020-08-12 | 2020-11-13 | 北京字节跳动网络技术有限公司 | Video generation method, generation model training method, device, medium and equipment |
WO2022033327A1 (en) * | 2020-08-12 | 2022-02-17 | 北京字节跳动网络技术有限公司 | Video generation method and apparatus, generation model training method and apparatus, and medium and device |
CN112634918A (en) * | 2020-09-29 | 2021-04-09 | 江苏清微智能科技有限公司 | Acoustic posterior probability based arbitrary speaker voice conversion system and method |
CN112634918B (en) * | 2020-09-29 | 2024-04-16 | 江苏清微智能科技有限公司 | System and method for converting voice of any speaker based on acoustic posterior probability |
CN112541956A (en) * | 2020-11-05 | 2021-03-23 | 北京百度网讯科技有限公司 | Animation synthesis method and device, mobile terminal and electronic equipment |
CN112735371B (en) * | 2020-12-28 | 2023-08-04 | 北京羽扇智信息科技有限公司 | Method and device for generating speaker video based on text information |
CN112735371A (en) * | 2020-12-28 | 2021-04-30 | 出门问问(苏州)信息科技有限公司 | Method and device for generating speaker video based on text information |
CN114578969A (en) * | 2020-12-30 | 2022-06-03 | 北京百度网讯科技有限公司 | Method, apparatus, device and medium for human-computer interaction |
CN114578969B (en) * | 2020-12-30 | 2023-10-20 | 北京百度网讯科技有限公司 | Method, apparatus, device and medium for man-machine interaction |
CN112766166A (en) * | 2021-01-20 | 2021-05-07 | 中国科学技术大学 | Lip-shaped forged video detection method and system based on polyphone selection |
CN112766166B (en) * | 2021-01-20 | 2022-09-06 | 中国科学技术大学 | Lip-shaped forged video detection method and system based on polyphone selection |
WO2022194044A1 (en) * | 2021-03-19 | 2022-09-22 | 北京有竹居网络技术有限公司 | Pronunciation assessment method and apparatus, storage medium, and electronic device |
CN113079327A (en) * | 2021-03-19 | 2021-07-06 | 北京有竹居网络技术有限公司 | Video generation method and device, storage medium and electronic equipment |
CN113077819A (en) * | 2021-03-19 | 2021-07-06 | 北京有竹居网络技术有限公司 | Pronunciation evaluation method and device, storage medium and electronic equipment |
CN113035235A (en) * | 2021-03-19 | 2021-06-25 | 北京有竹居网络技术有限公司 | Pronunciation evaluation method and apparatus, storage medium, and electronic device |
CN114338959A (en) * | 2021-04-15 | 2022-04-12 | 西安汉易汉网络科技股份有限公司 | End-to-end text-to-video synthesis method, system medium and application |
CN113314094A (en) * | 2021-05-28 | 2021-08-27 | 北京达佳互联信息技术有限公司 | Lip-shaped model training method and device and voice animation synthesis method and device |
CN113314094B (en) * | 2021-05-28 | 2024-05-07 | 北京达佳互联信息技术有限公司 | Lip model training method and device and voice animation synthesis method and device |
WO2022252890A1 (en) * | 2021-05-31 | 2022-12-08 | 上海商汤智能科技有限公司 | Interaction object driving and phoneme processing methods and apparatus, device and storage medium |
CN113760100A (en) * | 2021-09-22 | 2021-12-07 | 入微智能科技(南京)有限公司 | Human-computer interaction equipment with virtual image generation, display and control functions |
CN113760100B (en) * | 2021-09-22 | 2024-02-02 | 入微智能科技(南京)有限公司 | Man-machine interaction equipment with virtual image generation, display and control functions |
CN113838174B (en) * | 2021-11-25 | 2022-06-10 | 之江实验室 | Audio-driven face animation generation method, device, equipment and medium |
CN113838174A (en) * | 2021-11-25 | 2021-12-24 | 之江实验室 | Audio-driven face animation generation method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110880315A (en) | Personalized voice and video generation system based on phoneme posterior probability | |
Taylor et al. | A deep learning approach for generalized speech animation | |
JP3664474B2 (en) | Language-transparent synthesis of visual speech | |
Cao et al. | Expressive speech-driven facial animation | |
CN110751708B (en) | Method and system for driving face animation in real time through voice | |
US8224652B2 (en) | Speech and text driven HMM-based body animation synthesis | |
US20020024519A1 (en) | System and method for producing three-dimensional moving picture authoring tool supporting synthesis of motion, facial expression, lip synchronizing and lip synchronized voice of three-dimensional character | |
US20220108510A1 (en) | Real-time generation of speech animation | |
GB2516965A (en) | Synthetic audiovisual storyteller | |
CN113077537A (en) | Video generation method, storage medium and equipment | |
WO2024124680A1 (en) | Speech signal-driven personalized three-dimensional facial animation generation method, and application thereof | |
CN113838174A (en) | Audio-driven face animation generation method, device, equipment and medium | |
CN112002301A (en) | Text-based automatic video generation method | |
CN115761075A (en) | Face image generation method, device, equipment, medium and product | |
CN110728971B (en) | Audio and video synthesis method | |
CN116051692A (en) | Three-dimensional digital human face animation generation method based on voice driving | |
CN117975991A (en) | Digital person driving method and device based on artificial intelligence | |
WO2024113701A1 (en) | Voice-based video generation method and apparatus, server, and medium | |
CN117219050A (en) | Text generation video system based on depth generation countermeasure network | |
CN115311731B (en) | Expression generation method and device for sign language digital person | |
CN117115310A (en) | Digital face generation method and system based on audio and image | |
Liu et al. | Real-time speech-driven animation of expressive talking faces | |
CN115409923A (en) | Method, device and system for generating three-dimensional virtual image facial animation | |
Barve et al. | Synchronized Speech and Video Synthesis | |
d’Alessandro et al. | Reactive statistical mapping: Towards the sketching of performative control with data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200313 |