CN110880315A - Personalized voice and video generation system based on phoneme posterior probability - Google Patents

Personalized voice and video generation system based on phoneme posterior probability Download PDF

Info

Publication number
CN110880315A
CN110880315A CN201910991186.4A CN201910991186A CN110880315A CN 110880315 A CN110880315 A CN 110880315A CN 201910991186 A CN201910991186 A CN 201910991186A CN 110880315 A CN110880315 A CN 110880315A
Authority
CN
China
Prior art keywords
lip
video
speaker
phoneme posterior
posterior probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910991186.4A
Other languages
Chinese (zh)
Inventor
孙立发
周艺超
钟静华
李坤
胡景强
刘鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen City Of Hope Technology Co Ltd
Original Assignee
Shenzhen City Of Hope Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen City Of Hope Technology Co Ltd filed Critical Shenzhen City Of Hope Technology Co Ltd
Priority to CN201910991186.4A priority Critical patent/CN110880315A/en
Publication of CN110880315A publication Critical patent/CN110880315A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a personalized voice and video generation system based on phoneme posterior probability, which mainly comprises the following steps: s1, extracting phoneme posterior probability through an automatic speech recognition system; s2, training a recurrent neural network to learn the mapping relation between the phoneme posterior probability and the lip-shaped feature, and inputting the audio of any target speaker through the network to output the corresponding lip-shaped feature; s3, synthesizing the lip-shaped features into corresponding face images through the technologies of face alignment, image fusion, an optical flow method and the like; and S4, generating a final speaker speech video from the generated face sequence through a dynamic programming technology and the like. The present invention relates to the technical field of speech synthesis and speech conversion. The invention provides a method for generating lip shape based on phoneme posterior probability, which greatly reduces the requirement on the video data volume of the target speaker, and simultaneously can directly generate the video of the target speaker from the text content without additionally recording the audio of the speaker.

Description

Personalized voice and video generation system based on phoneme posterior probability
Technical Field
The invention relates to the technical field of voice and video, in particular to a personalized voice and video generation system based on phoneme posterior probability.
Background
With the improvement of computing power, the collection of a large amount of internet data and the breakthrough of a core algorithm, artificial intelligence enters a new development stage and a man-machine interaction mode is gradually changed. An important part in the human-computer interaction process is that the real human image is simulated to interact with the user, wherein the key technology is the virtual image generation technology, and the personalized voice and video synthesis can be realized by combining the voice synthesis technology and the voice conversion technology.
Speech synthesis is a technique that converts text to speech, which can be used as a custom tone for the synthesized speech. With the application of deep learning, the naturalness and fluency of synthesized voice and converted voice are greatly improved.
The current mainstream virtual image generation technology is to change the expression of the virtual image in real time according to facial recognition, and the mode is more suitable for the two-dimensional image but is difficult to generate the virtual image similar to a real person. In recent years, research and development are carried out in the technical and industrial fields of virtual image generation based on real-person modeling, and the generation effect still needs to be further improved, so that the problems of strange lip, hard sound, improper facial action and sound, low face, particularly lip pixels and the like exist. In addition, the technology has certain requirements on the video data volume of the target speaker, the data volume is insufficient, the generation effect of a user is difficult to ensure, the use experience of the user is reduced, the overall practicability is not strong, and the operation is inconvenient for the user.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a personalized voice and video generation system based on phoneme posterior probability, which greatly reduces the requirement on the video data volume of a target speaker and can directly generate the video of the target speaker from the text content without additionally recording the audio of the speaker.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme: a personalized voice and video generation system based on phoneme posterior probability mainly comprises the following steps:
s1, firstly, extracting phoneme Posterior Probability (PPG) from the voice of the source speaker by using an automatic voice recognition system (SI-ASR) independent of the speaker;
s2, secondly, training a Recurrent Neural Network (RNN) to learn the mapping relation between the phoneme posterior probability and the lip-shaped feature; through the network, the corresponding lip characteristics can be output by inputting the audio frequency of any target speaker; if the input is a text, firstly outputting the audio frequency of the target speaker through voice synthesis and voice conversion, and then outputting the lip-shaped feature through a network;
s3, synthesizing lip-shaped features generated by the recurrent neural network into corresponding face images through technologies such as face alignment, image fusion and an optical flow method, wherein the lip shape of the face is kept synchronous with audio;
and S4, generating a final speaker speech video from the generated face sequence through a dynamic programming technology and the like.
Preferably, the speaker-independent automatic speech recognition is abbreviated as SI-ASR, the training recurrent neural network is abbreviated as RNN, and the phoneme posterior probability is abbreviated as PPG.
Preferably, in S2, two steps are shifted in the RNN model, and in order to generate smooth and natural lip motion, a long-short term memory network (LSTM) is used as a basic unit of the neural network, and a gating mechanism of the LSTM unit can control necessary information storage and state transition so that it can simultaneously retain audio and long-term dependency of previous lip and head gestures, so that after the RNN model is trained, a speaker video with natural lip and head motion can be generated in accordance with the input audio.
Preferably, the synthesized face image in S3 uses multiple image processing algorithms, such as normalizing the face in the video by a face alignment technique, seamlessly joining the synthesized lip texture to the face by an image fusion technique, completing chin correction by an optical flow method, and readjusting the time axis of the video by dynamic programming to make the video act more naturally in cooperation with the audio head.
Preferably, in S4, the generated video supports editing and modifying again.
(III) advantageous effects
The invention provides a personalized voice and video generation system based on phoneme posterior probability. The method has the following beneficial effects:
(1) the personalized voice and video generation system based on the phoneme posterior probability specifically comprises the following steps: s1, firstly, extracting phoneme Posterior Probability (PPG) from the voice of the source speaker by using an automatic voice recognition system (SI-ASR) independent of the speaker; the requirements on the amount of video data of the target speaker are greatly reduced.
(2) The personalized voice and video generation system based on the phoneme posterior probability specifically comprises the following steps: s2, secondly, training a recurrent neural network to learn the mapping relation between the phoneme posterior probability and the lip-shaped characteristics; through the network, the corresponding lip characteristics can be output by inputting the audio frequency of any target speaker; if the input is a text, firstly outputting the audio frequency of the target speaker through voice synthesis and voice conversion, and then outputting the lip-shaped feature through a network; the video of the target speaker can be generated directly from the text content without additionally recording the audio of the speaker.
Drawings
FIG. 1 is a diagram of the main steps of the practice of the present invention;
FIG. 2 is a schematic representation of the RNN model of the present invention;
FIG. 3 is a detailed flow chart of the practice of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-3, an embodiment of the present invention provides a technical solution: a personalized voice and video generation system based on phoneme posterior probability mainly comprises the following steps:
s1, first, a phoneme Posterior Probability (PPG) is extracted from the speech of the source speaker using a speaker-independent automatic speech recognition (SI-ASR) system, the posterior probability based method being based in part on the following assumptions: the posterior probability obtained from the speaker-independent speech recognition system is independent of the speaker and only related to the content of the utterance, and the phoneme posterior probability-based method is divided into three stages: the method comprises a first training stage (marked as a training stage 1), a second training stage (marked as a training stage 2) and a video generation stage, wherein the SI-ASR model is used for obtaining PPG representation of input voice, the second training stage is used for modeling mapping relation between PPG characteristics and lip characteristics of a target speaker for voice parameter generation through training a Recurrent Neural Network (RNN) model, and the video generation stage is used for generating corresponding lip characteristics for input text or voice through the SI-ASR and RNN models so as to synthesize corresponding faces and videos.
S2, next, normalizing the lip shape of each frame in the video by extracting it and translating, rotating, scaling, etc., as a feature vector of the lip shape, and then generating a corresponding lip feature vector sequence { y } in time0,y1,…,ytAnd training with a Recurrent Neural Network (RNN) model, which schematically shows the audio features x at time t0How to input into the LSTM unit is noteworthy because lip movements usually precede our voice productionFor example when we say "orange", our mouth has opened before the sound of "o", so we move two steps in the model; learning the mapping relation between the posterior probability of the audio feature phoneme and the lip-shaped feature through the recurrent neural network, so that the corresponding lip-shaped feature can be output by inputting the audio of any target speaker; if the input is a text, firstly outputting the audio frequency of the target speaker through voice synthesis and voice conversion, and then outputting the lip-shaped feature through a network;
s3, synthesizing lip-shaped features generated based on the training recurrent neural network into corresponding face images through technologies such as face alignment, image fusion and an optical flow method, wherein the lip shape of the face is kept synchronous with audio;
and S4, generating a final speaker speech video from the generated face sequence through a dynamic programming technology and the like.
In the invention, the speaker-independent automatic speech recognition is called SI-ASR for short, the training recurrent neural network is called RNN for short, and the phoneme posterior probability is called PPG for short.
In the present invention, in S2, two steps are shifted in the RNN model, in order to generate smooth and natural lip motion, a long-short term memory network (LSTM) is used as a basic unit of the neural network, and a gating mechanism of the LSTM unit can control necessary information storage and state transition so that it can simultaneously retain audio and long-term dependency of previous lip and head gestures, and thus, after the RNN model is trained, a speaker video with natural lip and head motion can be generated in accordance with the input audio.
In the invention, the synthesized face image in the S3 uses a plurality of image processing algorithms, for example, the face in the video is orthogonalized by a face alignment technology, the synthesized lip texture is seamlessly connected with the face by an image fusion technology, the chin correction is completed by an optical flow method, and the time axis of the video is readjusted by dynamic programming, so that the action of matching the video with the audio head is more natural.
In the present invention, in S4, the generated video supports editing and modifying again.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (5)

1. A system for personalized speech and video generation based on phoneme posterior probabilities, comprising: the method mainly comprises the following steps:
s1, firstly, extracting phoneme Posterior Probability (PPG) from the voice of the source speaker by using an automatic voice recognition system (SI-ASR) independent of the speaker;
s2, secondly, training a Recurrent Neural Network (RNN) to learn the mapping relation between the phoneme posterior probability and the lip-shaped feature; through the network, the corresponding lip characteristics can be output by inputting the audio frequency of any target speaker; if the input is a text, firstly outputting the audio frequency of the target speaker through voice synthesis and voice conversion, and then outputting the lip-shaped feature through a network;
s3, synthesizing lip-shaped features generated based on the training recurrent neural network into corresponding face images through technologies such as face alignment, image fusion and an optical flow method, wherein the lip shape of the face is kept synchronous with audio;
and S4, generating a final speaker speech video from the generated face sequence through a dynamic programming technology and the like.
2. The system of claim 1 for personalized speech and video generation based on phoneme posterior probabilities, wherein: the speaker-independent automatic speech recognition is called SI-ASR for short, the training recurrent neural network is called RNN for short, and the phoneme posterior probability is called PPG for short.
3. The system of claim 1 for personalized speech and video generation based on phoneme posterior probabilities, wherein: in S2, two steps are shifted in the RNN model, and in order to generate smooth and natural lip motion, a long-short term memory network (LSTM) is used as a basic unit of the neural network, and a gating mechanism of the LSTM unit can control necessary information storage and state transition so that it can simultaneously retain audio and long-term dependency of previous lip and head gestures, and thus, after the RNN model is trained, a speaker video with natural lip and head motion in accordance with input audio can be generated.
4. The system of claim 1 for personalized speech and video generation based on phoneme posterior probabilities, wherein: the synthesized face image in the S3 uses a plurality of image processing algorithms, for example, the face in the video is orthogonalized by a face alignment technique, the synthesized lip texture is seamlessly joined to the face by an image fusion technique, the chin correction is completed by an optical flow method, and the time axis of the video is readjusted by dynamic programming, so that the action of the video in cooperation with the audio head is more natural.
5. The system of claim 1 for personalized speech and video generation based on phoneme posterior probabilities, wherein: in S4, the generated video supports editing and modifying again.
CN201910991186.4A 2019-10-17 2019-10-17 Personalized voice and video generation system based on phoneme posterior probability Pending CN110880315A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910991186.4A CN110880315A (en) 2019-10-17 2019-10-17 Personalized voice and video generation system based on phoneme posterior probability

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910991186.4A CN110880315A (en) 2019-10-17 2019-10-17 Personalized voice and video generation system based on phoneme posterior probability

Publications (1)

Publication Number Publication Date
CN110880315A true CN110880315A (en) 2020-03-13

Family

ID=69728108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910991186.4A Pending CN110880315A (en) 2019-10-17 2019-10-17 Personalized voice and video generation system based on phoneme posterior probability

Country Status (1)

Country Link
CN (1) CN110880315A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111432233A (en) * 2020-03-20 2020-07-17 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating video
CN111666831A (en) * 2020-05-18 2020-09-15 武汉理工大学 Decoupling representation learning-based speaking face video generation method
CN111933110A (en) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 Video generation method, generation model training method, device, medium and equipment
CN112541956A (en) * 2020-11-05 2021-03-23 北京百度网讯科技有限公司 Animation synthesis method and device, mobile terminal and electronic equipment
CN112634918A (en) * 2020-09-29 2021-04-09 江苏清微智能科技有限公司 Acoustic posterior probability based arbitrary speaker voice conversion system and method
CN112735371A (en) * 2020-12-28 2021-04-30 出门问问(苏州)信息科技有限公司 Method and device for generating speaker video based on text information
CN112766166A (en) * 2021-01-20 2021-05-07 中国科学技术大学 Lip-shaped forged video detection method and system based on polyphone selection
CN113035235A (en) * 2021-03-19 2021-06-25 北京有竹居网络技术有限公司 Pronunciation evaluation method and apparatus, storage medium, and electronic device
CN113077819A (en) * 2021-03-19 2021-07-06 北京有竹居网络技术有限公司 Pronunciation evaluation method and device, storage medium and electronic equipment
CN113079327A (en) * 2021-03-19 2021-07-06 北京有竹居网络技术有限公司 Video generation method and device, storage medium and electronic equipment
CN113314094A (en) * 2021-05-28 2021-08-27 北京达佳互联信息技术有限公司 Lip-shaped model training method and device and voice animation synthesis method and device
CN113760100A (en) * 2021-09-22 2021-12-07 入微智能科技(南京)有限公司 Human-computer interaction equipment with virtual image generation, display and control functions
CN113838174A (en) * 2021-11-25 2021-12-24 之江实验室 Audio-driven face animation generation method, device, equipment and medium
CN114338959A (en) * 2021-04-15 2022-04-12 西安汉易汉网络科技股份有限公司 End-to-end text-to-video synthesis method, system medium and application
CN114578969A (en) * 2020-12-30 2022-06-03 北京百度网讯科技有限公司 Method, apparatus, device and medium for human-computer interaction
WO2022252890A1 (en) * 2021-05-31 2022-12-08 上海商汤智能科技有限公司 Interaction object driving and phoneme processing methods and apparatus, device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101165679A (en) * 2006-10-20 2008-04-23 东芝泰格有限公司 Pattern matching device and method
CN103021440A (en) * 2012-11-22 2013-04-03 腾讯科技(深圳)有限公司 Method and system for tracking audio streaming media
CN103035236A (en) * 2012-11-27 2013-04-10 河海大学常州校区 High-quality voice conversion method based on modeling of signal timing characteristics
CN106599198A (en) * 2016-12-14 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image description method for multi-stage connection recurrent neural network
CN106653052A (en) * 2016-12-29 2017-05-10 Tcl集团股份有限公司 Virtual human face animation generation method and device
US20180012613A1 (en) * 2016-07-11 2018-01-11 The Chinese University Of Hong Kong Phonetic posteriorgrams for many-to-one voice conversion
CN109308731A (en) * 2018-08-24 2019-02-05 浙江大学 The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101165679A (en) * 2006-10-20 2008-04-23 东芝泰格有限公司 Pattern matching device and method
CN103021440A (en) * 2012-11-22 2013-04-03 腾讯科技(深圳)有限公司 Method and system for tracking audio streaming media
CN103035236A (en) * 2012-11-27 2013-04-10 河海大学常州校区 High-quality voice conversion method based on modeling of signal timing characteristics
US20180012613A1 (en) * 2016-07-11 2018-01-11 The Chinese University Of Hong Kong Phonetic posteriorgrams for many-to-one voice conversion
CN107610717A (en) * 2016-07-11 2018-01-19 香港中文大学 Many-one phonetics transfer method based on voice posterior probability
CN106599198A (en) * 2016-12-14 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image description method for multi-stage connection recurrent neural network
CN106653052A (en) * 2016-12-29 2017-05-10 Tcl集团股份有限公司 Virtual human face animation generation method and device
CN109308731A (en) * 2018-08-24 2019-02-05 浙江大学 The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
RICK PARENT 著: "《计算机动画算法与技术》", 31 January 2018, 清华大学出版社 *
SAMER AL MOUBAYED: "EXPIREMENT FOR LIPS SYNCHRONIZATION USING PHONE LATTICE TO FACE PARAMETERS", 《LEUVEN UNIVERSITY》 *
XINJIAN ZHANG等: "A New Language Independent, Photo-realistic Talking Head Driven by Voice Only", 《INTERSPEECH 2013》 *
YILONG LIU等: "Video-audio driven real-time facial animatio", 《ACM TRANSACTIONS ON GRAPHICS》 *
张普等: "《数字化汉语教学的研究与应用》", 30 June 2006, 语文出版社 *
徐涵 著: "《大数据、人工智能和网络舆情治理》", 31 October 2018, 武汉大学出版社 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111432233A (en) * 2020-03-20 2020-07-17 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating video
CN111666831B (en) * 2020-05-18 2023-06-20 武汉理工大学 Method for generating face video of speaker based on decoupling expression learning
CN111666831A (en) * 2020-05-18 2020-09-15 武汉理工大学 Decoupling representation learning-based speaking face video generation method
CN111933110A (en) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 Video generation method, generation model training method, device, medium and equipment
WO2022033327A1 (en) * 2020-08-12 2022-02-17 北京字节跳动网络技术有限公司 Video generation method and apparatus, generation model training method and apparatus, and medium and device
CN112634918A (en) * 2020-09-29 2021-04-09 江苏清微智能科技有限公司 Acoustic posterior probability based arbitrary speaker voice conversion system and method
CN112634918B (en) * 2020-09-29 2024-04-16 江苏清微智能科技有限公司 System and method for converting voice of any speaker based on acoustic posterior probability
CN112541956A (en) * 2020-11-05 2021-03-23 北京百度网讯科技有限公司 Animation synthesis method and device, mobile terminal and electronic equipment
CN112735371B (en) * 2020-12-28 2023-08-04 北京羽扇智信息科技有限公司 Method and device for generating speaker video based on text information
CN112735371A (en) * 2020-12-28 2021-04-30 出门问问(苏州)信息科技有限公司 Method and device for generating speaker video based on text information
CN114578969A (en) * 2020-12-30 2022-06-03 北京百度网讯科技有限公司 Method, apparatus, device and medium for human-computer interaction
CN114578969B (en) * 2020-12-30 2023-10-20 北京百度网讯科技有限公司 Method, apparatus, device and medium for man-machine interaction
CN112766166A (en) * 2021-01-20 2021-05-07 中国科学技术大学 Lip-shaped forged video detection method and system based on polyphone selection
CN112766166B (en) * 2021-01-20 2022-09-06 中国科学技术大学 Lip-shaped forged video detection method and system based on polyphone selection
WO2022194044A1 (en) * 2021-03-19 2022-09-22 北京有竹居网络技术有限公司 Pronunciation assessment method and apparatus, storage medium, and electronic device
CN113079327A (en) * 2021-03-19 2021-07-06 北京有竹居网络技术有限公司 Video generation method and device, storage medium and electronic equipment
CN113077819A (en) * 2021-03-19 2021-07-06 北京有竹居网络技术有限公司 Pronunciation evaluation method and device, storage medium and electronic equipment
CN113035235A (en) * 2021-03-19 2021-06-25 北京有竹居网络技术有限公司 Pronunciation evaluation method and apparatus, storage medium, and electronic device
CN114338959A (en) * 2021-04-15 2022-04-12 西安汉易汉网络科技股份有限公司 End-to-end text-to-video synthesis method, system medium and application
CN113314094A (en) * 2021-05-28 2021-08-27 北京达佳互联信息技术有限公司 Lip-shaped model training method and device and voice animation synthesis method and device
CN113314094B (en) * 2021-05-28 2024-05-07 北京达佳互联信息技术有限公司 Lip model training method and device and voice animation synthesis method and device
WO2022252890A1 (en) * 2021-05-31 2022-12-08 上海商汤智能科技有限公司 Interaction object driving and phoneme processing methods and apparatus, device and storage medium
CN113760100A (en) * 2021-09-22 2021-12-07 入微智能科技(南京)有限公司 Human-computer interaction equipment with virtual image generation, display and control functions
CN113760100B (en) * 2021-09-22 2024-02-02 入微智能科技(南京)有限公司 Man-machine interaction equipment with virtual image generation, display and control functions
CN113838174B (en) * 2021-11-25 2022-06-10 之江实验室 Audio-driven face animation generation method, device, equipment and medium
CN113838174A (en) * 2021-11-25 2021-12-24 之江实验室 Audio-driven face animation generation method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN110880315A (en) Personalized voice and video generation system based on phoneme posterior probability
Taylor et al. A deep learning approach for generalized speech animation
JP3664474B2 (en) Language-transparent synthesis of visual speech
Cao et al. Expressive speech-driven facial animation
CN110751708B (en) Method and system for driving face animation in real time through voice
US8224652B2 (en) Speech and text driven HMM-based body animation synthesis
US20020024519A1 (en) System and method for producing three-dimensional moving picture authoring tool supporting synthesis of motion, facial expression, lip synchronizing and lip synchronized voice of three-dimensional character
US20220108510A1 (en) Real-time generation of speech animation
GB2516965A (en) Synthetic audiovisual storyteller
CN113077537A (en) Video generation method, storage medium and equipment
WO2024124680A1 (en) Speech signal-driven personalized three-dimensional facial animation generation method, and application thereof
CN113838174A (en) Audio-driven face animation generation method, device, equipment and medium
CN112002301A (en) Text-based automatic video generation method
CN115761075A (en) Face image generation method, device, equipment, medium and product
CN110728971B (en) Audio and video synthesis method
CN116051692A (en) Three-dimensional digital human face animation generation method based on voice driving
CN117975991A (en) Digital person driving method and device based on artificial intelligence
WO2024113701A1 (en) Voice-based video generation method and apparatus, server, and medium
CN117219050A (en) Text generation video system based on depth generation countermeasure network
CN115311731B (en) Expression generation method and device for sign language digital person
CN117115310A (en) Digital face generation method and system based on audio and image
Liu et al. Real-time speech-driven animation of expressive talking faces
CN115409923A (en) Method, device and system for generating three-dimensional virtual image facial animation
Barve et al. Synchronized Speech and Video Synthesis
d’Alessandro et al. Reactive statistical mapping: Towards the sketching of performative control with data

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200313