CN117933318A - Method for constructing teaching digital person - Google Patents

Method for constructing teaching digital person Download PDF

Info

Publication number
CN117933318A
CN117933318A CN202410214465.0A CN202410214465A CN117933318A CN 117933318 A CN117933318 A CN 117933318A CN 202410214465 A CN202410214465 A CN 202410214465A CN 117933318 A CN117933318 A CN 117933318A
Authority
CN
China
Prior art keywords
human
limb
digital
video
teaching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410214465.0A
Other languages
Chinese (zh)
Inventor
方明
余松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Zhidao Online Education Technology Co ltd
Original Assignee
Wuhan Zhidao Online Education Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Zhidao Online Education Technology Co ltd filed Critical Wuhan Zhidao Online Education Technology Co ltd
Priority to CN202410214465.0A priority Critical patent/CN117933318A/en
Publication of CN117933318A publication Critical patent/CN117933318A/en
Pending legal-status Critical Current

Links

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

The invention relates to the field of human work, and provides a method for constructing teaching digital human, which uses a deep learning algorithm so-vits-svc to clone the sound tone of a teaching teacher and generate an audio stream with the sound tone of the teaching teacher; constructing a SADTALKER-based Wav2Talker digital human model, and generating natural limbs, gestures and expression dynamic videos of a teaching teacher from audio streams and character pictures by using a deep learning model; applying video-retalking technology to add emotion change to the facial expression; expanding GFPGAN human face eyes and nose super-resolution algorithm, and high-definition of the facial features of the whole human body; and adopting FACECHAIN deep learning model tools to construct a digital human image like a true human portrait. The method for constructing the teaching digital person has the advantages of reduced cost and no need of three-dimensional modeling and motion capture technology to form a virtual person. Training various generation models including tone, gesture, expression and limb action models according to recorded broadcast and live broadcast of the teacher; digital persons with multiple pictorial images.

Description

Method for constructing teaching digital person
Technical Field
The invention relates to the field of artificial intelligence, in particular to a method for constructing teaching digital persons.
Background
The digital human technology is an innovative technology based on artificial intelligence and natural language processing technology, and aims to create more intelligent, personalized and humanized human-computer interaction experience. The system can be applied to the fields of virtual assistants, customer service robots, education and training and the like, and can realize more natural and smooth communication with users by simulating human language communication and thinking logic.
The back of digital man-made technology involves a number of techniques such as deep learning, natural language processing, dialog system design, and a large number of corpus and algorithm optimizations. Through continuous training and optimization, the digital human technology can gradually improve the intelligent level of the digital human technology, so that the interaction with the user meets the personalized requirement.
In general, the development of digital human technology represents the latest progress of artificial intelligence and natural language processing technology in the field of human-computer interaction, and provides new possibilities for improving user experience and intelligent service capability.
The prior art scheme is to realize the live broadcasting of virtual characters by using a computer technology and an artificial intelligence technology. The following are some key technical points of live broadcasting of the existing virtual persons:
(1) Three-dimensional modeling: live video requires three-dimensional modeling of virtual characters, including head, body, and limbs. Features such as appearance contours, facial expressions, gestures and the like of the real characters are converted into a three-dimensional model through advanced graphic processing technology and computer vision algorithm.
(2) Motion capture: live video requires capturing the motion of a real person and applying it to the virtual person so that it can simulate the motion of a real person in real time. Motion data of a real character is typically captured using a sensor device or camera and applied to the skeletal system of the virtual character by an algorithm.
(3) And (3) speech synthesis: in live broadcasting of a virtual person, the virtual person needs to have natural and smooth speech expression capability. The voice synthesis technology can convert the characters into realistic voices, so that the virtual characters can live broadcast in natural voices.
(4) Semantic understanding: to achieve semantic understanding and automatic return of virtual characters to viewers, natural language processing and artificial intelligence algorithms are required. These techniques can analyze audience questions or comments and generate meaningful responses. Techniques such as semantic analysis, emotion recognition, and dialog generation play an important role here.
(5) Rendering in real time: virtual live broadcasting requires virtual characters to be rendered in real-time scenes to maintain smooth live broadcasting experience. Real-time rendering techniques transform three-dimensional models into realistic images by exploiting the parallel computing power of a Graphics Processing Unit (GPU) and present them to viewers in real time.
(6) And (3) interaction communication: live video broadcasting requires real-time interactive communication with the audience. This can be achieved by natural language processing, emotion recognition, etc., so that the avatar can understand the questions of the audience and respond accordingly.
Through the combination of the key technical points, the real-time live broadcast of the virtual character can be realized, and brand-new entertainment and communication experience is brought to audience.
Disadvantages of the prior art
(1) Professionals, professional sensors, professional software are required to capture the foreign trade contours, facial expressions, pose gestures of real characters to be converted into three-dimensional models, which are very complex and costly,
(2) Modeling by professional software generally forms a virtual animated person, which is quite different from a real person, and has unreal and unnatural feeling.
(3) The true person lacks timbre and is not dynamically synchronized with the lips of the virtual person.
(4) The true person lacks emotion, and the expression is hard or lacks emotion expression ability.
(5) The posture is awkward, unsynchronized with the voice of the virtual person and uncoordinated.
(6) Facial expression is missing and is mechanized.
(7) The image is single, and only the image is concentrated.
Disclosure of Invention
Aiming at the defects in the prior art and solving the problem of limb action deficiency, the invention provides a method for creating a more comprehensive teaching digital man by combining tone, expression, limbs and gestures.
The technical scheme adopted by the invention is as follows:
A method of constructing a lecture digital person comprising:
(1) Using a deep learning algorithm so-vits-svc to clone the sound tone of the appointed teaching teacher and generate an audio stream with the sound tone of the teaching teacher;
(2) The digital human model of Wav2Talker is constructed, wav2Talker is based on a SADTALKER model, but SADTALKER model only fuses 3 characteristics of sound audio, head posture and facial expression to generate a speaker head video, is limited to head dynamics, lacks limb motions below the head, and cannot express the limb language of a complete digital human. Therefore, a Wav2Talker model is provided, and a Wav2Talker model is added with human body limb motion generation on the basis of a SADTALKER model, wherein the human body limb motion generation comprises human body skeleton key points and hand key points, so that a finished digital human image is formed;
(3) The video-retalking technology is applied to realize that facial expressions add emotion changes, such as happiness, neutrality, sadness, happiness and the like. So that the digital person's performance has the same emotional expression as the real person;
(4) The human face eyes and nose super-resolution algorithm of GFPGAN is expanded, and the whole human face is high-definition. The algorithm is capable of converting an input low resolution image to a high resolution image while preserving the details and sharpness of the image. By this technique, we can refine the whole facial organ of a digital person, making it more realistic and clear.
(5) And adopting FACECHAIN deep learning model tools to construct a digital human image like a true human portrait.
By combining the above technology, the audio stream with the sound tone of the teacher and the lip dynamic, facial expression and limb actions are synchronized, so that the digital person can be expressed more truly and naturally than the traditional virtual person.
The process of constructing the digital mannequin of Wav2Talker is as follows:
Input: the input video key frame sequence V is { V 0,...,Vn }, n is the key frame number, and the input audio corresponding to the input video is recorded as alpha { alpha 0,...,ɑn };
Firstly, 24 human body skeleton key points and 21 hand key points are extracted from a video initial key frame teaching teacher picture V 0 (single frame image), 45 of the key points are called initial limb key points, and the initial limb key points are called eta 0;
Then, build condition generation antagonism network CGAN (limb GAN), the generator module of limb GAN gradually generates { η 1,...,ηn } subsequent limb key point sequence by gradually inputting η 0、{ɑ0,...,ɑn };
Thirdly, the { eta 1,...,ηn } sequence gives out a natural continuous consistent limb action video of a teaching teacher through limb key points to video rendering (skeleton-to-video rendering);
finally, combining SADTALKER human head gestures (gestures VAE) and facial expression (expressions Net) modules to form a complete teaching digital human video.
Compared with the prior art, the invention has the beneficial effects that:
according to the method for constructing the teaching digital person, through cloning the sound tone, the facial expression and the limb action characteristics of the real person teaching teacher, the digital person can replace the real person teaching teacher to conduct 24-hour live broadcast and interaction, so that the teaching effect identical to that of the real person is achieved.
The invention discloses a method for constructing teaching digital persons, which is based on Wav2Lip and SADTALKER, provides a Wav2Talker deep learning model, can complement the defect that Wav2Lip and SADTALKER only can generate the expression and action of the head of a person, and can generate facial expression and limb action, so that the digital person is more complete and more realistic.
The method for constructing the teaching digital person has the advantages of reduced cost, no need of three-dimensional modeling and motion capture technology to form a virtual person, and direct cloning of the original tone and gesture motion of the real person for new video production.
The invention constructs a teaching digital person method, and forms a real tone model of a person according to the past recorded broadcast and live broadcast audio of the cloned person; true natural lip dynamics, facial expression, limb movements; facial expressions with emotional expressions; digital persons with multiple pictorial images.
Drawings
FIG. 1 is a main frame diagram of Wav2 Talker;
In the SadTaler framework, the monocular three-dimensional face reconstruction module uses coefficients of 3DMM as the intermediate motion representation. To this end, we first generate realistic 3D motion coefficients (facial expression β, head pose ρ) from the audio, and then use these coefficients to implicitly adjust the three-dimensional perceived face rendering to generate the final video.
Adding a limb action key point extraction module on the basis of SADTALKER frames, wherein the module extracts 24 human body skeleton key points and 21 hand key points, namely 45 initial limb key points, which are marked as eta 0; however, the build condition generates an antagonism network CGAN (limb GAN), and the generator module of the limb GAN gradually generates { η 1,...,ηn } subsequent limb key point sequences by gradually inputting η 0、{ɑ0,...,ɑn }; thirdly, the { eta 1,...,ηn } sequence gives out a natural continuous consistent limb action video of a teaching teacher through limb key points to video rendering (skeleton-to-video rendering); finally, combining SADTALKER human head gestures (gestures VAE) and facial expression (expressions Net) modules to form a complete teaching digital human video.
Detailed Description
The invention is described in detail below with reference to the attached drawings and examples:
A method of constructing a lecture digital person comprising:
(1) Using a deep learning algorithm so-vits-svc to clone the sound tone of the appointed teaching teacher and generate an audio stream with the sound tone of the teaching teacher;
(2) The digital human model of Wav2Talker is constructed, wav2Talker is based on SADTALKER model, but SADTALKER model only fuses 3 characteristics of voice audio, head gesture and facial expression to generate speaker head state video, is limited to head dynamics, lacks limb actions below the head, and cannot express the limb language of a complete digital human. Therefore, the Wav2Talker model and the Wav2Talker model are added with the action characteristics of human limbs on the basis of the SADTALKER model, wherein the action characteristics comprise human skeleton key points and hand key points, and a finished digital human image is formed;
(3) The video-retalking technology is applied to realize that facial expressions add emotion changes, such as happiness, neutrality, sadness, happiness and the like. So that the digital person's performance has the same emotional expression as the real person;
(4) The human face eyes and nose super-resolution algorithm of GFPGAN is expanded, and the whole human face is high-definition. The algorithm is capable of converting an input low resolution image to a high resolution image while preserving the details and sharpness of the image. By this technique, we can refine the whole facial organ of a digital person, making it more realistic and clear.
(5) And adopting FACECHAIN deep learning model tools to construct a digital human image like a true human portrait.
And generating digital people with various images, such as office images, teacher images, white collar images and the like in real time.
The Shangde teaching digital man system is built through the Pipeline.
By combining the above technology, the audio stream with the sound tone of the teacher and the lip dynamic, facial expression and gesture actions are synchronized, so that the digital person's performance is more real and natural than the traditional virtual person.
As can be seen from fig. 1, the process of constructing the Wav2 Talker-based digital mannequin is as follows:
Input: the input video key frame sequence V is { V 0,...,Vn }, n is the key frame number, and the input audio corresponding to the input video is recorded as alpha { alpha 0,...,ɑn };
Firstly, 24 human body skeleton key points and 21 hand key points are extracted from a video initial key frame teaching teacher picture V 0 (single frame image), 45 of the key points are called initial limb key points, and the initial limb key points are called eta 0;
Then, build condition generation antagonism network CGAN (limb GAN), the generator module of limb GAN gradually generates { η 1,...,ηn } subsequent limb key point sequence by gradually inputting η 0、{ɑ0,...,ɑn };
Thirdly, the { eta 1,...,ηn } sequence gives out a natural continuous consistent limb action video of a teaching teacher through limb key points to video rendering (skeleton-to-video rendering);
finally, combining SADTALKER human head gestures (gestures VAE) and facial expression (expressions Net) modules to form a complete teaching digital human video.
Fig. 1 is a main frame diagram of Wav2Talker, and a limb GAN model is added on the basis of SADTALKER, and the limb GAN model is used for gradually generating subsequent limb key points according to input audio and initial limb key points, and finally rendering a real person action video from key to video.
The limb GAN is a conditional generation countermeasure network (CGAN) which is a variant of the expansion of the original generation countermeasure network (GAN) and which is capable of generating limb motion key points under specific conditions. CGAN is based on the following two key ideas:
(1) Condition input: unlike the original GAN, the conditional GAN introduces additional conditional inputs in the generator and the arbiter. This condition may be any form of auxiliary information such as audio features, preamble limb keypoints, etc. The generator combines these conditions with a random noise input to generate Limb key points (Limb Landmark) under the current conditions; the arbiter uses these conditions as additional inputs to determine the authenticity of the Limb movement key (Limb Landmark) and the consistency, continuity, etc. of the previous Limb key.
(2) Challenge training: the idea of performing countermeasure training between the generator and the arbiter remains unchanged in the condition GAN. The goal of the generator is to generate fluent Limb-motion keypoints (Limb labmark) to fool the discriminators as much as possible; the goal of the arbiter is to determine as accurately as possible whether the generated limb-action keypoints are real or false, and consistent and continuous with the precursor limb keypoints, etc.
The specific working procedure is as follows:
(1) Inputting the condition: the condition information (audio features, prefrontal limb keypoints) is provided as additional input to the generator and the arbiter.
(2) Generating an image: the generator receives random noise and input conditions and generates successive limb-motion keypoints.
(3) Judging true or false: the arbiter receives the generated limb movement key points and input conditions, tries to distinguish the real limb movement key points from the generated limb movement key points, and judges whether the limb movement key points are consistent with and continuous with the preamble key points.
(4) Calculating loss: from the output of the arbiter, the losses of the generator and the arbiter are calculated. The loss goal of the generator is to have the generated limb-action keypoints erroneously classified as true limb-action keypoints by the arbiter, while the loss goal of the arbiter is to accurately judge the distinction between true and generated action keypoints and whether consistency and continuity are maintained.
(5) Updating parameters: the parameters of the generator and the arbiter are updated according to the gradient of the loss function.
By repeatedly iterating the steps, the performances of the generator and the discriminator are gradually improved, and the generated limb action key points are more and more lifelike. The introduction of the condition GAN enables the generator to generate limb action key points with certain attributes under specific conditions, so that the application range of the GAN is expanded.
As mentioned above, for a better understanding of the invention, reference may be made to the following references. The entire contents of each of these references are incorporated herein by reference.
1. Svc-develop-team/so-vits-svc based on generating sound timbre transitions against a network. SoftVC VITS SINGING Voice Conversion
2. Github-Rudrabha/Wav2 Lip. The repository contains code that "one Lip sync expert is enough to implement speech-to-Lip generation", published in ACM Multimedia 2020 .This repository contains the codes of"A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild",published at ACM Multimedia 2020.
3. OpenTalker/SADTALKER [ CVPR 2023] to learn the true 3D motion coefficients for a stylized audio-driven single image speaker face animation .[CVPR 2023]SadTalker:Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation(github.com)
4. Github-TENCENTARC/GFPGAN is intended to develop practical algorithms for real world face restoration. GFPGAN AIMS AT developing Practical Algorithms for Real-world Face Restoration.
5. Github-modelscope/facechain a deep learning tool chain for generating your digital twins. FACECHAIN IS A DEEP-learning toolchain for generating your Digital-Twin.
6. OpenTalker/video-retalking [ SIGGRAPH ASIA 2022] Audio-based lip sync video editing, which can adjust character expressions .[SIGGRAPH Asia 2022]VideoReTalking:Audio-based Lip Synchronization for Talking Head Video Editing In the Wild(github.com)
7. NVlabs/few-shot-vid2vid, namely realizing Pytorch from few-sample realistic video to video, and realizing action migration. Pytorch implementation for few-shot photorealistic video-to-video Transmission (gilth. Com)
The above description is only of the preferred embodiment of the present invention, and is not intended to limit the structure of the present invention in any way. Any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention fall within the technical scope of the present invention.

Claims (2)

1. A method of constructing a lecture digital person comprising:
(1) Using a deep learning algorithm so-vits-svc to clone the sound tone of the appointed teaching teacher and generate an audio stream with the sound tone of the teaching teacher;
(2) Constructing a digital human model of Wav2Talker, wherein Wav2Talker is based on a SADTALKER model;
(3) The video-retalking technology is applied to realize that the facial expression adds emotion change, so that the expression of the digital person has the emotion expression same as that of a real person;
(4) Expanding GFPGAN human face eyes and nose super-resolution algorithm, and high-definition of the whole human face; the algorithm can convert an input low-resolution image into a high-resolution image, and meanwhile, the details and the definition of the image are kept;
(5) And adopting FACECHAIN deep learning model tools to construct a digital human image like a true human portrait.
2. The digital person of claim 1, wherein:
The process of constructing the Wav2 Talker-based digital human model is as follows:
Input: the input video key frame sequence V is { V 0,...,Vn }, n is the key frame number, and the input audio corresponding to the input video is recorded as alpha { alpha 0,...,ɑn };
firstly, 24 human body skeleton key points and 21 hand key points are extracted from a video initial key frame teaching teacher picture V 0 (single frame image), 45 of the key points are called initial limb key points, and the initial limb key points are marked as eta 0;
Then, build condition generation antagonism network CGAN (limb GAN), the generator module of limb GAN gradually generates { η 1,...,ηn } subsequent limb key point sequence by gradually inputting η 0、{ɑ0,...,ɑn };
Again, the { η 1,...,ηn } sequence goes out of the teacher's natural continuous limb motion video through limb keypoints to video rendering (skeleton-to-video rendering);
finally, combining SADTALKER human head gestures (gestures VAE) and facial expression (expressions Net) modules to form a complete teaching digital human video.
CN202410214465.0A 2024-02-27 2024-02-27 Method for constructing teaching digital person Pending CN117933318A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410214465.0A CN117933318A (en) 2024-02-27 2024-02-27 Method for constructing teaching digital person

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410214465.0A CN117933318A (en) 2024-02-27 2024-02-27 Method for constructing teaching digital person

Publications (1)

Publication Number Publication Date
CN117933318A true CN117933318A (en) 2024-04-26

Family

ID=90755572

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410214465.0A Pending CN117933318A (en) 2024-02-27 2024-02-27 Method for constructing teaching digital person

Country Status (1)

Country Link
CN (1) CN117933318A (en)

Similar Documents

Publication Publication Date Title
CN110599573B (en) Method for realizing real-time human face interactive animation based on monocular camera
Mattheyses et al. Audiovisual speech synthesis: An overview of the state-of-the-art
WO2003063079A2 (en) Apparatus and method for efficient animation of believable speaking 3d characters in real time
GB2601162A (en) Methods and systems for video translation
CN111724457A (en) Realistic virtual human multi-modal interaction implementation method based on UE4
CN115209180A (en) Video generation method and device
CN115049016B (en) Model driving method and device based on emotion recognition
Rebol et al. Passing a non-verbal turing test: Evaluating gesture animations generated from speech
CN113886641A (en) Digital human generation method, apparatus, device and medium
Zhao et al. Computer-aided graphic design for virtual reality-oriented 3D animation scenes
Gachery et al. Designing MPEG-4 facial animation tables for web applications
Wang et al. Computer-aided traditional art design based on artificial intelligence and human-computer interaction
CN117171392A (en) Virtual anchor generation method and system based on nerve radiation field and hidden attribute
Tan et al. Style2talker: High-resolution talking head generation with emotion style and art style
Lokesh et al. Computer Interaction to human through photorealistic facial model for inter-process communication
CN116721190A (en) Voice-driven three-dimensional face animation generation method
Adamo-Villani 3d rendering of american sign language finger-spelling: a comparative study of two animation techniques
Nakatsuka et al. Audio-guided Video Interpolation via Human Pose Features.
Perng et al. Image talk: a real time synthetic talking head using one single image with chinese text-to-speech capability
CN117933318A (en) Method for constructing teaching digital person
CN115984452A (en) Head three-dimensional reconstruction method and equipment
Zeng et al. Virtual Face Animation Generation Based on Conditional Generative Adversarial Networks
Liu Audio-Driven Talking Face Generation: A Review
CN116884066A (en) Lip synthesis technology-based 2D real person digital avatar generation method
WO2024066549A1 (en) Data processing method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination