CN117933318A

CN117933318A - Method for constructing teaching digital person

Info

Publication number: CN117933318A
Application number: CN202410214465.0A
Authority: CN
Inventors: 方明; 余松
Original assignee: Wuhan Zhidao Online Education Technology Co ltd
Current assignee: Wuhan Zhidao Online Education Technology Co ltd
Priority date: 2024-02-27
Filing date: 2024-02-27
Publication date: 2024-04-26

Abstract

The invention relates to the field of human work, and provides a method for constructing teaching digital human, which uses a deep learning algorithm so-vits-svc to clone the sound tone of a teaching teacher and generate an audio stream with the sound tone of the teaching teacher; constructing a SADTALKER-based Wav2Talker digital human model, and generating natural limbs, gestures and expression dynamic videos of a teaching teacher from audio streams and character pictures by using a deep learning model; applying video-retalking technology to add emotion change to the facial expression; expanding GFPGAN human face eyes and nose super-resolution algorithm, and high-definition of the facial features of the whole human body; and adopting FACECHAIN deep learning model tools to construct a digital human image like a true human portrait. The method for constructing the teaching digital person has the advantages of reduced cost and no need of three-dimensional modeling and motion capture technology to form a virtual person. Training various generation models including tone, gesture, expression and limb action models according to recorded broadcast and live broadcast of the teacher; digital persons with multiple pictorial images.

Description

Method for constructing teaching digital person

Technical Field

The invention relates to the field of artificial intelligence, in particular to a method for constructing teaching digital persons.

Background

The digital human technology is an innovative technology based on artificial intelligence and natural language processing technology, and aims to create more intelligent, personalized and humanized human-computer interaction experience. The system can be applied to the fields of virtual assistants, customer service robots, education and training and the like, and can realize more natural and smooth communication with users by simulating human language communication and thinking logic.

The back of digital man-made technology involves a number of techniques such as deep learning, natural language processing, dialog system design, and a large number of corpus and algorithm optimizations. Through continuous training and optimization, the digital human technology can gradually improve the intelligent level of the digital human technology, so that the interaction with the user meets the personalized requirement.

In general, the development of digital human technology represents the latest progress of artificial intelligence and natural language processing technology in the field of human-computer interaction, and provides new possibilities for improving user experience and intelligent service capability.

The prior art scheme is to realize the live broadcasting of virtual characters by using a computer technology and an artificial intelligence technology. The following are some key technical points of live broadcasting of the existing virtual persons:

(1) Three-dimensional modeling: live video requires three-dimensional modeling of virtual characters, including head, body, and limbs. Features such as appearance contours, facial expressions, gestures and the like of the real characters are converted into a three-dimensional model through advanced graphic processing technology and computer vision algorithm.

(2) Motion capture: live video requires capturing the motion of a real person and applying it to the virtual person so that it can simulate the motion of a real person in real time. Motion data of a real character is typically captured using a sensor device or camera and applied to the skeletal system of the virtual character by an algorithm.

(3) And (3) speech synthesis: in live broadcasting of a virtual person, the virtual person needs to have natural and smooth speech expression capability. The voice synthesis technology can convert the characters into realistic voices, so that the virtual characters can live broadcast in natural voices.

(4) Semantic understanding: to achieve semantic understanding and automatic return of virtual characters to viewers, natural language processing and artificial intelligence algorithms are required. These techniques can analyze audience questions or comments and generate meaningful responses. Techniques such as semantic analysis, emotion recognition, and dialog generation play an important role here.

(5) Rendering in real time: virtual live broadcasting requires virtual characters to be rendered in real-time scenes to maintain smooth live broadcasting experience. Real-time rendering techniques transform three-dimensional models into realistic images by exploiting the parallel computing power of a Graphics Processing Unit (GPU) and present them to viewers in real time.

(6) And (3) interaction communication: live video broadcasting requires real-time interactive communication with the audience. This can be achieved by natural language processing, emotion recognition, etc., so that the avatar can understand the questions of the audience and respond accordingly.

Through the combination of the key technical points, the real-time live broadcast of the virtual character can be realized, and brand-new entertainment and communication experience is brought to audience.

Disadvantages of the prior art

(1) Professionals, professional sensors, professional software are required to capture the foreign trade contours, facial expressions, pose gestures of real characters to be converted into three-dimensional models, which are very complex and costly,

(2) Modeling by professional software generally forms a virtual animated person, which is quite different from a real person, and has unreal and unnatural feeling.

(3) The true person lacks timbre and is not dynamically synchronized with the lips of the virtual person.

(4) The true person lacks emotion, and the expression is hard or lacks emotion expression ability.

(5) The posture is awkward, unsynchronized with the voice of the virtual person and uncoordinated.

(6) Facial expression is missing and is mechanized.

(7) The image is single, and only the image is concentrated.

Disclosure of Invention

Aiming at the defects in the prior art and solving the problem of limb action deficiency, the invention provides a method for creating a more comprehensive teaching digital man by combining tone, expression, limbs and gestures.

The technical scheme adopted by the invention is as follows:

A method of constructing a lecture digital person comprising:

(1) Using a deep learning algorithm so-vits-svc to clone the sound tone of the appointed teaching teacher and generate an audio stream with the sound tone of the teaching teacher;

(2) The digital human model of Wav2Talker is constructed, wav2Talker is based on a SADTALKER model, but SADTALKER model only fuses 3 characteristics of sound audio, head posture and facial expression to generate a speaker head video, is limited to head dynamics, lacks limb motions below the head, and cannot express the limb language of a complete digital human. Therefore, a Wav2Talker model is provided, and a Wav2Talker model is added with human body limb motion generation on the basis of a SADTALKER model, wherein the human body limb motion generation comprises human body skeleton key points and hand key points, so that a finished digital human image is formed;

(3) The video-retalking technology is applied to realize that facial expressions add emotion changes, such as happiness, neutrality, sadness, happiness and the like. So that the digital person's performance has the same emotional expression as the real person;

(4) The human face eyes and nose super-resolution algorithm of GFPGAN is expanded, and the whole human face is high-definition. The algorithm is capable of converting an input low resolution image to a high resolution image while preserving the details and sharpness of the image. By this technique, we can refine the whole facial organ of a digital person, making it more realistic and clear.

(5) And adopting FACECHAIN deep learning model tools to construct a digital human image like a true human portrait.

By combining the above technology, the audio stream with the sound tone of the teacher and the lip dynamic, facial expression and limb actions are synchronized, so that the digital person can be expressed more truly and naturally than the traditional virtual person.

The process of constructing the digital mannequin of Wav2Talker is as follows:

Input: the input video key frame sequence V is { V ₀,...,V_n }, n is the key frame number, and the input audio corresponding to the input video is recorded as alpha { alpha ₀,...,ɑ_n };

Firstly, 24 human body skeleton key points and 21 hand key points are extracted from a video initial key frame teaching teacher picture V ₀ (single frame image), 45 of the key points are called initial limb key points, and the initial limb key points are called eta ₀;

Then, build condition generation antagonism network CGAN (limb GAN), the generator module of limb GAN gradually generates { η ₁,...,η_n } subsequent limb key point sequence by gradually inputting η ₀、{ɑ₀,...,ɑ_n };

Thirdly, the { eta ₁,...,η_n } sequence gives out a natural continuous consistent limb action video of a teaching teacher through limb key points to video rendering (skeleton-to-video rendering);

finally, combining SADTALKER human head gestures (gestures VAE) and facial expression (expressions Net) modules to form a complete teaching digital human video.

Compared with the prior art, the invention has the beneficial effects that:

according to the method for constructing the teaching digital person, through cloning the sound tone, the facial expression and the limb action characteristics of the real person teaching teacher, the digital person can replace the real person teaching teacher to conduct 24-hour live broadcast and interaction, so that the teaching effect identical to that of the real person is achieved.

The invention discloses a method for constructing teaching digital persons, which is based on Wav2Lip and SADTALKER, provides a Wav2Talker deep learning model, can complement the defect that Wav2Lip and SADTALKER only can generate the expression and action of the head of a person, and can generate facial expression and limb action, so that the digital person is more complete and more realistic.

The method for constructing the teaching digital person has the advantages of reduced cost, no need of three-dimensional modeling and motion capture technology to form a virtual person, and direct cloning of the original tone and gesture motion of the real person for new video production.

The invention constructs a teaching digital person method, and forms a real tone model of a person according to the past recorded broadcast and live broadcast audio of the cloned person; true natural lip dynamics, facial expression, limb movements; facial expressions with emotional expressions; digital persons with multiple pictorial images.

Drawings

FIG. 1 is a main frame diagram of Wav2 Talker;

In the SadTaler framework, the monocular three-dimensional face reconstruction module uses coefficients of 3DMM as the intermediate motion representation. To this end, we first generate realistic 3D motion coefficients (facial expression β, head pose ρ) from the audio, and then use these coefficients to implicitly adjust the three-dimensional perceived face rendering to generate the final video.

Adding a limb action key point extraction module on the basis of SADTALKER frames, wherein the module extracts 24 human body skeleton key points and 21 hand key points, namely 45 initial limb key points, which are marked as eta ₀; however, the build condition generates an antagonism network CGAN (limb GAN), and the generator module of the limb GAN gradually generates { η ₁,...,η_n } subsequent limb key point sequences by gradually inputting η ₀、{ɑ₀,...,ɑ_n }; thirdly, the { eta ₁,...,η_n } sequence gives out a natural continuous consistent limb action video of a teaching teacher through limb key points to video rendering (skeleton-to-video rendering); finally, combining SADTALKER human head gestures (gestures VAE) and facial expression (expressions Net) modules to form a complete teaching digital human video.

Detailed Description

The invention is described in detail below with reference to the attached drawings and examples:

A method of constructing a lecture digital person comprising:

(2) The digital human model of Wav2Talker is constructed, wav2Talker is based on SADTALKER model, but SADTALKER model only fuses 3 characteristics of voice audio, head gesture and facial expression to generate speaker head state video, is limited to head dynamics, lacks limb actions below the head, and cannot express the limb language of a complete digital human. Therefore, the Wav2Talker model and the Wav2Talker model are added with the action characteristics of human limbs on the basis of the SADTALKER model, wherein the action characteristics comprise human skeleton key points and hand key points, and a finished digital human image is formed;

And generating digital people with various images, such as office images, teacher images, white collar images and the like in real time.

The Shangde teaching digital man system is built through the Pipeline.

By combining the above technology, the audio stream with the sound tone of the teacher and the lip dynamic, facial expression and gesture actions are synchronized, so that the digital person's performance is more real and natural than the traditional virtual person.

As can be seen from fig. 1, the process of constructing the Wav2 Talker-based digital mannequin is as follows:

Fig. 1 is a main frame diagram of Wav2Talker, and a limb GAN model is added on the basis of SADTALKER, and the limb GAN model is used for gradually generating subsequent limb key points according to input audio and initial limb key points, and finally rendering a real person action video from key to video.

The limb GAN is a conditional generation countermeasure network (CGAN) which is a variant of the expansion of the original generation countermeasure network (GAN) and which is capable of generating limb motion key points under specific conditions. CGAN is based on the following two key ideas:

(1) Condition input: unlike the original GAN, the conditional GAN introduces additional conditional inputs in the generator and the arbiter. This condition may be any form of auxiliary information such as audio features, preamble limb keypoints, etc. The generator combines these conditions with a random noise input to generate Limb key points (Limb Landmark) under the current conditions; the arbiter uses these conditions as additional inputs to determine the authenticity of the Limb movement key (Limb Landmark) and the consistency, continuity, etc. of the previous Limb key.

(2) Challenge training: the idea of performing countermeasure training between the generator and the arbiter remains unchanged in the condition GAN. The goal of the generator is to generate fluent Limb-motion keypoints (Limb labmark) to fool the discriminators as much as possible; the goal of the arbiter is to determine as accurately as possible whether the generated limb-action keypoints are real or false, and consistent and continuous with the precursor limb keypoints, etc.

The specific working procedure is as follows:

(1) Inputting the condition: the condition information (audio features, prefrontal limb keypoints) is provided as additional input to the generator and the arbiter.

(2) Generating an image: the generator receives random noise and input conditions and generates successive limb-motion keypoints.

(3) Judging true or false: the arbiter receives the generated limb movement key points and input conditions, tries to distinguish the real limb movement key points from the generated limb movement key points, and judges whether the limb movement key points are consistent with and continuous with the preamble key points.

(4) Calculating loss: from the output of the arbiter, the losses of the generator and the arbiter are calculated. The loss goal of the generator is to have the generated limb-action keypoints erroneously classified as true limb-action keypoints by the arbiter, while the loss goal of the arbiter is to accurately judge the distinction between true and generated action keypoints and whether consistency and continuity are maintained.

(5) Updating parameters: the parameters of the generator and the arbiter are updated according to the gradient of the loss function.

By repeatedly iterating the steps, the performances of the generator and the discriminator are gradually improved, and the generated limb action key points are more and more lifelike. The introduction of the condition GAN enables the generator to generate limb action key points with certain attributes under specific conditions, so that the application range of the GAN is expanded.

As mentioned above, for a better understanding of the invention, reference may be made to the following references. The entire contents of each of these references are incorporated herein by reference.

1. Svc-develop-team/so-vits-svc based on generating sound timbre transitions against a network. SoftVC VITS SINGING Voice Conversion

2. Github-Rudrabha/Wav2 Lip. The repository contains code that "one Lip sync expert is enough to implement speech-to-Lip generation", published in ACM Multimedia 2020 .This repository contains the codes of"A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild",published at ACM Multimedia 2020.

3. OpenTalker/SADTALKER [ CVPR 2023] to learn the true 3D motion coefficients for a stylized audio-driven single image speaker face animation .[CVPR 2023]SadTalker：Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation(github.com)

4. Github-TENCENTARC/GFPGAN is intended to develop practical algorithms for real world face restoration. GFPGAN AIMS AT developing Practical Algorithms for Real-world Face Restoration.

5. Github-modelscope/facechain a deep learning tool chain for generating your digital twins. FACECHAIN IS A DEEP-learning toolchain for generating your Digital-Twin.

6. OpenTalker/video-retalking [ SIGGRAPH ASIA 2022] Audio-based lip sync video editing, which can adjust character expressions .[SIGGRAPH Asia 2022]VideoReTalking:Audio-based Lip Synchronization for Talking Head Video Editing In the Wild(github.com)

7. NVlabs/few-shot-vid2vid, namely realizing Pytorch from few-sample realistic video to video, and realizing action migration. Pytorch implementation for few-shot photorealistic video-to-video Transmission (gilth. Com)

The above description is only of the preferred embodiment of the present invention, and is not intended to limit the structure of the present invention in any way. Any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention fall within the technical scope of the present invention.

Claims

1. A method of constructing a lecture digital person comprising:

(2) Constructing a digital human model of Wav2Talker, wherein Wav2Talker is based on a SADTALKER model;

(3) The video-retalking technology is applied to realize that the facial expression adds emotion change, so that the expression of the digital person has the emotion expression same as that of a real person;

(4) Expanding GFPGAN human face eyes and nose super-resolution algorithm, and high-definition of the whole human face; the algorithm can convert an input low-resolution image into a high-resolution image, and meanwhile, the details and the definition of the image are kept;

2. The digital person of claim 1, wherein:

The process of constructing the Wav2 Talker-based digital human model is as follows:

firstly, 24 human body skeleton key points and 21 hand key points are extracted from a video initial key frame teaching teacher picture V ₀ (single frame image), 45 of the key points are called initial limb key points, and the initial limb key points are marked as eta ₀;

Again, the { η ₁,...,η_n } sequence goes out of the teacher's natural continuous limb motion video through limb keypoints to video rendering (skeleton-to-video rendering);