CN116309984A

CN116309984A - Mouth shape animation generation method and system based on text driving

Info

Publication number: CN116309984A
Application number: CN202310045379.7A
Authority: CN
Inventors: 陈国伟; 李俊; 齐慧杰; 何强
Original assignee: Xinhua Fusion Media Technology Development Beijing Co ltd; Xinhua New Media Culture Communication Co ltd; Communication University of China
Current assignee: Xinhua Fusion Media Technology Development Beijing Co ltd; Xinhua New Media Culture Communication Co ltd; Communication University of China
Priority date: 2023-01-30
Filing date: 2023-01-30
Publication date: 2023-06-23

Abstract

The invention provides a method and a system for generating mouth-shaped animation based on text driving, wherein the method comprises the following steps: defining a mouth shape action data set of a digital virtual person, and establishing a pre-training model; inputting a text sequence of a language text, and carrying out phoneme recognition on the text sequence; mapping the relation between the phonemes and the mouth shape actions through a pre-training model, and outputting a mouth shape action frame sequence; and (3) synthesizing the mouth shape animation of the continuous frames from the mouth shape action frame sequence, and linearly interpolating the animation frames from the mouth shape action frames to keep the text length consistent with the animation length and keep the synchronization of the language text and the mouth shape animation. The invention can effectively combine the characteristics of linguistics and graphics to construct a complete mouth shape action data set, thereby facilitating the generation of subsequent mouth shape actions; the mapping relation between the phonemes and the mouth shape action frames can be completed by combining the pre-training model; if the attention model is adopted, the output of the front and back relevant phonemes can be completed, better motion compensation is provided, and the mouth shape animation effect is improved.

Description

Mouth shape animation generation method and system based on text driving

Technical Field

The invention relates to the technical field of digital virtual population type animation, in particular to a method and a system for generating a mouth type animation based on text driving.

Background

The digital virtual person aims at creating a digital image similar to a human image through a Computer Graphic (CG) technology, giving the digital image a specific character identity setting, and bringing closer visual and psychological distances to the human, so as to bring more real emotion interaction for users. Digital virtual humans in a broad sense refer to virtual characters having digitized shapes created and used by computer means such as computer graphics, graphics rendering, motion capture, deep learning, speech synthesis, and the like, and having the integrated products of multiple human features (appearance features, human performance capabilities, human interactive capabilities, and the like).

The digital virtual man-in-the-art comprises a task image, a voice generation module, an animation generation module, an audio and video synthesis display module and an interaction module. The manufacture of digital virtual persons relates to a plurality of technical fields, and the manufacture mode is not completely shaped at present.

At present, the mouth shape generating technology which has a relatively close technical relationship with the digital virtual man is also further developed. First, the mouth shape generation technique can facilitate the production of multimedia content, such as video games, dubbing movies, and television programs. Also, in the field of avatar fabrication, personas incorporating visual information may enhance a person's perceptibility, such as video conferences, virtual announcers, virtual teachers, virtual assistants, and real-person digital twins. In addition, the mouth shape generation technology has greater application potential and value in the education field, the communication field and the multi-mode man-machine interaction field. In recent years, various methods for generating a mouth shape have been explored, and the methods can be classified into: intermediate feature method, specific model method.

(1) The intermediate feature method (Intermediate Features Method) is to first establish a relationship between the audio feature and a predefined intermediate feature of the face modeling, and then generate a corresponding mouth shape from the intermediate feature, which may be a predefined face landmark or an expression coefficient.

Among them, two-dimensional model-based methods typically use facial landmarks (facial landmark) as intermediate features, and suwanakorn et al use Recurrent Neural Networks (RNNs) to map mel-frequency cepstrum coefficient (MFCC) feature maps with PCA coefficients of the facial landmarks, with texture information provided by the input face images, to generate target face images from the reconstructed face landmarks. Zhou et al map the input speech content code and facial landmark code to a target land mark relative to the face template offset, and then generate an image over an image-to-image network.

Three-dimensional model-based methods typically use facial parameters (facial parameters) as intermediate features, fried et al designed a neural renderer to generate the target video using facial parameters of the human head model as intermediate features. Wiles et al build a mapping from audio features to hidden variables (text codes) in a pre-trained face generation model to achieve audio-driven face video synthesis. Guo et al use conditional implicit functions to generate a dynamic neural radiation field from the audio features and then use volume rendering to synthesize video.

(2) The specific model method (Specific Model Method) is to directly generate a video of a speaking face or lip from an input driving source and a target face image, and does not involve intermediate face parameters. Because of the complex processing pipelines of the two-section framework of the middle feature method, for example, additional auxiliary technologies such as facial landmark detection, facial parameter labeling, three-dimensional face reconstruction and the like are needed, the research in recent years is more specialized in exploring the generation method of the single-order model, and corresponding achievements are obtained in the synthesis of the two-dimensional and three-dimensional speaking faces.

The two-dimensional face synthesis is carried out by using a two-dimensional model synthesis (Synthesis Based on 2D Models), and a multi-dimensional deformation model (MMM) based on a three-dimensional model method provides an early method for the two-dimensional face synthesis; in the early stages of face generation, a 3D face model based on HMM head trajectory and joint motion prediction was proposed. With the development of deep learning techniques, particularly DCN (Deep Cross Network) and GAN, in the field of face generation, many face generation methods began to turn to generating faces by integrating 3DMM and 3D face models, and pharm et al studied a hybrid model (blendcope) in which 3D rotation and expression coefficients can be predicted by inputting only speech, karras et al proposed a face model in which facial expression vertex coordinates were generated by inputting speech and emotion coding, edwards et al proposed a 3D face simulation model JALI mainly focusing on chin and mouth movements, and Zhou et al proposed a face model based on JALI or FACS standards by visual parameter driving predicted by three-level LSTM networks.

In terms of processing mouth-shaped animations, there are currently existing focused methods including: the performance-based driving method, the voice-based driving method, and the like, however, the above-described conventional mouth shape generating method has the following drawbacks and disadvantages:

1. the performance-driven method, see fig. 3, is divided into three parts: face capturing, expression generating and animation synthesizing. Parameters processed in the processing process are divided into static parameters (key points) and dynamic parameters (displacement and expression motion rules), and the character model and the expression animation are synthesized based on the processing of the parameters. The method has the advantages that the smoothness approaches to that of a real person, and the method has the defects that expensive data equipment is required for motion capture, and the method has high requirements on operators, can generate delay and jamming, and the delay is generally about 1-5S.

2. Based on the voice driving method, referring to fig. 4, the input voice is processed in real time, voice characteristic parameters (Mel frequency coefficient, linear prediction cepstrum coefficient, etc.) in the input voice are extracted, a key frame sequence is output through the mapping relation between a voice signal and mouth shape motion, and the key frame is used for synthesizing expression animation. The method is limited by the characteristics of the voice itself, and has no effective processing mode for the silent section and the noise section, which can lead to the animation to be in complete rest; furthermore, the accuracy of feature extraction is limited, which can have a large impact on animation effects.

Disclosure of Invention

In view of this, the present invention aims to combine linguistic and graphic features, based on linguistic text driving, to achieve higher accuracy of real-time animation with lower delay and good mouth shape animation effect by means of the mapping relationship between phonemes and mouth shape action frames.

The invention provides a method for generating mouth shape animation based on text driving, which comprises the following steps:

s1, defining a mouth shape action data set of a digital virtual person, and establishing a Pre-Training Model Pre-Training Model;

s2, inputting a text sequence of a language text, and carrying out phoneme recognition on the text sequence;

s3, mapping the relation between the phonemes and the mouth shape actions through the Pre-Training Model and outputting an action frame sequence;

s4, synthesizing the mouth shape animation of the continuous frames of the mouth shape action frame sequence, and linearly interpolating the animation frames of the mouth shape action frames to keep the text length consistent with the animation length, and simultaneously keeping the synchronization of the language text and the mouth shape animation;

the linear interpolation of the animation frame is performed on the mouth shape action frame to solve the problem of continuous front and back of mouth shape animation corresponding to phonemes.

Further, the method of inputting the text sequence of the language text in the step S2 includes:

the entered text words are described as a sequence of text that is continuous.

Because in practical applications, in the case of dialog scenes, the input of language text is a continuous input.

Further, the method for processing the phonemes in the step S2 includes:

and obtaining a corresponding phoneme sequence according to the input characteristics by adopting a phoneme analysis tool, and obtaining time sequence information of the phoneme sequence in the language text.

Further, the input features include:

sentence meaning features, character semantic features, grammar features, polyphone features.

Further, the phoneme recognition of the S2 step includes:

phoneme processing and time processing.

Further, the method for synthesizing the mouth shape animation in the step S4 comprises the following steps:

and synthesizing each sentence in the mouth shape action frame sequence into each animation, synthesizing the complete phoneme sequence corresponding to one sentence of the sentence, simultaneously referring to the time length corresponding to the language text and the frame skipping phenomenon of the front and rear phoneme animations in the phoneme sequence, supplementing frames to the animations in a linear interpolation mode, and finally outputting the mouth shape animations.

The invention also provides a system for generating mouth shape animation based on text driving, which executes the mouth shape animation generating method based on the text driving, and comprises the following steps:

a pre-training model module: the method comprises the steps of defining a mouth shape action data set of a digital virtual person, and establishing a Pre-Training Model Pre-Training Model;

a phoneme recognition module: a text sequence for inputting language text, and carrying out phoneme recognition on the text sequence;

and a mouth-shaped action frame sequence module: the method comprises the steps of carrying out relation mapping between phonemes and mouth-shaped actions through the Pre-Training Model and outputting a sequence of action frames;

and a mouth-shaped animation synthesis module: and the method is used for synthesizing the mouth shape animation of the continuous frames of the mouth shape action frame sequence, and carrying out linear interpolation of the animation frames on the mouth shape action frames so as to keep the text length consistent with the animation length and keep the synchronization of the language text and the mouth shape animation.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the text-driven-based mouth-shape animation generation method described above.

The present invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the text-driven based die animation generation method as described above when the program is executed.

Compared with the prior art, the invention has the beneficial effects that:

the method for generating the mouth shape animation based on the text driving can effectively combine the characteristics of linguistics and graphics to construct a complete mouth shape action data set, thereby being convenient for the generation of subsequent mouth shape actions; the mapping relation between the phonemes and the mouth shape action frames can be completed by combining the pre-training model; if the attention model is adopted, the output of the front and rear related phonemes can be completed, better motion compensation can be provided, so that the real-time animation can realize higher accuracy under lower delay, and the mouth shape animation effect is improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.

In the drawings:

FIG. 1 is a flow chart of a method for generating a mouth shape animation based on text driving of the present invention;

FIG. 2 is a schematic diagram of a computer device according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a prior art performance-based driving method;

FIG. 4 is a schematic diagram of a prior art voice-driven based approach;

FIG. 5 is a timing diagram of a mouth-shaped animation synthesis according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an alignment process of a die animation composition according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and products consistent with some aspects of the disclosure as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

The embodiment of the invention provides a method for generating mouth shape animation based on text driving, which is shown in fig. 1 and comprises the following steps:

the entered text words are described as a sequence of text that is continuous.

The phoneme processing method comprises the following steps:

The input features include:

The phoneme recognition comprises:

phoneme processing and time processing.

s4, synthesizing the mouth shape animation of the continuous frames of the mouth shape action frame sequence, and linearly interpolating the animation frames of the mouth shape action frames to keep the text length consistent with the animation length, and simultaneously keeping the synchronization of the language text and the mouth shape animation.

The method for synthesizing the mouth shape animation comprises the following steps:

Referring to fig. 5, a timing diagram of the composition of the mouth shape animation according to the present embodiment is shown.

In the present embodiment, referring to fig. 6, a piece of input text S is assumed, and after analysis by a phoneme analysis tool, formed into a phoneme sequence P (P ₁ ，p ₂ ，p ₃ …p _n ) And can form a sentence with a time length of T, each phoneme corresponds to a time of T _p (t _p1 ，t _p2 ,t _p3 …t _pn ) Typically T is greater than T _p The sum of each element does not take the duration of the phonemes into consideration when performing phoneme recognition, and the characteristics of the language expression include phonemes such as pause generation, and time difference is caused.

Let us assume the phoneme p _m And p _m-1 With time discontinuity, i.e. at the same time T, p _m And p _m-1 The animation formed by two phonemes has an error of deltat, and for the error, frame filling can be performed by adopting a mode of linear interpolation of adjacent phoneme animations, namely, proper frames are adopted and inserted into the interval between the two phoneme animations, so that the time synchronization and delta t filling errors are completed.

The embodiment of the invention also provides a system for generating the mouth shape animation based on the text driving, which executes the mouth shape animation generating method based on the text driving, and comprises the following steps:

The embodiment of the invention also provides a computer device, and fig. 2 is a schematic structural diagram of the computer device provided by the embodiment of the invention; referring to fig. 2 of the drawings, the computer apparatus includes: input means 23, output means 24, memory 22 and processor 21; the memory 22 is configured to store one or more programs; when the one or more programs are executed by the one or more processors 21, the one or more processors 21 are caused to implement the text-driven-based mouth shape animation generation method as provided in the above-described embodiments; wherein the input device 23, the output device 24, the memory 22 and the processor 21 may be connected by a bus or otherwise, for example in fig. 2 by a bus connection.

The memory 22 is used as a readable storage medium of a computing device and can be used for storing a software program and a computer executable program, and is based on program instructions corresponding to a text-driven mouth shape animation generation method according to an embodiment of the invention; the memory 22 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the device, etc.; in addition, memory 22 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device; in some examples, memory 22 may further comprise memory located remotely from processor 21, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 23 is operable to receive input numeric or character information and to generate key signal inputs relating to user settings and function control of the device; the output device 24 may include a display device such as a display screen.

The processor 21 executes various functional applications of the apparatus and data processing by running software programs, instructions and modules stored in the memory 22, i.e., implements the above-described text-driven-based mouth-shape animation generation method.

The computer equipment provided by the embodiment can be used for executing the mouth shape animation generation method based on the text driving, and has corresponding functions and beneficial effects.

Embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing the text-driven-based mouth-animation generation method as provided by the above embodiments, the storage medium being any of various types of memory devices or storage devices, the storage medium comprising: mounting media such as CD-ROM, floppy disk or tape devices; computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, lanbas (Rambus) RAM, etc.; nonvolatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc.; the storage medium may also include other types of memory or combinations thereof; in addition, the storage medium may be located in a first computer system in which the program is executed, or may be located in a second, different computer system, the second computer system being connected to the first computer system through a network (such as the internet); the second computer system may provide program instructions to the first computer for execution. Storage media includes two or more storage media that may reside in different locations (e.g., in different computer systems connected by a network). The storage medium may store program instructions (e.g., embodied as a computer program) executable by one or more processors.

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the method for generating a text-based animation according to the embodiment, and may also perform the related operations in the method for generating a text-based animation according to any embodiment of the present invention.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.

The foregoing description is only of the preferred embodiments of the invention and is not intended to limit the invention; various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for generating mouth shape animation based on text driving, which is characterized by comprising the following steps:

2. The text-driven mouth animation generation method of claim 1, wherein the method of inputting a text sequence of language text of the S2 step comprises:

the entered text words are described as a sequence of text that is continuous.

3. The text-driven based mouth shape animation generation method according to claim 1, wherein the method of phoneme processing of the S2 step comprises:

4. A text-driven based mouth animation generation method according to claim 3, wherein the inputted features include:

5. The text-driven based mouth shape animation generation method of claim 1, wherein the phoneme recognition of the S2 step comprises:

phoneme processing and time processing.

6. The text-driven-based mouth shape animation generation method according to claim 1, wherein the mouth shape animation synthesis method of the S4 step comprises:

7. A text-driven based mouth shape animation generation system, characterized by performing the text-driven based mouth shape animation generation method according to any of claims 1-6, comprising:

8. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the text-driven based mouth animation generation method of any of claims 1-6.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the text-driven based mouth animation generation method of any of claims 1-6 when the program is executed by the processor.