CN116912377A

CN116912377A - Interactive multi-mode stylized two-dimensional digital face animation generation method

Info

Publication number: CN116912377A
Application number: CN202310886513.6A
Authority: CN
Inventors: 周颖杰; 陈耀栋; 付一帆; 林坤杰; 刘辉
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-10-20

Abstract

The present disclosure provides an interactive multi-modal stylized two-dimensional digital facial animation generation method, the method comprising: acquiring preset voice data, preset image data and interactive input text of a user; determining the synthetic voice of a digital person according to preset voice data and interactive input text of a user; determining a figure appearance image corresponding to a digital person in a preset age group according to preset image data; determining two-dimensional digital face animation with audio according to figure appearance images corresponding to digital persons at preset age groups and synthesized voices of the digital persons; and sequentially carrying out stylization processing and super-resolution processing on the two-dimensional digital facial animation with the audio, and then carrying out synthesis processing on the two-dimensional digital facial animation with the audio and the synthesized voice of the digital person to determine the multi-mode stylized two-dimensional digital facial animation. The two-dimensional digital facial animation can be simply, conveniently and effectively generated through the method and the device, can be presented to the user in various styles, and improves the experience quality of the user.

Description

Interactive multi-mode stylized two-dimensional digital face animation generation method

Technical Field

The invention relates to the technical field of computer vision, in particular to an interactive multi-mode stylized two-dimensional digital face animation generation method.

Background

Along with the rapid development of computer graphics and artificial intelligence technology, the technology of digital people is greatly improved, and the digital people have the characteristics of vivid image, real actions, intelligent interaction and the like, and are integrated into the daily life of people in the fields of film and television, medical treatment, entertainment and the like.

From the data structure classification that generates digital persons, digital persons include two-dimensional digital persons and three-dimensional digital persons. Wherein, although the three-dimensional digital person can more realistically and comprehensively restore the appearance and the details of the person, the data structure of the three-dimensional digital person is dense and complex, and the technology is not mature compared with the two-dimensional digital person. In daily life of people, the image of the digital person is often presented to the user in the form of two-dimensional media such as pictures, videos and the like, and the three-dimensional digital person is often presented to the user in the form of two-dimensional media through additional rendering, so that the two-dimensional digital person has a significant role in the digital person field.

The existing digital man-made technology is complex in design flow, consumes a large amount of manpower resources and time cost, is low in development and design efficiency, and affects user experience in terms of user experience due to a single video style which is unchanged and insufficient clear image quality.

Disclosure of Invention

Aiming at the defects in the prior art, the object of the present disclosure is to provide an interactive multi-mode stylized two-dimensional digital face animation generation method.

To achieve the above object, according to a first aspect of the present invention, there is provided an interactive multi-modal stylized two-dimensional digital face animation generating method, including:

acquiring preset voice data, preset image data and interactive input text of a user;

determining the synthetic voice of a digital person according to the preset voice data and the interactive input text of the user;

inputting the preset image data into a pre-trained age conversion model, and determining a figure appearance image corresponding to the digital person in a preset age range;

inputting a figure appearance image corresponding to the digital person at a preset age range and a synthetic voice of the digital person into a pre-trained driving model to determine a two-dimensional digital face animation with audio;

inputting the two-dimensional digital facial animation with the audio into a pre-trained portrait cartoon model for stylizing treatment, and determining the two-dimensional digital facial animation with the preset style;

inputting the two-dimensional digital facial animation with the preset style into a pre-trained video super-division model for super-resolution processing, and determining the two-dimensional digital facial animation subjected to super-resolution processing;

and synthesizing the super-resolution processed two-dimensional digital facial animation and the synthesized voice of the digital person to determine the multi-mode stylized two-dimensional digital facial animation.

Optionally, the determining the synthesized voice of the digital person according to the preset voice data and the interactive input text of the user includes:

inputting the interactive input text of the user into a pre-trained language model, and determining the interactive text of the user and the pre-trained language model, wherein the interactive text comprises the interactive input text of the user and a response text of the pre-trained language model;

inputting the preset voice data into a pre-trained sound cloning model, and determining the sound characteristics of the digital person;

and inputting the sound characteristics of the digital person and the interactive text into the pre-trained sound cloning model to determine the synthetic voice of the digital person.

Optionally, the inputting the interactive input text of the user into a pre-trained language model, determining the interactive text of the user and the pre-trained language model includes:

wherein ,representing POST Access request, T _n Representing interactive input text of user while conducting nth round of interaction, R _n Response text, TR, representing the text entered by the pre-trained language model for the user's interaction when the nth round of interaction is performed _n Representing the interactive text of the user and the pre-trained language model when the nth round of interaction is performed, wherein the URL represents the port access IP address of the pre-trained language model, the POST represents the process of sending a POST access request to the pre-trained language model by a local host computer, [ T ] _n ,TR _n-1 ,L,TR ₁ ]Representing a request body of a POST access request, wherein Chat represents a process that the pre-trained language model responds to the POST access request during the nth round of interaction.

Optionally, the inputting the preset voice data into a pre-trained sound cloning model to determine the sound characteristics of the digital person includes:

F _A ＝f(A)

wherein ,F_A Representing the sound characteristics of the digital person, wherein A represents the preset voice data, and f represents the sound characteristic extraction operation performed by adopting the pre-trained sound cloning model.

Optionally, inputting the sound features of the digital person and the interactive text into the pre-trained sound clone model to determine the synthetic speech of the digital person, including:

wherein ,representing the synthesized speech of the digital person, mock represents the synthesized speech manipulation using the acoustic cloning model.

Optionally, the inputting the preset image data into a pre-trained age conversion model, and determining the figure appearance image corresponding to the digital person in the preset age range includes:

P＝SAM(I)

P＝[p ₁ ,p ₂ ,p ₃ ,L,p _k ]

wherein P represents a figure image set corresponding to the digital person at each age group, I represents the preset image data, and P _k And representing the figure appearance image corresponding to the kth age group, wherein SAM represents that the pre-trained age conversion model is adopted for age conversion operation.

Optionally, the step of inputting the character image corresponding to the digital person in the preset age range and the synthetic voice of the digital person into a pre-trained driving model to determine a two-dimensional digital face animation with audio, including:

p∈P

wherein p represents a figure appearance image corresponding to the digital person in a preset age range, V represents the two-dimensional digital face animation with audio, and Drive represents driving operation performed by adopting the pre-trained driving model.

Optionally, the inputting the two-dimensional digital facial animation with audio into a pre-trained portrait cartoon model for stylizing processing, and determining the two-dimensional digital facial animation with a preset style includes:

V _S ＝Net(V,S)

wherein S represents a presetStyle type set of V _s And representing the two-dimensional digital face animation with the preset style, wherein Net represents stylized processing operation performed by adopting the pre-trained portrait cartoonization model.

Optionally, the inputting the two-dimensional digital facial animation with the preset style into a pre-trained video super-division model to perform super-resolution processing, and determining the two-dimensional digital facial animation after super-resolution processing includes:

V _VSR-S ＝VSR(V _S )

wherein ,V_VSR-R And representing the super-resolution processed two-dimensional digital facial animation, wherein VSR represents super-resolution processing operation performed by adopting the pre-trained video super-division model.

Optionally, the synthesizing the super-resolution processed two-dimensional digital face animation and the synthesized voice of the digital person to determine a multi-mode stylized two-dimensional digital face animation includes:

wherein ,V_o Representing the multi-modal stylized two-dimensional digital facial animation,representing the synthetic processing operations.

Compared with the prior art, the embodiment of the invention has at least one of the following beneficial effects:

through the technical scheme, the two-dimensional digital human face animation with audio frequency is synthesized through the preset voice data, the preset image data and the interactive input text of the user by adopting the pre-trained age conversion model and the pre-trained driving model, the two-dimensional digital human animation can be gradually generated, and the interaction between the user and the two-dimensional digital human can be effectively realized; the two-dimensional digital human animation is stylized through the pre-trained human image cartoon model, the two-dimensional digital human face is presented to the user in various styles, the pre-trained video super-division model is adopted to conduct resolution processing on the two-dimensional digital human face animation, the definition of the two-dimensional digital human face animation is improved, and the experience quality of the user is improved.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a flowchart illustrating a method for interactive multimodal stylized two-dimensional digital facial animation generation, according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a method of determining digital human synthesized speech according to an exemplary embodiment.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

FIG. 1 is a flowchart illustrating a method for interactive multimodal stylized two-dimensional digital facial animation generation, according to an exemplary embodiment. As shown in FIG. 1, an interactive multi-mode stylized two-dimensional digital face animation generation method comprises S11 to S17.

S11, acquiring preset voice data, preset image data and interactive input text of a user.

S12, determining the synthesized voice of the digital person according to the preset voice data and the interactive input text of the user.

S13, inputting preset image data into a pre-trained age conversion model, and determining the figure image corresponding to the digital person in the preset age range.

S14, inputting the character image corresponding to the digital person in the preset age range and the synthetic voice of the digital person into a pre-trained driving model, and determining the two-dimensional digital face animation with the audio.

S15, inputting the two-dimensional digital facial animation with the audio into a pre-trained portrait cartoon model for stylizing processing, and determining the two-dimensional digital facial animation with the preset style.

S16, inputting the two-dimensional digital facial animation with the preset style into a pre-trained video super-division model for super-resolution processing, and determining the two-dimensional digital facial animation subjected to the super-resolution processing.

S17, synthesizing the two-dimensional digital facial animation subjected to super-resolution processing and the synthesized voice of the digital person, and determining the multi-mode stylized two-dimensional digital facial animation.

In some possible embodiments, in S11 of the present disclosure, preset voice data, preset image data, and interactive input text of the user are acquired.

The preset voice data can be voice data on a network or actually collected voice data, the preset image data can be character images on the network or actually collected, the preset voice data are used for synthesizing the voice of the digital person, and the preset image data are used for synthesizing the character image of the digital person.

The interactive input text of the user is the voice input text content of the interaction between the user and the digital person in the round, wherein the user asks the digital person for a reply or the digital person asks the user for a reply as a round of interaction.

As shown in fig. 2, in some possible embodiments, in S12 of the present disclosure, determining the synthesized voice of the digital person according to the preset voice data and the interactive input text of the user may include S21 to S23.

S21, inputting the interactive input text of the user into the pre-trained language model, and determining the interactive text of the user and the pre-trained language model.

Wherein the interactive text comprises interactive input text of the user and response text of the pre-trained language model.

In the present disclosure, the pre-trained language model may employ a ChatGLM language model, which may be deployed in advance on a local host. When the user interacts with the ChatGLM language model, the response text output by the ChatGLM language model relative to the interactive input text of the user is used as the response text of the digital person.

In one possible embodiment, S21 further includes:

wherein ,representing POST Access request, T _n Representing interactive input text of user while conducting nth round of interaction, R _n Response text, TR, representing text entered by the pre-trained language model for user interaction when the nth round of interaction is performed _n Representing the interactive text of the user and the pre-trained language model when the nth round of interaction is performed, wherein the URL represents the port access IP address of the pre-trained language model, the POST represents the process of sending the POST access request to the pre-trained language model by the local host, and the Chat representsAnd the pre-trained language model responds to the POST access request during the nth round of interaction.

The local host sends a POST access request to a port access IP address URL of the open access of the pre-deployed ChatGLM language model, and the request body of the POST access request is [ T ] _n ,TR _n-1 ,L,TR ₁ ]Wherein, the POST access request comprises a field 'sample' and a field 'history', and the field 'sample' represents an interactive input text T of a user in the round of interaction _n The field "history" indicates the interactive text [ TR ] of the n-1 round of interactions prior to the current round of interactions _n-1 ,···,TR ₁ ]。

The ChatGLM language model outputs a response text R according to the POST access request _n And the answer text R _n As a reply text to the digital person, and the reply text R can also be used _n With the input text T of the user _n New element TR in composition field "history _n Provision is made for the user to interact with the n+1 th of the ChatGLM language model.

S22, inputting preset voice data into a pre-trained sound cloning model, and determining sound characteristics of the digital person.

The pre-trained voice cloning model may adopt a MockingBird voice cloning model, input preset voice data into the MockingBird voice cloning model for voice feature extraction operation, extract voice features of the preset voice data, and use the voice features as voice features of digital people.

In some possible embodiments, S22 comprises:

F _A ＝f(A)

wherein ,F_A Representing the sound characteristics of a digital person, a representing preset voice data, and f representing the sound characteristic extraction operation using a pre-trained sound cloning model.

S23, inputting the sound characteristics of the digital person and the interactive text into a pre-trained sound cloning model to determine the synthetic voice of the digital person.

The pre-trained sound cloning model can also adopt a Mockingbird sound cloning model.

In the present disclosure, the synthesized speech of the digital person is the utterance of the response text output by the ChatGLM language model in the interactive text with the determined sound features of the digital person.

In some possible embodiments, S23 includes:

wherein ,representing the synthesized speech of a digital person, mock represents the synthesized speech operation using the acoustic cloning model.

Through the technical scheme, interaction between the user and the digital person is realized through interaction between the user and the pre-trained language model, and in each round of interaction between the user and the pre-trained language model, the interactive text before the round of interaction is referred to, so that the accuracy of the response text output by the pre-trained language model in the round of interaction is improved.

In some possible embodiments, in S13 of the present disclosure, inputting the preset image data into the pre-trained age conversion model, determining the person appearance image corresponding to the digital person at the preset age may include:

P＝SAM(I)

P＝[p ₁ ,p ₂ ,p ₃ ,L,p _k ]

wherein P represents a figure image set corresponding to a digital person at each age group, I represents preset image data, and P _k Representing the appearance image of the person corresponding to the kth age group, and SAM represents the age conversion operation by using the pre-trained age conversion model.

In the disclosure, the pre-trained age conversion model may use a SAM age conversion model, and generate, using the SAM age conversion model, person appearance images of each age group corresponding to preset image data, where each age group may be set according to actual requirements, for example, the age groups may be divided with five years as an age interval. Following the above example, the digital person of the present disclosure corresponds to the person appearance image set P including k person appearance images at each age group.

Firstly, inputting preset image data into a SAM age conversion model, outputting a figure appearance image data set corresponding to each age group divided by preset image data at preset age intervals, and taking the figure appearance image data set as a figure appearance image corresponding to a digital person at each age group.

And secondly, selecting a person appearance image corresponding to a preset age group from the person appearance image data set of the digital person as the person appearance image of the digital person interacted in the round.

Through the technical scheme, the image data are converted into the figure appearance images of all age groups, and the figure appearance images of the proper age groups are determined to serve as the figures of the digital person, so that the fidelity of the digital person can be effectively improved, and the user experience quality is improved.

In some possible embodiments, in S14 of the present disclosure, inputting a person appearance image corresponding to a digital person at a preset age and a synthesized voice of the digital person into a pre-trained driving model, determining a two-dimensional digital face animation with audio, including:

p∈P

wherein p represents a figure appearance image corresponding to a digital person at a preset age, V represents a two-dimensional digital face animation with audio, and Drive represents a driving operation performed by adopting a pre-trained driving model.

Wherein, the pre-trained driving model can adopt a SadTalker driving model.

Inputting the figure appearance image corresponding to the digital person in the preset age range and the synthetic voice of the digital person into a sadcalker driving model to drive the two-dimensional digital face so as to be displayed as the two-dimensional digital face animation with audio.

According to the technical scheme, the pre-trained driving model is adopted to drive the figure appearance image of the digital person and the synthesized voice of the digital person into the two-dimensional digital face animation, the two-dimensional digital person is generated by adopting simple steps, and the generation efficiency of the two-dimensional digital person is improved.

In some possible embodiments, in S15 of the present disclosure, inputting a two-dimensional digital face animation with audio into a pre-trained portrait cartoonization model for stylizing processing, determining a two-dimensional digital face animation with a preset style includes:

V _S ＝Net(V,S)

wherein S represents a preset style type set, V _s And (3) representing a two-dimensional digital face animation with a preset style, wherein Net represents stylized processing operation by adopting a pre-trained portrait cartoon model.

The pre-trained portrait cartoon model disclosed by the disclosure can adopt a DCT-Net model to perform style conversion on the two-dimensional digital facial animation, and the original audio of the two-dimensional digital facial animation is discarded in the style conversion process of the two-dimensional digital facial animation.

The preset style set S may include seven styles including animation style (animation), three-dimensional style (3 d), hand drawing style (handdraw), sketch style (sketch), art style (artstyle), design style (design), and inserting style (illustration), and the style of the two-dimensional digital face animation may be converted into at least one of the seven styles by using the DCT-Net model, and after performing the style processing operation, the two-dimensional digital face animation with the preset style and without a soundtrack is obtained.

Through the technical scheme, the style of the two-dimensional digital face animation is converted by adopting the pre-trained portrait cartoon model, so that the style is presented to the user in a diversified manner, the experience quality of the user is improved, and the interactive attractiveness of the user can be effectively improved.

In some possible embodiments, in S16 of the present disclosure, inputting a two-dimensional digital face animation having a preset style into a pre-trained video super-division model to perform super-resolution processing, determining the super-resolution processed two-dimensional digital face animation includes:

V _VSR-S ＝VSR(V _S )

wherein ,V_VSR-R Representing super-resolution processed two-dimensional digital facial animation, and VSR represents super-resolution processing operation performed by using a pre-trained video super-division model.

The pre-trained video super-division model can adopt a BasicVSR++ model, and the super-resolution processed two-dimensional digital facial animation is a two-dimensional digital facial animation without a sound track.

Through the technical scheme, the super-resolution processing is carried out on the two-dimensional digital facial animation with the preset style, so that the definition of the video of the two-dimensional digital facial animation is improved, and the experience of a user is improved.

In some possible embodiments, in S17 of the present disclosure, synthesizing the super-resolution processed two-dimensional digital face animation and the synthesized voice of the digital person, determining the multi-modal stylized two-dimensional digital face animation includes:

wherein ,V_o Representing a multi-modal stylized two-dimensional digital facial animation,representing the synthetic processing operations.

The synthesized voice of the two-dimensional digital face animation subjected to super-resolution processing and the synthesized voice of the digital person can be synthesized by adopting a program FFmpeg, and the synthesized voice of the digital person can be embedded into the corresponding position of the two-dimensional digital face animation subjected to super-resolution processing by adopting a mixed shearing mode, so that audio is added to the synthesized voice.

Through the technical scheme, the audio is added to the two-dimensional digital face animation which is free of sound tracks and is subjected to super-resolution processing, and the complete multi-mode stylized two-dimensional digital face animation, namely the animation video with the audio, can be intelligently interacted with a user in visual, auditory and linguistic aspects.

Each step S11 to S17 of the present disclosure may be adopted independently, or may be adopted in combination of a plurality of steps, which all fall within the scope of protection of the present disclosure.

In some possible embodiments, the following steps may also be employed to generate an interactive multimodal stylized two-dimensional digital facial animation.

The first step: and acquiring preset voice data, preset image data and interactive input text of a user.

And a second step of: and inputting the interactive input text of the user into the pre-trained ChatGLM language model, and outputting the interactive text of the user and the pre-trained language model.

Thirdly, inputting preset voice data into a pre-trained Mockingbird voice cloning model, and extracting voice characteristics of a digital person.

Fourth step: and inputting the sound characteristics of the digital person and the interactive text into a Mockingbird sound cloning model to determine the synthetic voice of the digital person.

Fifth step: and inputting the preset image data into a pre-trained SAM age conversion model, and outputting a figure appearance image corresponding to the digital person in a preset age range.

Sixth step: inputting the figure appearance image corresponding to the digital person at the preset age and the synthetic voice of the digital person into a pre-trained sadcalker driving model, and outputting the two-dimensional digital face animation without the soundtrack and with the audio.

Seventh step: inputting the two-dimensional digital facial animation without the soundtrack and with the audio frequency into a pre-trained DCT-Net portrait cartoon model for stylizing treatment, and determining the two-dimensional digital facial animation without the soundtrack and with the preset style.

Eighth step: inputting the two-dimensional digital facial animation which is free of sound tracks and has a preset style into a pre-trained BasicVSR++ video super-division model to perform super-resolution processing, and determining the two-dimensional digital facial animation which is free of sound tracks and subjected to super-resolution processing.

Ninth step: and synthesizing the two-dimensional digital face animation which is free of sound tracks and processed by super resolution and the synthesized voice of the digital person by adopting a program FFmpeg, and determining the multi-mode stylized two-dimensional digital face animation, namely the video with the audio animation.

According to the interactive multi-mode stylized two-dimensional digital facial animation generation method, the two-dimensional digital facial animation is generated through the acquisition network and the real character image and voice data, subjective scores are carried out on the generated two-dimensional digital facial animation based on a plurality of subjects, and the result shows that the interactive multi-mode stylized two-dimensional digital facial animation generation method can simply, conveniently and effectively generate the two-dimensional digital facial animation, and can carry out intelligent interaction with a user, the generated two-dimensional digital facial animation can be presented in various styles, the video quality is high, and the user experience is good.

The foregoing describes specific embodiments of the present invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the claims without affecting the spirit of the invention. The above-described preferred features may be used in any combination without collision.

Claims

1. An interactive multi-mode stylized two-dimensional digital facial animation generation method is characterized by comprising the following steps:

2. The method of claim 1, wherein determining the synthesized voice of the digital person according to the preset voice data and the interactive input text of the user comprises:

3. The method of claim 2, wherein the inputting the interactive input text of the user into the pre-trained language model, determining the interactive text of the user and the pre-trained language model, comprises:

4. A method according to claim 3, wherein said inputting said pre-set speech data into a pre-trained voice clone model to determine the voice characteristics of said digital person comprises:

F _A ＝f(A)

5. The method of claim 4, wherein inputting the voice features of the digital person and the interactive text into the pre-trained voice clone model determines synthesized speech of the digital person, comprising:

wherein ,synthesized language representing the digital personSound, mock, represents the synthesized speech operation using the acoustic cloning model.

6. The method of claim 1, wherein the inputting the pre-set image data into a pre-trained age conversion model to determine the person appearance image corresponding to the digital person at the pre-set age range comprises:

P＝SAM(I)

P＝[p ₁ ,p ₂ ,p ₃ ,L,p _k ]

7. The method of claim 6, wherein the inputting the character image corresponding to the digital person at the preset age and the synthesized voice of the digital person into the pre-trained driving model, determining the two-dimensional digital face animation with audio, comprises:

p∈P

8. The method of claim 7, wherein the inputting the two-dimensional digital facial animation with audio into the pre-trained portrait cartoonization model for stylization processing, determining the two-dimensional digital facial animation with a preset style comprises:

V _S ＝Net(V,S)

wherein,s represents a preset style type set, V _s And representing the two-dimensional digital face animation with the preset style, wherein Net represents stylized processing operation performed by adopting the pre-trained portrait cartoonization model.

9. The method according to claim 8, wherein inputting the two-dimensional digital face animation with the preset style into the pre-trained video super-division model for super-resolution processing, and determining the two-dimensional digital face animation after super-resolution processing comprises:

V _VSR-S ＝VSR(V _S )

10. The method of claim 9, wherein synthesizing the super-resolution processed two-dimensional digital face animation and the synthesized speech of the digital person to determine a multi-modal stylized two-dimensional digital face animation comprises: