CN115601230A

CN115601230A - Digital human synthesis method based on multi-task learning

Info

Publication number: CN115601230A
Application number: CN202211397710.3A
Authority: CN
Inventors: 李圆法; 熊京萍; 蔡劲松; 卫海智; 施灿灿; 陈楷; 冯纯博
Original assignee: Kexun Jialian Information Technology Co ltd
Current assignee: Kexun Jialian Information Technology Co ltd
Priority date: 2022-11-09
Filing date: 2022-11-09
Publication date: 2023-01-13

Abstract

The invention discloses a digital human synthesis method based on multitask learning, which relates to the technical field of computers and comprises the following steps: acquiring video data containing a speaker, preprocessing the video data, and extracting a face and an audio sequence of the speaker; performing MASK on the lower half part of the face; inputting the audio sequence and the MASK face sequence into a face synthesis module to synthesize a face sequence; the definition of the digital human face is improved by arranging a face rendering module; the voice and lip synchronization module is arranged to improve the accuracy of voice and lip, and the two modules interact with each other to improve the effect of synthesizing a digital person; in the digital human synthesis process, monitoring synthesis parameter data of a synthesized face module, and analyzing a synthesis deviation value; if the synthesized deviation value is larger than a preset deviation threshold value, generating a deviation early warning signal; the method can remind the manager to replace a new industrial computer for operation, thereby improving the synthesis precision and efficiency of the digital man.

Description

Digital human synthesis method based on multi-task learning

Technical Field

The invention relates to the technical field of computers, in particular to a digital human synthesis method based on multi-task learning.

Background

The digital human is a method for performing virtual simulation on the shapes and functions of human bodies at different levels by using an information science method. The cartoon-style digital person adopts the technical scheme of computer graphics, and has the advantages of high rendering speed and large adjustable parameter space compared with a real-person-style digital person.

However, the digital person with the anthropomorphic cartoon style needs a great deal of design and modeling work at the earlier stage, is high in cost, is not suitable for serious delivery scenes such as government affairs scenes, and is poor in user experience. The real-person style digital person is realized mainly based on the related technology of computer vision, the face area of the digital person is generated through real-time rendering of a deep learning algorithm model, and the face area is fused with the pre-recorded figure materials, so that the change of the expression and the mouth shape of the figure can be realized. However, the clarity of the resulting face region and the accuracy of the voice and lip shape of the rendering will greatly affect the presentation of real-person style digital people. Therefore, the invention provides a digital human synthesis method based on multi-task learning.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a digital human synthesis method based on multitask learning, which improves the definition of the digital human face by arranging a face rendering module; set up pronunciation and lip synchronization module, promote the degree of accuracy of pronunciation and lip, two module interactions promote synthetic digital people's effect.

To achieve the above object, an embodiment according to a first aspect of the present invention provides a digital human synthesis method based on multitask learning, including:

the method comprises the following steps: acquiring video data containing a speaker, performing preprocessing operation on the video data, and extracting a face and an audio sequence of the speaker; wherein, the face extraction can be performed by using any face model; extracting the audio sequence from the video by using an ffmpeg tool;

step two: preprocessing a face sequence: performing MASK on the lower half part of the face, namely setting the color to be black; inputting the audio sequence and the MASK face sequence into a face synthesis module to synthesize a face sequence;

step three: the synthesized face sequence is respectively optimized under the guidance of a voice and lip synchronization module and a face rendering module; the voice and lip synchronization module is used for judging the matching degree of the lip of the synthesized face and the audio and guiding the synthesized face module to optimize; the face rendering module is used for judging the face rendering reduction degree and guiding the face synthesis module to optimize;

step four: in the digital human synthesis process, monitoring synthesis parameter data of a synthesized face module, and analyzing a synthesis deviation value; if the synthesized deviation value PL is larger than a preset deviation threshold value, the low operation processing capability of the synthesized face module is indicated, and a deviation early warning signal is generated; so as to remind the manager to replace the new industrial computer for operation.

Further, the voice and lip synchronization module specifically comprises the following working steps:

respectively extracting lip-shaped characteristics in the audio sequence and the face sequence, and calculating cosine similarity; the cosine similarity represents the matching degree of the voice and the lip shape;

wherein, the audio characteristic extractor is a backbone network with RNN, CNN and transform paradigm; the lip-shaped feature extractor is a backbone network of CNN, transform paradigm.

Further, the training data of the voice and lip synchronization module are a real face picture and a synthesized face picture, wherein in the training stage, the real face picture is a positive sample, a label 1 is given, the synthesized face picture label is a negative sample, and a label 0 is given; optimizing the backbone network by adopting cross entropy loss; the cross entropy loss calculation is shown in equation 1:

wherein L is ₁ For cross entropy loss of speech and lip sync modules, y _i Is the label of the ith sample, σ _i Is the predicted value of the ith sample.

Furthermore, the face rendering module is divided into 3 parts, and the first part and the second part directly calculate a loss function to guide the optimization of the synthesized face template; the third part is to judge whether the face is a real face; the face feature extractor is mainly a backbone network of CNN, transform paradigm;

wherein the pixel loss calculation is shown in equation 2, and the perceptual loss calculation is shown in equation 3;

wherein I ₀ Is a real picture, I ₁ In order to obtain the picture to be synthesized,

a well-trained backbone network;

is a feature diagram of the last layer of the backbone network; l is ₂ In order to be a loss of a pixel,L ₃ is the loss of perception.

Furthermore, the training data of the face rendering module are a real face picture and a synthesized face picture, wherein the label of the real face picture is 1, the label of the synthesized face picture is 0, and the backbone network is optimized by adopting cross entropy loss; the cross entropy loss calculation is shown in equation 4:

further, the face synthesis module is used for generating a lip-shaped face corresponding to the audio frequency according to the given audio frequency sequence and the MASK face sequence; the synthetic human face module is of a standard Encoder-Decoder structure, the Encoder is formed by stacking convolution networks, and the Decoder is formed by stacking inverse convolution networks.

Further, wherein the loss function consists of four parts: cross entropy loss of a voice and lip synchronization module, pixel loss of a face rendering module, perception loss and cross entropy loss of real face scoring; the loss function is shown in equation 5:

L＝w ₁ *L ₁ +w ₂ *L ₂ +w ₃ *L ₃ +w ₄ *L ₄ - (5)

wherein w ₁ +w ₂ +w ₃ +w ₄ ＝1。

Further, the specific analysis process of the synthesized deviation value PL is as follows:

in the digital human synthesis process, collecting synthesis parameter data of a synthesized human face module at intervals of R2; the number of access node connections, the CPU load rate, the bandwidth load rate and the real-time network rate are marked as W1, W2, W3 and W4 in sequence; calculating a synthesis coefficient YS of the synthesized face module by using a formula YS = (W1 × b1+ W4 × b 4)/(W2 × b2+ W3 × b 3), wherein b1, b2, b3 and b4 are coefficient factors;

if the synthesis coefficient is less than or equal to a preset synthesis threshold value, feeding back a synthesis deviation signal to a control center;

when the control center monitors that the synthesized deviation signal is generated, automatically counting down, wherein the counting down is D1, and D1 is a preset value; continuing to monitor the synthetic deviation signal in the countdown stage, if a new synthetic deviation signal is monitored, automatically returning the countdown to the original value, and counting down again according to D1;

counting the occurrence frequency of the synthesized deviation signal in the countdown phase to be C1, and counting the length of the countdown phase to be D1; using formulas

And calculating to obtain a synthesis deviation value PL of the synthesized face module, wherein g1 and g2 are coefficient factors.

Compared with the prior art, the invention has the beneficial effects that:

1. the voice and lip synchronization module is used for respectively extracting lip characteristics in an audio sequence and a face sequence, calculating cosine similarity, judging the matching degree of the lip of a synthesized face and audio, guiding the synthesized face module to optimize and improving the matching degree of a hammer shape and the audio; the face rendering module is mainly divided into 3 parts, wherein the first part and the second part directly calculate a loss function, and the third part mainly judges whether the face is a real face; judging the face rendering and restoring degree, guiding the face synthesis module to optimize, and improving the face rendering and restoring effect; the two modules interact with each other, so that the effect of synthesizing the digital human is improved;

2. in the digital human synthesis process, the synthesis parameter data of the face synthesis module is monitored, and synthesis deviation value analysis is carried out; in the digital human synthesis process, collecting synthesis parameter data of a synthesized face module at intervals of R2, and calculating to obtain a synthesis coefficient YS of the synthesized face module; if YS is less than or equal to a preset synthesis threshold value, feeding back a synthesis deviation signal to the control center; when the control center monitors the synthesized deviation signal, the counting is automatically carried out; counting the occurrence frequency of the synthesized deviation signal in the countdown phase to be C1, and counting the length of the countdown phase to be D1; using formulas

Calculating to obtain a synthetic human face modelA composite deviation value PL of the block; if PL is larger than a preset deviation threshold value, indicating that the operation processing capability of the synthesized face module is low at the moment, and generating a deviation early warning signal; the system can remind the manager to replace a new industrial computer for operation processing, and improve the synthesis precision and efficiency of the digital man.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a technical architecture diagram of a digital human synthesis method based on multitask learning according to the present invention.

Fig. 2 is a technical architecture diagram of the voice and lip sync modules of the present invention.

FIG. 3 is a technical architecture diagram of the face rendering module of the present invention.

Fig. 4 is a technical architecture diagram of the synthesis of a face module in the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1 to 4, a digital human synthesis method based on multitask learning includes the following steps:

the method comprises the following steps: acquiring video data containing a speaker, performing necessary preprocessing operation on the video data, and extracting a face and an audio sequence of the speaker; wherein, the face extraction can be performed by any face model; extracting the audio sequence from the video by using an ffmpeg tool;

step two: preprocessing a face sequence, and performing MASK (MASK) on the lower half part of the face, namely setting the color to be black; inputting the audio sequence and the MASK face sequence into a face synthesis module to synthesize a face sequence;

step three: the synthesized face sequence is optimized under the guidance of a voice and lip synchronization module and a face rendering module respectively, so that the face synthesis quality is improved;

the voice and lip synchronization module is used for judging the matching degree of the lip shape of the synthesized face and the audio frequency, guiding the synthesized face module to optimize, and improving the matching degree of the hammer shape and the audio frequency;

the face rendering module is used for judging the face rendering and restoring degree, guiding the face synthesis module to carry out optimization and improving the face rendering and restoring effect;

in this embodiment, the technical architecture of the voice and lip synchronization module is as shown in fig. 2, and the specific steps are as follows:

respectively extracting lip characteristics in the audio sequence and the face sequence, and calculating cosine similarity, wherein the score represents the matching degree of the voice and the lip;

wherein: the audio feature extractor is not limited to a network, and the audio feature extractor can be a backbone network in a RNN, CNN or transform paradigm; the lip shape characteristic extractor is mainly a backbone network of CNN, transform paradigm;

the training data of the voice and lip synchronization module are a real face picture and a synthesized face picture, wherein in the training stage, the real face picture is a positive sample, a label 1 is given, the synthesized face picture label is a negative sample, and a label 0 is given; optimizing the backbone network using cross entropy loss; the cross entropy loss calculation is shown in equation 1:

wherein L is ₁ Cross entropy loss, y, for speech and lip sync modules _i Is the label of the ith sample, σ _i Is the predicted value of the ith sample;

in this embodiment, the technical architecture of the face rendering module is as shown in fig. 3, which is mainly divided into 3 parts, and the first part and the second part directly calculate a loss function to guide the optimization of the synthesized face template; wherein the pixel loss calculation is shown in equation 2, and the perceptual loss calculation is shown in equation 3:

wherein I ₀ Is a real picture, I ₁ In order to be a picture that is synthesized,

for a well-trained backbone network, such as vgg19, resnet50,

is the characteristics of the last layer of the backbone network; (ii) a L is a radical of an alcohol ₂ For pixel loss, L ₃ To a loss of perception;

the third part mainly judges whether the face is a real face; the face feature extractor is mainly a CNN, transform paradigm backbone network. The training data of the part are real face pictures and synthesized face pictures, wherein the label of the real face picture is 1, the label of the synthesized face picture is 0, and the part is optimized by adopting cross entropy loss; the cross entropy loss calculation is shown in equation 4:

in this embodiment, the synthesized face module is configured to generate a lip-shaped face corresponding to an audio frequency according to a given audio sequence and a MASK face sequence, and a technical architecture of the whole module is as shown in fig. 4;

the synthetic face module is of a standard Encoder-Decoder structure, the Encoder is formed by stacking convolution networks, and the Decoder is formed by stacking inverse convolution networks;

the optimization of the synthesized face module is guided by a voice and lip synchronization module and a face rendering module; wherein the loss function consists of four parts: cross entropy loss of a voice and lip synchronization module, pixel loss of a face rendering module, perception loss and cross entropy loss of real face scoring;

the loss function is shown in equation 5:

L＝w ₁ *L ₁ +w ₂ *L ₂ +w ₃ *L ₃ +w ₄ *L ₄ - (5)

wherein w ₁ +w ₂ +w ₃ +w ₄ =1, the specific value of the weight is adjusted according to the actual situation; the technical architecture of the invention is typical GAN (confrontation generation network), the generator is a synthesized face module, and the judger is a voice and lip synchronization module and a face rendering module; therefore, the training paradigm of GAN is also followed, with fixed generators when training the judgers, and multiple loop iterations when training the generators. In practical application, only a face synthesis module is used for synthesizing a digital person;

wherein, in this embodiment, the method further includes: in the digital human synthesis process, the synthesis parameter data of the face synthesis module is monitored, and synthesis deviation value analysis is carried out, specifically:

in the digital human synthesis process, collecting synthesis parameter data of a synthesized human face module at intervals of R2; the synthesis parameter data comprises access node connection number, CPU load rate, bandwidth load rate and real-time network rate;

the access node connection number, the CPU load rate, the bandwidth load rate and the real-time network rate are marked as W1, W2, W3 and W4 in sequence; calculating a synthesis coefficient YS of the synthesized face module by using a formula YS = (W1 × b1+ W4 × b 4)/(W2 × b2+ W3 × b 3), wherein b1, b2, b3 and b4 are coefficient factors;

comparing the synthesis coefficient YS with a preset synthesis threshold value; if the synthesis coefficient is less than or equal to a preset synthesis threshold value, feeding back a synthesis deviation signal to a control center;

when the control center monitors that the synthesized deviation signal is generated, automatically counting down, wherein the counting down is D1, and D1 is a preset value; for example, D1 takes the value 10; every time a synthesis coefficient is collected, the count-down is reduced by one; continuing to monitor the synthesized deviation signal in a countdown stage, if a new synthesized deviation signal is monitored, automatically returning the countdown to the original value, and counting down again according to D1, otherwise, returning the countdown to zero and stopping counting;

Calculating to obtain a synthesis deviation value PL of the synthesized face module, wherein g1 and g2 are coefficient factors;

comparing the synthesized deviation value PL with a preset deviation threshold, and if the synthesized deviation value PL is greater than the preset deviation threshold, indicating that the operation processing capacity of the synthesized face module is low at the moment, and generating a deviation early warning signal; the system can remind the manager to replace a new industrial computer for operation processing, and improve the synthesis precision and efficiency of the digital man.

The above formulas are all calculated by removing dimensions and taking numerical values thereof, the formula is a formula which is obtained by acquiring a large amount of data and performing software simulation to obtain the most approximate real condition, and the preset parameters and the preset threshold values in the formula are set by the technical personnel in the field according to the actual condition or obtained by simulating a large amount of data.

The working principle of the invention is as follows:

a digital person synthesis method based on multi-task learning comprises the steps of acquiring video data containing a speaker when the digital person synthesis method works, performing necessary preprocessing operation on the video data, and extracting a face and an audio sequence of the speaker; preprocessing a face sequence, and performing MASK (MASK) on the lower half part of the face, namely setting the color to be black; inputting the audio sequence and the MASK face sequence into a face synthesis module to synthesize a face sequence; the synthesized face sequence is optimized under the guidance of a voice and lip synchronization module and a face rendering module respectively, so that the face synthesis quality is improved; the voice and lip synchronization module is used for respectively extracting lip features in the audio sequence and the face sequence, calculating cosine similarity, judging the matching degree of the lip of the synthesized face and the audio, guiding the synthesized face module to optimize, and improving the matching degree of the hammer and the audio; the face rendering module is mainly divided into 3 parts, the first part and the second part directly calculate a loss function, and the third part mainly judges whether the face is a real face; judging the face rendering and restoring degree, guiding the face synthesis module to optimize, and improving the face rendering and restoring effect; the two modules interact with each other, so that the effect of synthesizing the digital man is improved;

in the digital human synthesis process, monitoring synthesis parameter data of a synthesized face module, and analyzing a synthesis deviation value; in the digital human synthesis process, collecting synthesis parameter data of a synthesized face module at intervals of R2, and calculating to obtain a synthesis coefficient YS of the synthesized face module; if YS is less than or equal to a preset synthesis threshold value, feeding back a synthesis deviation signal to a control center; when the control center monitors the synthesized deviation signal, the counting is automatically carried out; counting the occurrence frequency of the synthesized deviation signal in the countdown stage to be C1, and counting the length of the countdown stage to be D1; using a formula

Calculating to obtain a synthesis deviation value PL of the synthesis face module; if PL is larger than a preset deviation threshold value, indicating that the operation processing capability of the synthesized face module is low at the moment, and generating a deviation early warning signal; the method can remind the manager to replace a new industrial computer for operation, thereby improving the synthesis precision and efficiency of the digital man.

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand the invention for and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A digital human synthesis method based on multitask learning is characterized by comprising the following steps:

the method comprises the following steps: acquiring video data containing a speaker, preprocessing the video data, and extracting a face and an audio sequence of the speaker; wherein, the face extraction can be performed by any face model; extracting the audio sequence from the video by using an ffmpeg tool;

2. The method for synthesizing the digital human being based on the multitask learning according to the claim 1, wherein the specific working steps of the voice and lip-sync module are as follows:

wherein, the audio characteristic extractor is a backbone network with RNN, CNN and transform normal forms; the lip-shaped feature extractor is a backbone network of CNN, transform paradigm.

3. The method as claimed in claim 2, wherein the training data of the speech and lip synchronization module are real face picture and synthesized face picture, wherein in the training phase, the real face picture is positive sample, the label is given as 1, the synthesized face picture label is negative sample, and the label is given as 0; optimizing the backbone network by adopting cross entropy loss; the cross entropy loss calculation is shown in equation 1:

4. The method for synthesizing the digital human beings based on the multitask learning is characterized in that the face rendering module is divided into 3 parts, and the first part and the second part directly calculate a loss function to guide the optimization of a synthesized face template; the third part is to judge whether the face is a real face; the face feature extractor is mainly a backbone network of CNN, transform paradigm;

in which I ₀ Is a real picture, I ₁ In order to be a picture that is synthesized,

a well-trained backbone network;

is a feature diagram of the last layer of the backbone network; l is ₂ For pixel loss, L ₃ Is the loss of perception.

5. The method according to claim 4, wherein the training data of the face rendering module is real face pictures and synthesized face pictures, wherein the real face pictures have a label of 1, the synthesized face pictures have a label of 0, and the backbone network is optimized by cross entropy loss; the cross entropy loss calculation is shown in equation 4:

6. the method for synthesizing digital human beings based on multitask learning according to claim 1, characterized by that said synthesized face module is used for generating lip-shaped face corresponding to audio frequency according to given audio frequency sequence and MASK face sequence; the synthetic human face module is of a standard Encoder-Decoder structure, the Encoder is formed by stacking convolution networks, and the Decoder is formed by stacking inverse convolution networks.

7. The method for synthesizing a digital human being based on multitask learning according to claim 5, wherein the loss function is formed from four portions of: cross entropy loss of a voice and lip synchronization module, pixel loss of a face rendering module, perception loss and cross entropy loss of real face scoring; the loss function is shown in equation 5:

L＝w ₁ *L ₁ +w ₂ *L ₂ +w ₃ *L ₃ +w ₄ *L ₄ - (5)

wherein w ₁ +w ₂ +w ₃ +w ₄ ＝1。

8. The method for synthesizing the digital human being based on the multitask learning according to the claim 1, wherein the concrete analysis process of synthesizing the deviation value PL is as follows: