CN115601230A - Digital human synthesis method based on multi-task learning - Google Patents

Digital human synthesis method based on multi-task learning Download PDF

Info

Publication number
CN115601230A
CN115601230A CN202211397710.3A CN202211397710A CN115601230A CN 115601230 A CN115601230 A CN 115601230A CN 202211397710 A CN202211397710 A CN 202211397710A CN 115601230 A CN115601230 A CN 115601230A
Authority
CN
China
Prior art keywords
face
module
synthesis
synthesized
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211397710.3A
Other languages
Chinese (zh)
Inventor
李圆法
熊京萍
蔡劲松
卫海智
施灿灿
陈楷
冯纯博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kexun Jialian Information Technology Co ltd
Original Assignee
Kexun Jialian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kexun Jialian Information Technology Co ltd filed Critical Kexun Jialian Information Technology Co ltd
Priority to CN202211397710.3A priority Critical patent/CN115601230A/en
Publication of CN115601230A publication Critical patent/CN115601230A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/04Context-preserving transformations, e.g. by using an importance map
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Geometry (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a digital human synthesis method based on multitask learning, which relates to the technical field of computers and comprises the following steps: acquiring video data containing a speaker, preprocessing the video data, and extracting a face and an audio sequence of the speaker; performing MASK on the lower half part of the face; inputting the audio sequence and the MASK face sequence into a face synthesis module to synthesize a face sequence; the definition of the digital human face is improved by arranging a face rendering module; the voice and lip synchronization module is arranged to improve the accuracy of voice and lip, and the two modules interact with each other to improve the effect of synthesizing a digital person; in the digital human synthesis process, monitoring synthesis parameter data of a synthesized face module, and analyzing a synthesis deviation value; if the synthesized deviation value is larger than a preset deviation threshold value, generating a deviation early warning signal; the method can remind the manager to replace a new industrial computer for operation, thereby improving the synthesis precision and efficiency of the digital man.

Description

Digital human synthesis method based on multi-task learning
Technical Field
The invention relates to the technical field of computers, in particular to a digital human synthesis method based on multi-task learning.
Background
The digital human is a method for performing virtual simulation on the shapes and functions of human bodies at different levels by using an information science method. The cartoon-style digital person adopts the technical scheme of computer graphics, and has the advantages of high rendering speed and large adjustable parameter space compared with a real-person-style digital person.
However, the digital person with the anthropomorphic cartoon style needs a great deal of design and modeling work at the earlier stage, is high in cost, is not suitable for serious delivery scenes such as government affairs scenes, and is poor in user experience. The real-person style digital person is realized mainly based on the related technology of computer vision, the face area of the digital person is generated through real-time rendering of a deep learning algorithm model, and the face area is fused with the pre-recorded figure materials, so that the change of the expression and the mouth shape of the figure can be realized. However, the clarity of the resulting face region and the accuracy of the voice and lip shape of the rendering will greatly affect the presentation of real-person style digital people. Therefore, the invention provides a digital human synthesis method based on multi-task learning.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a digital human synthesis method based on multitask learning, which improves the definition of the digital human face by arranging a face rendering module; set up pronunciation and lip synchronization module, promote the degree of accuracy of pronunciation and lip, two module interactions promote synthetic digital people's effect.
To achieve the above object, an embodiment according to a first aspect of the present invention provides a digital human synthesis method based on multitask learning, including:
the method comprises the following steps: acquiring video data containing a speaker, performing preprocessing operation on the video data, and extracting a face and an audio sequence of the speaker; wherein, the face extraction can be performed by using any face model; extracting the audio sequence from the video by using an ffmpeg tool;
step two: preprocessing a face sequence: performing MASK on the lower half part of the face, namely setting the color to be black; inputting the audio sequence and the MASK face sequence into a face synthesis module to synthesize a face sequence;
step three: the synthesized face sequence is respectively optimized under the guidance of a voice and lip synchronization module and a face rendering module; the voice and lip synchronization module is used for judging the matching degree of the lip of the synthesized face and the audio and guiding the synthesized face module to optimize; the face rendering module is used for judging the face rendering reduction degree and guiding the face synthesis module to optimize;
step four: in the digital human synthesis process, monitoring synthesis parameter data of a synthesized face module, and analyzing a synthesis deviation value; if the synthesized deviation value PL is larger than a preset deviation threshold value, the low operation processing capability of the synthesized face module is indicated, and a deviation early warning signal is generated; so as to remind the manager to replace the new industrial computer for operation.
Further, the voice and lip synchronization module specifically comprises the following working steps:
respectively extracting lip-shaped characteristics in the audio sequence and the face sequence, and calculating cosine similarity; the cosine similarity represents the matching degree of the voice and the lip shape;
wherein, the audio characteristic extractor is a backbone network with RNN, CNN and transform paradigm; the lip-shaped feature extractor is a backbone network of CNN, transform paradigm.
Further, the training data of the voice and lip synchronization module are a real face picture and a synthesized face picture, wherein in the training stage, the real face picture is a positive sample, a label 1 is given, the synthesized face picture label is a negative sample, and a label 0 is given; optimizing the backbone network by adopting cross entropy loss; the cross entropy loss calculation is shown in equation 1:
Figure BDA0003933829410000031
wherein L is 1 For cross entropy loss of speech and lip sync modules, y i Is the label of the ith sample, σ i Is the predicted value of the ith sample.
Furthermore, the face rendering module is divided into 3 parts, and the first part and the second part directly calculate a loss function to guide the optimization of the synthesized face template; the third part is to judge whether the face is a real face; the face feature extractor is mainly a backbone network of CNN, transform paradigm;
wherein the pixel loss calculation is shown in equation 2, and the perceptual loss calculation is shown in equation 3;
Figure BDA0003933829410000032
Figure BDA0003933829410000033
wherein I 0 Is a real picture, I 1 In order to obtain the picture to be synthesized,
Figure BDA0003933829410000035
a well-trained backbone network;
Figure BDA0003933829410000036
is a feature diagram of the last layer of the backbone network; l is 2 In order to be a loss of a pixel,L 3 is the loss of perception.
Furthermore, the training data of the face rendering module are a real face picture and a synthesized face picture, wherein the label of the real face picture is 1, the label of the synthesized face picture is 0, and the backbone network is optimized by adopting cross entropy loss; the cross entropy loss calculation is shown in equation 4:
Figure BDA0003933829410000034
further, the face synthesis module is used for generating a lip-shaped face corresponding to the audio frequency according to the given audio frequency sequence and the MASK face sequence; the synthetic human face module is of a standard Encoder-Decoder structure, the Encoder is formed by stacking convolution networks, and the Decoder is formed by stacking inverse convolution networks.
Further, wherein the loss function consists of four parts: cross entropy loss of a voice and lip synchronization module, pixel loss of a face rendering module, perception loss and cross entropy loss of real face scoring; the loss function is shown in equation 5:
L=w 1 *L 1 +w 2 *L 2 +w 3 *L 3 +w 4 *L 4 - (5)
wherein w 1 +w 2 +w 3 +w 4 =1。
Further, the specific analysis process of the synthesized deviation value PL is as follows:
in the digital human synthesis process, collecting synthesis parameter data of a synthesized human face module at intervals of R2; the number of access node connections, the CPU load rate, the bandwidth load rate and the real-time network rate are marked as W1, W2, W3 and W4 in sequence; calculating a synthesis coefficient YS of the synthesized face module by using a formula YS = (W1 × b1+ W4 × b 4)/(W2 × b2+ W3 × b 3), wherein b1, b2, b3 and b4 are coefficient factors;
if the synthesis coefficient is less than or equal to a preset synthesis threshold value, feeding back a synthesis deviation signal to a control center;
when the control center monitors that the synthesized deviation signal is generated, automatically counting down, wherein the counting down is D1, and D1 is a preset value; continuing to monitor the synthetic deviation signal in the countdown stage, if a new synthetic deviation signal is monitored, automatically returning the countdown to the original value, and counting down again according to D1;
counting the occurrence frequency of the synthesized deviation signal in the countdown phase to be C1, and counting the length of the countdown phase to be D1; using formulas
Figure BDA0003933829410000041
And calculating to obtain a synthesis deviation value PL of the synthesized face module, wherein g1 and g2 are coefficient factors.
Compared with the prior art, the invention has the beneficial effects that:
1. the voice and lip synchronization module is used for respectively extracting lip characteristics in an audio sequence and a face sequence, calculating cosine similarity, judging the matching degree of the lip of a synthesized face and audio, guiding the synthesized face module to optimize and improving the matching degree of a hammer shape and the audio; the face rendering module is mainly divided into 3 parts, wherein the first part and the second part directly calculate a loss function, and the third part mainly judges whether the face is a real face; judging the face rendering and restoring degree, guiding the face synthesis module to optimize, and improving the face rendering and restoring effect; the two modules interact with each other, so that the effect of synthesizing the digital human is improved;
2. in the digital human synthesis process, the synthesis parameter data of the face synthesis module is monitored, and synthesis deviation value analysis is carried out; in the digital human synthesis process, collecting synthesis parameter data of a synthesized face module at intervals of R2, and calculating to obtain a synthesis coefficient YS of the synthesized face module; if YS is less than or equal to a preset synthesis threshold value, feeding back a synthesis deviation signal to the control center; when the control center monitors the synthesized deviation signal, the counting is automatically carried out; counting the occurrence frequency of the synthesized deviation signal in the countdown phase to be C1, and counting the length of the countdown phase to be D1; using formulas
Figure BDA0003933829410000051
Calculating to obtain a synthetic human face modelA composite deviation value PL of the block; if PL is larger than a preset deviation threshold value, indicating that the operation processing capability of the synthesized face module is low at the moment, and generating a deviation early warning signal; the system can remind the manager to replace a new industrial computer for operation processing, and improve the synthesis precision and efficiency of the digital man.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a technical architecture diagram of a digital human synthesis method based on multitask learning according to the present invention.
Fig. 2 is a technical architecture diagram of the voice and lip sync modules of the present invention.
FIG. 3 is a technical architecture diagram of the face rendering module of the present invention.
Fig. 4 is a technical architecture diagram of the synthesis of a face module in the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1 to 4, a digital human synthesis method based on multitask learning includes the following steps:
the method comprises the following steps: acquiring video data containing a speaker, performing necessary preprocessing operation on the video data, and extracting a face and an audio sequence of the speaker; wherein, the face extraction can be performed by any face model; extracting the audio sequence from the video by using an ffmpeg tool;
step two: preprocessing a face sequence, and performing MASK (MASK) on the lower half part of the face, namely setting the color to be black; inputting the audio sequence and the MASK face sequence into a face synthesis module to synthesize a face sequence;
step three: the synthesized face sequence is optimized under the guidance of a voice and lip synchronization module and a face rendering module respectively, so that the face synthesis quality is improved;
the voice and lip synchronization module is used for judging the matching degree of the lip shape of the synthesized face and the audio frequency, guiding the synthesized face module to optimize, and improving the matching degree of the hammer shape and the audio frequency;
the face rendering module is used for judging the face rendering and restoring degree, guiding the face synthesis module to carry out optimization and improving the face rendering and restoring effect;
in this embodiment, the technical architecture of the voice and lip synchronization module is as shown in fig. 2, and the specific steps are as follows:
respectively extracting lip characteristics in the audio sequence and the face sequence, and calculating cosine similarity, wherein the score represents the matching degree of the voice and the lip;
wherein: the audio feature extractor is not limited to a network, and the audio feature extractor can be a backbone network in a RNN, CNN or transform paradigm; the lip shape characteristic extractor is mainly a backbone network of CNN, transform paradigm;
the training data of the voice and lip synchronization module are a real face picture and a synthesized face picture, wherein in the training stage, the real face picture is a positive sample, a label 1 is given, the synthesized face picture label is a negative sample, and a label 0 is given; optimizing the backbone network using cross entropy loss; the cross entropy loss calculation is shown in equation 1:
Figure BDA0003933829410000071
wherein L is 1 Cross entropy loss, y, for speech and lip sync modules i Is the label of the ith sample, σ i Is the predicted value of the ith sample;
in this embodiment, the technical architecture of the face rendering module is as shown in fig. 3, which is mainly divided into 3 parts, and the first part and the second part directly calculate a loss function to guide the optimization of the synthesized face template; wherein the pixel loss calculation is shown in equation 2, and the perceptual loss calculation is shown in equation 3:
Figure BDA0003933829410000072
Figure BDA0003933829410000073
wherein I 0 Is a real picture, I 1 In order to be a picture that is synthesized,
Figure BDA0003933829410000074
for a well-trained backbone network, such as vgg19, resnet50,
Figure BDA0003933829410000075
is the characteristics of the last layer of the backbone network; (ii) a L is a radical of an alcohol 2 For pixel loss, L 3 To a loss of perception;
the third part mainly judges whether the face is a real face; the face feature extractor is mainly a CNN, transform paradigm backbone network. The training data of the part are real face pictures and synthesized face pictures, wherein the label of the real face picture is 1, the label of the synthesized face picture is 0, and the part is optimized by adopting cross entropy loss; the cross entropy loss calculation is shown in equation 4:
Figure BDA0003933829410000076
in this embodiment, the synthesized face module is configured to generate a lip-shaped face corresponding to an audio frequency according to a given audio sequence and a MASK face sequence, and a technical architecture of the whole module is as shown in fig. 4;
the synthetic face module is of a standard Encoder-Decoder structure, the Encoder is formed by stacking convolution networks, and the Decoder is formed by stacking inverse convolution networks;
the optimization of the synthesized face module is guided by a voice and lip synchronization module and a face rendering module; wherein the loss function consists of four parts: cross entropy loss of a voice and lip synchronization module, pixel loss of a face rendering module, perception loss and cross entropy loss of real face scoring;
the loss function is shown in equation 5:
L=w 1 *L 1 +w 2 *L 2 +w 3 *L 3 +w 4 *L 4 - (5)
wherein w 1 +w 2 +w 3 +w 4 =1, the specific value of the weight is adjusted according to the actual situation; the technical architecture of the invention is typical GAN (confrontation generation network), the generator is a synthesized face module, and the judger is a voice and lip synchronization module and a face rendering module; therefore, the training paradigm of GAN is also followed, with fixed generators when training the judgers, and multiple loop iterations when training the generators. In practical application, only a face synthesis module is used for synthesizing a digital person;
wherein, in this embodiment, the method further includes: in the digital human synthesis process, the synthesis parameter data of the face synthesis module is monitored, and synthesis deviation value analysis is carried out, specifically:
in the digital human synthesis process, collecting synthesis parameter data of a synthesized human face module at intervals of R2; the synthesis parameter data comprises access node connection number, CPU load rate, bandwidth load rate and real-time network rate;
the access node connection number, the CPU load rate, the bandwidth load rate and the real-time network rate are marked as W1, W2, W3 and W4 in sequence; calculating a synthesis coefficient YS of the synthesized face module by using a formula YS = (W1 × b1+ W4 × b 4)/(W2 × b2+ W3 × b 3), wherein b1, b2, b3 and b4 are coefficient factors;
comparing the synthesis coefficient YS with a preset synthesis threshold value; if the synthesis coefficient is less than or equal to a preset synthesis threshold value, feeding back a synthesis deviation signal to a control center;
when the control center monitors that the synthesized deviation signal is generated, automatically counting down, wherein the counting down is D1, and D1 is a preset value; for example, D1 takes the value 10; every time a synthesis coefficient is collected, the count-down is reduced by one; continuing to monitor the synthesized deviation signal in a countdown stage, if a new synthesized deviation signal is monitored, automatically returning the countdown to the original value, and counting down again according to D1, otherwise, returning the countdown to zero and stopping counting;
counting the occurrence frequency of the synthesized deviation signal in the countdown phase to be C1, and counting the length of the countdown phase to be D1; using formulas
Figure BDA0003933829410000091
Calculating to obtain a synthesis deviation value PL of the synthesized face module, wherein g1 and g2 are coefficient factors;
comparing the synthesized deviation value PL with a preset deviation threshold, and if the synthesized deviation value PL is greater than the preset deviation threshold, indicating that the operation processing capacity of the synthesized face module is low at the moment, and generating a deviation early warning signal; the system can remind the manager to replace a new industrial computer for operation processing, and improve the synthesis precision and efficiency of the digital man.
The above formulas are all calculated by removing dimensions and taking numerical values thereof, the formula is a formula which is obtained by acquiring a large amount of data and performing software simulation to obtain the most approximate real condition, and the preset parameters and the preset threshold values in the formula are set by the technical personnel in the field according to the actual condition or obtained by simulating a large amount of data.
The working principle of the invention is as follows:
a digital person synthesis method based on multi-task learning comprises the steps of acquiring video data containing a speaker when the digital person synthesis method works, performing necessary preprocessing operation on the video data, and extracting a face and an audio sequence of the speaker; preprocessing a face sequence, and performing MASK (MASK) on the lower half part of the face, namely setting the color to be black; inputting the audio sequence and the MASK face sequence into a face synthesis module to synthesize a face sequence; the synthesized face sequence is optimized under the guidance of a voice and lip synchronization module and a face rendering module respectively, so that the face synthesis quality is improved; the voice and lip synchronization module is used for respectively extracting lip features in the audio sequence and the face sequence, calculating cosine similarity, judging the matching degree of the lip of the synthesized face and the audio, guiding the synthesized face module to optimize, and improving the matching degree of the hammer and the audio; the face rendering module is mainly divided into 3 parts, the first part and the second part directly calculate a loss function, and the third part mainly judges whether the face is a real face; judging the face rendering and restoring degree, guiding the face synthesis module to optimize, and improving the face rendering and restoring effect; the two modules interact with each other, so that the effect of synthesizing the digital man is improved;
in the digital human synthesis process, monitoring synthesis parameter data of a synthesized face module, and analyzing a synthesis deviation value; in the digital human synthesis process, collecting synthesis parameter data of a synthesized face module at intervals of R2, and calculating to obtain a synthesis coefficient YS of the synthesized face module; if YS is less than or equal to a preset synthesis threshold value, feeding back a synthesis deviation signal to a control center; when the control center monitors the synthesized deviation signal, the counting is automatically carried out; counting the occurrence frequency of the synthesized deviation signal in the countdown stage to be C1, and counting the length of the countdown stage to be D1; using a formula
Figure BDA0003933829410000101
Calculating to obtain a synthesis deviation value PL of the synthesis face module; if PL is larger than a preset deviation threshold value, indicating that the operation processing capability of the synthesized face module is low at the moment, and generating a deviation early warning signal; the method can remind the manager to replace a new industrial computer for operation, thereby improving the synthesis precision and efficiency of the digital man.
In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand the invention for and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (8)

1. A digital human synthesis method based on multitask learning is characterized by comprising the following steps:
the method comprises the following steps: acquiring video data containing a speaker, preprocessing the video data, and extracting a face and an audio sequence of the speaker; wherein, the face extraction can be performed by any face model; extracting the audio sequence from the video by using an ffmpeg tool;
step two: preprocessing a face sequence: performing MASK on the lower half part of the face, namely setting the color to be black; inputting the audio sequence and the MASK face sequence into a face synthesis module to synthesize a face sequence;
step three: the synthesized face sequence is respectively optimized under the guidance of a voice and lip synchronization module and a face rendering module; the voice and lip synchronization module is used for judging the matching degree of the lip of the synthesized face and the audio and guiding the synthesized face module to optimize; the face rendering module is used for judging the face rendering reduction degree and guiding the face synthesis module to optimize;
step four: in the digital human synthesis process, monitoring synthesis parameter data of a synthesized face module, and analyzing a synthesis deviation value; if the synthesized deviation value PL is larger than a preset deviation threshold value, the low operation processing capability of the synthesized face module is indicated, and a deviation early warning signal is generated; so as to remind the manager to replace the new industrial computer for operation.
2. The method for synthesizing the digital human being based on the multitask learning according to the claim 1, wherein the specific working steps of the voice and lip-sync module are as follows:
respectively extracting lip-shaped characteristics in the audio sequence and the face sequence, and calculating cosine similarity; the cosine similarity represents the matching degree of the voice and the lip shape;
wherein, the audio characteristic extractor is a backbone network with RNN, CNN and transform normal forms; the lip-shaped feature extractor is a backbone network of CNN, transform paradigm.
3. The method as claimed in claim 2, wherein the training data of the speech and lip synchronization module are real face picture and synthesized face picture, wherein in the training phase, the real face picture is positive sample, the label is given as 1, the synthesized face picture label is negative sample, and the label is given as 0; optimizing the backbone network by adopting cross entropy loss; the cross entropy loss calculation is shown in equation 1:
Figure FDA0003933829400000021
wherein L is 1 For cross entropy loss of speech and lip sync modules, y i Is the label of the ith sample, σ i Is the predicted value of the ith sample.
4. The method for synthesizing the digital human beings based on the multitask learning is characterized in that the face rendering module is divided into 3 parts, and the first part and the second part directly calculate a loss function to guide the optimization of a synthesized face template; the third part is to judge whether the face is a real face; the face feature extractor is mainly a backbone network of CNN, transform paradigm;
wherein the pixel loss calculation is shown in equation 2, and the perceptual loss calculation is shown in equation 3;
Figure FDA0003933829400000022
Figure FDA0003933829400000023
in which I 0 Is a real picture, I 1 In order to be a picture that is synthesized,
Figure FDA0003933829400000024
a well-trained backbone network;
Figure FDA0003933829400000025
is a feature diagram of the last layer of the backbone network; l is 2 For pixel loss, L 3 Is the loss of perception.
5. The method according to claim 4, wherein the training data of the face rendering module is real face pictures and synthesized face pictures, wherein the real face pictures have a label of 1, the synthesized face pictures have a label of 0, and the backbone network is optimized by cross entropy loss; the cross entropy loss calculation is shown in equation 4:
Figure FDA0003933829400000026
6. the method for synthesizing digital human beings based on multitask learning according to claim 1, characterized by that said synthesized face module is used for generating lip-shaped face corresponding to audio frequency according to given audio frequency sequence and MASK face sequence; the synthetic human face module is of a standard Encoder-Decoder structure, the Encoder is formed by stacking convolution networks, and the Decoder is formed by stacking inverse convolution networks.
7. The method for synthesizing a digital human being based on multitask learning according to claim 5, wherein the loss function is formed from four portions of: cross entropy loss of a voice and lip synchronization module, pixel loss of a face rendering module, perception loss and cross entropy loss of real face scoring; the loss function is shown in equation 5:
L=w 1 *L 1 +w 2 *L 2 +w 3 *L 3 +w 4 *L 4 - (5)
wherein w 1 +w 2 +w 3 +w 4 =1。
8. The method for synthesizing the digital human being based on the multitask learning according to the claim 1, wherein the concrete analysis process of synthesizing the deviation value PL is as follows:
in the digital human synthesis process, collecting synthesis parameter data of a synthesized human face module at intervals of R2; the number of access node connections, the CPU load rate, the bandwidth load rate and the real-time network rate are marked as W1, W2, W3 and W4 in sequence; calculating a synthesis coefficient YS of the synthesized face module by using a formula YS = (W1 × b1+ W4 × b 4)/(W2 × b2+ W3 × b 3), wherein b1, b2, b3 and b4 are coefficient factors;
if the synthesis coefficient is less than or equal to a preset synthesis threshold value, feeding back a synthesis deviation signal to a control center;
when the control center monitors that the synthesized deviation signal is generated, automatically counting down, wherein the counting down is D1, and D1 is a preset value; continuing to monitor the synthetic deviation signal in the countdown stage, if a new synthetic deviation signal is monitored, automatically returning the countdown to the original value, and counting down again according to D1;
counting the occurrence frequency of the synthesized deviation signal in the countdown phase to be C1, and counting the length of the countdown phase to be D1; using formulas
Figure FDA0003933829400000031
And calculating to obtain a synthesis deviation value PL of the synthesized face module, wherein g1 and g2 are coefficient factors.
CN202211397710.3A 2022-11-09 2022-11-09 Digital human synthesis method based on multi-task learning Pending CN115601230A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211397710.3A CN115601230A (en) 2022-11-09 2022-11-09 Digital human synthesis method based on multi-task learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211397710.3A CN115601230A (en) 2022-11-09 2022-11-09 Digital human synthesis method based on multi-task learning

Publications (1)

Publication Number Publication Date
CN115601230A true CN115601230A (en) 2023-01-13

Family

ID=84853528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211397710.3A Pending CN115601230A (en) 2022-11-09 2022-11-09 Digital human synthesis method based on multi-task learning

Country Status (1)

Country Link
CN (1) CN115601230A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378697A (en) * 2021-06-08 2021-09-10 安徽大学 Method and device for generating speaking face video based on convolutional neural network
CN114119268A (en) * 2022-01-24 2022-03-01 科大智能物联技术股份有限公司 Collaborative manufacturing system for printing and packaging production line
CN114266944A (en) * 2021-12-23 2022-04-01 安徽中科锟铻量子工业互联网有限公司 Rapid model training result checking system
CN114488092A (en) * 2022-01-26 2022-05-13 安徽科创中光科技股份有限公司 Carrier-to-noise ratio processing method of coherent wind measurement laser radar
CN115202288A (en) * 2022-07-18 2022-10-18 马鞍山经纬回转支承有限公司 System and method for solving risk of switch type magnetic sucker
CN115309475A (en) * 2022-08-10 2022-11-08 蚌埠依爱消防电子有限责任公司 Picture and sound rapid loading method for emergency evacuation display system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378697A (en) * 2021-06-08 2021-09-10 安徽大学 Method and device for generating speaking face video based on convolutional neural network
CN114266944A (en) * 2021-12-23 2022-04-01 安徽中科锟铻量子工业互联网有限公司 Rapid model training result checking system
CN114119268A (en) * 2022-01-24 2022-03-01 科大智能物联技术股份有限公司 Collaborative manufacturing system for printing and packaging production line
CN114488092A (en) * 2022-01-26 2022-05-13 安徽科创中光科技股份有限公司 Carrier-to-noise ratio processing method of coherent wind measurement laser radar
CN115202288A (en) * 2022-07-18 2022-10-18 马鞍山经纬回转支承有限公司 System and method for solving risk of switch type magnetic sucker
CN115309475A (en) * 2022-08-10 2022-11-08 蚌埠依爱消防电子有限责任公司 Picture and sound rapid loading method for emergency evacuation display system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李雨思: "基于音频驱动的视频生成设计与实现" *

Similar Documents

Publication Publication Date Title
CN108447474B (en) Modeling and control method for synchronizing virtual character voice and mouth shape
Cao et al. Expressive speech-driven facial animation
CN109325437B (en) Image processing method, device and system
Morishima et al. A media conversion from speech to facial image for intelligent man-machine interface
Sifakis et al. Simulating speech with a physics-based facial muscle model
DE60101540T2 (en) Method of animating an artificial model of a human face using acoustic signals
CN110751708B (en) Method and system for driving face animation in real time through voice
US20020024519A1 (en) System and method for producing three-dimensional moving picture authoring tool supporting synthesis of motion, facial expression, lip synchronizing and lip synchronized voice of three-dimensional character
CN111325817A (en) Virtual character scene video generation method, terminal device and medium
CN106653052A (en) Virtual human face animation generation method and device
CN110910479B (en) Video processing method, device, electronic equipment and readable storage medium
CN116528019B (en) Virtual human video synthesis method based on voice driving and face self-driving
Mattos et al. Improving CNN-based viseme recognition using synthetic data
CN113689436A (en) Image semantic segmentation method, device, equipment and storage medium
CN116597857A (en) Method, system, device and storage medium for driving image by voice
CN115953521A (en) Remote digital human rendering method, device and system
CN115270184A (en) Video desensitization method, vehicle video desensitization method and vehicle-mounted processing system
CN117528135A (en) Speech-driven face video generation method and device, electronic equipment and medium
CN115601230A (en) Digital human synthesis method based on multi-task learning
CN112002005A (en) Cloud-based remote virtual collaborative host method
Beskow et al. Data-driven synthesis of expressive visual speech using an MPEG-4 talking head.
Huang et al. Visual speech emotion conversion using deep learning for 3D talking head
Morishima et al. Facial expression synthesis based on natural voice for virtual face-to-face communication with machine
CN114494813B (en) Dense cross attention-based index expression generation method
CN117150089B (en) Character artistic image changing system based on AIGC technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20230113