CN115601230A - Digital human synthesis method based on multi-task learning - Google Patents
Digital human synthesis method based on multi-task learning Download PDFInfo
- Publication number
- CN115601230A CN115601230A CN202211397710.3A CN202211397710A CN115601230A CN 115601230 A CN115601230 A CN 115601230A CN 202211397710 A CN202211397710 A CN 202211397710A CN 115601230 A CN115601230 A CN 115601230A
- Authority
- CN
- China
- Prior art keywords
- face
- module
- synthesis
- synthesized
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 241000282414 Homo sapiens Species 0.000 title claims abstract description 42
- 238000001308 synthesis method Methods 0.000 title claims abstract description 12
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 76
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 76
- 238000009877 rendering Methods 0.000 claims abstract description 33
- 238000000034 method Methods 0.000 claims abstract description 28
- 230000008569 process Effects 0.000 claims abstract description 13
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 10
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 238000012544 monitoring process Methods 0.000 claims abstract description 4
- 238000004364 calculation method Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 7
- 238000010586 diagram Methods 0.000 claims description 6
- 230000008447 perception Effects 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 7
- 239000000463 material Substances 0.000 description 3
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/04—Context-preserving transformations, e.g. by using an importance map
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
- G06V40/165—Detection; Localisation; Normalisation using facial parts and geometric relationships
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Geometry (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Human Computer Interaction (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a digital human synthesis method based on multitask learning, which relates to the technical field of computers and comprises the following steps: acquiring video data containing a speaker, preprocessing the video data, and extracting a face and an audio sequence of the speaker; performing MASK on the lower half part of the face; inputting the audio sequence and the MASK face sequence into a face synthesis module to synthesize a face sequence; the definition of the digital human face is improved by arranging a face rendering module; the voice and lip synchronization module is arranged to improve the accuracy of voice and lip, and the two modules interact with each other to improve the effect of synthesizing a digital person; in the digital human synthesis process, monitoring synthesis parameter data of a synthesized face module, and analyzing a synthesis deviation value; if the synthesized deviation value is larger than a preset deviation threshold value, generating a deviation early warning signal; the method can remind the manager to replace a new industrial computer for operation, thereby improving the synthesis precision and efficiency of the digital man.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a digital human synthesis method based on multi-task learning.
Background
The digital human is a method for performing virtual simulation on the shapes and functions of human bodies at different levels by using an information science method. The cartoon-style digital person adopts the technical scheme of computer graphics, and has the advantages of high rendering speed and large adjustable parameter space compared with a real-person-style digital person.
However, the digital person with the anthropomorphic cartoon style needs a great deal of design and modeling work at the earlier stage, is high in cost, is not suitable for serious delivery scenes such as government affairs scenes, and is poor in user experience. The real-person style digital person is realized mainly based on the related technology of computer vision, the face area of the digital person is generated through real-time rendering of a deep learning algorithm model, and the face area is fused with the pre-recorded figure materials, so that the change of the expression and the mouth shape of the figure can be realized. However, the clarity of the resulting face region and the accuracy of the voice and lip shape of the rendering will greatly affect the presentation of real-person style digital people. Therefore, the invention provides a digital human synthesis method based on multi-task learning.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a digital human synthesis method based on multitask learning, which improves the definition of the digital human face by arranging a face rendering module; set up pronunciation and lip synchronization module, promote the degree of accuracy of pronunciation and lip, two module interactions promote synthetic digital people's effect.
To achieve the above object, an embodiment according to a first aspect of the present invention provides a digital human synthesis method based on multitask learning, including:
the method comprises the following steps: acquiring video data containing a speaker, performing preprocessing operation on the video data, and extracting a face and an audio sequence of the speaker; wherein, the face extraction can be performed by using any face model; extracting the audio sequence from the video by using an ffmpeg tool;
step two: preprocessing a face sequence: performing MASK on the lower half part of the face, namely setting the color to be black; inputting the audio sequence and the MASK face sequence into a face synthesis module to synthesize a face sequence;
step three: the synthesized face sequence is respectively optimized under the guidance of a voice and lip synchronization module and a face rendering module; the voice and lip synchronization module is used for judging the matching degree of the lip of the synthesized face and the audio and guiding the synthesized face module to optimize; the face rendering module is used for judging the face rendering reduction degree and guiding the face synthesis module to optimize;
step four: in the digital human synthesis process, monitoring synthesis parameter data of a synthesized face module, and analyzing a synthesis deviation value; if the synthesized deviation value PL is larger than a preset deviation threshold value, the low operation processing capability of the synthesized face module is indicated, and a deviation early warning signal is generated; so as to remind the manager to replace the new industrial computer for operation.
Further, the voice and lip synchronization module specifically comprises the following working steps:
respectively extracting lip-shaped characteristics in the audio sequence and the face sequence, and calculating cosine similarity; the cosine similarity represents the matching degree of the voice and the lip shape;
wherein, the audio characteristic extractor is a backbone network with RNN, CNN and transform paradigm; the lip-shaped feature extractor is a backbone network of CNN, transform paradigm.
Further, the training data of the voice and lip synchronization module are a real face picture and a synthesized face picture, wherein in the training stage, the real face picture is a positive sample, a label 1 is given, the synthesized face picture label is a negative sample, and a label 0 is given; optimizing the backbone network by adopting cross entropy loss; the cross entropy loss calculation is shown in equation 1:
wherein L is 1 For cross entropy loss of speech and lip sync modules, y i Is the label of the ith sample, σ i Is the predicted value of the ith sample.
Furthermore, the face rendering module is divided into 3 parts, and the first part and the second part directly calculate a loss function to guide the optimization of the synthesized face template; the third part is to judge whether the face is a real face; the face feature extractor is mainly a backbone network of CNN, transform paradigm;
wherein the pixel loss calculation is shown in equation 2, and the perceptual loss calculation is shown in equation 3;
wherein I 0 Is a real picture, I 1 In order to obtain the picture to be synthesized,a well-trained backbone network;is a feature diagram of the last layer of the backbone network; l is 2 In order to be a loss of a pixel,L 3 is the loss of perception.
Furthermore, the training data of the face rendering module are a real face picture and a synthesized face picture, wherein the label of the real face picture is 1, the label of the synthesized face picture is 0, and the backbone network is optimized by adopting cross entropy loss; the cross entropy loss calculation is shown in equation 4:
further, the face synthesis module is used for generating a lip-shaped face corresponding to the audio frequency according to the given audio frequency sequence and the MASK face sequence; the synthetic human face module is of a standard Encoder-Decoder structure, the Encoder is formed by stacking convolution networks, and the Decoder is formed by stacking inverse convolution networks.
Further, wherein the loss function consists of four parts: cross entropy loss of a voice and lip synchronization module, pixel loss of a face rendering module, perception loss and cross entropy loss of real face scoring; the loss function is shown in equation 5:
L=w 1 *L 1 +w 2 *L 2 +w 3 *L 3 +w 4 *L 4 - (5)
wherein w 1 +w 2 +w 3 +w 4 =1。
Further, the specific analysis process of the synthesized deviation value PL is as follows:
in the digital human synthesis process, collecting synthesis parameter data of a synthesized human face module at intervals of R2; the number of access node connections, the CPU load rate, the bandwidth load rate and the real-time network rate are marked as W1, W2, W3 and W4 in sequence; calculating a synthesis coefficient YS of the synthesized face module by using a formula YS = (W1 × b1+ W4 × b 4)/(W2 × b2+ W3 × b 3), wherein b1, b2, b3 and b4 are coefficient factors;
if the synthesis coefficient is less than or equal to a preset synthesis threshold value, feeding back a synthesis deviation signal to a control center;
when the control center monitors that the synthesized deviation signal is generated, automatically counting down, wherein the counting down is D1, and D1 is a preset value; continuing to monitor the synthetic deviation signal in the countdown stage, if a new synthetic deviation signal is monitored, automatically returning the countdown to the original value, and counting down again according to D1;
counting the occurrence frequency of the synthesized deviation signal in the countdown phase to be C1, and counting the length of the countdown phase to be D1; using formulasAnd calculating to obtain a synthesis deviation value PL of the synthesized face module, wherein g1 and g2 are coefficient factors.
Compared with the prior art, the invention has the beneficial effects that:
1. the voice and lip synchronization module is used for respectively extracting lip characteristics in an audio sequence and a face sequence, calculating cosine similarity, judging the matching degree of the lip of a synthesized face and audio, guiding the synthesized face module to optimize and improving the matching degree of a hammer shape and the audio; the face rendering module is mainly divided into 3 parts, wherein the first part and the second part directly calculate a loss function, and the third part mainly judges whether the face is a real face; judging the face rendering and restoring degree, guiding the face synthesis module to optimize, and improving the face rendering and restoring effect; the two modules interact with each other, so that the effect of synthesizing the digital human is improved;
2. in the digital human synthesis process, the synthesis parameter data of the face synthesis module is monitored, and synthesis deviation value analysis is carried out; in the digital human synthesis process, collecting synthesis parameter data of a synthesized face module at intervals of R2, and calculating to obtain a synthesis coefficient YS of the synthesized face module; if YS is less than or equal to a preset synthesis threshold value, feeding back a synthesis deviation signal to the control center; when the control center monitors the synthesized deviation signal, the counting is automatically carried out; counting the occurrence frequency of the synthesized deviation signal in the countdown phase to be C1, and counting the length of the countdown phase to be D1; using formulasCalculating to obtain a synthetic human face modelA composite deviation value PL of the block; if PL is larger than a preset deviation threshold value, indicating that the operation processing capability of the synthesized face module is low at the moment, and generating a deviation early warning signal; the system can remind the manager to replace a new industrial computer for operation processing, and improve the synthesis precision and efficiency of the digital man.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a technical architecture diagram of a digital human synthesis method based on multitask learning according to the present invention.
Fig. 2 is a technical architecture diagram of the voice and lip sync modules of the present invention.
FIG. 3 is a technical architecture diagram of the face rendering module of the present invention.
Fig. 4 is a technical architecture diagram of the synthesis of a face module in the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1 to 4, a digital human synthesis method based on multitask learning includes the following steps:
the method comprises the following steps: acquiring video data containing a speaker, performing necessary preprocessing operation on the video data, and extracting a face and an audio sequence of the speaker; wherein, the face extraction can be performed by any face model; extracting the audio sequence from the video by using an ffmpeg tool;
step two: preprocessing a face sequence, and performing MASK (MASK) on the lower half part of the face, namely setting the color to be black; inputting the audio sequence and the MASK face sequence into a face synthesis module to synthesize a face sequence;
step three: the synthesized face sequence is optimized under the guidance of a voice and lip synchronization module and a face rendering module respectively, so that the face synthesis quality is improved;
the voice and lip synchronization module is used for judging the matching degree of the lip shape of the synthesized face and the audio frequency, guiding the synthesized face module to optimize, and improving the matching degree of the hammer shape and the audio frequency;
the face rendering module is used for judging the face rendering and restoring degree, guiding the face synthesis module to carry out optimization and improving the face rendering and restoring effect;
in this embodiment, the technical architecture of the voice and lip synchronization module is as shown in fig. 2, and the specific steps are as follows:
respectively extracting lip characteristics in the audio sequence and the face sequence, and calculating cosine similarity, wherein the score represents the matching degree of the voice and the lip;
wherein: the audio feature extractor is not limited to a network, and the audio feature extractor can be a backbone network in a RNN, CNN or transform paradigm; the lip shape characteristic extractor is mainly a backbone network of CNN, transform paradigm;
the training data of the voice and lip synchronization module are a real face picture and a synthesized face picture, wherein in the training stage, the real face picture is a positive sample, a label 1 is given, the synthesized face picture label is a negative sample, and a label 0 is given; optimizing the backbone network using cross entropy loss; the cross entropy loss calculation is shown in equation 1:
wherein L is 1 Cross entropy loss, y, for speech and lip sync modules i Is the label of the ith sample, σ i Is the predicted value of the ith sample;
in this embodiment, the technical architecture of the face rendering module is as shown in fig. 3, which is mainly divided into 3 parts, and the first part and the second part directly calculate a loss function to guide the optimization of the synthesized face template; wherein the pixel loss calculation is shown in equation 2, and the perceptual loss calculation is shown in equation 3:
wherein I 0 Is a real picture, I 1 In order to be a picture that is synthesized,for a well-trained backbone network, such as vgg19, resnet50,is the characteristics of the last layer of the backbone network; (ii) a L is a radical of an alcohol 2 For pixel loss, L 3 To a loss of perception;
the third part mainly judges whether the face is a real face; the face feature extractor is mainly a CNN, transform paradigm backbone network. The training data of the part are real face pictures and synthesized face pictures, wherein the label of the real face picture is 1, the label of the synthesized face picture is 0, and the part is optimized by adopting cross entropy loss; the cross entropy loss calculation is shown in equation 4:
in this embodiment, the synthesized face module is configured to generate a lip-shaped face corresponding to an audio frequency according to a given audio sequence and a MASK face sequence, and a technical architecture of the whole module is as shown in fig. 4;
the synthetic face module is of a standard Encoder-Decoder structure, the Encoder is formed by stacking convolution networks, and the Decoder is formed by stacking inverse convolution networks;
the optimization of the synthesized face module is guided by a voice and lip synchronization module and a face rendering module; wherein the loss function consists of four parts: cross entropy loss of a voice and lip synchronization module, pixel loss of a face rendering module, perception loss and cross entropy loss of real face scoring;
the loss function is shown in equation 5:
L=w 1 *L 1 +w 2 *L 2 +w 3 *L 3 +w 4 *L 4 - (5)
wherein w 1 +w 2 +w 3 +w 4 =1, the specific value of the weight is adjusted according to the actual situation; the technical architecture of the invention is typical GAN (confrontation generation network), the generator is a synthesized face module, and the judger is a voice and lip synchronization module and a face rendering module; therefore, the training paradigm of GAN is also followed, with fixed generators when training the judgers, and multiple loop iterations when training the generators. In practical application, only a face synthesis module is used for synthesizing a digital person;
wherein, in this embodiment, the method further includes: in the digital human synthesis process, the synthesis parameter data of the face synthesis module is monitored, and synthesis deviation value analysis is carried out, specifically:
in the digital human synthesis process, collecting synthesis parameter data of a synthesized human face module at intervals of R2; the synthesis parameter data comprises access node connection number, CPU load rate, bandwidth load rate and real-time network rate;
the access node connection number, the CPU load rate, the bandwidth load rate and the real-time network rate are marked as W1, W2, W3 and W4 in sequence; calculating a synthesis coefficient YS of the synthesized face module by using a formula YS = (W1 × b1+ W4 × b 4)/(W2 × b2+ W3 × b 3), wherein b1, b2, b3 and b4 are coefficient factors;
comparing the synthesis coefficient YS with a preset synthesis threshold value; if the synthesis coefficient is less than or equal to a preset synthesis threshold value, feeding back a synthesis deviation signal to a control center;
when the control center monitors that the synthesized deviation signal is generated, automatically counting down, wherein the counting down is D1, and D1 is a preset value; for example, D1 takes the value 10; every time a synthesis coefficient is collected, the count-down is reduced by one; continuing to monitor the synthesized deviation signal in a countdown stage, if a new synthesized deviation signal is monitored, automatically returning the countdown to the original value, and counting down again according to D1, otherwise, returning the countdown to zero and stopping counting;
counting the occurrence frequency of the synthesized deviation signal in the countdown phase to be C1, and counting the length of the countdown phase to be D1; using formulasCalculating to obtain a synthesis deviation value PL of the synthesized face module, wherein g1 and g2 are coefficient factors;
comparing the synthesized deviation value PL with a preset deviation threshold, and if the synthesized deviation value PL is greater than the preset deviation threshold, indicating that the operation processing capacity of the synthesized face module is low at the moment, and generating a deviation early warning signal; the system can remind the manager to replace a new industrial computer for operation processing, and improve the synthesis precision and efficiency of the digital man.
The above formulas are all calculated by removing dimensions and taking numerical values thereof, the formula is a formula which is obtained by acquiring a large amount of data and performing software simulation to obtain the most approximate real condition, and the preset parameters and the preset threshold values in the formula are set by the technical personnel in the field according to the actual condition or obtained by simulating a large amount of data.
The working principle of the invention is as follows:
a digital person synthesis method based on multi-task learning comprises the steps of acquiring video data containing a speaker when the digital person synthesis method works, performing necessary preprocessing operation on the video data, and extracting a face and an audio sequence of the speaker; preprocessing a face sequence, and performing MASK (MASK) on the lower half part of the face, namely setting the color to be black; inputting the audio sequence and the MASK face sequence into a face synthesis module to synthesize a face sequence; the synthesized face sequence is optimized under the guidance of a voice and lip synchronization module and a face rendering module respectively, so that the face synthesis quality is improved; the voice and lip synchronization module is used for respectively extracting lip features in the audio sequence and the face sequence, calculating cosine similarity, judging the matching degree of the lip of the synthesized face and the audio, guiding the synthesized face module to optimize, and improving the matching degree of the hammer and the audio; the face rendering module is mainly divided into 3 parts, the first part and the second part directly calculate a loss function, and the third part mainly judges whether the face is a real face; judging the face rendering and restoring degree, guiding the face synthesis module to optimize, and improving the face rendering and restoring effect; the two modules interact with each other, so that the effect of synthesizing the digital man is improved;
in the digital human synthesis process, monitoring synthesis parameter data of a synthesized face module, and analyzing a synthesis deviation value; in the digital human synthesis process, collecting synthesis parameter data of a synthesized face module at intervals of R2, and calculating to obtain a synthesis coefficient YS of the synthesized face module; if YS is less than or equal to a preset synthesis threshold value, feeding back a synthesis deviation signal to a control center; when the control center monitors the synthesized deviation signal, the counting is automatically carried out; counting the occurrence frequency of the synthesized deviation signal in the countdown stage to be C1, and counting the length of the countdown stage to be D1; using a formulaCalculating to obtain a synthesis deviation value PL of the synthesis face module; if PL is larger than a preset deviation threshold value, indicating that the operation processing capability of the synthesized face module is low at the moment, and generating a deviation early warning signal; the method can remind the manager to replace a new industrial computer for operation, thereby improving the synthesis precision and efficiency of the digital man.
In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand the invention for and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.
Claims (8)
1. A digital human synthesis method based on multitask learning is characterized by comprising the following steps:
the method comprises the following steps: acquiring video data containing a speaker, preprocessing the video data, and extracting a face and an audio sequence of the speaker; wherein, the face extraction can be performed by any face model; extracting the audio sequence from the video by using an ffmpeg tool;
step two: preprocessing a face sequence: performing MASK on the lower half part of the face, namely setting the color to be black; inputting the audio sequence and the MASK face sequence into a face synthesis module to synthesize a face sequence;
step three: the synthesized face sequence is respectively optimized under the guidance of a voice and lip synchronization module and a face rendering module; the voice and lip synchronization module is used for judging the matching degree of the lip of the synthesized face and the audio and guiding the synthesized face module to optimize; the face rendering module is used for judging the face rendering reduction degree and guiding the face synthesis module to optimize;
step four: in the digital human synthesis process, monitoring synthesis parameter data of a synthesized face module, and analyzing a synthesis deviation value; if the synthesized deviation value PL is larger than a preset deviation threshold value, the low operation processing capability of the synthesized face module is indicated, and a deviation early warning signal is generated; so as to remind the manager to replace the new industrial computer for operation.
2. The method for synthesizing the digital human being based on the multitask learning according to the claim 1, wherein the specific working steps of the voice and lip-sync module are as follows:
respectively extracting lip-shaped characteristics in the audio sequence and the face sequence, and calculating cosine similarity; the cosine similarity represents the matching degree of the voice and the lip shape;
wherein, the audio characteristic extractor is a backbone network with RNN, CNN and transform normal forms; the lip-shaped feature extractor is a backbone network of CNN, transform paradigm.
3. The method as claimed in claim 2, wherein the training data of the speech and lip synchronization module are real face picture and synthesized face picture, wherein in the training phase, the real face picture is positive sample, the label is given as 1, the synthesized face picture label is negative sample, and the label is given as 0; optimizing the backbone network by adopting cross entropy loss; the cross entropy loss calculation is shown in equation 1:
wherein L is 1 For cross entropy loss of speech and lip sync modules, y i Is the label of the ith sample, σ i Is the predicted value of the ith sample.
4. The method for synthesizing the digital human beings based on the multitask learning is characterized in that the face rendering module is divided into 3 parts, and the first part and the second part directly calculate a loss function to guide the optimization of a synthesized face template; the third part is to judge whether the face is a real face; the face feature extractor is mainly a backbone network of CNN, transform paradigm;
wherein the pixel loss calculation is shown in equation 2, and the perceptual loss calculation is shown in equation 3;
5. The method according to claim 4, wherein the training data of the face rendering module is real face pictures and synthesized face pictures, wherein the real face pictures have a label of 1, the synthesized face pictures have a label of 0, and the backbone network is optimized by cross entropy loss; the cross entropy loss calculation is shown in equation 4:
6. the method for synthesizing digital human beings based on multitask learning according to claim 1, characterized by that said synthesized face module is used for generating lip-shaped face corresponding to audio frequency according to given audio frequency sequence and MASK face sequence; the synthetic human face module is of a standard Encoder-Decoder structure, the Encoder is formed by stacking convolution networks, and the Decoder is formed by stacking inverse convolution networks.
7. The method for synthesizing a digital human being based on multitask learning according to claim 5, wherein the loss function is formed from four portions of: cross entropy loss of a voice and lip synchronization module, pixel loss of a face rendering module, perception loss and cross entropy loss of real face scoring; the loss function is shown in equation 5:
L=w 1 *L 1 +w 2 *L 2 +w 3 *L 3 +w 4 *L 4 - (5)
wherein w 1 +w 2 +w 3 +w 4 =1。
8. The method for synthesizing the digital human being based on the multitask learning according to the claim 1, wherein the concrete analysis process of synthesizing the deviation value PL is as follows:
in the digital human synthesis process, collecting synthesis parameter data of a synthesized human face module at intervals of R2; the number of access node connections, the CPU load rate, the bandwidth load rate and the real-time network rate are marked as W1, W2, W3 and W4 in sequence; calculating a synthesis coefficient YS of the synthesized face module by using a formula YS = (W1 × b1+ W4 × b 4)/(W2 × b2+ W3 × b 3), wherein b1, b2, b3 and b4 are coefficient factors;
if the synthesis coefficient is less than or equal to a preset synthesis threshold value, feeding back a synthesis deviation signal to a control center;
when the control center monitors that the synthesized deviation signal is generated, automatically counting down, wherein the counting down is D1, and D1 is a preset value; continuing to monitor the synthetic deviation signal in the countdown stage, if a new synthetic deviation signal is monitored, automatically returning the countdown to the original value, and counting down again according to D1;
counting the occurrence frequency of the synthesized deviation signal in the countdown phase to be C1, and counting the length of the countdown phase to be D1; using formulasAnd calculating to obtain a synthesis deviation value PL of the synthesized face module, wherein g1 and g2 are coefficient factors.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211397710.3A CN115601230A (en) | 2022-11-09 | 2022-11-09 | Digital human synthesis method based on multi-task learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211397710.3A CN115601230A (en) | 2022-11-09 | 2022-11-09 | Digital human synthesis method based on multi-task learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115601230A true CN115601230A (en) | 2023-01-13 |
Family
ID=84853528
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211397710.3A Pending CN115601230A (en) | 2022-11-09 | 2022-11-09 | Digital human synthesis method based on multi-task learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115601230A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113378697A (en) * | 2021-06-08 | 2021-09-10 | 安徽大学 | Method and device for generating speaking face video based on convolutional neural network |
CN114119268A (en) * | 2022-01-24 | 2022-03-01 | 科大智能物联技术股份有限公司 | Collaborative manufacturing system for printing and packaging production line |
CN114266944A (en) * | 2021-12-23 | 2022-04-01 | 安徽中科锟铻量子工业互联网有限公司 | Rapid model training result checking system |
CN114488092A (en) * | 2022-01-26 | 2022-05-13 | 安徽科创中光科技股份有限公司 | Carrier-to-noise ratio processing method of coherent wind measurement laser radar |
CN115202288A (en) * | 2022-07-18 | 2022-10-18 | 马鞍山经纬回转支承有限公司 | System and method for solving risk of switch type magnetic sucker |
CN115309475A (en) * | 2022-08-10 | 2022-11-08 | 蚌埠依爱消防电子有限责任公司 | Picture and sound rapid loading method for emergency evacuation display system |
-
2022
- 2022-11-09 CN CN202211397710.3A patent/CN115601230A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113378697A (en) * | 2021-06-08 | 2021-09-10 | 安徽大学 | Method and device for generating speaking face video based on convolutional neural network |
CN114266944A (en) * | 2021-12-23 | 2022-04-01 | 安徽中科锟铻量子工业互联网有限公司 | Rapid model training result checking system |
CN114119268A (en) * | 2022-01-24 | 2022-03-01 | 科大智能物联技术股份有限公司 | Collaborative manufacturing system for printing and packaging production line |
CN114488092A (en) * | 2022-01-26 | 2022-05-13 | 安徽科创中光科技股份有限公司 | Carrier-to-noise ratio processing method of coherent wind measurement laser radar |
CN115202288A (en) * | 2022-07-18 | 2022-10-18 | 马鞍山经纬回转支承有限公司 | System and method for solving risk of switch type magnetic sucker |
CN115309475A (en) * | 2022-08-10 | 2022-11-08 | 蚌埠依爱消防电子有限责任公司 | Picture and sound rapid loading method for emergency evacuation display system |
Non-Patent Citations (1)
Title |
---|
李雨思: "基于音频驱动的视频生成设计与实现" * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108447474B (en) | Modeling and control method for synchronizing virtual character voice and mouth shape | |
Cao et al. | Expressive speech-driven facial animation | |
CN109325437B (en) | Image processing method, device and system | |
Morishima et al. | A media conversion from speech to facial image for intelligent man-machine interface | |
Sifakis et al. | Simulating speech with a physics-based facial muscle model | |
DE60101540T2 (en) | Method of animating an artificial model of a human face using acoustic signals | |
CN110751708B (en) | Method and system for driving face animation in real time through voice | |
US20020024519A1 (en) | System and method for producing three-dimensional moving picture authoring tool supporting synthesis of motion, facial expression, lip synchronizing and lip synchronized voice of three-dimensional character | |
CN111325817A (en) | Virtual character scene video generation method, terminal device and medium | |
CN106653052A (en) | Virtual human face animation generation method and device | |
CN110910479B (en) | Video processing method, device, electronic equipment and readable storage medium | |
CN116528019B (en) | Virtual human video synthesis method based on voice driving and face self-driving | |
Mattos et al. | Improving CNN-based viseme recognition using synthetic data | |
CN113689436A (en) | Image semantic segmentation method, device, equipment and storage medium | |
CN116597857A (en) | Method, system, device and storage medium for driving image by voice | |
CN115953521A (en) | Remote digital human rendering method, device and system | |
CN115270184A (en) | Video desensitization method, vehicle video desensitization method and vehicle-mounted processing system | |
CN117528135A (en) | Speech-driven face video generation method and device, electronic equipment and medium | |
CN115601230A (en) | Digital human synthesis method based on multi-task learning | |
CN112002005A (en) | Cloud-based remote virtual collaborative host method | |
Beskow et al. | Data-driven synthesis of expressive visual speech using an MPEG-4 talking head. | |
Huang et al. | Visual speech emotion conversion using deep learning for 3D talking head | |
Morishima et al. | Facial expression synthesis based on natural voice for virtual face-to-face communication with machine | |
CN114494813B (en) | Dense cross attention-based index expression generation method | |
CN117150089B (en) | Character artistic image changing system based on AIGC technology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20230113 |