CN114245215A - Method, device, electronic equipment, medium and product for generating speaking video - Google Patents

Method, device, electronic equipment, medium and product for generating speaking video Download PDF

Info

Publication number
CN114245215A
CN114245215A CN202111404955.XA CN202111404955A CN114245215A CN 114245215 A CN114245215 A CN 114245215A CN 202111404955 A CN202111404955 A CN 202111404955A CN 114245215 A CN114245215 A CN 114245215A
Authority
CN
China
Prior art keywords
emotion
sequence
sample
speaking
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111404955.XA
Other languages
Chinese (zh)
Other versions
CN114245215B (en
Inventor
刘永进
叶子鹏
温玉辉
孙志尧
常亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Huawei Technologies Co Ltd
Original Assignee
Tsinghua University
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Huawei Technologies Co Ltd filed Critical Tsinghua University
Priority to CN202111404955.XA priority Critical patent/CN114245215B/en
Publication of CN114245215A publication Critical patent/CN114245215A/en
Application granted granted Critical
Publication of CN114245215B publication Critical patent/CN114245215B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4662Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
    • H04N21/4666Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms using neural networks, e.g. processing the feedback provided by the user
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8146Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computer Graphics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention provides a method, a device, electronic equipment, a medium and a product for generating a speaking video.

Description

Method, device, electronic equipment, medium and product for generating speaking video
Technical Field
The present invention relates to the field of machine learning technologies, and in particular, to a method, an apparatus, an electronic device, a medium, and a product for generating a speech video.
Background
Speaking video generation refers to a process of generating a video speaking by a target person synchronized with audio according to the audio and visual information (such as images or videos) of the target person. Such audio-driven generation technology of talking video has been widely applied in the fields of virtual avatars, virtual anchors, and the like.
However, the existing generation scheme of the speaking video can only generate the speaking video with neutral emotion, namely, the lip shape of a person is changed in the generated speaking video, but the facial expression of the person is not changed, so that the finally obtained speaking video cannot intuitively express the emotion change of a target person, and the problem that the speaking video is not real enough exists.
Disclosure of Invention
The invention provides a method, a device, electronic equipment, a medium and a product for generating a speaking video, which are used for solving the defects that the speaking video in the prior art cannot intuitively express the emotion change of a person and is not true enough.
In a first aspect, the present invention provides a method for generating a speaking video, the method comprising:
acquiring a speaking audio, an emotion label sequence and a face background sequence of a target person;
carrying out feature extraction on the speaking audio to obtain audio features;
inputting the audio features and the emotion label sequence into an emotion voice model to obtain a human face model sequence output by the emotion voice model; the emotion voice model is obtained by training a neural network based on an audio feature sample, an emotion label sequence sample and a corresponding face model sequence sample;
inputting the human face model sequence, the emotion label sequence and the human face background sequence into a neural rendering model to obtain a video frame sequence output by the neural rendering model; the neural rendering model is obtained by training a neural network based on a face model sequence sample, an emotion label sequence sample and a corresponding video frame sequence sample;
and synthesizing the video frame sequence and the speaking audio to generate the speaking video of the target character.
According to the method for generating the speaking video, the training process of the emotion voice model comprises the following steps:
acquiring a speaking video sample set; the speaking video sample set comprises a plurality of speaking video samples and corresponding emotion label sequence samples, and each speaking video sample comprises a single face;
extracting a speaking audio sample from the speaking video sample, and performing feature extraction on the speaking audio sample to obtain an audio feature sample;
extracting a plurality of video frame samples from the speaking video sample, and carrying out face reconstruction on each frame of video frame sample to generate a face model corresponding to each frame of video frame sample and form a face model sequence sample;
and training a pre-constructed neural network through the emotion label sequence sample, the audio characteristic sample and the face model sequence sample to obtain an emotion voice model.
According to the method for generating the speaking video, the emotion voice model comprises the following steps:
the voice recognition layer is used for extracting deep voice features from the audio features;
the emotion transcoding layer is used for extracting deep emotion characteristics corresponding to the emotion labels from the emotion label sequence;
and the mixed code conversion layer is used for integrating the deep emotion characteristics and the deep voice characteristics corresponding to the emotion labels respectively and outputting a human face model sequence.
Because the human face model sequence output by the emotion voice model is obtained by mixing the deep voice characteristic and the deep emotion characteristic, the three-dimensional information of the human face of the target person can be reflected more truly.
According to the method for generating the speaking video, the neural rendering model comprises the following steps:
the texture generation layer is used for respectively calculating first nerve texture information corresponding to each emotion label in the emotion label sequence;
the texture sampling layer is used for respectively sampling the first nerve texture information corresponding to each emotion label to a screen space according to the human face model sequence to obtain second nerve texture information corresponding to each emotion label;
the tooth refinement layer is used for respectively refining tooth areas in the second nerve texture information corresponding to the emotion labels to obtain third nerve texture information corresponding to the emotion labels;
the background extraction layer is used for extracting the characteristics of each background frame in the face background sequence to obtain the background frame characteristics corresponding to each background frame;
and the nerve rendering layer is used for fusing the third nerve texture information corresponding to each emotion label with the background frame characteristics corresponding to the corresponding background frame respectively to generate a video frame sequence.
According to a method for generating a speech video provided by the present invention, the texture generation layer includes:
the emotion transcoding sublayer is used for converting each emotion label into a dynamic neural texture weight;
and the calculation sublayer is used for multiplying the dynamic nerve texture weight corresponding to each emotion label with a preset dynamic nerve texture base to obtain first nerve texture information corresponding to each emotion label.
According to the method for generating the speaking video, the tooth refinement layer comprises:
the interception sublayer is used for determining the position of teeth in the second nerve texture information according to a preset tooth mask image, performing affine transformation on the second nerve texture information and intercepting to obtain the nerve texture of the tooth area;
the completion sublayer is used for performing characteristic completion on the nerve textures of the tooth area by using corresponding emotion labels to obtain complete tooth characteristics corresponding to the emotion labels;
and the thinning sublayer is used for respectively fusing the complete tooth characteristics corresponding to the emotion labels with the second nerve texture information to obtain third nerve texture information corresponding to each emotion label.
Because the neural rendering model gives consideration to the neural texture information and the tooth characteristics of the face during data processing, the face expression in the finally output video frame sequence is more natural, and the emotion change of a target person can be intuitively reflected, so that the reality is stronger.
In a second aspect, the present invention further provides an apparatus for generating a speaking video, the apparatus comprising:
the acquisition module is used for acquiring the speaking audio, the emotion label sequence and the face background sequence of the target person;
the first processing module is used for extracting the characteristics of the speaking audio to obtain audio characteristics;
the second processing module is used for inputting the audio features and the emotion label sequence into an emotion voice model to obtain a human face model sequence output by the emotion voice model; the emotion voice model is obtained by training a neural network based on an audio feature sample, an emotion label sequence sample and a corresponding face model sequence sample;
the third processing module is used for inputting the human face model sequence, the emotion label sequence and the human face background sequence into a neural rendering model to obtain a video frame sequence output by the neural rendering model; the neural rendering model is obtained by training a neural network based on a face model sequence sample, an emotion label sequence sample and a corresponding video frame sequence sample;
and the fourth processing module is used for synthesizing the video frame sequence and the speaking audio to generate the speaking video of the target character.
In a third aspect, the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method for generating a speaking video.
In a fourth aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method for generating a speaking video as described in any one of the above.
In a fifth aspect, the present invention also provides a computer program product comprising a computer program, which when executed by a processor, implements the steps of the method for generating speaking video according to any one of the above.
According to the method, the device, the electronic equipment, the medium and the product for generating the speaking video, the corresponding face model sequence can be obtained according to the audio features and the emotion label sequence of the speaking audio through the emotion voice model, the video frame sequence is obtained according to the face model sequence, the emotion label sequence and the face background sequence through the neural rendering model, and finally the speaking video of the target person can be obtained by synthesizing the video frame sequence and the speaking audio.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for generating a speaking video according to the present invention;
FIG. 2 is a schematic diagram of the data processing principle of the emotional speech model and the neural rendering model;
FIG. 3 is a schematic diagram of the data processing principle of the tooth refinement layer;
FIG. 4 is a schematic structural diagram of a device for generating speaking video provided by the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The following describes an implementation flow of a generating method of a speaking video provided by an embodiment of the invention with reference to fig. 1 to fig. 3.
Fig. 1 shows a method for generating a speaking video according to an embodiment of the present invention, where the method includes:
step 110: and acquiring the speaking audio, emotion label sequence and face background sequence of the target person.
It can be understood that the speaking audio may be an audio file recorded with speaking voice information of a target person, the emotion tag sequence includes a plurality of emotion tags of the target person, such as tag information representing emotional states of happiness, difficulty and the like, and the face background sequence includes a plurality of face background frames of the target person, and the face background frames are video frame information of a remaining part obtained by removing faces and teeth from video frames including normal face images.
Step 120: and carrying out feature extraction on the speaking audio to obtain audio features.
The audio features in this embodiment refer to mel-frequency spectrum features of the audio file, and meanwhile, in order to ensure the accuracy of subsequent data processing, the mel-frequency spectrum features extracted here should have the same length as the emotion tag sequence.
Step 130: inputting the audio features and the emotion label sequence into an emotion voice model to obtain a human face model sequence output by the emotion voice model; the emotion voice model is obtained by training a neural network based on the audio feature sample, the emotion label sequence sample and the corresponding face model sequence sample.
Specifically, the training process of the emotion speech model in this embodiment may include:
firstly, acquiring a speaking video sample set; the speaking video sample set comprises a plurality of speaking video samples and corresponding emotion label sequence samples, and each speaking video sample comprises a single face.
It should be noted that, in order to ensure that the referential value of the sample data is higher, in this embodiment, each spoken video sample in the spoken video sample set only contains a single face, and each spoken video corresponds to one emotion label.
And then, extracting a speaking audio sample from the speaking video sample, and performing feature extraction on the speaking audio sample to obtain an audio feature sample.
The process can be understood as a link for preprocessing a speaking video sample, in this embodiment, a mel-frequency spectrum feature sample is adopted as the audio feature sample, when the mel-frequency spectrum feature is extracted from the speaking audio sample, the window size of the mel-frequency spectrum feature needs to be the same as the length occupied by one frame of video frame in the speaking video sample, and if the frame rate of the speaking video sample is k fps, the window size of the mel-frequency spectrum feature is 1/k seconds.
Then, a plurality of video frame samples are extracted from the speaking video samples, face reconstruction is carried out on each frame of video frame sample, a face model corresponding to each frame of video frame sample is generated, and a face model sequence sample is formed.
The process can also be understood as a link of preprocessing a speaking video sample, and mainly obtains a face model corresponding to each frame of video frame, wherein the face model is a parameterized three-dimensional face model and comprises the identity, expression and posture coefficient of a character, and the posture coefficient is used for representing the posture of the head of the character.
It can be understood that mel frequency spectrum feature samples obtained by the two preprocessing links correspond to the face models in the face model sequence samples one by one, and if the frame rate of the video is k fps, k groups of mel frequency spectrum features and face models which correspond to one by one exist in each second.
And finally, training the pre-constructed neural network through the emotion label sequence sample, the audio characteristic sample and the face model sequence sample to obtain an emotion voice model.
The above process of model training can be understood as a process of training a deep neural network model from an emotion tag sequence and audio features to a face model sequence using emotion tags, audio features and a reconstructed three-dimensional face model.
Specifically, the emotional voice model in this embodiment may include:
the voice recognition layer is used for extracting deep voice features from the audio features;
the emotion transcoding layer is used for extracting deep emotion characteristics corresponding to the emotion labels from the emotion label sequence;
and the mixed transcoding layer is used for integrating the deep emotion characteristics and the deep voice characteristics corresponding to the emotion labels respectively, namely mixing the deep voice characteristics and the deep emotion characteristics and outputting a human face model sequence.
Referring to fig. 2, when data processing is performed using the emotional speech model, the obtained audio characteristics of the driving audio (i.e., speaking audio) are first input into the speechA phonon network (namely a voice recognition layer) extracts deep voice features in the audio features, and simultaneously, an emotion transcoding network T is utilized1(namely emotion transcoding layer) extracting deep emotion characteristics in emotion labels, carrying out audio-expression coding on the deep speech characteristics and the deep emotion characteristics, and inputting the deep speech characteristics and the deep emotion characteristics into a mixed transcoding network T3The (i.e. the hybrid transcoding layer) coefficients of the 3D digital Models (3D deformable Models) can be obtained, and the coefficients can be used to directly calculate a three-dimensional face model.
That is to say, the deep speech features and the deep emotion features are mixed, a coefficient sequence of the parameterized face model is directly obtained, meanwhile, the input emotion tag sequence is equal to and corresponds to the output coefficient sequence of the parameterized face model in length, each face model in the face model sequence can be obtained by calculating the coefficient (namely 3DMM coefficient in fig. 2) of the corresponding parameterized face model, specifically, the coefficient of the parameterized face model is multiplied by the face model matrix to obtain the corresponding face model, and then the face model sequence can be obtained.
In this embodiment, an end-to-end supervised training is adopted for the emotion voice model, in the process of training the emotion voice model, a loss function is taken as a reconstruction loss of the three-dimensional face, specifically, a difference is made between the three-dimensional face model output by the emotion voice model and the three-dimensional face model reconstructed before training, and a 2-norm of the difference is taken as the loss function.
It can be understood that the three-dimensional face model output by the emotion voice model and the reconstructed three-dimensional face model have the same topology, the difference of the three-dimensional face model with the same topology is vector subtraction, and the 2-norm of the difference is the 2-norm of the vector.
Step 140: inputting the human face model sequence, the emotion label sequence and the human face background sequence into a neural rendering model to obtain a video frame sequence output by the neural rendering model; the neural rendering model is obtained by training a neural network based on a human face model sequence sample, an emotion label sequence sample and a corresponding video frame sequence sample.
Specifically, the neural rendering model in this embodiment may include:
the texture generation layer is used for respectively calculating first nerve texture information corresponding to each emotion label in the emotion label sequence;
the texture sampling layer is used for respectively sampling the first nerve texture information corresponding to each emotion label to a screen space according to the face model sequence to obtain second nerve texture information corresponding to each emotion label, and the process can be understood as a process of fusing and mapping the geometric information and the texture information of each face model in the face model sequence with the first nerve texture information to the screen space;
the tooth refinement layer is used for respectively refining tooth areas in the second nerve texture information corresponding to the emotion labels to obtain third nerve texture information corresponding to the emotion labels;
the background extraction layer is used for extracting the characteristics of each background frame in the face background sequence to obtain the background frame characteristics corresponding to each background frame;
and the nerve rendering layer is used for fusing the third nerve texture information corresponding to each emotion label with the background frame characteristics corresponding to the corresponding background frame respectively to generate a video frame sequence.
It is understood that each emotion tag in the emotion tag sequence not only contains an emotional state noun, such as happy, but also contains corresponding emotion encoding information, and the emotion encoding information is a linear expression code corresponding to the emotional state noun, as shown in fig. 2, in essence, the emotion encoding information can be understood as a string of fixed-length numbers, such as 0000300 corresponding to the happy linear expression code. Whether the emotion voice model or the neural rendering model, emotion encoding information corresponding to emotion labels can be directly recognized and processed.
Further, the texture generation layer may specifically include:
the emotion transcoding sublayer is used for respectively converting each emotion label into a dynamic nerve texture weight, and specifically converting the linear expression code corresponding to each emotion label into the dynamic nerve texture weight;
and the calculation sublayer is used for multiplying the dynamic nerve texture weight corresponding to each emotion label with a preset dynamic nerve texture base to obtain first nerve texture information corresponding to each emotion label.
In an exemplary embodiment, the texture sampling layer may implement a process of sampling the first nerve texture information based on the face model sequence in a rasterization manner, and the sampled nerve texture has a nonzero value in a face model region and a zero value outside the face model region, so that the nerve texture information in the face model region is more prominent.
Referring to fig. 2, when processing data by the neural rendering model, the network T is first encoded based on emotion2The method comprises the steps of (namely, an emotion transcoding sublayer) obtaining dynamic nerve texture weight corresponding to a current emotion label, obtaining dynamic nerve texture (namely, first nerve texture information) based on a dynamic nerve texture base, sampling the first nerve texture information to a screen space to obtain sampled nerve texture (namely, second nerve texture information), refining a tooth area in the second nerve texture information by using a tooth submodule based on linear expression coding and a tooth mask image to obtain complete nerve texture (namely, third nerve texture information), and then performing nerve rendering based on background features extracted from a background frame and the complete nerve texture to obtain an output frame with photo reality, thereby obtaining a speaking video of a target person.
It can be understood that the video frame output by the neural rendering module is a color mask C(t)And an attention mask A(t)Let background frame be B(t)Then the frame R is output(t)Weighted summation of the input rendered image and the color mask, i.e.:
R(t)=A(t)·B(t)+(1-A(t))·C(t) (1)
in an exemplary embodiment, the tooth refinement layer may specifically include:
the interception sublayer is used for determining the position of the tooth in the second nerve texture information according to a preset tooth mask image, performing affine transformation on the second nerve texture information and intercepting to obtain the nerve texture of the tooth area;
the completion sublayer is used for performing characteristic completion on the nerve textures of the tooth area by using corresponding emotion labels to obtain complete tooth characteristics corresponding to the emotion labels;
and the thinning sublayer is used for respectively fusing the complete tooth characteristics corresponding to the emotion labels with the second nerve texture information to obtain third nerve texture information corresponding to each emotion label.
Fig. 3 shows a data processing procedure of the tooth refinement layer, which first calculates affine transformation data based on the tooth mask image, and performs affine transformation and truncation processing on the sampled nerve texture (i.e., second nerve texture information) through the affine transformation data to obtain the nerve texture of the tooth region. The affine transformation data obtained by the above calculation includes information such as the center of the affine transformation, which may be the centroid of the tooth mask, and the scaling, which may be the ratio of the bounding box size of the tooth to the picture size.
And then, performing feature completion on the nerve textures of the tooth area by using the emotion labels to obtain complete tooth features corresponding to the emotion labels
Finally, the collected nerve texture is combined with the complete tooth characteristics to obtain complete nerve texture (i.e. third nerve texture information).
In an exemplary embodiment, the inverse of the affine transformation of the tooth region may be used to combine the tooth features and the sampled neural texture to obtain the complete neural texture, and in particular, the combination of the tooth features and the sampled neural texture may be channel-wise superposition, while, in practical applications, the inverse of the affine transformation takes zeros at undefined places.
It can be understood that, in this embodiment, the neural rendering model is also obtained by end-to-end supervised training, and the loss function used in the training process is a multi-layer perceptual loss reconstructed frame by frame. The loss is calculated as follows:
let the output frame of the neural rendering model be RtInputting the corresponding real frame as GTtSetting VGG as a pre-trained VGG-19 classification neural network, taking a multi-layer intermediate result in the network as a perception feature, and assuming thatTaking the k-layer intermediate result, then there is (f)1(x),…,fk(x) VGG (x), from which the multi-layered perceptual loss of frame-by-frame reconstruction is defined as VGG (R)t)-VGG(GTt) 1-norm of (1).
It can be understood that, in the present embodiment, both the emotion speech model and the neural rendering model may adopt a deep neural network as a basic model architecture, where the deep neural network includes a convolutional layer, a pooling layer, an activation layer, a full-link layer, and other data processing structures.
Step 150: and synthesizing the video frame sequence and the speaking audio to generate the speaking video of the target character.
It can be understood that dynamic nerve texture information is introduced in the generation process of the speaking video, so that the emotion of a target character can be accurately modeled, the facial expression of the character in the finally obtained speaking video is more natural, and meanwhile, the tooth part in the nerve texture of a screen space can be refined through the tooth refining process, so that the generated speaking video has high-quality and clear teeth, and the video expression content is more real.
The following describes the generation device of speaking video provided by the present invention, and the generation device of speaking video described below and the generation method of speaking video described above can be referred to correspondingly.
Fig. 4 shows a device for generating speaking video provided by an embodiment of the present invention, the device includes:
an obtaining module 410, configured to obtain a speaking audio frequency, an emotion tag sequence, and a face background sequence of a target person;
the first processing module 420 is configured to perform feature extraction on the speaking audio to obtain audio features;
the second processing module 430 is configured to input the audio features and the emotion tag sequence into an emotion voice model to obtain a face model sequence output by the emotion voice model; the emotion voice model is obtained by training a neural network based on an audio characteristic sample, an emotion label sequence sample and a corresponding face model sequence sample;
the third processing module 440 is configured to input the face model sequence, the emotion tag sequence, and the face background sequence into the neural rendering model, so as to obtain a video frame sequence output by the neural rendering model; the neural rendering model is obtained by training a neural network based on a face model sequence sample, an emotion label sequence sample and a corresponding video frame sequence sample;
the fourth processing module 450 is configured to synthesize the video frame sequence and the speaking audio to generate a speaking video of the target person.
In an exemplary embodiment, the apparatus for generating a speaking video may further include:
the emotion voice model training module is used for acquiring a speaking video sample set; the speaking video sample set comprises a plurality of speaking video samples and corresponding emotion label sequence samples, and each speaking video sample comprises a single face; extracting a speaking audio sample from the speaking video sample, and performing feature extraction on the speaking audio sample to obtain an audio feature sample; extracting a plurality of video frame samples from the speaking video sample, and carrying out face reconstruction on each frame of video frame sample to generate a face model corresponding to each frame of video frame sample and form a face model sequence sample; training a pre-constructed neural network through an emotion label sequence sample, an audio characteristic sample and a face model sequence sample to obtain an emotion voice model.
In an exemplary embodiment, the emotional voice model may include:
the voice recognition layer is used for extracting deep voice features from the audio features;
the emotion transcoding layer is used for extracting deep emotion characteristics corresponding to the emotion labels from the emotion label sequence;
and the mixed code conversion layer is used for integrating the deep voice features and the deep emotion features corresponding to the emotion labels respectively and outputting a human face model sequence.
In an exemplary embodiment, the neural rendering model may include:
the texture generation layer is used for respectively calculating first nerve texture information corresponding to each emotion label;
the texture sampling layer is used for respectively sampling the first nerve texture information corresponding to each emotion label to a screen space according to the human face model sequence to obtain second nerve texture information corresponding to each emotion label;
the tooth refinement layer is used for respectively refining tooth areas in the second nerve texture information corresponding to the emotion labels to obtain third nerve texture information corresponding to the emotion labels;
the background extraction layer is used for extracting the characteristics of each background frame in the face background sequence to obtain the background frame characteristics corresponding to each background frame;
and the nerve rendering layer is used for fusing the third nerve texture information corresponding to each emotion label with the background frame characteristics corresponding to the corresponding background frame respectively to generate a video frame sequence.
Further, the texture generation layer specifically includes:
the emotion transcoding sublayer is used for converting each emotion label into a dynamic neural texture weight;
and the calculation sublayer is used for multiplying the dynamic nerve texture weight corresponding to each emotion label with a preset dynamic nerve texture base to obtain first nerve texture information corresponding to each emotion label.
Further, the tooth refinement layer specifically includes:
the interception sublayer is used for determining the position of the tooth in the second nerve texture information according to a preset tooth mask image, performing affine transformation on the second nerve texture information and intercepting to obtain the nerve texture of the tooth area;
the completion sublayer is used for performing characteristic completion on the nerve textures of the tooth area by using corresponding emotion labels to obtain complete tooth characteristics corresponding to the emotion labels;
and the thinning sublayer is used for fusing the complete tooth characteristics corresponding to the emotion labels and the second nerve texture information to obtain third nerve texture information corresponding to each emotion label.
Therefore, the device for generating the speaking video provided by the embodiment of the invention can model the emotion of the target person based on the dynamic nerve texture information, so that the speaking video with more natural facial expression is generated, the obtained speaking video can intuitively express the emotion change of the target person, and meanwhile, the device can refine the tooth area in the nerve texture of the screen space, so that the obtained speaking video is more real.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a method of generating a talking video, the method comprising: acquiring a speaking audio, an emotion label sequence and a face background sequence of a target person; carrying out feature extraction on the speaking audio to obtain audio features; inputting the audio features and the emotion label sequence into an emotion voice model to obtain a human face model sequence output by the emotion voice model; the emotion voice model is obtained by training a neural network based on an audio characteristic sample, an emotion label sequence sample and a corresponding face model sequence sample; inputting the human face model sequence, the emotion label sequence and the human face background sequence into a neural rendering model to obtain a video frame sequence output by the neural rendering model; the neural rendering model is obtained by training a neural network based on a face model sequence sample, an emotion label sequence sample and a corresponding video frame sequence sample; and synthesizing the video frame sequence and the speaking audio to generate the speaking video of the target character.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program being capable of executing, when executed by a processor, the method for generating a speaking video provided by the above methods, the method including: acquiring a speaking audio, an emotion label sequence and a face background sequence of a target person; carrying out feature extraction on the speaking audio to obtain audio features; inputting the audio features and the emotion label sequence into an emotion voice model to obtain a human face model sequence output by the emotion voice model; the emotion voice model is obtained by training a neural network based on an audio characteristic sample, an emotion label sequence sample and a corresponding face model sequence sample; inputting the human face model sequence, the emotion label sequence and the human face background sequence into a neural rendering model to obtain a video frame sequence output by the neural rendering model; the neural rendering model is obtained by training a neural network based on a face model sequence sample, an emotion label sequence sample and a corresponding video frame sequence sample; and synthesizing the video frame sequence and the speaking audio to generate the speaking video of the target character.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements a method for generating speaking video provided by the above methods, the method comprising: acquiring a speaking audio, an emotion label sequence and a face background sequence of a target person; carrying out feature extraction on the speaking audio to obtain audio features; inputting the audio features and the emotion label sequence into an emotion voice model to obtain a human face model sequence output by the emotion voice model; the emotion voice model is obtained by training a neural network based on an audio characteristic sample, an emotion label sequence sample and a corresponding face model sequence sample; inputting the human face model sequence, the emotion label sequence and the human face background sequence into a neural rendering model to obtain a video frame sequence output by the neural rendering model; the neural rendering model is obtained by training a neural network based on a face model sequence sample, an emotion label sequence sample and a corresponding video frame sequence sample; and synthesizing the video frame sequence and the speaking audio to generate the speaking video of the target character.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for generating a speaking video, comprising:
acquiring a speaking audio, an emotion label sequence and a face background sequence of a target person;
carrying out feature extraction on the speaking audio to obtain audio features;
inputting the audio features and the emotion label sequence into an emotion voice model to obtain a human face model sequence output by the emotion voice model; the emotion voice model is obtained by training a neural network based on an audio feature sample, an emotion label sequence sample and a corresponding face model sequence sample;
inputting the human face model sequence, the emotion label sequence and the human face background sequence into a neural rendering model to obtain a video frame sequence output by the neural rendering model; the neural rendering model is obtained by training a neural network based on a face model sequence sample, an emotion label sequence sample and a corresponding video frame sequence sample;
and synthesizing the video frame sequence and the speaking audio to generate the speaking video of the target character.
2. The method for generating speaking video as claimed in claim 1, wherein the training process of the emotion voice model comprises:
acquiring a speaking video sample set; the speaking video sample set comprises a plurality of speaking video samples and corresponding emotion label sequence samples, and each speaking video sample comprises a single face;
extracting a speaking audio sample from the speaking video sample, and performing feature extraction on the speaking audio sample to obtain an audio feature sample;
extracting a plurality of video frame samples from the speaking video sample, and carrying out face reconstruction on each frame of video frame sample to generate a face model corresponding to each frame of video frame sample and form a face model sequence sample;
and training a pre-constructed neural network through the emotion label sequence sample, the audio characteristic sample and the face model sequence sample to obtain an emotion voice model.
3. The method as claimed in claim 1, wherein the emotional speech model comprises:
the voice recognition layer is used for extracting deep voice features from the audio features;
the emotion transcoding layer is used for extracting deep emotion characteristics corresponding to the emotion labels from the emotion label sequence;
and the mixed code conversion layer is used for integrating the deep emotion characteristics and the deep voice characteristics corresponding to the emotion labels respectively and outputting a human face model sequence.
4. The method of claim 1, wherein the neural rendering model comprises:
the texture generation layer is used for respectively calculating first nerve texture information corresponding to each emotion label in the emotion label sequence;
the texture sampling layer is used for respectively sampling the first nerve texture information corresponding to each emotion label to a screen space according to the human face model sequence to obtain second nerve texture information corresponding to each emotion label;
the tooth refinement layer is used for respectively refining tooth areas in the second nerve texture information corresponding to the emotion labels to obtain third nerve texture information corresponding to the emotion labels;
the background extraction layer is used for extracting the characteristics of each background frame in the face background sequence to obtain the background frame characteristics corresponding to each background frame;
and the nerve rendering layer is used for fusing the third nerve texture information corresponding to each emotion label with the background frame characteristics corresponding to the corresponding background frame respectively to generate a video frame sequence.
5. The method as claimed in claim 4, wherein said texture generation layer comprises:
the emotion transcoding sublayer is used for converting each emotion label into a dynamic neural texture weight;
and the calculation sublayer is used for multiplying the dynamic nerve texture weight corresponding to each emotion label with a preset dynamic nerve texture base to obtain first nerve texture information corresponding to each emotion label.
6. The method as claimed in claim 4, wherein the teeth refinement layer comprises:
the interception sublayer is used for determining the position of teeth in the second nerve texture information according to a preset tooth mask image, carrying out affine transformation on the second nerve texture information and intercepting to obtain the nerve texture of the tooth area;
the completion sublayer is used for performing characteristic completion on the nerve textures of the tooth area by using corresponding emotion labels to obtain complete tooth characteristics corresponding to the emotion labels;
and the thinning sublayer is used for respectively fusing the complete tooth characteristics corresponding to the emotion labels with the second nerve texture information to obtain third nerve texture information corresponding to each emotion label.
7. An apparatus for generating a speaking video, comprising:
the acquisition module is used for acquiring the speaking audio, the emotion label sequence and the face background sequence of the target person;
the first processing module is used for extracting the characteristics of the speaking audio to obtain audio characteristics;
the second processing module is used for inputting the audio features and the emotion label sequence into an emotion voice model to obtain a human face model sequence output by the emotion voice model; the emotion voice model is obtained by training a neural network based on an audio feature sample, an emotion label sequence sample and a corresponding face model sequence sample;
the third processing module is used for inputting the human face model sequence, the emotion label sequence and the human face background sequence into a neural rendering model to obtain a video frame sequence output by the neural rendering model; the neural rendering model is obtained by training a neural network based on a face model sequence sample, an emotion label sequence sample and a corresponding video frame sequence sample;
and the fourth processing module is used for synthesizing the video frame sequence and the speaking audio to generate the speaking video of the target character.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for generating a talking video according to any of the claims 1 to 6 when executing the program.
9. A non-transitory computer-readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the method for generating a talking video according to any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method for generating a talking video according to anyone of claims 1 to 6 when being executed by a processor.
CN202111404955.XA 2021-11-24 2021-11-24 Method, device, electronic equipment, medium and product for generating speaking video Active CN114245215B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111404955.XA CN114245215B (en) 2021-11-24 2021-11-24 Method, device, electronic equipment, medium and product for generating speaking video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111404955.XA CN114245215B (en) 2021-11-24 2021-11-24 Method, device, electronic equipment, medium and product for generating speaking video

Publications (2)

Publication Number Publication Date
CN114245215A true CN114245215A (en) 2022-03-25
CN114245215B CN114245215B (en) 2023-04-07

Family

ID=80751012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111404955.XA Active CN114245215B (en) 2021-11-24 2021-11-24 Method, device, electronic equipment, medium and product for generating speaking video

Country Status (1)

Country Link
CN (1) CN114245215B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115205949A (en) * 2022-09-05 2022-10-18 腾讯科技(深圳)有限公司 Image generation method and related device
CN115375809A (en) * 2022-10-25 2022-11-22 科大讯飞股份有限公司 Virtual image generation method, device, equipment and storage medium
CN115996303A (en) * 2023-03-23 2023-04-21 科大讯飞股份有限公司 Video generation method, device, electronic equipment and storage medium
CN116091668A (en) * 2023-04-10 2023-05-09 广东工业大学 Talking head video generation method based on emotion feature guidance
CN117218224A (en) * 2023-08-21 2023-12-12 华院计算技术(上海)股份有限公司 Face emotion image generation method and device, readable storage medium and terminal

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190163965A1 (en) * 2017-11-24 2019-05-30 Genesis Lab, Inc. Multi-modal emotion recognition device, method, and storage medium using artificial intelligence
CN110688911A (en) * 2019-09-05 2020-01-14 深圳追一科技有限公司 Video processing method, device, system, terminal equipment and storage medium
CN111243626A (en) * 2019-12-30 2020-06-05 清华大学 Speaking video generation method and system
CN113194348A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human lecture video generation method, system, device and storage medium
CN113192161A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human image video generation method, system, device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190163965A1 (en) * 2017-11-24 2019-05-30 Genesis Lab, Inc. Multi-modal emotion recognition device, method, and storage medium using artificial intelligence
CN110688911A (en) * 2019-09-05 2020-01-14 深圳追一科技有限公司 Video processing method, device, system, terminal equipment and storage medium
CN111243626A (en) * 2019-12-30 2020-06-05 清华大学 Speaking video generation method and system
CN113194348A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human lecture video generation method, system, device and storage medium
CN113192161A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human image video generation method, system, device and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115205949A (en) * 2022-09-05 2022-10-18 腾讯科技(深圳)有限公司 Image generation method and related device
WO2024051445A1 (en) * 2022-09-05 2024-03-14 腾讯科技(深圳)有限公司 Image generation method and related device
CN115375809A (en) * 2022-10-25 2022-11-22 科大讯飞股份有限公司 Virtual image generation method, device, equipment and storage medium
CN115375809B (en) * 2022-10-25 2023-03-14 科大讯飞股份有限公司 Method, device and equipment for generating virtual image and storage medium
CN115996303A (en) * 2023-03-23 2023-04-21 科大讯飞股份有限公司 Video generation method, device, electronic equipment and storage medium
CN116091668A (en) * 2023-04-10 2023-05-09 广东工业大学 Talking head video generation method based on emotion feature guidance
CN117218224A (en) * 2023-08-21 2023-12-12 华院计算技术(上海)股份有限公司 Face emotion image generation method and device, readable storage medium and terminal

Also Published As

Publication number Publication date
CN114245215B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN114245215B (en) Method, device, electronic equipment, medium and product for generating speaking video
US11276231B2 (en) Semantic deep face models
CN113378697B (en) Method and device for generating speaking face video based on convolutional neural network
CN112215927B (en) Face video synthesis method, device, equipment and medium
WO2024051445A9 (en) Image generation method and related device
CN110599395B (en) Target image generation method, device, server and storage medium
CN113327278B (en) Three-dimensional face reconstruction method, device, equipment and storage medium
CN111901598B (en) Video decoding and encoding method, device, medium and electronic equipment
CN113272870A (en) System and method for realistic real-time portrait animation
CN113901894A (en) Video generation method, device, server and storage medium
CN110288697A (en) 3D face representation and method for reconstructing based on multiple dimensioned figure convolutional neural networks
CN108538283B (en) Method for converting lip image characteristics into voice coding parameters
CN111369646B (en) Expression synthesis method integrating attention mechanism
CN100505840C (en) Method and device for transmitting face synthesized video
CN114241558B (en) Model training method, video generating method and device, equipment and medium
EP3788599A1 (en) Generating a simulated image of a baby
CN108376234B (en) Emotion recognition system and method for video image
CN113469292A (en) Training method, synthesizing method, device, medium and equipment for video synthesizing model
CN114202615A (en) Facial expression reconstruction method, device, equipment and storage medium
CN112949707A (en) Cross-mode face image generation method based on multi-scale semantic information supervision
CN114222179A (en) Virtual image video synthesis method and equipment
CN116597857A (en) Method, system, device and storage medium for driving image by voice
CN112634413B (en) Method, apparatus, device and storage medium for generating model and generating 3D animation
CN114529785A (en) Model training method, video generation method and device, equipment and medium
CN114049290A (en) Image processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant