CN114202604A - Voice-driven target person video generation method and device and storage medium - Google Patents
Voice-driven target person video generation method and device and storage medium Download PDFInfo
- Publication number
- CN114202604A CN114202604A CN202111466434.7A CN202111466434A CN114202604A CN 114202604 A CN114202604 A CN 114202604A CN 202111466434 A CN202111466434 A CN 202111466434A CN 114202604 A CN114202604 A CN 114202604A
- Authority
- CN
- China
- Prior art keywords
- key point
- voice
- video
- image
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000003860 storage Methods 0.000 title claims abstract description 12
- 239000011159 matrix material Substances 0.000 claims abstract description 120
- 238000012549 training Methods 0.000 claims abstract description 36
- 238000013507 mapping Methods 0.000 claims abstract description 29
- 230000006870 function Effects 0.000 claims description 18
- 238000000605 extraction Methods 0.000 claims description 14
- 238000006073 displacement reaction Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 11
- 230000001360 synchronised effect Effects 0.000 claims description 4
- 238000000926 separation method Methods 0.000 claims description 3
- 230000009471 action Effects 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 230000001815 facial effect Effects 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 230000008921 facial expression Effects 0.000 description 4
- 239000010410 layer Substances 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000004886 head movement Effects 0.000 description 3
- 230000029058 respiratory gaseous exchange Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000005520 cutting process Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/205—3D [Three Dimensional] animation driven by audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention discloses a voice-driven target person video generation method, a voice-driven target person video generation device and a storage medium, wherein the method comprises the following steps: acquiring voice data and a positive image of the upper half of the figure; extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired front image of the upper body of the person; separating voice content information and audio information based on the acquired voice data; training a multi-dimensional mapping relation among the voice content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the voice content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix; generating a video image frame sequence based on a multi-dimensional mapping relation; and splicing the video image frame sequence with the language data to obtain the target human language audio/video. The linkage of the head action and the upper half body is fully considered, and the generated video is natural and has strong sense of reality.
Description
Technical Field
The invention relates to the technical field of computer information, in particular to a method and a device for generating a video of a voice-driven target person and a storage medium.
Background
The research of voice-driven human face video generation is an important research direction of man-machine interaction, and a section of voice is input at will to generate a section of voice of a target person and a synchronous speaking video of a mouth shape. Existing approaches are mainly end-to-end trained encoding-decoding frameworks. Chen et al adopt an attention mechanism to train facial features, improving video accuracy. Mittal et al propose separating speech content from emotion information, and controlling the movement of the face and head using different feature dimensions. Edwards et al multi-dimensionally map speech to faces, controlling face movement. Qian et al propose an auto vc approach, a few shot voice conversion approach that separates the voice into voice content and identity information. Zhou et al propose voice separated content and target person information, the content information controls the movement of the position of the lips and the appendage face strongly, while the target person information determines the details of the facial expression and other dynamics of the target head. Nian et al propose a face feature and mouth feature key point method, which respectively uses face contour key points and lip key points of a human face to represent head motion information and lip motion information of a target person. Although the prior art has good effects, the facial expression and the mouth and lip movement are mainly concerned, and the linkage of the head movement and the upper half body is ignored. The generated speaking video of the target person basically has only facial action and a small amount of head action, and the head and the upper body are not naturally linked.
Disclosure of Invention
The invention provides a voice-driven target person video generation method, a voice-driven target person video generation device and a storage medium, and aims to solve the problems that a target person speaking video generated in the prior art basically only has facial actions and a small amount of head actions, and the head and the upper body are linked unnaturally.
In a first aspect, a method for generating a video of a voice-driven target person is provided, which includes:
acquiring voice data and a positive image of the upper half of the figure;
extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired front image of the upper body of the person;
separating voice content information and audio information based on the acquired voice data;
training a multi-dimensional mapping relation among the voice content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the voice content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix;
generating a video image frame sequence based on a multi-dimensional mapping relation;
and splicing the video image frame sequence with the language data to obtain the target human language audio/video.
Further, the extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired image of the front of the upper body of the person includes:
dividing the front image of the upper half of the figure into a head image and an upper half image;
extracting an initial head key point coordinate matrix in a head portrait by adopting a head key point extraction model;
and extracting an initial upper body key point coordinate matrix in the upper body image by adopting an upper body key point extraction model.
Further, the separating the voice content information and the audio information based on the acquired voice data includes:
extracting a Mel spectrogram feature matrix of the voice data;
inputting the Mel spectrogram feature matrix into an LSTM-based voice feature extraction network to obtain a voice feature matrix;
for each frame image, inputting the corresponding voice feature matrix in the preset voice frame number window into a voice content encoder to obtain a voice content matrix;
and for each frame image, inputting the corresponding voice feature matrix in the preset voice frame number window into an audio encoder to obtain an audio matrix.
Further, the training of the multidimensional mapping relationship between the voice content information, the audio information, the coordinates of the head key point and the coordinates of the upper body key point based on the voice content information, the audio information, the coordinates of the head key point and the coordinates of the upper body key point specifically includes:
inputting the voice content matrix and the initial head key point coordinate matrix into a first multilayer perceptron, and predicting the displacement of the head key point coordinate matrix in each frame of image;
obtaining a head position prediction coordinate of each frame of image based on the initial head key point coordinate matrix and the displacement of the head key point coordinate matrix in each frame of image;
based on a self-attention network, fusing the voice content matrix and the audio matrix to obtain a self-reconfigurable audio mobile matrix;
inputting the self-reconfigurable audio movement matrix, the initial head key point coordinate matrix and the initial upper half body key point coordinate matrix into a second multilayer sensor, and predicting to obtain the head key point coordinate matrix and the upper half body key point coordinate matrix overall displacement of each frame of image;
and obtaining the overall predicted coordinate of each frame of image based on the head position predicted coordinate of each frame of image, the head key point coordinate matrix of each frame of image and the overall displacement of the upper body key point coordinate matrix.
Further, in the process of training the mapping relation between the voice content matrix and the head position prediction coordinate, the used voice content encoder and the first multilayer perceptron can be obtained by synchronous training based on the collected video data in advance, and a minimum loss function L adopted in the training processcThe following were used:
in the formula, Pi,tIndicating the predicted coordinate position of the ith head key point in the t-th frame image,represents the actual coordinate position of the ith head key point in the t frame image, lambdacThe weight coefficient is represented by a weight coefficient,represents Pi,tThe laplacian coordinates of the graph of (a),to representGraph laplacian coordinates, N represents the total number of head key points, T represents the total frame number of the image, | | includes2Representing the L2 norm.
Further, the generating a video image frame sequence based on the multi-dimensional mapping relationship includes:
and inputting the initial head key point coordinate matrix, the head position prediction coordinate of each frame of image and the overall prediction coordinate of each frame of image into a video frame reconstruction model, and fusing to obtain a reconstructed video frame sequence.
Further, a minimum loss function L adopted in the training of the video frame reconstruction modelaThe following were used:
where stc denotes the number of source video frames, trg denotes the number of target video frames, QtrgWhich represents the reconstructed video frame or frames, and,representing real video frames, λaDenotes a weight coefficient, λa=1,φ(Qtrg) Andrespectively representing the reconstructed video frame and the real video frame, pre-training the reconstructed video frame and the real video frame by adopting a ResNet-34 network1Representing the L1 norm.
Further, the splicing the video image frame sequence and the language data to obtain the target human language audio/video comprises:
and splicing the video image frame sequence and the language data by adopting an ffmpeg method to obtain a target human voice video, wherein the frame number of each second of the target human voice audio and video is a preset value.
In a second aspect, a voice-driven target person video generation apparatus is provided, including:
the data acquisition module is used for acquiring voice data and a positive image of the upper half of the figure;
the key point extraction module is used for extracting an initial head key point coordinate matrix and an initial upper half key point coordinate matrix based on the acquired figure upper half front image;
the voice data separation module is used for separating content information and audio information based on the acquired voice data;
the voice image mapping module is used for training a multi-dimensional mapping relation among the content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix;
the video frame generation module is used for generating a video image frame sequence based on a multi-dimensional mapping relation;
and the video generation module is used for splicing the video image frame sequence and the language data to obtain the target human language audio/video.
In a third aspect, a computer-readable storage medium is provided, which stores a computer program executed by a processor to implement the voice-driven target person video generation method as described above.
Advantageous effects
The invention provides a voice-driven target person video generation method, a voice-driven target person video generation device and a storage medium. The technical scheme of the invention is that the voice content information and the audio information in the voice data are separated firstly, the action of the head part is controlled through the voice content information, and the natural swing of the head part and the upper half body is controlled through the audio information. And performing multi-dimensional mapping training on the voice content information and the audio information, the head key points and the upper body key points to generate a mapping relation, obtaining a video image frame sequence, and finally generating the upper body front voice video of the target person. The linkage of the head action and the upper half body is fully considered, and the generated video is natural and has strong sense of reality.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a method for generating a voice-driven target person video according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
Example 1
The embodiment provides a method for generating a video of a voice-driven target person, which comprises the following steps:
s1: acquiring voice data and a positive image of the upper half of the figure; in practice, the front image of the upper body of the person includes the head and the upper body part, and may be one or more.
S2: and extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired image of the front side of the upper body of the person. More specifically, the method comprises the following steps:
s21: the front image of the upper half of the character is preprocessed to obtain a uniform size, such as 768 × 768. The image is cut into a head image and an upper half image by regular framing cutting, and the cutting position of the image can be determined only by coordinates of two points, namely the upper left corner and the lower right corner. Such as:
s211: the head image img is obtained by the formula (1) with the head upper left corner coordinates (256, 0) and the lower right corner coordinates (512, 256)head。
imghead=Image[256:512,0:256,] (1)
S212: the upper body image does not pay attention to the head, so that the width y coordinate of the picture is only required to be 256 to 768, all the x coordinate pixels are acquired, and the upper body image img is obtained through the formula (2)body。
imgbody=Image[:,256:768,] (2)
S22: and extracting an initial head key point coordinate matrix in the head portrait by adopting a head key point extraction model. In this embodiment, the head image img can be obtained by directly using the DLIB library for the open source face detection and the key point extraction toolheadOf 68 initial head keypoint coordinate matrices qheadAnd the coordinates of the key points of the head contain lip movement information and head movement information of the face.
qhead=DLIB(imghead) (3)
S23: and extracting an initial upper body key point coordinate matrix in the upper body image by adopting an upper body key point extraction model. In this embodiment, the upper body image img is obtained by using the open source inclusion _ v4 modelbody128-dimensional initial keypoint coordinate matrix qbody。
qbody=Inception_v4(imgbody) (4)
S3: training a multi-dimensional mapping relation among the voice content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the voice content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix.
A piece of voice data can be divided into voice content information and audio information. Different models are used to obtain the speech content information and the audio information. The speech content information mainly determines the general lip and nearby area movements, and the audio information determines the dynamics of the head and the fine movements of the body. For example, a person speaking will accompany changes in head and facial expressions and will not produce breathing movements, while at short pauses, changes in head and facial expressions will diminish with noticeable breathing movements. The method specifically comprises the following steps.
S31: speech content information-image training
The coordinate mapping of key points of the head features, mainly the coordinate mapping of key points of lips and surrounding faces, is obtained through the training of voice content signals.
S311: the speech data is read by python with librosa library.
S312: mel-language spectrogram feature matrix Audio for extracting voice data by python _ speech _ features methodmfcc。
S313: using Mel-language spectrogram characteristic of voice datamfccInputting LSTM-based voice feature extraction network to obtain voice feature matrix At∈RM×DWhere M is the total frame number of the input speech and D is the dimension of the speech feature matrix.
At=LSTM(Audiomfcc) (5)
S314: the sampling frequency of the voice data is much higher than that of the video frame, so that a plurality of frames of voice data correspond to one frame of image in the corresponding time window. Therefore, for each frame image, the speech feature matrix in the window of the corresponding preset number of speech frames (set to 18 frames in this embodiment) is input into the AutoCV.EcThe speech content encoder obtains a speech content matrix ct。
ct=AutoCV.Ec(At;wlstm,C) (6)
S315: matrix c of speech contenttAnd an initial head keypoint coordinate matrix qheadInputting a first multilayer perceptron (MLP)c) Predicting the displacement delta q of the coordinate matrix of the key point of the head in each frame of imaget(ii) a Wherein q ishead∈R68×3。
Δqt=MLPc(ct,qhead;wmlp,C) (7)
Wherein, { wlstm,C,wmlp,CAre AutoCV.E respectivelycAnd MLPcLearnable parameters of the network. The LSTM has three layers of cells, each with an internal hidden state vector of size 256. Decoder (first multilayer perceptron) MLPcThe network has three layers, with internal hidden state vector sizes of 512, 256 and 204(68 × 3), respectively.
S316: coordinate matrix q based on initial head key pointsheadAnd the displacement delta q of the coordinate matrix of the key point of the head in each frame of imagetObtaining the head position prediction coordinate P of each frame of imaget。
Pt=qhead+Δqt (8)
In the implementation process, in the process of training the mapping relation between the voice content matrix and the head position prediction coordinate, the used voice content encoder and the first multilayer perceptron can be obtained by synchronous training in advance based on the collected video data, and a minimum loss function L adopted in the training processcThe penalty function evaluates the distance between the actual and predicted coordinate positions, as well as the distance between their respective laplacian coordinates of the image, which facilitates proper placement of the coordinates relative to each other and preserves head detail. The loss function is formulated as follows:
in the formula, Pi,tIndicating the predicted coordinate position of the ith head key point in the t-th frame image,represents the actual coordinate position of the ith head key point in the t frame image, lambdacDenotes a weight coefficient, in this embodiment λcTaking out the number 1 of the samples,represents Pi,tThe laplacian coordinates of the graph of (a),to representGraph laplacian coordinates, N represents the total number of head key points, T represents the total frame number of the image, | | includes2Representing the L2 norm. Wherein the content of the first and second substances,calculated by the following formula:
wherein, N (p)i) Including p on a different face than piThe contiguous coordinates are adjacent. 8 face portions are used, which contain a predefined subset of coordinates for the face template.The calculation method of (1) is the same as that of (3), and will not be described herein again,
s32: speech content information and audio information-image training
The similarity of audio information between different utterances spoken by the same targeted speaker is maximized and the similarity between different targeted speakers is minimized. And obtaining the mapping relation between the head and the upper body corresponding to the audio information through the training of the voice audio information. The motion of the head or subtle associations between the upper body is also a key factor in generating the true speaking target person.
S321: the speech data is read by python with librosa library.
S322: mel-language spectrogram feature matrix Audio for extracting voice data by python _ speech _ features methodmfcc。
S323: using Mel-language spectrogram characteristic of voice datamfccInputting LSTM-based voice feature extraction network to obtain voice feature matrix At∈RM×DWhere M is the total frame number of the input speech and D is the dimension of the speech feature matrix.
At=LSTM(Audiomfcc) (11)
S324: for each frame image, the speech feature matrix in the corresponding window of preset speech frame number (set to 18 frames in this embodiment) is input into utoCV.EsAudio encoder obtains audio matrix
Wherein, wlstm,SIs utoCV.EsLearnable parameters of an audio encoder.
S325: by a single layer of MLPFrom 256 down to 128, getThe generalization ability of facial video can be improved, particularly for target people not observed during training.
S326: to produce consistent head and upper body movements, it requires capturing a longer time correlation than voice content movements. While speech audio information typically lasts for tens of milliseconds, head movements (e.g., head swinging from left to right) and upper body movements (e.g., breathing movements) may last for one or more seconds, or even orders of magnitude longer. To capture this long-term and structured dependency, the output is computed using a self-attention network. Matrix speech contentAnd the audio matrix input decoder obtains from the reconstructed audio motion matrix ht,wattnAnd s denotes a trainable parameter of the self attention network (self attention encoder).
S327: the transformation of the speech content matrix is connected with the audio matrix and the two initial coordinates. The weight assigned to each frame is calculated by a compatibility function that compares the representation of all the frames in the window. In all experiments, the window size was set to τ 256 frames (4 seconds). Moving the matrix h from the reconstructed audiotAnd an initial head keypoint coordinate matrix qheadAnd an initial upper half key point coordinate matrix qbobyInputting a second multilayer perceptron MLPsTo generate the target speaker perceptual coordinate displacement. The MLPsPredicting the final head key point coordinate matrix and upper body key point coordinate matrix overall displacement delta p of each frame of imaget,wmlp,SRepresenting a second multilayer perceptron MLPs(MLPsDecoder) of the video signal.
Δpt=MLPs(ht,qhead,qbody;wmlp,s) (15)
S328: predicting coordinates P based on head position of each frame imagetAnd the head key point coordinate matrix and the upper body key point coordinate matrix of each frame of image are integrally displaced by delta ptObtaining the overall predicted coordinate y of each frame of imaget。
yt=pt+Δpt (16)
In practice, in the process of training each network model in advance, besides capturing the key point positions, the head motion and the upper body motion of the target speaker need to be matched. To this end, a network of discriminators is created, the goal of which is to find out whether the temporal dynamics of the speaker's facial coordinates look "real" or false. It will generate the global predicted coordinate sequence within the same window used in the generator andthe speech content matrix and the audio matrix are used as input, and a representation r is returnedt:
During training, using LSGAN loss function LganFor discriminator parameter wattn,dTraining is performed, the training coordinates are regarded as "true", and the generated coordinates are regarded as "false" for each frame:
wherein the content of the first and second substances,representing the generator output, r, when the training coordinate y is used as its inputtRepresenting the output in the discriminator.
To train the parameter wattn,sTo maximize the "reality" of the output, a minimization loss function L is setsThe training distance taking into account the absolute position and the laplace coordinates:
wherein λs1 and μs0.001 pass hold verify set, yi,tIndicating the predicted coordinate position of the ith head or upper body keypoint in the tth frame image,indicating the actual coordinate position of the ith head or upper body keypoint in the tth frame image,denotes yi,tThe laplacian coordinates of the graph of (a),to representGraph laplace coordinates. Training is alternated between the generator and the discriminator to improve each other as is done in the GAN approach.
S4: and generating a video image frame sequence based on the multi-dimensional mapping relation. The method specifically comprises the following steps:
initial head key point coordinate matrix qheadThe predicted head position coordinate P of each frame of imagetAnd the overall predicted coordinate y of each frame of imagetInputting the video frame reconstruction model, and fusing to obtain a reconstructed video frame QtrgAnd (4) sequencing.
Minimum loss function L adopted during training of video frame reconstruction modelaThe following were used:
where stc denotes the number of source video frames, trg denotes the number of target video frames, QtrgWhich represents the reconstructed video frame or frames, and,representing real video frames, λaDenotes a weight coefficient, λa=1,φ(Qtrg)Andrespectively representing the reconstructed video frame and the real video frame, pre-training the reconstructed video frame and the real video frame by adopting a ResNet-34 network1Representing the L1 norm.
The video frame reconstruction model is an encoder-decoder network that performs image conversion to produce video frame images. The encoder takes 6 convolutional layers, each of which contains a 2-step convolution, followed by two residual blocks, and creates a bottleneck, which is then decoded by a symmetric upsampling decoder. Each frame image is generated using a skip connection between the symmetric layers of the encoder and decoder. Since the coordinates vary smoothly with time, the output image formed as an interpolation of these coordinates exhibits temporal coherence.
S5: and splicing the video image frame sequence with the language data to obtain the target human language audio/video. The method specifically comprises the following steps:
and splicing the video image frame sequence and the language data by adopting an ffmpeg method to obtain a target human voice video, wherein in the embodiment, the frame number of the target human voice audio/video per second is 29.
Example 2
The embodiment provides a voice-driven target person video generation device, which comprises:
the data acquisition module is used for acquiring voice data and a positive image of the upper half of the figure;
the key point extraction module is used for extracting an initial head key point coordinate matrix and an initial upper half key point coordinate matrix based on the acquired figure upper half front image;
the voice data separation module is used for separating content information and audio information based on the acquired voice data;
the voice image mapping module is used for training a multi-dimensional mapping relation among the content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix;
the video frame generation module is used for generating a video image frame sequence based on a multi-dimensional mapping relation;
and the video generation module is used for splicing the video image frame sequence and the language data to obtain the target human language audio/video.
Example 3
The present embodiment provides a computer-readable storage medium storing a computer program executed by a processor to implement the voice-driven target person video generation method as described above.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.
Claims (10)
1. A voice-driven target person video generation method is characterized by comprising the following steps:
acquiring voice data and a positive image of the upper half of the figure;
extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired front image of the upper body of the person;
separating voice content information and audio information based on the acquired voice data;
training a multi-dimensional mapping relation among the voice content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the voice content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix;
generating a video image frame sequence based on a multi-dimensional mapping relation;
and splicing the video image frame sequence with the language data to obtain the target human language audio/video.
2. The method for generating video of a voice-driven target person according to claim 1, wherein the extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired image of the front of the upper body of the person includes:
dividing the front image of the upper half of the figure into a head image and an upper half image;
extracting an initial head key point coordinate matrix in a head portrait by adopting a head key point extraction model;
and extracting an initial upper body key point coordinate matrix in the upper body image by adopting an upper body key point extraction model.
3. The voice-driven target person video generation method according to claim 1, wherein the separating of the voice content information and the audio information based on the acquired voice data includes:
extracting a Mel spectrogram feature matrix of the voice data;
inputting the Mel spectrogram feature matrix into an LSTM-based voice feature extraction network to obtain a voice feature matrix;
for each frame image, inputting the corresponding voice feature matrix in the preset voice frame number window into a voice content encoder to obtain a voice content matrix;
and for each frame image, inputting the corresponding voice feature matrix in the preset voice frame number window into an audio encoder to obtain an audio matrix.
4. The method for generating video of a voice-driven target person according to claim 3, wherein the training of the multidimensional mapping relationship between the voice content information and the audio information and the coordinates of the head key point and the coordinates of the upper body key point based on the voice content information, the audio information, the coordinate matrix of the head key point and the coordinate matrix of the upper body key point specifically comprises:
inputting the voice content matrix and the initial head key point coordinate matrix into a first multilayer perceptron, and predicting the displacement of the head key point coordinate matrix in each frame of image;
obtaining a head position prediction coordinate of each frame of image based on the initial head key point coordinate matrix and the displacement of the head key point coordinate matrix in each frame of image;
based on a self-attention network, fusing the voice content matrix and the audio matrix to obtain a self-reconfigurable audio mobile matrix;
inputting the self-reconfigurable audio movement matrix, the initial head key point coordinate matrix and the initial upper half body key point coordinate matrix into a second multilayer sensor, and predicting to obtain the head key point coordinate matrix and the upper half body key point coordinate matrix overall displacement of each frame of image;
and obtaining the overall predicted coordinate of each frame of image based on the head position predicted coordinate of each frame of image, the head key point coordinate matrix of each frame of image and the overall displacement of the upper body key point coordinate matrix.
5. The method according to claim 4, wherein the speech content encoder and the first multi-layer sensor used in the training of the mapping relationship between the speech content matrix and the head position prediction coordinates are obtained by synchronous training based on the collected video data, and the minimum loss function L used in the training is a minimum loss function LcThe following were used:
in the formula, Pi,tIndicating the predicted coordinate position of the ith head key point in the t-th frame image,represents the actual coordinate position of the ith head key point in the t frame image, lambdacThe weight coefficient is represented by a weight coefficient,represents Pi,tThe laplacian coordinates of the graph of (a),to representGraph laplacian coordinates, N represents the total number of head key points, T represents the total frame number of the image, | | includes2Representing the L2 norm.
6. The method as claimed in claim 4, wherein the generating a video image frame sequence based on multi-dimensional mapping comprises:
and inputting the initial head key point coordinate matrix, the head position prediction coordinate of each frame of image and the overall prediction coordinate of each frame of image into a video frame reconstruction model, and fusing to obtain a reconstructed video frame sequence.
7. The method according to claim 6, wherein the minimum loss function L is used in training the video frame reconstruction modelaThe following were used:
where stc denotes the number of source video frames, trg denotes the number of target video frames, QtrgWhich represents the reconstructed video frame or frames, and,representing real video frames, λaDenotes a weight coefficient, λa=1,φ(Qtrg) Andseparately representing reconstructed video frames and truesVideo frames are pre-trained using a ResNet-34 network, | | left alone |1Representing the L1 norm.
8. The method for generating the video of the target person by the voice driving according to claim 4, wherein the splicing the video image frame sequence and the language data to obtain the voice audio and video of the target person comprises:
and splicing the video image frame sequence and the language data by adopting an ffmpeg method to obtain a target human voice video, wherein the frame number of each second of the target human voice audio and video is a preset value.
9. A voice-driven target person video generating device, comprising:
the data acquisition module is used for acquiring voice data and a positive image of the upper half of the figure;
the key point extraction module is used for extracting an initial head key point coordinate matrix and an initial upper half key point coordinate matrix based on the acquired figure upper half front image;
the voice data separation module is used for separating content information and audio information based on the acquired voice data;
the voice image mapping module is used for training a multi-dimensional mapping relation among the content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix;
the video frame generation module is used for generating a video image frame sequence based on a multi-dimensional mapping relation;
and the video generation module is used for splicing the video image frame sequence and the language data to obtain the target human language audio/video.
10. A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the voice-driven target person video generation method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111466434.7A CN114202604A (en) | 2021-11-30 | 2021-11-30 | Voice-driven target person video generation method and device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111466434.7A CN114202604A (en) | 2021-11-30 | 2021-11-30 | Voice-driven target person video generation method and device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114202604A true CN114202604A (en) | 2022-03-18 |
Family
ID=80650381
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111466434.7A Pending CN114202604A (en) | 2021-11-30 | 2021-11-30 | Voice-driven target person video generation method and device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114202604A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115100329A (en) * | 2022-06-27 | 2022-09-23 | 太原理工大学 | Multi-mode driving-based emotion controllable facial animation generation method |
CN115996303A (en) * | 2023-03-23 | 2023-04-21 | 科大讯飞股份有限公司 | Video generation method, device, electronic equipment and storage medium |
CN116342835A (en) * | 2023-03-31 | 2023-06-27 | 华院计算技术(上海)股份有限公司 | Face three-dimensional surface grid generation method, device, computing equipment and storage medium |
WO2023241298A1 (en) * | 2022-06-16 | 2023-12-21 | 虹软科技股份有限公司 | Video generation method and apparatus, storage medium and electronic device |
CN117478818A (en) * | 2023-12-26 | 2024-01-30 | 荣耀终端有限公司 | Voice communication method, terminal and storage medium |
WO2024078243A1 (en) * | 2022-10-13 | 2024-04-18 | 腾讯科技(深圳)有限公司 | Training method and apparatus for video generation model, and storage medium and computer device |
-
2021
- 2021-11-30 CN CN202111466434.7A patent/CN114202604A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023241298A1 (en) * | 2022-06-16 | 2023-12-21 | 虹软科技股份有限公司 | Video generation method and apparatus, storage medium and electronic device |
CN115100329A (en) * | 2022-06-27 | 2022-09-23 | 太原理工大学 | Multi-mode driving-based emotion controllable facial animation generation method |
WO2024078243A1 (en) * | 2022-10-13 | 2024-04-18 | 腾讯科技(深圳)有限公司 | Training method and apparatus for video generation model, and storage medium and computer device |
CN115996303A (en) * | 2023-03-23 | 2023-04-21 | 科大讯飞股份有限公司 | Video generation method, device, electronic equipment and storage medium |
CN115996303B (en) * | 2023-03-23 | 2023-07-25 | 科大讯飞股份有限公司 | Video generation method, device, electronic equipment and storage medium |
CN116342835A (en) * | 2023-03-31 | 2023-06-27 | 华院计算技术(上海)股份有限公司 | Face three-dimensional surface grid generation method, device, computing equipment and storage medium |
CN117478818A (en) * | 2023-12-26 | 2024-01-30 | 荣耀终端有限公司 | Voice communication method, terminal and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114202604A (en) | Voice-driven target person video generation method and device and storage medium | |
CN113192161B (en) | Virtual human image video generation method, system, device and storage medium | |
CN110874557B (en) | Voice-driven virtual face video generation method and device | |
Cudeiro et al. | Capture, learning, and synthesis of 3D speaking styles | |
US11776188B2 (en) | Style-aware audio-driven talking head animation from a single image | |
CN113194348B (en) | Virtual human lecture video generation method, system, device and storage medium | |
US6735566B1 (en) | Generating realistic facial animation from speech | |
Sargin et al. | Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation | |
US20100057455A1 (en) | Method and System for 3D Lip-Synch Generation with Data-Faithful Machine Learning | |
CN115100329B (en) | Multi-mode driving-based emotion controllable facial animation generation method | |
US7257538B2 (en) | Generating animation from visual and audio input | |
EP4010899A1 (en) | Audio-driven speech animation using recurrent neutral network | |
Gururani et al. | Space: Speech-driven portrait animation with controllable expression | |
Potamianos et al. | Joint audio-visual speech processing for recognition and enhancement | |
Liu et al. | Synthesizing talking faces from text and audio: an autoencoder and sequence-to-sequence convolutional neural network | |
CN113516990A (en) | Voice enhancement method, method for training neural network and related equipment | |
Lavagetto | Time-delay neural networks for estimating lip movements from speech analysis: A useful tool in audio-video synchronization | |
CN116828129B (en) | Ultra-clear 2D digital person generation method and system | |
CN117409121A (en) | Fine granularity emotion control speaker face video generation method, system, equipment and medium based on audio frequency and single image driving | |
Wang et al. | CA-Wav2Lip: Coordinate Attention-based Speech To Lip Synthesis In The Wild | |
Asadiabadi et al. | Multimodal speech driven facial shape animation using deep neural networks | |
CN117237521A (en) | Speech driving face generation model construction method and target person speaking video generation method | |
Liu et al. | Real-time speech-driven animation of expressive talking faces | |
Caplier et al. | Image and video for hearing impaired people | |
Wang et al. | Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |