CN114202604A - Voice-driven target person video generation method and device and storage medium - Google Patents

Voice-driven target person video generation method and device and storage medium Download PDF

Info

Publication number
CN114202604A
CN114202604A CN202111466434.7A CN202111466434A CN114202604A CN 114202604 A CN114202604 A CN 114202604A CN 202111466434 A CN202111466434 A CN 202111466434A CN 114202604 A CN114202604 A CN 114202604A
Authority
CN
China
Prior art keywords
key point
voice
video
image
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111466434.7A
Other languages
Chinese (zh)
Inventor
王波
吴笛
张沅
刘吉伟
罗东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Great Wall Information Co Ltd
Original Assignee
Great Wall Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Great Wall Information Co Ltd filed Critical Great Wall Information Co Ltd
Priority to CN202111466434.7A priority Critical patent/CN114202604A/en
Publication of CN114202604A publication Critical patent/CN114202604A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a voice-driven target person video generation method, a voice-driven target person video generation device and a storage medium, wherein the method comprises the following steps: acquiring voice data and a positive image of the upper half of the figure; extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired front image of the upper body of the person; separating voice content information and audio information based on the acquired voice data; training a multi-dimensional mapping relation among the voice content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the voice content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix; generating a video image frame sequence based on a multi-dimensional mapping relation; and splicing the video image frame sequence with the language data to obtain the target human language audio/video. The linkage of the head action and the upper half body is fully considered, and the generated video is natural and has strong sense of reality.

Description

Voice-driven target person video generation method and device and storage medium
Technical Field
The invention relates to the technical field of computer information, in particular to a method and a device for generating a video of a voice-driven target person and a storage medium.
Background
The research of voice-driven human face video generation is an important research direction of man-machine interaction, and a section of voice is input at will to generate a section of voice of a target person and a synchronous speaking video of a mouth shape. Existing approaches are mainly end-to-end trained encoding-decoding frameworks. Chen et al adopt an attention mechanism to train facial features, improving video accuracy. Mittal et al propose separating speech content from emotion information, and controlling the movement of the face and head using different feature dimensions. Edwards et al multi-dimensionally map speech to faces, controlling face movement. Qian et al propose an auto vc approach, a few shot voice conversion approach that separates the voice into voice content and identity information. Zhou et al propose voice separated content and target person information, the content information controls the movement of the position of the lips and the appendage face strongly, while the target person information determines the details of the facial expression and other dynamics of the target head. Nian et al propose a face feature and mouth feature key point method, which respectively uses face contour key points and lip key points of a human face to represent head motion information and lip motion information of a target person. Although the prior art has good effects, the facial expression and the mouth and lip movement are mainly concerned, and the linkage of the head movement and the upper half body is ignored. The generated speaking video of the target person basically has only facial action and a small amount of head action, and the head and the upper body are not naturally linked.
Disclosure of Invention
The invention provides a voice-driven target person video generation method, a voice-driven target person video generation device and a storage medium, and aims to solve the problems that a target person speaking video generated in the prior art basically only has facial actions and a small amount of head actions, and the head and the upper body are linked unnaturally.
In a first aspect, a method for generating a video of a voice-driven target person is provided, which includes:
acquiring voice data and a positive image of the upper half of the figure;
extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired front image of the upper body of the person;
separating voice content information and audio information based on the acquired voice data;
training a multi-dimensional mapping relation among the voice content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the voice content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix;
generating a video image frame sequence based on a multi-dimensional mapping relation;
and splicing the video image frame sequence with the language data to obtain the target human language audio/video.
Further, the extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired image of the front of the upper body of the person includes:
dividing the front image of the upper half of the figure into a head image and an upper half image;
extracting an initial head key point coordinate matrix in a head portrait by adopting a head key point extraction model;
and extracting an initial upper body key point coordinate matrix in the upper body image by adopting an upper body key point extraction model.
Further, the separating the voice content information and the audio information based on the acquired voice data includes:
extracting a Mel spectrogram feature matrix of the voice data;
inputting the Mel spectrogram feature matrix into an LSTM-based voice feature extraction network to obtain a voice feature matrix;
for each frame image, inputting the corresponding voice feature matrix in the preset voice frame number window into a voice content encoder to obtain a voice content matrix;
and for each frame image, inputting the corresponding voice feature matrix in the preset voice frame number window into an audio encoder to obtain an audio matrix.
Further, the training of the multidimensional mapping relationship between the voice content information, the audio information, the coordinates of the head key point and the coordinates of the upper body key point based on the voice content information, the audio information, the coordinates of the head key point and the coordinates of the upper body key point specifically includes:
inputting the voice content matrix and the initial head key point coordinate matrix into a first multilayer perceptron, and predicting the displacement of the head key point coordinate matrix in each frame of image;
obtaining a head position prediction coordinate of each frame of image based on the initial head key point coordinate matrix and the displacement of the head key point coordinate matrix in each frame of image;
based on a self-attention network, fusing the voice content matrix and the audio matrix to obtain a self-reconfigurable audio mobile matrix;
inputting the self-reconfigurable audio movement matrix, the initial head key point coordinate matrix and the initial upper half body key point coordinate matrix into a second multilayer sensor, and predicting to obtain the head key point coordinate matrix and the upper half body key point coordinate matrix overall displacement of each frame of image;
and obtaining the overall predicted coordinate of each frame of image based on the head position predicted coordinate of each frame of image, the head key point coordinate matrix of each frame of image and the overall displacement of the upper body key point coordinate matrix.
Further, in the process of training the mapping relation between the voice content matrix and the head position prediction coordinate, the used voice content encoder and the first multilayer perceptron can be obtained by synchronous training based on the collected video data in advance, and a minimum loss function L adopted in the training processcThe following were used:
Figure BDA0003382946910000021
in the formula, Pi,tIndicating the predicted coordinate position of the ith head key point in the t-th frame image,
Figure BDA0003382946910000022
represents the actual coordinate position of the ith head key point in the t frame image, lambdacThe weight coefficient is represented by a weight coefficient,
Figure BDA0003382946910000031
represents Pi,tThe laplacian coordinates of the graph of (a),
Figure BDA0003382946910000032
to represent
Figure BDA0003382946910000033
Graph laplacian coordinates, N represents the total number of head key points, T represents the total frame number of the image, | | includes2Representing the L2 norm.
Further, the generating a video image frame sequence based on the multi-dimensional mapping relationship includes:
and inputting the initial head key point coordinate matrix, the head position prediction coordinate of each frame of image and the overall prediction coordinate of each frame of image into a video frame reconstruction model, and fusing to obtain a reconstructed video frame sequence.
Further, a minimum loss function L adopted in the training of the video frame reconstruction modelaThe following were used:
Figure BDA0003382946910000034
where stc denotes the number of source video frames, trg denotes the number of target video frames, QtrgWhich represents the reconstructed video frame or frames, and,
Figure BDA0003382946910000035
representing real video frames, λaDenotes a weight coefficient, λa=1,φ(Qtrg) And
Figure BDA0003382946910000036
respectively representing the reconstructed video frame and the real video frame, pre-training the reconstructed video frame and the real video frame by adopting a ResNet-34 network1Representing the L1 norm.
Further, the splicing the video image frame sequence and the language data to obtain the target human language audio/video comprises:
and splicing the video image frame sequence and the language data by adopting an ffmpeg method to obtain a target human voice video, wherein the frame number of each second of the target human voice audio and video is a preset value.
In a second aspect, a voice-driven target person video generation apparatus is provided, including:
the data acquisition module is used for acquiring voice data and a positive image of the upper half of the figure;
the key point extraction module is used for extracting an initial head key point coordinate matrix and an initial upper half key point coordinate matrix based on the acquired figure upper half front image;
the voice data separation module is used for separating content information and audio information based on the acquired voice data;
the voice image mapping module is used for training a multi-dimensional mapping relation among the content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix;
the video frame generation module is used for generating a video image frame sequence based on a multi-dimensional mapping relation;
and the video generation module is used for splicing the video image frame sequence and the language data to obtain the target human language audio/video.
In a third aspect, a computer-readable storage medium is provided, which stores a computer program executed by a processor to implement the voice-driven target person video generation method as described above.
Advantageous effects
The invention provides a voice-driven target person video generation method, a voice-driven target person video generation device and a storage medium. The technical scheme of the invention is that the voice content information and the audio information in the voice data are separated firstly, the action of the head part is controlled through the voice content information, and the natural swing of the head part and the upper half body is controlled through the audio information. And performing multi-dimensional mapping training on the voice content information and the audio information, the head key points and the upper body key points to generate a mapping relation, obtaining a video image frame sequence, and finally generating the upper body front voice video of the target person. The linkage of the head action and the upper half body is fully considered, and the generated video is natural and has strong sense of reality.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a method for generating a voice-driven target person video according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
Example 1
The embodiment provides a method for generating a video of a voice-driven target person, which comprises the following steps:
s1: acquiring voice data and a positive image of the upper half of the figure; in practice, the front image of the upper body of the person includes the head and the upper body part, and may be one or more.
S2: and extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired image of the front side of the upper body of the person. More specifically, the method comprises the following steps:
s21: the front image of the upper half of the character is preprocessed to obtain a uniform size, such as 768 × 768. The image is cut into a head image and an upper half image by regular framing cutting, and the cutting position of the image can be determined only by coordinates of two points, namely the upper left corner and the lower right corner. Such as:
s211: the head image img is obtained by the formula (1) with the head upper left corner coordinates (256, 0) and the lower right corner coordinates (512, 256)head
imghead=Image[256:512,0:256,] (1)
S212: the upper body image does not pay attention to the head, so that the width y coordinate of the picture is only required to be 256 to 768, all the x coordinate pixels are acquired, and the upper body image img is obtained through the formula (2)body
imgbody=Image[:,256:768,] (2)
S22: and extracting an initial head key point coordinate matrix in the head portrait by adopting a head key point extraction model. In this embodiment, the head image img can be obtained by directly using the DLIB library for the open source face detection and the key point extraction toolheadOf 68 initial head keypoint coordinate matrices qheadAnd the coordinates of the key points of the head contain lip movement information and head movement information of the face.
qhead=DLIB(imghead) (3)
S23: and extracting an initial upper body key point coordinate matrix in the upper body image by adopting an upper body key point extraction model. In this embodiment, the upper body image img is obtained by using the open source inclusion _ v4 modelbody128-dimensional initial keypoint coordinate matrix qbody
qbody=Inception_v4(imgbody) (4)
S3: training a multi-dimensional mapping relation among the voice content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the voice content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix.
A piece of voice data can be divided into voice content information and audio information. Different models are used to obtain the speech content information and the audio information. The speech content information mainly determines the general lip and nearby area movements, and the audio information determines the dynamics of the head and the fine movements of the body. For example, a person speaking will accompany changes in head and facial expressions and will not produce breathing movements, while at short pauses, changes in head and facial expressions will diminish with noticeable breathing movements. The method specifically comprises the following steps.
S31: speech content information-image training
The coordinate mapping of key points of the head features, mainly the coordinate mapping of key points of lips and surrounding faces, is obtained through the training of voice content signals.
S311: the speech data is read by python with librosa library.
S312: mel-language spectrogram feature matrix Audio for extracting voice data by python _ speech _ features methodmfcc
S313: using Mel-language spectrogram characteristic of voice datamfccInputting LSTM-based voice feature extraction network to obtain voice feature matrix At∈RM×DWhere M is the total frame number of the input speech and D is the dimension of the speech feature matrix.
At=LSTM(Audiomfcc) (5)
S314: the sampling frequency of the voice data is much higher than that of the video frame, so that a plurality of frames of voice data correspond to one frame of image in the corresponding time window. Therefore, for each frame image, the speech feature matrix in the window of the corresponding preset number of speech frames (set to 18 frames in this embodiment) is input into the AutoCV.EcThe speech content encoder obtains a speech content matrix ct
ct=AutoCV.Ec(At;wlstm,C) (6)
S315: matrix c of speech contenttAnd an initial head keypoint coordinate matrix qheadInputting a first multilayer perceptron (MLP)c) Predicting the displacement delta q of the coordinate matrix of the key point of the head in each frame of imaget(ii) a Wherein q ishead∈R68×3
Δqt=MLPc(ct,qhead;wmlp,C) (7)
Wherein, { wlstm,C,wmlp,CAre AutoCV.E respectivelycAnd MLPcLearnable parameters of the network. The LSTM has three layers of cells, each with an internal hidden state vector of size 256. Decoder (first multilayer perceptron) MLPcThe network has three layers, with internal hidden state vector sizes of 512, 256 and 204(68 × 3), respectively.
S316: coordinate matrix q based on initial head key pointsheadAnd the displacement delta q of the coordinate matrix of the key point of the head in each frame of imagetObtaining the head position prediction coordinate P of each frame of imaget
Pt=qhead+Δqt (8)
In the implementation process, in the process of training the mapping relation between the voice content matrix and the head position prediction coordinate, the used voice content encoder and the first multilayer perceptron can be obtained by synchronous training in advance based on the collected video data, and a minimum loss function L adopted in the training processcThe penalty function evaluates the distance between the actual and predicted coordinate positions, as well as the distance between their respective laplacian coordinates of the image, which facilitates proper placement of the coordinates relative to each other and preserves head detail. The loss function is formulated as follows:
Figure BDA0003382946910000061
in the formula, Pi,tIndicating the predicted coordinate position of the ith head key point in the t-th frame image,
Figure BDA0003382946910000062
represents the actual coordinate position of the ith head key point in the t frame image, lambdacDenotes a weight coefficient, in this embodiment λcTaking out the number 1 of the samples,
Figure BDA0003382946910000063
represents Pi,tThe laplacian coordinates of the graph of (a),
Figure BDA0003382946910000064
to represent
Figure BDA0003382946910000065
Graph laplacian coordinates, N represents the total number of head key points, T represents the total frame number of the image, | | includes2Representing the L2 norm. Wherein the content of the first and second substances,
Figure BDA0003382946910000066
calculated by the following formula:
Figure BDA0003382946910000067
wherein, N (p)i) Including p on a different face than piThe contiguous coordinates are adjacent. 8 face portions are used, which contain a predefined subset of coordinates for the face template.
Figure BDA0003382946910000068
The calculation method of (1) is the same as that of (3), and will not be described herein again,
s32: speech content information and audio information-image training
The similarity of audio information between different utterances spoken by the same targeted speaker is maximized and the similarity between different targeted speakers is minimized. And obtaining the mapping relation between the head and the upper body corresponding to the audio information through the training of the voice audio information. The motion of the head or subtle associations between the upper body is also a key factor in generating the true speaking target person.
S321: the speech data is read by python with librosa library.
S322: mel-language spectrogram feature matrix Audio for extracting voice data by python _ speech _ features methodmfcc
S323: using Mel-language spectrogram characteristic of voice datamfccInputting LSTM-based voice feature extraction network to obtain voice feature matrix At∈RM×DWhere M is the total frame number of the input speech and D is the dimension of the speech feature matrix.
At=LSTM(Audiomfcc) (11)
S324: for each frame image, the speech feature matrix in the corresponding window of preset speech frame number (set to 18 frames in this embodiment) is input into utoCV.EsAudio encoder obtains audio matrix
Figure BDA0003382946910000071
Figure BDA0003382946910000072
Wherein, wlstm,SIs utoCV.EsLearnable parameters of an audio encoder.
S325: by a single layer of MLP
Figure BDA0003382946910000073
From 256 down to 128, get
Figure BDA0003382946910000074
The generalization ability of facial video can be improved, particularly for target people not observed during training.
Figure BDA0003382946910000075
S326: to produce consistent head and upper body movements, it requires capturing a longer time correlation than voice content movements. While speech audio information typically lasts for tens of milliseconds, head movements (e.g., head swinging from left to right) and upper body movements (e.g., breathing movements) may last for one or more seconds, or even orders of magnitude longer. To capture this long-term and structured dependency, the output is computed using a self-attention network. Matrix speech contentAnd the audio matrix input decoder obtains from the reconstructed audio motion matrix ht,wattnAnd s denotes a trainable parameter of the self attention network (self attention encoder).
Figure BDA0003382946910000076
S327: the transformation of the speech content matrix is connected with the audio matrix and the two initial coordinates. The weight assigned to each frame is calculated by a compatibility function that compares the representation of all the frames in the window. In all experiments, the window size was set to τ 256 frames (4 seconds). Moving the matrix h from the reconstructed audiotAnd an initial head keypoint coordinate matrix qheadAnd an initial upper half key point coordinate matrix qbobyInputting a second multilayer perceptron MLPsTo generate the target speaker perceptual coordinate displacement. The MLPsPredicting the final head key point coordinate matrix and upper body key point coordinate matrix overall displacement delta p of each frame of imaget,wmlp,SRepresenting a second multilayer perceptron MLPs(MLPsDecoder) of the video signal.
Δpt=MLPs(ht,qhead,qbody;wmlp,s) (15)
S328: predicting coordinates P based on head position of each frame imagetAnd the head key point coordinate matrix and the upper body key point coordinate matrix of each frame of image are integrally displaced by delta ptObtaining the overall predicted coordinate y of each frame of imaget
yt=pt+Δpt (16)
In practice, in the process of training each network model in advance, besides capturing the key point positions, the head motion and the upper body motion of the target speaker need to be matched. To this end, a network of discriminators is created, the goal of which is to find out whether the temporal dynamics of the speaker's facial coordinates look "real" or false. It will generate the global predicted coordinate sequence within the same window used in the generator andthe speech content matrix and the audio matrix are used as input, and a representation r is returnedt
Figure BDA0003382946910000081
During training, using LSGAN loss function LganFor discriminator parameter wattn,dTraining is performed, the training coordinates are regarded as "true", and the generated coordinates are regarded as "false" for each frame:
Figure BDA0003382946910000082
wherein the content of the first and second substances,
Figure BDA0003382946910000083
representing the generator output, r, when the training coordinate y is used as its inputtRepresenting the output in the discriminator.
To train the parameter wattn,sTo maximize the "reality" of the output, a minimization loss function L is setsThe training distance taking into account the absolute position and the laplace coordinates:
Figure BDA0003382946910000084
wherein λs1 and μs0.001 pass hold verify set, yi,tIndicating the predicted coordinate position of the ith head or upper body keypoint in the tth frame image,
Figure BDA0003382946910000085
indicating the actual coordinate position of the ith head or upper body keypoint in the tth frame image,
Figure BDA0003382946910000086
denotes yi,tThe laplacian coordinates of the graph of (a),
Figure BDA0003382946910000087
to represent
Figure BDA0003382946910000088
Graph laplace coordinates. Training is alternated between the generator and the discriminator to improve each other as is done in the GAN approach.
S4: and generating a video image frame sequence based on the multi-dimensional mapping relation. The method specifically comprises the following steps:
initial head key point coordinate matrix qheadThe predicted head position coordinate P of each frame of imagetAnd the overall predicted coordinate y of each frame of imagetInputting the video frame reconstruction model, and fusing to obtain a reconstructed video frame QtrgAnd (4) sequencing.
Figure BDA0003382946910000089
Minimum loss function L adopted during training of video frame reconstruction modelaThe following were used:
Figure BDA00033829469100000810
where stc denotes the number of source video frames, trg denotes the number of target video frames, QtrgWhich represents the reconstructed video frame or frames, and,
Figure BDA00033829469100000811
representing real video frames, λaDenotes a weight coefficient, λa=1,φ(Qtrg)And
Figure BDA00033829469100000812
respectively representing the reconstructed video frame and the real video frame, pre-training the reconstructed video frame and the real video frame by adopting a ResNet-34 network1Representing the L1 norm.
The video frame reconstruction model is an encoder-decoder network that performs image conversion to produce video frame images. The encoder takes 6 convolutional layers, each of which contains a 2-step convolution, followed by two residual blocks, and creates a bottleneck, which is then decoded by a symmetric upsampling decoder. Each frame image is generated using a skip connection between the symmetric layers of the encoder and decoder. Since the coordinates vary smoothly with time, the output image formed as an interpolation of these coordinates exhibits temporal coherence.
S5: and splicing the video image frame sequence with the language data to obtain the target human language audio/video. The method specifically comprises the following steps:
and splicing the video image frame sequence and the language data by adopting an ffmpeg method to obtain a target human voice video, wherein in the embodiment, the frame number of the target human voice audio/video per second is 29.
Example 2
The embodiment provides a voice-driven target person video generation device, which comprises:
the data acquisition module is used for acquiring voice data and a positive image of the upper half of the figure;
the key point extraction module is used for extracting an initial head key point coordinate matrix and an initial upper half key point coordinate matrix based on the acquired figure upper half front image;
the voice data separation module is used for separating content information and audio information based on the acquired voice data;
the voice image mapping module is used for training a multi-dimensional mapping relation among the content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix;
the video frame generation module is used for generating a video image frame sequence based on a multi-dimensional mapping relation;
and the video generation module is used for splicing the video image frame sequence and the language data to obtain the target human language audio/video.
Example 3
The present embodiment provides a computer-readable storage medium storing a computer program executed by a processor to implement the voice-driven target person video generation method as described above.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A voice-driven target person video generation method is characterized by comprising the following steps:
acquiring voice data and a positive image of the upper half of the figure;
extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired front image of the upper body of the person;
separating voice content information and audio information based on the acquired voice data;
training a multi-dimensional mapping relation among the voice content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the voice content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix;
generating a video image frame sequence based on a multi-dimensional mapping relation;
and splicing the video image frame sequence with the language data to obtain the target human language audio/video.
2. The method for generating video of a voice-driven target person according to claim 1, wherein the extracting an initial head key point coordinate matrix and an initial upper body key point coordinate matrix based on the acquired image of the front of the upper body of the person includes:
dividing the front image of the upper half of the figure into a head image and an upper half image;
extracting an initial head key point coordinate matrix in a head portrait by adopting a head key point extraction model;
and extracting an initial upper body key point coordinate matrix in the upper body image by adopting an upper body key point extraction model.
3. The voice-driven target person video generation method according to claim 1, wherein the separating of the voice content information and the audio information based on the acquired voice data includes:
extracting a Mel spectrogram feature matrix of the voice data;
inputting the Mel spectrogram feature matrix into an LSTM-based voice feature extraction network to obtain a voice feature matrix;
for each frame image, inputting the corresponding voice feature matrix in the preset voice frame number window into a voice content encoder to obtain a voice content matrix;
and for each frame image, inputting the corresponding voice feature matrix in the preset voice frame number window into an audio encoder to obtain an audio matrix.
4. The method for generating video of a voice-driven target person according to claim 3, wherein the training of the multidimensional mapping relationship between the voice content information and the audio information and the coordinates of the head key point and the coordinates of the upper body key point based on the voice content information, the audio information, the coordinate matrix of the head key point and the coordinate matrix of the upper body key point specifically comprises:
inputting the voice content matrix and the initial head key point coordinate matrix into a first multilayer perceptron, and predicting the displacement of the head key point coordinate matrix in each frame of image;
obtaining a head position prediction coordinate of each frame of image based on the initial head key point coordinate matrix and the displacement of the head key point coordinate matrix in each frame of image;
based on a self-attention network, fusing the voice content matrix and the audio matrix to obtain a self-reconfigurable audio mobile matrix;
inputting the self-reconfigurable audio movement matrix, the initial head key point coordinate matrix and the initial upper half body key point coordinate matrix into a second multilayer sensor, and predicting to obtain the head key point coordinate matrix and the upper half body key point coordinate matrix overall displacement of each frame of image;
and obtaining the overall predicted coordinate of each frame of image based on the head position predicted coordinate of each frame of image, the head key point coordinate matrix of each frame of image and the overall displacement of the upper body key point coordinate matrix.
5. The method according to claim 4, wherein the speech content encoder and the first multi-layer sensor used in the training of the mapping relationship between the speech content matrix and the head position prediction coordinates are obtained by synchronous training based on the collected video data, and the minimum loss function L used in the training is a minimum loss function LcThe following were used:
Figure FDA0003382946900000021
in the formula, Pi,tIndicating the predicted coordinate position of the ith head key point in the t-th frame image,
Figure FDA0003382946900000022
represents the actual coordinate position of the ith head key point in the t frame image, lambdacThe weight coefficient is represented by a weight coefficient,
Figure FDA0003382946900000023
represents Pi,tThe laplacian coordinates of the graph of (a),
Figure FDA0003382946900000024
to represent
Figure FDA0003382946900000025
Graph laplacian coordinates, N represents the total number of head key points, T represents the total frame number of the image, | | includes2Representing the L2 norm.
6. The method as claimed in claim 4, wherein the generating a video image frame sequence based on multi-dimensional mapping comprises:
and inputting the initial head key point coordinate matrix, the head position prediction coordinate of each frame of image and the overall prediction coordinate of each frame of image into a video frame reconstruction model, and fusing to obtain a reconstructed video frame sequence.
7. The method according to claim 6, wherein the minimum loss function L is used in training the video frame reconstruction modelaThe following were used:
Figure FDA0003382946900000026
where stc denotes the number of source video frames, trg denotes the number of target video frames, QtrgWhich represents the reconstructed video frame or frames, and,
Figure FDA0003382946900000027
representing real video frames, λaDenotes a weight coefficient, λa=1,φ(Qtrg) And
Figure FDA0003382946900000028
separately representing reconstructed video frames and truesVideo frames are pre-trained using a ResNet-34 network, | | left alone |1Representing the L1 norm.
8. The method for generating the video of the target person by the voice driving according to claim 4, wherein the splicing the video image frame sequence and the language data to obtain the voice audio and video of the target person comprises:
and splicing the video image frame sequence and the language data by adopting an ffmpeg method to obtain a target human voice video, wherein the frame number of each second of the target human voice audio and video is a preset value.
9. A voice-driven target person video generating device, comprising:
the data acquisition module is used for acquiring voice data and a positive image of the upper half of the figure;
the key point extraction module is used for extracting an initial head key point coordinate matrix and an initial upper half key point coordinate matrix based on the acquired figure upper half front image;
the voice data separation module is used for separating content information and audio information based on the acquired voice data;
the voice image mapping module is used for training a multi-dimensional mapping relation among the content information, the audio information, the head key point coordinate and the upper half key point coordinate based on the content information, the audio information, the initial head key point coordinate matrix and the initial upper half key point coordinate matrix;
the video frame generation module is used for generating a video image frame sequence based on a multi-dimensional mapping relation;
and the video generation module is used for splicing the video image frame sequence and the language data to obtain the target human language audio/video.
10. A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the voice-driven target person video generation method according to any one of claims 1 to 8.
CN202111466434.7A 2021-11-30 2021-11-30 Voice-driven target person video generation method and device and storage medium Pending CN114202604A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111466434.7A CN114202604A (en) 2021-11-30 2021-11-30 Voice-driven target person video generation method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111466434.7A CN114202604A (en) 2021-11-30 2021-11-30 Voice-driven target person video generation method and device and storage medium

Publications (1)

Publication Number Publication Date
CN114202604A true CN114202604A (en) 2022-03-18

Family

ID=80650381

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111466434.7A Pending CN114202604A (en) 2021-11-30 2021-11-30 Voice-driven target person video generation method and device and storage medium

Country Status (1)

Country Link
CN (1) CN114202604A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115100329A (en) * 2022-06-27 2022-09-23 太原理工大学 Multi-mode driving-based emotion controllable facial animation generation method
CN115996303A (en) * 2023-03-23 2023-04-21 科大讯飞股份有限公司 Video generation method, device, electronic equipment and storage medium
CN116342835A (en) * 2023-03-31 2023-06-27 华院计算技术(上海)股份有限公司 Face three-dimensional surface grid generation method, device, computing equipment and storage medium
WO2023241298A1 (en) * 2022-06-16 2023-12-21 虹软科技股份有限公司 Video generation method and apparatus, storage medium and electronic device
CN117478818A (en) * 2023-12-26 2024-01-30 荣耀终端有限公司 Voice communication method, terminal and storage medium
WO2024078243A1 (en) * 2022-10-13 2024-04-18 腾讯科技(深圳)有限公司 Training method and apparatus for video generation model, and storage medium and computer device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023241298A1 (en) * 2022-06-16 2023-12-21 虹软科技股份有限公司 Video generation method and apparatus, storage medium and electronic device
CN115100329A (en) * 2022-06-27 2022-09-23 太原理工大学 Multi-mode driving-based emotion controllable facial animation generation method
WO2024078243A1 (en) * 2022-10-13 2024-04-18 腾讯科技(深圳)有限公司 Training method and apparatus for video generation model, and storage medium and computer device
CN115996303A (en) * 2023-03-23 2023-04-21 科大讯飞股份有限公司 Video generation method, device, electronic equipment and storage medium
CN115996303B (en) * 2023-03-23 2023-07-25 科大讯飞股份有限公司 Video generation method, device, electronic equipment and storage medium
CN116342835A (en) * 2023-03-31 2023-06-27 华院计算技术(上海)股份有限公司 Face three-dimensional surface grid generation method, device, computing equipment and storage medium
CN117478818A (en) * 2023-12-26 2024-01-30 荣耀终端有限公司 Voice communication method, terminal and storage medium

Similar Documents

Publication Publication Date Title
CN114202604A (en) Voice-driven target person video generation method and device and storage medium
CN113192161B (en) Virtual human image video generation method, system, device and storage medium
CN110874557B (en) Voice-driven virtual face video generation method and device
Cudeiro et al. Capture, learning, and synthesis of 3D speaking styles
US11776188B2 (en) Style-aware audio-driven talking head animation from a single image
CN113194348B (en) Virtual human lecture video generation method, system, device and storage medium
US6735566B1 (en) Generating realistic facial animation from speech
Sargin et al. Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation
US20100057455A1 (en) Method and System for 3D Lip-Synch Generation with Data-Faithful Machine Learning
CN115100329B (en) Multi-mode driving-based emotion controllable facial animation generation method
US7257538B2 (en) Generating animation from visual and audio input
EP4010899A1 (en) Audio-driven speech animation using recurrent neutral network
Gururani et al. Space: Speech-driven portrait animation with controllable expression
Potamianos et al. Joint audio-visual speech processing for recognition and enhancement
Liu et al. Synthesizing talking faces from text and audio: an autoencoder and sequence-to-sequence convolutional neural network
CN113516990A (en) Voice enhancement method, method for training neural network and related equipment
Lavagetto Time-delay neural networks for estimating lip movements from speech analysis: A useful tool in audio-video synchronization
CN116828129B (en) Ultra-clear 2D digital person generation method and system
CN117409121A (en) Fine granularity emotion control speaker face video generation method, system, equipment and medium based on audio frequency and single image driving
Wang et al. CA-Wav2Lip: Coordinate Attention-based Speech To Lip Synthesis In The Wild
Asadiabadi et al. Multimodal speech driven facial shape animation using deep neural networks
CN117237521A (en) Speech driving face generation model construction method and target person speaking video generation method
Liu et al. Real-time speech-driven animation of expressive talking faces
Caplier et al. Image and video for hearing impaired people
Wang et al. Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination