CN116228979A

CN116228979A - Voice-driven editable face replay method, device and storage medium

Info

Publication number: CN116228979A
Application number: CN202310163900.7A
Authority: CN
Inventors: 郑迦恒; 夏世宇; 孙络祎; 谢志峰
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-06-06

Abstract

The invention relates to a voice-driven editable face replay method, a voice-driven editable face replay device and a storage medium, wherein the voice-driven editable face replay method comprises the following steps: carrying out three-dimensional reconstruction of a human face on the input video to synthesize a three-dimensional grid model of the human face; training an LSTM network, and constructing cross-modal mapping from audio to facial expression; and constructing an editable dynamic nerve radiation field, returning the color and density of the sampling points, and finally generating a replay result through volume rendering. Aiming at the problem that the conventional method cannot perform personalized editing on voice-driven face replay, the method converts the dynamic face generation problem into the sampling problem of the static template face in the standard space by constructing the editable dynamic nerve radiation field model, and realizes the respective decoupling of the shape and the appearance by anchoring the geometric latent code and the texture latent code into the model vertex, thereby realizing the free editing of the geometry and the texture of the face.

Description

Voice-driven editable face replay method, device and storage medium

Technical Field

The present invention relates to the field of computer vision and computer graphics, and in particular, to a speech-driven editable face replay method, apparatus, and storage medium.

Background

The human face replay is to synthesize the replay talking video of the original person according to the identity information of the person and the expression, gesture and other information provided by the driving target. In the traditional process, the task is completed by manually customizing a fine three-dimensional face model by an artist, then carrying out the steps of facial binding, motion capture, animation correction, material debugging and the like, and finally rendering an image in a graphics pipeline, so that higher labor and time for manufacturing teams are required to be consumed. In recent years, along with the intervention of deep learning, the traditional face replay process is simplified by a neural network algorithm, and a user without related expertise can quickly manufacture a face replay video in an end-to-end manner only through a section of monocular RGB video.

The face replay technology driven by video as a driving target has been mature for many years, however, the face replay using video as a driving source has difficulty in handling face occlusion, and the final replay effect is severely dependent on the performance level of a performer. Compared with video media, the voice signal is easier to acquire and more convenient to use, and the voice signal can be matched with the speech style of the replay character in a personalized way by means of syllable intonation and other information in the voice. Therefore, speech driven face replay technology has become a research hotspot in the fields of graphics, computer vision and cross-modality in recent years. The technology can be applied to the fields of virtual anchor, movie dubbing mouth registration, meta-universe personal image customization and the like, and has wide research significance and application value.

With the popularity of short video, self-media platforms, users have placed new demands on video authoring, and users often want to be able to edit portrait video content, such as thin faces, adding facial special effects, etc. Therefore, the voice-driven human face replay technology which is convenient to operate, high in fidelity and capable of being edited naturally becomes a new application requirement in the current self-media era.

Liu et al propose a sketch-based facial video editing method deep video edit, which can represent editing operations in a latent space, and through a specific propagation and fusion module, a high-quality video editing result is generated based on StyleGAN3, allowing a user to edit a face in a video through sketch and masking.

Suwanakorn et al propose Synthesizing Obama that uses Obama tens of hours of lecture video as training material, learns audio-to-mouth mapping through a recurrent neural network (Recurrent Neural Network, RNN), synthesizes mouth texture using a model based on principal component analysis (Principal Components Analysis, PCA), and achieves voice-driven face replay.

Guo et al first applied the neural radiation field technique to a voice-driven face replay method, using two neural radiation fields to directly construct a mapping of input audio to replay video. Compared with the two-dimensional image distortion-based and the generation type method based on the generation countermeasure network (Generative Adversarial Networks, GAN), the method can generate the face dynamic details with higher precision, and the excellent effect of the nerve radiation field in the dynamic face generation field is shown.

Yuan et al propose NeRF-Editing, which establishes a correspondence between an explicit mesh representation and an implicit neural representation of a target scene, distorts camera rays with a tetrahedral mesh as a proxy, and achieves shape Editing of the implicit object.

Yang et al propose NeuMesh, which inserts geometric and texture latent codes into vertices, and separately decouples static object shape and appearance properties with two multi-layer perceptrons (Multilayer Perceptron, MLP), creating an editable neural radiation field. The method implicitly reconstructs a three-dimensional object from the input of a group of multi-view images, and can realize geometric editing of a model by changing a model grid; texture editing of the model can be achieved by replacing and modifying the latent codes in the vertices.

However, the voice-driven face replay method in the prior art cannot perform personalized editing on the shape and texture of the replayed face.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a voice-driven editable face replay method, a voice-driven editable face replay device and a storage medium.

The aim of the invention can be achieved by the following technical scheme:

a speech-driven editable face replay method comprises the following steps:

s1, inputting video materials to form a video data set;

s2, carrying out face three-dimensional reconstruction on the input video material, synthesizing a face three-dimensional grid model, extracting identity coefficients, expression coefficients and head posture coefficients of a source face three-dimensional deformable model, generating a mask at the same time, and extracting a face region image as a true value of training;

s3, extracting an expression coefficient of the three-dimensional deformable model of the source face in the video data set, training a long-term memory network by using the aligned audio features and the facial expression coefficient data, and constructing a cross-modal mapping from audio to facial expression;

s4, constructing an editable dynamic nerve radiation field, calculating an offset value of a camera light sampling point by taking a face model as an agent, inquiring k nearest vertexes of the offset sampling point in a standard space, obtaining a texture latent code and a geometric latent code corresponding to the sampling point by interpolation, inputting the color and the density of the sampling point by regression of a texture decoder and a geometric decoder, and generating a figure replay video by volume drawing;

s5, receiving audio input, regressing the facial expression coefficients through the long-term and short-term memory network, combining the source facial identity coefficients to synthesize a reenacted facial grid model, inputting editable dynamic neural radiation field frame regression sampling point density and color values, and synthesizing a reenact video frame through volume drawing;

s6, modifying the shape of the reenacted face grid model to edit the shape of the target person; and the visual editing of the target character is realized by changing and replacing the texture latent codes and the geometric latent codes.

Further, step S2 specifically includes:

using Deep3DFace algorithm as three-dimensional reconstruction algorithm of human face, fitting human face shape and appearance through convolutional neural network, and representing as:

wherein, alpha, beta, tau, omicron and rho respectively represent the identity, expression, material, illumination and posture coefficient of the face; once the face identity and the expression coefficient are obtained, the shape of the face is expressed as:

wherein ,

representing the average shape of the face; b (B) _id ，B _exp The bases of the 3DMM model shape and expression, respectively.

Further, the step S3 specifically includes:

extracting mel-cepstrum features for each audio in the dataset;

the mapping from audio to expression coefficient is constructed through a long-short-term memory network and expressed as follows:

wherein E is a mel-cepstrum feature s ^(t) An encoder of (a); h is a ^(t-1) ，c ^(t-1) The hidden layer and the cellular state of LSTM respectively,

is the facial expression coefficient predicted by the t-th frame network; once a frame of facial expression coefficient predicted by the long-short-period memory network is obtained, the facial expression coefficient can be combined with the identity coefficient of the original person to generate a three-dimensional grid model of the replayed face.

Further, the step S4 specifically includes:

s401, defining a face template model with a neutral expression, and anchoring a geometric latent code and a texture latent code at the vertex;

s402, transforming the sampling points into a standard space according to the head posture, and using the model vertex as an agent to acquire an offset value deltax of the transformation of the camera light sampling points;

s403, inquiring the nearest k vertexes of the sampling point, and obtaining a texture latent code and a geometric latent code corresponding to the sampling point through interpolation; the color and the density of the sampling point are regressed through a texture decoder and a geometric decoder of the nerve radiation field, and a face image is rendered;

s404, generating a replay result through volume rendering.

Further, step S402 specifically includes:

converting the face template model and the driving face grid model into tetrahedron representation in a standard space;

transforming the sampling points of the camera light into a standard space, and finding tetrahedrons corresponding to each sampling point in the driving face grid model;

and carrying out gravity center interpolation on the vertex displacement of the face template model and the vertex displacement of the driving face grid model to obtain an offset value delta x from a sampling point x of the driving face grid model space to a sampling point x1 of the face template model space, so as to realize light deformation.

Further, step S403 specifically includes:

inquiring the vertexes of k nearest face template models for the offset sampling points x1 in the standard space;

obtaining a geometric latent code, a texture latent code and a distance indicator from the vertex of the face template model;

inputting MLP-based geometry decoder F _G And texture decoder F _T In which the sign distance field s and the color c are regressed.

Further, step S404 specifically includes:

according to the volume rendering equation, for any point o, the direction is d, and the near and far ends are m respectively _n and m_f The color of the camera ray r (m) =o+md is expressed as:

where σ (r (m)) and c (r (m), d) are the density and color values of the sample points, T (m) represents the ray from m _n and m_f Is a cumulative transmittance of (a);

calculating geometry using symbolic distance fields, actual color values

Expressed as:

where T is the cumulative transmittance, Φ is the cumulative distribution of the logical distribution, s is the SDF value, and α is the opacity derived from the adjacent symbol distance field.

Further, the loss function L of the model as a whole is expressed as:

a speech driven editable face replay device comprising a memory storing a computer program and a processor invoking the program instructions to be able to perform a speech driven editable face replay method as described above.

A computer readable storage medium comprising a computer program executable by a processor to implement a speech driven editable face replay method as described above.

Compared with the prior art, the invention has the following beneficial effects:

aiming at the defect that the prior method cannot perform personalized editing on voice-driven face replay, the invention converts the dynamic face generation problem into the sampling problem of a static template face in a standard space by constructing an editable dynamic nerve radiation field model, and realizes the respective decoupling of shape and appearance by anchoring a geometric latent code and a texture latent code into a vertex, thereby editing the geometry and texture of the face. The user only needs to input a section of single speech video material for 3-5 minutes as training, after training is completed, the model can accept any voice as input to generate high-fidelity replay video, and the user can edit the shape and appearance of the human face in the video, such as face thinning, face changing, makeup and the like.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a LSTM training flow chart of audio to emoticon cross-modal mapping;

FIG. 3 is a flow chart of an editable dynamic neural radiation field framework training;

FIG. 4 is a flow chart of geometric editing (face thinning, expression changing) according to an embodiment of the invention;

FIG. 5 is a flow chart of texture editing (face changing) according to an embodiment of the invention;

FIG. 6 is a flow chart of texture editing (makeup) according to an embodiment of the present invention.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.

A speech-driven editable face replay method comprises the following specific operation steps:

s1: and (5) preprocessing data. Firstly, inputting single voice speech video materials to form a video data set, carrying out face three-dimensional reconstruction on the input single voice speech video, synthesizing a face three-dimensional grid model, extracting identity coefficients, expression coefficients and head posture coefficients of a source face three-dimensional deformable model (3D Morphable Model,3DMM), generating a mask, and extracting a face region image as a training truth value (GT).

S2: a mapping of speech to expression is constructed. Firstly, a single voice speech video data set is used for extracting 3DMM expression coefficients of a human face in the data set, an aligned audio feature and facial expression coefficient data are used for training a long-term memory network (Long Short Term Memory, LSTM), and a cross-modal mapping from audio to facial expression is constructed.

S3: an editable dynamic neural radiation field is constructed. Firstly, calculating an offset value of a camera light sampling point by taking a face model as an agent, inquiring k nearest vertexes of the offset sampling point in a standard space, obtaining a texture latent code and a geometric latent code corresponding to the sampling point through interpolation, inputting the color and the density of the sampling point by a texture decoder and a geometric decoder, and finally generating a replay result through volume drawing.

S4: target character replay and shape appearance editing. After the model is integrally trained, the method firstly receives any section of voice as input, regresses the facial expression coefficient through LSTM, then combines the source facial identity coefficient to synthesize a reenactured facial grid model, then inputs the editable dynamic neural radiation field frame regression sampling point density and color value, and finally synthesizes a reenactured video frame through volume drawing. The shape of the target person can be edited by modifying the shape of the replay model; by changing and replacing the texture latent codes of the template model, the appearance editing of the target person can be realized.

The specific steps of step S1 are as follows:

using Deep3DFace algorithm as three-dimensional face reconstruction algorithm, fitting face shape and appearance through convolutional neural network (Convolutional Neural Network, CNN), the process is expressed as:

wherein, alpha, beta, tau, omicron and rho respectively represent the identity, expression, material, illumination and posture coefficient of the face. Once the identity and the expression coefficient of the face are obtained, the shape of the face can be expressed as

wherein ,/>

The specific steps of step S2 are as follows:

firstly, extracting Mel cepstrum features from each audio in a training set; then, the mapping of audio to emoticons is constructed by LSTM, which is expressed as:

is the facial expression coefficient predicted by the t-th frame network. Once the facial expression coefficient is obtained, the facial expression coefficient can be combined with the identity coefficient of the original person to generate a three-dimensional grid model for replaying the faceType (2).

Further, the specific steps of the step S3 are as follows:

firstly, defining a template model, and anchoring geometric and texture latent codes at vertexes; then, according to the head posture, the sampling points are transformed into a standard space, and the model vertex is used as an agent to obtain an offset value delta x of the camera light sampling points; next, inquiring k nearest vertexes of the sampling point, and obtaining corresponding geometric and texture latent codes through interpolation; finally, the face image is rendered through the geometrical and texture decoder regression density and color values of the nerve radiation field.

S3-1: light deformation module

Firstly, converting a template grid and a driving grid into tetrahedral representations under a standard space; then, converting the sampling points of the camera light into a standard space, and finding tetrahedrons corresponding to each sampling point in a driving model; and then, carrying out gravity center interpolation on the vertex displacement of the template model and the driving model to obtain an offset value delta x from a sampling point x of the driving model space to a sampling point x1 of the template model space, thereby realizing light deformation.

S3-2: texture decoder and geometry decoder

For the offset sampling point x1 in the standard space, firstly, searching the vertexes of k nearest template models; then, obtaining geometric latent codes, texture latent codes and distance indicators from the vertexes; next, an MLP-based geometry decoder F is input _G And texture decoder F _T In regression, the sign distance field (Signed Distance Field, SDF) s and color c.

S3-3: volume rendering

where σ (r (m)) and c (r (m), d) are the density and color values of the sample points, T (m) isFrom m of surface rays _n and m_f Is a function of the cumulative transmittance of the light source.

In the invention, the SDF is used for calculating the geometric shape and the actual color value

Expressed as:

where T is the cumulative transmittance, Φ is the cumulative distribution of the logical distribution, s is the SDF value, and α is the opacity derived from the adjacent SDFs.

In the present invention, the loss function L of the whole model is expressed as:

referring to FIG. 1 for the overall flow, for input speech, firstly extracting audio features and inputting LSTM regression replay face expression coefficients, combining with source face identity coefficients, and synthesizing a face grid model; then, constructing an editable dynamic nerve radiation field model by taking vertex displacement between the replay model and the template model as an agent; finally, the RGB video frames are rendered by volume rendering.

In the training process of the mapping part of the audio module, referring to fig. 2, firstly, a data set is constructed through single voice speech video, each audio in the training set extracts mel-cepstrum features (Mel Frequency Cepstral Coefficient, MFCC), the length of an analysis window is set to be 0.25 millisecond, and the interval between continuous windows is set to be 10 milliseconds; then, carrying out face three-dimensional reconstruction on the video face to obtain one-to-one corresponding audio features and face expression coefficients; finally, a mapping from audio to expression is constructed by LSTM.

In the invention, referring to fig. 3, a neutral template model of a human face is defined first, and a human face grid model is reconstructed from a training video; then, obtaining an offset value delta x of a sampling point by calculating the vertex displacement values of the neutral template model of the human face and the synthesized driving model of the human face; next, inquiring k vertexes around the sampling point in a standard space, obtaining a shape latent code and a texture latent code through interpolation, and inputting the shape decoder and the texture decoder; finally, rendering the replay video through volume rendering.

The geometrical editing mode of the invention is as shown in figure 4, the driving expression can be customized by changing the grid model of the driving face, and the global geometrical editing of face thinning, head shrinking and the like can be completed by modifying the grid shape of the neutral template model of the face; the implementation mode of texture editing (face changing) is as shown in fig. 5, and the face changing operation can be realized by training editable dynamic nerve radiation field models of different characters and replacing texture latent codes in the vertexes of a face template model; in the implementation manner of texture editing (makeup) in the invention, as shown in fig. 6, after a user edits a replay video frame, vertices affected by pixel modification are reversely determined through camera rays, and the edited image is used for fine tuning the latent codes in the vertices to realize the makeup effect.

The invention also provides a voice-driven editable face replay device and a storage medium, wherein the device comprises a memory and a processor, the memory stores a computer program, and the processor calls the program instructions to execute the voice-driven editable face replay method. At the hardware level, the speech driven editable face replay device comprises a processor, an internal bus, a network interface, a memory and a nonvolatile storage, and can also comprise hardware required by other services. The processor reads the corresponding computer program from the nonvolatile memory to the memory and then runs the computer program to realize the editable face replay method. Of course, in addition to software implementations, the present invention does not exclude other implementations, such as a logic device or a combination of hardware and software, etc.

The storage medium includes a computer program executable by a processor to implement a speech driven editable face replay method as described above. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims

1. The speech-driven editable face replay method is characterized by comprising the following steps of:

s1, inputting video materials to form a video data set;

2. The speech driven editable face replay method of claim 1, wherein step S2 specifically comprises:

wherein ,

3. The speech driven editable face replay method of claim 1, wherein step S3 specifically comprises:

extracting mel-cepstrum features for each audio in the dataset;

4. The speech driven editable face replay method of claim 1, wherein step S4 specifically comprises:

s404, generating a replay result through volume rendering.

5. The speech driven editable face replay method of claim 4, wherein step S402 specifically comprises:

6. The speech driven editable face replay method of claim 5, wherein step S403 specifically comprises:

7. The speech driven editable face replay method of claim 6, wherein step S404 comprises:

calculating geometry using symbolic distance fields, actual color values

Expressed as:

8. The speech driven editable face replay method of claim 7, wherein the loss function L of the model as a whole is expressed as:

9. a speech driven editable face replay device comprising a memory and a processor, the memory storing a computer program, the processor invoking the program instructions to enable a speech driven editable face replay method according to any one of claims 1 to 8.

10. A computer readable storage medium comprising a computer program executable by a processor to implement a speech driven editable face replay method according to any one of claims 1 to 8.