CN116228979A - Voice-driven editable face replay method, device and storage medium - Google Patents

Voice-driven editable face replay method, device and storage medium Download PDF

Info

Publication number
CN116228979A
CN116228979A CN202310163900.7A CN202310163900A CN116228979A CN 116228979 A CN116228979 A CN 116228979A CN 202310163900 A CN202310163900 A CN 202310163900A CN 116228979 A CN116228979 A CN 116228979A
Authority
CN
China
Prior art keywords
face
editable
model
replay
driven
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310163900.7A
Other languages
Chinese (zh)
Inventor
郑迦恒
夏世宇
孙络祎
谢志峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202310163900.7A priority Critical patent/CN116228979A/en
Publication of CN116228979A publication Critical patent/CN116228979A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/08Volume rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/20Indexing scheme for editing of 3D models
    • G06T2219/2021Shape modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Graphics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Geometry (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Architecture (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention relates to a voice-driven editable face replay method, a voice-driven editable face replay device and a storage medium, wherein the voice-driven editable face replay method comprises the following steps: carrying out three-dimensional reconstruction of a human face on the input video to synthesize a three-dimensional grid model of the human face; training an LSTM network, and constructing cross-modal mapping from audio to facial expression; and constructing an editable dynamic nerve radiation field, returning the color and density of the sampling points, and finally generating a replay result through volume rendering. Aiming at the problem that the conventional method cannot perform personalized editing on voice-driven face replay, the method converts the dynamic face generation problem into the sampling problem of the static template face in the standard space by constructing the editable dynamic nerve radiation field model, and realizes the respective decoupling of the shape and the appearance by anchoring the geometric latent code and the texture latent code into the model vertex, thereby realizing the free editing of the geometry and the texture of the face.

Description

Voice-driven editable face replay method, device and storage medium
Technical Field
The present invention relates to the field of computer vision and computer graphics, and in particular, to a speech-driven editable face replay method, apparatus, and storage medium.
Background
The human face replay is to synthesize the replay talking video of the original person according to the identity information of the person and the expression, gesture and other information provided by the driving target. In the traditional process, the task is completed by manually customizing a fine three-dimensional face model by an artist, then carrying out the steps of facial binding, motion capture, animation correction, material debugging and the like, and finally rendering an image in a graphics pipeline, so that higher labor and time for manufacturing teams are required to be consumed. In recent years, along with the intervention of deep learning, the traditional face replay process is simplified by a neural network algorithm, and a user without related expertise can quickly manufacture a face replay video in an end-to-end manner only through a section of monocular RGB video.
The face replay technology driven by video as a driving target has been mature for many years, however, the face replay using video as a driving source has difficulty in handling face occlusion, and the final replay effect is severely dependent on the performance level of a performer. Compared with video media, the voice signal is easier to acquire and more convenient to use, and the voice signal can be matched with the speech style of the replay character in a personalized way by means of syllable intonation and other information in the voice. Therefore, speech driven face replay technology has become a research hotspot in the fields of graphics, computer vision and cross-modality in recent years. The technology can be applied to the fields of virtual anchor, movie dubbing mouth registration, meta-universe personal image customization and the like, and has wide research significance and application value.
With the popularity of short video, self-media platforms, users have placed new demands on video authoring, and users often want to be able to edit portrait video content, such as thin faces, adding facial special effects, etc. Therefore, the voice-driven human face replay technology which is convenient to operate, high in fidelity and capable of being edited naturally becomes a new application requirement in the current self-media era.
Liu et al propose a sketch-based facial video editing method deep video edit, which can represent editing operations in a latent space, and through a specific propagation and fusion module, a high-quality video editing result is generated based on StyleGAN3, allowing a user to edit a face in a video through sketch and masking.
Suwanakorn et al propose Synthesizing Obama that uses Obama tens of hours of lecture video as training material, learns audio-to-mouth mapping through a recurrent neural network (Recurrent Neural Network, RNN), synthesizes mouth texture using a model based on principal component analysis (Principal Components Analysis, PCA), and achieves voice-driven face replay.
Guo et al first applied the neural radiation field technique to a voice-driven face replay method, using two neural radiation fields to directly construct a mapping of input audio to replay video. Compared with the two-dimensional image distortion-based and the generation type method based on the generation countermeasure network (Generative Adversarial Networks, GAN), the method can generate the face dynamic details with higher precision, and the excellent effect of the nerve radiation field in the dynamic face generation field is shown.
Yuan et al propose NeRF-Editing, which establishes a correspondence between an explicit mesh representation and an implicit neural representation of a target scene, distorts camera rays with a tetrahedral mesh as a proxy, and achieves shape Editing of the implicit object.
Yang et al propose NeuMesh, which inserts geometric and texture latent codes into vertices, and separately decouples static object shape and appearance properties with two multi-layer perceptrons (Multilayer Perceptron, MLP), creating an editable neural radiation field. The method implicitly reconstructs a three-dimensional object from the input of a group of multi-view images, and can realize geometric editing of a model by changing a model grid; texture editing of the model can be achieved by replacing and modifying the latent codes in the vertices.
However, the voice-driven face replay method in the prior art cannot perform personalized editing on the shape and texture of the replayed face.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a voice-driven editable face replay method, a voice-driven editable face replay device and a storage medium.
The aim of the invention can be achieved by the following technical scheme:
a speech-driven editable face replay method comprises the following steps:
s1, inputting video materials to form a video data set;
s2, carrying out face three-dimensional reconstruction on the input video material, synthesizing a face three-dimensional grid model, extracting identity coefficients, expression coefficients and head posture coefficients of a source face three-dimensional deformable model, generating a mask at the same time, and extracting a face region image as a true value of training;
s3, extracting an expression coefficient of the three-dimensional deformable model of the source face in the video data set, training a long-term memory network by using the aligned audio features and the facial expression coefficient data, and constructing a cross-modal mapping from audio to facial expression;
s4, constructing an editable dynamic nerve radiation field, calculating an offset value of a camera light sampling point by taking a face model as an agent, inquiring k nearest vertexes of the offset sampling point in a standard space, obtaining a texture latent code and a geometric latent code corresponding to the sampling point by interpolation, inputting the color and the density of the sampling point by regression of a texture decoder and a geometric decoder, and generating a figure replay video by volume drawing;
s5, receiving audio input, regressing the facial expression coefficients through the long-term and short-term memory network, combining the source facial identity coefficients to synthesize a reenacted facial grid model, inputting editable dynamic neural radiation field frame regression sampling point density and color values, and synthesizing a reenact video frame through volume drawing;
s6, modifying the shape of the reenacted face grid model to edit the shape of the target person; and the visual editing of the target character is realized by changing and replacing the texture latent codes and the geometric latent codes.
Further, step S2 specifically includes:
using Deep3DFace algorithm as three-dimensional reconstruction algorithm of human face, fitting human face shape and appearance through convolutional neural network, and representing as:
Figure BDA0004095201000000031
wherein, alpha, beta, tau, omicron and rho respectively represent the identity, expression, material, illumination and posture coefficient of the face; once the face identity and the expression coefficient are obtained, the shape of the face is expressed as:
Figure BDA0004095201000000032
wherein ,
Figure BDA0004095201000000033
representing the average shape of the face; b (B) id ,B exp The bases of the 3DMM model shape and expression, respectively.
Further, the step S3 specifically includes:
extracting mel-cepstrum features for each audio in the dataset;
the mapping from audio to expression coefficient is constructed through a long-short-term memory network and expressed as follows:
Figure BDA0004095201000000034
wherein E is a mel-cepstrum feature s (t) An encoder of (a); h is a (t-1) ,c (t-1) The hidden layer and the cellular state of LSTM respectively,
Figure BDA0004095201000000035
is the facial expression coefficient predicted by the t-th frame network; once a frame of facial expression coefficient predicted by the long-short-period memory network is obtained, the facial expression coefficient can be combined with the identity coefficient of the original person to generate a three-dimensional grid model of the replayed face.
Further, the step S4 specifically includes:
s401, defining a face template model with a neutral expression, and anchoring a geometric latent code and a texture latent code at the vertex;
s402, transforming the sampling points into a standard space according to the head posture, and using the model vertex as an agent to acquire an offset value deltax of the transformation of the camera light sampling points;
s403, inquiring the nearest k vertexes of the sampling point, and obtaining a texture latent code and a geometric latent code corresponding to the sampling point through interpolation; the color and the density of the sampling point are regressed through a texture decoder and a geometric decoder of the nerve radiation field, and a face image is rendered;
s404, generating a replay result through volume rendering.
Further, step S402 specifically includes:
converting the face template model and the driving face grid model into tetrahedron representation in a standard space;
transforming the sampling points of the camera light into a standard space, and finding tetrahedrons corresponding to each sampling point in the driving face grid model;
and carrying out gravity center interpolation on the vertex displacement of the face template model and the vertex displacement of the driving face grid model to obtain an offset value delta x from a sampling point x of the driving face grid model space to a sampling point x1 of the face template model space, so as to realize light deformation.
Further, step S403 specifically includes:
inquiring the vertexes of k nearest face template models for the offset sampling points x1 in the standard space;
obtaining a geometric latent code, a texture latent code and a distance indicator from the vertex of the face template model;
inputting MLP-based geometry decoder F G And texture decoder F T In which the sign distance field s and the color c are regressed.
Further, step S404 specifically includes:
according to the volume rendering equation, for any point o, the direction is d, and the near and far ends are m respectively n and mf The color of the camera ray r (m) =o+md is expressed as:
Figure BDA0004095201000000041
where σ (r (m)) and c (r (m), d) are the density and color values of the sample points, T (m) represents the ray from m n and mf Is a cumulative transmittance of (a);
Figure BDA0004095201000000042
calculating geometry using symbolic distance fields, actual color values
Figure BDA0004095201000000043
Expressed as:
Figure BDA0004095201000000051
where T is the cumulative transmittance, Φ is the cumulative distribution of the logical distribution, s is the SDF value, and α is the opacity derived from the adjacent symbol distance field.
Further, the loss function L of the model as a whole is expressed as:
Figure BDA0004095201000000052
a speech driven editable face replay device comprising a memory storing a computer program and a processor invoking the program instructions to be able to perform a speech driven editable face replay method as described above.
A computer readable storage medium comprising a computer program executable by a processor to implement a speech driven editable face replay method as described above.
Compared with the prior art, the invention has the following beneficial effects:
aiming at the defect that the prior method cannot perform personalized editing on voice-driven face replay, the invention converts the dynamic face generation problem into the sampling problem of a static template face in a standard space by constructing an editable dynamic nerve radiation field model, and realizes the respective decoupling of shape and appearance by anchoring a geometric latent code and a texture latent code into a vertex, thereby editing the geometry and texture of the face. The user only needs to input a section of single speech video material for 3-5 minutes as training, after training is completed, the model can accept any voice as input to generate high-fidelity replay video, and the user can edit the shape and appearance of the human face in the video, such as face thinning, face changing, makeup and the like.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a LSTM training flow chart of audio to emoticon cross-modal mapping;
FIG. 3 is a flow chart of an editable dynamic neural radiation field framework training;
FIG. 4 is a flow chart of geometric editing (face thinning, expression changing) according to an embodiment of the invention;
FIG. 5 is a flow chart of texture editing (face changing) according to an embodiment of the invention;
FIG. 6 is a flow chart of texture editing (makeup) according to an embodiment of the present invention.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.
Aiming at the defect that the prior method cannot perform personalized editing on voice-driven face replay, the invention converts the dynamic face generation problem into the sampling problem of a static template face in a standard space by constructing an editable dynamic nerve radiation field model, and realizes the respective decoupling of shape and appearance by anchoring a geometric latent code and a texture latent code into a vertex, thereby editing the geometry and texture of the face. The user only needs to input a section of single speech video material for 3-5 minutes as training, after training is completed, the model can accept any voice as input to generate high-fidelity replay video, and the user can edit the shape and appearance of the human face in the video, such as face thinning, face changing, makeup and the like.
A speech-driven editable face replay method comprises the following specific operation steps:
s1: and (5) preprocessing data. Firstly, inputting single voice speech video materials to form a video data set, carrying out face three-dimensional reconstruction on the input single voice speech video, synthesizing a face three-dimensional grid model, extracting identity coefficients, expression coefficients and head posture coefficients of a source face three-dimensional deformable model (3D Morphable Model,3DMM), generating a mask, and extracting a face region image as a training truth value (GT).
S2: a mapping of speech to expression is constructed. Firstly, a single voice speech video data set is used for extracting 3DMM expression coefficients of a human face in the data set, an aligned audio feature and facial expression coefficient data are used for training a long-term memory network (Long Short Term Memory, LSTM), and a cross-modal mapping from audio to facial expression is constructed.
S3: an editable dynamic neural radiation field is constructed. Firstly, calculating an offset value of a camera light sampling point by taking a face model as an agent, inquiring k nearest vertexes of the offset sampling point in a standard space, obtaining a texture latent code and a geometric latent code corresponding to the sampling point through interpolation, inputting the color and the density of the sampling point by a texture decoder and a geometric decoder, and finally generating a replay result through volume drawing.
S4: target character replay and shape appearance editing. After the model is integrally trained, the method firstly receives any section of voice as input, regresses the facial expression coefficient through LSTM, then combines the source facial identity coefficient to synthesize a reenactured facial grid model, then inputs the editable dynamic neural radiation field frame regression sampling point density and color value, and finally synthesizes a reenactured video frame through volume drawing. The shape of the target person can be edited by modifying the shape of the replay model; by changing and replacing the texture latent codes of the template model, the appearance editing of the target person can be realized.
The specific steps of step S1 are as follows:
using Deep3DFace algorithm as three-dimensional face reconstruction algorithm, fitting face shape and appearance through convolutional neural network (Convolutional Neural Network, CNN), the process is expressed as:
Figure BDA0004095201000000071
wherein, alpha, beta, tau, omicron and rho respectively represent the identity, expression, material, illumination and posture coefficient of the face. Once the identity and the expression coefficient of the face are obtained, the shape of the face can be expressed as
Figure BDA0004095201000000072
wherein ,/>
Figure BDA0004095201000000073
Representing the average shape of the face; b (B) id ,B exp The bases of the 3DMM model shape and expression, respectively.
The specific steps of step S2 are as follows:
firstly, extracting Mel cepstrum features from each audio in a training set; then, the mapping of audio to emoticons is constructed by LSTM, which is expressed as:
Figure BDA0004095201000000074
wherein E is a mel-cepstrum feature s (t) An encoder of (a); h is a (t-1) ,c (t-1) The hidden layer and the cellular state of LSTM respectively,
Figure BDA0004095201000000075
is the facial expression coefficient predicted by the t-th frame network. Once the facial expression coefficient is obtained, the facial expression coefficient can be combined with the identity coefficient of the original person to generate a three-dimensional grid model for replaying the faceType (2).
Further, the specific steps of the step S3 are as follows:
firstly, defining a template model, and anchoring geometric and texture latent codes at vertexes; then, according to the head posture, the sampling points are transformed into a standard space, and the model vertex is used as an agent to obtain an offset value delta x of the camera light sampling points; next, inquiring k nearest vertexes of the sampling point, and obtaining corresponding geometric and texture latent codes through interpolation; finally, the face image is rendered through the geometrical and texture decoder regression density and color values of the nerve radiation field.
S3-1: light deformation module
Firstly, converting a template grid and a driving grid into tetrahedral representations under a standard space; then, converting the sampling points of the camera light into a standard space, and finding tetrahedrons corresponding to each sampling point in a driving model; and then, carrying out gravity center interpolation on the vertex displacement of the template model and the driving model to obtain an offset value delta x from a sampling point x of the driving model space to a sampling point x1 of the template model space, thereby realizing light deformation.
S3-2: texture decoder and geometry decoder
For the offset sampling point x1 in the standard space, firstly, searching the vertexes of k nearest template models; then, obtaining geometric latent codes, texture latent codes and distance indicators from the vertexes; next, an MLP-based geometry decoder F is input G And texture decoder F T In regression, the sign distance field (Signed Distance Field, SDF) s and color c.
S3-3: volume rendering
According to the volume rendering equation, for any point o, the direction is d, and the near and far ends are m respectively n and mf The color of the camera ray r (m) =o+md is expressed as:
Figure BDA0004095201000000081
where σ (r (m)) and c (r (m), d) are the density and color values of the sample points, T (m) isFrom m of surface rays n and mf Is a function of the cumulative transmittance of the light source.
Figure BDA0004095201000000082
In the invention, the SDF is used for calculating the geometric shape and the actual color value
Figure BDA0004095201000000083
Expressed as:
Figure BDA0004095201000000084
where T is the cumulative transmittance, Φ is the cumulative distribution of the logical distribution, s is the SDF value, and α is the opacity derived from the adjacent SDFs.
In the present invention, the loss function L of the whole model is expressed as:
Figure BDA0004095201000000085
referring to FIG. 1 for the overall flow, for input speech, firstly extracting audio features and inputting LSTM regression replay face expression coefficients, combining with source face identity coefficients, and synthesizing a face grid model; then, constructing an editable dynamic nerve radiation field model by taking vertex displacement between the replay model and the template model as an agent; finally, the RGB video frames are rendered by volume rendering.
In the training process of the mapping part of the audio module, referring to fig. 2, firstly, a data set is constructed through single voice speech video, each audio in the training set extracts mel-cepstrum features (Mel Frequency Cepstral Coefficient, MFCC), the length of an analysis window is set to be 0.25 millisecond, and the interval between continuous windows is set to be 10 milliseconds; then, carrying out face three-dimensional reconstruction on the video face to obtain one-to-one corresponding audio features and face expression coefficients; finally, a mapping from audio to expression is constructed by LSTM.
In the invention, referring to fig. 3, a neutral template model of a human face is defined first, and a human face grid model is reconstructed from a training video; then, obtaining an offset value delta x of a sampling point by calculating the vertex displacement values of the neutral template model of the human face and the synthesized driving model of the human face; next, inquiring k vertexes around the sampling point in a standard space, obtaining a shape latent code and a texture latent code through interpolation, and inputting the shape decoder and the texture decoder; finally, rendering the replay video through volume rendering.
The geometrical editing mode of the invention is as shown in figure 4, the driving expression can be customized by changing the grid model of the driving face, and the global geometrical editing of face thinning, head shrinking and the like can be completed by modifying the grid shape of the neutral template model of the face; the implementation mode of texture editing (face changing) is as shown in fig. 5, and the face changing operation can be realized by training editable dynamic nerve radiation field models of different characters and replacing texture latent codes in the vertexes of a face template model; in the implementation manner of texture editing (makeup) in the invention, as shown in fig. 6, after a user edits a replay video frame, vertices affected by pixel modification are reversely determined through camera rays, and the edited image is used for fine tuning the latent codes in the vertices to realize the makeup effect.
The invention also provides a voice-driven editable face replay device and a storage medium, wherein the device comprises a memory and a processor, the memory stores a computer program, and the processor calls the program instructions to execute the voice-driven editable face replay method. At the hardware level, the speech driven editable face replay device comprises a processor, an internal bus, a network interface, a memory and a nonvolatile storage, and can also comprise hardware required by other services. The processor reads the corresponding computer program from the nonvolatile memory to the memory and then runs the computer program to realize the editable face replay method. Of course, in addition to software implementations, the present invention does not exclude other implementations, such as a logic device or a combination of hardware and software, etc.
The storage medium includes a computer program executable by a processor to implement a speech driven editable face replay method as described above. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims (10)

1. The speech-driven editable face replay method is characterized by comprising the following steps of:
s1, inputting video materials to form a video data set;
s2, carrying out face three-dimensional reconstruction on the input video material, synthesizing a face three-dimensional grid model, extracting identity coefficients, expression coefficients and head posture coefficients of a source face three-dimensional deformable model, generating a mask at the same time, and extracting a face region image as a true value of training;
s3, extracting an expression coefficient of the three-dimensional deformable model of the source face in the video data set, training a long-term memory network by using the aligned audio features and the facial expression coefficient data, and constructing a cross-modal mapping from audio to facial expression;
s4, constructing an editable dynamic nerve radiation field, calculating an offset value of a camera light sampling point by taking a face model as an agent, inquiring k nearest vertexes of the offset sampling point in a standard space, obtaining a texture latent code and a geometric latent code corresponding to the sampling point by interpolation, inputting the color and the density of the sampling point by regression of a texture decoder and a geometric decoder, and generating a figure replay video by volume drawing;
s5, receiving audio input, regressing the facial expression coefficients through the long-term and short-term memory network, combining the source facial identity coefficients to synthesize a reenacted facial grid model, inputting editable dynamic neural radiation field frame regression sampling point density and color values, and synthesizing a reenact video frame through volume drawing;
s6, modifying the shape of the reenacted face grid model to edit the shape of the target person; and the visual editing of the target character is realized by changing and replacing the texture latent codes and the geometric latent codes.
2. The speech driven editable face replay method of claim 1, wherein step S2 specifically comprises:
using Deep3DFace algorithm as three-dimensional reconstruction algorithm of human face, fitting human face shape and appearance through convolutional neural network, and representing as:
Figure FDA0004095200990000011
wherein, alpha, beta, tau, omicron and rho respectively represent the identity, expression, material, illumination and posture coefficient of the face; once the face identity and the expression coefficient are obtained, the shape of the face is expressed as:
Figure FDA0004095200990000012
wherein ,
Figure FDA0004095200990000013
representing the average shape of the face; b (B) id ,B exp The bases of the 3DMM model shape and expression, respectively.
3. The speech driven editable face replay method of claim 1, wherein step S3 specifically comprises:
extracting mel-cepstrum features for each audio in the dataset;
the mapping from audio to expression coefficient is constructed through a long-short-term memory network and expressed as follows:
Figure FDA0004095200990000021
wherein E is a mel-cepstrum feature s (t) An encoder of (a); h is a (t-1) ,c (t-1) The hidden layer and the cellular state of LSTM respectively,
Figure FDA0004095200990000022
is the facial expression coefficient predicted by the t-th frame network; once a frame of facial expression coefficient predicted by the long-short-period memory network is obtained, the facial expression coefficient can be combined with the identity coefficient of the original person to generate a three-dimensional grid model of the replayed face.
4. The speech driven editable face replay method of claim 1, wherein step S4 specifically comprises:
s401, defining a face template model with a neutral expression, and anchoring a geometric latent code and a texture latent code at the vertex;
s402, transforming the sampling points into a standard space according to the head posture, and using the model vertex as an agent to acquire an offset value deltax of the transformation of the camera light sampling points;
s403, inquiring the nearest k vertexes of the sampling point, and obtaining a texture latent code and a geometric latent code corresponding to the sampling point through interpolation; the color and the density of the sampling point are regressed through a texture decoder and a geometric decoder of the nerve radiation field, and a face image is rendered;
s404, generating a replay result through volume rendering.
5. The speech driven editable face replay method of claim 4, wherein step S402 specifically comprises:
converting the face template model and the driving face grid model into tetrahedron representation in a standard space;
transforming the sampling points of the camera light into a standard space, and finding tetrahedrons corresponding to each sampling point in the driving face grid model;
and carrying out gravity center interpolation on the vertex displacement of the face template model and the vertex displacement of the driving face grid model to obtain an offset value delta x from a sampling point x of the driving face grid model space to a sampling point x1 of the face template model space, so as to realize light deformation.
6. The speech driven editable face replay method of claim 5, wherein step S403 specifically comprises:
inquiring the vertexes of k nearest face template models for the offset sampling points x1 in the standard space;
obtaining a geometric latent code, a texture latent code and a distance indicator from the vertex of the face template model;
inputting MLP-based geometry decoder F G And texture decoder F T In which the sign distance field s and the color c are regressed.
7. The speech driven editable face replay method of claim 6, wherein step S404 comprises:
according to the volume rendering equation, for any point o, the direction is d, and the near and far ends are m respectively n and mf The color of the camera ray r (m) =o+md is expressed as:
Figure FDA0004095200990000031
where σ (r (m)) and c (r (m), d) are the density and color values of the sample points, T (m) represents the ray from m n and mf Is a cumulative transmittance of (a);
Figure FDA0004095200990000032
calculating geometry using symbolic distance fields, actual color values
Figure FDA0004095200990000033
Expressed as:
Figure FDA0004095200990000034
where T is the cumulative transmittance, Φ is the cumulative distribution of the logical distribution, s is the SDF value, and α is the opacity derived from the adjacent symbol distance field.
8. The speech driven editable face replay method of claim 7, wherein the loss function L of the model as a whole is expressed as:
Figure FDA0004095200990000035
9. a speech driven editable face replay device comprising a memory and a processor, the memory storing a computer program, the processor invoking the program instructions to enable a speech driven editable face replay method according to any one of claims 1 to 8.
10. A computer readable storage medium comprising a computer program executable by a processor to implement a speech driven editable face replay method according to any one of claims 1 to 8.
CN202310163900.7A 2023-02-24 2023-02-24 Voice-driven editable face replay method, device and storage medium Pending CN116228979A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310163900.7A CN116228979A (en) 2023-02-24 2023-02-24 Voice-driven editable face replay method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310163900.7A CN116228979A (en) 2023-02-24 2023-02-24 Voice-driven editable face replay method, device and storage medium

Publications (1)

Publication Number Publication Date
CN116228979A true CN116228979A (en) 2023-06-06

Family

ID=86578204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310163900.7A Pending CN116228979A (en) 2023-02-24 2023-02-24 Voice-driven editable face replay method, device and storage medium

Country Status (1)

Country Link
CN (1) CN116228979A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036157A (en) * 2023-10-09 2023-11-10 易方信息科技股份有限公司 Editable simulation digital human figure design method, system, equipment and medium
CN117422829A (en) * 2023-10-24 2024-01-19 南京航空航天大学 Face image synthesis optimization method based on nerve radiation field
CN117422802A (en) * 2023-12-19 2024-01-19 粤港澳大湾区数字经济研究院(福田) Three-dimensional figure digital reconstruction method, device, terminal equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036157A (en) * 2023-10-09 2023-11-10 易方信息科技股份有限公司 Editable simulation digital human figure design method, system, equipment and medium
CN117036157B (en) * 2023-10-09 2024-02-20 易方信息科技股份有限公司 Editable simulation digital human figure design method, system, equipment and medium
CN117422829A (en) * 2023-10-24 2024-01-19 南京航空航天大学 Face image synthesis optimization method based on nerve radiation field
CN117422802A (en) * 2023-12-19 2024-01-19 粤港澳大湾区数字经济研究院(福田) Three-dimensional figure digital reconstruction method, device, terminal equipment and storage medium
CN117422802B (en) * 2023-12-19 2024-04-12 粤港澳大湾区数字经济研究院(福田) Three-dimensional figure digital reconstruction method, device, terminal equipment and storage medium

Similar Documents

Publication Publication Date Title
Kim et al. Deep video portraits
Wang et al. One-shot talking face generation from single-speaker audio-visual correlation learning
US11741940B2 (en) Text and audio-based real-time face reenactment
CN116228979A (en) Voice-driven editable face replay method, device and storage medium
Fan et al. A deep bidirectional LSTM approach for video-realistic talking head
Tekalp et al. Face and 2-D mesh animation in MPEG-4
Brand Voice puppetry
Chuang et al. Mood swings: expressive speech animation
KR102509666B1 (en) Real-time face replay based on text and audio
CN108090940A (en) Text based video generates
US20030163315A1 (en) Method and system for generating caricaturized talking heads
CN113077537A (en) Video generation method, storage medium and equipment
CN116597857A (en) Method, system, device and storage medium for driving image by voice
Li et al. A survey of computer facial animation techniques
Theobald et al. Near-videorealistic synthetic talking faces: Implementation and evaluation
Nocentini et al. Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation
CN117237521A (en) Speech driving face generation model construction method and target person speaking video generation method
Nakatsuka et al. Audio-oriented video interpolation using key pose
Perng et al. Image talk: a real time synthetic talking head using one single image with chinese text-to-speech capability
CN116385606A (en) Speech signal driven personalized three-dimensional face animation generation method and application thereof
del Valle et al. 3D talking head customization by adapting a generic model to one uncalibrated picture
Kolivand et al. Realistic lip syncing for virtual character using common viseme set
Chuang Analysis, synthesis, and retargeting of facial expressions
Zhou et al. Synthesizing a talking mouth
Gu A journey to photo-realistic facial animation synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination