CN112562722A - Audio-driven digital human generation method and system based on semantics - Google Patents
Audio-driven digital human generation method and system based on semantics Download PDFInfo
- Publication number
- CN112562722A CN112562722A CN202011382282.8A CN202011382282A CN112562722A CN 112562722 A CN112562722 A CN 112562722A CN 202011382282 A CN202011382282 A CN 202011382282A CN 112562722 A CN112562722 A CN 112562722A
- Authority
- CN
- China
- Prior art keywords
- semantic
- face
- audio
- mouth
- face image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000006243 chemical reaction Methods 0.000 claims abstract description 55
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 10
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 10
- 238000009877 rendering Methods 0.000 claims description 41
- 238000012549 training Methods 0.000 claims description 29
- 230000001537 neural effect Effects 0.000 claims description 26
- 238000013528 artificial neural network Methods 0.000 claims description 19
- 239000013604 expression vector Substances 0.000 claims description 12
- 230000000306 recurrent effect Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 230000000873 masking effect Effects 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 10
- 238000005070 sampling Methods 0.000 claims description 9
- 238000013527 convolutional neural network Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 4
- 125000004122 cyclic group Chemical group 0.000 claims description 3
- 230000001815 facial effect Effects 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 13
- 230000006870 function Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 4
- 230000008447 perception Effects 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000002194 synthesizing effect Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 230000003321 amplification Effects 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/005—General purpose rendering architectures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Quality & Reliability (AREA)
- Computer Graphics (AREA)
- Image Analysis (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention discloses a semantic-based audio-driven digital human generation method and a semantic-based audio-driven digital human generation system, wherein the generation method comprises the following steps: acquiring a target audio and a first face image sequence; extracting the characteristics of the target audio to obtain corresponding audio characteristics; inputting the audio features into a pre-trained semantic conversion network, and performing semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs; and acquiring the face images to be rendered with the same number of the mouth semantic graphs based on a first face image sequence, shielding the mouth region of the face images to be rendered, and performing face synthesis based on the mouth semantic graphs and the face images to be rendered to generate a synthesized face sequence. The invention realizes the conversion of audio and facial semantics through the semantic conversion network, and achieves the accurate expression of mouth shape by utilizing the facial semantics.
Description
Technical Field
The invention relates to the field of machine learning, in particular to a semantic-based audio-driven digital human generation method and system.
Background
The video of the synchronous speaking action of the digital person generated by the audio drive is widely applied to various video sharing scenes, such as news broadcasting, training sharing, advertising and the like;
a method for synchronously driving a three-dimensional human face mouth shape and a facial pose animation by voice is disclosed by the publication number CN1032188842, mouth shape characteristic parameters and facial pose characteristic parameters which are defined based on MPEG-4 and correspond to each initial and final in a video frame are extracted, then a difference value Vel of each characteristic point coordinate and a standard frame coordinate is calculated, a corresponding scale reference quantity P on a human face defined according to MPEG-4 is calculated, and a human face motion parameter is calculated and obtained through the difference value Vel and the scale reference quantity P;
the patent application adopts the constructed three-dimensional face as a digital person, and the face generated by modeling is greatly different from a real face, so that the method is not suitable for occasions requiring the consistency of the digital face and the real face, such as news broadcasting, training sharing and the like;
because the face movement and speaking are a very elaborate and complex process, the face movement can only be preliminarily represented by using the coordinates of the feature points, errors exist in the positioning of the face feature points, and the face movement and speaking have individual differences; the method associates each initial and final with the mouth shape face posture characteristic parameters, and the tone, language and speed of the sound are related to the face movement, so the method has large limitation.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides the audio-driven digital person generation method and system based on the semantic meaning, which can accurately and finely express the face, and is suitable for occasions requiring the digital person to be similar to a target person.
In order to solve the technical problem, the invention is solved by the following technical scheme:
a semantic-based audio-driven digital human generation method comprises the following steps:
acquiring a target audio and a target face image sequence, and after masking the mouth region of each target face image in the target face image sequence, acquiring a corresponding first face image sequence;
extracting the characteristics of the target audio to obtain corresponding audio characteristics;
inputting the audio features into a pre-trained semantic conversion network, and performing semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs;
and constructing a second face image sequence based on the first face image sequence, wherein the second face image sequence contains the face images to be rendered in the same quantity as the mouth semantic images, and generating a synthesized face sequence based on the mouth semantic images and the face images to be rendered, and the synthesized face sequence contains synthesized faces corresponding to the mouth semantic images one to one.
As an implementable embodiment, the semantic conversion network comprises a recurrent neural network and an upsampling convolutional neural network;
the recurrent neural network is used for converting the audio features into expression vectors:
the up-sampling convolution neural network is used for generating a semantic motion sequence based on the expression vector.
As an implementable embodiment:
respectively connecting the semantic graph of the mouth part with the corresponding face image to be rendered to obtain corresponding data to be synthesized;
and inputting the data to be synthesized into a preset neural rendering network, and synthesizing and rendering the face image to be rendered by the neural rendering network based on the mouth semantic graph to generate a corresponding synthesized face.
As an implementation manner, the pre-training semantic conversion network includes the following steps:
acquiring a speaking video corresponding to a target face, extracting audio features of the speaking video, acquiring sample audio features, extracting video frames of the speaking video, detecting the face in each video frame, segmenting a mouth semantic graph of the face, and taking the obtained mouth semantic graph as a sample semantic graph;
training the semantic conversion network based on the sample audio features and the sample semantic graph.
As an implementation manner, the pre-training semantic conversion network includes the following steps:
masking the mouth region of the face in the video frame to obtain a corresponding sample image to be rendered;
connecting the sample image to be rendered and the corresponding sample semantic graph to obtain corresponding sample data to be synthesized;
and training the neural rendering network based on the sample data to be synthesized and the sample face image.
As an implementable embodiment:
the audio features are mel-frequency cepstral coefficients.
The invention also provides a semantic-based audio-driven digital human generation system, which comprises:
the data acquisition module is used for acquiring a target audio and a target face image sequence, and after masking processing is carried out on mouth regions of all target face images in the target face image sequence, a corresponding first face image sequence is obtained;
the characteristic extraction module is used for extracting the characteristics of the target audio to obtain corresponding audio characteristics;
the semantic conversion module is used for inputting the audio features to a pre-trained semantic conversion network, and the semantic conversion network performs semantic conversion on the audio features to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs;
and the synthetic rendering module is used for constructing a second face image sequence based on the first face image sequence, the second face image sequence comprises the face images to be rendered in the same quantity as the mouth semantic graphs, face synthesis is carried out based on the mouth semantic graphs and the face images to be rendered, a synthetic face sequence is generated, and synthetic faces corresponding to the mouth semantic graphs one by one are contained in the synthetic face sequence.
As an implementable embodiment:
the semantic conversion network comprises a cyclic neural network and an up-sampling convolutional neural network;
the recurrent neural network is used for converting the audio features into expression vectors:
the up-sampling convolution neural network is used for generating a semantic motion sequence based on the expression vector.
As one implementable way, the composition rendering module includes:
the connecting unit is used for connecting the semantic graph of the mouth part with the corresponding face image to be rendered respectively to obtain corresponding data to be synthesized;
and the rendering unit is used for inputting the data to be synthesized into a preset neural rendering network, and the neural rendering network synthesizes and renders the face image to be rendered based on the mouth semantic graph to generate a corresponding synthesized face.
The invention also proposes a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of the preceding claims.
Due to the adoption of the technical scheme, the invention has the remarkable technical effects that:
according to the invention, through a pre-trained semantic conversion network, the semantic is adopted to achieve the fine expression of the mouth shape, the semantic is essentially a binary image of the mouth shape of the digital human face, and compared with key points or parameters of the face, the expression of the face is more accurate and fine.
The invention carries out synthesis rendering through the neural rendering network, can more accurately realize the digital human generation driven by audio frequency of Lubang, enables the synthesized face to be more similar to the real face, and improves the watching experience.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow diagram of a semantic-based audio-driven digital human generation method of the present invention;
FIG. 2 is a schematic diagram of a network architecture of a neural rendering network according to embodiment 1;
FIG. 3 is a schematic diagram of a neural rendering network generating a corresponding synthetic face based on a semantic graph of a mouth and a face image to be rendered in a case;
fig. 4 is a schematic diagram of module connection of a semantic-based audio-driven digital human generation system according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples, which are illustrative of the present invention and are not to be construed as being limited thereto.
Embodiment 1, a semantic-based audio-driven digital human generation method, as shown in fig. 1, includes the following steps:
s100, acquiring a target audio and a target face image sequence, and masking mouth regions of all target face images in the target face image sequence to obtain a corresponding first face image sequence;
after the mouth region of the target face image is subjected to mask processing, corresponding face images to be rendered are obtained, and the face images to be rendered corresponding to the target face images one to one form a first face image sequence.
S200, extracting the characteristics of the target audio to obtain corresponding audio characteristics;
s300, inputting the audio features into a pre-trained semantic conversion network, and performing semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs;
s400, constructing a second face image sequence based on the first face image sequence, wherein the second face image sequence comprises the same number of face images to be rendered as the mouth semantic graphs, carrying out face synthesis based on the mouth semantic graphs and the face images to be rendered to generate a synthesized face sequence, and the synthesized face sequence comprises synthesized faces corresponding to the mouth semantic graphs one by one.
When the number of the mouth semantic images is less than or equal to the number of the face images to be rendered in the first face image sequence, extracting the corresponding number of face images to be rendered in sequence to form a second face image sequence;
when the number of the mouth semantic images is larger than that of the face images to be rendered in the first face image sequence, the first face image sequence can be repeatedly played to amplify and obtain a corresponding number of face images to be rendered so as to form a second face image sequence;
for example, the first face image sequence is placed in the front direction and placed in the reverse direction in a circulating mode, an image sequence with infinite length can be generated, and the second face image sequence can be obtained by sequentially extracting face images to be rendered with corresponding length according to the number of the semantic graphs of the mouth.
The obtained synthesized face sequence is a picture sequence of the digital person speaking generated based on the target face, the digital person speaking action is consistent with the target audio, and then a corresponding video can be generated based on the corresponding synthesized face sequence and the target audio.
In the embodiment, the conversion between audio and semantics is realized through a pre-trained semantic conversion network, the semantics is adopted to achieve the fine expression of the mouth shape, the semantics is essentially a binary image of the digital human face mouth shape, and compared with key points or parameters of the face, the expression of the face is more accurate and fine, so that the method is suitable for the requirements of news broadcasting, teaching training and the like that the digital human face needs to have a target character and the speech action is more real.
The specific way of obtaining the face image to be rendered in step S100 is as follows:
manually setting a mask on the target face image to shield the mouth region of the target face image, and taking the obtained image as the face image to be rendered, wherein a person skilled in the art can set the mask region according to the actual situation.
The specific way of extracting the audio features in step S200 is as follows:
the audio features are mel-frequency cepstrum coefficients (MFCCs), in this embodiment, the target audio is framed according to 40 milliseconds and corresponding mel-frequency cepstrum coefficients are extracted, and a person skilled in the art can design the framing time interval according to actual needs.
Further, the pre-training mode of the semantic conversion network in step S300 is as follows:
a1, acquiring a speaking video corresponding to a target face, extracting audio features of the speaking video, acquiring sample audio features, extracting video frames of the speaking video, detecting the face in each video frame, segmenting a semantic graph of the mouth of the face, and taking the obtained semantic graph of the mouth as a sample semantic graph;
in the embodiment, speaking videos of a target person are collected, and audio and video separation is carried out on each speaking video to obtain corresponding speaking audio and a plurality of video frames; framing the audio data according to 40 milliseconds and extracting corresponding Mel frequency cepstrum coefficients to obtain sample audio features; based on the existing face detection and face segmentation technology, the face in each video frame is detected, and a semantic graph of a mouth part corresponding to the face is extracted to be used as a sample semantic graph.
A2, training the semantic conversion network based on the sample audio features and the sample semantic graph, and iteratively training the semantic conversion network based on the following steps:
taking the sample audio features as an input of a semantic conversion network, and outputting a predicted mouth semantic graph, namely a predicted semantic graph, by the semantic conversion network;
performing loss calculation based on a corresponding sample semantic graph (real data) and a prediction semantic graph (prediction data), performing gradient rotation based on a first loss value obtained by calculation, and updating parameters of a semantic conversion network, wherein the first loss value is the sum of cross entropy loss and perception loss;
and finishing the training when the training times reach a preset iteration time threshold or the loss value is reduced to a preset loss threshold.
The model training step belongs to the conventional technical means in the field, so the model training step is not further detailed in this embodiment, and a person skilled in the art can also train to obtain a corresponding semantic conversion network.
Further, the semantic conversion network comprises a recurrent neural network and an upsampling convolutional neural network;
the recurrent neural network is used for converting the audio features into expression vectors:
the up-sampling convolution neural network is used for generating a semantic motion sequence based on the expression vector.
In this embodiment, the recurrent neural network includes two GRU layers, GRUs (gated recurrent units), and this embodiment extracts the time sequence relationship of the input audio by using this network, and averages the output of this network in the time dimension, and then sends the result to the Linear layer.
In the embodiment, Tanh is used as an activation function in an output layer of the upsampling convolutional neural network for predicting the mouth semantic graph.
The network structure of the semantic conversion network is specifically shown in the following table:
TABLE 1
Wherein kernel is a convolution kernel, stride is a step length, the Linear layer is a full-link layer, and the Reshape layer is used for transforming vector dimensions.
Further, in step S400, performing face synthesis based on the mouth semantic graph and the face image to be rendered, and generating a synthesized face sequence specifically includes:
s410, connecting the semantic graphs of the mouth parts with the corresponding face images to be rendered respectively to obtain corresponding data to be synthesized;
the connection refers to connecting two pieces of multidimensional data of the mouth semantic graph and the face image to be rendered on one channel (dimension), for example, connecting a 20-dimensional vector and a 30-dimensional vector to form a 50-dimensional vector.
And S420, inputting the data to be synthesized into a preset neural rendering network, and synthesizing and rendering the face image to be rendered by the neural rendering network based on the mouth semantic graph to generate a corresponding synthesized face.
In the embodiment, the synthesis rendering is performed through the neural rendering network, so that the audio-driven digital human generation can be more accurately realized by Lu Pont, the synthesized face is more similar to the real face, and the watching experience is improved.
The pre-training mode of the neural rendering network is as follows:
b1, masking the mouth region of the face in the video frame to obtain a corresponding sample image to be rendered;
the video frame is the video frame extracted in step a 1;
in this step, the mouth region is consistent with the mouth region in step S100, that is, after a person skilled in the art sets a fixed region for shielding the mouth according to actual needs, mask processing is performed on a target face image for synthesizing a digital person and a video frame serving as a training sample based on the fixed region.
B2, connecting the sample image to be rendered and the corresponding sample semantic graph to obtain corresponding sample data to be synthesized;
the connection described in the above-described connection synchronization step S410 is not described in detail.
B3, training the neural rendering network based on the sample data to be synthesized and the sample face image.
Taking the sample data to be synthesized as the input of a neural rendering network, and outputting the predicted synthesized face by the semantic conversion network, namely, predicting a face image;
loss calculation is carried out based on the corresponding sample face image (real data) and the prediction face image (prediction data), gradient rotation is carried out based on a second loss value obtained by calculation, and parameters of the semantic conversion network are updated, wherein the second loss value is the sum of L1 loss and perception loss;
and finishing the training when the training times reach a preset iteration time threshold or the loss value is reduced to a preset loss threshold.
The model training step belongs to the conventional technical means in the field, so the model training step is not further detailed in this embodiment, and a person skilled in the art can also train to obtain a corresponding semantic conversion network.
Note that the cross-entropy loss, the L1 loss and the perceptual loss are all conventional loss functions in the art, and a person skilled in the art can calculate corresponding loss values according to actual situations without providing detailed formulas.
In this embodiment, the formula for calculating the perception loss by the semantic conversion network and the neural rendering network is as follows:
in the present embodiment, the real data Y and the prediction data are combinedRespectively input into a pre-trained VGG network V, wherein V is in the formulajThe activation condition of the j layer of the VGG network in the real data or the prediction data processing is shown in (C)j,Hj,Wj). The squares of the L2 losses are then used to compare the real data Y with the predicted dataAs a corresponding loss of perception.
When training semantic exchange network, real data Y is sample semantic graph and prediction dataIs a corresponding prediction semantic graph;
when the neural rendering network is trained, the real data Y is a sample face image and prediction dataAnd the corresponding predicted face image is obtained.
In this embodiment, the network structure of the neural rendering network is specifically shown in the following table:
TABLE 2
In the above table, Mask MtRepresenting a face image with a mask, namely a face image to be rendered or a sample image to be rendered; image Q represents a semantic graph of the mouth; skip indicates that it belongs to a Skip layer connection structure, the network architecture diagram of the neural rendering network is shown in fig. 2, and a dotted line in fig. 2 indicates Skip layer connection.
Case (2):
and acquiring a speaking video of the target person, and training to acquire a semantic conversion network and a video rendering network by using audio data (MFCC) and image data in the speaking video according to the training steps.
Referring to fig. 3, according to actual needs, extracting a segment of speaking video from pre-collected speaking videos, or selecting a speaking video (non-training video) designated by a user, masking a mouth region of a face in each video frame (i.e., a target face image) with a target face image sequence of video frames of the speaking video to obtain a face image to be rendered, thereby generating a first face image sequence;
acquiring a target audio, extracting a mel frequency cepstrum coefficient of the target audio, and acquiring corresponding audio characteristics; inputting the audio features into a semantic conversion network to obtain a plurality of corresponding mouth semantic graphs;
after the first face image sequence is subjected to forward amplification and backward amplification, the length of the obtained image sequence is consistent with the number of the mouth semantic images, face images to be rendered in the image sequence are sequentially extracted at the moment, the corresponding mouth semantic images and the face images to be rendered are connected and then input into a neural rendering network, and corresponding synthesized faces are obtained;
and generating a video based on the obtained synthesized face and the target audio, and synchronizing the mouth shape of the digital person corresponding to the target person with the target audio.
Embodiment 2, a semantic-based audio-driven digital human generation system, as shown in fig. 4, includes:
the data acquisition module 100 is configured to acquire a target audio and a target face image sequence, and perform masking processing on a mouth region of each target face image in the target face image sequence to obtain a corresponding first face image sequence;
the feature extraction module 200 is configured to perform feature extraction on the target audio to obtain corresponding audio features;
a semantic conversion module 300, configured to input the audio features into a pre-trained semantic conversion network, where the audio features are subjected to semantic conversion by the semantic conversion network to obtain a corresponding semantic motion sequence, where the semantic motion sequence includes a plurality of mouth semantic graphs;
and a synthesis rendering module 400, configured to construct a second face image sequence based on the first face image sequence, where the second face image sequence includes the same number of face images to be rendered as the mouth semantic graphs, and perform face synthesis based on the mouth semantic graphs and the face images to be rendered to generate a synthesized face sequence, where the synthesized face sequence includes synthesized faces corresponding to the mouth semantic graphs one to one.
The semantic conversion network comprises a cyclic neural network and an up-sampling convolutional neural network;
the recurrent neural network is used for converting the audio features into expression vectors:
the up-sampling convolution neural network is used for generating a semantic motion sequence based on the expression vector.
The composite rendering module 400 includes:
the connecting unit is used for connecting the semantic graph of the mouth part with the corresponding face image to be rendered respectively to obtain corresponding data to be synthesized;
and the rendering unit is used for inputting the data to be synthesized into a preset neural rendering network, and the neural rendering network synthesizes and renders the face image to be rendered based on the mouth semantic graph to generate a corresponding synthesized face.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
Embodiment 3 is a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method of embodiment 1.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be noted that:
reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, the appearances of the phrase "one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
In addition, it should be noted that the specific embodiments described in the present specification may differ in the shape of the components, the names of the components, and the like. All equivalent or simple changes of the structure, the characteristics and the principle of the invention which are described in the patent conception of the invention are included in the protection scope of the patent of the invention. Various modifications, additions and substitutions for the specific embodiments described may be made by those skilled in the art without departing from the scope of the invention as defined in the accompanying claims.
Claims (10)
1. A semantic-based audio-driven digital human generation method is characterized by comprising the following steps:
acquiring a target audio and a target face image sequence, and after masking the mouth region of each target face image in the target face image sequence, acquiring a corresponding first face image sequence;
extracting the characteristics of the target audio to obtain corresponding audio characteristics;
inputting the audio features into a pre-trained semantic conversion network, and performing semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs;
and constructing a second face image sequence based on the first face image sequence, wherein the second face image sequence contains the face images to be rendered in the same quantity as the mouth semantic images, and generating a synthesized face sequence based on the mouth semantic images and the face images to be rendered, and the synthesized face sequence contains synthesized faces corresponding to the mouth semantic images one to one.
2. The semantic-based audio driven digital human generation method of claim 1, wherein the semantic conversion network comprises a recurrent neural network and an upsampled convolutional neural network;
the recurrent neural network is used for converting the audio features into expression vectors:
the up-sampling convolution neural network is used for generating a semantic motion sequence based on the expression vector.
3. The semantic-based audio-driven digital human generation method of claim 1 or 2, wherein:
connecting the semantic graph of the mouth with the corresponding face image to be rendered to obtain corresponding data to be synthesized;
and inputting the data to be synthesized into a preset neural rendering network, and performing synthetic rendering on the face image to be rendered by the neural rendering network based on the mouth semantic graph to generate a corresponding synthetic face.
4. The semantic-based audio-driven digital human generation method of claim 3, wherein the pre-training of the semantic conversion network comprises:
acquiring a speaking video corresponding to a target face, extracting audio features of the speaking video, acquiring sample audio features, extracting video frames of the speaking video, detecting the face in each video frame, segmenting a mouth semantic graph of the face, and taking the obtained mouth semantic graph as a sample semantic graph;
training the semantic conversion network based on the sample audio features and the sample semantic graph.
5. The semantic-based audio-driven digital human generation method of claim 4, wherein the pre-training of the semantic conversion network comprises:
masking the mouth region of the face in the video frame to obtain a corresponding sample image to be rendered;
connecting the sample image to be rendered and the corresponding sample semantic graph to obtain corresponding sample data to be synthesized;
and training the neural rendering network based on the sample data to be synthesized and the sample face image.
6. The semantic-based audio-driven digital human generation method of claim 5, wherein:
the audio features are mel-frequency cepstral coefficients.
7. A semantic-based audio-driven digital human generation system, comprising:
the data acquisition module is used for acquiring a target audio and a target face image sequence, and after masking processing is carried out on mouth regions of all target face images in the target face image sequence, a corresponding first face image sequence is obtained;
the characteristic extraction module is used for extracting the characteristics of the target audio to obtain corresponding audio characteristics;
the semantic conversion module is used for inputting the audio features to a pre-trained semantic conversion network, and the semantic conversion network performs semantic conversion on the audio features to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs;
and the synthetic rendering module is used for constructing a second face image sequence based on the first face image sequence, the second face image sequence contains the face images to be rendered in the same quantity as the mouth semantic graphs, and is also used for carrying out face synthesis based on the mouth semantic graphs and the face images to be rendered to generate a synthetic face sequence, and the synthetic face sequence contains synthetic faces corresponding to the mouth semantic graphs one by one.
8. The semantic-based audio-driven digital human generation system of claim 7, wherein:
the semantic conversion network comprises a cyclic neural network and an up-sampling convolutional neural network;
the recurrent neural network is used for converting the audio features into expression vectors:
the up-sampling convolution neural network is used for generating a semantic motion sequence based on the expression vector.
9. The semantic-based audio-driven digital human generation system according to claim 7 or 8, wherein the synthesis rendering module comprises:
the connecting unit is used for connecting the semantic graph of the mouth part with the corresponding face image to be rendered respectively to obtain corresponding data to be synthesized;
and the rendering unit is used for inputting the data to be synthesized into a preset neural rendering network, and the neural rendering network synthesizes and renders the face image to be rendered based on the mouth semantic graph to generate a corresponding synthesized face.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011382282.8A CN112562722A (en) | 2020-12-01 | 2020-12-01 | Audio-driven digital human generation method and system based on semantics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011382282.8A CN112562722A (en) | 2020-12-01 | 2020-12-01 | Audio-driven digital human generation method and system based on semantics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112562722A true CN112562722A (en) | 2021-03-26 |
Family
ID=75045817
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011382282.8A Pending CN112562722A (en) | 2020-12-01 | 2020-12-01 | Audio-driven digital human generation method and system based on semantics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112562722A (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113096242A (en) * | 2021-04-29 | 2021-07-09 | 平安科技(深圳)有限公司 | Virtual anchor generation method and device, electronic equipment and storage medium |
CN113269872A (en) * | 2021-06-01 | 2021-08-17 | 广东工业大学 | Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization |
CN113299312A (en) * | 2021-05-21 | 2021-08-24 | 北京市商汤科技开发有限公司 | Image generation method, device, equipment and storage medium |
CN113380269A (en) * | 2021-06-08 | 2021-09-10 | 北京百度网讯科技有限公司 | Video image generation method, apparatus, device, medium, and computer program product |
CN113628635A (en) * | 2021-07-19 | 2021-11-09 | 武汉理工大学 | Voice-driven speaking face video generation method based on teacher and student network |
CN113674184A (en) * | 2021-07-19 | 2021-11-19 | 清华大学 | Virtual speaker limb gesture generation method, device, equipment and storage medium |
CN113723385A (en) * | 2021-11-04 | 2021-11-30 | 新东方教育科技集团有限公司 | Video processing method and device and neural network training method and device |
CN113747086A (en) * | 2021-09-30 | 2021-12-03 | 深圳追一科技有限公司 | Digital human video generation method and device, electronic equipment and storage medium |
CN113822968A (en) * | 2021-11-24 | 2021-12-21 | 北京影创信息科技有限公司 | Method, system and storage medium for driving virtual human in real time by voice |
CN113822969A (en) * | 2021-09-15 | 2021-12-21 | 宿迁硅基智能科技有限公司 | Method, device and server for training nerve radiation field model and face generation |
CN113851145A (en) * | 2021-09-23 | 2021-12-28 | 厦门大学 | Virtual human action sequence synthesis method combining voice and semantic key actions |
CN114419702A (en) * | 2021-12-31 | 2022-04-29 | 南京硅基智能科技有限公司 | Digital human generation model, training method of model, and digital human generation method |
CN115330913A (en) * | 2022-10-17 | 2022-11-11 | 广州趣丸网络科技有限公司 | Three-dimensional digital population form generation method and device, electronic equipment and storage medium |
CN115953521A (en) * | 2023-03-14 | 2023-04-11 | 世优(北京)科技有限公司 | Remote digital human rendering method, device and system |
WO2023088080A1 (en) * | 2021-11-22 | 2023-05-25 | 上海商汤智能科技有限公司 | Speaking video generation method and apparatus, and electronic device and storage medium |
CN116342835A (en) * | 2023-03-31 | 2023-06-27 | 华院计算技术(上海)股份有限公司 | Face three-dimensional surface grid generation method, device, computing equipment and storage medium |
CN116385604A (en) * | 2023-06-02 | 2023-07-04 | 摩尔线程智能科技(北京)有限责任公司 | Video generation and model training method, device, equipment and storage medium |
CN116778040A (en) * | 2023-08-17 | 2023-09-19 | 北京百度网讯科技有限公司 | Face image generation method based on mouth shape, training method and device of model |
CN116994600A (en) * | 2023-09-28 | 2023-11-03 | 中影年年(北京)文化传媒有限公司 | Method and system for driving character mouth shape based on audio frequency |
CN117036555A (en) * | 2023-05-18 | 2023-11-10 | 无锡捷通数智科技有限公司 | Digital person generation method and device and digital person generation system |
CN117372553A (en) * | 2023-08-25 | 2024-01-09 | 华院计算技术(上海)股份有限公司 | Face image generation method and device, computer readable storage medium and terminal |
CN117689783A (en) * | 2024-02-02 | 2024-03-12 | 湖南马栏山视频先进技术研究院有限公司 | Face voice driving method and device based on super-parameter nerve radiation field |
CN117036555B (en) * | 2023-05-18 | 2024-06-21 | 无锡捷通数智科技有限公司 | Digital person generation method and device and digital person generation system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105975952A (en) * | 2016-05-26 | 2016-09-28 | 天津艾思科尔科技有限公司 | Beard detection method and system in video image |
CN108229245A (en) * | 2016-12-14 | 2018-06-29 | 贵港市瑞成科技有限公司 | Method for detecting fatigue driving based on facial video features |
CN110866968A (en) * | 2019-10-18 | 2020-03-06 | 平安科技(深圳)有限公司 | Method for generating virtual character video based on neural network and related equipment |
CN111243626A (en) * | 2019-12-30 | 2020-06-05 | 清华大学 | Speaking video generation method and system |
CN111259875A (en) * | 2020-05-06 | 2020-06-09 | 中国人民解放军国防科技大学 | Lip reading method based on self-adaptive magnetic space-time diagramm volumetric network |
-
2020
- 2020-12-01 CN CN202011382282.8A patent/CN112562722A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105975952A (en) * | 2016-05-26 | 2016-09-28 | 天津艾思科尔科技有限公司 | Beard detection method and system in video image |
CN108229245A (en) * | 2016-12-14 | 2018-06-29 | 贵港市瑞成科技有限公司 | Method for detecting fatigue driving based on facial video features |
CN110866968A (en) * | 2019-10-18 | 2020-03-06 | 平安科技(深圳)有限公司 | Method for generating virtual character video based on neural network and related equipment |
CN111243626A (en) * | 2019-12-30 | 2020-06-05 | 清华大学 | Speaking video generation method and system |
CN111259875A (en) * | 2020-05-06 | 2020-06-09 | 中国人民解放军国防科技大学 | Lip reading method based on self-adaptive magnetic space-time diagramm volumetric network |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113096242A (en) * | 2021-04-29 | 2021-07-09 | 平安科技(深圳)有限公司 | Virtual anchor generation method and device, electronic equipment and storage medium |
WO2022242381A1 (en) * | 2021-05-21 | 2022-11-24 | 上海商汤智能科技有限公司 | Image generation method and apparatus, device, and storage medium |
CN113299312A (en) * | 2021-05-21 | 2021-08-24 | 北京市商汤科技开发有限公司 | Image generation method, device, equipment and storage medium |
CN113269872A (en) * | 2021-06-01 | 2021-08-17 | 广东工业大学 | Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization |
CN113380269A (en) * | 2021-06-08 | 2021-09-10 | 北京百度网讯科技有限公司 | Video image generation method, apparatus, device, medium, and computer program product |
CN113380269B (en) * | 2021-06-08 | 2023-01-10 | 北京百度网讯科技有限公司 | Video image generation method, apparatus, device, medium, and computer program product |
CN113674184A (en) * | 2021-07-19 | 2021-11-19 | 清华大学 | Virtual speaker limb gesture generation method, device, equipment and storage medium |
CN113628635B (en) * | 2021-07-19 | 2023-09-15 | 武汉理工大学 | Voice-driven speaker face video generation method based on teacher student network |
CN113628635A (en) * | 2021-07-19 | 2021-11-09 | 武汉理工大学 | Voice-driven speaking face video generation method based on teacher and student network |
CN113822969A (en) * | 2021-09-15 | 2021-12-21 | 宿迁硅基智能科技有限公司 | Method, device and server for training nerve radiation field model and face generation |
CN113851145A (en) * | 2021-09-23 | 2021-12-28 | 厦门大学 | Virtual human action sequence synthesis method combining voice and semantic key actions |
CN113851145B (en) * | 2021-09-23 | 2024-06-07 | 厦门大学 | Virtual human action sequence synthesis method combining voice and semantic key actions |
CN113747086A (en) * | 2021-09-30 | 2021-12-03 | 深圳追一科技有限公司 | Digital human video generation method and device, electronic equipment and storage medium |
CN113723385A (en) * | 2021-11-04 | 2021-11-30 | 新东方教育科技集团有限公司 | Video processing method and device and neural network training method and device |
WO2023077742A1 (en) * | 2021-11-04 | 2023-05-11 | 新东方教育科技集团有限公司 | Video processing method and apparatus, and neural network training method and apparatus |
WO2023088080A1 (en) * | 2021-11-22 | 2023-05-25 | 上海商汤智能科技有限公司 | Speaking video generation method and apparatus, and electronic device and storage medium |
CN113822968B (en) * | 2021-11-24 | 2022-03-04 | 北京影创信息科技有限公司 | Method, system and storage medium for driving virtual human in real time by voice |
CN113822968A (en) * | 2021-11-24 | 2021-12-21 | 北京影创信息科技有限公司 | Method, system and storage medium for driving virtual human in real time by voice |
CN114419702A (en) * | 2021-12-31 | 2022-04-29 | 南京硅基智能科技有限公司 | Digital human generation model, training method of model, and digital human generation method |
CN114419702B (en) * | 2021-12-31 | 2023-12-01 | 南京硅基智能科技有限公司 | Digital person generation model, training method of model, and digital person generation method |
CN115330913A (en) * | 2022-10-17 | 2022-11-11 | 广州趣丸网络科技有限公司 | Three-dimensional digital population form generation method and device, electronic equipment and storage medium |
CN115953521A (en) * | 2023-03-14 | 2023-04-11 | 世优(北京)科技有限公司 | Remote digital human rendering method, device and system |
CN116342835A (en) * | 2023-03-31 | 2023-06-27 | 华院计算技术(上海)股份有限公司 | Face three-dimensional surface grid generation method, device, computing equipment and storage medium |
CN117036555B (en) * | 2023-05-18 | 2024-06-21 | 无锡捷通数智科技有限公司 | Digital person generation method and device and digital person generation system |
CN117036555A (en) * | 2023-05-18 | 2023-11-10 | 无锡捷通数智科技有限公司 | Digital person generation method and device and digital person generation system |
CN116385604A (en) * | 2023-06-02 | 2023-07-04 | 摩尔线程智能科技(北京)有限责任公司 | Video generation and model training method, device, equipment and storage medium |
CN116385604B (en) * | 2023-06-02 | 2023-12-19 | 摩尔线程智能科技(北京)有限责任公司 | Video generation and model training method, device, equipment and storage medium |
CN116778040B (en) * | 2023-08-17 | 2024-04-09 | 北京百度网讯科技有限公司 | Face image generation method based on mouth shape, training method and device of model |
CN116778040A (en) * | 2023-08-17 | 2023-09-19 | 北京百度网讯科技有限公司 | Face image generation method based on mouth shape, training method and device of model |
CN117372553A (en) * | 2023-08-25 | 2024-01-09 | 华院计算技术(上海)股份有限公司 | Face image generation method and device, computer readable storage medium and terminal |
CN117372553B (en) * | 2023-08-25 | 2024-05-10 | 华院计算技术(上海)股份有限公司 | Face image generation method and device, computer readable storage medium and terminal |
CN116994600B (en) * | 2023-09-28 | 2023-12-12 | 中影年年(北京)文化传媒有限公司 | Method and system for driving character mouth shape based on audio frequency |
CN116994600A (en) * | 2023-09-28 | 2023-11-03 | 中影年年(北京)文化传媒有限公司 | Method and system for driving character mouth shape based on audio frequency |
CN117689783A (en) * | 2024-02-02 | 2024-03-12 | 湖南马栏山视频先进技术研究院有限公司 | Face voice driving method and device based on super-parameter nerve radiation field |
CN117689783B (en) * | 2024-02-02 | 2024-04-30 | 湖南马栏山视频先进技术研究院有限公司 | Face voice driving method and device based on super-parameter nerve radiation field |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112562722A (en) | Audio-driven digital human generation method and system based on semantics | |
CN111325817B (en) | Virtual character scene video generation method, terminal equipment and medium | |
Cao et al. | Expressive speech-driven facial animation | |
CN103650002B (en) | Text based video generates | |
CN110853670B (en) | Music-driven dance generation method | |
CN112001992A (en) | Voice-driven 3D virtual human expression sound-picture synchronization method and system based on deep learning | |
JP2009533786A (en) | Self-realistic talking head creation system and method | |
KR20060090687A (en) | System and method for audio-visual content synthesis | |
US11847726B2 (en) | Method for outputting blend shape value, storage medium, and electronic device | |
JP2003529861A (en) | A method for animating a synthetic model of a human face driven by acoustic signals | |
CN110910479B (en) | Video processing method, device, electronic equipment and readable storage medium | |
CN113838173B (en) | Virtual human head motion synthesis method driven by combination of voice and background sound | |
CN113838174B (en) | Audio-driven face animation generation method, device, equipment and medium | |
CN116051692B (en) | Three-dimensional digital human face animation generation method based on voice driving | |
CN116597857A (en) | Method, system, device and storage medium for driving image by voice | |
CN115578512A (en) | Method, device and equipment for training and using generation model of voice broadcast video | |
CN116828129B (en) | Ultra-clear 2D digital person generation method and system | |
CN117409121A (en) | Fine granularity emotion control speaker face video generation method, system, equipment and medium based on audio frequency and single image driving | |
CN116912375A (en) | Facial animation generation method and device, electronic equipment and storage medium | |
CN116417008A (en) | Cross-mode audio-video fusion voice separation method | |
CN115439614B (en) | Virtual image generation method and device, electronic equipment and storage medium | |
CN115223224A (en) | Digital human speaking video generation method, system, terminal device and medium | |
Deena et al. | Speech-driven facial animation using a shared Gaussian process latent variable model | |
Mahavidyalaya | Phoneme and viseme based approach for lip synchronization | |
CN114360491A (en) | Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210326 |