CN112215926A - Voice-driven human face action real-time transfer method and system - Google Patents
Voice-driven human face action real-time transfer method and system Download PDFInfo
- Publication number
- CN112215926A CN112215926A CN202011027777.9A CN202011027777A CN112215926A CN 112215926 A CN112215926 A CN 112215926A CN 202011027777 A CN202011027777 A CN 202011027777A CN 112215926 A CN112215926 A CN 112215926A
- Authority
- CN
- China
- Prior art keywords
- audio
- dimensional
- frame
- audio features
- human face
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/205—3D [Three Dimensional] animation driven by audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a voice-driven real-time transfer method and a voice-driven real-time transfer system for human face actions, wherein the method comprises the following steps: inputting an audio sequence of a source character; estimating an audio signal characterization for each frame in the audio sequence; driving a three-dimensional face model action according to the estimated audio signal representation of each audio frame; acquiring a target video frame; predicting the human face action on the target video frame image based on the driven three-dimensional face model to obtain a human face action prediction result; and synthesizing the predicted face action prediction result to a corresponding frame image in the target video, so as to realize real-time transfer of the face action driven by voice. The invention greatly improves the sense of reality of the driven human face action, greatly reduces the complexity of the human face driving algorithm and can effectively ensure the real-time property of the driving human face action.
Description
Technical Field
The invention relates to the technical field of face action driving, in particular to a voice-driven face action real-time transfer method and system.
Background
The voice-driven human face animation is a research hotspot in the technical field of current animation simulation. The technical core of the voice-driven human face animation is that the human face model animation is driven by the voice information input from the outside. The technology of the voice-driven face animation which is popular in time is mainly characterized in that the corresponding relation between voice information and face animation videos is established, all the face animation videos are stored in a face animation material library, then the voice information input from the outside is recognized, the face animation videos corresponding to the recognized voice information are matched from the face animation material library according to the matching relation between the voice information and the face animation videos, and finally the face animation videos are directly called to be displayed to a user. The method cannot realize the real-time performance of the voice-driven human face animation.
In addition, although some existing methods for driving the human face animation by voice ensure the real-time performance of human face driving to a certain extent, the algorithm is complex, the real-time effect is not ideal, and the fidelity of the driven human face is poor, so that the application requirements cannot be met.
Disclosure of Invention
The invention aims to provide a voice-driven real-time transfer method and system for human face actions, so as to solve the technical problems.
In order to achieve the purpose, the invention adopts the following technical scheme:
the method for transferring the human face action driven by the voice in real time comprises the following steps:
inputting an audio sequence of a source character;
estimating an audio signal characterization for each frame in the audio sequence;
driving a three-dimensional face model action according to the estimated audio signal representation of each audio frame;
acquiring a target video frame;
predicting the human face action on the target video frame image based on the driven three-dimensional face model to obtain a human face action prediction result;
and synthesizing the predicted face action prediction result to a corresponding frame image in the target video, so as to realize real-time transfer of the face action driven by voice.
Preferably, the audio signal characterization for each frame in the audio sequence is estimated based on a FacialSpeech speech recognition framework.
Preferably, the feature dimension of each frame of audio input into the facel special speech recognition framework is 16 × 29, and the number "16" represents a time window in which each frame of audio contains 16 audio features;
the number "29" indicates that the FacialSpeech alphabet is 29 in length.
Preferably, the FacialSpeech speech recognition framework comprises 4 convolutional layers and 3 full-connected layers which are cascaded in sequence, and the input 16 × 29-dimensional audio features are subjected to one-dimensional feature convolution extraction of the first convolutional layer and then output 8 × 32-dimensional audio features;
8 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the second convolution layer and then 4 x 32 dimensional audio features are output;
4 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the third convolution layer, and then 2 x 64 dimensional audio features are output;
the 2 x 64-dimensional audio features are subjected to one-dimensional feature convolution extraction of a fourth convolution layer and then 64 audio features are output;
the first fully-connected layer maps 64 audio features to 128;
the second fully-connected layer maps 128 audio features to 64;
the third fully-connected layer maps 64 audio features into an audio characterization vector of length 32.
Preferably, the convolution kernel size of 4 of said convolutional layers is 3, the step size is 2.
The invention also provides a voice-driven human face action real-time transfer system, which can realize the human face action real-time transfer method, and the system comprises:
the audio sequence input module is used for inputting an audio sequence of a source role;
the audio signal expression estimation module is connected with the audio sequence input module and used for estimating the representation of the audio signal of each frame in the audio sequence;
the model action driving module is connected with the audio signal expression estimation module and used for driving a three-dimensional face model action according to the representation of the audio signal of each audio frame;
the target video frame acquisition module is used for acquiring a target video frame;
the target frame human face action prediction module is respectively connected with the model action driving module and the target video frame acquisition module and is used for predicting human face actions on the target video frame images based on the driven three-dimensional face model to obtain a human face action prediction result;
and the face action transfer module is connected with the target frame face action prediction module and used for synthesizing the predicted face action prediction result to a corresponding frame image in the target video frame so as to realize real-time transfer of the face action driven by voice.
Preferably, the audio signal characterization for each frame in the audio sequence is estimated based on a FacialSpeech speech recognition framework.
Preferably, the feature dimension of each frame of audio input into the facel special speech recognition framework is 16 × 29, and the number "16" represents a time window in which each frame of audio contains 16 audio features;
the number "29" indicates that the FacialSpeech alphabet is 29 in length.
Preferably, the FacialSpeech speech recognition framework comprises 4 convolutional layers and 3 full-connected layers which are cascaded in sequence, and the input 16 × 29-dimensional audio features are subjected to one-dimensional feature convolution extraction of the first convolutional layer and then output 8 × 32-dimensional audio features;
8 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the second convolution layer and then 4 x 32 dimensional audio features are output;
4 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the third convolution layer, and then 2 x 64 dimensional audio features are output;
the 2 x 64-dimensional audio features are subjected to one-dimensional feature convolution extraction of a fourth convolution layer and then 64 audio features are output;
the first fully-connected layer maps 64 audio features to 128;
the second fully-connected layer maps 128 audio features to 64;
the third fully-connected layer maps 64 audio features into an audio characterization vector of length 32.
Preferably, the convolution kernel size of 4 of said convolutional layers is 3, the step size is 2.
The invention firstly drives the three-dimensional face model to act through the estimated audio signal representation, then predicts the face action of the target frame image based on the driven three-dimensional face model, synthesizes the prediction result on the target frame image, and realizes the real-time transfer of the face action driven by voice.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a step diagram of a voice-driven real-time transfer method for human face actions according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a voice-driven real-time human face motion transfer system according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the architecture of the FacialSpeech speech recognition framework employed in the present invention;
fig. 4 is a schematic diagram of the invention for realizing real-time transfer of human face actions based on voice driving.
Detailed Description
The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.
Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.
In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between the components, is to be understood broadly, for example, as being fixed or detachable or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
An embodiment of the present invention provides a method for transferring a face action in real time driven by voice, as shown in fig. 1, including the following steps:
step S1, inputting an audio sequence of a source role;
step S2, evaluating an audio signal representation (expression characterizing audio features) of each frame in the audio sequence;
step S3, driving a three-dimensional face model according to the estimated audio signal representation of each audio frame; fig. 4 is a schematic diagram of a three-dimensional face model;
step S4, acquiring a target video frame;
step S5, based on the driven three-dimensional face model, predicting the face action on the target video frame image to obtain the face action prediction result;
and step S6, synthesizing the predicted human face action prediction result to a corresponding frame image in the target video, and realizing the real-time transfer of the human face action driven by voice.
In step S2, the present invention estimates an audio signal characterization for each frame in an audio sequence based on, in particular, the FacialSpeech speech recognition framework. FacialSpeech is a speech recognition system developed by hundreds of degrees in China. The invention improves the network architecture for estimating the audio signal representation based on the FacialSpeech framework. The present invention first determines the feature dimension of each frame of audio in an input audio sequence to be 16 x 29,
the number "16" represents a time window containing 16 audio features per frame of audio;
the number "29" indicates that the FacialSpeech alphabet is 29 in length.
Referring to fig. 3, the improved facel speech recognition framework of the present invention includes 4 convolutional layers and 3 full-link layers cascaded in sequence, and the input 16 × 29 dimensional audio features are convolution extracted by the one-dimensional features of the first convolutional layer and then output 8 × 32 dimensional audio features;
8 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the second convolution layer and then 4 x 32 dimensional audio features are output;
4 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the third convolution layer, and then 2 x 64 dimensional audio features are output;
the 2 x 64-dimensional audio features are subjected to one-dimensional feature convolution extraction of a fourth convolution layer and then 64 audio features are output;
the first fully-connected layer maps 64 audio features to 128;
the second fully-connected layer maps 128 audio features to 64;
the third fully-connected layer maps 64 audio features into an audio characterization vector of length 32.
The convolution kernel size of the 4 convolutional layers is 3, and the step size is 2.
The specific estimation process with respect to the characterization of the audio signal is not set forth herein.
In step S3, a preset three-dimensional face model is driven to act based on the audio signal representation, which is beneficial to improving the fidelity of the face action of the target video frame and reducing the complexity of the face action synthesis algorithm. If the audio signal representation (expression for representing audio characteristics) is calculated incorrectly, and therefore the direct-driven target video frame human face may have poor reality sense, even has the situations of face distortion and the like, the invention firstly drives the three-dimensional face model to act, transfers the facial action of the model to the target video frame image under the condition of ensuring the fidelity, and is beneficial to improving the reality sense of the target video frame human face action. And the algorithm of the direct drive target frame human face action is quite complex, and the real-time performance of the human face drive is influenced.
Referring to fig. 4, the following briefly explains the principle of transferring the facial motion of a three-dimensional face model to a target video frame image:
extracting a face region on a target video frame image based on a three-dimensional face model, then mapping model actions to the target face region (many existing face mapping methods are available and are not specifically described here), and finally synthesizing the target face subjected to action mapping to the target video frame image to realize real-time transfer of the face actions driven by voice.
The present invention also provides a voice-driven real-time transfer system for human face actions, which can implement the above-mentioned real-time transfer method for human face actions, as shown in fig. 2, the system includes:
the audio sequence input module 1 is used for inputting an audio sequence of a source role;
the audio signal expression estimation module 2 is connected with the audio sequence input module 1 and is used for estimating the representation of the audio signal of each frame in the audio sequence;
the model action driving module 3 is connected with the audio signal expression estimation module 2 and used for driving a three-dimensional face model action according to the representation of the audio signal of each audio frame;
a target video frame obtaining module 4, configured to obtain a target video frame;
the target frame human face action prediction module 5 is respectively connected with the model action driving module 3 and the target video frame acquisition module 4 and is used for predicting human face actions on a target video frame image based on the driven three-dimensional face model to obtain a human face action prediction result;
and the human face action transfer module 6 is connected with the target frame human face action prediction module 5 and is used for synthesizing the predicted human face action prediction result to a corresponding frame image in a target video frame so as to realize real-time transfer of the human face action driven by voice.
The human face action real-time transfer system provided by the invention estimates the representation of the audio signal of each frame in the audio sequence based on a FacialSpeech speech recognition framework.
Specifically, as shown in fig. 3, the improved facel spech framework of the present invention includes 4 convolutional layers and 3 fully-connected layers, which are cascaded in sequence, and the input 16 × 29 dimensional (the number "16" represents a time window in which each frame of audio contains 16 audio features; the number "29" represents that the length of the facel spech alphabet is 29.) audio features are convolved and extracted by one-dimensional features of the first convolutional layer, and then 8 × 32 dimensional audio features are output;
8 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the second convolution layer and then 4 x 32 dimensional audio features are output;
4 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the third convolution layer, and then 2 x 64 dimensional audio features are output;
the 2 x 64-dimensional audio features are subjected to one-dimensional feature convolution extraction of a fourth convolution layer and then 64 audio features are output;
the first fully-connected layer maps 64 audio features to 128;
the second fully-connected layer maps 128 audio features to 64;
the third fully-connected layer maps 64 audio features into an audio characterization vector of length 32.
The convolution kernel size of 4 of the convolutional layers is 3, and the step size is 2.
It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.
Claims (10)
1. A voice-driven human face action real-time transfer method is characterized by comprising the following steps:
inputting an audio sequence of a source character;
estimating an audio signal characterization for each frame in the audio sequence;
driving a three-dimensional face model action according to the estimated audio signal representation of each audio frame;
acquiring a target video frame;
predicting the human face action on the target video frame image based on the driven three-dimensional face model to obtain a human face action prediction result;
and synthesizing the predicted face action prediction result to a corresponding frame image in the target video, so as to realize real-time transfer of the face action driven by voice.
2. The method of claim 1, wherein the audio signal characterization for each frame in the audio sequence is estimated based on a FacialSpeech speech recognition framework.
3. The method of claim 2, wherein the feature dimension of each frame of audio input into the FacialSpeech speech recognition framework is 16 x 29,
the number "16" represents a time window containing 16 audio features per frame of audio;
the number "29" indicates that the FacialSpeech alphabet is 29 in length.
4. The real-time human face motion transfer method according to claim 3, wherein the FacialSpeech speech recognition framework comprises 4 convolutional layers and 3 full-link layers which are cascaded in sequence, and the input 16 x 29-dimensional audio features are subjected to one-dimensional feature convolution extraction of the first convolutional layer and then output 8 x 32-dimensional audio features;
8 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the second convolution layer and then 4 x 32 dimensional audio features are output;
4 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the third convolution layer, and then 2 x 64 dimensional audio features are output;
the 2 x 64-dimensional audio features are subjected to one-dimensional feature convolution extraction of a fourth convolution layer and then 64 audio features are output;
the first fully-connected layer maps 64 audio features to 128;
the second fully-connected layer maps 128 audio features to 64;
the third fully-connected layer maps 64 audio features into an audio characterization vector of length 32.
5. The method according to claim 4, wherein the convolution kernel size of 4 convolution layers is 3, and the step size is 2.
6. A voice-driven real-time human face motion transfer system, which can implement the method of any one of claims 1-5, and comprises:
the audio sequence input module is used for inputting an audio sequence of a source role;
the audio signal expression estimation module is connected with the audio sequence input module and used for estimating the representation of the audio signal of each frame in the audio sequence;
the model action driving module is connected with the audio signal expression estimation module and used for driving a three-dimensional face model action according to the representation of the audio signal of each audio frame;
the target video frame acquisition module is used for acquiring a target video frame;
the target frame human face action prediction module is respectively connected with the model action driving module and the target video frame acquisition module and is used for predicting human face actions on the target video frame images based on the driven three-dimensional face model to obtain a human face action prediction result;
and the face action transfer module is connected with the target frame face action prediction module and used for synthesizing the predicted face action prediction result to a corresponding frame image in the target video frame so as to realize real-time transfer of the face action driven by voice.
7. The system of claim 6, wherein the audio signal characterization for each frame in the audio sequence is estimated based on a FacialSpeech speech recognition framework.
8. The system of claim 7, wherein the feature dimension of each frame of audio input into the FacialSpeech speech recognition framework is 16 x 29,
the number "16" represents a time window containing 16 audio features per frame of audio;
the number "29" indicates that the FacialSpeech alphabet is 29 in length.
9. The system of claim 8, wherein the facespecific speech recognition framework comprises 4 convolutional layers and 3 full-link layers which are cascaded in sequence, and the input 16 x 29-dimensional audio features are subjected to one-dimensional feature convolution extraction of the first convolutional layer and then output 8 x 32-dimensional audio features;
8 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the second convolution layer and then 4 x 32 dimensional audio features are output;
4 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the third convolution layer, and then 2 x 64 dimensional audio features are output;
the 2 x 64-dimensional audio features are subjected to one-dimensional feature convolution extraction of a fourth convolution layer and then 64 audio features are output;
the first fully-connected layer maps 64 audio features to 128;
the second fully-connected layer maps 128 audio features to 64;
the third fully-connected layer maps 64 audio features into an audio characterization vector of length 32.
10. The system of claim 9, wherein the convolution kernel size of 4 convolution layers is 3 and the step size is 2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011027777.9A CN112215926A (en) | 2020-09-28 | 2020-09-28 | Voice-driven human face action real-time transfer method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011027777.9A CN112215926A (en) | 2020-09-28 | 2020-09-28 | Voice-driven human face action real-time transfer method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112215926A true CN112215926A (en) | 2021-01-12 |
Family
ID=74051267
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011027777.9A Pending CN112215926A (en) | 2020-09-28 | 2020-09-28 | Voice-driven human face action real-time transfer method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112215926A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113035198A (en) * | 2021-02-26 | 2021-06-25 | 北京百度网讯科技有限公司 | Lip movement control method, device and medium for three-dimensional face |
CN113132815A (en) * | 2021-04-22 | 2021-07-16 | 北京房江湖科技有限公司 | Video generation method and device, computer-readable storage medium and electronic equipment |
CN113160799A (en) * | 2021-04-22 | 2021-07-23 | 北京房江湖科技有限公司 | Video generation method and device, computer-readable storage medium and electronic equipment |
CN113408449A (en) * | 2021-06-25 | 2021-09-17 | 达闼科技(北京)有限公司 | Face action synthesis method based on voice drive, electronic equipment and storage medium |
WO2023088080A1 (en) * | 2021-11-22 | 2023-05-25 | 上海商汤智能科技有限公司 | Speaking video generation method and apparatus, and electronic device and storage medium |
CN117729298A (en) * | 2023-12-15 | 2024-03-19 | 北京中科金财科技股份有限公司 | Photo driving method based on action driving and mouth shape driving |
CN117831126A (en) * | 2024-01-02 | 2024-04-05 | 暗物质(北京)智能科技有限公司 | Voice-driven 3D digital human action generation method, system, equipment and medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102054287A (en) * | 2009-11-09 | 2011-05-11 | 腾讯科技(深圳)有限公司 | Facial animation video generating method and device |
CN106485774A (en) * | 2016-12-30 | 2017-03-08 | 当家移动绿色互联网技术集团有限公司 | Expression based on voice Real Time Drive person model and the method for attitude |
CN109308731A (en) * | 2018-08-24 | 2019-02-05 | 浙江大学 | Speech-driven lip-syncing face video synthesis algorithm based on cascaded convolutional LSTM |
CN111243065A (en) * | 2019-12-26 | 2020-06-05 | 浙江大学 | Voice signal driven face animation generation method |
-
2020
- 2020-09-28 CN CN202011027777.9A patent/CN112215926A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102054287A (en) * | 2009-11-09 | 2011-05-11 | 腾讯科技(深圳)有限公司 | Facial animation video generating method and device |
CN106485774A (en) * | 2016-12-30 | 2017-03-08 | 当家移动绿色互联网技术集团有限公司 | Expression based on voice Real Time Drive person model and the method for attitude |
CN109308731A (en) * | 2018-08-24 | 2019-02-05 | 浙江大学 | Speech-driven lip-syncing face video synthesis algorithm based on cascaded convolutional LSTM |
CN111243065A (en) * | 2019-12-26 | 2020-06-05 | 浙江大学 | Voice signal driven face animation generation method |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113035198A (en) * | 2021-02-26 | 2021-06-25 | 北京百度网讯科技有限公司 | Lip movement control method, device and medium for three-dimensional face |
CN113035198B (en) * | 2021-02-26 | 2023-11-21 | 北京百度网讯科技有限公司 | Three-dimensional face lip movement control method, equipment and medium |
CN113132815A (en) * | 2021-04-22 | 2021-07-16 | 北京房江湖科技有限公司 | Video generation method and device, computer-readable storage medium and electronic equipment |
CN113160799A (en) * | 2021-04-22 | 2021-07-23 | 北京房江湖科技有限公司 | Video generation method and device, computer-readable storage medium and electronic equipment |
CN113408449A (en) * | 2021-06-25 | 2021-09-17 | 达闼科技(北京)有限公司 | Face action synthesis method based on voice drive, electronic equipment and storage medium |
CN113408449B (en) * | 2021-06-25 | 2022-12-06 | 达闼科技(北京)有限公司 | Face action synthesis method based on voice drive, electronic equipment and storage medium |
WO2023088080A1 (en) * | 2021-11-22 | 2023-05-25 | 上海商汤智能科技有限公司 | Speaking video generation method and apparatus, and electronic device and storage medium |
CN117729298A (en) * | 2023-12-15 | 2024-03-19 | 北京中科金财科技股份有限公司 | Photo driving method based on action driving and mouth shape driving |
CN117831126A (en) * | 2024-01-02 | 2024-04-05 | 暗物质(北京)智能科技有限公司 | Voice-driven 3D digital human action generation method, system, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112215926A (en) | Voice-driven human face action real-time transfer method and system | |
Guo et al. | Ad-nerf: Audio driven neural radiance fields for talking head synthesis | |
CN110119757B (en) | Model training method, video category detection method, device, electronic equipment and computer readable medium | |
Zhao et al. | Learning to forecast and refine residual motion for image-to-video generation | |
CN113592985B (en) | Method and device for outputting mixed deformation value, storage medium and electronic device | |
JP2022515620A (en) | Image area recognition method by artificial intelligence, model training method, image processing equipment, terminal equipment, server, computer equipment and computer program | |
CN111276240B (en) | Multi-label multi-mode holographic pulse condition identification method based on graph convolution network | |
JP2023546173A (en) | Facial recognition type person re-identification system | |
CN111259841A (en) | Image processing method and related equipment | |
CN100505840C (en) | Method and device for transmitting face synthesized video | |
CN113808008A (en) | A method of building a generative adversarial network based on Transformer to achieve makeup transfer | |
CN114245215A (en) | Method, device, electronic equipment, medium and product for generating speaking video | |
CN112634413B (en) | Method, apparatus, device and storage medium for generating model and generating 3D animation | |
CN113223125A (en) | Face driving method, device, equipment and medium for virtual image | |
CN109784243A (en) | Identity determines method and device, neural network training method and device, medium | |
CN116704084B (en) | Training method of facial animation generation network, facial animation generation method and device | |
CN117037288A (en) | Multimode human body action recognition method and device based on Transformer double-flow fusion network | |
CN115588153B (en) | A video frame generation method based on 3D-DoubleU-Net | |
CN117576279A (en) | Digital person driving method and system based on multi-mode data | |
CN116993948A (en) | Face three-dimensional reconstruction method, system and intelligent terminal | |
CN114445529A (en) | Human face image animation method and system based on motion and voice characteristics | |
CN114399648A (en) | Behavior recognition method and apparatus, storage medium, and electronic device | |
CN113269068A (en) | Gesture recognition method based on multi-modal feature adjustment and embedded representation enhancement | |
Chandrasiri et al. | Real time facial expression recognition system with applications to facial animation in MPEG-4 | |
CN115862150B (en) | Diver action recognition method based on three-dimensional human body skin |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210112 |
|
RJ01 | Rejection of invention patent application after publication |