CN117152285A - Virtual person generating method, device, equipment and medium based on audio control - Google Patents

Virtual person generating method, device, equipment and medium based on audio control Download PDF

Info

Publication number
CN117152285A
CN117152285A CN202311032590.1A CN202311032590A CN117152285A CN 117152285 A CN117152285 A CN 117152285A CN 202311032590 A CN202311032590 A CN 202311032590A CN 117152285 A CN117152285 A CN 117152285A
Authority
CN
China
Prior art keywords
image
features
audio
virtual person
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311032590.1A
Other languages
Chinese (zh)
Inventor
林洛阳
梁小丹
操晓春
陈定纬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202311032590.1A priority Critical patent/CN117152285A/en
Publication of CN117152285A publication Critical patent/CN117152285A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/40Analysis of texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Processing Or Creating Images (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a virtual person generating method, a device, equipment and a medium based on audio control, wherein the method comprises the following steps: acquiring an input image, and performing image coding on the input image to acquire a hidden layer vector; noise is added to the hidden layer vector based on Gaussian distribution noise, and a noise added image is obtained; acquiring a visual image, and extracting features of the visual image to acquire visual features; the method comprises the steps of obtaining input audio, and carrying out feature extraction and mask processing on the input audio to obtain mask audio features; carrying out emotion recognition on the input image or the input audio to obtain emotion characteristics; and carrying out inverse denoising and image decoding on the noisy image based on the visual features, the mask audio features and the emotion features to obtain a target image of the virtual person. The invention can effectively solve the limitation that the existing virtual person generation method is only limited to facial modeling, efficiently realizes virtual person generation based on audio control, and can be widely applied to the technical field of image processing.

Description

Virtual person generating method, device, equipment and medium based on audio control
Technical Field
The invention relates to the technical field of image processing, in particular to a virtual person generating method, device, equipment and medium based on audio control.
Background
The virtual human generation is a video generation technology based on audio and images, and aims to generate facial expressions, mouth shapes, gestures and the like of the virtual human in real time through audio and image input, so that the virtual human can simulate communication and behaviors of the human in a realistic manner. In recent years, the rapid development of computer vision technology provides a wide space for the generation and application of virtual humans. The application potential of the virtual person generation technology is huge. In the entertainment field, the method can be applied to movies, games, virtual anchor and the like, so that the virtual character can generate accurate facial expression and mouth-shaped motion in real time according to voice input, and expressive force and emotion transfer of the character are enhanced; in the field of education and training, the virtual person generation technology can be used for virtual teachers, so that consistency of facial expressions and actions and speaking contents is realized, and more vivid and attractive teaching experience is provided. In addition, the virtual person generation technology can be applied to the fields of man-machine interaction, virtual reality, augmented reality and the like, and more natural and real interaction experience is provided for users. Based on extensive reality demands, how to generate vivid and lifelike virtual figures and realize more real virtual interaction experience is a long-term target in the field of artificial intelligence computer vision. The existing virtual person generation mode has certain limitation and lower efficiency.
Disclosure of Invention
In view of the above, the embodiments of the present invention provide a method, apparatus, device, and medium for generating a virtual person based on audio control, which can efficiently generate a virtual person based on audio control.
In one aspect, an embodiment of the present invention provides a method for generating a virtual person based on audio control, including:
acquiring an input image, and performing image coding on the input image to acquire a hidden layer vector;
noise is added to the hidden layer vector based on Gaussian distribution noise, and a noise added image is obtained;
acquiring a visual image, and extracting features of the visual image to acquire visual features; the method comprises the steps of obtaining input audio, and carrying out feature extraction and mask processing on the input audio to obtain mask audio features;
carrying out emotion recognition on the input image or the input audio to obtain emotion characteristics;
and carrying out inverse denoising and image decoding on the noisy image based on the visual features, the mask audio features and the emotion features to obtain a target image of the virtual person.
Optionally, image encoding the input image to obtain a hidden layer vector includes:
the input image is encoded into a hidden layer space using a pre-trained first image encoder to obtain hidden layer vectors.
Optionally, noise is added to the hidden layer vector based on the Gaussian distribution noise, and a noise added image is obtained, including:
And performing forward diffusion processing on the hidden layer vector in the hidden layer space, and adding noise from Gaussian distribution on the hidden layer vector based on preset Gaussian distribution to obtain a noise-added image.
Optionally, the visual image comprises a gesture image and a texture image, and the visual feature comprises a gesture feature and a texture feature; extracting features of the visual image to obtain visual features, including:
extracting features of the gesture image by using a pre-trained second image encoder to obtain gesture features;
extracting features of the texture image by using a pre-trained second image encoder to obtain texture features;
the gesture features are used for generating the body gesture of the virtual person, and the texture features are used for generating the face texture of the virtual person.
Optionally, performing feature extraction and masking processing on the input audio to obtain masked audio features, including:
performing feature extraction on input audio by using a pre-trained audio feature extractor to obtain audio features;
adding a lip mask to the audio features to obtain masked audio features;
wherein the masking audio features are used for the mouth-shaped portion generation of the virtual person.
Optionally, emotion recognition is performed on the input image or the input audio to obtain emotion characteristics, including:
Carrying out emotion recognition on the input image or the input audio by utilizing a pre-trained expression recognition model to obtain emotion characteristics;
the expression recognition model is generated through emotion classification loss function training by utilizing an input image marked with emotion labels and video features and audio features obtained based on the input image.
Optionally, based on the visual feature, the masking audio feature and the emotion feature, performing inverse denoising and image decoding on the denoised image to obtain a target image of the virtual person, including:
inputting the visual features, the mask audio features and the emotion features into a pre-trained reverse denoising module, and performing reverse denoising on the noisy image;
the inverse denoising module comprises a plurality of u-nets, wherein each u-net is used for predicting vectors of denoising of the denoising image in different stages, and obtaining the noise image predicted in different stages; the reverse denoising module is generated based on noise reconstruction loss function training;
the reverse denoising process comprises the following steps:
splicing the visual features with noise images in different stages, and generating the body posture and the face texture of the virtual person based on the spliced features; the input of the u-net of each stage is the splicing characteristic of the noise image and the visual characteristic of the previous stage;
Inputting the mask audio features into each u-net, performing attention mechanism processing with the spliced features, generating a mouth-shaped part of the virtual person, and obtaining multi-modal features;
and inputting the emotion characteristics into each u-net, and carrying out characteristic fusion with the multi-modal characteristics of the u-net to generate the face of the virtual person so as to obtain the target image of the virtual person.
In another aspect, an embodiment of the present invention provides an audio control-based virtual person generating apparatus, including:
the first module is used for acquiring an input image, and performing image coding on the input image to acquire a hidden layer vector;
the second module is used for adding noise to the hidden layer vector based on Gaussian distribution noise to obtain a noise added image;
the third module is used for acquiring a visual image, extracting characteristics of the visual image and acquiring visual characteristics; the method comprises the steps of obtaining input audio, and carrying out feature extraction and mask processing on the input audio to obtain mask audio features;
a fourth module, configured to perform emotion recognition on an input image or input audio to obtain emotion features;
and a fifth module, configured to perform inverse denoising and image decoding on the denoised image based on the visual feature, the mask audio feature and the emotion feature, so as to obtain a target image of the virtual person.
In another aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory;
the memory is used for storing programs;
the processor executes a program to implement the method as before.
In another aspect, embodiments of the present invention provide a computer-readable storage medium storing a program for execution by a processor to perform a method as previously described.
Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the foregoing method.
Firstly, obtaining an input image, and carrying out image coding on the input image to obtain hidden layer vectors; noise is added to the hidden layer vector based on Gaussian distribution noise, and a noise added image is obtained; acquiring a visual image, and extracting features of the visual image to acquire visual features; the method comprises the steps of obtaining input audio, and carrying out feature extraction and mask processing on the input audio to obtain mask audio features; carrying out emotion recognition on the input image or the input audio to obtain emotion characteristics; and carrying out inverse denoising and image decoding on the noisy image based on the visual features, the mask audio features and the emotion features to obtain a target image of the virtual person. According to the embodiment of the invention, through noise sampled from Gaussian distribution, gaussian noise is denoised by reverse denoising, and denoised hidden layer vectors are decoded by a decoder of an image to obtain a final image; by fusing the visual features and the masking audio features, the modeling capability of feature information is enhanced; and semantic modeling guided by emotion features is introduced in the reverse denoising process, so that the quality of half-body virtual person generation is successfully improved, and the limitation that the existing virtual person generation method is limited to facial modeling is effectively solved. The embodiment of the invention can efficiently realize the virtual person generation based on audio control.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a virtual person generating method based on audio control according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of a training path of a virtual person generating method based on audio control according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of one of training path expansion of the audio control-based virtual person generation method according to the embodiment of the present invention;
fig. 4 is a schematic diagram of architecture of a virtual person generation model based on audio control according to an embodiment of the present invention;
FIG. 5 is a schematic flow chart of one of training path expansion of the virtual person generation method based on audio control according to the embodiment of the present invention;
FIG. 6 is a schematic flow chart of one of training path expansion of the audio control-based virtual person generation method according to the embodiment of the present invention;
Fig. 7 is a schematic structural diagram of a virtual person generating apparatus based on audio control according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a frame of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
It should be noted that early virtual person generation techniques relied mainly on statistical models and machine learning methods to implement virtual person generation by modeling audio information or video frame sequences. For example, in one research effort, a Hidden Markov Model (HMM) based method is proposed, where a given speech signal is modeled using a trained model to generate a visual parameter trajectory of lip motion, and an optimal sequence of mouth-shaped images is selected from an original database according to the generated trajectory information, and synthesized with a virtual human background head video.
Then, with the rapid development of deep learning technology, the method based on deep learning and neural network is widely applied in virtual human generation, and gradually surpasses and replaces the early method based on statistical model and machine learning. The data driving is one of the main characteristics of a virtual person generating method based on deep learning, the method can automatically extract characteristic information in audio data and image data by using a deep neural network, analyze and learn the characteristic information, capture information such as human body actions and facial expressions, and the like, and then train a generating model to generate a virtual person, so that the virtual person generating method has higher efficiency and accuracy. Common neural network methods such as Recurrent Neural Network (RNN) and long short term memory network (LSTM) have all made significant progress in virtual human generation research. In addition, a typical neural network-based virtual person generation method is to generate a countermeasure network (GAN), where the model is composed of a generator and a discriminator, and the generator can receive information such as an audio sequence and a video frame sequence, and generate a continuous virtual person video sequence; the discriminator is responsible for evaluating the authenticity of the generated virtual person video and distinguishing the virtual person video from the real video, and the more real and natural virtual person generation is realized by alternately training the generator and the discriminator. The virtual person generation method based on the generation of the countermeasure network is one of the popular research directions in the field of virtual person generation. For example, a representative research effort developed a framework for generating high resolution virtual human video by facial editing based on pre-trained generation of an countermeasure network model; another study would generate an antagonism network in combination with a cross-modal attention mechanism to improve the performance of virtual human generation.
Recently, some research efforts have begun to attempt to implement virtual human generation using Diffusion Model-based methods to alleviate the problem of difficulty in generating countermeasures to network training. The diffusion model-based virtual person generation method gradually adds gaussian noise to audio and image data in a forward diffusion process, and then trains a neural network using a backward diffusion method to learn a conditional distribution probability inversion noise. For example, one research effort has proposed a Latent Diffusion Model (LDM) that first encodes images into feature space and applies the forward diffusion and denoising process of the model to low-dimensional image features to achieve more efficient virtual person generation; there is also a study that combines a potential diffusion model with spatial and directional temporal attention to enable text-guided virtual human video generation.
Currently, there are many models and methods in virtual person generation, all three of which are common solutions. However, for a virtual human generation method based on statistical models and machine learning, the method generally requires processing of input audio data or image data by means of an artificial feature extraction strategy or classical feature extraction algorithm. Because the adopted traditional technology can not well capture the characteristic information in the audio and the image, the method is generally low in efficiency, high in cost and low in accuracy and mobility of the generated video. Secondly, for a virtual human generation method based on deep learning and a diffusion model, the method can adaptively extract potential characteristic information in data and model the potential characteristic information under the assistance of a neural network, although the method is improved by a long term, the traditional research result only focuses on image synthesis of a face and discrete emotion types in the data, but ignores characteristic information such as action generation of the upper half part of a human body such as gestures and available emotion semantics, and part of research work has higher dependence on a 3D parameterized model.
In view of this, in one aspect, as shown in fig. 1, an embodiment of the present invention provides a virtual person generating method based on audio control, including:
s100, acquiring an input image, and performing image coding on the input image to obtain a hidden layer vector;
it should be noted that, in some embodiments, performing image encoding on an input image to obtain a hidden layer vector may include: the input image is encoded into a hidden layer space using a pre-trained first image encoder to obtain hidden layer vectors.
In some embodiments, as shown in FIG. 2, the self-encoder of one image is pre-trained to encode the image into the hidden layer space for the forward diffusion and inverse denoising process of the subsequent potential diffusion model.
S200, noise is added to the hidden layer vector based on Gaussian distribution noise, and a noise added image is obtained;
it should be noted that, in some embodiments, step S200 may include: and performing forward diffusion processing on the hidden layer vector in the hidden layer space, and adding noise from Gaussian distribution on the hidden layer vector based on preset Gaussian distribution to obtain a noise-added image.
In some embodiments, as shown in fig. 3, the forward diffusion process of the potential diffusion model: for each input image, the image is encoded into the hidden layer space using a pre-trained image encoder, and a forward diffusion process is performed in the hidden layer space, specifically, a set of gaussian distributions are predefined, and noise from these gaussian distributions is added to the hidden layer vectors.
S300, acquiring a visual image, and extracting features of the visual image to obtain visual features; the method comprises the steps of obtaining input audio, and carrying out feature extraction and mask processing on the input audio to obtain mask audio features;
it should be noted that, the visual image includes a gesture image and a texture image, the visual features include a gesture feature and a texture feature, and in some embodiments, performing feature extraction on the visual image to obtain the visual features may include: extracting features of the gesture image by using a pre-trained second image encoder to obtain gesture features; extracting features of the texture image by using a pre-trained second image encoder to obtain texture features; the gesture features are used for generating the body gesture of the virtual person, and the texture features are used for generating the face texture of the virtual person.
In some embodiments, performing feature extraction and masking processing on the input audio to obtain masked audio features may include: performing feature extraction on input audio by using a pre-trained audio feature extractor to obtain audio features; adding a lip mask to the audio features to obtain masked audio features; wherein the masking audio features are used for the mouth-shaped portion generation of the virtual person.
Wherein, in order to achieve high quality virtual person video generation, mask image I for gesture control is given m (pose image) and reference image I for preserving identity information r (texture image). The model adopts a pre-trained model E I (image encoder) to extract the corresponding feature information: texture feature z r And gesture feature z m . These features will then be input into each u-net of the potential diffusion model to control facial pose and preserve identity information.
For half-body virtual human generation of audio control, model headFirst using an encoder E A To extract audio features. Since the audio features are typically only able to control the region around the mouth, the model adds a lip mask to the audio features and inputs the masked audio features as key value pairs in the latent diffusion model to the cross-attention module for multi-modal learning. Under the action of the lip mask, the audio features will have a significant effect on the lip composition generated by the half-life virtual person.
S400, emotion recognition is carried out on the input image or the input audio, and emotion characteristics are obtained;
it should be noted that, in some embodiments, step S400 may include: carrying out emotion recognition on the input image or the input audio by utilizing a pre-trained expression recognition model to obtain emotion characteristics; the expression recognition model is generated through emotion classification loss function training by utilizing an input image marked with emotion labels and video features and audio features obtained based on the input image.
In some embodiments, a half-body virtual human generation model based on audio control incorporates multi-modal features into the potential diffusion model. The audio features mainly control the motion of the lips, while the gesture features mainly govern the gesture changes of the body and face. Although the model fuses the texture features of the virtual person into the u-net of the potential diffusion model for face synthesis, as the face appearance change is realized according to the audio input, the texture features are derived from random selection in the virtual person with the same identity in the dataset, so that the model lacks necessary information for the prediction of the details of the human face. To solve this problem, the model incorporates an emotion information guide module that incorporates emotion information guided semantic modeling into the diffusion process to generate high quality facial textures for the half-life virtual person. Specifically, as shown in the emotion information guide module of fig. 4, the model uses emotion features and auxiliary multi-modal features in half-body virtual person generation to guide the semantic denoising process of the potential diffusion model. By adopting the emotion characteristics, the model can adaptively recover the facial texture of the virtual person.
Because of the lack of emotion tags in the dataset, and the collection of tag information requires a large number of artifacts According to the model, the pre-training model is adopted to extract the emotion information of each frame of virtual human image in the data set, emotion labels are fused into the potential diffusion model, and meanwhile facial expression classification loss calculation is added in training, so that consistency between the virtual human facial expression and the provided emotion labels is ensured. Specifically, the model first cuts out the face portion T of the half-body virtual person fa And calculating a loss value by using a cross entropy function, the loss value can be described by a formula as follows:
L cls =CE(F fa (T fa ),z e )
wherein F is fa Representing a pre-trained model for emotion classification, z e Representing emotion labels, CE represents a cross entropy loss function.
S500, based on the visual features, the mask audio features and the emotion features, carrying out inverse denoising and image decoding on the noisy image to obtain a target image of the virtual person.
It should be noted that, in some embodiments, step S500 may include: inputting the visual features, the mask audio features and the emotion features into a pre-trained reverse denoising module, and performing reverse denoising on the noisy image; the inverse denoising module comprises a plurality of u-nets, wherein each u-net is used for predicting vectors of denoising of the denoising image in different stages, and obtaining the noise image predicted in different stages; the reverse denoising module is generated based on noise reconstruction loss function training; the reverse denoising process comprises the following steps: splicing the visual features with noise images in different stages, and generating the body posture and the face texture of the virtual person based on the spliced features; the input of the u-net of each stage is the splicing characteristic of the noise image and the visual characteristic of the previous stage; inputting the mask audio features into each u-net, performing attention mechanism processing with the spliced features, generating a mouth-shaped part of the virtual person, and obtaining multi-modal features; and inputting the emotion characteristics into each u-net, and carrying out characteristic fusion with the multi-modal characteristics of the u-net to generate the face of the virtual person so as to obtain the target image of the virtual person.
In some embodiments, as shown in fig. 5, visual features and audio features of the image are extracted, and the visual features and audio features are fused into a reverse denoising module in the potential diffusion model to guide the generation of mouth shapes and body parts;
in a specific embodiment, extracting visual features and audio features of an image, fusing the visual features and the audio features into a reverse denoising module in a potential diffusion model, guiding the generation of mouth shapes and body parts, comprising:
c1, constructing a reverse denoising module consisting of a plurality of u-nets, wherein each u-net predicts denoising vectors at different stages, and further obtains predicted noise images at different stages;
c2, extracting gesture and texture visual features by using an image feature encoder, extracting gesture features of an image, and controlling the gesture of a human body; extracting texture features for generating face textures; splicing the two visual features with noise images predicted in different stages to serve as input of a next stage u-net;
and C3, extracting audio features by using an audio feature encoder, expanding the audio features in space, masking the region of the mouth part, inputting the audio features into each u-net module, and performing an attention mechanism algorithm with the features of the u-net to guide the generation of the mouth part.
Further, as shown in fig. 6, since the audio features only control the mouth shape, the pre-training model is used to extract the emotion features of the face; inputting emotion characteristics into each layer u-net in the reverse denoising module to perform characteristic fusion, assisting the multi-modal characteristics, and controlling the whole face to be generated;
in a specific embodiment, D, as the audio features only control the mouth shape, the pre-training model is used for extracting the emotion features of the human face; inputting emotion characteristics into each layer u-net in the reverse denoising module to perform characteristic fusion, assisting the multi-modal characteristics, and controlling the whole face to be generated; the embodiment of emotion feature extraction comprises the following steps:
d1, extracting facial emotion codes by using an expression recognition model to serve as emotion characteristics;
d2, extracting facial image features as emotion features;
and D3, extracting audio emotion characteristics as emotion characteristics.
Furthermore, as shown in fig. 2, E, a reverse denoising module for training a potential diffusion model by using a noise reconstruction loss function and an emotion classification loss function, and stopping training when the loss function value of the test set is no longer reduced;
F. reasoning: and inputting noise sampled from Gaussian distribution, denoising the Gaussian noise by using a module for inverse denoising by using a potential diffusion model, and decoding the denoised hidden layer vector by using a decoder of the image to obtain a final image. The method flow of virtual person generation based on audio control can be realized through the reasoning stage according to the training model, wherein the method flow of virtual person generation based on audio control can be realized through the relevant flow steps of the steps S100 to S500 and the specific embodiment thereof.
In some embodiments, the method fully models the multi-modal characteristics in the generation of the half-body virtual human, and optimizes the generation process of the half-body virtual human by means of emotion information guidance, so that the high-quality half-body virtual human can be generated efficiently for given audio and images or videos.
With the development of computer vision technology. The virtual person generation technology is getting more and more attention, and has huge application potential. It has profound significance to study how to generate vivid and lifelike virtual figures. However, in the deep learning-based and diffusion model-based virtual human generation methods, early studies focused mainly on virtual human mouth type and facial expression synthesis, and ignored important influences of actions such as body and emotion information on virtual human generation. In recent years, some studies have attempted to incorporate emotion factors into virtual person generation for facial expression generation. However, such methods only consider discrete emotion categories in the data, and generate fewer attempts for parts outside the face, so that the generated virtual person does not have a flexible and smooth speaking style, and the reality and the liveliness are insufficient.
In order to realize comprehensive and vivid virtual person generation, the invention provides a half-body virtual person generation model based on audio control, which synthesizes a half-body virtual person video from a source image or video according to audio information based on a potential diffusion model (LDM), fuses multi-modal features into the diffusion model by combining a designed multi-modal fusion strategy, and introduces semantic modeling guided by emotion information into the diffusion model to help improve the effect of half-body virtual person generation.
In addition, the invention also provides a virtual person generation model based on audio control, and the structural schematic diagram of the model is shown in fig. 4. In the figure, arrows represent specific data flows.
The half-body virtual human generation model based on audio control mainly comprises three parts: potential diffusion model framework, multimodal feature fusion strategy, emotion information guided semantic modeling. The model first uses a pre-trained image feature encoder E I From the pose image I m And texture image I r Extracting image features and incorporating texture features z among each u-net of the latent diffusion model r And gesture feature z m Noise image predicted with each stageFusion is performed. The latter model employs an audio encoder E A To extract the audio feature a. The model expands the space of the audio features and adds mouth mask to the audio features, the mask audio is merged into a denoising submodule of the diffusion model, and an attention mechanism algorithm is performed with the u-net features so as to accurately generate lip movements of the virtual person. In addition, in order to make up for the defect of the audio information in the face detail generation, the model introduces semantic modeling guided by emotion information, and introduces a pre-extracted emotion label z in the denoising process e To assist in the face detail synthesis of the virtual person, so that the model can realize higher quality half-body virtual person generation.
Next, technical details of each component part of the half-body virtual human generation model based on audio control proposed in this patent are specifically explained.
A. Potential diffusion model frame
The Latent Diffusion Model (LDM) is shown in fig. 4 as a latent diffusion model submodule. The potential diffusion model is proposed to denoise image feature noise with specific parameters. Given a givenAn image I, a potential diffusion model firstly codes the image into a feature space to obtain an image feature z 0 The forward diffusion process is to z 0 The gaussian noise is continually added, causing it to eventually tend towards a gaussian distribution. The second stage of inverse denoising starts from Gaussian noise samples, and aims to continuously denoise the noise samples and restore the noise samples to the distribution of the original image.
Given a data z sampled from a distribution q of the image feature space 0 :z 0 ~q(z 0 ) The purpose of forward noise diffusion on image features is to reduce the variance to beta by adding it at time t t E (0, 1) to generate a series of potential features z 1 ,z 2 ,...z T . The calculation of the forward process distribution can be formulated as:
wherein x is t Representing the characteristic after the t-th noise addition, wherein I is an identity matrix beta t Is the variance of the noise added at time t and N is the gaussian distribution. When the time step T is large enough, x T Will be close to an isotropic gaussian distribution. At arbitrary step length t, z t Can be directly from z in forward diffusion process 0 Without intermediate steps to generate potential features. The formula is described as follows:
α t =1-β t
where e is noise from normal distributed sampling. To distribute q (z) 0 ) The new sample is obtained by reversing the forward diffusion process, i.e. noise z sampled from gaussian distribution T Initially, the back diffusion process will be based on a posterior value q (z t-1 |z t ) A sequence is generated. Due to q (z t-1 |z t ) Depending on the unknown data distribution q (z 0 ) And posterior value q (z t-1 |z t ) Also a gaussian distribution, in order to approximate the representation function q, for a given z t As input, the model learns z from the input condition c (gesture feature, texture feature, audio feature, emotion feature) using a neural network t-1 Mean and covariance of (c). z t-1 Can be described by the formula:
q(z t-1 ∣z t )=N(μ θ (z t ,t,c),∑ θ (z t ,t,c))
here μ θ Representing the mean value of the gaussian distribution Σ θ Representing the variance of the gaussian distribution. In the proposed method, the noise e (z t T, c) to predict the gaussian distribution q (z) 0 ) According to the bayesian theorem, the calculation formula is as follows:
B. multimodal feature fusion strategy
In order to realize high-quality virtual human video generation, a multi-mode feature fusion strategy is designed and adopted. As shown in the multimodal feature fusion submodule of fig. 4, a mask image I for attitude control is given m And a reference image I for preserving identity information r . The model adopts a pre-trained model E I To extract the corresponding characteristic information: texture feature z r And a gestureFeature z m . These features will then be input into each u-net of the potential diffusion model to control facial pose and preserve identity information.
For audio-controlled half-life virtual human generation, the model first employs an encoder E A To extract audio features. Since the audio features are typically only able to control the region around the mouth, the model adds a lip mask to the audio features and inputs the masked audio features as key value pairs in the latent diffusion model to the cross-attention module for multi-modal learning. Under the action of the lip mask, the audio features will have a significant effect on the lip composition generated by the half-life virtual person.
C. Emotion information guided semantic modeling
The half-body virtual human generation model based on audio control adds multi-modal features into the potential diffusion model. The audio features mainly control the motion of the lips, while the gesture features mainly govern the gesture changes of the body and face. Although the model fuses the texture features of the virtual person into the u-net of the potential diffusion model for face synthesis, as the face appearance change is realized according to the audio input, the texture features are derived from random selection in the virtual person with the same identity in the dataset, so that the model lacks necessary information for the prediction of the details of the human face. To solve this problem, the model incorporates an emotion information guide module that incorporates emotion information guided semantic modeling into the diffusion process to generate high quality facial textures for the half-life virtual person. Specifically, as shown in the emotion information guide module of fig. 4, the model uses emotion features and auxiliary multi-modal features in half-body virtual person generation to guide the semantic denoising process of the potential diffusion model. By adopting the emotion characteristics, the model can adaptively recover the facial texture of the virtual person.
Because the data set lacks emotion labels and a great deal of labor cost is required for collecting label information, the model adopts a pre-training model to extract the emotion information of each frame of virtual human image in the data set, and the emotion labels are fused into a potential diffusion model, and meanwhile facial expression classification loss calculation is added in training to ensure that the virtual human is ensured Consistency between facial expressions and provided emotion tags. Specifically, the model first cuts out the face portion T of the half-body virtual person fa And calculating a loss value by using a cross entropy function, the loss value can be described by a formula as follows:
L cls =CE(F fa (T fa ),z e )
wherein F is fa Representing a pre-trained model for emotion classification, z e Representing emotion labels, CE represents a cross entropy loss function.
The model uses two classes of loss functions in the training process. Firstly, in order to realize the generation of high-quality half-life virtual persons under multiple control factors, the mean square error L is adopted ldm To calculate the difference between the real noise and the predicted noise. This loss value is used to train the denoising process in the potential diffusion model. Secondly, in order to provide additional auxiliary information in the face generation process, the model introduces emotion classification loss L cls This loss value is used to optimize the semantic modeling method under the guidance of affective information employed in the generation of the avatar.
Model M uses audio feature a and gesture feature z m Texture feature z r And emotion label z e As input, and calculate the L2 loss between the gaussian distributed real noise and the predicted noise, the calculation formula is as follows:
here e, z is noise from the zhengtai distribution.
However, L ldm The consistency between facial emotion and a given emotion label is not well constrained, and in order to alleviate this problem, pixel-level cross entropy loss computation is employed in the model. Specifically, potential features z in the potential diffusion model are first to be used 0 Decoding onto corresponding image, feature z 0 By z t Sampling to obtain the product. Since the sampling step is not differentiable, the model approximates z by 0 The formula is described as:
the model then predicts potential featuresDecoding to target image +.>Then clipping the target image with a prediction boundary box to obtain a face representation +.>And applying cross entropy loss calculation between the face representation and the emotion tag, wherein the calculation formula is as follows:
finally, the denoising process in the half-body virtual human generation model based on audio control is optimized by means of the total loss function, and high-quality and high-efficiency half-body virtual human generation is realized through model training. The aggregate loss function is formulated as:
L=L ldmcls L cls
wherein lambda is cls Coefficient representing emotion classification loss value, λ in the present invention cls =1。
In summary, the invention provides a new virtual person generating method based on audio control, which utilizes a potential diffusion model to synthesize a half-body virtual person video from a source image or video according to audio information, successfully fuses multi-modal features (namely human body textures, gestures and mask audio features) into the diffusion model through a designed multi-modal fusion strategy, and enhances modeling capability of feature information; and semantic modeling guided by emotion information is introduced in the reverse denoising process, so that the quality of half-body virtual person generation is successfully improved, and the limitation that the existing virtual person generation method is limited to facial modeling is effectively solved. Compared with the prior art, the invention at least has the following beneficial effects:
1. The invention provides a half-body virtual human synthesis method based on audio control for the first time, and utilizes a potential diffusion model to synthesize a half-body virtual human video from a source image or video according to audio information, so that the problem that the video is limited to the head by the existing virtual human generation method is effectively solved, the authenticity and fluency of the half-body virtual human generated by the model are greatly enhanced, and the interaction capability of the virtual human and the real world is improved.
2. The invention designs and adopts a novel multi-modal feature fusion strategy to fuse multi-modal feature information such as textures, gestures and the like of the virtual person into the diffusion model, thereby remarkably enhancing the modeling capability of the feature information, overcoming the problem of weaker feature information modeling in the existing virtual person generating method and remarkably improving the quality and efficiency of virtual person generation.
3. The invention provides a semantic modeling method under emotion information guidance, and introduces the semantic modeling method into a model, so that a generated virtual person can have a more flexible and smooth speaking style, the problem of insufficient characteristic information of the existing virtual person generation method is solved, and the generation mobility of the virtual person is enhanced.
4. The method provided by the invention is not only limited in the field of virtual man generation, but also has the core ideas of multi-modal feature fusion and emotion information guided semantic modeling based on audio control. In the method, the model can adapt to different modal information and task scenes only by simple modification facing different image and video synthesis tasks and different data sets. Therefore, the method has wide application prospect.
On the other hand, as shown in fig. 7, an embodiment of the present invention provides a virtual person generating apparatus 600 based on audio control, including: a first module 610, configured to obtain an input image, and perform image encoding on the input image to obtain a hidden layer vector; a second module 620, configured to denoise the hidden layer vector based on the gaussian distributed noise, and obtain a denoised image; a third module 630, configured to obtain a visual image, and perform feature extraction on the visual image to obtain a visual feature; the method comprises the steps of obtaining input audio, and carrying out feature extraction and mask processing on the input audio to obtain mask audio features; a fourth module 640, configured to perform emotion recognition on the input image or the input audio to obtain emotion features; a fifth module 650 is configured to perform inverse denoising and image decoding on the denoised image based on the visual feature, the masking audio feature and the emotion feature, so as to obtain a target image of the virtual person.
The content of the method embodiment of the invention is suitable for the device embodiment, the specific function of the device embodiment is the same as that of the method embodiment, and the achieved beneficial effects are the same as those of the method.
As shown in fig. 8, another aspect of an embodiment of the present invention further provides an electronic device 700, including a processor 710 and a memory 720;
The memory 720 is used for storing programs;
processor 710 executes a program to implement the method as before.
The content of the method embodiment of the invention is suitable for the electronic equipment embodiment, the functions of the electronic equipment embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method.
Another aspect of the embodiments of the present invention also provides a computer-readable storage medium storing a program that is executed by a processor to implement a method as before.
The content of the method embodiment of the invention is applicable to the computer readable storage medium embodiment, the functions of the computer readable storage medium embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method.
Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the foregoing method.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the functions and/or features may be integrated in a single physical device and/or software module or may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution apparatus, device, or apparatus, such as a computer-based apparatus, processor-containing apparatus, or other apparatus that can fetch the instructions from the instruction execution apparatus, device, or apparatus and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution apparatus, device, or apparatus.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution device. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and the equivalent modifications or substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims (10)

1. A method of generating a virtual person based on audio control, comprising:
acquiring an input image, and performing image coding on the input image to obtain a hidden layer vector;
adding noise to the hidden layer vector based on Gaussian distribution noise to obtain a noise added image;
acquiring a visual image, and extracting features of the visual image to obtain visual features; the method comprises the steps of obtaining input audio, and carrying out feature extraction and mask processing on the input audio to obtain mask audio features;
carrying out emotion recognition on the input image or the input audio to obtain emotion characteristics;
and carrying out reverse denoising and image decoding on the noise-added image based on the visual features, the mask audio features and the emotion features to obtain a target image of the virtual person.
2. The audio control-based virtual person generation method according to claim 1, wherein the image encoding the input image to obtain a hidden layer vector comprises:
the input image is encoded into a hidden layer space using a pre-trained first image encoder to obtain a hidden layer vector.
3. The method for generating a virtual person based on audio control according to claim 2, wherein the gaussian distribution-based noise adds noise to the hidden layer vector to obtain a noise-added image, comprising:
And performing forward diffusion processing on the hidden layer vector in the hidden layer space, and adding noise from Gaussian distribution on the hidden layer vector based on preset Gaussian distribution to obtain a noise-added image.
4. The audio control-based virtual person generation method according to claim 1, wherein the visual image includes a gesture image and a texture image, and the visual features include gesture features and texture features; the step of extracting the features of the visual image to obtain visual features includes:
extracting features of the gesture image by using a pre-trained second image encoder to obtain gesture features;
extracting features of the texture image by using a pre-trained second image encoder to obtain texture features;
the gesture features are used for generating the body gesture of the virtual person, and the texture features are used for generating the face texture of the virtual person.
5. The method for generating virtual persons based on audio control according to claim 1, wherein the performing feature extraction and masking processing on the input audio to obtain masked audio features includes:
performing feature extraction on the input audio by using a pre-trained audio feature extractor to obtain audio features;
Adding a lip mask to the audio features to obtain masked audio features;
wherein the masking audio feature is used for mouth-shaped portion generation of the dummy.
6. The method for generating a virtual person based on audio control according to claim 1, wherein the emotion recognition of the input image or the input audio to obtain emotion characteristics includes:
carrying out emotion recognition on the input image or the input audio by utilizing a pre-trained expression recognition model to obtain emotion characteristics;
the expression recognition model is generated through emotion classification loss function training by utilizing an input image marked with emotion labels and video features and audio features obtained based on the input image.
7. The method for generating a virtual person based on audio control according to claim 1, wherein the performing inverse denoising and image decoding on the noisy image based on the visual feature, the masking audio feature and the emotion feature to obtain a target image of a virtual person comprises:
inputting the visual features, the mask audio features and the emotion features into a pre-trained reverse denoising module, and reversely denoising the noisy image;
The inverse denoising module comprises a plurality of u-nets, wherein each u-net is used for predicting vectors of denoising of the denoising image in different stages to obtain noise images predicted in different stages; the reverse denoising module is generated based on noise reconstruction loss function training;
the reverse denoising process comprises the following steps:
splicing the visual features with the noise images at different stages, and generating the body posture and the face texture of the virtual person based on the spliced features; the input of the u-net of each stage is the splicing characteristic of the noise image and the visual characteristic of the previous stage;
inputting the mask audio features to each u-net, performing attention mechanism processing with the spliced features, generating a mouth-shaped part of the virtual person, and obtaining multi-mode features;
inputting the emotion characteristics to each u-net, and carrying out characteristic fusion with the multi-mode characteristics of the u-net to generate the face of the virtual person so as to obtain a target image of the virtual person.
8. A virtual person generating apparatus based on audio control, comprising:
the first module is used for acquiring an input image, and carrying out image coding on the input image to obtain a hidden layer vector;
The second module is used for adding noise to the hidden layer vector based on Gaussian distribution noise to obtain a noise added image;
the third module is used for acquiring a visual image, extracting the characteristics of the visual image and acquiring visual characteristics; the method comprises the steps of obtaining input audio, and carrying out feature extraction and mask processing on the input audio to obtain mask audio features;
a fourth module, configured to perform emotion recognition on the input image or the input audio to obtain emotion features;
and a fifth module, configured to perform inverse denoising and image decoding on the denoised image based on the visual feature, the mask audio feature and the emotion feature, to obtain a target image of the virtual person.
9. An electronic device comprising a processor and a memory;
the memory is used for storing programs;
the processor executing the program implements the method of any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the storage medium stores a program that is executed by a processor to implement the method of any one of claims 1 to 7.
CN202311032590.1A 2023-08-15 2023-08-15 Virtual person generating method, device, equipment and medium based on audio control Pending CN117152285A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311032590.1A CN117152285A (en) 2023-08-15 2023-08-15 Virtual person generating method, device, equipment and medium based on audio control

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311032590.1A CN117152285A (en) 2023-08-15 2023-08-15 Virtual person generating method, device, equipment and medium based on audio control

Publications (1)

Publication Number Publication Date
CN117152285A true CN117152285A (en) 2023-12-01

Family

ID=88903623

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311032590.1A Pending CN117152285A (en) 2023-08-15 2023-08-15 Virtual person generating method, device, equipment and medium based on audio control

Country Status (1)

Country Link
CN (1) CN117152285A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118116408A (en) * 2024-04-29 2024-05-31 荣耀终端有限公司 Audio identification method, medium, electronic device and program product

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118116408A (en) * 2024-04-29 2024-05-31 荣耀终端有限公司 Audio identification method, medium, electronic device and program product

Similar Documents

Publication Publication Date Title
Das et al. Speech-driven facial animation using cascaded gans for learning of motion and texture
CN113194348B (en) Virtual human lecture video generation method, system, device and storage medium
Fan et al. A deep bidirectional LSTM approach for video-realistic talking head
Ezzat et al. Trainable videorealistic speech animation
Brand Voice puppetry
CN110853670A (en) Music-driven dance generating method
Du et al. Stylistic locomotion modeling and synthesis using variational generative models
Li et al. Learning dynamic audio-visual mapping with input-output hidden Markov models
CN117152285A (en) Virtual person generating method, device, equipment and medium based on audio control
CN115187704A (en) Virtual anchor generation method, device, equipment and storage medium
Deng et al. Audio-based head motion synthesis for avatar-based telepresence systems
CN114170353B (en) Multi-condition control dance generation method and system based on neural network
Filntisis et al. Video-realistic expressive audio-visual speech synthesis for the Greek language
Ju et al. Expressive facial gestures from motion capture data
Websdale et al. Speaker-independent speech animation using perceptual loss functions and synthetic data
CN113779224A (en) Personalized dialogue generation method and system based on user dialogue history
US20240013464A1 (en) Multimodal disentanglement for generating virtual human avatars
Nakatsuka et al. Audio-oriented video interpolation using key pose
Deng et al. Synthesizing speech animation by learning compact speech co-articulation models
CN116310003A (en) Semantic-driven martial arts action synthesis method
Wang et al. InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation
Gowda et al. From pixels to portraits: A comprehensive survey of talking head generation techniques and applications
Filntisis et al. Photorealistic adaptation and interpolation of facial expressions using HMMS and AAMS for audio-visual speech synthesis
He Exploring style transfer algorithms in Animation: Enhancing visual
Deena et al. Speech-driven facial animation using a shared Gaussian process latent variable model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination