CN115908662A - Method, device and equipment for training and using generation model of speaker video - Google Patents

Method, device and equipment for training and using generation model of speaker video Download PDF

Info

Publication number
CN115908662A
CN115908662A CN202211631657.9A CN202211631657A CN115908662A CN 115908662 A CN115908662 A CN 115908662A CN 202211631657 A CN202211631657 A CN 202211631657A CN 115908662 A CN115908662 A CN 115908662A
Authority
CN
China
Prior art keywords
head
rendering
module
torso
trunk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211631657.9A
Other languages
Chinese (zh)
Inventor
严妍
汪敏
杨春宇
白杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kaipuyun Information Technology Co ltd
Cape Cloud Information Technology Co ltd
Original Assignee
Beijing Kaipuyun Information Technology Co ltd
Cape Cloud Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kaipuyun Information Technology Co ltd, Cape Cloud Information Technology Co ltd filed Critical Beijing Kaipuyun Information Technology Co ltd
Priority to CN202211631657.9A priority Critical patent/CN115908662A/en
Publication of CN115908662A publication Critical patent/CN115908662A/en
Pending legal-status Critical Current

Links

Images

Abstract

The application discloses a method, a device and equipment for training and using a generative model of a speaker video, and belongs to the technical field of machine learning. The method comprises the following steps: improving the AD-NeRF model to obtain a generation model; processing video data and audio data in the training sample by using the generating model to obtain head semantic codes and trunk semantic codes; rendering head semantic codes based on a first transform module in a head nerve radiation field, and calculating head loss on a head rendering result and a real head image by using a first discriminator in the head nerve radiation field; rendering the torso semantic code based on a second transform module in the torso nerve radiation field, and calculating a torso loss on a torso rendering result and a real torso image by using a second discriminator in the torso nerve radiation field; the generative model is trained using head loss and torso loss. The method and the device can improve the characteristics and the image generation capacity and improve the problem of trunk blurring.

Description

Method, device and equipment for training and using generation model of speaker video
Technical Field
The application relates to the technical field of machine learning, in particular to a method, a device and equipment for training and using a speaker video generation model.
Background
The synthesis of high fidelity audio driven facial video sequences is an important and challenging problem in many applications, such as digital humans, chat robots, and virtual video conferencing. That is, considering the generation process of speaker video as cross-modal mapping from audio to visual face, it is expected that the synthesized face image will have the same realistic effect as the original video photograph while performing the natural speech style.
In recent years, it has been proposed to generate speaker video using an AD-NeRF (Audio Driven Neural radiation Fields) model. Specifically, an algorithm for directly generating a speaker video from a voice signal is provided on the basis of NeRF (neural radiation field), the method directly inputs the characteristics of an audio signal into a conditional implicit function to generate a dynamic neural radiation field, and the speaker video corresponding to the audio signal is synthesized through volume rendering. The AD-NeRF model not only can synthesize the head (with hair) region, but also can generate the trunk through two independent nerve radiation fields, but the synthesized video of the speaker has the problems that the mouth of the speaker looks unnatural and the trunk is fuzzy.
Disclosure of Invention
The application provides a method, a device and equipment for training and using a generation model of a speaker video, which are used for solving the problems that the mouth of a speaker looks unnatural and the trunk of the speaker is fuzzy in the speaker video synthesized by using an AD-NeRF model. The technical scheme is as follows:
in one aspect, a method for training a generative model of a speaker video is provided, the method comprising:
acquiring a plurality of groups of training samples, wherein each group of training samples comprises video data and audio data containing a speaker, and a real head image and a real trunk image of the speaker;
improving an AD-NeRF model to obtain a generated model, wherein a head nerve radiation field in the generated model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generated model comprises a second Transformer module and a second discriminator;
for each group of training samples, processing the video data and the audio data by using the generating model to obtain head semantic codes and trunk semantic codes;
rendering the semantic head code based on the first Transformer module to obtain a head rendering result, and calculating head loss on the head rendering result and the real head image by using the first discriminator;
rendering the torso semantic code based on the second transform module to obtain a torso rendering result, and calculating a torso loss for the torso rendering result and the real torso image by using the second discriminator;
training the generative model with the head loss and the torso loss.
In a possible implementation manner, the generating model further includes a video processing module, a wave2vec2.0 module, and a hidden function, and the processing the video data and the audio data by using the generating model to obtain a head semantic code and a torso semantic code includes:
processing the video data by using the video processing module to obtain a video analysis chart and a posture parameter;
processing the audio data by using the wave2vec2.0 module to obtain wave2vec2.0 characteristics;
and processing the video analysis graph, the posture parameters and the wave2vec2.0 features by using the implicit function to obtain head semantic codes and trunk semantic codes.
In a possible implementation manner, the rendering the header semantic code based on the first transform module to obtain a header rendering result includes:
performing volume rendering on the head semantic code to obtain a head low-dimensional feature map;
rendering the head low-dimensional feature map by using a head-based two-dimensional neural rendering module to obtain a first intermediate result;
and processing the first intermediate result by utilizing the first Transformer module to obtain a head rendering result.
In a possible implementation manner, the rendering the torso semantic code based on the second transform module to obtain a torso rendering result includes:
performing volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map;
rendering the low-dimensional trunk characteristic graph by using a two-dimensional nerve rendering module based on a trunk to obtain a second intermediate result;
and processing the second intermediate result by utilizing the second Transformer module to obtain a trunk rendering result.
In one possible implementation, the first and second discriminators are GAN discriminators.
In one aspect, a method for using a generative model of a speaker video is provided, the method comprising:
acquiring video data and audio data containing a speaker;
processing the video data and the audio data by using a trained generation model to obtain head semantic codes and trunk semantic codes, wherein a head nerve radiation field in the generation model comprises a first transform module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transform module and a second discriminator;
rendering the head semantic code based on the first Transformer module to obtain a head rendering result;
rendering the trunk semantic code based on the second transform module to obtain a trunk rendering result;
rendering the head rendering result and the trunk rendering result to obtain the speaker video.
In one aspect, an apparatus for training a generative model of a speaker video is provided, the apparatus comprising:
the acquisition module is used for acquiring a plurality of groups of training samples, wherein each group of training samples comprises video data and audio data containing a speaker, and a real head image and a real body image of the speaker;
the generation module is used for improving the AD-NeRF model to obtain a generation model, a head nerve radiation field in the generation model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second Transformer module and a second discriminator;
the processing module is used for processing the video data and the audio data by utilizing the generating model for each group of training samples to obtain head semantic codes and trunk semantic codes;
the processing module is further configured to render the semantic head code based on the first transform module to obtain a head rendering result, and calculate a head loss for the head rendering result and the real head image by using the first discriminator;
the processing module is further configured to render the torso semantic code based on the second transform module to obtain a torso rendering result, and calculate a torso loss for the torso rendering result and the real torso image by using the second discriminator;
a training module to train the generative model using the head loss and the torso loss.
In one aspect, an apparatus for using a generative model of a speaker video is provided, the apparatus comprising:
the acquisition module is used for acquiring video data and audio data containing a speaker;
the processing module is used for processing the video data and the audio data by using a trained generative model to obtain head semantic coding and trunk semantic coding, a head nerve radiation field in the generative model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generative model comprises a second Transformer module and a second discriminator;
the rendering module is used for rendering the head semantic code based on the first Transformer module to obtain a head rendering result;
the rendering module is further configured to render the torso semantic code based on the second transform module to obtain a torso rendering result;
the rendering module is further used for rendering the head rendering result and the trunk rendering result to obtain the speaker video.
In one aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the speaker video generative model training method as described above, or the at least one instruction is loaded and executed by a processor to implement the speaker video generative model using method as described above.
In one aspect, a computer device is provided, which includes a processor and a memory, wherein the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the training method for generating models of speaker videos as described above; alternatively, the instructions are loaded and executed by the processor to implement the generative model usage method for speaker video as described above.
The technical scheme provided by the application has the beneficial effects that:
a Wave2Vec2.0 module is introduced on the basis of an AD-NeRF model, and features of voice data are extracted through the Wave2Vec2.0 module, so that the extracted Wave2Vec2.0 features are optimized; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a transform module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart of a method for training a generative model of a speaker video according to an embodiment of the present application;
FIG. 2 is a flow chart of a training process of a cranial nerve radiation field provided by an embodiment of the present application;
FIG. 3 is a flow chart illustrating training of a neural radiation field of a trunk according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for using a generative model of speaker video according to another embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a process for using a generative model of a speaker video according to an embodiment of the present application;
FIG. 6 is a block diagram of an apparatus for training a generative model of a speaker video according to an embodiment of the present application;
fig. 7 is a block diagram illustrating an apparatus for using a generative model of a speaker video according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of a training method for generating a model of a speaker video according to an embodiment of the present application is shown, where the training method for generating a model of a speaker video can be applied to a computer device. The training method for the generative model of the speaker video can comprise the following steps:
step 101, a plurality of sets of training samples are obtained, wherein each set of training samples comprises video data and audio data of a speaker, and a real head image and a real body image of the speaker.
The video data in this embodiment may be a series of videos of speeches of speakers, where the speakers may be real persons or virtual digital persons. For example, the computer device may segment the training video according to a fixed duration to obtain each video data. The fixed time period can be set according to requirements, for example, the fixed time period is set to a value within 3-5 minutes.
The audio data may be extracted from video data of the same set of training samples, where the training samples are positive samples; alternatively, the audio data may be uncorrelated with the video data of the same set of training samples, which are negative samples in this case.
And 102, improving the AD-NeRF model to obtain a generated model, wherein a head nerve radiation field in the generated model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generated model comprises a second Transformer module and a second discriminator.
In the embodiment, the modification of the AD-NeRF model comprises two parts, wherein the first part is to introduce a Wave2Vec2.0 module on the basis of the AD-NeRF model, and extract the characteristics of voice data through the Wave2Vec2.0 module; and the second part is to modify the frameworks of the head nerve radiation field and the trunk nerve radiation field on the basis of the AD-NeRF model, add a Transformer module and optimize a loss calculation strategy through a discriminator.
For the convenience of distinction, in the present embodiment, the Transformer module in the head nerve radiation field is referred to as a first Transformer module, and the Transformer module in the trunk nerve radiation field is referred to as a second Transformer module; the discriminators in the head nerve radiation field are referred to as a first discriminator, and the discriminators in the trunk nerve radiation field are referred to as a second discriminator for distinction. Wherein the first and second discriminators are GAN (generic adaptive Networks) discriminators.
And 103, for each group of training samples, processing the video data and the audio data by using the generation model to obtain head semantic codes and trunk semantic codes.
Specifically, the generating model further includes a video processing module, a wave2vec2.0 module and a hidden function, and then the generating model is used to process video data and audio data to obtain head semantic coding and torso semantic coding, which may include: processing the video data by using a video processing module to obtain a video analysis chart and a posture parameter; processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analysis chart, the posture parameters and the wave2vec2.0 features by using the implicit function to obtain head semantic codes and trunk semantic codes. The features of the voice data are extracted through a Wave2Vec2.0 module, and the extracted Wave2Vec2.0 features can be optimized.
And 104, rendering the semantic code of the head based on the first transform module to obtain a head rendering result, and calculating the head loss by using the first discriminator for the head rendering result and the real head image.
Specifically, rendering the semantic code of the head based on the first transform module to obtain a rendering result of the head, which may include: performing volume rendering on the head semantic code to obtain a head low-dimensional feature map; rendering the head low-dimensional feature map by using a head-based two-dimensional neural rendering module to obtain a first intermediate result; and processing the first intermediate result by utilizing a first Transformer module to obtain a head rendering result.
Fig. 2 illustrates a training process of a cranial nerve radiation field, after obtaining a cranial semantic code, performing volume rendering on the cranial semantic code by using a volume rendering mediated by a head-based implicit function to obtain a low-dimensional feature map of the head; then, a head-based two-dimensional neural rendering module is used for rendering the head low-dimensional feature map to obtain a first intermediate result; processing the first intermediate result by utilizing a first Transformer module to obtain a head rendering result; and finally, comparing the head rendering result with the real head image by using a first discriminator to obtain the head loss.
And 105, rendering the torso semantic code based on a second transform module to obtain a torso rendering result, and calculating the torso loss of the torso rendering result and the real torso image by using a second discriminator.
Specifically, rendering the torso semantic code based on the second Transformer module to obtain a torso rendering result may include: carrying out volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map; rendering the low-dimensional trunk characteristic graph by using a two-dimensional trunk-based neural rendering module to obtain a second intermediate result; and processing the second intermediate result by utilizing a second transform module to obtain a trunk rendering result.
Fig. 3 shows a training process of a trunk neural radiation field, after obtaining a trunk semantic code, performing volume rendering on the trunk semantic code by using hidden function-mediated volume rendering based on a trunk to obtain a trunk low-dimensional feature map; then, a two-dimensional nerve rendering module based on the trunk is used for rendering the low-dimensional characteristic graph of the trunk to obtain a second intermediate result; processing the second intermediate result by utilizing a second Transformer module to obtain a trunk rendering result; and finally, comparing the trunk rendering result with the real trunk image by using a second discriminator to obtain the trunk loss.
The generative model is trained using head loss and torso loss, step 106.
After the head loss and the trunk loss are obtained, the head loss and the trunk loss can be used as feedback to adjust model parameters of the generated model, the training sample is processed again by using the adjusted training parameters until the model parameters of the generated model meet preset conditions, the training is stopped, and the generated model with the model parameters is determined as a finally trained generated model.
To sum up, in the speaker video generation model training method provided by the embodiment of the application, the Wave2Vec2.0 module is introduced on the basis of the AD-NeRF model, and the features of the voice data are extracted through the Wave2Vec2.0 module, so that the extracted Wave2Vec2.0 features are optimized; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a Transformer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
Referring to fig. 4, a flowchart of a method for using a generative model of a speaker video according to an embodiment of the present application is shown, where the method for using a generative model of a speaker video can be applied to a computer device. The method for using the speaker video generation model can comprise the following steps:
step 401, video data and audio data containing a speaker are obtained.
The video data in this embodiment may be a speaker speaking video, where the speaker may be a real person or a virtual digital person, and the audio data is unrelated to the video data.
Step 402, processing the video data and the audio data by using the trained generation model to obtain a head semantic code and a trunk semantic code, wherein a head nerve radiation field in the generation model comprises a first transform module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transform module and a second discriminator.
Specifically, the generating model includes a video processing module, a wave2vec2.0 module and a hidden function, and then the trained generating model is used to process video data and audio data to obtain head semantic coding and trunk semantic coding, which may include: processing the video data by using a video processing module to obtain a video analysis chart and a posture parameter; processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analysis chart, the posture parameters and the wave2vec2.0 features by using the implicit function to obtain head semantic codes and trunk semantic codes. The features of the voice data are extracted through a Wave2Vec2.0 module, and the extracted Wave2Vec2.0 features can be optimized.
And step 403, rendering the semantic code of the head based on the first transform module to obtain a rendering result of the head.
Specifically, rendering the semantic code of the header based on the first Transformer module to obtain a header rendering result may include: performing volume rendering on the head semantic code to obtain a head low-dimensional feature map; rendering the head low-dimensional feature map by using a head-based two-dimensional neural rendering module to obtain a first intermediate result; and processing the first intermediate result by utilizing a first Transformer module to obtain a head rendering result.
And step 404, rendering the torso semantic code based on a second Transformer module to obtain a torso rendering result.
Specifically, rendering the torso semantic code based on the second Transformer module to obtain a torso rendering result may include: carrying out volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map; rendering the low-dimensional trunk characteristic graph by using a two-dimensional trunk-based neural rendering module to obtain a second intermediate result; and processing the second intermediate result by utilizing a second transform module to obtain a trunk rendering result.
And step 405, rendering the head rendering result and the trunk rendering result to obtain the speaker video.
Fig. 5 shows a synthesis process of a speaker video, that is, a video analysis graph and a pose parameter are extracted from video data, a wave2vec2.0 feature is extracted from audio data, a head semantic code and a trunk semantic code are generated from the video analysis graph, the pose parameter and the wave2vec2.0 feature by using a implicit function, a head rendering result is generated from the head semantic code by using a head nerve radiation field, a trunk rendering result is generated from the trunk semantic code by using a trunk nerve radiation field, and the head rendering result and the trunk rendering result are rendered to obtain the speaker video.
To sum up, the method for using the speaker video generation model provided by the embodiment of the application introduces a Wave2Vec2.0 module on the basis of the AD-NeRF model, and extracts the features of the voice data through the Wave2Vec2.0 module, so that the extracted Wave2Vec2.0 features are optimized; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a Transformer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
Referring to fig. 6, a block diagram of a training apparatus for generating a speaker video according to an embodiment of the present application is shown, where the training apparatus for generating a speaker video can be applied to a computer device. The device for training the generative model of the speaker video can comprise:
an obtaining module 610, configured to obtain a plurality of sets of training samples, where each set of training samples includes video data including a speaker, audio data, a real head image and a real torso image of the speaker;
the generating module 620 is used for improving the AD-NeRF model to obtain a generated model, a head nerve radiation field in the generated model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generated model comprises a second Transformer module and a second discriminator;
a processing module 630, configured to process the video data and the audio data by using the generation model for each group of training samples, so as to obtain a head semantic code and a trunk semantic code;
the processing module 630 is further configured to render the head semantic code based on the first transform module to obtain a head rendering result, and calculate a head loss for the head rendering result and the real head image by using the first discriminator;
the processing module 630 is further configured to render the torso semantic code based on the second transform module to obtain a torso rendering result, and calculate a torso loss for the torso rendering result and the real torso image by using the second discriminator;
a training module 640 for training the generative model with head loss and torso loss.
In an optional embodiment, the generation model further includes a video processing module, a wave2vec2.0 module, and a implicit function, and the processing module 630 is further configured to:
processing the video data by using a video processing module to obtain a video analysis chart and a posture parameter;
processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics;
and processing the video analysis chart, the posture parameters and the wave2vec2.0 features by using the implicit function to obtain head semantic codes and trunk semantic codes.
In an optional embodiment, the processing module 630 is further configured to:
performing volume rendering on the head semantic code to obtain a head low-dimensional feature map;
rendering the head low-dimensional feature map by using a head-based two-dimensional neural rendering module to obtain a first intermediate result;
and processing the first intermediate result by utilizing a first Transformer module to obtain a head rendering result.
In an optional embodiment, the processing module 630 is further configured to:
carrying out volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map;
rendering the low-dimensional trunk characteristic graph by using a two-dimensional nerve rendering module based on the trunk to obtain a second intermediate result;
and processing the second intermediate result by utilizing a second transform module to obtain a trunk rendering result.
In an alternative embodiment, the first and second discriminators are GAN discriminators.
To sum up, the speaker video generation model training device provided by the embodiment of the application introduces a wave2vec2.0 module on the basis of the AD-NeRF model, and extracts the features of the voice data through the wave2vec2.0 module, thereby optimizing the extracted wave2vec2.0 features; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a Transformer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
Referring to fig. 7, a block diagram of a device for using a generative model of a speaker video according to an embodiment of the present application is shown, where the device for using a generative model of a speaker video can be applied to a computer device. The device for using the generating model of the speaker video can comprise:
an obtaining module 710, configured to obtain video data and audio data that include a speaker;
the processing module 720 is configured to process the video data and the audio data by using the trained generative model to obtain a head semantic code and a trunk semantic code, where a head nerve radiation field in the generative model includes a first transform module and a first discriminator, and a trunk nerve radiation field in the generative model includes a second transform module and a second discriminator;
the rendering module 730 is configured to render the head semantic code based on the first transform module to obtain a head rendering result;
the rendering module 730 is further configured to render the torso semantic code based on the second Transformer module, so as to obtain a torso rendering result;
the rendering module 730 is further configured to render the head rendering result and the trunk rendering result to obtain the speaker video.
In summary, the device for generating the speaker video, provided by the embodiment of the application, introduces the wave2vec2.0 module on the basis of the AD-NeRF model, and extracts the features of the voice data through the wave2vec2.0 module, so that the extracted wave2vec2.0 features are optimized; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a Transformer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
One embodiment of the present application provides a computer-readable storage medium having at least one instruction stored therein, which is loaded and executed by a processor to implement the speaker video generative model training method or the speaker video generative model using method as described above.
One embodiment of the present application provides a computer device, which includes a processor and a memory, where the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the speaker video generative model training method or the speaker video generative model using method as described above.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is not intended to limit the embodiments of the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the embodiments of the present application should be included in the scope of the embodiments of the present application.

Claims (10)

1. A method for training generative models of speaker video, the method comprising:
acquiring a plurality of groups of training samples, wherein each group of training samples comprises video data and audio data containing a speaker, and a real head image and a real trunk image of the speaker;
improving an AD-NeRF model to obtain a generated model, wherein a head nerve radiation field in the generated model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generated model comprises a second Transformer module and a second discriminator;
for each group of training samples, processing the video data and the audio data by using the generating model to obtain head semantic codes and trunk semantic codes;
rendering the semantic head code based on the first Transformer module to obtain a head rendering result, and calculating head loss on the head rendering result and the real head image by using the first discriminator;
rendering the torso semantic code based on the second Transformer module to obtain a torso rendering result, and calculating a torso loss for the torso rendering result and the real torso image by using the second discriminator;
training the generative model with the head loss and the torso loss.
2. The method for training the generative model of the speaker video according to claim 1, wherein the generative model further comprises a video processing module, a wave2vec2.0 module and a hidden function, and the processing of the video data and the audio data by using the generative model to obtain the head semantic code and the torso semantic code comprises:
processing the video data by using the video processing module to obtain a video analysis chart and a posture parameter;
processing the audio data by using the wave2vec2.0 module to obtain wave2vec2.0 characteristics;
and processing the video analysis graph, the posture parameters and the wave2vec2.0 features by using the implicit function to obtain head semantic codes and trunk semantic codes.
3. The method for training the generative model of speaker video according to claim 1, wherein the rendering the semantic head coding based on the first transform module to obtain a rendering result of the head comprises:
performing volume rendering on the head semantic code to obtain a head low-dimensional feature map;
rendering the head low-dimensional feature map by using a head-based two-dimensional neural rendering module to obtain a first intermediate result;
and processing the first intermediate result by utilizing the first Transformer module to obtain a head rendering result.
4. The training method for generative models of speaker video according to claim 1, wherein the rendering the torso semantic code based on the second fransformer module to obtain a torso rendering result comprises:
performing volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map;
rendering the trunk low-dimensional feature map by using a trunk-based two-dimensional neural rendering module to obtain a second intermediate result;
and processing the second intermediate result by utilizing the second Transformer module to obtain a trunk rendering result.
5. The training method for generative models of speaker video according to any one of claims 1 to 4, wherein the first and second discriminators are GAN discriminators.
6. A method for using a generative model of a speaker video, the method comprising:
acquiring video data and audio data containing a speaker;
processing the video data and the audio data by using a trained generation model to obtain head semantic codes and trunk semantic codes, wherein a head nerve radiation field in the generation model comprises a first transform module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transform module and a second discriminator;
rendering the head semantic code based on the first Transformer module to obtain a head rendering result;
rendering the torso semantic code based on the second Transformer module to obtain a torso rendering result;
rendering the head rendering result and the trunk rendering result to obtain the speaker video.
7. An apparatus for training generative models of speaker video, the apparatus comprising:
the acquisition module is used for acquiring a plurality of groups of training samples, wherein each group of training samples comprises video data and audio data containing a speaker, and a real head image and a real body image of the speaker;
the generation module is used for improving the AD-NeRF model to obtain a generation model, a head nerve radiation field in the generation model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second Transformer module and a second discriminator;
the processing module is used for processing the video data and the audio data by utilizing the generating model for each group of training samples to obtain head semantic codes and trunk semantic codes;
the processing module is further configured to render the semantic head code based on the first transform module to obtain a head rendering result, and calculate a head loss for the head rendering result and the real head image by using the first discriminator;
the processing module is further configured to render the torso semantic code based on the second transform module to obtain a torso rendering result, and calculate a torso loss for the torso rendering result and the real torso image by using the second discriminator;
a training module to train the generative model using the head loss and the torso loss.
8. An apparatus for using generative models of speaker video, the apparatus comprising:
the acquisition module is used for acquiring video data and audio data containing a speaker;
the processing module is used for processing the video data and the audio data by using a trained generative model to obtain head semantic coding and trunk semantic coding, a head nerve radiation field in the generative model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generative model comprises a second Transformer module and a second discriminator;
the rendering module is used for rendering the head semantic code based on the first Transformer module to obtain a head rendering result;
the rendering module is further configured to render the torso semantic code based on the second Transformer module to obtain a torso rendering result;
the rendering module is further used for rendering the head rendering result and the trunk rendering result to obtain the speaker video.
9. A computer-readable storage medium, wherein at least one instruction is stored in the storage medium, and the at least one instruction is loaded and executed by a processor to implement the training method for generative model of speaker video as claimed in any one of claims 1 to 5, or the at least one instruction is loaded and executed by a processor to implement the using method for generative model of speaker video as claimed in claim 6.
10. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the generative model training method for speaker video according to any one of claims 1 to 5; alternatively, the instructions are loaded and executed by the processor to implement the speaker video generative model use method as claimed in claim 6.
CN202211631657.9A 2022-12-19 2022-12-19 Method, device and equipment for training and using generation model of speaker video Pending CN115908662A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211631657.9A CN115908662A (en) 2022-12-19 2022-12-19 Method, device and equipment for training and using generation model of speaker video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211631657.9A CN115908662A (en) 2022-12-19 2022-12-19 Method, device and equipment for training and using generation model of speaker video

Publications (1)

Publication Number Publication Date
CN115908662A true CN115908662A (en) 2023-04-04

Family

ID=86487924

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211631657.9A Pending CN115908662A (en) 2022-12-19 2022-12-19 Method, device and equipment for training and using generation model of speaker video

Country Status (1)

Country Link
CN (1) CN115908662A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117689783A (en) * 2024-02-02 2024-03-12 湖南马栏山视频先进技术研究院有限公司 Face voice driving method and device based on super-parameter nerve radiation field

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112887698A (en) * 2021-02-04 2021-06-01 中国科学技术大学 High-quality face voice driving method based on nerve radiation field
CN113378697A (en) * 2021-06-08 2021-09-10 安徽大学 Method and device for generating speaking face video based on convolutional neural network
CN113793408A (en) * 2021-09-15 2021-12-14 宿迁硅基智能科技有限公司 Real-time audio-driven face generation method and device and server
CN113822969A (en) * 2021-09-15 2021-12-21 宿迁硅基智能科技有限公司 Method, device and server for training nerve radiation field model and face generation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112887698A (en) * 2021-02-04 2021-06-01 中国科学技术大学 High-quality face voice driving method based on nerve radiation field
CN113378697A (en) * 2021-06-08 2021-09-10 安徽大学 Method and device for generating speaking face video based on convolutional neural network
CN113793408A (en) * 2021-09-15 2021-12-14 宿迁硅基智能科技有限公司 Real-time audio-driven face generation method and device and server
CN113822969A (en) * 2021-09-15 2021-12-21 宿迁硅基智能科技有限公司 Method, device and server for training nerve radiation field model and face generation

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
RICONG HUANG 等: "Audio-driven Talking Head Generation with Transformer and 3D Morphable Model", MM \'22: PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 30 October 2022 (2022-10-30) *
SHUNYU YAO 等: "DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering", ARXIV.ORG, 3 January 2022 (2022-01-03) *
YANG ZHOU 等: "MakeltTalk: speaker-aware talking-head animation", ACM TRANSACTIONS ON GRAPHICS, vol. 39, no. 6, 27 November 2020 (2020-11-27) *
苏红旗 等: "音频与动作两种驱动说话人脸视频生成综述", 计算机与图像技术, no. 21, 1 November 2022 (2022-11-01) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117689783A (en) * 2024-02-02 2024-03-12 湖南马栏山视频先进技术研究院有限公司 Face voice driving method and device based on super-parameter nerve radiation field
CN117689783B (en) * 2024-02-02 2024-04-30 湖南马栏山视频先进技术研究院有限公司 Face voice driving method and device based on super-parameter nerve radiation field

Similar Documents

Publication Publication Date Title
CN113269872A (en) Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization
CN113378697A (en) Method and device for generating speaking face video based on convolutional neural network
EP3982362A1 (en) Audio processing method, apparatus, computer device, and storage medium
CN113838173B (en) Virtual human head motion synthesis method driven by combination of voice and background sound
CN113077537A (en) Video generation method, storage medium and equipment
CN113886641A (en) Digital human generation method, apparatus, device and medium
CN115908662A (en) Method, device and equipment for training and using generation model of speaker video
CN115423908A (en) Virtual face generation method, device, equipment and readable storage medium
CN115914505B (en) Video generation method and system based on voice-driven digital human model
Shankar et al. Non-parallel emotion conversion using a deep-generative hybrid network and an adversarial pair discriminator
CN113516990A (en) Voice enhancement method, method for training neural network and related equipment
CN113948105A (en) Voice-based image generation method, device, equipment and medium
CN114882861A (en) Voice generation method, device, equipment, medium and product
Huang et al. Parametric implicit face representation for audio-driven facial reenactment
CN113470170A (en) Real-time video face region space-time consistent synthesis method using voice information
Huang et al. Perceptual conversational head generation with regularized driver and enhanced renderer
CN116912375A (en) Facial animation generation method and device, electronic equipment and storage medium
CN112862672B (en) Liu-bang generation method, device, computer equipment and storage medium
CN113886640A (en) Digital human generation method, apparatus, device and medium
CN115278293A (en) Virtual anchor generation method and device, storage medium and computer equipment
CN113469292A (en) Training method, synthesizing method, device, medium and equipment for video synthesizing model
CN113343761A (en) Real-time facial expression migration method based on generation confrontation
Pan et al. Bone-conducted speech to air-conducted speech conversion based on cycleconsistent adversarial networks
Chen et al. VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer
CN117523051B (en) Method, device, equipment and storage medium for generating dynamic image based on audio

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination