CN115908662A - Method, device and equipment for training and using generation model of speaker video - Google Patents
Method, device and equipment for training and using generation model of speaker video Download PDFInfo
- Publication number
- CN115908662A CN115908662A CN202211631657.9A CN202211631657A CN115908662A CN 115908662 A CN115908662 A CN 115908662A CN 202211631657 A CN202211631657 A CN 202211631657A CN 115908662 A CN115908662 A CN 115908662A
- Authority
- CN
- China
- Prior art keywords
- head
- rendering
- module
- torso
- trunk
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 65
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000009877 rendering Methods 0.000 claims abstract description 153
- 238000012545 processing Methods 0.000 claims abstract description 65
- 230000005855 radiation Effects 0.000 claims abstract description 54
- 210000005036 nerve Anatomy 0.000 claims abstract description 50
- 230000006870 function Effects 0.000 claims description 14
- 230000001537 neural effect Effects 0.000 claims description 14
- 238000004458 analytical method Methods 0.000 claims description 12
- 238000010801 machine learning Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 6
- 238000012512 characterization method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 2
- 210000003792 cranial nerve Anatomy 0.000 description 2
- 230000001404 mediated effect Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Abstract
The application discloses a method, a device and equipment for training and using a generative model of a speaker video, and belongs to the technical field of machine learning. The method comprises the following steps: improving the AD-NeRF model to obtain a generation model; processing video data and audio data in the training sample by using the generating model to obtain head semantic codes and trunk semantic codes; rendering head semantic codes based on a first transform module in a head nerve radiation field, and calculating head loss on a head rendering result and a real head image by using a first discriminator in the head nerve radiation field; rendering the torso semantic code based on a second transform module in the torso nerve radiation field, and calculating a torso loss on a torso rendering result and a real torso image by using a second discriminator in the torso nerve radiation field; the generative model is trained using head loss and torso loss. The method and the device can improve the characteristics and the image generation capacity and improve the problem of trunk blurring.
Description
Technical Field
The application relates to the technical field of machine learning, in particular to a method, a device and equipment for training and using a speaker video generation model.
Background
The synthesis of high fidelity audio driven facial video sequences is an important and challenging problem in many applications, such as digital humans, chat robots, and virtual video conferencing. That is, considering the generation process of speaker video as cross-modal mapping from audio to visual face, it is expected that the synthesized face image will have the same realistic effect as the original video photograph while performing the natural speech style.
In recent years, it has been proposed to generate speaker video using an AD-NeRF (Audio Driven Neural radiation Fields) model. Specifically, an algorithm for directly generating a speaker video from a voice signal is provided on the basis of NeRF (neural radiation field), the method directly inputs the characteristics of an audio signal into a conditional implicit function to generate a dynamic neural radiation field, and the speaker video corresponding to the audio signal is synthesized through volume rendering. The AD-NeRF model not only can synthesize the head (with hair) region, but also can generate the trunk through two independent nerve radiation fields, but the synthesized video of the speaker has the problems that the mouth of the speaker looks unnatural and the trunk is fuzzy.
Disclosure of Invention
The application provides a method, a device and equipment for training and using a generation model of a speaker video, which are used for solving the problems that the mouth of a speaker looks unnatural and the trunk of the speaker is fuzzy in the speaker video synthesized by using an AD-NeRF model. The technical scheme is as follows:
in one aspect, a method for training a generative model of a speaker video is provided, the method comprising:
acquiring a plurality of groups of training samples, wherein each group of training samples comprises video data and audio data containing a speaker, and a real head image and a real trunk image of the speaker;
improving an AD-NeRF model to obtain a generated model, wherein a head nerve radiation field in the generated model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generated model comprises a second Transformer module and a second discriminator;
for each group of training samples, processing the video data and the audio data by using the generating model to obtain head semantic codes and trunk semantic codes;
rendering the semantic head code based on the first Transformer module to obtain a head rendering result, and calculating head loss on the head rendering result and the real head image by using the first discriminator;
rendering the torso semantic code based on the second transform module to obtain a torso rendering result, and calculating a torso loss for the torso rendering result and the real torso image by using the second discriminator;
training the generative model with the head loss and the torso loss.
In a possible implementation manner, the generating model further includes a video processing module, a wave2vec2.0 module, and a hidden function, and the processing the video data and the audio data by using the generating model to obtain a head semantic code and a torso semantic code includes:
processing the video data by using the video processing module to obtain a video analysis chart and a posture parameter;
processing the audio data by using the wave2vec2.0 module to obtain wave2vec2.0 characteristics;
and processing the video analysis graph, the posture parameters and the wave2vec2.0 features by using the implicit function to obtain head semantic codes and trunk semantic codes.
In a possible implementation manner, the rendering the header semantic code based on the first transform module to obtain a header rendering result includes:
performing volume rendering on the head semantic code to obtain a head low-dimensional feature map;
rendering the head low-dimensional feature map by using a head-based two-dimensional neural rendering module to obtain a first intermediate result;
and processing the first intermediate result by utilizing the first Transformer module to obtain a head rendering result.
In a possible implementation manner, the rendering the torso semantic code based on the second transform module to obtain a torso rendering result includes:
performing volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map;
rendering the low-dimensional trunk characteristic graph by using a two-dimensional nerve rendering module based on a trunk to obtain a second intermediate result;
and processing the second intermediate result by utilizing the second Transformer module to obtain a trunk rendering result.
In one possible implementation, the first and second discriminators are GAN discriminators.
In one aspect, a method for using a generative model of a speaker video is provided, the method comprising:
acquiring video data and audio data containing a speaker;
processing the video data and the audio data by using a trained generation model to obtain head semantic codes and trunk semantic codes, wherein a head nerve radiation field in the generation model comprises a first transform module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transform module and a second discriminator;
rendering the head semantic code based on the first Transformer module to obtain a head rendering result;
rendering the trunk semantic code based on the second transform module to obtain a trunk rendering result;
rendering the head rendering result and the trunk rendering result to obtain the speaker video.
In one aspect, an apparatus for training a generative model of a speaker video is provided, the apparatus comprising:
the acquisition module is used for acquiring a plurality of groups of training samples, wherein each group of training samples comprises video data and audio data containing a speaker, and a real head image and a real body image of the speaker;
the generation module is used for improving the AD-NeRF model to obtain a generation model, a head nerve radiation field in the generation model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second Transformer module and a second discriminator;
the processing module is used for processing the video data and the audio data by utilizing the generating model for each group of training samples to obtain head semantic codes and trunk semantic codes;
the processing module is further configured to render the semantic head code based on the first transform module to obtain a head rendering result, and calculate a head loss for the head rendering result and the real head image by using the first discriminator;
the processing module is further configured to render the torso semantic code based on the second transform module to obtain a torso rendering result, and calculate a torso loss for the torso rendering result and the real torso image by using the second discriminator;
a training module to train the generative model using the head loss and the torso loss.
In one aspect, an apparatus for using a generative model of a speaker video is provided, the apparatus comprising:
the acquisition module is used for acquiring video data and audio data containing a speaker;
the processing module is used for processing the video data and the audio data by using a trained generative model to obtain head semantic coding and trunk semantic coding, a head nerve radiation field in the generative model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generative model comprises a second Transformer module and a second discriminator;
the rendering module is used for rendering the head semantic code based on the first Transformer module to obtain a head rendering result;
the rendering module is further configured to render the torso semantic code based on the second transform module to obtain a torso rendering result;
the rendering module is further used for rendering the head rendering result and the trunk rendering result to obtain the speaker video.
In one aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the speaker video generative model training method as described above, or the at least one instruction is loaded and executed by a processor to implement the speaker video generative model using method as described above.
In one aspect, a computer device is provided, which includes a processor and a memory, wherein the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the training method for generating models of speaker videos as described above; alternatively, the instructions are loaded and executed by the processor to implement the generative model usage method for speaker video as described above.
The technical scheme provided by the application has the beneficial effects that:
a Wave2Vec2.0 module is introduced on the basis of an AD-NeRF model, and features of voice data are extracted through the Wave2Vec2.0 module, so that the extracted Wave2Vec2.0 features are optimized; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a transform module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart of a method for training a generative model of a speaker video according to an embodiment of the present application;
FIG. 2 is a flow chart of a training process of a cranial nerve radiation field provided by an embodiment of the present application;
FIG. 3 is a flow chart illustrating training of a neural radiation field of a trunk according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for using a generative model of speaker video according to another embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a process for using a generative model of a speaker video according to an embodiment of the present application;
FIG. 6 is a block diagram of an apparatus for training a generative model of a speaker video according to an embodiment of the present application;
fig. 7 is a block diagram illustrating an apparatus for using a generative model of a speaker video according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of a training method for generating a model of a speaker video according to an embodiment of the present application is shown, where the training method for generating a model of a speaker video can be applied to a computer device. The training method for the generative model of the speaker video can comprise the following steps:
The video data in this embodiment may be a series of videos of speeches of speakers, where the speakers may be real persons or virtual digital persons. For example, the computer device may segment the training video according to a fixed duration to obtain each video data. The fixed time period can be set according to requirements, for example, the fixed time period is set to a value within 3-5 minutes.
The audio data may be extracted from video data of the same set of training samples, where the training samples are positive samples; alternatively, the audio data may be uncorrelated with the video data of the same set of training samples, which are negative samples in this case.
And 102, improving the AD-NeRF model to obtain a generated model, wherein a head nerve radiation field in the generated model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generated model comprises a second Transformer module and a second discriminator.
In the embodiment, the modification of the AD-NeRF model comprises two parts, wherein the first part is to introduce a Wave2Vec2.0 module on the basis of the AD-NeRF model, and extract the characteristics of voice data through the Wave2Vec2.0 module; and the second part is to modify the frameworks of the head nerve radiation field and the trunk nerve radiation field on the basis of the AD-NeRF model, add a Transformer module and optimize a loss calculation strategy through a discriminator.
For the convenience of distinction, in the present embodiment, the Transformer module in the head nerve radiation field is referred to as a first Transformer module, and the Transformer module in the trunk nerve radiation field is referred to as a second Transformer module; the discriminators in the head nerve radiation field are referred to as a first discriminator, and the discriminators in the trunk nerve radiation field are referred to as a second discriminator for distinction. Wherein the first and second discriminators are GAN (generic adaptive Networks) discriminators.
And 103, for each group of training samples, processing the video data and the audio data by using the generation model to obtain head semantic codes and trunk semantic codes.
Specifically, the generating model further includes a video processing module, a wave2vec2.0 module and a hidden function, and then the generating model is used to process video data and audio data to obtain head semantic coding and torso semantic coding, which may include: processing the video data by using a video processing module to obtain a video analysis chart and a posture parameter; processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analysis chart, the posture parameters and the wave2vec2.0 features by using the implicit function to obtain head semantic codes and trunk semantic codes. The features of the voice data are extracted through a Wave2Vec2.0 module, and the extracted Wave2Vec2.0 features can be optimized.
And 104, rendering the semantic code of the head based on the first transform module to obtain a head rendering result, and calculating the head loss by using the first discriminator for the head rendering result and the real head image.
Specifically, rendering the semantic code of the head based on the first transform module to obtain a rendering result of the head, which may include: performing volume rendering on the head semantic code to obtain a head low-dimensional feature map; rendering the head low-dimensional feature map by using a head-based two-dimensional neural rendering module to obtain a first intermediate result; and processing the first intermediate result by utilizing a first Transformer module to obtain a head rendering result.
Fig. 2 illustrates a training process of a cranial nerve radiation field, after obtaining a cranial semantic code, performing volume rendering on the cranial semantic code by using a volume rendering mediated by a head-based implicit function to obtain a low-dimensional feature map of the head; then, a head-based two-dimensional neural rendering module is used for rendering the head low-dimensional feature map to obtain a first intermediate result; processing the first intermediate result by utilizing a first Transformer module to obtain a head rendering result; and finally, comparing the head rendering result with the real head image by using a first discriminator to obtain the head loss.
And 105, rendering the torso semantic code based on a second transform module to obtain a torso rendering result, and calculating the torso loss of the torso rendering result and the real torso image by using a second discriminator.
Specifically, rendering the torso semantic code based on the second Transformer module to obtain a torso rendering result may include: carrying out volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map; rendering the low-dimensional trunk characteristic graph by using a two-dimensional trunk-based neural rendering module to obtain a second intermediate result; and processing the second intermediate result by utilizing a second transform module to obtain a trunk rendering result.
Fig. 3 shows a training process of a trunk neural radiation field, after obtaining a trunk semantic code, performing volume rendering on the trunk semantic code by using hidden function-mediated volume rendering based on a trunk to obtain a trunk low-dimensional feature map; then, a two-dimensional nerve rendering module based on the trunk is used for rendering the low-dimensional characteristic graph of the trunk to obtain a second intermediate result; processing the second intermediate result by utilizing a second Transformer module to obtain a trunk rendering result; and finally, comparing the trunk rendering result with the real trunk image by using a second discriminator to obtain the trunk loss.
The generative model is trained using head loss and torso loss, step 106.
After the head loss and the trunk loss are obtained, the head loss and the trunk loss can be used as feedback to adjust model parameters of the generated model, the training sample is processed again by using the adjusted training parameters until the model parameters of the generated model meet preset conditions, the training is stopped, and the generated model with the model parameters is determined as a finally trained generated model.
To sum up, in the speaker video generation model training method provided by the embodiment of the application, the Wave2Vec2.0 module is introduced on the basis of the AD-NeRF model, and the features of the voice data are extracted through the Wave2Vec2.0 module, so that the extracted Wave2Vec2.0 features are optimized; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a Transformer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
Referring to fig. 4, a flowchart of a method for using a generative model of a speaker video according to an embodiment of the present application is shown, where the method for using a generative model of a speaker video can be applied to a computer device. The method for using the speaker video generation model can comprise the following steps:
The video data in this embodiment may be a speaker speaking video, where the speaker may be a real person or a virtual digital person, and the audio data is unrelated to the video data.
Specifically, the generating model includes a video processing module, a wave2vec2.0 module and a hidden function, and then the trained generating model is used to process video data and audio data to obtain head semantic coding and trunk semantic coding, which may include: processing the video data by using a video processing module to obtain a video analysis chart and a posture parameter; processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analysis chart, the posture parameters and the wave2vec2.0 features by using the implicit function to obtain head semantic codes and trunk semantic codes. The features of the voice data are extracted through a Wave2Vec2.0 module, and the extracted Wave2Vec2.0 features can be optimized.
And step 403, rendering the semantic code of the head based on the first transform module to obtain a rendering result of the head.
Specifically, rendering the semantic code of the header based on the first Transformer module to obtain a header rendering result may include: performing volume rendering on the head semantic code to obtain a head low-dimensional feature map; rendering the head low-dimensional feature map by using a head-based two-dimensional neural rendering module to obtain a first intermediate result; and processing the first intermediate result by utilizing a first Transformer module to obtain a head rendering result.
And step 404, rendering the torso semantic code based on a second Transformer module to obtain a torso rendering result.
Specifically, rendering the torso semantic code based on the second Transformer module to obtain a torso rendering result may include: carrying out volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map; rendering the low-dimensional trunk characteristic graph by using a two-dimensional trunk-based neural rendering module to obtain a second intermediate result; and processing the second intermediate result by utilizing a second transform module to obtain a trunk rendering result.
And step 405, rendering the head rendering result and the trunk rendering result to obtain the speaker video.
Fig. 5 shows a synthesis process of a speaker video, that is, a video analysis graph and a pose parameter are extracted from video data, a wave2vec2.0 feature is extracted from audio data, a head semantic code and a trunk semantic code are generated from the video analysis graph, the pose parameter and the wave2vec2.0 feature by using a implicit function, a head rendering result is generated from the head semantic code by using a head nerve radiation field, a trunk rendering result is generated from the trunk semantic code by using a trunk nerve radiation field, and the head rendering result and the trunk rendering result are rendered to obtain the speaker video.
To sum up, the method for using the speaker video generation model provided by the embodiment of the application introduces a Wave2Vec2.0 module on the basis of the AD-NeRF model, and extracts the features of the voice data through the Wave2Vec2.0 module, so that the extracted Wave2Vec2.0 features are optimized; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a Transformer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
Referring to fig. 6, a block diagram of a training apparatus for generating a speaker video according to an embodiment of the present application is shown, where the training apparatus for generating a speaker video can be applied to a computer device. The device for training the generative model of the speaker video can comprise:
an obtaining module 610, configured to obtain a plurality of sets of training samples, where each set of training samples includes video data including a speaker, audio data, a real head image and a real torso image of the speaker;
the generating module 620 is used for improving the AD-NeRF model to obtain a generated model, a head nerve radiation field in the generated model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generated model comprises a second Transformer module and a second discriminator;
a processing module 630, configured to process the video data and the audio data by using the generation model for each group of training samples, so as to obtain a head semantic code and a trunk semantic code;
the processing module 630 is further configured to render the head semantic code based on the first transform module to obtain a head rendering result, and calculate a head loss for the head rendering result and the real head image by using the first discriminator;
the processing module 630 is further configured to render the torso semantic code based on the second transform module to obtain a torso rendering result, and calculate a torso loss for the torso rendering result and the real torso image by using the second discriminator;
a training module 640 for training the generative model with head loss and torso loss.
In an optional embodiment, the generation model further includes a video processing module, a wave2vec2.0 module, and a implicit function, and the processing module 630 is further configured to:
processing the video data by using a video processing module to obtain a video analysis chart and a posture parameter;
processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics;
and processing the video analysis chart, the posture parameters and the wave2vec2.0 features by using the implicit function to obtain head semantic codes and trunk semantic codes.
In an optional embodiment, the processing module 630 is further configured to:
performing volume rendering on the head semantic code to obtain a head low-dimensional feature map;
rendering the head low-dimensional feature map by using a head-based two-dimensional neural rendering module to obtain a first intermediate result;
and processing the first intermediate result by utilizing a first Transformer module to obtain a head rendering result.
In an optional embodiment, the processing module 630 is further configured to:
carrying out volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map;
rendering the low-dimensional trunk characteristic graph by using a two-dimensional nerve rendering module based on the trunk to obtain a second intermediate result;
and processing the second intermediate result by utilizing a second transform module to obtain a trunk rendering result.
In an alternative embodiment, the first and second discriminators are GAN discriminators.
To sum up, the speaker video generation model training device provided by the embodiment of the application introduces a wave2vec2.0 module on the basis of the AD-NeRF model, and extracts the features of the voice data through the wave2vec2.0 module, thereby optimizing the extracted wave2vec2.0 features; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a Transformer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
Referring to fig. 7, a block diagram of a device for using a generative model of a speaker video according to an embodiment of the present application is shown, where the device for using a generative model of a speaker video can be applied to a computer device. The device for using the generating model of the speaker video can comprise:
an obtaining module 710, configured to obtain video data and audio data that include a speaker;
the processing module 720 is configured to process the video data and the audio data by using the trained generative model to obtain a head semantic code and a trunk semantic code, where a head nerve radiation field in the generative model includes a first transform module and a first discriminator, and a trunk nerve radiation field in the generative model includes a second transform module and a second discriminator;
the rendering module 730 is configured to render the head semantic code based on the first transform module to obtain a head rendering result;
the rendering module 730 is further configured to render the torso semantic code based on the second Transformer module, so as to obtain a torso rendering result;
the rendering module 730 is further configured to render the head rendering result and the trunk rendering result to obtain the speaker video.
In summary, the device for generating the speaker video, provided by the embodiment of the application, introduces the wave2vec2.0 module on the basis of the AD-NeRF model, and extracts the features of the voice data through the wave2vec2.0 module, so that the extracted wave2vec2.0 features are optimized; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a Transformer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
One embodiment of the present application provides a computer-readable storage medium having at least one instruction stored therein, which is loaded and executed by a processor to implement the speaker video generative model training method or the speaker video generative model using method as described above.
One embodiment of the present application provides a computer device, which includes a processor and a memory, where the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the speaker video generative model training method or the speaker video generative model using method as described above.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is not intended to limit the embodiments of the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the embodiments of the present application should be included in the scope of the embodiments of the present application.
Claims (10)
1. A method for training generative models of speaker video, the method comprising:
acquiring a plurality of groups of training samples, wherein each group of training samples comprises video data and audio data containing a speaker, and a real head image and a real trunk image of the speaker;
improving an AD-NeRF model to obtain a generated model, wherein a head nerve radiation field in the generated model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generated model comprises a second Transformer module and a second discriminator;
for each group of training samples, processing the video data and the audio data by using the generating model to obtain head semantic codes and trunk semantic codes;
rendering the semantic head code based on the first Transformer module to obtain a head rendering result, and calculating head loss on the head rendering result and the real head image by using the first discriminator;
rendering the torso semantic code based on the second Transformer module to obtain a torso rendering result, and calculating a torso loss for the torso rendering result and the real torso image by using the second discriminator;
training the generative model with the head loss and the torso loss.
2. The method for training the generative model of the speaker video according to claim 1, wherein the generative model further comprises a video processing module, a wave2vec2.0 module and a hidden function, and the processing of the video data and the audio data by using the generative model to obtain the head semantic code and the torso semantic code comprises:
processing the video data by using the video processing module to obtain a video analysis chart and a posture parameter;
processing the audio data by using the wave2vec2.0 module to obtain wave2vec2.0 characteristics;
and processing the video analysis graph, the posture parameters and the wave2vec2.0 features by using the implicit function to obtain head semantic codes and trunk semantic codes.
3. The method for training the generative model of speaker video according to claim 1, wherein the rendering the semantic head coding based on the first transform module to obtain a rendering result of the head comprises:
performing volume rendering on the head semantic code to obtain a head low-dimensional feature map;
rendering the head low-dimensional feature map by using a head-based two-dimensional neural rendering module to obtain a first intermediate result;
and processing the first intermediate result by utilizing the first Transformer module to obtain a head rendering result.
4. The training method for generative models of speaker video according to claim 1, wherein the rendering the torso semantic code based on the second fransformer module to obtain a torso rendering result comprises:
performing volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map;
rendering the trunk low-dimensional feature map by using a trunk-based two-dimensional neural rendering module to obtain a second intermediate result;
and processing the second intermediate result by utilizing the second Transformer module to obtain a trunk rendering result.
5. The training method for generative models of speaker video according to any one of claims 1 to 4, wherein the first and second discriminators are GAN discriminators.
6. A method for using a generative model of a speaker video, the method comprising:
acquiring video data and audio data containing a speaker;
processing the video data and the audio data by using a trained generation model to obtain head semantic codes and trunk semantic codes, wherein a head nerve radiation field in the generation model comprises a first transform module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transform module and a second discriminator;
rendering the head semantic code based on the first Transformer module to obtain a head rendering result;
rendering the torso semantic code based on the second Transformer module to obtain a torso rendering result;
rendering the head rendering result and the trunk rendering result to obtain the speaker video.
7. An apparatus for training generative models of speaker video, the apparatus comprising:
the acquisition module is used for acquiring a plurality of groups of training samples, wherein each group of training samples comprises video data and audio data containing a speaker, and a real head image and a real body image of the speaker;
the generation module is used for improving the AD-NeRF model to obtain a generation model, a head nerve radiation field in the generation model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second Transformer module and a second discriminator;
the processing module is used for processing the video data and the audio data by utilizing the generating model for each group of training samples to obtain head semantic codes and trunk semantic codes;
the processing module is further configured to render the semantic head code based on the first transform module to obtain a head rendering result, and calculate a head loss for the head rendering result and the real head image by using the first discriminator;
the processing module is further configured to render the torso semantic code based on the second transform module to obtain a torso rendering result, and calculate a torso loss for the torso rendering result and the real torso image by using the second discriminator;
a training module to train the generative model using the head loss and the torso loss.
8. An apparatus for using generative models of speaker video, the apparatus comprising:
the acquisition module is used for acquiring video data and audio data containing a speaker;
the processing module is used for processing the video data and the audio data by using a trained generative model to obtain head semantic coding and trunk semantic coding, a head nerve radiation field in the generative model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generative model comprises a second Transformer module and a second discriminator;
the rendering module is used for rendering the head semantic code based on the first Transformer module to obtain a head rendering result;
the rendering module is further configured to render the torso semantic code based on the second Transformer module to obtain a torso rendering result;
the rendering module is further used for rendering the head rendering result and the trunk rendering result to obtain the speaker video.
9. A computer-readable storage medium, wherein at least one instruction is stored in the storage medium, and the at least one instruction is loaded and executed by a processor to implement the training method for generative model of speaker video as claimed in any one of claims 1 to 5, or the at least one instruction is loaded and executed by a processor to implement the using method for generative model of speaker video as claimed in claim 6.
10. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the generative model training method for speaker video according to any one of claims 1 to 5; alternatively, the instructions are loaded and executed by the processor to implement the speaker video generative model use method as claimed in claim 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211631657.9A CN115908662A (en) | 2022-12-19 | 2022-12-19 | Method, device and equipment for training and using generation model of speaker video |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211631657.9A CN115908662A (en) | 2022-12-19 | 2022-12-19 | Method, device and equipment for training and using generation model of speaker video |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115908662A true CN115908662A (en) | 2023-04-04 |
Family
ID=86487924
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211631657.9A Pending CN115908662A (en) | 2022-12-19 | 2022-12-19 | Method, device and equipment for training and using generation model of speaker video |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115908662A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117689783A (en) * | 2024-02-02 | 2024-03-12 | 湖南马栏山视频先进技术研究院有限公司 | Face voice driving method and device based on super-parameter nerve radiation field |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112887698A (en) * | 2021-02-04 | 2021-06-01 | 中国科学技术大学 | High-quality face voice driving method based on nerve radiation field |
CN113378697A (en) * | 2021-06-08 | 2021-09-10 | 安徽大学 | Method and device for generating speaking face video based on convolutional neural network |
CN113793408A (en) * | 2021-09-15 | 2021-12-14 | 宿迁硅基智能科技有限公司 | Real-time audio-driven face generation method and device and server |
CN113822969A (en) * | 2021-09-15 | 2021-12-21 | 宿迁硅基智能科技有限公司 | Method, device and server for training nerve radiation field model and face generation |
-
2022
- 2022-12-19 CN CN202211631657.9A patent/CN115908662A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112887698A (en) * | 2021-02-04 | 2021-06-01 | 中国科学技术大学 | High-quality face voice driving method based on nerve radiation field |
CN113378697A (en) * | 2021-06-08 | 2021-09-10 | 安徽大学 | Method and device for generating speaking face video based on convolutional neural network |
CN113793408A (en) * | 2021-09-15 | 2021-12-14 | 宿迁硅基智能科技有限公司 | Real-time audio-driven face generation method and device and server |
CN113822969A (en) * | 2021-09-15 | 2021-12-21 | 宿迁硅基智能科技有限公司 | Method, device and server for training nerve radiation field model and face generation |
Non-Patent Citations (4)
Title |
---|
RICONG HUANG 等: "Audio-driven Talking Head Generation with Transformer and 3D Morphable Model", MM \'22: PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 30 October 2022 (2022-10-30) * |
SHUNYU YAO 等: "DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering", ARXIV.ORG, 3 January 2022 (2022-01-03) * |
YANG ZHOU 等: "MakeltTalk: speaker-aware talking-head animation", ACM TRANSACTIONS ON GRAPHICS, vol. 39, no. 6, 27 November 2020 (2020-11-27) * |
苏红旗 等: "音频与动作两种驱动说话人脸视频生成综述", 计算机与图像技术, no. 21, 1 November 2022 (2022-11-01) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117689783A (en) * | 2024-02-02 | 2024-03-12 | 湖南马栏山视频先进技术研究院有限公司 | Face voice driving method and device based on super-parameter nerve radiation field |
CN117689783B (en) * | 2024-02-02 | 2024-04-30 | 湖南马栏山视频先进技术研究院有限公司 | Face voice driving method and device based on super-parameter nerve radiation field |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113269872A (en) | Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization | |
CN113378697A (en) | Method and device for generating speaking face video based on convolutional neural network | |
EP3982362A1 (en) | Audio processing method, apparatus, computer device, and storage medium | |
CN113838173B (en) | Virtual human head motion synthesis method driven by combination of voice and background sound | |
CN113077537A (en) | Video generation method, storage medium and equipment | |
CN113886641A (en) | Digital human generation method, apparatus, device and medium | |
CN115908662A (en) | Method, device and equipment for training and using generation model of speaker video | |
CN115423908A (en) | Virtual face generation method, device, equipment and readable storage medium | |
CN115914505B (en) | Video generation method and system based on voice-driven digital human model | |
Shankar et al. | Non-parallel emotion conversion using a deep-generative hybrid network and an adversarial pair discriminator | |
CN113516990A (en) | Voice enhancement method, method for training neural network and related equipment | |
CN113948105A (en) | Voice-based image generation method, device, equipment and medium | |
CN114882861A (en) | Voice generation method, device, equipment, medium and product | |
Huang et al. | Parametric implicit face representation for audio-driven facial reenactment | |
CN113470170A (en) | Real-time video face region space-time consistent synthesis method using voice information | |
Huang et al. | Perceptual conversational head generation with regularized driver and enhanced renderer | |
CN116912375A (en) | Facial animation generation method and device, electronic equipment and storage medium | |
CN112862672B (en) | Liu-bang generation method, device, computer equipment and storage medium | |
CN113886640A (en) | Digital human generation method, apparatus, device and medium | |
CN115278293A (en) | Virtual anchor generation method and device, storage medium and computer equipment | |
CN113469292A (en) | Training method, synthesizing method, device, medium and equipment for video synthesizing model | |
CN113343761A (en) | Real-time facial expression migration method based on generation confrontation | |
Pan et al. | Bone-conducted speech to air-conducted speech conversion based on cycleconsistent adversarial networks | |
Chen et al. | VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer | |
CN117523051B (en) | Method, device, equipment and storage medium for generating dynamic image based on audio |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |