CN115908662B - Speaker video generation model training and using method, device and equipment - Google Patents

Speaker video generation model training and using method, device and equipment Download PDF

Info

Publication number
CN115908662B
CN115908662B CN202211631657.9A CN202211631657A CN115908662B CN 115908662 B CN115908662 B CN 115908662B CN 202211631657 A CN202211631657 A CN 202211631657A CN 115908662 B CN115908662 B CN 115908662B
Authority
CN
China
Prior art keywords
head
module
trunk
rendering
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211631657.9A
Other languages
Chinese (zh)
Other versions
CN115908662A (en
Inventor
严妍
汪敏
杨春宇
白杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kaipuyun Information Technology Co ltd
Cape Cloud Information Technology Co ltd
Original Assignee
Beijing Kaipuyun Information Technology Co ltd
Cape Cloud Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kaipuyun Information Technology Co ltd, Cape Cloud Information Technology Co ltd filed Critical Beijing Kaipuyun Information Technology Co ltd
Priority to CN202211631657.9A priority Critical patent/CN115908662B/en
Publication of CN115908662A publication Critical patent/CN115908662A/en
Application granted granted Critical
Publication of CN115908662B publication Critical patent/CN115908662B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The application discloses a method, a device and equipment for training and using a generation model of a speaker video, and belongs to the technical field of machine learning. The method comprises the following steps: improving the AD-NeRF model to obtain a generated model; processing video data and audio data in the training samples by using the generation model to obtain head semantic codes and trunk semantic codes; rendering a head semantic code based on a first transducer module in a head nerve radiation field, and calculating head loss for a head rendering result and a real head image by using a first discriminator in the head nerve radiation field; rendering a torso semantic code based on a second transducer module in the torso nerve radiation field, and calculating torso loss for a torso rendering result and a real torso image by using a second discriminator in the torso nerve radiation field; the model is generated using head loss and torso loss training. The application can promote the representation and image generation capability and improve the problem of trunk blurring.

Description

Speaker video generation model training and using method, device and equipment
Technical Field
The present application relates to the field of machine learning technologies, and in particular, to a method, an apparatus, and a device for training and using a speaker video generation model.
Background
In many applications, such as digital people, chat robots, and virtual video conferencing, the synthesis of high fidelity audio driven facial video sequences is an important and challenging problem. That is, regarding the generation process of a speaker video as a cross-modal mapping from audio to visual faces, it is desirable that the synthesized face image has the same realistic effect as the same photograph as the original video while performing a natural speaking style.
In recent years, it has been proposed to use an AD-NeRF (Audio Driven Neural RADIANCE FIELDS, audio-driven neuro-radiation field) model to generate speaker videos. Specifically, an algorithm for directly generating a speaker video from a voice signal is provided on the basis of NeRF (nerve radiation field), the method directly inputs the characteristics of an audio signal into a conditional implicit function to generate a dynamic nerve radiation field, and the speaker video corresponding to the audio signal is synthesized through volume rendering. The AD-NeRF model not only can synthesize head (with hair) areas, but also can generate a trunk through two independent nerve radiation fields, but the synthesized speaker video has the problems that the mouth of the speaker looks unnatural and the trunk is blurred.
Disclosure of Invention
The application provides a method, a device and equipment for training and using a generation model of a speaker video, which are used for solving the problems that the speaker video synthesized by using an AD-NeRF model has unnatural mouth and fuzzy trunk. The technical scheme is as follows:
in one aspect, a method for training a generated model of a speaker video is provided, the method comprising:
acquiring a plurality of groups of training samples, wherein each group of training samples comprises video data, audio data and a real head image and a real trunk image of a speaker;
Improving the AD-NeRF model to obtain a generation model, wherein a head nerve radiation field in the generation model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transducer module and a second discriminator;
for each group of training samples, processing the video data and the audio data by using the generating model to obtain head semantic codes and trunk semantic codes;
rendering the head semantic code based on the first transducer module to obtain a head rendering result, and calculating head loss for the head rendering result and the real head image by using the first discriminator;
Rendering the trunk semantic code based on the second transducer module to obtain a trunk rendering result, and calculating trunk loss by using the second discriminator to the trunk rendering result and the real trunk image;
training the generated model using the head loss and the torso loss.
In a possible implementation manner, the generating model further includes a video processing module, a wave2vec2.0 module, and a hidden function, and the processing the video data and the audio data by using the generating model to obtain a head semantic code and a trunk semantic code includes:
Processing the video data by utilizing the video processing module to obtain a video resolution graph and gesture parameters;
processing the audio data by using the wave2vec2.0 module to obtain wave2vec2.0 characteristics;
and processing the video analytic graph, the gesture parameters and the wave2vec2.0 features by using the hidden function to obtain a head semantic code and a trunk semantic code.
In one possible implementation manner, the rendering the header semantic code based on the first transducer module, to obtain a header rendering result, includes:
Performing volume rendering on the head semantic code to obtain a head low-dimensional feature map;
rendering the head low-dimensional feature map by using a head-based two-dimensional nerve rendering module to obtain a first intermediate result;
And processing the first intermediate result by using the first transducer module to obtain a head rendering result.
In one possible implementation manner, the rendering the torso semantic code based on the second transducer module, to obtain a torso rendering result, includes:
performing volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map;
rendering the trunk low-dimensional feature map by using a trunk-based two-dimensional nerve rendering module to obtain a second intermediate result;
and processing the second intermediate result by using the second transducer module to obtain a trunk rendering result.
In one possible implementation, the first arbiter and the second arbiter are GAN arbiters.
In one aspect, a method for using a speaker video generation model is provided, where the method includes:
acquiring video data and audio data containing a speaker;
Processing the video data and the audio data by using a trained generation model to obtain head semantic codes and trunk semantic codes, wherein a head nerve radiation field in the generation model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transducer module and a second discriminator;
rendering the head semantic code based on the first transducer module to obtain a head rendering result;
rendering the trunk semantic code based on the second transducer module to obtain a trunk rendering result;
and rendering the head rendering result and the trunk rendering result to obtain a speaker video.
In one aspect, a model training apparatus for generating a speaker video is provided, the apparatus comprising:
The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a plurality of groups of training samples, and each group of training samples comprises video data, audio data and a real head image and a real trunk image of a speaker;
The generation module is used for improving the AD-NeRF model to obtain a generation model, a head nerve radiation field in the generation model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transducer module and a second discriminator;
the processing module is used for processing the video data and the audio data by using the generation model for each group of training samples to obtain head semantic codes and trunk semantic codes;
The processing module is further configured to render the header semantic code based on the first transducer module to obtain a header rendering result, and calculate a header loss for the header rendering result and the real header image by using the first discriminator;
The processing module is further configured to render the torso semantic code based on the second transducer module to obtain a torso rendering result, and calculate a torso loss for the torso rendering result and the real torso image by using the second discriminator;
a training module for training the generated model using the head loss and the torso loss.
In one aspect, a speaker video generation model using apparatus is provided, the apparatus comprising:
the acquisition module is used for acquiring video data and audio data containing a speaker;
the processing module is used for processing the video data and the audio data by using a trained generation model to obtain head semantic codes and trunk semantic codes, a head nerve radiation field in the generation model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transducer module and a second discriminator;
the rendering module is used for rendering the head semantic code based on the first transducer module to obtain a head rendering result;
The rendering module is further used for rendering the trunk semantic code based on the second transducer module to obtain a trunk rendering result;
the rendering module is further used for rendering the head rendering result and the trunk rendering result to obtain a speaker video.
In one aspect, a computer readable storage medium having stored therein at least one instruction loaded and executed by a processor to implement a method of model training for generating a speaker video as described above, or at least one instruction loaded and executed by a processor to implement a method of model using a speaker video as described above, is provided.
In one aspect, a computer device is provided, the computer device including a processor and a memory, the memory having stored therein at least one instruction, the instruction being loaded and executed by the processor to implement a model training method for generating a speaker video as described above; or the instructions are loaded and executed by the processor to implement the speaker video generation model usage method as described above.
The technical scheme provided by the application has the beneficial effects that at least:
Introducing a Wave2Vec2.0 module on the basis of the AD-NeRF model, and extracting the characteristics of the voice data through the Wave2Vec2.0 module, so that the extracted Wave2Vec2.0 characteristics are optimized; the architecture of the head nerve radiation field and the trunk nerve radiation field is modified on the basis of the AD-NeRF model, a transducer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for model training for speaker video generation according to one embodiment of the present application;
FIG. 2 is a training flow diagram of a head nerve radiation field provided by one embodiment of the present application;
FIG. 3 is a training flow diagram of a trunk nerve radiation field provided by one embodiment of the present application;
FIG. 4 is a flow chart of a method for using a model for generating a speaker video according to another embodiment of the present application;
FIG. 5 is a schematic diagram of a speaker video generation model usage flow provided by an embodiment of the present application;
FIG. 6 is a block diagram of a speaker video generation model training apparatus according to one embodiment of the present application;
Fig. 7 is a block diagram of a speaker video generation model using apparatus according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of the embodiments of the present application will be given with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of a method for generating a model training method for a speaker video according to an embodiment of the present application is shown, where the method for generating a model training method for a speaker video may be applied to a computer device. The speaker video generation model training method can comprise the following steps:
Step 101, a plurality of sets of training samples are obtained, wherein each set of training samples comprises video data, audio data, a real head image and a real trunk image of a speaker.
The video data in this embodiment may be a series of speaker lectures, where the speaker may be a real person or a virtual digital person. For example, the computer device may segment the training video for a fixed duration to obtain each video data. The fixed duration may be set according to requirements, for example, the fixed duration is set to a value within 3-5 minutes.
The audio data may be extracted from video data of the same set of training samples, where the training samples are positive samples; or the audio data may be uncorrelated with the video data of the same set of training samples, which are negative samples.
And 102, improving the AD-NeRF model to obtain a generated model, wherein a head nerve radiation field in the generated model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generated model comprises a second transducer module and a second discriminator.
In the embodiment, the transformation of the AD-NeRF model comprises two parts, wherein the first part is to introduce a wave2Vec2.0 module based on the AD-NeRF model, and the wave2Vec2.0 module is used for extracting the characteristics of voice data; the second part is to reconstruct the framework of the head nerve radiation field and the trunk nerve radiation field on the basis of the AD-NeRF model, a transducer module is added, and meanwhile, a loss calculation strategy is optimized through a discriminator.
For convenience of distinction, in this embodiment, the transducer module in the head nerve radiation field is referred to as a first transducer module, and the transducer module in the trunk nerve radiation field is referred to as a second transducer module; the discriminators in the head nerve radiation field are referred to as first discriminators, and the discriminators in the trunk nerve radiation field are referred to as second discriminators to illustrate the discrimination. Wherein the first arbiter and the second arbiter are GAN (GENERATIVE ADVERSARIAL Networks, generating an antagonistic network) arbiter.
And 103, processing the video data and the audio data by using a generating model for each group of training samples to obtain a head semantic code and a trunk semantic code.
Specifically, the generating model further includes a video processing module, a wave2vec2.0 module and a hidden function, and the processing of the video data and the audio data by using the generating model to obtain a head semantic code and a trunk semantic code may include: processing the video data by utilizing a video processing module to obtain a video resolution graph and gesture parameters; processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analytic graph, the gesture parameters and the wave2vec2.0 features by using the hidden function to obtain a head semantic code and a trunk semantic code. The feature of the voice data is extracted through the wave2vec2.0 module, so that the extracted wave2vec2.0 feature can be optimized.
And 104, rendering the head semantic code based on the first transducer module to obtain a head rendering result, and calculating head loss on the head rendering result and the real head image by using a first discriminator.
Specifically, rendering the header semantic code based on the first transducer module to obtain a header rendering result may include: performing volume rendering on the head semantic code to obtain a head low-dimensional feature map; rendering the head low-dimensional feature map by using a head-based two-dimensional nerve rendering module to obtain a first intermediate result; and processing the first intermediate result by using a first transducer module to obtain a head rendering result.
FIG. 2 illustrates a training flow of a head neural radiation field, after head semantic coding is obtained, the head semantic coding can be subjected to volume rendering by utilizing volume rendering mediated by a hidden function based on the head, and a head low-dimensional feature map is obtained; then, rendering the head low-dimensional feature map by using a head-based two-dimensional nerve rendering module to obtain a first intermediate result; processing the first intermediate result by using a first transducer module to obtain a head rendering result; and finally, comparing the head rendering result with the real head image by using a first discriminator to obtain head loss.
And 105, rendering the trunk semantic code based on a second transducer module to obtain a trunk rendering result, and calculating trunk loss by using a second discriminator to the trunk rendering result and the real trunk image.
Specifically, rendering the torso semantic code based on the second transducer module to obtain a torso rendering result may include: performing volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map; rendering the trunk low-dimensional feature map by using a trunk-based two-dimensional nerve rendering module to obtain a second intermediate result; and processing the second intermediate result by using a second transducer module to obtain a trunk rendering result.
FIG. 3 illustrates a training flow of a torso neural radiation field, after torso semantic code is obtained, the torso semantic code can be subjected to volume rendering by utilizing the hidden function-mediated volume rendering based on the torso, so as to obtain a torso low-dimensional feature map; then, rendering the trunk low-dimensional feature map by using a trunk-based two-dimensional nerve rendering module to obtain a second intermediate result; processing the second intermediate result by using a second transducer module to obtain a trunk rendering result; and finally, comparing the trunk rendering result with a real trunk image by using a second discriminator to obtain trunk loss.
Step 106, training and generating a model by utilizing the head loss and the trunk loss.
After the head loss and the trunk loss are obtained, the head loss and the trunk loss can be used as feedback to adjust the model parameters of the generated model, the training samples are processed again by utilizing the adjusted training parameters, the training is stopped until the model parameters of the generated model meet the preset conditions, and the generated model with the model parameters is determined to be the finally trained generated model.
In summary, according to the speaker video generation model training method provided by the embodiment of the application, a wave2Vec2.0 module is introduced on the basis of the AD-NeRF model, and the characteristics of voice data are extracted through the wave2Vec2.0 module, so that the extracted wave2Vec2.0 characteristics are optimized; the architecture of the head nerve radiation field and the trunk nerve radiation field is modified on the basis of the AD-NeRF model, a transducer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
Referring to fig. 4, a flowchart of a method for using a speaker video generation model according to an embodiment of the present application is shown, where the method for using a speaker video generation model may be applied to a computer device. The speaker video generation model using method can comprise the following steps:
in step 401, video data and audio data containing a speaker are acquired.
The video data in this embodiment may be a speaking video, where the speaking person may be a real person or a virtual digital person, and the audio data is not related to the video data.
Step 402, processing the video data and the audio data by using the trained generation model to obtain a head semantic code and a trunk semantic code, wherein a head nerve radiation field in the generation model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transducer module and a second discriminator.
Specifically, the generating model includes a video processing module, a wave2vec2.0 module and a hidden function, and the processing of the video data and the audio data by using the trained generating model to obtain a head semantic code and a trunk semantic code may include: processing the video data by utilizing a video processing module to obtain a video resolution graph and gesture parameters; processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analytic graph, the gesture parameters and the wave2vec2.0 features by using the hidden function to obtain a head semantic code and a trunk semantic code. The feature of the voice data is extracted through the wave2vec2.0 module, so that the extracted wave2vec2.0 feature can be optimized.
And step 403, rendering the header semantic code based on the first transducer module to obtain a header rendering result.
Specifically, rendering the header semantic code based on the first transducer module to obtain a header rendering result may include: performing volume rendering on the head semantic code to obtain a head low-dimensional feature map; rendering the head low-dimensional feature map by using a head-based two-dimensional nerve rendering module to obtain a first intermediate result; and processing the first intermediate result by using a first transducer module to obtain a head rendering result.
And step 404, rendering the trunk semantic code based on the second transducer module to obtain a trunk rendering result.
Specifically, rendering the torso semantic code based on the second transducer module to obtain a torso rendering result may include: performing volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map; rendering the trunk low-dimensional feature map by using a trunk-based two-dimensional nerve rendering module to obtain a second intermediate result; and processing the second intermediate result by using a second transducer module to obtain a trunk rendering result.
And step 405, rendering the head rendering result and the trunk rendering result to obtain a speaker video.
Fig. 5 shows a synthesis flow of a speaker video, that is, firstly, extracting a video resolution graph and gesture parameters from video data, extracting wave2Vec2.0 features from audio data, then, generating head semantic codes and trunk semantic codes from the video resolution graph, the gesture parameters and the wave2Vec2.0 features by using a hidden function, generating a head rendering result from the head semantic codes by using a head nerve radiation field, generating a trunk rendering result from the trunk semantic codes by using a trunk nerve radiation field, and rendering the head rendering result and the trunk rendering result to obtain the speaker video.
In summary, according to the method for using the speaker video generation model provided by the embodiment of the application, a wave2Vec2.0 module is introduced on the basis of the AD-NeRF model, and the characteristics of voice data are extracted through the wave2Vec2.0 module, so that the extracted wave2Vec2.0 characteristics are optimized; the architecture of the head nerve radiation field and the trunk nerve radiation field is modified on the basis of the AD-NeRF model, a transducer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
Referring to fig. 6, a block diagram of a speaker video generation model training apparatus according to an embodiment of the present application is shown, where the speaker video generation model training apparatus may be applied to a computer device. The speaker video generation model training device may include:
An acquisition module 610, configured to acquire a plurality of sets of training samples, where each set of training samples includes video data, audio data, a real head image and a real torso image of a speaker;
The generating module 620 is configured to improve the AD-NeRF model to obtain a generating model, where a head nerve radiation field in the generating model includes a first transducer module and a first discriminator, and a trunk nerve radiation field in the generating model includes a second transducer module and a second discriminator;
the processing module 630 is configured to process, for each set of training samples, the video data and the audio data by using the generating model, to obtain a head semantic code and a torso semantic code;
The processing module 630 is further configured to render the header semantic code based on the first transducer module to obtain a header rendering result, and calculate a header loss for the header rendering result and the real header image by using the first discriminator;
the processing module 630 is further configured to render the torso semantic code based on the second transducer module to obtain a torso rendering result, and calculate a torso loss for the torso rendering result and the real torso image by using the second discriminator;
A training module 640 for training the generative model with head loss and torso loss.
In an alternative embodiment, the generating model further includes a video processing module, a wave2vec2.0 module, and a hidden function, and the processing module 630 is further configured to:
processing the video data by utilizing a video processing module to obtain a video resolution graph and gesture parameters;
Processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics;
and processing the video analytic graph, the gesture parameters and the wave2vec2.0 features by using the hidden function to obtain a head semantic code and a trunk semantic code.
In an alternative embodiment, processing module 630 is further configured to:
Performing volume rendering on the head semantic code to obtain a head low-dimensional feature map;
rendering the head low-dimensional feature map by using a head-based two-dimensional nerve rendering module to obtain a first intermediate result;
And processing the first intermediate result by using a first transducer module to obtain a head rendering result.
In an alternative embodiment, processing module 630 is further configured to:
performing volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map;
rendering the trunk low-dimensional feature map by using a trunk-based two-dimensional nerve rendering module to obtain a second intermediate result;
and processing the second intermediate result by using a second transducer module to obtain a trunk rendering result.
In an alternative embodiment, the first arbiter and the second arbiter are GAN arbiters.
In summary, the speaker video generation model training device provided by the embodiment of the application introduces the wave2Vec2.0 module on the basis of the AD-NeRF model, and extracts the characteristics of the voice data through the wave2Vec2.0 module, so as to optimize the extracted wave2Vec2.0 characteristics; the architecture of the head nerve radiation field and the trunk nerve radiation field is modified on the basis of the AD-NeRF model, a transducer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
Referring to fig. 7, a block diagram of a speaker video generation model using apparatus according to an embodiment of the present application is shown, where the speaker video generation model using apparatus may be applied to a computer device. The speaker video generation model using device may include:
an acquisition module 710 for acquiring video data and audio data containing a speaker;
The processing module 720 is configured to process the video data and the audio data by using a trained generation model to obtain a head semantic code and a torso semantic code, a head nerve radiation field in the generation model includes a first transducer module and a first discriminator, and a torso nerve radiation field in the generation model includes a second transducer module and a second discriminator;
the rendering module 730 is configured to render the header semantic code based on the first transducer module to obtain a header rendering result;
the rendering module 730 is further configured to render the torso semantic code based on the second transducer module, so as to obtain a torso rendering result;
The rendering module 730 is further configured to render the head rendering result and the torso rendering result to obtain a speaker video.
In summary, the device for generating the model of the speaker video provided by the embodiment of the application introduces the wave2Vec2.0 module based on the AD-NeRF model, and extracts the characteristics of the voice data through the wave2Vec2.0 module, so as to optimize the extracted wave2Vec2.0 characteristics; the architecture of the head nerve radiation field and the trunk nerve radiation field is modified on the basis of the AD-NeRF model, a transducer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.
One embodiment of the present application provides a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement a method of training a model of generating a speaker video or a method of using a model of generating a speaker video as described above.
One embodiment of the present application provides a computer device including a processor and a memory having at least one instruction stored therein, the instruction being loaded and executed by the processor to implement a method of training a generative model of a speaker video or a method of using a generative model of a speaker video as described above.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description should not be taken as limiting the embodiments of the application, but rather should be construed to cover all modifications, equivalents, improvements, etc. that may fall within the spirit and principles of the embodiments of the application.

Claims (9)

1. A method for training a model generated by a speaker video, the method comprising:
acquiring a plurality of groups of training samples, wherein each group of training samples comprises video data, audio data and a real head image and a real trunk image of a speaker;
Improving the AD-NeRF model to obtain a generation model, wherein a head nerve radiation field in the generation model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transducer module and a second discriminator;
for each group of training samples, processing the video data and the audio data by using the generating model to obtain head semantic codes and trunk semantic codes;
rendering the head semantic code based on the first transducer module to obtain a head rendering result, and calculating head loss for the head rendering result and the real head image by using the first discriminator;
Rendering the trunk semantic code based on the second transducer module to obtain a trunk rendering result, and calculating trunk loss by using the second discriminator to the trunk rendering result and the real trunk image;
Training the generated model with the head loss and the torso loss;
The generating model further includes a video processing module, a wave2vec2.0 module and a hidden function, and the processing the video data and the audio data by using the generating model to obtain a head semantic code and a trunk semantic code includes: processing the video data by utilizing the video processing module to obtain a video resolution graph and gesture parameters; processing the audio data by using the wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analytic graph, the gesture parameters and the wave2vec2.0 features by using the hidden function to obtain a head semantic code and a trunk semantic code.
2. The method for training the model for generating the speaker video according to claim 1, wherein the rendering the header semantic code based on the first transducer module to obtain a header rendering result comprises:
Performing volume rendering on the head semantic code to obtain a head low-dimensional feature map;
rendering the head low-dimensional feature map by using a head-based two-dimensional nerve rendering module to obtain a first intermediate result;
And processing the first intermediate result by using the first transducer module to obtain a head rendering result.
3. The method for training the model for generating the speaker video according to claim 1, wherein the rendering the torso semantic code based on the second transducer module to obtain a torso rendering result comprises:
performing volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map;
rendering the trunk low-dimensional feature map by using a trunk-based two-dimensional nerve rendering module to obtain a second intermediate result;
and processing the second intermediate result by using the second transducer module to obtain a trunk rendering result.
4. A method of model training for the generation of speaker video according to any one of claims 1 to 3, wherein the first and second discriminators are GAN discriminators.
5. A method for using a model generated by a speaker video, the method comprising:
acquiring video data and audio data containing a speaker;
Processing the video data and the audio data by using a trained generation model to obtain head semantic codes and trunk semantic codes, wherein a head nerve radiation field in the generation model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transducer module and a second discriminator;
rendering the head semantic code based on the first transducer module to obtain a head rendering result;
rendering the trunk semantic code based on the second transducer module to obtain a trunk rendering result;
Rendering the head rendering result and the trunk rendering result to obtain a speaker video;
The generating model further includes a video processing module, a wave2vec2.0 module and a hidden function, and the processing the video data and the audio data by using the trained generating model to obtain a head semantic code and a trunk semantic code includes: processing the video data by utilizing the video processing module to obtain a video resolution graph and gesture parameters; processing the audio data by using the wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analytic graph, the gesture parameters and the wave2vec2.0 features by using the hidden function to obtain a head semantic code and a trunk semantic code.
6. A speaker video generation model training apparatus, the apparatus comprising:
The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a plurality of groups of training samples, and each group of training samples comprises video data, audio data and a real head image and a real trunk image of a speaker;
The generation module is used for improving the AD-NeRF model to obtain a generation model, a head nerve radiation field in the generation model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transducer module and a second discriminator;
the processing module is used for processing the video data and the audio data by using the generation model for each group of training samples to obtain head semantic codes and trunk semantic codes;
The processing module is further configured to render the header semantic code based on the first transducer module to obtain a header rendering result, and calculate a header loss for the header rendering result and the real header image by using the first discriminator;
The processing module is further configured to render the torso semantic code based on the second transducer module to obtain a torso rendering result, and calculate a torso loss for the torso rendering result and the real torso image by using the second discriminator;
A training module for training the generated model using the head loss and the torso loss;
The generating model further comprises a video processing module, a wave2vec2.0 module and a hidden function, and the processing module is further used for: processing the video data by utilizing the video processing module to obtain a video resolution graph and gesture parameters; processing the audio data by using the wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analytic graph, the gesture parameters and the wave2vec2.0 features by using the hidden function to obtain a head semantic code and a trunk semantic code.
7. A speaker video generation model using apparatus, the apparatus comprising:
the acquisition module is used for acquiring video data and audio data containing a speaker;
the processing module is used for processing the video data and the audio data by using a trained generation model to obtain head semantic codes and trunk semantic codes, a head nerve radiation field in the generation model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transducer module and a second discriminator;
the rendering module is used for rendering the head semantic code based on the first transducer module to obtain a head rendering result;
The rendering module is further used for rendering the trunk semantic code based on the second transducer module to obtain a trunk rendering result;
The rendering module is further used for rendering the head rendering result and the trunk rendering result to obtain a speaker video;
The generating model further comprises a video processing module, a wave2vec2.0 module and a hidden function, and the processing module is further used for: processing the video data by utilizing the video processing module to obtain a video resolution graph and gesture parameters; processing the audio data by using the wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analytic graph, the gesture parameters and the wave2vec2.0 features by using the hidden function to obtain a head semantic code and a trunk semantic code.
8. A computer readable storage medium having stored therein at least one instruction loaded and executed by a processor to implement the method of model training for generating a speaker video according to any one of claims 1 to 4, or at least one instruction loaded and executed by a processor to implement the method of model using generating a speaker video according to claim 5.
9. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction that is loaded and executed by the processor to implement the method of model training generation of a speaker video according to any of claims 1 to 4; or the instructions are loaded and executed by the processor to implement the speaker video generation model usage method of claim 5.
CN202211631657.9A 2022-12-19 2022-12-19 Speaker video generation model training and using method, device and equipment Active CN115908662B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211631657.9A CN115908662B (en) 2022-12-19 2022-12-19 Speaker video generation model training and using method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211631657.9A CN115908662B (en) 2022-12-19 2022-12-19 Speaker video generation model training and using method, device and equipment

Publications (2)

Publication Number Publication Date
CN115908662A CN115908662A (en) 2023-04-04
CN115908662B true CN115908662B (en) 2024-05-28

Family

ID=86487924

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211631657.9A Active CN115908662B (en) 2022-12-19 2022-12-19 Speaker video generation model training and using method, device and equipment

Country Status (1)

Country Link
CN (1) CN115908662B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117689783B (en) * 2024-02-02 2024-04-30 湖南马栏山视频先进技术研究院有限公司 Face voice driving method and device based on super-parameter nerve radiation field

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112887698A (en) * 2021-02-04 2021-06-01 中国科学技术大学 High-quality face voice driving method based on nerve radiation field
CN113378697A (en) * 2021-06-08 2021-09-10 安徽大学 Method and device for generating speaking face video based on convolutional neural network
CN113793408A (en) * 2021-09-15 2021-12-14 宿迁硅基智能科技有限公司 Real-time audio-driven face generation method and device and server
CN113822969A (en) * 2021-09-15 2021-12-21 宿迁硅基智能科技有限公司 Method, device and server for training nerve radiation field model and face generation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112887698A (en) * 2021-02-04 2021-06-01 中国科学技术大学 High-quality face voice driving method based on nerve radiation field
CN113378697A (en) * 2021-06-08 2021-09-10 安徽大学 Method and device for generating speaking face video based on convolutional neural network
CN113793408A (en) * 2021-09-15 2021-12-14 宿迁硅基智能科技有限公司 Real-time audio-driven face generation method and device and server
CN113822969A (en) * 2021-09-15 2021-12-21 宿迁硅基智能科技有限公司 Method, device and server for training nerve radiation field model and face generation

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering;Shunyu Yao 等;arxiv.org;20220103;全文 *
MakeltTalk: speaker-aware talking-head animation;Yang Zhou 等;ACM Transactions on Graphics;20201127;第39卷(第6期);全文 *
Ricong Huang 等.Audio-driven Talking Head Generation with Transformer and 3D Morphable Model.MM '22: Proceedings of the 30th ACM International Conference on Multimedia.2022,全文. *
音频与动作两种驱动说话人脸视频生成综述;苏红旗 等;计算机与图像技术;20221101(第21期);全文 *

Also Published As

Publication number Publication date
CN115908662A (en) 2023-04-04

Similar Documents

Publication Publication Date Title
CN113378697B (en) Method and device for generating speaking face video based on convolutional neural network
CN112507617B (en) Training method of SRFlow super-resolution model and face recognition method
EP3982362A1 (en) Audio processing method, apparatus, computer device, and storage medium
CN111145311B (en) Multi-attribute editing method, system and device for high-resolution face image
CN114245215B (en) Method, device, electronic equipment, medium and product for generating speaking video
CN115914505B (en) Video generation method and system based on voice-driven digital human model
CN113838173B (en) Virtual human head motion synthesis method driven by combination of voice and background sound
CN115908662B (en) Speaker video generation model training and using method, device and equipment
CN113886641A (en) Digital human generation method, apparatus, device and medium
CN110969572A (en) Face changing model training method, face exchanging device and electronic equipment
CN113948105A (en) Voice-based image generation method, device, equipment and medium
Huang et al. Parametric implicit face representation for audio-driven facial reenactment
CN113469292A (en) Training method, synthesizing method, device, medium and equipment for video synthesizing model
CN115423908A (en) Virtual face generation method, device, equipment and readable storage medium
CN116012255A (en) Low-light image enhancement method for generating countermeasure network based on cyclic consistency
CN115278293A (en) Virtual anchor generation method and device, storage medium and computer equipment
CN112862672B (en) Liu-bang generation method, device, computer equipment and storage medium
CN114202460A (en) Super-resolution high-definition reconstruction method, system and equipment facing different damage images
Yu et al. Confies: Controllable neural face avatars
CN113343761A (en) Real-time facial expression migration method based on generation confrontation
CN113886642A (en) Digital human generation method, apparatus, device and medium
CN117593442B (en) Portrait generation method based on multi-stage fine grain rendering
CN118413722B (en) Audio drive video generation method, device, computer equipment and storage medium
RU2817316C2 (en) Method and apparatus for training image generation model, method and apparatus for generating images and their devices
CN116912377A (en) Interactive multi-mode stylized two-dimensional digital face animation generation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant