CN115908662A

CN115908662A - Method, device and equipment for training and using generation model of speaker video

Info

Publication number: CN115908662A
Application number: CN202211631657.9A
Authority: CN
Inventors: 严妍; 汪敏; 杨春宇; 白杨
Original assignee: Beijing Kaipuyun Information Technology Co ltd; Cape Cloud Information Technology Co ltd
Current assignee: Beijing Kaipuyun Information Technology Co ltd; Cape Cloud Information Technology Co ltd
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-04-04

Abstract

The application discloses a method, a device and equipment for training and using a generative model of a speaker video, and belongs to the technical field of machine learning. The method comprises the following steps: improving the AD-NeRF model to obtain a generation model; processing video data and audio data in the training sample by using the generating model to obtain head semantic codes and trunk semantic codes; rendering head semantic codes based on a first transform module in a head nerve radiation field, and calculating head loss on a head rendering result and a real head image by using a first discriminator in the head nerve radiation field; rendering the torso semantic code based on a second transform module in the torso nerve radiation field, and calculating a torso loss on a torso rendering result and a real torso image by using a second discriminator in the torso nerve radiation field; the generative model is trained using head loss and torso loss. The method and the device can improve the characteristics and the image generation capacity and improve the problem of trunk blurring.

Description

Method, device and equipment for training and using generation model of speaker video

Technical Field

The application relates to the technical field of machine learning, in particular to a method, a device and equipment for training and using a speaker video generation model.

Background

The synthesis of high fidelity audio driven facial video sequences is an important and challenging problem in many applications, such as digital humans, chat robots, and virtual video conferencing. That is, considering the generation process of speaker video as cross-modal mapping from audio to visual face, it is expected that the synthesized face image will have the same realistic effect as the original video photograph while performing the natural speech style.

In recent years, it has been proposed to generate speaker video using an AD-NeRF (Audio Driven Neural radiation Fields) model. Specifically, an algorithm for directly generating a speaker video from a voice signal is provided on the basis of NeRF (neural radiation field), the method directly inputs the characteristics of an audio signal into a conditional implicit function to generate a dynamic neural radiation field, and the speaker video corresponding to the audio signal is synthesized through volume rendering. The AD-NeRF model not only can synthesize the head (with hair) region, but also can generate the trunk through two independent nerve radiation fields, but the synthesized video of the speaker has the problems that the mouth of the speaker looks unnatural and the trunk is fuzzy.

Disclosure of Invention

The application provides a method, a device and equipment for training and using a generation model of a speaker video, which are used for solving the problems that the mouth of a speaker looks unnatural and the trunk of the speaker is fuzzy in the speaker video synthesized by using an AD-NeRF model. The technical scheme is as follows:

in one aspect, a method for training a generative model of a speaker video is provided, the method comprising:

acquiring a plurality of groups of training samples, wherein each group of training samples comprises video data and audio data containing a speaker, and a real head image and a real trunk image of the speaker;

improving an AD-NeRF model to obtain a generated model, wherein a head nerve radiation field in the generated model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generated model comprises a second Transformer module and a second discriminator;

for each group of training samples, processing the video data and the audio data by using the generating model to obtain head semantic codes and trunk semantic codes;

rendering the semantic head code based on the first Transformer module to obtain a head rendering result, and calculating head loss on the head rendering result and the real head image by using the first discriminator;

rendering the torso semantic code based on the second transform module to obtain a torso rendering result, and calculating a torso loss for the torso rendering result and the real torso image by using the second discriminator;

training the generative model with the head loss and the torso loss.

In a possible implementation manner, the generating model further includes a video processing module, a wave2vec2.0 module, and a hidden function, and the processing the video data and the audio data by using the generating model to obtain a head semantic code and a torso semantic code includes:

processing the video data by using the video processing module to obtain a video analysis chart and a posture parameter;

processing the audio data by using the wave2vec2.0 module to obtain wave2vec2.0 characteristics;

and processing the video analysis graph, the posture parameters and the wave2vec2.0 features by using the implicit function to obtain head semantic codes and trunk semantic codes.

In a possible implementation manner, the rendering the header semantic code based on the first transform module to obtain a header rendering result includes:

performing volume rendering on the head semantic code to obtain a head low-dimensional feature map;

rendering the head low-dimensional feature map by using a head-based two-dimensional neural rendering module to obtain a first intermediate result;

and processing the first intermediate result by utilizing the first Transformer module to obtain a head rendering result.

In a possible implementation manner, the rendering the torso semantic code based on the second transform module to obtain a torso rendering result includes:

performing volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map;

rendering the low-dimensional trunk characteristic graph by using a two-dimensional nerve rendering module based on a trunk to obtain a second intermediate result;

and processing the second intermediate result by utilizing the second Transformer module to obtain a trunk rendering result.

In one possible implementation, the first and second discriminators are GAN discriminators.

In one aspect, a method for using a generative model of a speaker video is provided, the method comprising:

acquiring video data and audio data containing a speaker;

processing the video data and the audio data by using a trained generation model to obtain head semantic codes and trunk semantic codes, wherein a head nerve radiation field in the generation model comprises a first transform module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transform module and a second discriminator;

rendering the head semantic code based on the first Transformer module to obtain a head rendering result;

rendering the trunk semantic code based on the second transform module to obtain a trunk rendering result;

rendering the head rendering result and the trunk rendering result to obtain the speaker video.

In one aspect, an apparatus for training a generative model of a speaker video is provided, the apparatus comprising:

the acquisition module is used for acquiring a plurality of groups of training samples, wherein each group of training samples comprises video data and audio data containing a speaker, and a real head image and a real body image of the speaker;

the generation module is used for improving the AD-NeRF model to obtain a generation model, a head nerve radiation field in the generation model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second Transformer module and a second discriminator;

the processing module is used for processing the video data and the audio data by utilizing the generating model for each group of training samples to obtain head semantic codes and trunk semantic codes;

the processing module is further configured to render the semantic head code based on the first transform module to obtain a head rendering result, and calculate a head loss for the head rendering result and the real head image by using the first discriminator;

the processing module is further configured to render the torso semantic code based on the second transform module to obtain a torso rendering result, and calculate a torso loss for the torso rendering result and the real torso image by using the second discriminator;

a training module to train the generative model using the head loss and the torso loss.

In one aspect, an apparatus for using a generative model of a speaker video is provided, the apparatus comprising:

the acquisition module is used for acquiring video data and audio data containing a speaker;

the processing module is used for processing the video data and the audio data by using a trained generative model to obtain head semantic coding and trunk semantic coding, a head nerve radiation field in the generative model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generative model comprises a second Transformer module and a second discriminator;

the rendering module is used for rendering the head semantic code based on the first Transformer module to obtain a head rendering result;

the rendering module is further configured to render the torso semantic code based on the second transform module to obtain a torso rendering result;

the rendering module is further used for rendering the head rendering result and the trunk rendering result to obtain the speaker video.

In one aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the speaker video generative model training method as described above, or the at least one instruction is loaded and executed by a processor to implement the speaker video generative model using method as described above.

In one aspect, a computer device is provided, which includes a processor and a memory, wherein the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the training method for generating models of speaker videos as described above; alternatively, the instructions are loaded and executed by the processor to implement the generative model usage method for speaker video as described above.

The technical scheme provided by the application has the beneficial effects that:

a Wave2Vec2.0 module is introduced on the basis of an AD-NeRF model, and features of voice data are extracted through the Wave2Vec2.0 module, so that the extracted Wave2Vec2.0 features are optimized; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a transform module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of a method for training a generative model of a speaker video according to an embodiment of the present application;

FIG. 2 is a flow chart of a training process of a cranial nerve radiation field provided by an embodiment of the present application;

FIG. 3 is a flow chart illustrating training of a neural radiation field of a trunk according to an embodiment of the present application;

FIG. 4 is a flowchart of a method for using a generative model of speaker video according to another embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a process for using a generative model of a speaker video according to an embodiment of the present application;

FIG. 6 is a block diagram of an apparatus for training a generative model of a speaker video according to an embodiment of the present application;

fig. 7 is a block diagram illustrating an apparatus for using a generative model of a speaker video according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of a training method for generating a model of a speaker video according to an embodiment of the present application is shown, where the training method for generating a model of a speaker video can be applied to a computer device. The training method for the generative model of the speaker video can comprise the following steps:

step 101, a plurality of sets of training samples are obtained, wherein each set of training samples comprises video data and audio data of a speaker, and a real head image and a real body image of the speaker.

The video data in this embodiment may be a series of videos of speeches of speakers, where the speakers may be real persons or virtual digital persons. For example, the computer device may segment the training video according to a fixed duration to obtain each video data. The fixed time period can be set according to requirements, for example, the fixed time period is set to a value within 3-5 minutes.

The audio data may be extracted from video data of the same set of training samples, where the training samples are positive samples; alternatively, the audio data may be uncorrelated with the video data of the same set of training samples, which are negative samples in this case.

And 102, improving the AD-NeRF model to obtain a generated model, wherein a head nerve radiation field in the generated model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generated model comprises a second Transformer module and a second discriminator.

In the embodiment, the modification of the AD-NeRF model comprises two parts, wherein the first part is to introduce a Wave2Vec2.0 module on the basis of the AD-NeRF model, and extract the characteristics of voice data through the Wave2Vec2.0 module; and the second part is to modify the frameworks of the head nerve radiation field and the trunk nerve radiation field on the basis of the AD-NeRF model, add a Transformer module and optimize a loss calculation strategy through a discriminator.

For the convenience of distinction, in the present embodiment, the Transformer module in the head nerve radiation field is referred to as a first Transformer module, and the Transformer module in the trunk nerve radiation field is referred to as a second Transformer module; the discriminators in the head nerve radiation field are referred to as a first discriminator, and the discriminators in the trunk nerve radiation field are referred to as a second discriminator for distinction. Wherein the first and second discriminators are GAN (generic adaptive Networks) discriminators.

And 103, for each group of training samples, processing the video data and the audio data by using the generation model to obtain head semantic codes and trunk semantic codes.

Specifically, the generating model further includes a video processing module, a wave2vec2.0 module and a hidden function, and then the generating model is used to process video data and audio data to obtain head semantic coding and torso semantic coding, which may include: processing the video data by using a video processing module to obtain a video analysis chart and a posture parameter; processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analysis chart, the posture parameters and the wave2vec2.0 features by using the implicit function to obtain head semantic codes and trunk semantic codes. The features of the voice data are extracted through a Wave2Vec2.0 module, and the extracted Wave2Vec2.0 features can be optimized.

And 104, rendering the semantic code of the head based on the first transform module to obtain a head rendering result, and calculating the head loss by using the first discriminator for the head rendering result and the real head image.

Specifically, rendering the semantic code of the head based on the first transform module to obtain a rendering result of the head, which may include: performing volume rendering on the head semantic code to obtain a head low-dimensional feature map; rendering the head low-dimensional feature map by using a head-based two-dimensional neural rendering module to obtain a first intermediate result; and processing the first intermediate result by utilizing a first Transformer module to obtain a head rendering result.

Fig. 2 illustrates a training process of a cranial nerve radiation field, after obtaining a cranial semantic code, performing volume rendering on the cranial semantic code by using a volume rendering mediated by a head-based implicit function to obtain a low-dimensional feature map of the head; then, a head-based two-dimensional neural rendering module is used for rendering the head low-dimensional feature map to obtain a first intermediate result; processing the first intermediate result by utilizing a first Transformer module to obtain a head rendering result; and finally, comparing the head rendering result with the real head image by using a first discriminator to obtain the head loss.

And 105, rendering the torso semantic code based on a second transform module to obtain a torso rendering result, and calculating the torso loss of the torso rendering result and the real torso image by using a second discriminator.

Specifically, rendering the torso semantic code based on the second Transformer module to obtain a torso rendering result may include: carrying out volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map; rendering the low-dimensional trunk characteristic graph by using a two-dimensional trunk-based neural rendering module to obtain a second intermediate result; and processing the second intermediate result by utilizing a second transform module to obtain a trunk rendering result.

Fig. 3 shows a training process of a trunk neural radiation field, after obtaining a trunk semantic code, performing volume rendering on the trunk semantic code by using hidden function-mediated volume rendering based on a trunk to obtain a trunk low-dimensional feature map; then, a two-dimensional nerve rendering module based on the trunk is used for rendering the low-dimensional characteristic graph of the trunk to obtain a second intermediate result; processing the second intermediate result by utilizing a second Transformer module to obtain a trunk rendering result; and finally, comparing the trunk rendering result with the real trunk image by using a second discriminator to obtain the trunk loss.

The generative model is trained using head loss and torso loss, step 106.

After the head loss and the trunk loss are obtained, the head loss and the trunk loss can be used as feedback to adjust model parameters of the generated model, the training sample is processed again by using the adjusted training parameters until the model parameters of the generated model meet preset conditions, the training is stopped, and the generated model with the model parameters is determined as a finally trained generated model.

To sum up, in the speaker video generation model training method provided by the embodiment of the application, the Wave2Vec2.0 module is introduced on the basis of the AD-NeRF model, and the features of the voice data are extracted through the Wave2Vec2.0 module, so that the extracted Wave2Vec2.0 features are optimized; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a Transformer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.

Referring to fig. 4, a flowchart of a method for using a generative model of a speaker video according to an embodiment of the present application is shown, where the method for using a generative model of a speaker video can be applied to a computer device. The method for using the speaker video generation model can comprise the following steps:

step 401, video data and audio data containing a speaker are obtained.

The video data in this embodiment may be a speaker speaking video, where the speaker may be a real person or a virtual digital person, and the audio data is unrelated to the video data.

Step 402, processing the video data and the audio data by using the trained generation model to obtain a head semantic code and a trunk semantic code, wherein a head nerve radiation field in the generation model comprises a first transform module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transform module and a second discriminator.

Specifically, the generating model includes a video processing module, a wave2vec2.0 module and a hidden function, and then the trained generating model is used to process video data and audio data to obtain head semantic coding and trunk semantic coding, which may include: processing the video data by using a video processing module to obtain a video analysis chart and a posture parameter; processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analysis chart, the posture parameters and the wave2vec2.0 features by using the implicit function to obtain head semantic codes and trunk semantic codes. The features of the voice data are extracted through a Wave2Vec2.0 module, and the extracted Wave2Vec2.0 features can be optimized.

And step 403, rendering the semantic code of the head based on the first transform module to obtain a rendering result of the head.

Specifically, rendering the semantic code of the header based on the first Transformer module to obtain a header rendering result may include: performing volume rendering on the head semantic code to obtain a head low-dimensional feature map; rendering the head low-dimensional feature map by using a head-based two-dimensional neural rendering module to obtain a first intermediate result; and processing the first intermediate result by utilizing a first Transformer module to obtain a head rendering result.

And step 404, rendering the torso semantic code based on a second Transformer module to obtain a torso rendering result.

And step 405, rendering the head rendering result and the trunk rendering result to obtain the speaker video.

Fig. 5 shows a synthesis process of a speaker video, that is, a video analysis graph and a pose parameter are extracted from video data, a wave2vec2.0 feature is extracted from audio data, a head semantic code and a trunk semantic code are generated from the video analysis graph, the pose parameter and the wave2vec2.0 feature by using a implicit function, a head rendering result is generated from the head semantic code by using a head nerve radiation field, a trunk rendering result is generated from the trunk semantic code by using a trunk nerve radiation field, and the head rendering result and the trunk rendering result are rendered to obtain the speaker video.

To sum up, the method for using the speaker video generation model provided by the embodiment of the application introduces a Wave2Vec2.0 module on the basis of the AD-NeRF model, and extracts the features of the voice data through the Wave2Vec2.0 module, so that the extracted Wave2Vec2.0 features are optimized; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a Transformer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.

Referring to fig. 6, a block diagram of a training apparatus for generating a speaker video according to an embodiment of the present application is shown, where the training apparatus for generating a speaker video can be applied to a computer device. The device for training the generative model of the speaker video can comprise:

an obtaining module 610, configured to obtain a plurality of sets of training samples, where each set of training samples includes video data including a speaker, audio data, a real head image and a real torso image of the speaker;

the generating module 620 is used for improving the AD-NeRF model to obtain a generated model, a head nerve radiation field in the generated model comprises a first Transformer module and a first discriminator, and a trunk nerve radiation field in the generated model comprises a second Transformer module and a second discriminator;

a processing module 630, configured to process the video data and the audio data by using the generation model for each group of training samples, so as to obtain a head semantic code and a trunk semantic code;

the processing module 630 is further configured to render the head semantic code based on the first transform module to obtain a head rendering result, and calculate a head loss for the head rendering result and the real head image by using the first discriminator;

the processing module 630 is further configured to render the torso semantic code based on the second transform module to obtain a torso rendering result, and calculate a torso loss for the torso rendering result and the real torso image by using the second discriminator;

a training module 640 for training the generative model with head loss and torso loss.

In an optional embodiment, the generation model further includes a video processing module, a wave2vec2.0 module, and a implicit function, and the processing module 630 is further configured to:

processing the video data by using a video processing module to obtain a video analysis chart and a posture parameter;

processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics;

and processing the video analysis chart, the posture parameters and the wave2vec2.0 features by using the implicit function to obtain head semantic codes and trunk semantic codes.

In an optional embodiment, the processing module 630 is further configured to:

and processing the first intermediate result by utilizing a first Transformer module to obtain a head rendering result.

In an optional embodiment, the processing module 630 is further configured to:

carrying out volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map;

rendering the low-dimensional trunk characteristic graph by using a two-dimensional nerve rendering module based on the trunk to obtain a second intermediate result;

and processing the second intermediate result by utilizing a second transform module to obtain a trunk rendering result.

In an alternative embodiment, the first and second discriminators are GAN discriminators.

To sum up, the speaker video generation model training device provided by the embodiment of the application introduces a wave2vec2.0 module on the basis of the AD-NeRF model, and extracts the features of the voice data through the wave2vec2.0 module, thereby optimizing the extracted wave2vec2.0 features; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a Transformer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.

Referring to fig. 7, a block diagram of a device for using a generative model of a speaker video according to an embodiment of the present application is shown, where the device for using a generative model of a speaker video can be applied to a computer device. The device for using the generating model of the speaker video can comprise:

an obtaining module 710, configured to obtain video data and audio data that include a speaker;

the processing module 720 is configured to process the video data and the audio data by using the trained generative model to obtain a head semantic code and a trunk semantic code, where a head nerve radiation field in the generative model includes a first transform module and a first discriminator, and a trunk nerve radiation field in the generative model includes a second transform module and a second discriminator;

the rendering module 730 is configured to render the head semantic code based on the first transform module to obtain a head rendering result;

the rendering module 730 is further configured to render the torso semantic code based on the second Transformer module, so as to obtain a torso rendering result;

the rendering module 730 is further configured to render the head rendering result and the trunk rendering result to obtain the speaker video.

In summary, the device for generating the speaker video, provided by the embodiment of the application, introduces the wave2vec2.0 module on the basis of the AD-NeRF model, and extracts the features of the voice data through the wave2vec2.0 module, so that the extracted wave2vec2.0 features are optimized; the framework of the head nerve radiation field and the framework of the trunk nerve radiation field are modified on the basis of the AD-NeRF model, a Transformer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.

One embodiment of the present application provides a computer-readable storage medium having at least one instruction stored therein, which is loaded and executed by a processor to implement the speaker video generative model training method or the speaker video generative model using method as described above.

One embodiment of the present application provides a computer device, which includes a processor and a memory, where the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the speaker video generative model training method or the speaker video generative model using method as described above.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is not intended to limit the embodiments of the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the embodiments of the present application should be included in the scope of the embodiments of the present application.

Claims

1. A method for training generative models of speaker video, the method comprising:

rendering the torso semantic code based on the second Transformer module to obtain a torso rendering result, and calculating a torso loss for the torso rendering result and the real torso image by using the second discriminator;

training the generative model with the head loss and the torso loss.

2. The method for training the generative model of the speaker video according to claim 1, wherein the generative model further comprises a video processing module, a wave2vec2.0 module and a hidden function, and the processing of the video data and the audio data by using the generative model to obtain the head semantic code and the torso semantic code comprises:

3. The method for training the generative model of speaker video according to claim 1, wherein the rendering the semantic head coding based on the first transform module to obtain a rendering result of the head comprises:

4. The training method for generative models of speaker video according to claim 1, wherein the rendering the torso semantic code based on the second fransformer module to obtain a torso rendering result comprises:

rendering the trunk low-dimensional feature map by using a trunk-based two-dimensional neural rendering module to obtain a second intermediate result;

5. The training method for generative models of speaker video according to any one of claims 1 to 4, wherein the first and second discriminators are GAN discriminators.

6. A method for using a generative model of a speaker video, the method comprising:

acquiring video data and audio data containing a speaker;

rendering the torso semantic code based on the second Transformer module to obtain a torso rendering result;

7. An apparatus for training generative models of speaker video, the apparatus comprising:

8. An apparatus for using generative models of speaker video, the apparatus comprising:

the rendering module is further configured to render the torso semantic code based on the second Transformer module to obtain a torso rendering result;

9. A computer-readable storage medium, wherein at least one instruction is stored in the storage medium, and the at least one instruction is loaded and executed by a processor to implement the training method for generative model of speaker video as claimed in any one of claims 1 to 5, or the at least one instruction is loaded and executed by a processor to implement the using method for generative model of speaker video as claimed in claim 6.

10. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the generative model training method for speaker video according to any one of claims 1 to 5; alternatively, the instructions are loaded and executed by the processor to implement the speaker video generative model use method as claimed in claim 6.