CN115908662B

CN115908662B - Speaker video generation model training and using method, device and equipment

Info

Publication number: CN115908662B
Application number: CN202211631657.9A
Authority: CN
Inventors: 严妍; 汪敏; 杨春宇; 白杨
Original assignee: Beijing Kaipuyun Information Technology Co ltd; Cape Cloud Information Technology Co ltd
Current assignee: Beijing Kaipuyun Information Technology Co ltd; Cape Cloud Information Technology Co ltd
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2024-05-28
Anticipated expiration: 2042-12-19
Also published as: CN115908662A

Abstract

The application discloses a method, a device and equipment for training and using a generation model of a speaker video, and belongs to the technical field of machine learning. The method comprises the following steps: improving the AD-NeRF model to obtain a generated model; processing video data and audio data in the training samples by using the generation model to obtain head semantic codes and trunk semantic codes; rendering a head semantic code based on a first transducer module in a head nerve radiation field, and calculating head loss for a head rendering result and a real head image by using a first discriminator in the head nerve radiation field; rendering a torso semantic code based on a second transducer module in the torso nerve radiation field, and calculating torso loss for a torso rendering result and a real torso image by using a second discriminator in the torso nerve radiation field; the model is generated using head loss and torso loss training. The application can promote the representation and image generation capability and improve the problem of trunk blurring.

Description

Speaker video generation model training and using method, device and equipment

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a method, an apparatus, and a device for training and using a speaker video generation model.

Background

In many applications, such as digital people, chat robots, and virtual video conferencing, the synthesis of high fidelity audio driven facial video sequences is an important and challenging problem. That is, regarding the generation process of a speaker video as a cross-modal mapping from audio to visual faces, it is desirable that the synthesized face image has the same realistic effect as the same photograph as the original video while performing a natural speaking style.

In recent years, it has been proposed to use an AD-NeRF (Audio Driven Neural RADIANCE FIELDS, audio-driven neuro-radiation field) model to generate speaker videos. Specifically, an algorithm for directly generating a speaker video from a voice signal is provided on the basis of NeRF (nerve radiation field), the method directly inputs the characteristics of an audio signal into a conditional implicit function to generate a dynamic nerve radiation field, and the speaker video corresponding to the audio signal is synthesized through volume rendering. The AD-NeRF model not only can synthesize head (with hair) areas, but also can generate a trunk through two independent nerve radiation fields, but the synthesized speaker video has the problems that the mouth of the speaker looks unnatural and the trunk is blurred.

Disclosure of Invention

The application provides a method, a device and equipment for training and using a generation model of a speaker video, which are used for solving the problems that the speaker video synthesized by using an AD-NeRF model has unnatural mouth and fuzzy trunk. The technical scheme is as follows:

in one aspect, a method for training a generated model of a speaker video is provided, the method comprising:

acquiring a plurality of groups of training samples, wherein each group of training samples comprises video data, audio data and a real head image and a real trunk image of a speaker;

Improving the AD-NeRF model to obtain a generation model, wherein a head nerve radiation field in the generation model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transducer module and a second discriminator;

for each group of training samples, processing the video data and the audio data by using the generating model to obtain head semantic codes and trunk semantic codes;

rendering the head semantic code based on the first transducer module to obtain a head rendering result, and calculating head loss for the head rendering result and the real head image by using the first discriminator;

Rendering the trunk semantic code based on the second transducer module to obtain a trunk rendering result, and calculating trunk loss by using the second discriminator to the trunk rendering result and the real trunk image;

training the generated model using the head loss and the torso loss.

In a possible implementation manner, the generating model further includes a video processing module, a wave2vec2.0 module, and a hidden function, and the processing the video data and the audio data by using the generating model to obtain a head semantic code and a trunk semantic code includes:

Processing the video data by utilizing the video processing module to obtain a video resolution graph and gesture parameters;

processing the audio data by using the wave2vec2.0 module to obtain wave2vec2.0 characteristics;

and processing the video analytic graph, the gesture parameters and the wave2vec2.0 features by using the hidden function to obtain a head semantic code and a trunk semantic code.

In one possible implementation manner, the rendering the header semantic code based on the first transducer module, to obtain a header rendering result, includes:

Performing volume rendering on the head semantic code to obtain a head low-dimensional feature map;

rendering the head low-dimensional feature map by using a head-based two-dimensional nerve rendering module to obtain a first intermediate result;

And processing the first intermediate result by using the first transducer module to obtain a head rendering result.

In one possible implementation manner, the rendering the torso semantic code based on the second transducer module, to obtain a torso rendering result, includes:

performing volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map;

rendering the trunk low-dimensional feature map by using a trunk-based two-dimensional nerve rendering module to obtain a second intermediate result;

and processing the second intermediate result by using the second transducer module to obtain a trunk rendering result.

In one possible implementation, the first arbiter and the second arbiter are GAN arbiters.

In one aspect, a method for using a speaker video generation model is provided, where the method includes:

acquiring video data and audio data containing a speaker;

Processing the video data and the audio data by using a trained generation model to obtain head semantic codes and trunk semantic codes, wherein a head nerve radiation field in the generation model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transducer module and a second discriminator;

rendering the head semantic code based on the first transducer module to obtain a head rendering result;

rendering the trunk semantic code based on the second transducer module to obtain a trunk rendering result;

and rendering the head rendering result and the trunk rendering result to obtain a speaker video.

In one aspect, a model training apparatus for generating a speaker video is provided, the apparatus comprising:

The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a plurality of groups of training samples, and each group of training samples comprises video data, audio data and a real head image and a real trunk image of a speaker;

The generation module is used for improving the AD-NeRF model to obtain a generation model, a head nerve radiation field in the generation model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transducer module and a second discriminator;

the processing module is used for processing the video data and the audio data by using the generation model for each group of training samples to obtain head semantic codes and trunk semantic codes;

The processing module is further configured to render the header semantic code based on the first transducer module to obtain a header rendering result, and calculate a header loss for the header rendering result and the real header image by using the first discriminator;

The processing module is further configured to render the torso semantic code based on the second transducer module to obtain a torso rendering result, and calculate a torso loss for the torso rendering result and the real torso image by using the second discriminator;

a training module for training the generated model using the head loss and the torso loss.

In one aspect, a speaker video generation model using apparatus is provided, the apparatus comprising:

the acquisition module is used for acquiring video data and audio data containing a speaker;

the processing module is used for processing the video data and the audio data by using a trained generation model to obtain head semantic codes and trunk semantic codes, a head nerve radiation field in the generation model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transducer module and a second discriminator;

the rendering module is used for rendering the head semantic code based on the first transducer module to obtain a head rendering result;

The rendering module is further used for rendering the trunk semantic code based on the second transducer module to obtain a trunk rendering result;

the rendering module is further used for rendering the head rendering result and the trunk rendering result to obtain a speaker video.

In one aspect, a computer readable storage medium having stored therein at least one instruction loaded and executed by a processor to implement a method of model training for generating a speaker video as described above, or at least one instruction loaded and executed by a processor to implement a method of model using a speaker video as described above, is provided.

In one aspect, a computer device is provided, the computer device including a processor and a memory, the memory having stored therein at least one instruction, the instruction being loaded and executed by the processor to implement a model training method for generating a speaker video as described above; or the instructions are loaded and executed by the processor to implement the speaker video generation model usage method as described above.

The technical scheme provided by the application has the beneficial effects that at least:

Introducing a Wave2Vec2.0 module on the basis of the AD-NeRF model, and extracting the characteristics of the voice data through the Wave2Vec2.0 module, so that the extracted Wave2Vec2.0 characteristics are optimized; the architecture of the head nerve radiation field and the trunk nerve radiation field is modified on the basis of the AD-NeRF model, a transducer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for model training for speaker video generation according to one embodiment of the present application;

FIG. 2 is a training flow diagram of a head nerve radiation field provided by one embodiment of the present application;

FIG. 3 is a training flow diagram of a trunk nerve radiation field provided by one embodiment of the present application;

FIG. 4 is a flow chart of a method for using a model for generating a speaker video according to another embodiment of the present application;

FIG. 5 is a schematic diagram of a speaker video generation model usage flow provided by an embodiment of the present application;

FIG. 6 is a block diagram of a speaker video generation model training apparatus according to one embodiment of the present application;

Fig. 7 is a block diagram of a speaker video generation model using apparatus according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of the embodiments of the present application will be given with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of a method for generating a model training method for a speaker video according to an embodiment of the present application is shown, where the method for generating a model training method for a speaker video may be applied to a computer device. The speaker video generation model training method can comprise the following steps:

Step 101, a plurality of sets of training samples are obtained, wherein each set of training samples comprises video data, audio data, a real head image and a real trunk image of a speaker.

The video data in this embodiment may be a series of speaker lectures, where the speaker may be a real person or a virtual digital person. For example, the computer device may segment the training video for a fixed duration to obtain each video data. The fixed duration may be set according to requirements, for example, the fixed duration is set to a value within 3-5 minutes.

The audio data may be extracted from video data of the same set of training samples, where the training samples are positive samples; or the audio data may be uncorrelated with the video data of the same set of training samples, which are negative samples.

And 102, improving the AD-NeRF model to obtain a generated model, wherein a head nerve radiation field in the generated model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generated model comprises a second transducer module and a second discriminator.

In the embodiment, the transformation of the AD-NeRF model comprises two parts, wherein the first part is to introduce a wave2Vec2.0 module based on the AD-NeRF model, and the wave2Vec2.0 module is used for extracting the characteristics of voice data; the second part is to reconstruct the framework of the head nerve radiation field and the trunk nerve radiation field on the basis of the AD-NeRF model, a transducer module is added, and meanwhile, a loss calculation strategy is optimized through a discriminator.

For convenience of distinction, in this embodiment, the transducer module in the head nerve radiation field is referred to as a first transducer module, and the transducer module in the trunk nerve radiation field is referred to as a second transducer module; the discriminators in the head nerve radiation field are referred to as first discriminators, and the discriminators in the trunk nerve radiation field are referred to as second discriminators to illustrate the discrimination. Wherein the first arbiter and the second arbiter are GAN (GENERATIVE ADVERSARIAL Networks, generating an antagonistic network) arbiter.

And 103, processing the video data and the audio data by using a generating model for each group of training samples to obtain a head semantic code and a trunk semantic code.

Specifically, the generating model further includes a video processing module, a wave2vec2.0 module and a hidden function, and the processing of the video data and the audio data by using the generating model to obtain a head semantic code and a trunk semantic code may include: processing the video data by utilizing a video processing module to obtain a video resolution graph and gesture parameters; processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analytic graph, the gesture parameters and the wave2vec2.0 features by using the hidden function to obtain a head semantic code and a trunk semantic code. The feature of the voice data is extracted through the wave2vec2.0 module, so that the extracted wave2vec2.0 feature can be optimized.

And 104, rendering the head semantic code based on the first transducer module to obtain a head rendering result, and calculating head loss on the head rendering result and the real head image by using a first discriminator.

Specifically, rendering the header semantic code based on the first transducer module to obtain a header rendering result may include: performing volume rendering on the head semantic code to obtain a head low-dimensional feature map; rendering the head low-dimensional feature map by using a head-based two-dimensional nerve rendering module to obtain a first intermediate result; and processing the first intermediate result by using a first transducer module to obtain a head rendering result.

FIG. 2 illustrates a training flow of a head neural radiation field, after head semantic coding is obtained, the head semantic coding can be subjected to volume rendering by utilizing volume rendering mediated by a hidden function based on the head, and a head low-dimensional feature map is obtained; then, rendering the head low-dimensional feature map by using a head-based two-dimensional nerve rendering module to obtain a first intermediate result; processing the first intermediate result by using a first transducer module to obtain a head rendering result; and finally, comparing the head rendering result with the real head image by using a first discriminator to obtain head loss.

And 105, rendering the trunk semantic code based on a second transducer module to obtain a trunk rendering result, and calculating trunk loss by using a second discriminator to the trunk rendering result and the real trunk image.

Specifically, rendering the torso semantic code based on the second transducer module to obtain a torso rendering result may include: performing volume rendering on the trunk semantic code to obtain a trunk low-dimensional feature map; rendering the trunk low-dimensional feature map by using a trunk-based two-dimensional nerve rendering module to obtain a second intermediate result; and processing the second intermediate result by using a second transducer module to obtain a trunk rendering result.

FIG. 3 illustrates a training flow of a torso neural radiation field, after torso semantic code is obtained, the torso semantic code can be subjected to volume rendering by utilizing the hidden function-mediated volume rendering based on the torso, so as to obtain a torso low-dimensional feature map; then, rendering the trunk low-dimensional feature map by using a trunk-based two-dimensional nerve rendering module to obtain a second intermediate result; processing the second intermediate result by using a second transducer module to obtain a trunk rendering result; and finally, comparing the trunk rendering result with a real trunk image by using a second discriminator to obtain trunk loss.

Step 106, training and generating a model by utilizing the head loss and the trunk loss.

After the head loss and the trunk loss are obtained, the head loss and the trunk loss can be used as feedback to adjust the model parameters of the generated model, the training samples are processed again by utilizing the adjusted training parameters, the training is stopped until the model parameters of the generated model meet the preset conditions, and the generated model with the model parameters is determined to be the finally trained generated model.

In summary, according to the speaker video generation model training method provided by the embodiment of the application, a wave2Vec2.0 module is introduced on the basis of the AD-NeRF model, and the characteristics of voice data are extracted through the wave2Vec2.0 module, so that the extracted wave2Vec2.0 characteristics are optimized; the architecture of the head nerve radiation field and the trunk nerve radiation field is modified on the basis of the AD-NeRF model, a transducer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.

Referring to fig. 4, a flowchart of a method for using a speaker video generation model according to an embodiment of the present application is shown, where the method for using a speaker video generation model may be applied to a computer device. The speaker video generation model using method can comprise the following steps:

in step 401, video data and audio data containing a speaker are acquired.

The video data in this embodiment may be a speaking video, where the speaking person may be a real person or a virtual digital person, and the audio data is not related to the video data.

Step 402, processing the video data and the audio data by using the trained generation model to obtain a head semantic code and a trunk semantic code, wherein a head nerve radiation field in the generation model comprises a first transducer module and a first discriminator, and a trunk nerve radiation field in the generation model comprises a second transducer module and a second discriminator.

Specifically, the generating model includes a video processing module, a wave2vec2.0 module and a hidden function, and the processing of the video data and the audio data by using the trained generating model to obtain a head semantic code and a trunk semantic code may include: processing the video data by utilizing a video processing module to obtain a video resolution graph and gesture parameters; processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analytic graph, the gesture parameters and the wave2vec2.0 features by using the hidden function to obtain a head semantic code and a trunk semantic code. The feature of the voice data is extracted through the wave2vec2.0 module, so that the extracted wave2vec2.0 feature can be optimized.

And step 403, rendering the header semantic code based on the first transducer module to obtain a header rendering result.

And step 404, rendering the trunk semantic code based on the second transducer module to obtain a trunk rendering result.

And step 405, rendering the head rendering result and the trunk rendering result to obtain a speaker video.

Fig. 5 shows a synthesis flow of a speaker video, that is, firstly, extracting a video resolution graph and gesture parameters from video data, extracting wave2Vec2.0 features from audio data, then, generating head semantic codes and trunk semantic codes from the video resolution graph, the gesture parameters and the wave2Vec2.0 features by using a hidden function, generating a head rendering result from the head semantic codes by using a head nerve radiation field, generating a trunk rendering result from the trunk semantic codes by using a trunk nerve radiation field, and rendering the head rendering result and the trunk rendering result to obtain the speaker video.

In summary, according to the method for using the speaker video generation model provided by the embodiment of the application, a wave2Vec2.0 module is introduced on the basis of the AD-NeRF model, and the characteristics of voice data are extracted through the wave2Vec2.0 module, so that the extracted wave2Vec2.0 characteristics are optimized; the architecture of the head nerve radiation field and the trunk nerve radiation field is modified on the basis of the AD-NeRF model, a transducer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.

Referring to fig. 6, a block diagram of a speaker video generation model training apparatus according to an embodiment of the present application is shown, where the speaker video generation model training apparatus may be applied to a computer device. The speaker video generation model training device may include:

An acquisition module 610, configured to acquire a plurality of sets of training samples, where each set of training samples includes video data, audio data, a real head image and a real torso image of a speaker;

The generating module 620 is configured to improve the AD-NeRF model to obtain a generating model, where a head nerve radiation field in the generating model includes a first transducer module and a first discriminator, and a trunk nerve radiation field in the generating model includes a second transducer module and a second discriminator;

the processing module 630 is configured to process, for each set of training samples, the video data and the audio data by using the generating model, to obtain a head semantic code and a torso semantic code;

The processing module 630 is further configured to render the header semantic code based on the first transducer module to obtain a header rendering result, and calculate a header loss for the header rendering result and the real header image by using the first discriminator;

the processing module 630 is further configured to render the torso semantic code based on the second transducer module to obtain a torso rendering result, and calculate a torso loss for the torso rendering result and the real torso image by using the second discriminator;

A training module 640 for training the generative model with head loss and torso loss.

In an alternative embodiment, the generating model further includes a video processing module, a wave2vec2.0 module, and a hidden function, and the processing module 630 is further configured to:

processing the video data by utilizing a video processing module to obtain a video resolution graph and gesture parameters;

Processing the audio data by using a wave2vec2.0 module to obtain wave2vec2.0 characteristics;

In an alternative embodiment, processing module 630 is further configured to:

And processing the first intermediate result by using a first transducer module to obtain a head rendering result.

In an alternative embodiment, processing module 630 is further configured to:

and processing the second intermediate result by using a second transducer module to obtain a trunk rendering result.

In an alternative embodiment, the first arbiter and the second arbiter are GAN arbiters.

In summary, the speaker video generation model training device provided by the embodiment of the application introduces the wave2Vec2.0 module on the basis of the AD-NeRF model, and extracts the characteristics of the voice data through the wave2Vec2.0 module, so as to optimize the extracted wave2Vec2.0 characteristics; the architecture of the head nerve radiation field and the trunk nerve radiation field is modified on the basis of the AD-NeRF model, a transducer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.

Referring to fig. 7, a block diagram of a speaker video generation model using apparatus according to an embodiment of the present application is shown, where the speaker video generation model using apparatus may be applied to a computer device. The speaker video generation model using device may include:

an acquisition module 710 for acquiring video data and audio data containing a speaker;

The processing module 720 is configured to process the video data and the audio data by using a trained generation model to obtain a head semantic code and a torso semantic code, a head nerve radiation field in the generation model includes a first transducer module and a first discriminator, and a torso nerve radiation field in the generation model includes a second transducer module and a second discriminator;

the rendering module 730 is configured to render the header semantic code based on the first transducer module to obtain a header rendering result;

the rendering module 730 is further configured to render the torso semantic code based on the second transducer module, so as to obtain a torso rendering result;

The rendering module 730 is further configured to render the head rendering result and the torso rendering result to obtain a speaker video.

In summary, the device for generating the model of the speaker video provided by the embodiment of the application introduces the wave2Vec2.0 module based on the AD-NeRF model, and extracts the characteristics of the voice data through the wave2Vec2.0 module, so as to optimize the extracted wave2Vec2.0 characteristics; the architecture of the head nerve radiation field and the trunk nerve radiation field is modified on the basis of the AD-NeRF model, a transducer module is added, meanwhile, a loss calculation strategy is optimized through a discriminator, the characterization capability and the image generation capability are further improved, and the problem of trunk blurring is effectively solved.

One embodiment of the present application provides a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement a method of training a model of generating a speaker video or a method of using a model of generating a speaker video as described above.

One embodiment of the present application provides a computer device including a processor and a memory having at least one instruction stored therein, the instruction being loaded and executed by the processor to implement a method of training a generative model of a speaker video or a method of using a generative model of a speaker video as described above.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description should not be taken as limiting the embodiments of the application, but rather should be construed to cover all modifications, equivalents, improvements, etc. that may fall within the spirit and principles of the embodiments of the application.

Claims

1. A method for training a model generated by a speaker video, the method comprising:

Training the generated model with the head loss and the torso loss;

The generating model further includes a video processing module, a wave2vec2.0 module and a hidden function, and the processing the video data and the audio data by using the generating model to obtain a head semantic code and a trunk semantic code includes: processing the video data by utilizing the video processing module to obtain a video resolution graph and gesture parameters; processing the audio data by using the wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analytic graph, the gesture parameters and the wave2vec2.0 features by using the hidden function to obtain a head semantic code and a trunk semantic code.

2. The method for training the model for generating the speaker video according to claim 1, wherein the rendering the header semantic code based on the first transducer module to obtain a header rendering result comprises:

3. The method for training the model for generating the speaker video according to claim 1, wherein the rendering the torso semantic code based on the second transducer module to obtain a torso rendering result comprises:

4. A method of model training for the generation of speaker video according to any one of claims 1 to 3, wherein the first and second discriminators are GAN discriminators.

5. A method for using a model generated by a speaker video, the method comprising:

acquiring video data and audio data containing a speaker;

Rendering the head rendering result and the trunk rendering result to obtain a speaker video;

The generating model further includes a video processing module, a wave2vec2.0 module and a hidden function, and the processing the video data and the audio data by using the trained generating model to obtain a head semantic code and a trunk semantic code includes: processing the video data by utilizing the video processing module to obtain a video resolution graph and gesture parameters; processing the audio data by using the wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analytic graph, the gesture parameters and the wave2vec2.0 features by using the hidden function to obtain a head semantic code and a trunk semantic code.

6. A speaker video generation model training apparatus, the apparatus comprising:

A training module for training the generated model using the head loss and the torso loss;

The generating model further comprises a video processing module, a wave2vec2.0 module and a hidden function, and the processing module is further used for: processing the video data by utilizing the video processing module to obtain a video resolution graph and gesture parameters; processing the audio data by using the wave2vec2.0 module to obtain wave2vec2.0 characteristics; and processing the video analytic graph, the gesture parameters and the wave2vec2.0 features by using the hidden function to obtain a head semantic code and a trunk semantic code.

7. A speaker video generation model using apparatus, the apparatus comprising:

The rendering module is further used for rendering the head rendering result and the trunk rendering result to obtain a speaker video;

8. A computer readable storage medium having stored therein at least one instruction loaded and executed by a processor to implement the method of model training for generating a speaker video according to any one of claims 1 to 4, or at least one instruction loaded and executed by a processor to implement the method of model using generating a speaker video according to claim 5.

9. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction that is loaded and executed by the processor to implement the method of model training generation of a speaker video according to any of claims 1 to 4; or the instructions are loaded and executed by the processor to implement the speaker video generation model usage method of claim 5.