CN117115317A

CN117115317A - Avatar driving and model training method, apparatus, device and storage medium

Info

Publication number: CN117115317A
Application number: CN202311008609.9A
Authority: CN
Inventors: 陈毅; 赵亚飞; 范锡睿; 杜宗财; 王志强
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-08-10
Filing date: 2023-08-10
Publication date: 2023-11-24

Abstract

The present disclosure provides a method, a device and a storage medium for virtual image driving and model training, which relate to the technical field of artificial intelligence, in particular to the technical fields of augmented reality, virtual reality, computer vision, deep learning, large models and the like, and can be applied to scenes such as metauniverse, virtual digital people and the like. The avatar driving method includes: determining driving information of the avatar, the driving information including: a target action; generating an initial action parameter sequence based on a pre-generated base video and the target action; reconstructing the initial action parameter sequence to generate a target action parameter sequence; and driving the avatar to execute a corresponding action based on the target action parameter sequence. The present disclosure may promote the fluency of the motion of the avatar.

Description

Avatar driving and model training method, apparatus, device and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of augmented reality, virtual reality, computer vision, deep learning, large models and the like, and can be applied to scenes such as metauniverse, virtual digital people and the like, in particular to a virtual image driving and model training method, device, equipment and storage medium.

Background

With the development of artificial intelligence technology, two-dimensional (2D) or three-dimensional (3D) digital people are widely used, such as virtual anchor, virtual customer service, virtual idol, etc.

The large model has excellent natural language processing capability and can enhance the dialogue interaction capability between the digital person and the user.

For digital people, how to perform smoother action driving is a problem to be solved.

Disclosure of Invention

The present disclosure provides an avatar driving method, apparatus, device, and storage medium.

According to an aspect of the present disclosure, there is provided an avatar driving method including: determining driving information of the avatar, the driving information including: a target action; generating an initial action parameter sequence based on a pre-generated base video and the target action; reconstructing the initial action parameter sequence to generate a target action parameter sequence; and driving the avatar to execute a corresponding action based on the target action parameter sequence.

According to another aspect of the present disclosure, there is provided a training method of a sequence reconstruction model, the sequence reconstruction model including: an encoder and decoder, the method comprising: adopting the encoder to encode the real action parameter sequence to obtain an intermediate representation; adopting the decoder to decode the intermediate representation to obtain a predicted action parameter sequence; obtaining a real time correlation loss function based on the real action parameter sequence, and obtaining a predicted time correlation loss function based on the predicted action parameter sequence; constructing a total loss function based on the real time-dependent loss function and the predicted time-dependent loss function; and adjusting model parameters of the encoder and model parameters of the decoder by adopting the total loss function.

According to another aspect of the present disclosure, there is provided an avatar driving apparatus including: a determining module for determining driving information of the avatar, the driving information including: a target action; the generation module is used for generating an initial action parameter sequence based on the pre-generated base video and the target action; the reconstruction module is used for reconstructing the initial action parameter sequence to generate a target action parameter sequence; and the action driving module is used for driving the virtual image to execute corresponding actions based on the target action parameter sequence.

According to another aspect of the present disclosure, there is provided a training apparatus of a sequence reconstruction model, the sequence reconstruction model including: an encoder and decoder, the apparatus comprising: the coding module is used for coding the real action parameter sequence by adopting the coder so as to obtain an intermediate representation; the decoding module is used for decoding the intermediate representation by adopting the decoder so as to obtain a predicted action parameter sequence; the acquisition module is used for acquiring a real time correlation loss function based on the real action parameter sequence and acquiring a predicted time correlation loss function based on the predicted action parameter sequence; the construction module is used for constructing a total loss function based on the real time correlation loss function and the predicted time correlation loss function; and the adjusting module is used for adjusting the model parameters of the encoder and the model parameters of the decoder by adopting the total loss function.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the above aspects.

According to the technical scheme, the accuracy of the target expression model can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

fig. 2 is a schematic diagram of an application scenario provided according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a fifth embodiment of the present disclosure;

fig. 7 is a schematic diagram of an electronic device for implementing the avatar driving method or training method of the sequence reconstruction model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, when a digital person performs motion driving, the motion is connected to the motion by using frames in an idle (idel) state (for example, a stationary standard standing posture). The scheme can also realize intelligent action expression, but the whole digital person can be seen to be relatively stiff due to the fact that the idel state frames are connected, and each action is started from the idel state and finally returns to the idel state due to the limitation of the idel state frames, so that the digital person can be seen to be more stiff.

In order to enhance the smoothness of the motion of the avatar, the present disclosure provides the following embodiments.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. The present embodiment provides an avatar driving method, which includes:

101. determining driving information of the avatar, the driving information including: the target action.

102. And generating an initial action parameter sequence based on the pre-generated base video and the target action.

103. And reconstructing the initial action parameter sequence to generate a target action parameter sequence.

104. And driving the avatar to execute a corresponding action based on the target action parameter sequence.

Wherein the avatar may be a human figure (digital human), a cartoon animal figure, etc. Taking the avatar as an example of a digital person, a user (e.g., a real world person) may interact with the digital person, during which the user may input text to the digital person, which feeds back corresponding output text based on the text input by the user.

The text entered by the user to the digital person may be referred to as question text and the output text fed back by the digital person to the user may be referred to as reply text.

In order to improve interactive expressive force, a digital person may make some actions and expressions when feeding back a reply text to a user, so that the effect is more realistic.

The action determined based on the reply text may be referred to as a target action, the expression determined based on the reply text may be referred to as a target expression, and the target action and the target expression may be collectively referred to as driving information.

The base video is a generic video, which is pre-generated, and then different actions can be added on the base video based on the difference of the reply texts.

The base video is composed of a plurality of images having a time-series relationship, for example, a first image at a first timing, a second image at a second timing, and the like. After the target time is determined, a target action can be added in the image corresponding to the target time. For example, if the target time is the second moment and the target action is a lift hand, then the target action is added to the second image, i.e., it is determined that the digital human action in the second image is a lift hand action.

In addition, the actions in the images corresponding to the rest of the time except the target time can keep the original actions in the base video, for example, if the original actions in the first image are the hanging actions, the actions in the first image are still kept to be the hanging actions.

The motion can be driven based on motion parameters, the motion parameters can be specifically the angle information of key points (joints) of the digital person, the motion and the motion parameters have a one-to-one correspondence, and the motion parameters of the motion can be obtained after the motion is determined, for example, the motion parameters of the vertical motion are first angles, the motion parameters of the lifting motion are second angles, and the like.

Because the base video is a group of image sequences, after the target action is added in the base video, the action parameter sequence corresponding to the group of image sequences is called an initial action parameter sequence.

For example, the base video includes the first image, the second image, and the like, where the first image corresponds to a hanging motion, the motion parameter is a first angle, the second image corresponds to a lifting motion, the motion parameter is a second angle, and the initial motion parameter sequence is the first angle, the second angle, and the like.

In the related art, after determining the target action, the target action is added to the base video to directly obtain the target video, so that the problem of unsmooth action exists, for example, the action is changed from the hanging hand action of the first image to the lifting hand action of the second image and then to the hanging hand action of the third image, and the hanging hand action is assumed to be vertical 90 degrees, and the lifting hand action is horizontal 0 degree, so that the target action is directly changed from 90 degrees to 0 degrees, or is directly changed from 0 degree to 90 degrees, and the problems of rigid and unsmooth exist; whereas a natural and smooth real scene should be gradually changed from 90 degrees to 0 degrees and from 0 degrees to 90 degrees, for example, the first image is 90 degrees, the second image is 60 degrees, the third image is 30 degrees, the fourth image is 0 degrees, etc., instead of the first image being 90 degrees, the second image is directly changed to 0 degrees.

In order to avoid the problem of unsmooth motion driving based directly on the initial motion parameters, in this embodiment, the initial motion parameter sequence is reconstructed, and the motion parameter sequence obtained after the reconstruction is referred to as a target motion parameter sequence, such as 90 degrees, 60 degrees, 30 degrees, 0 degrees, and the like.

After the target action parameter sequence is obtained, the digital person can be driven to execute corresponding actions based on the target action parameter sequence, for example, the digital person is driven to execute the corresponding actions according to the joint angles, so that smoother and natural actions are obtained.

In this embodiment, the target motion parameter sequence is obtained by performing reconstruction processing on the initial motion parameter sequence, and the virtual image is driven to execute the corresponding motion based on the target motion parameter.

In order to better understand the embodiments of the present disclosure, application scenarios to which the embodiments of the present disclosure may be applied are described.

Fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure. The scene comprises: user terminal 201 and server 202, user terminal 201 may include: personal computers (Personal Computer, PCs), cell phones, tablet computers, notebook computers, smart wearable devices, and the like. The server 202 may be a cloud server or a local server, and the user terminal 201 and the server 202 may communicate using a communication network, for example, a wired network and/or a wireless network.

Taking the virtual image as a digital person as an example, an Application (APP) can be installed in the user terminal, and the APP can provide digital person service, that is, the user can interact with the digital person at the server through the APP installed on the user terminal, and the server is deployed on the server. Specifically, the user inputs the problem content to the digital person, and the user can input the problem content in a voice or text form; the digital person obtains corresponding reply content based on the problem content, and feeds back the reply content to the user, and the digital person can also feed back in a voice or text form.

If the user inputs the problem content in a voice form, an algorithm such as voice recognition can be adopted to convert voice into text, which is called a problem text, and then a model is adopted to process the problem text obtained by voice conversion or the problem text directly input by the user so as to obtain a reply text. In the context of voice feedback, the reply text may be converted to voice and fed back to the user.

In order to improve the interactive expressive force, when the digital person generates the reply text, corresponding driving information can be determined, wherein the driving information can comprise target actions and target time and can also comprise target expressions.

To improve accuracy, a large model (Large Language Model, LLM) may be employed to generate reply text based on the question text, and to determine driving information corresponding to the reply text.

LLM is a hot problem in the field of artificial intelligence in recent years, LLM is a pre-training language model, and rich language knowledge and world knowledge are learned by pre-training on massive text data, so that a remarkable effect can be achieved on various natural language processing (Natural Language Processing, NLP) tasks. The relics, chatGPT and the like are all applications based on LLM development, and can generate smooth, logical and creative text contents and even perform natural dialogue with human beings. Specifically, the large model may be a general Pre-training (GPT) model based on a transducer, an enhanced representation (Enhanced Representation through Knowledge Integration, ERNIE) model based on knowledge integration, and the like.

Therefore, the problem text is processed based on LLM to obtain the reply text and the driving information thereof, so that more accurate reply text and the corresponding driving information thereof can be obtained, and the interaction effect is improved.

In order to simulate the interaction effect of the real person, the base video can be recorded in advance, and the dynamic change process of the real person can be simulated by playing the base video. The base video may be a segment (e.g., 100 s) of animated video in which the animated character is not standing in place, may have small limb movements, and simulates the normal standing speaking of a real character.

The base video is generic, i.e. different reply texts share the base video. In order to improve interactive expressive force, aiming at different reply texts, target expressions and target actions corresponding to the different reply texts can be added into a base video, so that when different reply texts are fed back by a digital person, different expressions and actions can be matched.

Taking the target motion as an example, the motion parameter sequence generated based on the base video and the target motion may be referred to as an initial motion parameter sequence. The initial motion parameter sequence may characterize the state of the base video after adding the target motion.

If the digital person is directly driven to execute the corresponding action based on the initial action parameter sequence, the display effect is affected because the original actions in the target action and the base video may have a problem of unsmooth.

For this reason, as shown in fig. 2, in the present embodiment, the initial motion parameter sequence is subjected to a reconstruction process (sequence reconstruction) to obtain a target motion parameter sequence, and then the digital person is driven to execute a corresponding motion (motion driving) based on the target motion parameter sequence.

In combination with the above application scenario, the present disclosure further provides the following embodiments.

Fig. 3 is a schematic view of a second embodiment according to the present disclosure, which provides an avatar driving method including:

301. and processing the problem text by adopting a large model to determine a reply text corresponding to the problem text and driving information corresponding to the reply text.

In combination with the application scenario, the large model (LLM) is a pre-training language model with excellent performance, so that the adoption of the large model to determine the reply text and the driving information thereof can utilize the excellent performance of the large model to improve accuracy and applicability.

Where the question text may be entered into a large model that outputs relevant parameters including the reply text and the driving information.

Taking the driving information including the target expression and the target action as an example, the relevant parameters of the driving information may specifically include: expression labels, action labels and target action time.

For example, the reply text is: is your, king women asking what you have to ask my?

After adding the expression label and the action label, converting into:

< nature > hello, < nature > king lady [ call in ], < nature > ask what you have to ask my?

Wherein < nature > is an expression tag and [ call in ] is an action tag.

The target action time, such as the time corresponding to the call, is obtained by a large model.

The target expression time, such as the time corresponding to < nature >, may be a preset fixed position, for example, based on the above example, adding an expression tag in front of each phrase.

302. And generating an initial action parameter sequence based on the pre-generated base video and the target action.

Wherein, a section (100 s) of animation video can be preset as a base video, wherein the animation person can not stand in situ, and can have small limb movements to simulate the normal standing speaking of a real person.

After determining the target motion, the target motion may be inserted into the base video, and the corresponding motion parameter sequence is referred to as an initial motion parameter sequence.

Specifically, among a plurality of base images included in the base video, determining other base images than the base image corresponding to the target action time; acquiring substrate action parameters corresponding to the other substrate images; and forming the initial action parameter sequence by the base action parameter and the target action parameter corresponding to the target action.

For example, the base video includes a first image, a second image, and a third image (these three images are collectively referred to as a base image), and because of the video, these images have timing information, assuming that the timing information is t1, t2, t3, t1 is the earliest and t3 is the latest, i.e., the first image is played first, then the second image is played, and then the third image is played. Assuming that the target action time is t2, the base image corresponding to the target action time is a second image, namely, the target action needs to be inserted into the second image; the other base images are the first image and the third image.

The motion parameters corresponding to the motion of the digital person in the first image and the third image are called base motion parameters, and the motion parameters corresponding to the target motion are called target motion parameters. The motion parameter may specifically be joint angle information. Thereafter, an initial sequence of motion parameters may be composed of the base motion parameters and the target motion parameters. When the sequence is formed, the sequence is still combined according to the time sequence information of the corresponding images, for example, the action parameters corresponding to the three images included in the base video are a first action parameter, a second action parameter and a third action parameter, and the target action corresponds to the second image, and the initial action parameter sequence is as follows: a first action parameter, a target action parameter and a third action parameter.

In this embodiment, an initial motion parameter sequence is formed based on the substrate motion parameters and the target motion parameters corresponding to other substrate images, so that an accurate initial motion parameter sequence can be obtained.

303. And adopting a pre-trained sequence reconstruction model to reconstruct the input initial motion parameter sequence so as to output a target motion parameter sequence.

Because the digital person is directly driven to execute the corresponding action based on the initial action sequence, the problem of action inconsistency may exist, therefore, in the embodiment, the initial action sequence can be reconstructed to obtain the target action parameter sequence corresponding to the more consistent action with better performance, and further, the final more consistent and smooth action can be obtained.

Taking the example that the motion parameter is a joint angle, the initial motion parameter sequence may be: 90 degrees, 0 degrees, 90 degrees, etc., the target action parameter sequence is: 90 degrees, 60 degrees, 30 degrees, etc.

In particular, a deep learning model, which may be referred to as a sequence reconstruction model, may be employed to obtain a target sequence of motion parameters based on the initial sequence of motion parameters.

In this embodiment, the sequence reconstruction model is used to process the initial motion parameter sequence to obtain the target motion parameter sequence, so that the excellent performance of the sequence reconstruction model can be utilized to improve the reconstruction efficiency and accuracy.

Further, the sequence reconstruction model may include an encoder and a decoder, with the encoder, to process the input initial sequence of motion parameters to obtain an intermediate representation; and processing the intermediate features by adopting the decoder to output the target action parameter sequence.

In this embodiment, the encoder and decoder can simply and efficiently reconstruct the time series.

In particular, the sequence reconstruction model may be a model used by a vector quantization variable self-encoding (Vector Quantized Variational AutoEncoder, VQVAE) algorithm.

AE (AutoEncoder), mainly comprising an encoder (encoder) and a decoder (decoder), taking as an example an input being an image, an output being a hidden feature (intermediate representation), the encoder compressing the image into a hidden feature (latent feature); the input to the decoder is the hidden feature and the output is the reconstructed image.

VAE (Variational AutoEncoder, variable self-encoding), unlike AE, AE learns to characterize (hidden features) z, VAE learns to characterize z's distribution (assuming normal distribution), samples the learned distribution to obtain a characterization z, and reconstructs an image based on z.

VQVAE (Vector Quantized Variational AutoEncoder, vector quantization variable self-coding), unlike VAE, uses a fixed prior distribution, i.e., a normal distribution, while VQVAE uses an autoregressive model to learn the prior distribution. In addition, the VAE uses a continuous intermediate representation z, the VQVAE discretizes the continuous representation output by the encoder to obtain a discrete intermediate representation, and then a decoder is adopted to process the input discrete intermediate representation to obtain a reconstruction result.

VQVAE can be used for image processing, i.e. the input is the original image and the output is the reconstructed image.

However, in the present embodiment, the reconstructed motion parameter sequence is the VQVAE applied to the time-series reconstructed scene, and accordingly, the input is the initial motion parameter sequence and the output is the target motion parameter sequence.

The encoder may include one or more convolutional layers.

In the image processing scene, the information of the space dimension is convolved; in the time-series scenario of the present embodiment, convolution processing is performed on the information of the time dimension.

The sequence reconstruction model may be pre-trained, the model training process typically comprising: and constructing a total loss function, and adjusting model parameters based on the total loss function. In this embodiment, the total loss function may include a time dependent loss function, such as a velocity loss function and/or an acceleration loss function, in addition to the reconstructed loss function.

Wherein the speed loss function is constructed based on the actual speed and the predicted speed, and the acceleration loss function is constructed based on the actual acceleration and the predicted acceleration.

Taking the velocity loss function (the acceleration loss function may be performed with reference) as an example, the velocity loss function may be constructed based on the predicted velocity and the actual velocity. In the training stage, the input of the sequence reconstruction model is a real action parameter sequence, the output is a predicted action parameter sequence, the real speed is calculated based on two adjacent real action parameters in the real action parameter sequence, and the predicted speed is calculated based on two adjacent predicted action parameters in the predicted action parameter sequence.

Taking the real speed (the predicted speed may be performed with reference to the execution) as an example, assuming that the motion parameter is a joint angle, two adjacent real motion parameters are a first angle and a second angle, respectively, and assuming that the time interval between two adjacent images corresponding to the motion parameter is t, the real speed= (second angle-first angle)/t corresponding to the second angle. Similarly, acceleration= (second angle-first angle)/t ² 。

In this embodiment, the total loss function is constructed based on the velocity loss function and/or the acceleration loss function, so that a time-dependent loss function can be introduced, which is beneficial to smooth motion, and thus more consistent and smooth motion is obtained.

304. And driving the avatar to execute a corresponding action based on the target action parameter sequence.

For example, through sequence reconstruction, the target motion parameter sequence is: 90 degrees, 60 degrees, 30 degrees and the like, the avatar can be driven to execute actions corresponding to the joint angles of 90 degrees, 60 degrees, 30 degrees and the like, so that the smoothness of the actions is realized, and if the avatar is directly executed based on initial action parameters, the actions of 90 degrees and 0 degrees are executed, and the actions are not smooth.

In addition, the driving information may further include a target expression, and thus, the avatar may be driven to display a corresponding expression based on the target expression. Namely, may further comprise:

305. and performing expression driving on the virtual image based on the target expression parameters corresponding to the target expression.

For example, if the target expression is natural, the avatar may be driven to exhibit the natural expression.

In addition, the motion driving and the expression driving may be driven based on the corresponding times. For example, performing expression driving at a target expression time so that an avatar in an image corresponding to the target expression time has the target expression; and performing action driving at the target action time so that the virtual image in the image corresponding to the target action time has the target action. The target expression time and the target action time may correspond to the same or different images, and if the target expression time and the target action time correspond to the same image, the avatar performs action driving and expression driving in the image, thereby having the target expression and the target action.

In this embodiment, the expression driving is performed on the avatar based on the target expression parameter, so that not only the action driving but also the expression driving can be realized for the avatar, the expressive force is improved, and the interaction effect is improved.

In addition, unlike the motion, the target motion time is determined by a large model, and the target expression time corresponding to the target expression may be fixedly set, for example, fixed in front of each phrase of the reply text, and accordingly, the avatar may be expression-driven based on the target expression parameter corresponding to the target expression at a preset target expression time.

In this embodiment, by setting the target expression time, the operation can be simplified, and the processing efficiency can be improved.

In addition, in the process of performing action driving and expression driving on the virtual image, the content of the reply text can be played through voice, so that the expressive force of the voice, the expression and the action of the virtual image can be enhanced.

Fig. 4 is a schematic diagram of a third embodiment of the present disclosure, which provides a training method of a sequence reconstruction model, the sequence reconstruction model including an encoder and a decoder, the method comprising:

401. and adopting the encoder to encode the real action parameter sequence so as to obtain the intermediate representation.

The real action parameter sequence is a training sample, and can be obtained in the existing sample set or obtained by means of collection and the like.

The input of the encoder is the sequence of real motion parameters and the output is an intermediate representation.

In addition, the real motion parameter sequence is a sequence in a time dimension, and the encoder may include a convolution layer, and when performing convolution processing using the convolution layer, the convolution processing in the time dimension is performed on the real motion parameter sequence.

For example, the real action parameter sequence includes a first action parameter, a second action parameter and a third action parameter in a time sequence arrangement, and then the first action parameter and the second action parameter can be subjected to convolution processing to obtain a convolution result of a first dimension; and carrying out convolution processing on the second action parameter and the third action parameter to obtain a convolution result of the second dimension and the like.

402. And adopting the decoder to decode the intermediate representation to obtain a predicted action parameter sequence.

Wherein the input of the decoder is an intermediate representation and the output is a predicted sequence of motion parameters.

In addition, the output of the encoder can be continuous intermediate representation, the discrete intermediate representation is obtained after the discrete processing of the continuous intermediate representation, the decoder decodes the input discrete intermediate representation, and the output is a predicted action parameter sequence.

403. And obtaining a real time correlation loss function based on the real action parameter sequence, and obtaining a predicted time correlation loss function based on the predicted action parameter sequence.

404. And constructing a total loss function based on the real time-dependent loss function and the predicted time-dependent loss function.

405. And adjusting model parameters of the encoder and model parameters of the decoder by adopting the total loss function.

In this embodiment, the total loss function is constructed based on the time-dependent loss function, so that the time-dependent loss function can be introduced, which is beneficial to smooth motion, and thus more consistent and smooth motion is obtained.

Wherein the time dependent loss function may be a velocity loss function and/or an acceleration loss function.

Taking the velocity loss function (the acceleration loss function may be performed with reference) as an example, the velocity loss function may be constructed based on the predicted velocity and the actual velocity.

Taking the real speed (the predicted speed can be executed by referring to the execution) as an example, assuming that the motion parameter is a joint angle, two adjacent real motion parameters are a first angle and a second angle respectively, and assuming that the time interval of two adjacent images corresponding to the motion parameter is t, the real speed corresponding to the second angle is (second angle-first angle)/t. Similarly, the acceleration is (second angle-first angle)/t ² 。

After the time-dependent loss function is obtained, a total loss function may be constructed based on the time-dependent loss function, such as total loss function = velocity loss function + acceleration loss function + existing loss function, e.g. reconstructed loss function, etc.

After the total loss function is obtained, the model parameters can be adjusted by using the total loss function, for example, a Back Propagation (BP) algorithm is used to adjust the model parameters until a preset end condition is reached, for example, a preset number of times or model convergence is reached, so as to obtain a final sequence reconstruction model.

The final sequence reconstruction model can be used in the virtual image driving process, and specifically, the sequence reconstruction model generated by training can be adopted to reconstruct the initial action parameter sequence so as to obtain the target action parameter sequence.

In this embodiment, the time-dependent loss function may be efficiently obtained by using the velocity loss function and/or the acceleration loss function as the time-dependent loss function, thereby improving the processing efficiency.

Fig. 5 is a schematic diagram according to a fourth embodiment of the present disclosure. The present embodiment provides an avatar driving apparatus, as shown in fig. 5, the apparatus 500 including: a determination module 501, a generation module 502, a reconstruction module 503, and an action driving module 504.

The determining module 501 is configured to determine driving information of the avatar, the driving information including: a target action; the generating module 502 is configured to generate an initial motion parameter sequence based on a pre-generated base video and the target motion; the reconstruction module 503 is configured to perform reconstruction processing on the initial motion parameter sequence to generate a target motion parameter sequence; the action driving module 504 is configured to drive the avatar to perform a corresponding action based on the target action parameter sequence.

The determining module 501 is further configured to: and processing the problem text by adopting a large model to determine a reply text corresponding to the problem text and driving information corresponding to the reply text.

In this embodiment, the reply text and the driving information thereof are determined by using the large model, so that the accuracy and applicability can be improved by using the excellent performance of the large model.

In some embodiments, the driving information further includes: target action time; the generating module 502 is further configured to: determining other substrate images except the substrate image corresponding to the target action time in a plurality of substrate images included in the substrate video; acquiring substrate action parameters corresponding to the other substrate images; and forming the initial action parameter sequence by the base action parameter and the target action parameter corresponding to the target action.

In some embodiments, the reconstruction module 503 is further configured to: and adopting a pre-trained sequence reconstruction model to reconstruct the input initial motion parameter sequence so as to output the target motion parameter sequence.

In some embodiments, the sequence reconstruction model comprises: an encoder and a decoder; the reconstruction module 503 is further configured to: adopting the encoder to encode the input initial action parameter sequence to obtain an intermediate representation; and adopting the decoder to decode the intermediate feature so as to output the target action parameter sequence.

In some embodiments, the sequence reconstruction model is trained based on a total loss function that is constructed based on a velocity loss function and/or an acceleration loss function.

In some embodiments, the driving information further includes: target expression; the apparatus further comprises: and the expression driving module is used for performing expression driving on the virtual image based on the target expression parameters corresponding to the target expression.

In some embodiments, the expression driving module is further configured to: and performing expression driving on the virtual image based on the target expression parameters corresponding to the target expression in a preset target expression time.

Fig. 6 is a schematic diagram according to a fifth embodiment of the present disclosure. The present embodiment provides a training device for a sequence reconstruction model, where the sequence reconstruction model includes: encoder and decoder, as shown in fig. 6, the apparatus 600 includes: an encoding module 601, a decoding module 602, an acquisition module 603, a construction module 604 and an adjustment module 605.

The encoding module 601 is configured to encode the sequence of real motion parameters by using the encoder to obtain an intermediate representation; the decoding module 602 is configured to perform decoding processing on the intermediate representation by using the decoder to obtain a predicted motion parameter sequence; the obtaining module 603 is configured to obtain a real time-related loss function based on the real motion parameter sequence, and obtain a predicted time-related loss function based on the predicted motion parameter sequence; a construction module 604 is configured to construct a total loss function based on the real time-dependent loss function and the predicted time-dependent loss function; an adjustment module 605 is configured to adjust model parameters of the encoder and model parameters of the decoder using the total loss function.

In some embodiments, the real time dependent loss function comprises: a true velocity loss function and/or a true acceleration loss function, the predicted time-dependent loss function comprising: a predicted speed loss function and/or a predicted acceleration loss function.

It is to be understood that in the embodiments of the disclosure, the same or similar content in different embodiments may be referred to each other.

It can be understood that "first", "second", etc. in the embodiments of the present disclosure are only used for distinguishing, and do not indicate the importance level, the time sequence, etc.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. The electronic device 700 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 701 performs the respective methods and processes described above, for example, an avatar driving method or a training method of a sequence reconstruction model. For example, in some embodiments, the avatar driving method or training method of the sequence reconstruction model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into the RAM 703 and executed by the calculation unit 701, one or more steps of the avatar driving method or training method of the sequence reconstruction model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the avatar driving method or the training method of the sequence reconstruction model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-chips (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An avatar driving method, comprising:

determining driving information of the avatar, the driving information including: a target action;

generating an initial action parameter sequence based on a pre-generated base video and the target action;

reconstructing the initial action parameter sequence to generate a target action parameter sequence;

and driving the avatar to execute a corresponding action based on the target action parameter sequence.

2. The method of claim 1, wherein the determining the driving information of the avatar comprises:

and processing the problem text by adopting a large model to determine a reply text corresponding to the problem text and driving information corresponding to the reply text.

3. The method of claim 1, wherein,

the driving information further includes: target action time;

the generating an initial motion sequence based on the pre-generated base video and the target motion comprises the following steps:

determining other substrate images except the substrate image corresponding to the target action time in a plurality of substrate images included in the substrate video;

acquiring substrate action parameters corresponding to the other substrate images;

and forming the initial action parameter sequence by the base action parameter and the target action parameter corresponding to the target action.

4. The method of claim 1, wherein the reconstructing the initial motion parameter sequence to generate a target motion parameter sequence comprises:

and adopting a pre-trained sequence reconstruction model to reconstruct the input initial motion parameter sequence so as to output the target motion parameter sequence.

5. The method of claim 4, wherein,

the sequence reconstruction model comprises: an encoder and a decoder;

the reconstructing the input initial motion parameter sequence by adopting a pre-trained sequence reconstruction model to output the target motion parameter sequence comprises the following steps:

adopting the encoder to encode the input initial action parameter sequence to obtain an intermediate representation;

and adopting the decoder to decode the intermediate feature so as to output the target action parameter sequence.

6. The method of claim 4, wherein the sequence reconstruction model is trained based on a total loss function constructed based on a velocity loss function and/or an acceleration loss function.

7. The method according to any one of claims 1 to 6, wherein,

the driving information further includes: target expression;

the method further comprises the steps of:

and performing expression driving on the virtual image based on the target expression parameters corresponding to the target expression.

8. The method of claim 7, wherein the performing the expression driving on the avatar based on the target expression parameter corresponding to the target expression comprises:

And performing expression driving on the virtual image based on the target expression parameters corresponding to the target expression in a preset target expression time.

9. A training method of a sequence reconstruction model, the sequence reconstruction model comprising: an encoder and decoder, the method comprising:

adopting the encoder to encode the real action parameter sequence to obtain an intermediate representation;

adopting the decoder to decode the intermediate representation to obtain a predicted action parameter sequence;

obtaining a real time correlation loss function based on the real action parameter sequence, and obtaining a predicted time correlation loss function based on the predicted action parameter sequence;

constructing a total loss function based on the real time-dependent loss function and the predicted time-dependent loss function;

and adjusting model parameters of the encoder and model parameters of the decoder by adopting the total loss function.

10. The method of claim 9, wherein the real time dependent loss function comprises: a true velocity loss function and/or a true acceleration loss function, the predicted time-dependent loss function comprising: a predicted speed loss function and/or a predicted acceleration loss function.

11. An avatar driving apparatus, comprising:

a determining module for determining driving information of the avatar, the driving information including: a target action;

the generation module is used for generating an initial action parameter sequence based on the pre-generated base video and the target action;

the reconstruction module is used for reconstructing the initial action parameter sequence to generate a target action parameter sequence;

and the action driving module is used for driving the virtual image to execute corresponding actions based on the target action parameter sequence.

12. The apparatus of claim 11, wherein the means for determining is further for:

13. The apparatus of claim 11, wherein,

the driving information further includes: target action time;

the generation module is further configured to:

14. The apparatus of claim 11, wherein the reconstruction module is further to:

15. The apparatus of claim 14, wherein,

the sequence reconstruction model comprises: an encoder and a decoder;

the reconstruction module is further to:

16. The apparatus of claim 15, wherein the sequence reconstruction model is trained based on a total loss function constructed based on a velocity loss function and/or an acceleration loss function.

17. The device according to any one of claims 11-16, wherein,

the driving information further includes: target expression;

The apparatus further comprises:

and the expression driving module is used for performing expression driving on the virtual image based on the target expression parameters corresponding to the target expression.

18. The apparatus of claim 17, wherein the expression driving module is further to:

19. A training device of a sequence reconstruction model, the sequence reconstruction model comprising: an encoder and decoder, the apparatus comprising:

the coding module is used for coding the real action parameter sequence by adopting the coder so as to obtain an intermediate representation;

the decoding module is used for decoding the intermediate representation by adopting the decoder so as to obtain a predicted action parameter sequence;

the acquisition module is used for acquiring a real time correlation loss function based on the real action parameter sequence and acquiring a predicted time correlation loss function based on the predicted action parameter sequence;

the construction module is used for constructing a total loss function based on the real time correlation loss function and the predicted time correlation loss function;

And the adjusting module is used for adjusting the model parameters of the encoder and the model parameters of the decoder by adopting the total loss function.

20. The apparatus of claim 19, wherein the real time-dependent loss function comprises: a true velocity loss function and/or a true acceleration loss function, the predicted time-dependent loss function comprising: a predicted speed loss function and/or a predicted acceleration loss function.

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-10.

23. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-10.