CN116863042A

CN116863042A - Motion generation method of virtual object and training method of motion generation model

Info

Publication number: CN116863042A
Application number: CN202210282160.4A
Authority: CN
Inventors: 王伟强; 者雪飞; 暴林超; 陈欢
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2023-10-10

Abstract

The application discloses a method and a device for generating actions of a virtual object, a training method and a training device of an action generation model, computer equipment and a storage medium, and belongs to the technical field of computers. According to the method, potential action characteristics are obtained through random sampling, so that hidden information of the pose of the virtual object is represented under a probability space corresponding to standard normal distribution, and then the action sequence of the virtual object is predicted in an autoregressive mode by combining with bias information related to action types to be generated, so that the action sequence not only represents the hidden information, but also meets the constraint of the action types, the pose of each frame in the past is also introduced in an autoregressive mode to predict the pose of the current frame, the generation process of the final action sequence is independent of an RNN generator under a GAN architecture, the problem of mode collapse is avoided, and the action sequence with higher diversity can be generated.

Description

Motion generation method of virtual object and training method of motion generation model

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method for generating an action of a virtual object, a training method for generating an action model, a device, a computer device, and a storage medium.

Background

With the development and progress of computer technology, the action generation of virtual objects (namely character action generation) has important practical significance for various computer vision tasks including multimedia interaction technology and vision information understanding. The action generating task of the virtual object can be summarized as: given a class of actions, a sequence of actions for a virtual object is automatically generated by the machine.

Currently, GAN (Generative Adversarial Networks, generative countermeasure network) is generally used to perform action generation of virtual objects, and in GAN architecture, one RNN (Recurrent Neural Networks, recurrent neural network) generator and one CNN (Convolutional Neural Networks, convolutional neural network) arbiter are used to perform countermeasure learning, but the countermeasure learning faces a pattern collapse problem, that is, the RNN generator always tends to generate the same action sequence, resulting in poor generalization capability of the RNN generator.

Disclosure of Invention

The embodiment of the application provides a method and a device for generating actions of a virtual object, a training method and a training device of an action generation model, computer equipment and a storage medium, which can improve the diversity of generated action sequences. The technical scheme is as follows:

In one aspect, there is provided a method for generating an action of a virtual object, the method comprising:

randomly sampling to obtain potential action characteristics, wherein the potential action characteristics are obtained by mapping the pose distribution of the virtual object to random sampling values under standard normal distribution;

based on an action category to be generated, obtaining bias information of the potential action feature, wherein the bias information is used for representing an influence factor of the potential action feature under the action category;

and generating an action sequence of the virtual object in an autoregressive mode based on the potential action characteristics and the bias information, wherein the action sequence is used for representing the pose of the virtual object for executing the action corresponding to the action category.

In one aspect, a training method of an action generation model is provided, the method comprising:

based on sample action categories and sample action sequences, determining sample distribution obeyed by the sample action sequences, wherein the sample distribution has a mapping relation with pose distribution of a virtual object;

carrying out re-parameterization on the sample distribution through standard normal distribution, and sampling from the re-parameterized sample distribution to obtain sample potential characteristics;

Based on the sample action category, obtaining sample bias information of the sample potential features, wherein the sample bias information is used for representing influence factors on the sample potential features under the sample action category;

training an initial motion model based on the sample potential characteristics, the sample bias information and the sample motion sequence to obtain a motion generation model, wherein the motion generation model is used for generating a motion sequence of a virtual object to execute a motion corresponding to an input motion category in an autoregressive mode.

In one aspect, there is provided an action generating apparatus of a virtual object, the apparatus comprising:

the random sampling module is used for obtaining potential action characteristics through random sampling, wherein the potential action characteristics are random sampling values obtained by mapping the pose distribution of the virtual object to standard normal distribution;

the acquisition module is used for acquiring the bias information of the potential action characteristics based on the action category to be generated, wherein the bias information is used for representing the influence factors on the potential action characteristics under the action category;

and the generation module is used for generating an action sequence of the virtual object in an autoregressive mode based on the potential action characteristics and the bias information, wherein the action sequence is used for representing the pose of the virtual object for executing the action corresponding to the action category.

In one possible implementation, the generating module includes:

and the prediction unit is used for inputting the potential action characteristics, the bias information, the poses and the precursor poses of the poses into an action generation model for any pose in the action sequence, and predicting the next pose in the action sequence through the action generation model.

In one possible implementation, the prediction unit includes:

the first fusion subunit is used for fusing the potential action features and the bias information to obtain action bias features;

the second fusion subunit is used for fusing the pose, each precursor pose of the pose and the pose position characteristics to obtain precursor pose characteristics, wherein the pose position characteristics are used for representing the sequence of the pose and each precursor pose of the pose in time sequence;

and the pose decoding subunit is used for decoding to obtain the next pose based on the action bias characteristic and the preamble pose characteristic.

In one possible embodiment, the second fusion subunit is configured to:

performing full connection processing on the pose and each precursor pose of the pose to obtain initial pose characteristics;

And fusing the initial pose features and the pose position features to obtain the preamble pose features.

In one possible implementation, the action generation model includes a plurality of decoding units for decoding input features based on a self-attention mechanism;

the pose decoding subunit is configured to:

inputting the motion bias feature and the preamble pose feature into the plurality of decoding units, and outputting the hidden vector of the next pose by the last decoding unit;

and performing full connection processing on the hidden vector of the next pose to obtain the next pose.

In one possible implementation manner, the acquiring module is configured to:

and performing full connection processing on the single thermal codes of the action categories to obtain the offset information.

In one possible embodiment, the apparatus further comprises:

and the synthesis module is used for synthesizing an action fragment of the virtual object for executing the action corresponding to the action category based on the action sequence and the object model of the virtual object.

In one aspect, there is provided a training apparatus for an action generation model, the apparatus comprising:

the determining module is used for determining sample distribution obeyed by the sample action sequence based on the sample action category and the sample action sequence, wherein the sample distribution has a mapping relation with the pose distribution of the virtual object;

The sampling module is used for carrying out re-parameterization on the sample distribution through standard normal distribution, and sampling potential characteristics of the sample from the re-parameterized sample distribution;

the acquisition module is used for acquiring sample bias information of the sample potential features based on the sample action category, wherein the sample bias information is used for representing influence factors on the sample potential features under the sample action category;

the training module is used for training the initial motion model based on the sample potential characteristics, the sample bias information and the sample motion sequence to obtain a motion generation model, and the motion generation model is used for generating a motion sequence of the virtual object to execute the motion corresponding to the input motion category in an autoregressive mode.

In one possible implementation manner, the initial motion model includes a plurality of encoding units and a plurality of decoding units, the encoding units are used for predicting distribution parameters of the sample distribution, and the decoding units are used for decoding to obtain the next pose based on a self-attention mechanism, wherein the number of the encoding units and the number of the decoding units are the same.

In one possible implementation, the determining module is configured to:

Performing full connection processing on the sample action category to obtain sample category characteristics;

performing full connection processing on the sample action sequence to obtain a first action feature;

fusing the first action feature with a sample position feature of the sample action sequence to obtain a second action feature, wherein the sample position feature is used for representing the sequence of each pose in the sample action sequence;

inputting the second action feature and the sample category feature into the plurality of coding units, and outputting a target hidden vector associated with the sample distribution by a last coding unit;

and acquiring a distribution parameter for indicating the sample distribution based on the target hidden vector.

In one possible embodiment, the distribution parameters include a mean and a standard deviation of the sample distribution;

the sampling module is used for:

sampling from the standard normal distribution to obtain a re-parameterized adjustment factor;

correcting the standard deviation based on the adjustment factor to obtain a corrected standard deviation;

and determining the sample distribution after the heavy parameterization based on the mean value and the corrected standard deviation.

In one possible implementation, the training module includes:

A decoding unit for generating a predicted motion sequence of the virtual object by the plurality of decoding units in the initial motion model based on the sample latent features and the sample bias information;

the acquisition unit is used for acquiring a loss function value of the iteration based on the sample action sequence and the prediction action sequence;

the iteration unit is used for iteratively adjusting the model parameters of the initial action model;

and an output unit configured to output the plurality of decoding units in the initial motion model as the motion generation model when the loss function value meets a stop condition.

In a possible implementation, the decoding unit is configured to:

fusing the sample potential features and the sample bias information to obtain sample action bias features;

carrying out scheduling sampling on the latest pose in the predicted action sequence based on target probability, wherein the position sequence number in the predicted action sequence and each sample pose in the sample action sequence is not more than the latest pose, so as to obtain a preamble pose sequence, and the target probability is used for representing the possibility that each pose in the preamble pose sequence is sampled from the predicted action sequence;

Fusing the preamble pose sequence and sample position features of the preamble pose sequence to obtain sample preamble pose features, wherein the sample position features are used for representing the sequence of each pose in the preamble pose sequence;

and decoding the sample motion bias characteristic and the sample preamble pose characteristic through the plurality of decoding units to obtain the next pose of the latest pose in the predicted motion sequence.

In one possible implementation, the target probability is positively correlated with a position sequence number of the latest pose in the predicted motion sequence.

In one possible implementation, the loss function value includes: at least one of a pose reconstruction penalty term for characterizing a difference between the sample motion sequence and the predicted motion sequence, a joint position reconstruction penalty term for characterizing a difference between a joint position sample sequence and a joint position prediction sequence of the virtual object, or a relative entropy penalty term for characterizing a difference between the sample distribution and a standard normal distribution; the joint position sample sequence is obtained based on the sample action sequence, and the joint position prediction sequence is obtained based on the prediction action sequence.

In one aspect, a computer device is provided that includes one or more processors and one or more memories having stored therein at least one computer program loaded and executed by the one or more processors to implement a method of motion generation or a training method of a motion generation model for a virtual object as any of the possible implementations described above.

In one aspect, a storage medium is provided, in which at least one computer program is stored, the at least one computer program being loaded and executed by a processor to implement a method of generating actions of a virtual object or a training method of an action generation model as in any of the possible implementations described above.

In one aspect, a computer program product or computer program is provided, the computer program product or computer program comprising one or more program codes, the one or more program codes being stored in a computer readable storage medium. One or more processors of a computer device are capable of reading the one or more program codes from a computer-readable storage medium, the one or more processors executing the one or more program codes, such that the computer device is capable of performing the method of generating actions of a virtual object or the training method of an action generation model of any one of the possible embodiments described above.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

potential action characteristics are obtained through random sampling, hidden information of the pose of the virtual object under a probability space corresponding to standard normal distribution is represented, and then the action sequence of the virtual object is predicted in an autoregressive mode by combining with bias information related to action types to be generated, so that the action sequence not only represents the hidden information but also meets the constraint of the action types, the pose of each frame in the past is also introduced in an autoregressive mode to predict the pose of the current frame, the generation process of the final action sequence is independent of an RNN generator under a GAN architecture, the problem of mode collapse is avoided, and the action sequence with higher diversity can be generated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an implementation environment of a method for generating actions of a virtual object according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for generating actions of a virtual object according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for generating actions of a virtual object according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a predictive action sequence using an action generation model provided by an embodiment of the application;

FIG. 5 is a flowchart of a training method of an action generation model provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a decoding prediction process according to an embodiment of the present application;

FIG. 7 is a schematic flow chart diagram of a training method of an action generation model according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a sequence of actions generated in a non-autoregressive manner provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of an autoregressive sequence of actions generated as provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of an autoregressive sequence of actions generated as provided by an embodiment of the present application;

fig. 11 is a schematic structural diagram of an action generating device of a virtual object according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a training device for generating a model according to an embodiment of the present application;

FIG. 13 is a schematic diagram of a computer device according to an embodiment of the present application;

Fig. 14 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

The terms "first," "second," and the like in this disclosure are used for distinguishing between similar elements or items having substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the terms "first," "second," and "n," and that there is no limitation on the amount and order of execution.

The term "at least one" in the present application means one or more, meaning "a plurality" means two or more, for example, a plurality of first positions means two or more first positions.

The term "comprising at least one of A or B" in the present application relates to the following cases: only a, only B, and both a and B.

The user related information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.) and signals referred to by the present application are all user authorized or fully authorized by the parties, and the collection, use and processing of the related data requires compliance with the relevant laws and regulations and standards of the relevant country and region. For example, in the sample action sequence related to the present application, if the virtual object corresponding to the sample action sequence is manipulated by the user, or the sample action sequence itself is parsed by the action video clip of the user, then the sample action sequence is obtained under the condition of sufficient authorization.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

In the field of artificial intelligence, computer Vision (CV) is a science of researching how to make a machine "look at", and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as identifying and measuring a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for the human eye to observe or transmit to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition ), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D (three-dimensional) techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others.

In the embodiment of the application, the action generation (namely, role action generation) of the virtual object in the computer vision task is related, and the action generation task is simply summarized as follows: given a class of actions, a sequence of actions for a virtual object is automatically generated by the machine. The action generating task has important practical significance for a plurality of computer vision tasks including a multimedia interaction technology and vision information understanding, and can be suitable for various application scenes such as monocular motion estimation, action recognition and the like, and the functions such as motion denoising or data set augmentation are realized. For example, in various types of applications (e.g., gaming applications, video production applications, animation production applications, etc.), a desire to achieve behavior control of a virtual reality or virtual object typically involves generating a series of actions for a given action class (also known as semantic action tags) over a specified duration.

For the action generating task, the manner of generating the action sequence under the driving of the action category is less, so at present, the GAN (Generative Adversarial Networks, generating type countermeasure network) can realize the action generation of the virtual object, in the GAN architecture, one RNN (Recurrent Neural Networks, recurrent neural network) generator and one CNN (Convolutional Neural Networks, convolutional neural network) discriminator are used for performing countermeasure learning, that is, the optimization target of the RNN generator is to deceive the CNN discriminator as much as possible, so that the CNN discriminator can not judge whether the input action sequence is the true action or the machine-generated action at all, and the optimization target of the CNN discriminator is to distinguish whether the input action sequence is the true action or the machine-generated action at all.

Resistance learning suffers from pattern collapse, i.e., after the RNN generator finds that a certain action sequence under a certain action class can fool the CNN arbiter, the same action sequence will always tend to be generated to fool the CNN arbiter well to stop training, which results in poor generalization ability of the RNN generator and failure of training to reach nash equilibrium. In addition, the action sequence generated by the RNN generator is difficult to learn the explicit expression of its pose distribution finally, and the diversity of the generated action sequence is difficult to interpret, so the interpretability is also poor.

In view of this, an embodiment of the present application provides a training method for an action generation model, which is not dependent on the GAN architecture, but is a method for generating an action sequence of a virtual object based on driving of a transform variation automatic encoder, and based on the action generation model obtained by training by the training method, the training method can be put into an action generation task of the virtual object, and finally, a given action class is reached, and an action sequence of the virtual object can be automatically generated by the action generation model obtained by training. Based on the obtained action sequence, the visual display of the action segment of the virtual object can be further modeled by combining an object model (such as a 3D model or a 2D model) of the virtual object, and the visual display will be described in detail through the following embodiments.

Hereinafter, terms related to the embodiments of the present application will be explained.

Transducer: the transducer is a preamble codec predictor, originally proposed as a machine translated Seq2Seq (Sequence To Sequence, sequence-to-sequence) model, modified on the basis of the transducer model in embodiments of the present application to accomplish the action generating task.

SMPL (skin Multi-Person Linear) model: the SMPL model is a differentiable parametric human body model, can be used for 3D human body modeling and animation driving, has high simulation degree, can simulate the protrusion and the depression of muscles of a human body in the limb movement process, and can avoid the surface distortion of the human body in the movement process, thereby accurately describing the appearance of the stretching and the shrinking movements of the muscles of the human body.

VAE (variable Auto-Encoder): VAEs are an unsupervised way of learning complex probability distributions, and are composed of an Encoder (i.e. an encoding module, comprising a series of cascaded encoding units) and a decoder (i.e. a decoding module, comprising a series of cascaded decoding units), which imitates the learning prediction mechanism of an Auto-Encoder (AE), and encode and decode between measurable functions. In the embodiment of the application, for a standard normal distribution, the pose distribution of a given virtual object (the training stage is characterized by using sample distribution), there is always a measurable function which can be made to be micro, and the measurable function is mapped to another probability distribution, so that the probability distribution obtained by mapping is arbitrarily close to the standard normal distribution.

Relative Entropy (Relative Entropy): also referred to as KL (Kullback-Leibler) divergence or information divergence, the relative entropy is an asymmetric measure of the difference between two probability distributions, and in embodiments of the present application, it relates to measuring the difference between the probability distribution mapped by the VAE encoder and the standard normal distribution using the relative entropy, i.e. by minimizing the relative entropy, the probability distribution mapped by the VAE encoder is made to approximate the standard normal distribution as much as possible.

Heavy parameterization (The Reparametrisation Trick): the re-parameterization process is described simply as: sampling an adjustment factor from standard normal distribution, then combining the distribution parameters of probability distribution output by a VAE encoder, weighting standard deviation in the distribution parameters by using the adjustment factor obtained by sampling, and combining the existing average value of the distribution parameters to determine potential vectors subjected to re-parameterization sampling. In the VAE training stage, model parameters of an encoder and a decoder are optimized through a gradient descent algorithm under the guidance of relative entropy, so that the mean value of posterior distribution of potential vectors approaches 0 and the standard deviation approaches 1, and the probability distribution obtained by mapping the VAE encoder can be ensured to be an approximate standard normal distribution after training is finished. The reparameterization is used to make the "sampling" of this non-guided operation not involve the gradient descent of the VAE, but instead the potential vector involves the gradient descent of the VAR, thereby making the entire transform variant auto-encoder (i.e., the initial motion model) trainable.

Scheduling and sampling: for the decoder of the VAE, the decoder can be regarded as an action generator, and the elements generated by the preamble are taken as the input of the action generator with a certain probability, in other words, the input of the action generator is not necessarily the elements generated by the preamble, but may be sampled from the action actually generated, so that even if the elements generated by the action generator in the preamble are wrong, the error is not accumulated, the overall training target is ensured to be still the maximum probability of generating the real action sequence, and the model is ensured to train towards the correct direction.

Action category (Action Label): also referred to as action tags, meaning semantic tags of various actions that can be performed by a virtual object, generally related to daily actions that can be implemented by a human body, for example, action categories include: running, jumping, kicking, advancing, reversing, rotating, waving arms, lifting arms, picking up east and west, placing things, etc.

Generating a 3D human body action sequence: the virtual object is a virtual character modeled by using a 3D human body model, the final motion generation model outputs multi-frame human body rotation and translation information (namely the pose of the virtual object), and joint positions of the virtual object in each frame and human body Mesh (Mesh) vertex coordinates can be further generated by combining the SMPL model, so that a motion segment of the virtual object can be constructed for visual display.

Autoregressive (Autoregressive Model, AR model): autoregressive is a way of processing a time series, and refers to predicting the performance of a current frame by using information of each past frame.

The system configuration of the embodiment of the present application will be described below.

Fig. 1 is an implementation environment schematic diagram of a method for generating actions of a virtual object according to an embodiment of the present application. Referring to fig. 1, in this implementation environment, it includes: terminal 110 and server 120, terminal 110 and server 120 being examples of computer devices.

The terminal 110 is used for playing action segments of the virtual object, and an application program supporting playing the action segments is installed and run on the terminal 110, and the application program includes but is not limited to: game applications, animation applications, video applications, applications supporting virtual reality, social applications, payment applications, etc., the type of application program is not particularly limited in the embodiments of the present application.

Optionally, the virtual object is a movable object capable of performing action, including a virtual character, a virtual animal, a cartoon character, etc., for example, the virtual object is a three-dimensional model constructed based on three-dimensional human skeleton technology, and of course, the virtual object can also be implemented by adopting a 2.5-dimensional or 2-dimensional model, which is not limited in the embodiment of the present application.

Alternatively, the terminal 110 is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.

The terminal 110 and the server 120 can be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

The server 120 is configured to synthesize an action segment of the virtual object, for example, the server 120 predicts an action sequence of the virtual object using an action generation model, combines the action sequence and the three-dimensional human model, synthesizes an action segment in which the virtual object moves according to the action sequence, and pushes the action segment to the terminal 110. Typically, the server 120 is used to provide background services to the application installed on the terminal 110. Optionally, the server 120 is also used to train the action generation model.

Schematically, the server 120 collects sample motion sequences of various virtual objects or real humans, marks the sample motion sequences to obtain corresponding sample motion categories, trains the sample motion sequences and the sample motion categories to obtain a motion generation model, predicts the motion sequences of the virtual objects by using the motion generation model on the server 120 side, combines the motion sequences and the three-dimensional human body model, synthesizes motion fragments of the virtual objects moving according to the motion sequences, and pushes the motion fragments to the terminal 110 so that a user can play the pushed motion fragments by using the application program in the terminal 110.

Schematically, the server 120 collects sample motion sequences of various virtual objects or real humans, marks the sample motion sequences to obtain corresponding sample motion types, trains the sample motion sequences and the sample motion types to obtain motion generation models, and then issues the motion generation models to the terminal 110, so that when the terminal 110 needs, the terminal 110 side locally invokes the motion generation models to predict the motion sequences of the virtual objects, combines the motion sequences and the three-dimensional human body models, synthesizes motion fragments of the virtual objects moving according to the motion sequences, and finally plays the locally synthesized motion fragments by using the application program.

Optionally, the server 120 includes at least one of a server, a plurality of servers, a cloud computing platform, or a virtualization center. For example, the server 120 takes over primary computing work and the terminal 110 takes over secondary computing work; alternatively, the server 120 takes on secondary computing work and the terminal 110 takes on primary computing work; alternatively, a distributed computing architecture is employed between both the terminal 110 and the server 120 for collaborative computing.

In some embodiments, the server is a stand-alone physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms.

Those skilled in the art will appreciate that the number of terminals 110 may be greater or lesser. For example, the number of the terminals 110 may be only one, or the number of the terminals 110 may be several tens or hundreds, or more. The number and device type of the terminals 110 are not limited in the embodiment of the present application.

Fig. 2 is a flowchart of a method for generating actions of a virtual object according to an embodiment of the present application. Referring to fig. 2, this embodiment is performed by a computer device, and is described by taking the computer device as a server, and includes the following steps:

201. the server randomly samples to obtain potential action characteristics, wherein the potential action characteristics are obtained by mapping the pose distribution of the virtual object to random sampling values under standard normal distribution.

The standard normal distribution is denoted as N (0, 1), and means a normal distribution having a mean value of 0 and a standard deviation of 1.

The pose distribution of the virtual object refers to probability distribution formed by poses which the virtual object may have when executing actions of various action categories, and represents probability space of the poses when executing actions of the virtual object under all action categories.

The pose of the virtual object refers to that the virtual object corresponds to the translation and rotation of the previous frame under any frame (or any moment) of the action sequence, for example, the translation and rotation of the whole object model is represented by using the translation and rotation of each joint of the object model corresponding to the previous frame, and the translation and rotation of the external surface (Mesh) skin of the object model can be predicted by using the translation and rotation of each joint corresponding to the previous frame.

In some embodiments, random sampling is performed from the standard normal distribution to obtain a potential motion feature, where the potential motion feature can reflect an implicit motion representation in which the pose of the virtual object is downsampled from the original probability space to the probability space corresponding to the standard normal distribution. For example, the potential motion feature is a potential vector z, and then the potential vector z must conform to a standard normal distribution, that is, z-N (0, 1) exists, and the potential vector z can be regarded as a random sampling value of motion information implied by the pose of the virtual object under the probability space corresponding to the standard normal distribution.

202. The server obtains bias information of the potential action feature based on the action category to be generated, wherein the bias information is used for representing an influence factor of the potential action feature under the action category.

The bias information can be used for representing relevant characteristics of the action category to be generated at this time, and because the potential action characteristics are randomly sampled from the standard normal distribution, the potential action characteristics can represent the pose of the virtual object to a certain extent and are hidden in the probability space corresponding to the standard normal distribution, but cannot be closely related to the action category to be generated at this time, and the bias information is acquired by the action category to be generated at this time, so that when the potential action characteristics are utilized to decode the action sequence, the bias information related to the action category to be generated at this time is introduced as a guide, which is equivalent to representing an influence factor on the potential action characteristics under the action category to be generated at this time.

In some embodiments, the server performs One-Hot (One-Hot) encoding on the action category to be generated, so as to obtain One-Hot encoding (also called One-Hot vector) of the action category, and then inputs the One-Hot encoding into a full connection layer for full connection processing, and outputs bias information of the potential action feature. Wherein, each element in the One-Hot vector is either 1 or 0, the vector length of the One-Hot vector represents the number of kinds of all the action categories that can be generated, and each element in the One-Hot vector is associated with One action category that can be generated, and the element with 1 represents the action category to be generated at this time.

Illustratively, the action categories that may be generated include: kicking, turning around, carrying things, throwing 4 kinds altogether, then the first element of the One-Hot vector is associated with the action category "kicking around", the second element is associated with the action category "turning around", the third element is associated with the action category "carrying east west", the fourth element is associated with the action category "throwing around", it is obvious that the vector length of the One-Hot vector is also 4 (equal to the total number of kinds 4 of action categories). Assume that the One-Hot vector takes a value of [0, 1,0], and at this time, the action category to be generated is described as "Gandong West".

It should be noted that, since two or more actions may be involved in One action sequence, the action classes to be generated may be two or more, in this case, only the element corresponding to each of the two or more action classes is set to 1 in the One-Hot vector, so that the action sequence having two or more actions can be synthesized.

In other embodiments, the server performs Embedding (Embedding) processing on the action category to be generated to obtain an Embedding vector of the action category, then inputs the Embedding vector into a full connection layer for full connection processing, and outputs the bias information of the potential action feature.

In some embodiments, since the full-connection layer for obtaining the offset information is usually trained and optimized along with the motion generation model, after the motion generation model is trained, the full-connection layer for obtaining the offset information is trained, the parameters of the full-connection layer are fixed, the motion types which can be supported by the motion generation model are fixed, the offset information is only associated with the motion types, and therefore, the server only needs to input One-Hot vector or an Embedding vector of each motion type into the trained full-connection layer for full-connection processing, finally, the offset information corresponding to the motion type is obtained, and then each motion type is associated with the corresponding offset information for storage. Then, when the action is generated, the action category to be generated at this time is taken as an index, and the bias information stored in association with the index is queried, so that the acquisition efficiency of the bias information can be accelerated, the bias information is not required to be calculated every time the action is generated, and the calculation resources of the server can be saved.

203. The server generates an action sequence of the virtual object in an autoregressive mode based on the potential action characteristics and the bias information, wherein the action sequence is used for representing the pose of the virtual object for executing the action corresponding to the action category.

Generating an action sequence in an autoregressive manner means: and predicting the pose of the virtual object in the current frame by using the pose of the virtual object in each frame in the past, and continuously executing the process to finally obtain an action sequence formed by the pose of each frame, for example, predicting the pose of the t frame by using the poses of the 1 st frame to the t-1 st frame in the action sequence, wherein t is an integer greater than 1.

In some embodiments, the potential motion features represent, to a certain extent, implicit information of the pose of the virtual object under the probability space corresponding to the standard normal distribution, and the bias information can provide constraint information based on the motion category to be generated at this time, so that the motion sequence of the virtual object can be generated autoregressively by using the motion generation model in combination with the potential motion features and the bias information.

All the above optional solutions can be combined to form an optional embodiment of the present disclosure, which is not described in detail herein.

According to the method provided by the embodiment of the application, potential action characteristics are obtained through random sampling, so that hidden information of the pose of the virtual object is represented under a probability space corresponding to standard normal distribution, and then the action sequence of the virtual object is predicted by autoregressively combining with bias information related to the action category to be generated, so that the action sequence not only represents the hidden information, but also meets the constraint of the action category, and the pose of each frame in the past is also introduced into an autoregressive mode to predict the pose of the current frame, so that the generation process of the final action sequence does not depend on an RNN generator under a GAN architecture, the mode collapse problem is avoided, and the action sequence with higher diversity can be generated.

In the above embodiment, the flow and the overall technical concept of generating actions of a virtual object are briefly described, and in the embodiment of the present application, how to auto-regress predicting an action sequence will be described in detail, fig. 3 is a flowchart of a method for generating actions of a virtual object according to an embodiment of the present application, as shown in fig. 3, where the embodiment is executed by a computer device, and the embodiment is described by taking the computer device as a server for example:

301. the server randomly samples to obtain potential action characteristics, wherein the potential action characteristics are obtained by mapping the pose distribution of the virtual object to random sampling values under standard normal distribution.

Step 301 is similar to step 201, and will not be described here.

302. The server obtains bias information of the potential action feature based on the action category to be generated, wherein the bias information is used for representing an influence factor of the potential action feature under the action category.

Step 302 is similar to step 202, and will not be described again.

303. And the server fuses the potential action characteristics and the bias information to obtain action bias characteristics.

In some embodiments, the server concatenates (Concat) the potential motion feature and the bias information to obtain the motion bias feature, which is equivalent to transferring the potential motion feature into a motion space associated with the current motion class.

In other embodiments, the server may also fuse the potential action feature with the bias information by using other manners of adding the potential action feature with the bias information by elements, multiplying the potential action feature by elements, and converging the potential action feature with the bias information by bilinear, where the fusing manner is not specifically limited by the embodiments of the present application.

304. The server inputs the motion bias feature and the initial pose into a motion generation model, and predicts the 1 st pose of the motion sequence through the motion generation model.

The initial pose refers to a pose initialized by the virtual object under the action category, for example, a plurality of initial poses are set for the action category, one initial pose generated as the action is randomly selected from the plurality of initial poses, or a pose of one frame is randomly sampled from a sample action sequence corresponding to the action category as the initial pose.

Consider the initial pose as the 0 th pose P ₀ Then by the initial pose P ₀ To predict the 1 st pose P ₁ And based on the 1 st to i th pose [ P ] ₁ ，P ₂ ，…，P _i ]To predictThe (i+1) (i is greater than or equal to 1) th pose P _i+1 The process is the same, and the detailed prediction process is described in the following step 304, which is not repeated here.

It should be noted that, the pose of the virtual object is used to characterize the translation and rotation of multiple joints of the virtual object, and taking the object model of the virtual object as the SMPL model as an example, the SMPL model involves 24 joints when performing 3D human modeling on the virtual object, for the ith pose P _i ，P _i Is characterized by the rotation of 24 joints of the virtual object in the ith frameAnd translation characterization of root joints->Spliced, in other words, rotated representation R _i And 6 dimensions are adopted for each joint to estimate the posture information of joint rotation, and only displacement information of a unique root joint in 24 joints in the directions of x, y and z coordinate axes is needed for translational characterization.

305. The server inputs the motion bias characteristic and the 1 st pose to the i-th pose into the motion generation model, and the i+1-th pose is obtained through prediction by the motion generation model, wherein i is an integer greater than or equal to 1.

Since i is an integer greater than or equal to 1, then the i+1st pose can characterize any pose in the motion sequence other than the 1 st pose, this step 304 is equivalent to providing a possible way of predicting any pose in the motion sequence other than the 1 st pose. In other words, the server predicts the 1 st pose in the motion sequence by the above step 303, and predicts any pose except the 1 st pose in the motion sequence by the present step 304.

In some embodiments, a predicted preamble pose sequence [ P ] is formed by the 1 st pose through the i-th pose ₁ ，P ₂ ，…，P _i ]For the preamble pose sequence [ P ] ₁ ，P ₂ ，…，P _i ]The position encoding is performed (Position Encoding,PE) to obtain a pose position feature, also known as a preamble pose sequence [ P ] ₁ ，P ₂ ，…，P _i ]For characterizing a preamble pose sequence [ P ] ₁ ，P ₂ ，…，P _i ]The sequence of each pose in time sequence. Alternatively, in the case of performing the position coding, an absolute position coding method or a relative position coding method may be used, and the position coding method is not particularly limited in the embodiment of the present application.

Next, the preamble pose sequence [ P ] ₁ ，P ₂ ，…，P _i ]And the above pose position features are fused to obtain the preamble pose features, which not only can represent the preamble pose sequence [ P ] ₁ ，P ₂ ，…，P _i ]Features of each pose in the sequence can also characterize the preamble pose sequence [ P ] ₁ ，P ₂ ，…，P _i ]The time sequence of each pose in the model.

In some embodiments, the server sequences the preamble pose [ P ] ₁ ，P ₂ ，…，P _i ]And inputting the characteristic vector into a full-connection layer for full-connection processing, and adding the characteristic vector output by the full-connection layer and the position coding vector (namely pose position characteristic) according to elements to obtain the preamble pose characteristic.

After the preamble pose feature is acquired, the preamble pose feature and the motion bias feature acquired in the step 303 are input into the motion generation model together, and the i+1th pose is obtained by decoding the motion generation model.

In the above process, the method is equivalent to that for any pose in an action sequence (i.e. the ith pose can represent any pose because i is an integer greater than or equal to 1), the pose (i.e. the ith pose) and each preceding pose of the pose (i.e. the 1 st pose to the i-1 st pose) can form a preceding pose sequence, the potential action feature, the bias information and the preceding pose sequence are input into an action generating model, and the next pose (i.e. the (i+1) th pose) in the action sequence is predicted by the action generating model, wherein the meaning of the "preceding pose of any pose" means: the pose that is temporally located before the arbitrary pose in the motion sequence is called the leading pose of the arbitrary pose.

In some embodiments, the motion generation model comprises a plurality of cascaded decoding units, i.e. except that the first decoding unit takes the motion bias feature and the preamble pose feature as input features, each of the remaining decoding units takes the feature vector output by the previous decoding unit as input feature, wherein each decoding unit is configured to decode the input feature based on a self-attention mechanism. Alternatively, since the Q (Query) vector, the K (Key) vector, and the V (Value) vector need to be used for weighting operation in the self-attention mechanism, the preamble pose feature is used as the Q vector in the self-attention mechanism, and the corresponding K vector and one V vector can be generated based on the action bias feature.

In some embodiments, the server inputs the motion bias feature and the preamble pose feature to a first decoding unit of the plurality of decoding units, the first decoding unit uses the preamble pose feature as a Q vector, and based on the motion bias feature, is capable of generating a corresponding K vector and a V vector, so that decoding based on a self-attention mechanism is performed according to the determined Q vector, K vector and V vector, the decoded feature vector is input to a second decoding unit, and so on, the last decoding unit outputs a hidden vector of the i+1th pose.

In some embodiments, the server performs full connection processing on the hidden vector of the (i+1) th pose to obtain the (i+1) th pose. Optionally, the server inputs the hidden vector of the i+1th pose into at least one full connection layer, the hidden vector of the i+1th pose is fully connected through the at least one full connection layer, the i+1th pose is output by the last full connection layer, for example, the hidden vector of the i+1th pose is input into 2 serial full connection layers, and the i+1th pose is output by the latter full connection layer.

306. The server acquires an action sequence formed by all the poses predicted by the action generating model.

In some embodiments, the server iteratively performs step 304 described above, continually predicting new poses using past preamble pose sequences until the predicted poses reach a specified number, corresponding to the predicted action sequences reaching a specified length, where the specified number or specified length is set by a technician.

In some embodiments, the server iteratively performs step 304 described above, continually predicting new poses using past preamble sequences until the motion generation model outputs a terminator, at which time each pose from the 1 st pose to the previous pose of the terminator can constitute the sequence of motion.

In the above steps 303-306, a possible implementation manner of generating the motion sequence of the virtual object by the server in an autoregressive manner based on the potential motion characteristics and the bias information is provided, and in some embodiments, the motion sequence can also be generated by other autoregressive models, which are not particularly limited in the embodiment of the present application.

307. The server synthesizes an action fragment of the virtual object for executing the action corresponding to the action category based on the action sequence and the object model of the virtual object.

Taking an object model of a virtual object as an SMPL model as an example, 24 joints are involved in 3D human modeling of the virtual object by the SMPL model, and for the ith pose P _i ，P _i Is characterized by the rotation of 24 joints of the virtual object in the ith frameAnd translation characterization of root joints->Spliced, in other words, rotated representation R _i Using 6 dimensions for each joint to estimate pose information for joint rotation, translation characterization D _i Only the displacement information of the unique root joint in the 24 joints in the directions of three coordinate axes of x, y and z is needed.

Assume thatThe sequence length of the action sequence is T (T is more than or equal to 1), the action sequence can be expressed as [ P ] ₁ ，P ₂ ，…，P _T ]Each element in the sequence of actions represents the pose of the virtual object in each media frame, and when the virtual object is 3D human modeled using the SMPL model, the pose of the virtual object in each frame is also referred to as the SMPL pose parameter per frame, the rotation in the SMPL pose parameter characterizes R _i Can reflect the global rotation of the root joint and the joint rotation of the rest 23 joints in the motion tree of the SMPL model, and the translation characterization D _i The displacement information of the root joint can be reflected, so that the SMPL posture parameter can decouple the posture and the shape of the virtual object in each frame.

In some embodiments, the motion sequence is input into the SMPL model, for the ith pose in the motion sequence, since the pose reflects the rotation representation of 24 joints and the translation representation of the root joint, the joint coordinates of each of the 24 joints of the virtual object under the ith frame can be determined, or the SMPL model can also calculate the vertex coordinates of each of 6890 vertices of the human body Mesh (Mesh), and in addition, the technician can also input a shape parameter for controlling the physical obesity of the virtual object into the SMPL model, so that the controllability and the degree of freedom of the virtual object modeling process can be improved.

The server can decouple the gesture and the shape of the virtual object in each frame by utilizing the SMPL model, can calculate and obtain the joint coordinates of each joint and the vertex coordinates of each vertex of the virtual object in each frame based on each gesture (including the translation representation and the rotation representation of the joint) in the action sequence, can finally render one frame of action picture of the virtual object, and can execute the operation on each frame to finally obtain the action segment formed by multiple frames of action pictures, wherein each action picture in the action segment is matched with one gesture in the action sequence.

It should be noted that, in the embodiment of the present application, only the object model of the virtual object is an SMPL model, and the 3D human motion segment is reconstructed by using the SMPL model as an example, optionally, when the virtual object is a reconstructed human model, a human parameterized model such as a SCAPE (Shape Completion And Animation Of People, human modeling and animation) model, an smpl+h model, an SMPL-X model, and the like can also be used to reconstruct a 3D human virtual object, and of course, some 2D human models can also be used to reconstruct a 2D human virtual object, optionally, when the virtual object is not a reconstructed human model, other matched object models can also be used to create a cartoon character, an avatar, a virtual character, and other types of virtual objects, where whether the motion segment of the virtual object is reconstructed by using the SMPL model is not specifically limited.

In some embodiments, if multiple action categories to be generated are given, then the final synthesized action segment is typically a virtual object performing consecutive actions corresponding to the multiple action categories, e.g., given the action categories "kick" and "fall," then the final synthesized action segment includes the virtual object performing the action of "kick" followed by the action of "fall," and the virtual object links up coherently, transitioning naturally between the two different action categories.

Further, under the drive of a given action category, an action sequence corresponding to the action category is generated in a regression mode through a trained action generation model, the autoregressive mode predicts the pose of the current frame based on the pose of the past frame, so that the information of the poses of each frame generated in the preamble can be fully utilized to make the inter-frame action transition more natural, and the prediction of the poses is carried out frame by frame, so that the action sequence with any length can be generated theoretically. This is quite different from the non-autoregressive method, which is to output the motion sequences of all frames at one time, and this necessarily limits the predicted motion sequences not to be of variable length, so that the motion sequences are generated by the autoregressive method, and the controllability of the motion generation model can be improved, so that the motion sequences of variable length can be flexibly generated.

Furthermore, on the basis of a motion sequence predicted by the motion generation model, the motion segment of the virtual object can be reconstructed by combining with the SMPL model, so that front-end visual display is performed, and the motion segment can be generated by constraining the rotation of the human body part of the reconstructed 3D human body model in the motion tree by using joint coordinates and vertex coordinates, so that the virtual object in the generated motion segment can execute more real motion, and the simulation degree of the synthesized motion segment is improved.

Fig. 4 is a schematic diagram of predicting an action sequence using an action generating model according to an embodiment of the present application, and as shown in fig. 4, only a generator, i.e., a decoder (refer to a decoding module including a plurality of decoding units) 410 is needed when predicting an action sequence using the action generating model, and a decoder configuration in which the decoder 410 adopts a transform model is described here as an example. The input signal to the transducer decoder 410 includes two: one is a vector obtained by splicing the potential vector z and the offset information b, and the other is a vector obtained by combining the position coding vector PE after full-connection processing of the preamble pose sequence motion_prev and adding the elements.

Alternatively, the server randomly samples a potential vector z-N (0, 1) from the standard normal distribution N (0, 1), where the potential vector z compliant with the standard normal distribution is the potential motion feature involved in step 201. Next, using the action class a to be generated, a bias information b is obtained, where the bias information b represents an influence factor on the potential vector z under the action class a, and a vector (i.e., an action bias feature) obtained by stitching the potential vector z and the bias information b is input to the transducer decoder 410.

Optionally, the server moves the preamble pose sequence motion_prev [ P ] ₁ ，P ₂ ，…，P _i ]The vector is input to a full-connection layer 420, and a vector (i.e., a preamble pose feature) obtained by adding a feature vector output from the full-connection layer 420 and a position coding vector PE (i.e., a pose position feature) of the preamble pose sequence motion_prev by elements is input to the transducer decoder 410.

The internal decoding logic of the transducer decoder 410 has been described in the previous embodiment and will be described again in the next training process embodiment, and thus will not be described again here.

Optionally, the feature vectors output by the transducer decoder 410 are sequentially input to two cascaded full-connection layers 430 and 440, and the next full-connection layer 440 outputs the predicted pose P for the next frame _i+1 . Iteratively performing the above-described autoregressive frame-by-frame prediction process, a final motion sequence P is output when the transform decoder 410 stops prediction ₁ ，P ₂ ，…，P _T ]Wherein T represents the sequence length of the action sequence, and T is equal to or greater than 1.

Optionally, the action sequence [ P ] ₁ ，P ₂ ，…，P _T ]Input into one SMPL layer 450, the joint coordinates of each joint of the 3D human model of the virtual object in each frame, and the vertex coordinates of each vertex in the human network (corresponding to the human surface) are acquired through the SMPL layer 450, so as to reconstruct the action segment of the virtual object.

In the embodiment of the application, as the decoding function of the motion generation model is realized by using the transducer decoder, the time sequence information and the human body structure information of the preamble pose sequence can be fully learned by combining the position coding of the transducer decoder and the full-connection layers before and after the decoder, so that the pose of the next frame is predicted, the pose can reflect the rotation representation of each joint and the translation representation of the root joint in the SMPL model, thereby improving the prediction accuracy of the transducer decoder, and further combining the SMPL model, and synthesizing more real motion fragments.

In the above embodiments, how the server generates the action sequence corresponding to the action category from regression through a trained action generation model under the drive of the given action category is described in detail, and the training method of the action generation model will be described in detail in the embodiments of the present application.

Fig. 5 is a flowchart of a training method of an action generating model according to an embodiment of the present application, as shown in fig. 5, where the embodiment is executed by a computer device, and the computer device is taken as a server for explanation, and the embodiment includes the following steps:

501. the server obtains a sample action sequence and a corresponding sample action category.

In some embodiments, the server segments the motion video or motion video segment from a pre-stored motion data set to obtain a plurality of motion segmentation sequences, then filters a plurality of sample motion sequences from the plurality of motion segmentation sequences, and labels each sample motion sequence with a corresponding sample motion category.

Alternatively, when the virtual object is a reconstructed manikin, the action data set may be a video data set of photographed or recorded real human beings making various kinds of actions, or when the virtual object is an avatar in a game, the action data set may be a video data set composed of game play data, screen recording data, live game data, game play data, etc. for the avatar in the game, or when the virtual object is a virtual character in an animation, the action data set may be a video data set composed of an animation, a CG (Computer Graphic) video, etc. that has been made, and different action data sets are taken according to the type of the virtual object, which is not particularly limited in the embodiment of the present application. The shooting, transmission and use of the action data set are required to be authorized by the user or fully authorized by each party.

Optionally, the target detection is performed on each motion video or motion video segment, so that a target object corresponding to the virtual object (the target object may be a real human or a virtual object) can be determined, and whether the detected target object is in a motion state or a static state is judged, so as to obtain a plurality of motion segmentation sequences of the target object.

Optionally, when the sample motion sequence is screened from the motion segmentation sequences, performing primary screening based on the definition or resolution of each motion segmentation sequence, then performing frame rate downsampling on the motion segmentation sequences obtained by the initial screening, and performing secondary screening on the downsampled motion segmentation sequences through duration to finally obtain the sample motion sequence meeting the conditions, so that the sample quality of the training sample can be improved.

For example, from the full motion segmentation sequences, the motion segmentation sequences with lower definition or lower resolution are removed, then the rest of the motion segmentation sequences are downsampled, so that the frame rate (Frames Per Second, FPS, i.e. transmission frame number per second) of the downsampled motion segmentation sequences reaches 30, and for each downsampled motion segmentation sequence, the motion segmentation sequences with the duration between 1 and 3 seconds are screened out as sample motion sequences, so that not only the blurred video frames are avoided as samples, but also the motion that is too long or too short is removed, so that the initial motion model is easier to learn.

Optionally, performing an action labeling on each sample action sequence to obtain a sample action category corresponding to each sample action sequence, where a mode of action labeling includes manual labeling or automatic machine identification of the action category, which is not limited by the embodiment of the present application. It should be noted that, since the target object in each sample action sequence may perform more than one action, the actions may be in fluid connection, and thus, the sample action category corresponding to each sample action sequence may be one or more.

Optionally, after labeling of the sample action category is completed for each sample action sequence, obtaining the occurrence frequency of each sample action category from the total sample action categories, wherein the occurrence frequency refers to the proportion of the sample action sequence corresponding to the sample action category in the total sample action sequence, and then eliminating the sample action sequence corresponding to the sample action category with the occurrence frequency which does not meet the target condition, so as to realize rescreening of the sample action sequence and achieve a better training effect.

For example, the target condition is that the frequency of occurrence is higher than a frequency threshold, i.e. only the sample action sequence corresponding to the sample action class (e.g. turning around, carrying things, throwing, etc.) of the high frequency occurrence is used as training sample, or that the target condition is that the frequency of occurrence is between 2 ⁶ -2 ¹¹ In the meantime, the embodiment of the present application does not specifically limit the target condition.

In some embodiments, after screening the sample motion sequences, the remaining sample motion sequences and the sample motion types corresponding to each sample motion sequence form a sample data set, optionally, the sample data set is divided into a training data set and a test data set according to a target proportion (greater than or equal to 0 and less than or equal to 100%), for example, 80% of sample data is selected to form the training data set, and the remaining 20% of sample data form the test data set.

It should be noted that, since the training data set involves a plurality of sample action sequences, the embodiment of the present application only shows the processing flows of a single sample action sequence and a corresponding sample action class, and the processing flows of the remaining sample action sequences and corresponding sample action classes are similar, which is not described herein.

502. And the server performs full connection processing on the sample action category to obtain sample category characteristics.

In some embodiments, the server performs One-Hot (One-Hot) encoding on the sample action class to obtain One-Hot encoding (also referred to as One-Hot vector) of the sample action class, or performs Embedding (Embedding) processing on the sample action class to obtain an Embedding vector of the sample action class. Then, the One-Hot vector or the Embedding vector of the sample action category is input into a full connection layer to carry out full connection processing, and the sample category characteristics of the sample action category are output. The One-Hot vector and the Embedding vector are both described in step 202 above, and will not be described here.

503. And the server performs full connection processing on the sample action sequence to obtain a first action characteristic.

For each element of each sample motion sequence in the sample data set, typically a video frame sampled from an original motion video or motion video segment, in order to characterize the motion of the target object in the video frame, pose estimation needs to be performed on each video frame in the sample motion sequence to obtain a sample pose of the target object in each video frame, that is, each element in the finally used sample motion sequence is a sample pose (typically a pose vector) of the target object in one video frame, and is not the original video frame.

Schematically, taking a real human as an example, a 3D human model (i.e. a virtual object) corresponding to a target object can be modeled based on a motion video of the target object, and assuming that the 3D human model is an SMPL model, the SMPL model generally uses joint coordinates and human mesh vertex coordinates to perform motion characterization, and in the embodiment of the application, since the human mesh vertex coordinates can be calculated through motion information of joints, translational characterization and rotational characterization of 24 joints are used as sample pose of each frame, so as to realize decoupling of the pose and shape of the virtual object in the SMPL model. The sample pose (i.e., the SMPL parameter) of each frame includes a 6D rotation characterization of each of the 24 joints Furthermore, the translation characterization of only one root joint of 24 joints in the x, y and z axes is included +.>In other words, the rotation characterizes R _i And translation characterization D _i Combining (e.g. stitching) the sample poses P that make up a frame _i ' the SMPL model is based on specified shape parameters (for controlling the physical body weight of the virtual object) and the sample pose P of each frame _i ' the respective 24 joints of the virtual object in the frame can be calculatedAnd the vertex coordinates of 6890 vertices in the human mesh, thereby facilitating the interaction of the modeled virtual object with the environment. />

In some embodiments, the server sequences the sample actions [ P ] ₁ ’，P ₂ ’，…，P _T ’]Inputting the first motion characteristic of the sample motion sequence into another full-connection layer for full-connection processing, and outputting the first motion characteristic of the sample motion sequence, wherein T represents the sequence length of the sample motion sequence, and T is more than or equal to 1.

Illustratively, the full connection layer used in this step 503 and the full connection layer used in the previous step 502 have different weight parameters, but the same number of characteristic channels C (C. Gtoreq.1) is set for both full connection layers, i.e., the sample action category can be mapped to by the above step 502Space, by this step 503, the sample motion sequence can be mapped to +. >Space.

504. The server fuses the first action feature and the sample position feature of the sample action sequence to obtain a second action feature, wherein the sample position feature is used for representing the sequential sequence of all the poses in the sample action sequence.

In some embodiments, the server performs position coding on the sample motion sequence using an absolute position coding scheme to obtain an absolute position coding vector, and uses the absolute position coding vector as a sample position feature, where in one example, the absolute position coding vector is a position coding vector in a sine function form (also referred to as sine position coding).

In some embodiments, the server uses a relative position coding mode to perform position coding on the sample action sequence to obtain a relative position coding vector, and uses the relative position coding vector as a sample position feature.

In some embodiments, the first motion feature acquired in the step 503 and the sample position feature are fused to obtain a second motion feature of the sample motion sequence. For example, the full connection layer in step 503 is output The spatial feature vector and the position code vector are added by elements to obtain the second motion feature, wherein the second motion feature not only can represent the sample motion sequence [ P ] ₁ ’，P ₂ ’，…，P _T ’]Features of each sample pose, and can also characterize sample motion sequence [ P ] ₁ ’，P ₂ ’，…，P _T ’]The time sequence of the pose of each sample.

In some embodiments, if the dimensions of the first motion feature and the sample position feature are different, then the first motion feature and the sample position feature cannot be directly added by element, and at this time, a 1-dimensional convolution layer is used to perform dimensional transformation (i.e., dimension up or dimension down) on the first motion feature and the sample position feature, so that the dimensions of the first motion feature and the sample position feature after dimensional transformation are the same, and then the first motion feature and the sample position feature after dimensional transformation are added by element, so as to obtain the second motion feature, where the 1-dimensional convolution layer refers to a convolution layer with a convolution kernel size of 1×1. It should be noted that, for the rest of the element-wise addition operations in the embodiments of the present application, when dimensions between two features to be added are different, a 1-dimensional convolution layer may be used to perform dimension transformation, which will not be described in detail later.

In some embodiments, the first action feature and the sample position feature may be fused in addition to the element addition manner, and the fusion manner may be performed by using a splicing manner, an element multiplication manner, a bilinear fusion manner, or the like, which is not specifically limited by the embodiment of the present application.

505. The server inputs the second motion feature and the sample category feature into a plurality of coding units in an initial motion model, and the last coding unit outputs a target hidden vector associated with sample distribution obeyed by the sample motion sequence.

In some embodiments, since the sample category feature represents feature information of the sample action category, the second action feature represents pose information of the sample action sequence, the server splices the sample category feature and the second action feature, and inputs the spliced feature into the initial action model to predict a target hidden vector associated with sample distribution.

Next, a model structure of the initial motion model and a method for acquiring the target hidden vector will be described.

In some embodiments, the initial motion model comprises an encoding module (or encoder) and a decoding module (or decoder), the encoding module comprising a plurality of cascaded encoding units for predicting distribution parameters of the sample distribution; the decoding module comprises a plurality of cascaded decoding units, wherein the decoding units are used for decoding the preamble pose sequence based on a self-attention mechanism to obtain the next pose, and the number of the encoding units in the encoding module is the same as the number of the decoding units in the decoding module.

In this step 505, the above coding module, i.e. a plurality of cascaded coding units, is used to code the feature obtained by splicing the sample class feature and the second action feature, and the feature channel of the dimension corresponding to the sample class feature in the feature vector output by the last coding unit is obtained as the target hidden vector, for example, when the sample class feature and the second action feature are spliced, the sample class feature is front (i.e. the 1 st dimension), the second action feature is rear, the 1 st dimension feature channel in the feature vector output by the last coding unit is used as the target hidden vector, and for example, when the sample class feature and the second action feature are spliced, the second action feature is front (i.e. the last 1 dimension), and then the last 1 st dimension feature channel in the feature vector output by the last coding unit is used as the target hidden vector.

In an exemplary scenario, taking an initial motion model as a transducer model, the transducer model includes an encoder and a decoder, where the encoder includes N cascaded encoding units, and the decoder includes N cascaded decoding units, where N is an integer greater than or equal to 1, for example, n=6 or other values, and the value of N is not specifically limited in this embodiment of the present application.

In the encoder of the transform model, each encoding unit internally includes a Multi-Head Attention layer and a feedforward neural network (Feed Forward Neural Networks) layer, where the Multi-Head Attention layer is used to comprehensively extract the association relationship between the sample poses in the sample motion sequence from multiple expression subspaces, the feedforward neural network layer is used to fully connect the feature vectors output by the Multi-Head Attention layer, and residual structures are disposed behind the Multi-Head Attention layer and the feedforward neural network layer, that is, the input and the output of the current layer are connected (i.e. spliced) in residual and normalized, and then input to the next layer, where the residual connection and normalization operations can be generally regarded as a residual normalization (Add & m) layer, in other words, a residual normalization layer is connected behind the Multi-Head Attention layer and the feedforward neural network layer in series.

Based on the internal structure of the coding unit, the characteristics obtained by splicing the sample category characteristics and the second action characteristics are input into a multi-head attention layer in the 1 st coding unit in the coder, Q vectors, K vectors and V vectors are respectively calculated on the input characteristics through the multi-head attention layer, the weight operation based on a self-attention mechanism is carried out on the V vectors by utilizing the Q vectors and the K vectors, a multi-head attention characteristic is output, the multi-head attention characteristic and the input characteristics of the multi-head attention layer (namely, the characteristics obtained by splicing the sample category characteristics and the second action characteristics) are input into a residual normalization layer together, namely, the multi-head attention characteristic and the input characteristics are spliced, the residual characteristics obtained by splicing are normalized, the normalized residual characteristics are input into a feedforward neural network layer for full connection processing, outputting a full connection feature, then inputting the full connection feature and an input feature (i.e. normalized residual feature) of a feedforward neural network layer together into another residual normalization layer, namely splicing the full connection feature and the input feature, normalizing the spliced another residual feature, namely inputting the feature vector output by the 1 st coding unit into the 2 nd coding unit, performing an operation similar to the 1 st coding unit, and so on until the last coding unit also outputs a feature vector, wherein in the embodiment of the application, the sample category feature is a previous (i.e. 1 st dimension), a sample category feature is a previous (i.e. the 1 st dimension) when the sample category feature and the second action feature are spliced, the second motion feature is followed then the 1 st-dimensional channel feature of the feature vector corresponds to the sample class feature, so the 1 st-dimensional channel feature of the feature vector is used as the target hidden vector.

The encoder of the above-mentioned Transformer model is equivalent to that the pose of the virtual object (represented by using a sample motion sequence) is transferred from the real pose distribution to the sample distribution where a target hidden vector is located through a series of encoding and mapping of the encoder, and because the original pose vector of the virtual object is usually in a high-dimensional space, and the mapped target hidden vector is usually in a low-dimensional space, the target hidden vector can be regarded as representing some characteristic information implied by the pose vector of the virtual object when mapped to the low-dimensional space.

506. The server obtains a distribution parameter indicating the distribution of the sample based on the target hidden vector.

The sample distribution has a mapping relation with the pose distribution of the virtual object.

In some embodiments, the distribution parameters include the mean μ and standard deviation σ of the sample distribution, optionally, the server inputs the target hidden vector into two different full-connection layers, respectively, the target hidden vector is fully-connected through one full-connection layer, the mean μ of the sample distribution is output, the target hidden vector is fully-connected through another full-connection layer, the standard deviation σ of the sample distribution is output, the mean μ and standard deviation σ can be only One indicating a signal can be expressed as N (mu, sigma) ² ) Is a sample distribution of (a). The probability space where the sample distribution is located is the motion potential space with low dimension obtained by mapping the pose of the virtual object.

In the above steps 502-506, a possible implementation manner of determining, by the server, a sample distribution obeyed by the sample action sequence based on the sample action category and the sample action sequence, that is, a distribution parameter is predicted by the coding module of the initial action model, and the sample distribution can be uniquely determined by the distribution parameter output by the coding module.

507. And the server re-parameterizes the sample distribution through standard normal distribution, and samples the sample distribution after re-parameterization to obtain the potential characteristics of the sample.

In some embodiments, since the sample distribution is not normally standard normal distribution, by re-parameterizing the sample distribution and introducing a loss term of relative entropy (i.e. KL divergence) between the sample distribution and the standard normal distribution into the loss function value in step 510 described below, the sample distribution can be encouraged to be optimized towards a direction that continuously approximates the standard normal distribution by minimizing the loss function value under the constraint of the loss term, and finally, when training is finished, the sample distribution can be regarded as being similar to the standard normal distribution, so that when predicting using the motion generation model, since the motion generation model does not include an encoder, only the decoder is included, at this time, the motion generation model does not need to be used for predicting the distribution parameters of the sample distribution as in the training phase, and random sampling is performed in the standard normal distribution, and by combining the distribution parameters of the sample distribution, a potential motion feature can be obtained by the re-parameterization technique, and the trained decoder can be driven to automatically regress the motion generation model by directly performing random sampling in the standard normal distribution.

In some embodiments, the process of re-parameterizing the sample distribution includes: from the standard normal distribution N (0, 1), a re-parameterized adjustment factor epsilon-N (0, 1) is obtained by random sampling, and then the standard deviation sigma of the sample distribution is modified based on the adjustment factor epsilonPositive, i.e. the standard deviation sigma is multiplied by the adjustment factor epsilon to obtain a distribution N (mu, sigma) corresponding to the slave samples ² ) The potential vector z=μ+ (σ·ε) obtained by sampling the sample, which is obtained by randomly sampling the sample from the standard normal distribution N (0, 1) and performing parameter variation in combination with the distribution parameters of the sample distribution, but can be regarded as the sample distribution N (μ, σ) ² ) Is known as a reparameterization process, i.e., the distribution of samples is not changed before and after reparameterization, but the posterior distribution N (μ, σ) of the potential vector z is obtained ² ). Assuming that M is used to represent the probability space in which the posterior distribution of the potential vector z is located, the re-parameterization process is equivalent to the server randomly sampling from the probability space M to obtain a sample of potential features (i.e., the potential vector z e M).

It should be noted that, the sample distribution is not changed before and after the reparameterization, but the reparameterization manner can change the random sampling of the potential vector z from the probability space M into the random sampling of epsilon from the standard normal distribution N and then transforming to obtain a potential vector z by combining the distribution parameters of the sample distribution, and the sampling manner is similar to the manner of sampling the potential vector z from the standard normal distribution in the step 201, which is not described herein.

508. The server obtains sample bias information for the sample potential feature based on the sample action category, the sample bias information being used to characterize an impact factor on the sample potential feature under the sample action category.

In some embodiments, the server encodes the sample action category to obtain an One-Hot vector of the sample action category, or performs an Embedding process on the sample action category to obtain an Embedding vector of the sample action category, or caches the sample action category after obtaining the One-Hot vector or the Embedding vector of the sample action category in step 502, and directly reads the cached One-Hot vector or the Embedding vector of the sample action category in step 508.

In some embodiments, the server inputs One-Hot vector or an embedded vector of the sample action category into a full connection layer for full connection processing, and outputs the sample offset information. It should be noted that, the fully connected layer for obtaining the sample bias information in the present step 508 may have different weight parameters from the fully connected layer for obtaining the sample class feature used in the above step 502, and the two fully connected layers may also have the same or different input vectors. Illustratively, one-Hot vectors of sample action categories are respectively input into different full-connection layers, one full-connection layer outputs sample category characteristics, and the other full-connection layer outputs sample bias information. Illustratively, one-Hot vectors of sample action categories are input into One fully connected layer to obtain sample category features, and embedded vectors of sample action categories are input into another fully connected layer to obtain sample bias information, which is not particularly limited in the embodiments of the present application.

509. The server generates a predicted motion sequence for the virtual object by a plurality of decoding units in the initial motion model based on the sample potential features and the sample bias information.

In some embodiments, this step 509 involves generating a predicted action sequence using a decoding module in the initial action model that contains a plurality of cascaded decoding units.

Optionally, the server fuses the sample potential feature and the sample bias information to obtain a sample motion bias feature, for example, a potential vector z (i.e., a sample potential feature) randomly sampled from the re-parameterized sample distribution is spliced with the sample bias information b to obtain a sample motion bias feature. Optionally, other ways of adding by element, multiplying by element, bilinear fusion and the like are also supported for fusion, and the embodiment of the application does not specifically limit the fusion way.

Alternatively, unlike the prediction stage in the previous embodiment, each element in the preamble pose sequence used in the prediction stage is a pose generated by the decoder, but in the training stage, only the target probability in the preamble pose sequence used in the training stage uses the predicted pose generated by the decoder, and the other part of probabilities uses the sample pose actually generated. The preamble pose sequence is obtained by means of scheduling sampling, the influence caused by the pose with the prediction error in the autoregressive mode can be greatly reduced, the situation that the prediction error of the pose of a certain frame is transmitted and accumulated in the follow-up pose continuously is avoided, and the prediction accuracy of a decoder can be improved. It should be noted that, the mode of scheduling and sampling is an optional step, and the predicted pose generated by the decoder may be used as the preamble pose sequence as in the prediction stage, so that the training process can be simplified.

Optionally, the scheduling sampling process includes: the server predicts the latest pose P in the action sequence _i Based on the target probability, from the predicted action sequence [ P ] ₁ ，P ₂ ，…，P _i ]And each sample pose [ P ] of which the position sequence number does not exceed the latest pose in the sample motion sequence ₁ ’，P ₂ ’，…，P _i ’]The preamble pose sequence motion_prev [ P_prev ] is obtained by scheduling and sampling ₁ ，P-prev ₂ ，…，P-prev _i ]Wherein the latest pose P _i Refers to the last pose P in the existing predicted motion sequence _i I.e. the last pose P in the existing predicted motion sequence _i The target probability is used to characterize the likelihood that each pose in the sequence of leading poses is sampled from the sequence of predicted actions, i.e., P-prev for the same frame _i Equal to the predicted pose P _i The probability of (2) is the target probability, P-prev _i Equal to the sample pose P _i The probability of' is the value of 1 minus the target probability. The target probability is a probability value greater than or equal to 0 and less than or equal to 1.

Optionally, the target probability is not fixed during the prediction, but is positively correlated with the position sequence number of the latest pose in the predicted action sequence, in other words, the target probability increases with the sequence length of the predicted action sequence, so that when the prediction is started, the sequence length of the predicted action sequence is shorter, then the predicted pose generated by the decoder is sampled with a smaller probability, the sample pose generated by the decoder is sampled with a larger probability, so that the decoder learns about the relevant knowledge of the sample pose faster, and as the prediction is continuously performed, the predicted pose generated by the decoder is sampled with a larger probability, and the sample pose generated by the decoder is sampled with a smaller probability, so that the decoder learns about the predicted pose of the past frame to generate a new predicted pose faster.

It should be noted that, the technician may set the target probability to a fixed value, for example, fix the target probability to 50%, so as to simplify the training process, and the embodiment of the application does not specifically limit whether the target probability is a variable probability.

In one exemplary scenario, assume that the current predicted action sequence is [ P ] ₁ ，P ₂ ，…，P _i ]Since the decoder does not stop prediction, it is necessary to continue predicting the next pose P _i+1 If it is in the prediction phase, the existing P will be used directly ₁ ，P ₂ ，…，P _i ]As the preamble pose sequence, however, in the training phase, a scheduled sampling method will be used, and the current value of the target probability is 70% will be described as an example, for the 1 st frame, the predicted pose P of the 1 st frame in the predicted motion sequence is sampled with a probability of 70% ₁ Sample pose P sampled to frame 1 in sample motion sequence with 30% probability ₁ ' the same scheduling sampling process is executed for each frame, and finally a preamble pose sequence motion_prev [ P_prev ] is obtained ₁ ，P-prev ₂ ，…，P-prev _i ]。

It should be noted that, when sampling is scheduled, only the predicted pose of the corresponding position sequence number in the predicted motion sequence and the sample pose of the corresponding position sequence number in the sample motion sequence are sampled for each frame, but not the predicted poses or sample poses of other position sequence numbers, i.e. the latest predicted pose P in the preamble pose sequence, which is unlikely to appear, is ensured _i Future sample pose [ P ] _i+1 ’，P _i+2 ’，…，P _T ’]Is guaranteed that the decoder does not "see" any information of the sample pose of the future frame.

Optionally, after the preamble pose sequence is acquired, a position-coding vector of the preamble pose sequence (i.e., a sample position feature of the preamble pose sequence) is acquired in a similar manner to that described in step 504 above. And then, inputting the preamble pose sequence into a full-connection layer, performing full-connection processing on the preamble pose sequence through the full-connection layer, and adding the feature vector output by the full-connection layer and the position coding vector according to elements to obtain sample preamble pose features, wherein the sample position features are used for representing the sequence of each pose in the preamble pose sequence.

In the above process, the preamble pose sequence motion_prev [ P_prev ] ₁ ，P-prev ₂ ，…，P-prev _i ]And sample position features PE of the preamble pose sequence _{motion_prev} Fusion is carried out to obtain the sample preamble pose characteristics, so that the sample preamble pose characteristics not only can represent the preamble pose sequence motion_prev [ P_prev ] ₁ ，P-prev ₂ ，…，P-prev _i ]Features of each pose in the model can also characterize the preface pose sequence motion_prev [ P_prev ] ₁ ，P-prev ₂ ，…，P-prev _i ]The time sequence of each pose in the model. It should be noted that, here, only the fusion is performed in the element addition manner, alternatively, the fusion may be performed in other manners such as stitching, multiplication in elements, bilinear fusion, and the like, and the embodiment of the present application does not specifically limit the fusion manner.

In some embodiments, after the sample motion bias feature and the sample preamble pose feature are obtained, the feature obtained by splicing the sample motion bias feature and the sample preamble pose feature is input into a decoder of an initial motion model, and the decoder decodes the spliced feature to obtain the next pose P of the latest pose in the predicted motion sequence _i+1 . For example, the spliced features are input into a plurality of cascaded decoding units, and the last decoding unit predicts the next pose P _i+1 For the next pose P _i+1 Full join of hidden vectors of (a)Obtaining the next pose P _i+1 。

In an exemplary scenario, illustrated by taking the initial motion model as a transducer model, this step 509 involves predicting the next pose P using the decoder of the transducer model _i+1 Is a hidden vector of (c). In the decoder of the Transformer model, each decoding unit internally comprises a mask multi-head attention layer, a fusion multi-head attention layer and a feedforward neural network layer, wherein the mask multi-head attention layer is similar to the multi-head attention layer in the encoding unit when introducing the encoder, but only focuses on the information of each front pose before the current frame, which is equivalent to the sample pose after the mask is blocked, the fusion multi-head attention layer is also similar to the multi-head attention layer in the encoding unit, but the fusion multi-head attention layer takes the characteristics output by the mask multi-head attention layer of the decoding unit as input signals, and takes the characteristics output by the encoding unit corresponding to the serial number as input signals, for example, for the 1 st decoding unit, the input signals of the fusion multi-head attention layer not only comprise the characteristics output by the 1 st decoding unit, but also comprise the characteristics output by the 1 st encoding unit in the encoder, the design is used for focusing on the encoding information of the encoding unit, the decoding unit outputs the characteristics by the decoding unit and the self-prediction units, namely, the pre-head attention layer is also used for the normalization of the pre-prediction layer, and the pre-head attention layer is not connected with the feedforward neural network layer after the pre-normalization layer is arranged, and the pre-normalization layer is connected with the pre-head attention layer, and the pre-normalization layer is arranged after the pre-normalization layer is connected with the pre-head layer, and the pre-head attention layer is connected with the pre-normalization layer, and the pre-head attention layer is used for the pre-stage, and the pre-head layer is used for the encoding.

Based on the internal structure of the decoding unit, the sample motion bias feature and the sample front pose feature are input into a mask multi-head attention layer in the 1 st decoding unit of the decoder, the K vector and the V vector are calculated based on the sample motion bias feature through the mask multi-head attention layer, the sample front pose feature is used as a Q vector, the weight operation based on a self-attention mechanism is carried out on the V vector by using the Q vector and the K vector, a multi-head attention feature is output, the multi-head attention feature and the input feature of the mask multi-head attention layer (namely, the feature obtained by splicing the sample motion bias feature and the sample front pose feature) are input into a residual normalization layer together, namely, the multi-head attention feature and the input feature are spliced, the residual feature obtained by splicing is normalized, and the residual feature after normalization is input into a fusion multi-head attention layer.

For the fused multi-headed attention layer of the 1 st decoding unit, the input signal includes: the normalized residual features (i.e., the features of the residual normalized layer output after the 1 st decoding unit masks the multi-headed attention layer), and the feature vectors of the 1 st encoding unit output in the encoder. And (3) splicing the characteristic vector output by the 1 st coding unit and the normalized residual characteristic, inputting the spliced characteristic vector into a fusion multi-head attention layer, respectively calculating a Q vector, a K vector and a V vector for the spliced characteristic by the fusion multi-head attention layer, carrying out weight operation on the V vector by using the Q vector and the K vector based on a self-attention mechanism, outputting a pose interaction characteristic, and then inputting the pose interaction characteristic and the input characteristic of the fusion multi-head attention layer (namely, the characteristic vector output by the 1 st coding unit and the characteristic obtained by splicing the normalized residual characteristic) into another residual normalization layer, namely, splicing the pose interaction characteristic and the input characteristic, normalizing the spliced another residual characteristic, and inputting the normalized another residual characteristic into a feedforward neural network layer.

For the feedforward neural network layer of the 1 st decoding unit, performing full connection processing on the input normalized another residual error feature to output a full connection feature, then inputting the full connection feature and the input feature of the feedforward neural network layer (i.e. the normalized another residual error feature) together into another residual error normalization layer (which is different from the first two residual error normalization layers), namely, splicing the full connection feature and the input feature, and splicing the spliced another residual error normalization layerAnd normalizing one residual characteristic, wherein the normalized other residual characteristic is the characteristic vector output by the 1 st decoding unit. Then, the feature vector output by the 1 st decoding unit is input into the 2 nd decoding unit, and similar operations are performed to the 1 st decoding unit (currently, the fusion multi-head attention layer of the 2 nd decoding unit also needs to acquire the feature vector output by the 2 nd encoding unit), and so on until the last decoding unit also outputs a feature vector, the feature vector output by the last decoding unit is input into two cascaded full-connection layers for full-connection processing, and the next pose P predicted by the output of the next full-connection layer _i+1 。

Fig. 6 is a schematic diagram of a decoding prediction process according to an embodiment of the present application, and as shown in fig. 6, in the case where the initial motion model is a transducer model, how to predict the i-th pose and the i+1-th pose will be described. The transducer model includes N cascaded coding units 610 and N cascaded decoding units, and since the N decoding units are circularly called when predicting the i-th pose and predicting the i+1-th pose, in order to facilitate checking the difference between the input and the output of the decoding units in two cycles, the decoding unit 621 in the i-th cycle and the decoding unit 622 in the i+1-th cycle are respectively unfolded and drawn, but in reality, the decoding unit 621 and the decoding unit 622 refer to two different cycles of the N decoding units in the same group of cascaded.

In the training stage, the second motion feature obtained by combining the sample motion sequence 601 with the position coding vector processing and the sample class feature obtained by combining the sample motion class processing are input into N cascade coding units 610 together, the last coding unit 610 outputs a target hidden vector, the mean value mu and standard deviation sigma of sample distribution are predicted through the target hidden vector, then a heavy parameterization mode is used for sampling to obtain a potential vector z, and the potential vector z is combined with sample bias information b obtained based on the sample motion class to predict the (i+1) th pose P _i+1 At this time, the potential vector z and the sample offset information b are spliced and input to N concatenated decoding units 622. Furthermore, for N concatenated decoding units 622, in advanceMeasure the i+1th pose P _i+1 When it is required to base the sample action sequence 601[ P ] ₁ ’，P ₂ ’，…，P _i ’](from P for the full sample action sequence _i ' truncation at) and the predicted action sequence 602[ p ] output from the decoding unit 621 in the ith cycle ₁ ，P ₂ ，…，P _i ]Performing dispatch sampling to obtain a preamble pose sequence 603[ P_prev ] ₁ ，P-prev ₂ ，…，P-prev _i ]This preamble pose sequence 603 includes both the sample pose that actually occurs in the sample motion sequence 601 and the predicted pose that is generated by the decoding unit in the predicted motion sequence 602.

It should be appreciated that FIG. 6 is for ease of reference, only the input to the decoding unit 621 in the ith cycle is depicted as a sample action sequence 601[ P ] ₁ ’，P ₂ ’，…，P _i ’]And the output of the decoding unit 621 in the ith cycle is plotted as a predicted motion sequence 602[ P ] ₁ ，P ₂ ，…，P _i ]. In actual training, the input of the decoding unit 621 in the ith cycle is based on the sample motion sequence [ P ] ₁ ’，P ₂ ’，…，P _i-1 ’]And a predicted motion sequence [ P ] outputted from the decoding unit in the i-1 th cycle ₁ ，P ₂ ，…，P _i-1 ]The preamble pose sequence obtained by the scheduling sampling, and the output of the decoding unit 621 in the ith cycle is the ith pose P _i 。

In the (i+1) -th cycle, the features obtained by splicing the preamble pose sequence 603 obtained by scheduling and sampling, the potential vector z and the sample offset information b are input into N cascaded decoding units 622, the input features are decoded by the N cascaded decoding units 622, and the last decoding unit 622 outputs the (i+1) -th pose P _i+1 Then, the (i+1) th pose P _i+1 Is input into 2 cascade full-connected layers, and the (i+1) th pose P finally predicted is output by the latter full-connected layer _i+1 This (i+1) th pose P _i+1 Added to the original predicted motion sequence [ P ] ₁ ，P ₂ ，…，P _i ]Tail of the (i+2) th cycle of the intermediate modulation is obtainedNew predicted action sequence [ P ] needed to be used in degree sampling ₁ ，P ₂ ，…，P _i ，P _i+1 ]。

510. The server obtains a loss function value of the current iteration based on the sample action sequence and the predicted action sequence.

In some embodiments, the loss function value includes: at least one of a pose reconstruction penalty term, a joint position reconstruction penalty term, or a relative entropy penalty term. For example, the loss function value is represented by using only the pose reconstruction loss term, or the loss function value is represented by using the pose reconstruction loss term and the relative entropy loss term together, or the loss function value is represented by using the pose reconstruction loss term, the joint position reconstruction loss term, and the relative entropy loss term together, which is not particularly limited in the embodiment of the present application.

Wherein the pose reconstruction loss term is used to characterize the difference between the sample motion sequence and the predicted motion sequence, representing the difference between the motion sequence predicted by the decoder and the actual motion sequence of the training sample, e.g., assuming that the sample motion sequence is represented as [ P ] ₁ ’，P ₂ ’，…，P _T ’]The predicted action sequence is denoted as [ P ] ₁ ，P ₂ ，…，P _T ]At this time, a sample motion sequence [ P ] is used ₁ ’，P ₂ ’，…，P _T ’]And predicting action sequences [ P ] ₁ ，P ₂ ，…，P _T ]Modeling the pose reconstruction penalty term with an L2 regularization penalty therebetween, the L2 regularization penaltyThe expression of (2) is as follows:

wherein i is an integer greater than or equal to 1 and less than or equal to T, which represents the sequence length of the sample motion sequence.

Wherein the joint position reconstruction loss term is used to characterize a joint position pattern of the virtual objectThe difference between the present sequence and the joint position prediction sequence, in the case where the object model of the virtual object is a SMPL model, represents the difference between the SMPL pose reconstructed from the motion sequence predicted by the decoder and the SMPL pose reconstructed from the motion sequence true by the training samples, i.e., the difference between the motion sequence predicted by the decoder and the motion sequence reconstructed from the training samples, i.e., the motion sequence predicted by the decoder using the SMPL model (assuming shape parametersEven with the average shape, of course other values can be set as desired), based on the sample motion sequence P ₁ ’，P ₂ ’，…，P _T ’]Reconstructing to obtain a joint position sample sequence [ J ] of the virtual object ₁ ’，J ₂ ’，…，J _T ’]Based on predicted action sequence [ P ] ₁ ，P ₂ ，…，P _T ]Reconstruction to obtain a joint position prediction sequence [ J ] of a virtual object ₁ ，J ₂ ，…，J _T ]For example, using a sequence of joint position samples [ J ₁ ’，J ₂ ’，…，J _T ’]And joint position prediction sequence [ J ₁ ，J ₂ ，…，J _T ]Modeling the joint position reconstruction penalty term with an L2 canonical penalty in between, the L2 canonical penalty +.>The expression of (2) is as follows:

Wherein the relative entropy loss term is used to characterize the difference between the sample distribution and the standard normal distribution, representing the difference between the sample distribution and the standard normal distribution in the low-dimensional space of the mapping of the true pose distribution through the encoder, schematically, the relative entropy loss term is characterized as

Optionally, the loss term is reconstructed using the poseLoss term for joint position reconstruction>And a relative entropy loss termTo weight sum each loss term, i.e. the final loss function value, in the case of commonly characterizing the loss function valuesCharacterized by a loss term of pose reconstruction->Loss term for joint position reconstruction>And the relative entropy loss term->Weighted sum of the three, loss function value- >The expression of (2) is as follows:

wherein lambda is _P Representing a loss term for pose reconstructionWeights, lambda, occupied in the loss function _J Represents the loss of joint position reconstruction term->Weights, lambda, occupied in the loss function _KL Represents the relative entropy loss term->The weight occupied in the loss function.

The technician applies the reaction force to lambda _P 、λ _J 、λ _KL The personalized setting can well realize contribution degree adjustment among different loss items so as to realize trade-off between diversity and authenticity of the generated action sequence.

511. The server adjusts model parameters of the initial motion model, iteratively executes the steps 502-510, and outputs the plurality of decoding units in the initial motion model as a motion generation model when the loss function value meets a stop condition.

The action generation model is used for generating an action sequence of the virtual object to execute actions corresponding to the input action categories in an autoregressive mode.

In some embodiments, if the loss function value obtained in step 510 does not meet the stop condition, the model parameters of the initial motion model are adjusted, and steps 502-510 are iteratively performed until the loss function value obtained in a certain iteration meets the stop condition, training is stopped, and a plurality of decoding units trained in the initial motion model are output as a final motion generation model.

Optionally, the stopping condition means that the loss function value is lower than a loss threshold, and the loss threshold is any value greater than or equal to 0 and less than or equal to 1, then when the loss function value is greater than or equal to the loss threshold, the model parameters of the initial motion model are adjusted, the steps 502-510 are iteratively executed until the loss function value acquired at a certain iteration is lower than the loss threshold, and training is stopped.

Optionally, the stopping condition refers to that the number of iterations exceeds a number threshold, where the number threshold is any integer greater than or equal to 1, and when the number of iterations is less than or equal to the number threshold, model parameters of the initial motion model are adjusted, and the steps 502-510 are iteratively executed until the number of iterations exceeds the number threshold at a certain iteration, and training is stopped.

In some embodiments, after the training to obtain the action generation model, the trained action generation model is tested on a test dataset based on a manner similar to the predicted action sequence in the previous embodiment, thereby enabling performance assessment of the trained action generation model.

In the above steps 509-511, a possible implementation manner of training the initial motion model based on the sample potential feature, the sample bias information and the sample motion sequence to obtain the motion generation model is provided, when the transducer model is adopted as the initial motion model, the corresponding decoder can be trained in combination with the idea of VAE, by introducing a relative entropy loss term into the loss function, in the process of minimizing the relative entropy loss term, the sample distribution can be promoted to approach to the standard normal distribution, in addition, by introducing a joint position reconstruction loss term into the loss function, in the process of minimizing the joint position reconstruction loss term, the body part rotation of the 3D mannequin can be restrained in the motion number of the SMPL model to generate a more real motion sequence.

Furthermore, the initial motion model is a transducer model, so that the problem that some other autoregressive models possibly return to an average gesture is solved to a certain extent, the problem that the gesture of a virtual object in a motion sequence drifts is solved, and the diversity and the authenticity of the generated motion sequence can be well smoothed after the relative entropy loss term is introduced into the loss function.

Furthermore, as the decoder of the transducer model uses an autoregressive mode to predict the next pose and combines with the scheduling sampling to obtain the preamble pose sequence, the sample pose (namely, the pose true value in the sample pose sequence) is used as the input of the decoder as much as possible in early training, so that the decoder can be quickly guided to a reasonable parameter state from a randomly initialized state, more predicted poses (namely, poses generated by the decoder in the predicted pose sequence) are used as much as possible as the input of the decoder along with the training, so that the domain deviation problem caused by inconsistent data distribution in the domain where the sample pose is located and the domain where the predicted pose is located can be relieved, and the fault tolerance capability of the motion generation model obtained through training can be increased.

Furthermore, as the full-connection layers are respectively added before and after the transducer model, the encoder can learn the distribution parameters of the potential space of motion (namely sample distribution) in the sample data, so that the time sequence information contained in the long sequence is effectively utilized, the interpretability of the motion generation model obtained by training is improved, and the explicit representation of the potential space of motion can be modeled.

Furthermore, the transform model does not depend on the GAN architecture, so that the problem that the RNN generator is easy to fall into over fitting and generalization capability is avoided, namely, the over fitting problem is effectively relieved, in addition, the transform model predicts the future pose due to autoregressive, so that the generated action sequence is variable in length, and the diversity of the generated action sequence is greatly improved.

In the embodiment of the application, the sample distribution is determined on the sample data set, and in the process of training the motion generation model based on the sample bias information and the sample potential characteristics sampled from the sample distribution, the sample distribution gradually approaches the standard normal distribution through continuous optimization of the initial motion model, and the sample distribution is the standard normal distribution when training is stopped, so that the potential motion characteristics can be directly sampled on the standard normal distribution when the motion generation model is used for predicting the motion sequence, and the accuracy and naturalness of the motion sequence generated based on the motion generation model are finally improved.

Fig. 7 is a schematic flowchart of a training method of an action generation model according to an embodiment of the present application, and as shown in fig. 7, an initial action model is taken as a transform model as an example, and a related art of a transform automatic encoder VAE is also used in the training phase, so that the training method is also called a transform automatic encoder. In the training phase, the transducer variation automatic encoder comprises an encoder and a decoder, and in the using phase of the testing phase after training, only the trained decoder is used as an action generating model, and after the action category to be generated is input, the trained action generating model can generate a real, diversified and variable-length action sequence matched with the input action category. The training process of the transducer variation automatic encoder is described below.

The auto encoder for transform variation comprises an encoder 710 and a decoder 720, wherein for the encoder 710, distribution parameters of sample distribution are generated based on an input sample motion sequence and a corresponding sample motion class, for example, the distribution parameters comprise a mean value mu and a standard deviation sigma, after the sample distribution is re-parameterized, the re-parameterized sample distribution is sampled to obtain a sample potential feature (namely a potential vector z), in addition, sample bias information is obtained according to the sample motion class, meanwhile, scheduling sampling is performed according to the sample motion sequence and an existing prediction motion sequence to obtain a front pose sequence, and the decoder 720 predicts new poses frame by frame in an autoregressive mode by taking the sample potential feature, the sample bias information and the front pose sequence as inputs.

When the transducer model is adopted as the initial motion model, the decoder 720 corresponding to the motion model can be trained by combining the idea of the VAE because of the distribution parameters of the sample distribution predicted by the encoder 710, and the sample distribution can be caused to approach the standard normal distribution in the process of minimizing the relative entropy loss term by introducing the relative entropy loss term into the loss function, so that the time sequence information contained in the long sequence is effectively utilized, the interpretability of the motion generation model obtained by training is improved, and the explicit characterization of the modeling motion potential space can be performed.

In the following, a test comparison will be made between an action sequence generated by a non-autoregressive method and an action sequence generated by an autoregressive method based on an action generation model according to an embodiment of the present application.

Fig. 8 is a schematic diagram of an action sequence generated in a non-autoregressive manner, as shown in 800, where the action category used in the test is "rotation", and it can be seen that, because an autoregressive algorithm and a dispatch sampling algorithm are not adopted, the finally generated action sequence is not real or natural, and looks more like a camera view is rotating, instead of a virtual object of 3D reconstruction is rotating.

Fig. 9 is a schematic diagram of an auto-regressive motion sequence provided in an embodiment of the present application, as shown in fig. 9, after the training of the auto-encoder for transformation of a transducer is completed by using the training method in the previous embodiment, the decoder part is used as a motion generation model to generate the motion sequence in an auto-regressive manner, the motion class used for testing is "reversing" in 901, the motion class used for testing is "lifting arm" in 902, and the motion class used for testing is "placing things" in 903, and it should be noted that the initial pose used in the testing stage is the initial pose corresponding to the motion class, or the initial pose initialized by the SMPL model, which is not specifically limited in the embodiment of the present application. By comparing 901-903 in fig. 9 and 800 in fig. 8, it can be clearly seen that the motion sequence generated by the autoregressive manner in the embodiment of the present application is far more real and natural than the motion sequence generated by the non-autoregressive manner.

Fig. 10 is a schematic diagram of an action sequence generated in an autoregressive manner according to an embodiment of the present application, as shown in fig. 10, in 1001-1003, three different action sequences generated by an action generation model are shown for the same action category, "waving arms", and it can be seen that, compared with an RNN generator under a GAN architecture, the RNN generator always tends to output the same action sequence, the action generation model according to the embodiment of the present application can generate more diversified action sequences, and the generalization capability of the action generation model is higher.

Fig. 11 is a schematic structural diagram of an action generating device for a virtual object according to an embodiment of the present application, where, as shown in fig. 11, the device includes:

the random sampling module 1101 is configured to randomly sample to obtain a potential motion feature, where the potential motion feature is mapping the pose distribution of the virtual object to a random sampling value under standard normal distribution;

an obtaining module 1102, configured to obtain, based on an action class to be generated, bias information of the potential action feature, where the bias information is used to characterize an impact factor on the potential action feature under the action class;

the generating module 1103 is configured to generate, in an autoregressive manner, an action sequence of the virtual object based on the potential action feature and the bias information, where the action sequence is used to characterize a pose of the virtual object for executing an action corresponding to the action category.

According to the device provided by the embodiment of the application, potential action characteristics are obtained through random sampling, so that hidden information of the pose of the virtual object is represented under a probability space corresponding to standard normal distribution, and then the action sequence of the virtual object is predicted by autoregressively combining with bias information related to the action category to be generated, so that the action sequence not only represents the hidden information, but also meets the constraint of the action category, and the pose of each frame in the past is also introduced into an autoregressive mode to predict the pose of the current frame, so that the generation process of the final action sequence does not depend on an RNN generator under a GAN architecture, the mode collapse problem is avoided, and the action sequence with higher diversity can be generated.

In one possible implementation, based on the apparatus composition of fig. 11, the generating module 1103 includes:

and the prediction unit is used for inputting the potential action characteristics, the bias information, the pose and each precursor pose of the pose into an action generation model for any pose in the action sequence, and predicting the next pose in the action sequence through the action generation model.

In one possible implementation, based on the apparatus composition of fig. 11, the prediction unit includes:

the first fusion subunit is used for fusing the potential action feature and the bias information to obtain an action bias feature;

In one possible embodiment, the second fusion subunit is configured to:

performing full connection processing on the pose and each front pose of the pose to obtain initial pose characteristics;

In one possible implementation, the action generation model includes a plurality of decoding units for decoding the input features based on a self-attention mechanism;

the pose decoding subunit is configured to:

and carrying out full connection processing on the hidden vector of the next pose to obtain the next pose.

In one possible implementation, the obtaining module 1102 is configured to:

and performing full connection processing on the single thermal codes of the action types to obtain the offset information.

In one possible embodiment, the device based on fig. 11 is composed, and the device further comprises:

and the synthesis module is used for synthesizing an action segment of the virtual object for executing the action corresponding to the action category based on the action sequence and the object model of the virtual object.

It should be noted that: the motion generating device for a virtual object provided in the above embodiment only illustrates the division of the above functional modules when generating a motion sequence for a virtual object, and in practical application, the above functional allocation can be completed by different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules to complete all or part of the functions described above. In addition, the motion generating device of the virtual object provided in the foregoing embodiment and the motion generating method embodiment of the virtual object belong to the same concept, and the specific implementation process of the motion generating device of the virtual object is detailed in the motion generating method embodiment of the virtual object, which is not described herein.

Fig. 12 is a schematic structural diagram of a training device for an action generating model according to an embodiment of the present application, please refer to fig. 12, wherein the device includes:

a determining module 1201, configured to determine, based on a sample action category and a sample action sequence, a sample distribution to which the sample action sequence is subjected, where the sample distribution has a mapping relationship with a pose distribution of a virtual object;

the sampling module 1202 is configured to re-parameterize the sample distribution through standard normal distribution, and sample potential features from the re-parameterized sample distribution;

an obtaining module 1203 configured to obtain, based on the sample action category, sample bias information of the sample potential feature, where the sample bias information is used to characterize an influence factor on the sample potential feature under the sample action category;

the training module 1204 is configured to train the initial motion model based on the sample potential feature, the sample bias information, and the sample motion sequence, to obtain a motion generation model, where the motion generation model is configured to generate, in an autoregressive manner, a motion sequence in which the virtual object performs a motion corresponding to the input motion class.

According to the device provided by the embodiment of the application, the sample distribution is determined on the sample data set, and the sample distribution gradually approaches the standard normal distribution through continuous optimization of the initial motion model in the process of training the motion generation model based on the sample bias information and the sample potential characteristics sampled from the sample distribution in a re-parameterization mode, so that the sample distribution is the standard normal distribution when training is stopped, and potential motion characteristics can be directly sampled on the standard normal distribution when the motion generation model is used for predicting the motion sequence, and the accuracy and naturalness of the motion sequence generated based on the motion generation model are finally improved.

In one possible implementation, the initial motion model includes a plurality of encoding units for predicting distribution parameters of the sample distribution and a plurality of decoding units for decoding to obtain a next pose based on a self-attention mechanism, wherein the number of encoding units and the number of decoding units are the same.

In one possible implementation, the determining module 1201 is configured to:

fusing the first action feature with a sample position feature of the sample action sequence to obtain a second action feature, wherein the sample position feature is used for representing the sequential sequence of all the poses in the sample action sequence;

based on the target hidden vector, a distribution parameter indicating the sample distribution is acquired.

In one possible embodiment, the distribution parameters include a mean and standard deviation of the sample distribution;

The sampling module 1202 is configured to:

sampling to obtain a re-parameterized adjustment factor from the standard normal distribution;

and determining the re-parameterized sample distribution based on the mean and the corrected standard deviation.

In one possible implementation, based on the apparatus composition of fig. 12, the training module 1204 includes:

a decoding unit for generating a predicted motion sequence of the virtual object by the plurality of decoding units in the initial motion model based on the sample potential features and the sample bias information;

In a possible implementation, the decoding unit is configured to:

Carrying out scheduling sampling on the latest pose in the predicted action sequence based on target probability, and obtaining a preamble pose sequence from each sample pose of which the position sequence number does not exceed the latest pose in the predicted action sequence and the sample action sequence, wherein the target probability is used for representing the possibility that each pose in the preamble pose sequence is sampled from the predicted action sequence;

In one possible implementation, the target probability is positively correlated with the position number of the latest pose in the predicted motion sequence.

In one possible embodiment, the loss function value includes: at least one of a pose reconstruction penalty term for characterizing a difference between the sample motion sequence and the predicted motion sequence, a joint position reconstruction penalty term for characterizing a difference between a joint position sample sequence and a joint position prediction sequence of the virtual object, or a relative entropy penalty term for characterizing a difference between the sample distribution and a standard normal distribution; the joint position sample sequence is obtained based on the sample action sequence, and the joint position prediction sequence is obtained based on the prediction action sequence.

It should be noted that: in the training device for an action generating model provided in the above embodiment, only the division of the above functional modules is used for illustration when the action generating model is trained, and in practical application, the above functional allocation can be completed by different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules to complete all or part of the functions described above. In addition, the training device of the action generating model provided in the above embodiment and the training method embodiment of the action generating model belong to the same concept, and the specific implementation process of the training device of the action generating model is detailed in the training method embodiment of the action generating model, which is not described herein.

Fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application, and as shown in fig. 13, taking the computer device as a terminal 1300 for example, after the server trains an action generating model, the action generating model is issued to the terminal 1300, so that the terminal 1300 can locally generate an action sequence of a virtual object based on a specified action class. Optionally, the device types of the terminal 1300 include: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 1300 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.

In general, the terminal 1300 includes: a processor 1301, and a memory 1302.

Optionally, processor 1301 includes one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. Optionally, the processor 1301 is implemented in hardware in at least one of a DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). In some embodiments, processor 1301 includes a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, processor 1301 is integrated with a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of content that the display screen is required to display. In some embodiments, the processor 1301 also includes an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

In some embodiments, memory 1302 includes one or more computer-readable storage media, optionally non-transitory. Memory 1302 also optionally includes high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1302 is used to store at least one program code for execution by processor 1301 to implement the action generation method or training method of the action generation model of a virtual object provided by various embodiments of the present application.

In some embodiments, the terminal 1300 may further optionally include: a peripheral interface 1303 and at least one peripheral. The processor 1301, the memory 1302, and the peripheral interface 1303 can be connected by a bus or signal lines. The respective peripheral devices can be connected to the peripheral device interface 1303 through a bus, a signal line, or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1304, a display screen 1305, a camera assembly 1306, audio circuitry 1307, and a power supply 1308.

A peripheral interface 1303 may be used to connect I/O (Input/Output) related at least one peripheral to the processor 1301 and the memory 1302. In some embodiments, processor 1301, memory 1302, and peripheral interface 1303 are integrated on the same chip or circuit board; in some other embodiments, any one or both of processor 1301, memory 1302, and peripheral interface 1303 are implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 1304 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1304 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1304 converts an electrical signal to an electromagnetic signal for transmission, or converts a received electromagnetic signal to an electrical signal. Optionally, the radio frequency circuit 1304 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. Optionally, the radio frequency circuit 1304 communicates with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuit 1304 further includes NFC (Near Field Communication ) related circuits, which the present application is not limited to.

The display screen 1305 is used to display a UI (User Interface). Optionally, the UI includes graphics, text, icons, video, and any combination thereof. When the display 1305 is a touch display, the display 1305 also has the ability to capture touch signals at or above the surface of the display 1305. The touch signal can be input to the processor 1301 as a control signal for processing. Optionally, the display screen 1305 is also used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1305 is one, providing the front panel of terminal 1300; in other embodiments, the display 1305 is at least two, and is disposed on different surfaces of the terminal 1300 or in a folded design; in still other embodiments, the display 1305 is a flexible display disposed on a curved surface or a folded surface of the terminal 1300. Even alternatively, the display screen 1305 is arranged in a non-rectangular irregular pattern, i.e. a shaped screen. Optionally, the display screen 1305 is made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 1306 is used to capture images or video. Optionally, camera assembly 1306 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 1306 also includes a flash. Alternatively, the flash is a single-color temperature flash, or a dual-color temperature flash. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and is used for light compensation under different color temperatures.

In some embodiments, the audio circuit 1307 comprises a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1301 for processing, or inputting the electric signals to the radio frequency circuit 1304 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones are respectively disposed at different portions of the terminal 1300. Optionally, the microphone is an array microphone or an omni-directional pickup microphone. The speaker is then used to convert electrical signals from the processor 1301 or the radio frequency circuit 1304 into sound waves. Alternatively, the speaker is a conventional thin film speaker, or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only an electric signal but also an acoustic wave audible to humans can be converted into an acoustic wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 1307 also comprises a headphone jack.

A power supply 1308 is used to power the various components in terminal 1300. Alternatively, the power source 1308 is an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power source 1308 comprises a rechargeable battery, the rechargeable battery supports wired or wireless charging. The rechargeable battery is also used to support fast charge technology.

In some embodiments, terminal 1300 also includes one or more sensors 1310. The one or more sensors 1310 include, but are not limited to: acceleration sensor 1311, gyroscope sensor 1312, pressure sensor 1313, optical sensor 1314, and proximity sensor 1315.

In some embodiments, acceleration sensor 1311 detects the magnitude of acceleration on three coordinate axes of the coordinate system established with terminal 1300. For example, the acceleration sensor 1311 is configured to detect components of gravitational acceleration on three coordinate axes. Optionally, the processor 1301 controls the display screen 1305 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 1311. The acceleration sensor 1311 is also used for acquisition of motion data of a game or user.

In some embodiments, the gyro sensor 1312 detects the body direction and the rotation angle of the terminal 1300, and the gyro sensor 1312 and the acceleration sensor 1311 cooperate to collect 3D actions of the user on the terminal 1300. Processor 1301 realizes the following functions according to the data collected by gyro sensor 1312: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

Optionally, a pressure sensor 1313 is disposed on a side frame of the terminal 1300 and/or below the display screen 1305. When the pressure sensor 1313 is disposed at a side frame of the terminal 1300, a grip signal of the terminal 1300 by a user can be detected, and the processor 1301 performs left-right hand recognition or quick operation according to the grip signal collected by the pressure sensor 1313. When the pressure sensor 1313 is disposed at the lower layer of the display screen 1305, the processor 1301 realizes control of the operability control on the UI interface according to the pressure operation of the user on the display screen 1305. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 1314 is used to collect ambient light intensity. In one embodiment, processor 1301 controls the display brightness of display screen 1305 based on the intensity of ambient light collected by optical sensor 1314. Specifically, when the intensity of the ambient light is high, the display brightness of the display screen 1305 is turned up; when the ambient light intensity is low, the display brightness of the display screen 1305 is turned down. In another embodiment, processor 1301 also dynamically adjusts the shooting parameters of camera assembly 1306 based on the intensity of ambient light collected by optical sensor 1314.

A proximity sensor 1315, also referred to as a distance sensor, is typically provided on the front panel of the terminal 1300. The proximity sensor 1315 is used to collect the distance between the user and the front of the terminal 1300. In one embodiment, when proximity sensor 1315 detects a gradual decrease in the distance between the user and the front of terminal 1300, processor 1301 controls display screen 1305 to switch from a bright screen state to a inactive screen state; when the proximity sensor 1315 detects that the distance between the user and the front surface of the terminal 1300 gradually increases, the processor 1301 controls the display screen 1305 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 13 is not limiting of terminal 1300 and can include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 14 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device 1400 may generate a relatively large difference according to different configurations or performances, and the computer device 1400 includes one or more processors (Central Processing Units, CPU) 1401 and one or more memories 1402, where at least one computer program is stored in the memories 1402, and the at least one computer program is loaded and executed by the one or more processors 1401 to implement the motion generating method or the training method of the motion generating model of the virtual object according to the embodiments described above. Optionally, the computer device 1400 further includes a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium is also provided, for example a memory comprising at least one computer program executable by a processor in a terminal to perform the method of generating actions of a virtual object or the training method of an action generation model in the respective embodiments described above. For example, the computer readable storage medium includes ROM (Read-Only Memory), RAM (Random-Access Memory), CD-ROM (Compact Disc Read-Only Memory), magnetic tape, floppy disk, optical data storage device, and the like.

In an exemplary embodiment, a computer program product or computer program is also provided, comprising one or more program codes, the one or more program codes being stored in a computer readable storage medium. The one or more processors of the computer device are capable of reading the one or more program codes from the computer-readable storage medium, and executing the one or more program codes to enable the computer device to execute to perform the action generating method of the virtual object or the training method of the action generating model in the above embodiments.

Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the above-described embodiments can be implemented by hardware, or can be implemented by a program instructing the relevant hardware, optionally stored in a computer readable storage medium, optionally a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims

1. A method of generating actions for a virtual object, the method comprising:

2. The method of claim 1, wherein the generating the sequence of actions of the virtual object in an autoregressive manner based on the potential action features and the bias information comprises:

and inputting the potential action characteristics, the bias information, the pose and each precursor pose of the pose into an action generation model for any pose in the action sequence, and predicting the next pose in the action sequence through the action generation model.

3. The method of claim 2, wherein the inputting the potential motion features, the bias information, and the poses, and the respective leading poses of the poses, into a motion generation model, predicting a next pose in the sequence of motions from the motion generation model comprises:

fusing the potential action features and the bias information to obtain action bias features;

fusing the pose, each precursor pose of the pose and pose position features to obtain precursor pose features, wherein the pose position features are used for representing the sequence of the pose and each precursor pose of the pose;

And decoding to obtain the next pose based on the motion bias characteristic and the preamble pose characteristic.

4. A method according to claim 3, wherein the fusing the pose and the respective leading pose and pose position features of the pose to obtain leading pose features comprises:

5. A method according to claim 3, wherein the action generation model comprises a plurality of decoding units for decoding input features based on a self-attention mechanism;

the decoding to obtain the next pose based on the motion bias feature and the preamble pose feature includes:

6. The method of claim 1, wherein the obtaining bias information for the potential motion feature based on the motion class to be generated comprises:

7. The method according to claim 1, wherein the method further comprises:

and synthesizing an action segment of the virtual object for executing the action corresponding to the action category based on the action sequence and the object model of the virtual object.

8. A method of training an action generation model, the method comprising:

9. The method of claim 8, wherein the initial motion model comprises a plurality of encoding units for predicting distribution parameters of the sample distribution and a plurality of decoding units for decoding based on a self-attention mechanism to obtain a next pose, wherein the number of encoding units and decoding units is the same.

10. The method of claim 9, wherein the determining a sample distribution to which the sample action sequence is subjected based on the sample action category and the sample action sequence comprises:

11. The method of claim 10, wherein the distribution parameters include a mean and a standard deviation of the sample distribution;

the re-parameterizing the sample distribution by a standard normal distribution comprises:

12. The method of claim 10, wherein training an initial motion model based on the sample potential features, the sample bias information, and the sample motion sequence to obtain a motion generation model comprises:

generating, by the plurality of decoding units in the initial motion model, a predicted motion sequence for the virtual object based on the sample latent features and the sample bias information;

Acquiring a loss function value of the iteration based on the sample action sequence and the predicted action sequence;

and iteratively adjusting model parameters of the initial motion model, and outputting the plurality of decoding units in the initial motion model to generate a model for the motion when the loss function value meets a stop condition.

13. The method of claim 12, wherein the generating, by the plurality of decoding units in the initial motion model, a predicted motion sequence for the virtual object based on the sample potential features and the sample bias information comprises:

14. The method of claim 13, wherein the target probability is positively correlated with a position order of the most recent pose in the predicted sequence of actions.

15. The method of claim 12, wherein the loss function value comprises: at least one of a pose reconstruction penalty term for characterizing a difference between the sample motion sequence and the predicted motion sequence, a joint position reconstruction penalty term for characterizing a difference between a joint position sample sequence and a joint position prediction sequence of the virtual object, or a relative entropy penalty term for characterizing a difference between the sample distribution and a standard normal distribution; the joint position sample sequence is obtained based on the sample action sequence, and the joint position prediction sequence is obtained based on the prediction action sequence.

16. An action generating apparatus of a virtual object, the apparatus comprising:

17. A training device for an action generation model, the device comprising:

18. A computer device comprising one or more processors and one or more memories, the one or more memories having stored therein at least one computer program loaded and executed by the one or more processors to implement the method of action generation of a virtual object as claimed in any of claims 1 to 7; or, a training method of an action generation model according to any one of claims 8 to 15.

19. A storage medium having stored therein at least one computer program loaded and executed by a processor to implement the method of action generation of a virtual object as claimed in any one of claims 1 to 7; or, a training method of an action generation model according to any one of claims 8 to 15.

20. A computer program product, characterized in that it comprises at least one computer program loaded and executed by a processor to implement the method of action generation of a virtual object according to any of claims 1 to 7; or, a training method of an action generation model according to any one of claims 8 to 15.