CN116402925A

CN116402925A - Animation generation method based on video data driving

Info

Publication number: CN116402925A
Application number: CN202310049210.9A
Authority: CN
Inventors: 陈溟; 王振强; 刘伟鹏; 石敏; 韩国庆; 朱登明; 周元良
Original assignee: Qingdao Haifa Radio&tv Media Technology Co ltd
Current assignee: Qingdao Haifa Radio&tv Media Technology Co ltd
Priority date: 2023-02-01
Filing date: 2023-02-01
Publication date: 2023-07-07

Abstract

The invention provides an animation generation method based on video data driving, which comprises the following steps: extracting a motion gesture sequence containing three-dimensional gesture information of a preset target in different video frames from a driven video, wherein the driven video contains pictures of corresponding actions of the target in the real world, and the actions are actions which need to be made by roles different from the target; generating a clothing morphology sequence of clothing three-dimensional morphology corresponding to each video frame in action according to the motion gesture sequence extracted from the driven video; and generating an animation that the character wears the clothing to move and the clothing generates corresponding deformation along with the movement of the character according to the movement gesture sequence and the clothing form sequence extracted from the driven video.

Description

Animation generation method based on video data driving

Technical Field

The invention relates to the field of video technology, in particular to the field of character (such as video or in games) and garment Animation (Animation) design thereof, and more particularly relates to an Animation generation method based on video data driving.

Background

The roles and the clothing deformation and animation technology thereof are taken as important branches of computer graphics and play a vital role in the fields of movies, games and the like. The core problem of garment deformation and animation techniques is how to generate a continuous, stable sequence of garment deformations from a given sequence of movements. In the clothing deformation and animation technology, cloth is used as a flexible material, and the deformation trend of the cloth is closely related to the human body state. On one hand, the body shape characteristics (such as height, fat and thin and the like) of the character directly influence the deformation effect of the clothing; the movement trend of the character determines the trend of clothing deformation. Thus, building a relevant model that can generate a continuous, realistic garment deformation effect is a challenging task.

Currently, character and garment animation creation is mainly based on a physical simulation method and a data driving method. The physical simulation method is a method for simulating the movement of a role and the deformation of clothes in a simulation scene subjected to the physical law, the calculation cost of the physical simulation method is high, complex simulation parameters are required to be set, an artist or an animator is required to iteratively adjust the generation effect, and the efficiency is low; data-driven methods are directed to learning garment deformation laws from existing garment deformation instances, but it is often difficult to ensure continuity of garment deformation effects and are generally only applicable to garment types that are topologically consistent with the human body. Therefore, constructing a high-efficiency and stable garment animation simulation model gradually becomes a research hotspot in the field of garment animation.

The main driver of the animation creation of the characters and the clothes is the motion sequence of the characters, and various motion forms with rich types and various postures can be involved in the clothes deformation simulation process. However, in order to obtain a motion sequence of a character conforming to a real world motion law at present, there are two main ways:

mode one: the existing character motion sequence with an open source is utilized, but the actions covered by the character motion sequence with the open source are limited and single, so that the personalized requirement is difficult to meet;

Mode two: the animation is performed manually and adjusted to design a customized character motion sequence, but this approach is cumbersome and time consuming to operate, inefficient to manufacture, and often stiff to act.

Disclosure of Invention

It is therefore an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide an animation generation method based on video data driving.

The invention aims at realizing the following technical scheme:

according to a first aspect of the present invention, there is provided a video data-driven-based animation generation method, comprising: extracting a motion gesture sequence containing three-dimensional gesture information of a preset target in different video frames from a driven video, wherein the driven video contains pictures of corresponding actions of the target in the real world, and the actions are actions which need to be made by roles different from the target; generating a clothing morphology sequence of clothing three-dimensional morphology corresponding to each video frame in action according to the motion gesture sequence extracted from the driven video; and generating an animation that the character wears the clothing to move and the clothing generates corresponding deformation along with the movement of the character according to the movement gesture sequence and the clothing form sequence extracted from the driven video.

Optionally, a sequence of motion gestures of the predetermined object is extracted from the driven video using a pre-trained motion gesture recognition model, wherein the motion gesture recognition model is an image-to-sequence neural network model comprising a feature extractor for extracting image features from the video frames and a sequence generator for generating a sequence of motion gestures of the predetermined object from the image features.

Optionally, a pre-trained clothing morphology prediction model is utilized to generate a clothing morphology sequence of the clothing three-dimensional morphology corresponding to each video frame during the action according to the motion gesture sequence extracted from the driving video, wherein the pre-trained clothing morphology prediction model is a neural network model from sequence to sequence.

Optionally, the pre-trained clothing morphology prediction model is obtained by training a sequence-to-sequence neural network model by using a training set of a predetermined clothing morphology, the training set comprises a plurality of data samples, each data sample comprises a motion gesture sample sequence and a clothing morphology label sequence corresponding to the motion gesture sample sequence, the motion gesture sample sequence is a sequence of a motion gesture of a predetermined target extracted from a pre-collected sample video, and the clothing morphology label sequence is a sequence of clothing morphology obtained by driving a character model wearing the clothing model to move by using the motion gesture sample sequence extracted from the sample video under a physical simulation condition.

Optionally, during training, the data sample is used for training the clothing morphology prediction model, the clothing morphology sequence is output according to the motion gesture sample sequence, and parameters of the clothing morphology prediction model are updated according to the output clothing morphology sequence and loss calculated by the corresponding clothing morphology label sequence.

Optionally, during training, the clothing morphology prediction model is trained by using the data samples, a clothing morphology sequence is output according to the motion gesture sample sequence, and parameters of the clothing morphology prediction model are updated according to the loss of clothing morphology and the loss of clothing penetrating characters, wherein the loss of clothing morphology represents the difference between the output clothing morphology sequence and the corresponding clothing morphology label sequence, and the loss of clothing penetrating characters represents the degree of penetrating clothing vertices on the clothing into the body of the character model.

Optionally, the loss is calculated when training the garment morphology prediction model as follows:

L＝L _cloth +λL _coll

wherein L is _cloth Indicating loss of clothing morphology, L _coll Representing loss of clothing penetrating character, lambda is represented as L _coll And (5) setting a weighting coefficient.

Optionally, during training, the data sample is used for training the clothing morphology prediction model to output a clothing morphology sequence according to the motion gesture sample sequence, and parameters of the clothing morphology prediction model are updated according to the loss of clothing morphology, the loss of clothing penetrating characters and the loss of clothing self-penetration, wherein the loss of clothing morphology represents the difference between the output clothing morphology sequence and the corresponding clothing morphology label sequence, the loss of clothing penetrating characters represents the degree of clothing vertexes penetrating through the body of the character model on the clothing, and the loss of clothing self-penetration represents errors caused by the clothing vertexes penetrating through the clothing on the clothing.

L＝L _cloth +λL _coll +μL _self

wherein L is _cloth Indicating loss of clothing morphology, L _coll Representing loss of clothing penetrating character, lambda is represented as L _coll Set weighting coefficient L _self Represents the loss of self-penetration of the garment, μ is denoted as L _self And (5) setting a weighting coefficient.

Optionally, the loss of the clothing penetration character corresponding to the clothing morphology sequence output according to one motion gesture sample sequence is calculated according to the following manner:

wherein T represents the length of the sequence, ε represents the threshold value of the clothing penetration character set, M _t Representing a set of correspondence between the garment vertices and the vertices of the character model in the predicted garment form for the t-th frame,

representing the position of the garment vertex i in the garment form predicted for the t-th frame in the sequence of garment forms, in>

Representing +.>

Position of body vertex j nearest, +.>

A normal vector representing the body vertex j.

Optionally, the loss of clothing self-penetration corresponding to the clothing morphology sequence output according to one motion gesture sample sequence is calculated according to the following manner:

wherein T represents the length of the sequence, V _t Representing a set of garment vertices corresponding to the predicted garment morphology for the t-th frame in the sequence of garment morphologies,

Representing the position of the garment vertex x in the garment form predicted for the t-th frame in the sequence of garment forms, in>

Representing the position of the nearest garment vertex y to the garment vertex x in the garment form predicted for the t frame in the garment form sequence, n _y,t A normal vector representing the garment vertex y in the garment form predicted for the t-th frame in the sequence of garment forms, is->

Indicating the threshold value of the self-penetration of the garment set.

According to a second aspect of the present invention, there is provided an electronic device comprising: one or more processors; and a memory, wherein the memory is for storing executable instructions; the one or more processors are configured to implement the steps of the method of any of the preceding claims via execution of the executable instructions.

Compared with the prior art, the invention has the advantages that:

the invention extracts a motion gesture sequence containing three-dimensional gesture information of a preset target in different video frames from a shot driving video, wherein the driving video contains a picture of the target making corresponding action in the real world and the action is an action which needs to be executed by a character different from the target, thereby enabling the action made by a virtual character to be made and extracted by a real object (target) in the real space (real world); then, generating a clothing morphology sequence of clothing three-dimensional morphology corresponding to each video frame in action according to the motion gesture sequence extracted from the driven video; and finally, according to the motion gesture sequence and the clothing form sequence extracted from the driven video, generating an animation that the clothing is worn by the character to move and the clothing generates corresponding deformation along with the movement of the character. Because the corresponding actions of the target in the real world necessarily conform to the physical laws and various gestures are easy to realize, the motion gesture sequence extracted from the physical laws can efficiently meet the personalized requirements; moreover, by extracting the motion gesture sequence from the driven video, the manufacturing efficiency is higher than that of the prior art, the motion is more vivid, the quality of the generated animation is higher, and the visual experience of a user or a spectator is better.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a video data driven based animation generation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a motion gesture recognition model according to an embodiment of the present invention extracting a motion gesture sequence corresponding to a simulated human body model according to a human body motion video;

FIG. 3 is a schematic diagram of a motion gesture recognition model and discriminator together with countermeasure training in accordance with an embodiment of the invention;

FIG. 4 is a schematic diagram of a clothing morphology prediction model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a transducer model as a garment morphology prediction model according to an embodiment of the present invention;

FIG. 6 is a schematic illustration of an animation of a character wearing apparel movement from a video driven in accordance with an embodiment of the present invention;

fig. 7 is a schematic diagram of a three-dimensional garment deformation effect based on video driving according to an embodiment of the invention.

Detailed Description

For the purpose of making the technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by way of specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As mentioned in the background section, in the field of animation generation, it is difficult to meet the requirement of personalization with existing character motion sequences that are open-source, the production efficiency of manually designing custom character motion sequences is low, and the actions tend to be hard. To this end, the present invention extracts a motion gesture sequence containing three-dimensional gesture information of a predetermined target in different video frames from a photographed driven video, the driven video containing a picture in which the target makes a corresponding action in the real world and the action being an action that needs to be performed by a character different from the target, thereby making it possible for the action made by a virtual character to be made and extracted by a real object (target) in the real space (real world); then, generating a clothing morphology sequence of clothing three-dimensional morphology corresponding to each video frame in action according to the motion gesture sequence extracted from the driven video; and finally, according to the motion gesture sequence and the clothing form sequence extracted from the driven video, generating an animation that the clothing is worn by the character to move and the clothing generates corresponding deformation along with the movement of the character. Because the corresponding actions of the target in the real world necessarily conform to the physical laws and various gestures are easy to realize, the motion gesture sequence extracted from the physical laws can efficiently meet the personalized requirements; moreover, by extracting the motion gesture sequence from the driven video, the manufacturing efficiency is higher than that of the prior art, the motion is more vivid, the quality of the generated animation is higher, and the visual experience of a user or a spectator is better.

Before describing embodiments of the present invention in detail, some of the terms used therein are explained as follows:

roles refer to objects that are designed for video or games. For example, it may be a character in an animation (e.g., a person or an animal) or a character in a game (e.g., a person (hero) or an animal).

The object refers to an object that needs three-dimensional posture information to be recognized. Targets may be customized by the practitioner as desired, e.g., in a scenario, three-dimensional pose information of a person (target) may be identified to control movement of a game hero (character); alternatively still, in another scenario, three-dimensional pose information of an animal (corresponding to a target, such as a dog or movable prop in the form of an animal) may be identified to control the motion of a beast (corresponding to a character) designed for a movie.

According to an embodiment of the present invention, referring to fig. 1, a video data-driven animation generation method includes steps S1, S2, S3. For a better understanding of the present invention, each step is described in detail below in connection with specific examples.

Step S1: a motion gesture sequence containing three-dimensional gesture information of a predetermined target in different video frames is extracted from a driven video, which contains a picture of the target making a corresponding action in the real world and which action is an action that needs to be made by a character different from the target.

According to one embodiment of the invention, a sequence of motion gestures of a predetermined object is extracted from a driven video using a pre-trained motion gesture recognition model, wherein the motion gesture recognition model is an image-to-sequence neural network model comprising a feature extractor for extracting image features from a video frame and a sequence generator for generating a sequence of motion gestures of the predetermined object from the image features. The motion gesture sequence of the target extracted based on the video accords with the real world motion law, provides a better data base for the subsequent clothing form prediction, can also reduce the work difficulty of an animator, improves the efficiency, and reduces the difficulty of creation of films, games and the like. Preferably, the roles are virtual objects that are different from the targets. For example, the target is a person, and the roles are virtual objects such as averda, virtual persons, virtual gorillas and the like; for another example, the goal is a cat and the character is a game hero designed for play that looks like a tiger. Optionally, the motion gesture includes three-dimensional gesture information (or gesture parameters); alternatively, the motion pose includes three-dimensional pose information and posture information (or body type parameters).

Schematically, assuming that the target is a person, selecting a data set H for identifying the motion gesture of the human body if the length of the motion gesture sequence of the motion video is T _sample Can be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,

representing the corresponding relation between the ith frame of video data in the dataset and the simulation human body model (such as the SMPL human body model), V _i Representing the i-th frame video frame,/->

Representing parameters of the simulation human body model corresponding to the ith frame, indicating true values of motion gestures (also called motion gesture tags), and forming a motion gesture tag sequence by T motion gesture tags; for any one video frame, there is the following correspondence:

wherein V represents image information of a video frameThe pixel X is expressed in the form of width w, height h and channel number c;

representing a simulated manikin, beta.epsilon.R ¹⁰ Representing the body type parameters of the simulation human body model, using the parameters beta respectively for 10 dimensions ₁ ...β ₁₀ The expression is referred to as beta parameter, θ εR ⁷² Representing the posture parameters of the simulation human body model, 72 dimensions are used for respectively using the parameters theta ₁ ...θ ₇₂ The θ parameter is referred to as follows. It should be understood that if non-human, the character model may be defined based on the active joints involved in the character model corresponding to the character, and thus the body shape parameters and the posture parameters may be adjusted accordingly, which is not limiting in any way.

According to one embodiment of the present invention, in the case where the object to be recognized as a motion gesture is a person, a process of extracting a motion gesture sequence thereof is referred to fig. 2. The pre-trained motion gesture recognition model can adopt a motion gesture recognition model proposed in the existing method for estimating body gesture (Video Inference for Body Estimation, abbreviated as VIBE) based on video reasoning, referring to fig. 3, the feature extractor adopts a CNN layer, and the sequence generator comprises a 2-layer GRU layer and a full-connection layer. Of course, the structure of the motion gesture recognition model can be customized by an implementer, for example, the structure is improved on the existing structure, the feature extractor adopts a CNN layer, and the sequence generator comprises a 2-layer LSTM layer (or a 1-layer or 3-layer GRU layer) and a full connection layer. For the custom motion gesture recognition model, the data set and the loss function of human motion gesture recognition used in training can refer to the AMASS data set and the loss function adopted by the VIBE method. It should be appreciated that the data set may also be self-made. And extracting a subset from the data set during subsequent training to serve as a training set for human motion gesture recognition.

In the VIEB method, a motion gesture recognition model (a generator) and a discriminator are obtained by using a training set of human motion gesture recognition to jointly resist training, wherein the discriminator is used for judging the true or false of an input motion gesture sequence, the motion gesture sequence output by the motion gesture recognition model and a corresponding motion gesture label sequence (true value) are respectively input into the discriminator each time, the motion gesture sequence output by the motion gesture recognition model is recognized as false (false) by the training discriminator, and the motion gesture label sequence is recognized as true (real) by the training discriminator. During countermeasure training, the motion gesture recognition model not only recognizes corresponding sub-losses, but also updates parameters of the model by utilizing the sub-losses, wherein the sub-losses are determined according to true and false prediction results of the discriminator on the motion gesture sequence. However, the pair of pair loss is often obtained by averaging the loss of the true and false prediction results corresponding to one batch of samples, which is inconvenient for training the parameters of the optimization model. In this regard, the present invention may also adjust the manner in which the resist loss is calculated when the motion gesture recognition model and the discriminator are employed together for the resist training, to obtain other embodiments. According to one embodiment of the invention, the quantum loss is calculated as follows:

Where n is the number of samples of a batch during training,

representing the sequence of motion attitudes of discriminator D for sample i>

The output prediction results, log (·) represent logarithms.

Similarly, for the discriminator, at the time of countermeasure training, the loss corresponding to the discriminator prediction result is calculated based on a predetermined loss function of the discriminator, and the parameters of the discriminator are graded and back-propagated to update. The discriminator predicts the corresponding loss of the result in the following way:

where n is the number of samples of a batch during training, θ _i Representing a sequence of motion gesture labels derived from samples i in the training set,

representing a sequence of motion gestures, D (θ), of motion gesture recognition model extracted from video of sample i _i ) A sequence of motion gesture labels θ representing the motion gesture of discriminator D to sample i _i Output prediction result,/->

Representing the sequence of motion attitudes of discriminator D for sample i>

The output prediction results, log (·) represent logarithms. Wherein, a motion gesture label sequence (such as a motion gesture label sequence corresponding to a video from an AMASS dataset) is assigned 1 (corresponding to true) at the position of the true-false prediction label; the tags where the motion gesture sequence is true and false predicted are assigned 0 (corresponding to false); thus, the discriminator may be trained as a binary classifier using a cross entropy loss function. In the loss function, log (·) taking does not change the relative relation between data, and the monotonicity based on the log (·) function can facilitate optimization of the parameters of the discriminator.

Step S2: and generating a clothing morphology sequence of the clothing three-dimensional morphology corresponding to each video frame during the action according to the motion gesture sequence extracted from the driven video.

The clothing morphology sequence can be extracted by adopting a neural network model, and according to one embodiment of the invention, a clothing morphology sequence of a clothing three-dimensional morphology corresponding to each video frame during action is generated according to a motion gesture sequence extracted from a driving video by utilizing a pre-trained clothing morphology prediction model, wherein the pre-trained clothing morphology prediction model is a neural network model from sequence to sequence. Referring to fig. 4, preferably, the clothing morphology prediction model includes an encoder for extracting a gesture temporal feature corresponding to each motion gesture therein from the motion gesture sequence, and a decoder for predicting a corresponding clothing morphology from the gesture temporal feature, each of the clothing morphologies corresponding to the motion gestures constituting the clothing morphology sequence. For example, the sequence-to-sequence neural network model may employ an existing transducer model, LSTM model, GRU model, or RNN model, or a combination of transducer model, LSTM model, GRU model, and RNN model may also be employed.

For illustration, see fig. 6, a transducer model is illustrated. The transducer model includes an Encoder (Encoder) and a Decoder (Decoder). The structure of the encoder comprises an embedded Layer (Embedding), a position coding Layer (Positional Encoding), a Multi-Head Attention Layer (Multi-Head Attention), a normalization Layer (Layer Norm) and a Feed-Forward full-connection Layer (Feed Forward), wherein residual connection is adopted between some layers to avoid the problem of gradient disappearance; the structure of the decoder is similar to that of the encoder; the connection between encoder and decoder is based on Attention mechanism (Attention layer). For a motion gesture sequence extracted based on video, an encoder is used for taking a motion gesture (comprising parameters of the gesture or parameters of body type and gesture, as required by an implementation manner) corresponding to each frame in the motion gesture sequence as an input of the encoder, calculating the correlation degree between sequence data based on a self-attention mechanism, and extracting gesture time sequence characteristics in the sequence:

Wherein W is ^Q Representing a query matrix, W ^K Representing a key matrix, W ^V Representing a matrix of values, matrix W ^Q 、W ^K 、W ^V Performing linear transformation on the input vector to generate Q, K, V three vector matrixes; the Q matrix represents a query matrix, contains data information focused by a self vector and is used for sending queries to vectors at other positions in the sequence; the queried vector provides a key vector matrix K of the query vector, and the key vector matrix K comprises data information held by the query vector; the correlation degree of the data between the sequences is represented by the vector product of the Q matrix and the K matrix; multiplying weights by a value matrix of the corresponding vectorV, obtaining a time sequence processed result by the vector;

wherein a plurality of different groups of W are generally defined ^Q 、W ^K 、W ^V The matrix is further transformed to different subspaces, and data information of the different subspaces is concerned; combining and mapping the detailed information focused by different subspaces to obtain a gesture time sequence characteristic Z _out 。

Subsequently, the gesture timing feature Z _out Input to a decoder, the decoder outputs a time sequence characteristic Z according to the gesture _out And generating the clothing morphology corresponding to the three-dimensional clothing morphology corresponding to the motion corresponding to the video frames, wherein the clothing morphology of all the video frames forms a clothing morphology sequence.

To train the garment morphology prediction model, a dataset of garment morphologies needs to be prepared. For more realism, a sequence of motion poses of objects extracted from the sample video may be employed as a means to drive the motion of a character model wearing custom-made garments to obtain a dataset of the corresponding garment morphology. According to one embodiment of the invention, any one of the data samples in the garment form data set is obtained in the following physical simulation manner: acquiring a garment model corresponding to a garment obtained by sewing a plurality of two-dimensional panels of a predetermined garment to a character model in a T-shape; acquiring a motion gesture sequence of a target extracted from a sample video, enabling a character model to move according to the motion gesture sequence under a physical simulation condition, so that the clothing model is deformed, obtaining a clothing shape sequence of a simulation state corresponding to the clothing motion of the character model, and taking the clothing shape sequence as a clothing shape tag sequence; and taking each motion gesture sequence and the corresponding clothing morphology label sequence as a data sample. By preparing a plurality of (tens, hundreds, thousands, tens of thousands or even more) sample videos for physical simulation, a data set of clothing forms composed of a plurality of data samples can be obtained. The construction of a data set of a garment form in an artificial character implementation scenario can be seen with reference to fig. 5. After the data set of the garment form is prepared, a subset is extracted from the data set of the garment form and used as a training set of the garment form, and the rest of the data sample can be reserved for verification or test. The preparation of the data set of the garment form is focused on performing a physical simulation, the manner of which is not limited to this embodiment, and other physical simulation software may be used to provide or modify the foregoing embodiment to create other embodiments, such as: stitching of clothing pieces, etc. is performed when the character model is changed to a non-T-pose, and the present invention is not limited in this respect.

The garments of the above characters may be custom made or pre-designed for the characters to meet the needs in the animated scenes of a movie or game or the like. According to one embodiment of the invention, a two-dimensional piece of clothing is drawn for a character to design a garment style. The required garment style is designed, corresponding two-dimensional garment pieces are drawn, garment instance data are constructed, and the embodiment can be performed based on industrial design software (Marvelous Designer). The design flow of the clothing model of the virtual character is as follows: firstly, determining the required clothing style, drawing two-dimensional clothing pieces in a dotted line mode, and then sewing the corresponding clothing pieces. In order to obtain a good cloth visual effect, after the two-dimensional clothing piece is designed, clothing cloth parameters such as cloth material elasticity, rigidity, cloth shearing force, edge bending rate and the like are manually edited. The cloth appearance can be effectively improved without changing the structure of the garment piece. After the two-dimensional clothes pieces are drawn, setting the sewing relation among the clothes pieces, marking the lines needing sewing in pairs, and generating a required clothing model through a virtual sewing technology. Based on a physical simulation method, simulation environment parameters such as gravity, damping force and the like are set, clothing vertex information is calculated, collision processing and collision detection are carried out between the character models and the clothing models, and clothing form tag sequences of the characters under different motion gesture sequences are constructed.

Before the clothing morphology sequence is extracted, a corresponding neural network model needs to be trained. According to one embodiment of the present invention, the pre-trained clothing morphology prediction model is obtained by training a sequence-to-sequence neural network model by using a training set of a predetermined clothing morphology, the training set includes a plurality of data samples, each data sample includes a motion gesture sample sequence and a clothing morphology label sequence corresponding to the motion gesture sample sequence, the motion gesture sample sequence is a sequence of motion gestures of a predetermined target extracted from a pre-collected sample video, and the clothing morphology label sequence is a sequence of clothing morphology obtained by driving a character model wearing the clothing model to move by using the motion gesture sample sequence extracted from the sample video in a physical simulation mode.

When training the corresponding neural network model, a calculation mode of loss needs to be defined, according to one embodiment of the invention, a clothing morphology prediction model is trained by using data samples during training, a clothing morphology sequence is output according to a motion gesture sample sequence, and parameters of the clothing morphology prediction model are updated according to the output clothing morphology sequence and the loss calculated by the corresponding clothing morphology label sequence. During training, the parameters (including parameters of an encoder and a decoder) of the clothing morphology prediction model are updated by gradient and back propagation according to the loss of clothing deformation.

In a patent application previously filed by the applicant, a trained clothing morphology prediction model may be used for predicting clothing penetration through a character model, such as: the garment vertices invade the skin of the character model, so penetration correction is performed after prediction. In this regard, improvements may be considered. According to one embodiment of the invention, a clothing morphology prediction model is trained by using data samples during training, a clothing morphology sequence is output according to a motion gesture sample sequence, and parameters of the clothing morphology prediction model are updated according to the loss of clothing morphology and the loss of a clothing penetrating role, wherein the loss of clothing morphology characterizes the difference between the output clothing morphology sequence and a corresponding clothing morphology label sequence, and the loss of the clothing penetrating role represents the degree of penetration of clothing vertices on the clothing through the body of the character model. And (3) gradient and back-propagating parameters of the updated clothing morphology prediction model according to the total loss determined by the loss of clothing deformation and the loss of the clothing penetrating character model. According to one embodiment of the invention, the loss is calculated when training the garment morphology prediction model as follows:

L＝L _cloth +λL _coll

wherein L is _cloth Indicating loss of clothing morphology, L _coll Indicating loss of the clothing penetrating character. Lambda is denoted as L _coll And (5) setting a weighting coefficient. Lambda may be set according to the needs of the practitioner, for example, its value is set to 0.2, 0.5, 0.8, 1, or the like. The technical scheme of the embodiment at least can realize the following beneficial technical effects: in the prior art, the loss of the clothing penetrating character model is not considered to guide the updating of the model parameters, but after the improvement of the embodiment, the clothing morphology prediction model can better correct the parameters during training, the occurrence probability or the number of the clothing penetrating character model in the output clothing morphology sequence is reduced, the prediction precision of the clothing morphology prediction model is further improved, and the final animation effect is more vivid.

Training may be performed in batches (Batch), and if one motion gesture sample sequence is input per Batch, according to one embodiment of the present invention, the loss of clothing morphology corresponding to the clothing morphology sequence output from the single motion gesture sample sequence may be calculated as follows:

wherein T represents the frame number of the input video, and is also the length of a sequence (referring to a motion gesture sample sequence and any one of a clothing shape sequence and a clothing shape label sequence corresponding to the motion gesture sample sequence), and corresponds to the T frame image contained in the input video, V _train,i A garment form tag representing an i-th frame garment form of the input video in the training set garment form tag sequence,

and the clothing morphology of the i-th frame of the input video is predicted and output by the clothing morphology prediction model. The clothing morphology can be characterized by clothing vertexes, namely, the clothing design of the character is controlled by the clothing vertexes, the clothing morphology label can indicate the true value of the clothing morphology, and the clothing morphology can be labeled manually or a part of data is labeled manually and then a neural network model is trained to assist labeling. T clothes form sequenceAnd forming a clothing form sequence, and sequentially forming T clothing form labels. It should be understood that if a lot inputs a plurality of motion gesture sample sequences, the clothing morphology loss of the current lot is calculated according to the above formula and the average value is calculated as the clothing morphology loss of the current lot, respectively, for all the motion gesture sample sequences input for the current lot and the clothing morphology sequences output correspondingly. In addition, the above manner of calculating the loss when training the clothing morphology prediction model is not the only embodiment, and the practitioner can adjust the model according to the needs, for example: setting λ to other values, or modifying λ to L _cloth Set weighting coefficients, etc.

If one motion gesture sample sequence is input per batch, according to one embodiment of the present invention, the loss of a clothing penetration character corresponding to a clothing morphology sequence output from a single motion gesture sample sequence may be calculated as follows:

wherein T represents the frame number of the input video, is also the length of a sequence (referring to a motion gesture sample sequence and any one sequence of a clothing shape sequence and a clothing shape label sequence corresponding to the motion gesture sample sequence), epsilon represents the set threshold value of clothing penetration roles, M _t Representing a set of correspondence between the garment vertices and the vertices of the character model in the predicted garment form for the t-th frame,

Representing +.>

Position of body vertex j nearest, +.>

A normal vector representing the body vertex j. Wherein (1)>

Pointing outside the body of the character model. Epsilon is the minimum penetration distance set by the practitioner for increased robustness, and its value can be set as desired (e.g., 3mm, 3.9mm, 4mm, 4.1mm, 5mm, etc.) based on experience or on site training conditions, as the invention is not limited in any way. It should be understood that if a lot inputs a plurality of motion gesture sample sequences, the clothing morphology sequences corresponding to the output of all motion gesture sample sequences input for the current lot calculate the loss of the clothing-penetrating character according to the above formula and calculate the mean value as the loss of the clothing-penetrating character for the current lot, respectively. The technical scheme of the embodiment at least can realize the following beneficial technical effects: the loss of the clothing penetrating the role can ease the penetration problem between the cloth clothing and the body of the role, the loss is favorable for generating effective clothing deformation, the gradient of the loss pushes the clothing vertexes to the outside of the body of the role, the situation that the clothing is embedded into the body of the role in the generated animation is effectively reduced, the model prediction precision is improved, and the animation effect is more lifelike.

In a patent application previously filed by the applicant, the trained clothing morphology prediction model may also be used for prediction in cases where the clothing penetrates the clothing itself, for example: some of the predicted garment vertices penetrate the surface of the garment model, resulting in the realism of the animation and the user experience being affected. Further improvements are also contemplated for this. According to one embodiment of the invention, a clothing morphology prediction model is trained by using data samples during training, a clothing morphology sequence is output according to a motion gesture sample sequence, and parameters of the clothing morphology prediction model are updated according to the loss of clothing morphology, the loss of clothing penetrating characters and the loss of clothing self-penetration, wherein the loss of clothing morphology represents the difference between the output clothing morphology sequence and the corresponding clothing morphology label sequence, the loss of clothing penetrating characters represents the degree of clothing vertexes penetrating the body of a character model, and the loss of clothing self-penetration represents the errors caused by the clothing vertexes penetrating the clothing. For example, the parameters of the garment morphology prediction model are graded and back-propagated based on the loss of garment deformation, the loss of garment penetration character model, and the total loss determined from the loss of garment penetration. According to one embodiment of the invention, the loss is calculated when training the garment morphology prediction model as follows:

L＝L _cloth +λL _coll +μL _self

Wherein L is _cloth Indicating loss of clothing morphology, L _coll Representing loss of clothing penetrating character, lambda is represented as L _coll Set weighting coefficient L _self Represents the loss of clothing self-penetration, mu is represented as L _self The weighting coefficients set. L (L) _cloth 、L _coll Reference may be made to the foregoing embodiments, and details of this embodiment are not described. μ may be set according to the needs of the practitioner, for example, its value is set to 0.2, 0.4, 0.5, 0.6, 0.7, 0.8, or 1, etc. The technical scheme of the embodiment at least can realize the following beneficial technical effects: in the prior art, the loss of clothing penetrating through the clothing (namely, the loss of clothing penetrating through) is not considered to guide the updating of model parameters, but after the improvement of the embodiment, the clothing form prediction model can be better corrected in parameter during training, the occurrence probability or the number of clothing self-penetrating conditions in an output clothing form sequence is reduced, the prediction precision of the clothing form prediction model is further improved, the effect of final animation is more natural and vivid, and the user experience is improved.

If one motion profile sample sequence is input per batch, according to one embodiment of the present invention, the loss of clothing self-penetration corresponding to the clothing morphology sequence output from a single motion profile sample sequence can be calculated as follows:

Wherein T corresponds to the input videoThe frame number is also the length of the corresponding sequence (refers to the motion gesture sample sequence and any one of the clothing shape sequence and clothing shape label sequence corresponding to the motion gesture sample sequence), V _t Representing a set of garment vertices corresponding to the predicted garment morphology for the t-th frame in the sequence of garment morphologies,

Indicating the threshold value of the self-penetration of the garment set. It will be appreciated that at rest (e.g. in a T-pose) the normal vector of the garment vertex y points from that vertex to the outside of the garment template (i.e. the arrow of the normal vector faces away from the inside of the garment template). />

The values of (a) may be set as desired (e.g., 3mm, 3.9mm, 4mm, 4.1mm, 5mm, etc.) according to the experience of the practitioner or the training situation on site, and the present invention is not limited in any way.

It should be appreciated that the above calculation of the garment self-penetration loss is only an alternative embodiment, and that the practitioner may calculate the garment self-penetration loss according to other custom-made methods; such as: calculating the number of clothing vertex pairs penetrating any clothing surface piece (surface formed by at least 3 clothing vertices) or the weighted value of the number of clothing vertex pairs formed by each clothing vertex and the nearest clothing vertex as the self-penetration loss of the clothing; for another example, the following is calculated:

Wherein T corresponds to the number of frames of the input video, and is also the length of the corresponding sequence (referring to the motion gesture sample sequence and any one of the clothing shape sequence and clothing shape label sequence corresponding to the motion gesture sample sequence), V _t Representing a set of garment vertices corresponding to the predicted garment morphology for the t-th frame in the sequence of garment morphologies,

representing the position of the vertex x of the garment where the self-penetration is detected in the garment form predicted for the t-th frame in the sequence of garment forms,/for the garment form>

Representing the position of the nearest garment vertex y to the garment vertex x of the penetrated garment surface in the garment form predicted for the t-th frame in the garment form sequence, n _y,t A normal vector representing the garment vertex y in the garment form predicted for the t-th frame in the sequence of garment forms, is->

Indicating the threshold value of the self-penetration of the garment set.

Step S3: and generating an animation that the character wears the clothing to move and the clothing generates corresponding deformation along with the movement of the character according to the movement gesture sequence and the clothing form sequence extracted from the driven video.

In accordance with one embodiment of the present invention, a sequence of motion gestures extracted from a driven video, wherein the motion gestures indicative of a character model in each frame, may be used in character animation software (e.g., blender) to drive a character model in motion; correspondingly, after the clothing form sequence is obtained, the clothing form of the clothing model in each frame is correspondingly indicated, and the character animation software can be used for driving the clothing model to deform so as to form the animation that the character wears the clothing to move and the clothing generates corresponding deformation along with the movement of the character. After fusing the animation and the corresponding background, a movie, video or game picture can be obtained.

If a pre-trained motion gesture recognition model and a pre-trained garment morphology prediction model are used, and both are self-trained by the practitioner, an exemplary implementation procedure is given below, comprising the steps of:

step A1: acquiring a data set which takes a video as input data and a marked motion gesture sequence (or a motion gesture label sequence, namely a true value of a motion gesture marked according to the motion of a target in the video) as a label for guiding training, extracting a subset from the data set as a training set, training a motion gesture recognition model which can be used for extracting the motion gesture sequence of a preset target from the video, and obtaining a pre-trained motion gesture recognition model;

step A2: acquiring a data set which takes a motion gesture sequence as input data and a marked clothing morphology sequence (or clothing morphology label sequence, namely, a true value of clothing morphology marked according to the motion gesture sequence induced clothing deformation) as a label for guiding training, extracting a subset from the data set as a training set, training the clothing morphology sequence which can be used for generating clothing three-dimensional morphology corresponding to each video frame during action according to the motion gesture sequence, and obtaining a pre-trained clothing morphology prediction model;

Step B1: extracting a motion gesture sequence containing three-dimensional gesture information of a preset target in different video frames from a driven video by utilizing a pre-trained motion gesture recognition model, wherein the driven video contains pictures of corresponding actions of the target in the real world, and the actions are actions required to be executed by roles different from the target;

step B2: generating a clothing morphology sequence of the clothing three-dimensional morphology corresponding to each video frame in action according to the motion gesture sequence extracted from the driven video by utilizing a pre-trained clothing morphology prediction model;

step B3: generating an animation that a character wears a garment to move and the garment generates corresponding deformation along with the movement of the character according to the movement gesture sequence and the garment form sequence extracted from the driven video; wherein, the motion gesture sequence is used for controlling the motion of the character, and the clothing morphological sequence is used for controlling the deformation generated by the clothing along with the motion of the character.

For ease of understanding, a human target, the character model is exemplified herein as an SMPL manikin, with reference to fig. 6. Inputting the driven video into a motion gesture recognition model to obtain a motion gesture sequence (each motion gesture contains beta parameters and theta parameters); inputting each motion gesture in the motion gesture sequence into a clothing morphology prediction model according to time sequence to obtain a corresponding clothing morphology, wherein all the clothing morphologies form a clothing morphology sequence.

To verify the effect of creating a garment morphology, the inventors performed a related experiment. Table 1 shows that under different motion sequences, human body posture parameter error data (i.e., error data of a motion posture sequence of a human body) is generated based on video. And comparing the generated posture parameters with the human posture parameters marked in the data set. And calculating the average absolute error corresponding to the two, wherein the average absolute error can reflect the actual condition of the predicted value error, and the smaller the value is, the higher the model accuracy is. The human body sequence generated based on the video can be seen to accord with the motion rule in the aspect of the continuity of human body actions, and the human body shape and the gesture are attached to the real human body shape and the gesture in the video.

TABLE 1 average absolute error of human body pose parameters

Fig. 7 is a schematic diagram of a three-dimensional clothing deformation effect based on video driving, wherein the generation effect of the clothing animation is observed in the visual fidelity, so that the change of different human body type parameters can be seen, and the clothing fitting degree is changed correspondingly. Along with the change of the posture parameters of the human body, the garment generates a deformation effect of corresponding movement trend. Continuous and stable clothing deformation effect can be generated under the driving of continuous human body movement based on video extraction.

In addition, table 2 shows the error data of the garment vertices driven by different motion gesture sequences. And comparing the clothes deformation generated under the same motion sequence with the clothes deformation vertex root mean square error based on the physical simulation, accumulating the clothes deformation generated by each frame with the clothes deformation vertex root mean square error generated based on the physical simulation, and finally dividing the clothes deformation sequence by the sequence length to obtain the overall clothes deformation sequence root mean square error, wherein the smaller the value is, the smaller the error between the true value and the predicted value is, and the better the prediction effect of the model is.

TABLE 2 root mean square error of vertices under different motion gesture sequences

Therefore, as the video data contains visual and real human body movement, if three-dimensional human body movement gesture data can be obtained from human body movement videos, visual and real movement gesture sequences can be obtained more quickly. In addition, the video data are all coherent and real human body motions, and the motion gesture sequences extracted based on the video meet time sequence constraint conditions, so that a data basis is provided for the time sequence deformation simulation of the clothing. From this, generate continuous, true different motion gesture sequences from video data to further drive and generate stable lifelike three-dimensional clothing deformation, hopefully promote clothing deformation creation's intellectuality, reduce the work degree of difficulty of animator, promote work efficiency.

It should be noted that, although the steps are described above in a specific order, it is not meant to necessarily be performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order, as long as the required functions are achieved.

The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A video data driven animation generation method, comprising:

extracting a motion gesture sequence containing three-dimensional gesture information of a preset target in different video frames from a driven video, wherein the driven video contains pictures of corresponding actions of the target in the real world, and the actions are actions which need to be made by roles different from the target;

generating a clothing morphology sequence of clothing three-dimensional morphology corresponding to each video frame in action according to the motion gesture sequence extracted from the driven video;

and generating an animation that the character wears the clothing to move and the clothing generates corresponding deformation along with the movement of the character according to the movement gesture sequence and the clothing form sequence extracted from the driven video.

2. The method of claim 1, wherein the sequence of motion gestures of the predetermined object is extracted from the driven video using a pre-trained motion gesture recognition model, wherein the motion gesture recognition model is an image-to-sequence neural network model comprising a feature extractor for extracting image features from the video frames and a sequence generator for generating the sequence of motion gestures of the predetermined object from the image features.

3. The method of claim 1, wherein a pre-trained garment morphology prediction model is utilized to generate a sequence of garment morphologies for the three-dimensional morphology of the garment corresponding to each video frame during motion from the sequence of motion poses extracted from the driven video, wherein the pre-trained garment morphology prediction model is a sequence-to-sequence neural network model.

4. A method according to claim 3, wherein the pre-trained garment morphology prediction model is obtained by training a sequence-to-sequence neural network model using a training set of predetermined garment morphologies, the training set including a plurality of data samples, each data sample including a sequence of motion gesture samples and a sequence of garment morphology labels corresponding to the sequence of motion gesture samples, the sequence of motion gesture samples being a sequence of motion gestures of a predetermined target extracted from a pre-collected sample video, the sequence of garment morphology labels being a sequence of garment morphologies obtained by driving a character model of the wearing garment model to move using the sequence of motion gesture samples extracted from the sample video under physical simulation conditions.

5. A method according to claim 3, wherein the training is performed by training the garment morphology prediction model using the data samples to output a garment morphology sequence based on the motion gesture sample sequence, and updating parameters of the garment morphology prediction model based on the output garment morphology sequence and the loss calculated by the corresponding garment morphology tag sequence.

6. A method according to claim 3, wherein the training is performed by training the garment morphology prediction model using the data samples to output a garment morphology sequence based on the motion gesture sample sequence, and updating parameters of the garment morphology prediction model based on a loss of garment morphology and a loss of garment penetration character, wherein the loss of garment morphology characterizes a difference between the output garment morphology sequence and the corresponding garment morphology label sequence, and the loss of garment penetration character characterizes a degree of penetration of a garment vertex on the garment through the body of the character model.

7. The method of claim 6, wherein the loss is calculated when training the garment morphology prediction model by:

L＝L _cloth +λL _coll

8. A method according to claim 3, wherein the training is performed by training the garment morphology prediction model using the data samples to output a garment morphology sequence based on the motion gesture sample sequence, and updating parameters of the garment morphology prediction model based on a loss of garment morphology, a loss of garment penetration of a character, and a loss of garment self-penetration, wherein the loss of garment morphology characterizes a difference between the output garment morphology sequence and a corresponding garment morphology tag sequence, and the loss of garment penetration of character characterizes a degree to which a garment vertex on the garment penetrates the body of the character model, and wherein the loss of garment self-penetration characterizes an error caused by the garment vertex on the garment penetrating the garment itself.

9. The method of claim 8, wherein the loss is calculated when training the garment morphology prediction model by:

L＝L _cloth +λL _coll +μL _self

10. Method according to any of claims 6-9, characterized in that the loss of a clothing penetrating character corresponding to a clothing morphology sequence output from a sequence of motion gesture samples is calculated as follows:

Representing +.>

Position of body vertex j nearest, +.>

A normal vector representing the body vertex j.

11. The method according to claim 8 or 9, wherein the loss of garment self-penetration corresponding to the garment morphology sequence output from one sequence of motion gesture samples is calculated as follows:

Indicating the threshold value of the self-penetration of the garment set.

12. A computer readable storage medium, having stored thereon a computer program executable by a processor to implement the steps of the method of any one of claims 1 to 11.

13. An electronic device, comprising:

one or more processors; and

a memory, wherein the memory is for storing executable instructions;

the one or more processors are configured to implement the steps of the method of any one of claims 1 to 11 via execution of the executable instructions.