CN111681321B

CN111681321B - Method for synthesizing three-dimensional human motion by using cyclic neural network based on layered learning

Info

Publication number: CN111681321B
Application number: CN202010506080.3A
Authority: CN
Inventors: 周东生; 郭重阳; 杨鑫; 张强; 魏小鹏
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2023-07-04
Anticipated expiration: 2040-06-05
Also published as: CN111681321A

Abstract

The invention provides a cyclic neural network based on layered learning for three-dimensional human motion synthesis method, which comprises the following steps: training a model step and a test model step; the training model step comprises: constructing a low-layer motion information extraction network by adopting GRU units; establishing a high-level motion synthesis network by adopting a GRU network; the motion data with different styles are used as the input of a high-level motion synthesis network, the motion data skeleton characteristics are combined with the motion characteristics extracted by the low-level motion information extraction network to be used as the input, and the high-level motion synthesis network is trained to learn skeleton space-time relation information of motions with different styles; the first 30 frames of data of the exercise data in the test set are input into the high-level exercise synthesis network after training, and finally verification is carried out. The invention can be used for synthesizing motion conforming to an input track, generating transition motion between two different types of motion, and synthesizing motion sequences with different emotion styles.

Description

Method for synthesizing three-dimensional human motion by using cyclic neural network based on layered learning

Technical Field

The invention relates to the technical field of computer graphics and human body motion modeling, in particular to a method for synthesizing three-dimensional human body motion by using a cyclic neural network based on layered learning.

Background

The three-dimensional human motion capture device is a high-technology device for accurately measuring the motion condition of a human body in a three-dimensional space. Based on the techniques of methods such as multi-view video and computer graphics, the three-dimensional data of the joint points of the human body can be accurately obtained, and then the human body motion data set is reconstructed based on the human body topological structure. The data set has wide application value, and can be widely applied to the fields related to human motion analysis, such as computer animation, virtual reality, security monitoring, human-robot interaction and the like. On the other hand, due to the contradiction between the individuation of human body movement and the requirement of data generality, the reusability of human body movement data is limited. The acquisition of new data requires higher costs due to economic and time costs and other factors, and tends to create a large amount of redundancy for the same type of data. Therefore, how to effectively reuse existing motion data has become one of the key issues to be resolved in the academic and engineering fields.

The human motion synthesis technology synthesizes various new motions meeting the requirements based on the existing data set. The technology not only can solve the problem of data reuse taught by the technology, but also can break the hardware barrier problem in the field related to human motion analysis, and has important research value and significance. Meanwhile, in the typical representative fields of artificial intelligence such as natural human-robot interaction and automatic driving, the technology is taken as one of basic support technologies for understanding, analyzing and predicting human behaviors by machines, and has shown wider and more vitality research and application values and received attention of more and more researchers.

The human motion synthesis technology becomes one of research hotspots in academia and application fields due to high research difficulty, strong practicability and great commercial value. Representative methods of motion synthesis techniques mainly include three types: the method is based on optimization, the deep learning method and the reinforcement learning method. Although the optimization-based method can synthesize motion satisfying constraints, the modeling process is complex and it is difficult to process large datasets. Reinforcement learning-based methods, while capable of interacting with the external environment, are still limited by complex modeling processes and single forms of motion. In contrast, the deep learning-based approach is capable of handling large data sets of diverse motion patterns, and will be capable of encoding complex motion data into small, fixed-size networks. The above advantages make it a growing research hotspot in the field of motion synthesis. .

Disclosure of Invention

According to the technical problems, a cyclic neural network based on layered learning is provided for a three-dimensional human motion synthesis method. The invention mainly utilizes a cyclic neural network based on layered learning for three-dimensional human motion synthesis method, which comprises the following steps: training model step and test model step, its characterized in that: the training model step comprises the following steps:

step S11: constructing a low-layer motion information extraction network by adopting GRU units, wherein the network takes curvature and average speed information of each frame of a skeleton in a data set as input, and the network can output motion characteristics of each frame of a role after training;

step S12: establishing a high-level motion synthesis network by adopting a GRU network; combining the skeleton features in the data set with the motion features extracted in the step S11 as input, training the space-time relationship between the network learning motion sequences, and synthesizing the motion sequences following the user input track;

step S13: adopting motion data with different styles as input of a high-level motion synthesis network, combining the motion data skeleton characteristics with the motion characteristics extracted by the low-level motion information extraction network in the step S11 as input, training the high-level motion synthesis network to learn skeleton space-time relation information of motions with different styles, and synthesizing motion sequences with different styles, namely converting normal walking style motion data into emotion walking style data;

further, the test model step includes the steps of:

step S21: randomly screening out test data from the data set to serve as a test set, inputting the first 30 frame data of the motion data in the test set into a high-level motion synthesis network after training, synthesizing different types of motion sequences, comparing the accuracy of extracting motion information by the low-level motion information extraction network, and testing the performance of the high-level motion synthesis network after training by the joint distance error between the synthesized motion sequence and the real motion sequence and the effect of synthesizing animation;

step S22: in the style motion synthesis task, the effectiveness of the model is demonstrated by comparing the synthesized animation effects with different motion styles.

Further, in the step S11, a recurrent neural network model is acquired by:

training the input motion data into a GRU network, wherein the GRU network is as follows:

z _t ＝σ(W _z ·[h _t-1 ,x _t ])

r _t ＝σ(W _r ·[h _t-1 ,x _t ])

wherein σ represents a sigmoid activation function, tanh represents a tanh activation function, W represents a weight parameter, (·) represents a point multiplication, (×x) represents a matrix multiplication; z _t Representing the state of the update door, wherein the update door is hidden according to the state h of the last step _t-1 And hidden state after the update of the unit

Updating the final hidden state h of the GRU unit _t ，W _z Representing the weight, W, of the update gate in the GRU unit _r Indicating reset gate r _t The reset gate resets the hidden information of the unit according to the hidden state information of the previous step>

x _t Input information representing the GRU units.

Still further, the low-level motion feature extraction:

first, the definition function is as follows:

wherein x= (X) ₂ ,x ₁ ),X∈R ² ；

Wherein q _i Offset value, q, representing i-th frame role _i+1 Offset value, q, representing i+1st frame role _i ∈R ² ，p _i Representing the world coordinate positions of x and y axes of the root joint of the character in the ith frame, p _i+1 Representing the x and y axis world coordinate positions, p, of the root joint of the character in the (i+1) th frame _i ∈R ² ；

c _i ＝f(q _i )；

Wherein c _i Is a curvature feature for input;

wherein s is _i Representing the instant speed of the character root joint after being subjected to a Gaussian filter; exp (i) denotes a gaussian filter function; sigma represents a parameter of a gaussian filter function;

where b represents the average speed of the character, L represents the length of the motion sequence, θ _i Contact information indicating the foot; designating theta when the left foot contacts the ground _i =2pi; designating theta when the right foot contacts the ground _i ＝π；

d _i ＝{cosδ _i ,sinδ _i }；

Wherein d _i Representing the motion orientation, delta, of each frame character _i Euler direction angles of x and y axes are shown;

f _i ＝||s _i cosθ _i ,s _i sinθ _i || ₂ ；

wherein f _i Local speed characteristic quantity representing role, motion characteristic quantity f of human body step is calculated according to instant speed of root joint _i ；

Wherein beta is _i A motion parameter representing a complete sequence;

g _i ＝f(β _i+1 )-f(β _i )；

g _i is the step frequency characteristic of the character calculated by using the difference method.

Further, the input features of the high-level motion synthesis network described in S12 may be expressed as:

wherein,,

representing the first characteristic in the control parameter of the ith frame, wherein theta is contact information of the foot joint;

the process of high-level network composition motion can be expressed as the formula:

x _k+1 ＝P({x ₁ ,E ₁ ,T ₁ },{x ₂ ,E ₂ ,T ₂ }，...，{x _k ,E _k ,T _k }，φ)；

wherein T is _i Skeleton height and average speed expressed as last moment character as additional conversion parameter, T _i ∈R ² The additional transition parameters at time T1 are their altitude and average speed, phi representing the training parameters of the network.

Compared with the prior art, the invention has the following advantages:

1) The invention can be used for synthesizing motion conforming to an input track, generating transition motion between two different types of motion, and synthesizing motion sequences with different emotion styles.

2) Compared with other methods, the error value of the synthesized motion sequence is lower, and the synthesized motion is more accurate.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort to a person skilled in the art.

Fig. 1 is a schematic diagram of an overall framework of a cyclic neural network based on layered learning for a three-dimensional human motion synthesis method in the invention.

FIG. 2 is a diagram showing the accuracy of the low-level network extraction of motion information, wherein (a) kicks, (b) punches, (c) rolls, and (d) jogges.

FIG. 3 is a graph of error contrast of high-level network synthesized motion, (a) kicking, (b) punching a punch, (c) tumbling, and (d) jogging.

FIG. 4 is a schematic diagram of an animation effect of a composite kick motion; (a) is a true value; (b) is a method of the invention; (c) is a reference method.

FIG. 5 is a schematic diagram of an animation effect of a synthetic punch motion; (a) is a true value; (b) is a method of the invention; (c) is a reference method.

FIG. 6 is a schematic diagram of an animation effect of the synthesized roll motion; (a) is a true value; (b) is a method of the invention; (c) is a reference method.

FIG. 7 is a schematic diagram of an animation effect of a synthesized jogging motion; (a) is a true value; (b) is a method of the invention; (c) is a reference method.

Fig. 8 is a schematic diagram of the transitional effect of the motion of the composite kicking motion to running.

Fig. 9 is a schematic diagram of the motion transition effect of the synthesized rolling motion to running.

Fig. 10 is a schematic diagram of a network architecture for athletic style conversion.

FIG. 11 is a schematic diagram of the generation of different athletic styles. From top to bottom, (a) neutral, (b) angry, (c) depressed, (d) aged, and (e) pride.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in fig. 1 to 11, the present invention provides a cyclic neural network based on hierarchical learning for three-dimensional human motion synthesis, comprising: training a model step and testing the model step.

As a preferred embodiment of the present application, the training model step in the present application includes the steps of:

step S12: establishing a high-level motion synthesis network by adopting a GRU network; and (3) combining the skeleton features in the data set with the motion features extracted in the step (S11) as input, training the network to learn the front-back space-time relationship between the motion sequences, and synthesizing the motion sequences following the user input tracks. In this application, training networks learn skeleton spatiotemporal relationship information with emotional motion, such as features like swing arm amplitude while walking frustrated. During the run phase, the network is able to convert the entered normal ambulatory data into emotional ambulatory data. Here, the posture characteristics at the time of exercise are mainly those in which the walking style is certainly different from that at the time of normal emotion, for example, the walking style is heavy at the time of depression, the swing arm amplitude becomes small, and the like.

Step S13: and (3) adopting motion data with different styles as input of a high-level motion synthesis network, combining the motion data skeleton characteristics with the motion characteristics extracted by the low-level motion information extraction network in the step (S11) as input, training the high-level motion synthesis network to learn skeleton space-time relationship information of motions with different styles, and synthesizing motion sequences with different styles, namely converting the motion data with normal walking styles into the walking style data with emotion.

Preferably, the test model step includes the steps of:

As a preferred embodiment of the present application, in the step S11, the recurrent neural network model is acquired by:

z _t ＝σ(W _z ·[h _t-1 ,x _t ])

r _t ＝σ(W _r ·[h _t-1 ,x _t ])

x _t Input information representing the GRU units.

The motion data input here may refer to both curvature and average speed information input by the low-level network and input data by the high-level network, because the low-level and high-level networks all use the GRU network, and this is only for explaining the network structure of the GRU.

As a preferred embodiment, the low-level motion feature extraction:

first, the definition function is as follows:

wherein x= (X) ₂ ,x ₁ ),X∈R ² ；

c _i ＝f(q _i )；

Wherein c _i Is a curvature feature for input;

where b represents the average speed of the character, L represents the length of the motion sequence, θ _i Contact information indicating the foot; when the left foot contacts the groundSpecify θ _i =2pi; designating theta when the right foot contacts the ground _i ＝π；

d _i ＝{cosδ _i ,sinδ _i }；

f _i ＝||s _i cosθ _i ,s _i sinθ _i ||2；

Wherein beta is _i A motion parameter representing a complete sequence;

g _i ＝f(β _i+1 )-f(β _i )；

It will be appreciated that the input features of the high-level motion synthesis network described in step S12 may be expressed as:

wherein,,

x _k+1 ＝P({x ₁ ,E ₁ ,T ₁ },{x ₂ ,E ₂ ,T ₂ }，...，{x _k ,E _k ,T _k }，θ)；

wherein T is _i Represented asThe skeleton height and average speed of the character at the last moment are taken as additional conversion parameters, T _i ∈R ² The additional conversion parameters at time T1 are their altitude and average speed.

Example 1

As an embodiment of the present invention, the effect on the resultant human body movement can be further specifically explained by the following experiment:

experimental conditions:

1) The motion data set used in the experiment is composed of a CMU human motion capture data set comprising a plurality of online large motion databases including various running, walking, kicking, tumbling, etc. sequences of motions.

2) The programming platform used in the experiment was python3.6 and the deep learning framework was pytorch.

4) The server used in the experiment is configured as a Quadro K6000 graphic card, the memory is 12G, the processor model is Intel (R) Xeon (R) CPU E5-2620v3@2.40H, 64.0GB RAM, and the operating system is Ubuntu16.04LTS.

5) In the experiment, the accuracy of extracting the motion information by the low-level network is adopted to evaluate the performance of the low-level network, and the joint position error between the 60-frame motion sequence and the real motion sequence is generated to evaluate the effect of the synthesized motion;

the experimental contents are as follows:

the experiment is based on the method in the text and the method in the document [1], and is aimed at four representative actions of kicking, punching, rolling, jogging and the like, and the accuracy of extracting the motion information is analyzed and compared, and the result is shown in figure 2.

Table1 comparison of errors within 60 frames of motion feature extracted by Table1 using different methods

Next, joint position errors of the four motions synthesized (kicking, punching, tumbling, jogging) were compared by experiments. We compare the generated motion sequence with the joint positions of the real motion in the database to find the position error per frame, as shown in fig. 3.

Based on the method, a combined experiment based on two different types of motion data is also carried out, and the core of the combined experiment is to generate the motion data of the transition stage. Monotonically regular movements such as walking, running, etc., defined as simple forms of movement; and the violent and changeable actions such as turning over a bucket, jumping, kicking, etc. are called complex action forms. Combining both simple and complex forms of motion, whether the resultant transitional motion is natural, realistic and smooth, is one of the research difficulties. The method performs two experiments, the first being a combination of kicking and running. The second group is to combine the flip bucket with running.

The present method also contemplates using the methods herein for athletic style conversion. A task is defined as generating emotional motion data given a motion trajectory.

Analysis of experimental results:

as can be seen from table1, the present method compares the motion characteristics extracted by two methods to errors within 60 frames. The black bolded is a smaller error value, and as can be seen from the table, the motion features extracted by the method have smaller errors than the true value, and the step motion speed features extracted based on the method do not completely fit the true value, but are compared with the literature ^[1] The method has lower error and can embody the dynamic characteristic of the role in motion.

Experiments show that the method can generate a motion sequence with higher fitting degree with the original motion data aiming at motions with smaller character displacement change, such as kicking and punching motions. And adopt the literature ^[1] The same class of motion synthesized by the method of (a) has irregular rotation and sliding phenomena, which lead to insufficient smoothness and fluency of the overall motion, and the visualized animation sequence is qualitatively shown in fig. 4 and 5. The comparison animations all have the same time axis, wherein the first row of green is the real motion sequenceColumn, second row red is a motion sequence generated based on the method herein, third row black is literature-based ^[1] A framework sequence generated by the method of (a).

For the motion type with larger motion amplitude and complexity, such as rolling motion, as shown in fig. 6, the method can also generate motion data with higher fitting degree with the original data. But based on literature ^[1] The method of (a) does not perform well enough in generating such data and the generated motion data is ambiguous. As shown in the figure, the raw data input is a roll motion, but the motion generated is a jogging motion.

In addition, for movements of lesser magnitude and complexity, such as jogging datasets, as shown in FIG. 7, although the method and method ^[1] The method can generate smooth jogging motion, but the method has higher fitting degree with a true value, so that the error degree of the joint is lower and is more close to the characteristic of the true motion.

In the generated transition animation of the transition from the kicking action to the running action, the model can automatically generate a smooth transition frame without any manual editing process, as shown in the black box section in fig. 8. In the generated transition animation from the rolling action to the running action, the generated transition frame shows the state of the human body after the overturning bucket due to instable standing, and the lifelike details of the action show the process of spontaneously regulating the body balance state of the human body in motion, as shown in the black frame part of fig. 9. It should be noted that the above-mentioned synthetic actions are not included in the existing dataset, but are entirely new actions generated based on the methods herein.

In the case of providing neutral emotional motion data, the network can generate motion data with different emotions, such as an angry walking style, a frustrated walking style, an old walking style, and a pride walking style, as shown in fig. 11.

Reference to the literature

[1]D.Pavllo,C.Feichtenhofer,M.Auli,and D.Grangier,Modeling Human Motion with Quaternion-based NeuralNetworks,Springer.Int J Comput Vis.,pp.1-18,Oct.2019,10.1007/s11263-019-01245-6.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments. In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. Wherein the test method embodiments described above are merely illustrative.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A cyclic neural network based on layered learning is used for the three-dimensional human motion synthesis method, comprising: training model step and test model step, its characterized in that:

the training model step comprises the following steps:

s11: constructing a low-layer motion information extraction network by adopting GRU units, wherein the network takes curvature and average speed information of each frame of a skeleton in a data set as input, and the network can output motion characteristics of each frame of a role after training;

the low-level motion feature extraction:

first, the definition function is as follows:

wherein x= (X) ₂ ,x ₁ ),X∈R ² ；

Wherein q _i Offset value, q, representing i-th frame role _i ∈R ² ，p _i Representing the world coordinate positions of x and y axes of the root joint of the character in the ith frame, p _i+1 Representing the x and y axis world coordinate positions, p, of the root joint of the character in the (i+1) th frame _i ∈R ² ；

c _i ＝f(q _i )；

Wherein c _i Is a curvature feature for input;

d _i ＝{cosδ _i ,sinδ _i }；

f _i ＝||s _i cosθ _i ,s _i sinθ _i || ₂ ；

Wherein beta is _i A motion parameter representing a complete sequence;

g _i ＝f(β _i+1 )-f(β _i )；

g _i step frequency characteristics of the roles are calculated by using a difference method;

s12: establishing a high-level motion synthesis network by adopting a GRU network; combining the skeleton features in the data set with the motion features extracted in the step S11 as input, training the space-time relationship between the network learning motion sequences, and synthesizing the motion sequences following the user input track;

the input features of the high-level motion synthesis network described in S12 can be expressed as:

wherein,,

wherein T is _i Skeleton height and average speed expressed as last moment character as additional conversion parameter, T _i ∈R ² ，T ₁ The additional conversion parameters at the moment are the self height and average speed, phi represents the training parameters of the network;

s13: adopting motion data with different styles as input of a high-level motion synthesis network, combining the motion data skeleton characteristics with the motion characteristics extracted by the low-level motion information extraction network in the step S11 as input, training the high-level motion synthesis network to learn skeleton space-time relation information of motions with different styles, and synthesizing motion sequences with different styles, namely converting normal walking style motion data into emotion walking style data;

the test model step comprises the following steps:

s21: randomly screening out test data from the data set to serve as a test set, inputting the first 30 frame data of the motion data in the test set into a high-level motion synthesis network after training, synthesizing different types of motion sequences, comparing the accuracy of extracting motion information by the low-level motion information extraction network, and testing the performance of the high-level motion synthesis network after training by the joint distance error between the synthesized motion sequence and the real motion sequence and the effect of synthesizing animation;

s22: in the style motion synthesis task, the effectiveness of the model is demonstrated by comparing the synthesized animation effects with different motion styles.

2. The hierarchical learning based recurrent neural network for three-dimensional human motion synthesis method according to claim 1, wherein in said step S11, a recurrent neural network model is obtained by:

z _t ＝σ(W _z ·[h _t-1 ,x _t ])

r _t ＝σ(W _r ·[h _t-1 ,x _t ])

Updating the final hidden state h of the GRU unit _t ，W _z Representing the weight, W, of the update gate in the GRU unit _r Indicating reset gate r _t The reset gate resets the hidden information of the unit according to the hidden state information of the previous step

x _t Input information representing the GRU units.