CN112581499A - 3D human motion prediction method based on depth state space model - Google Patents

3D human motion prediction method based on depth state space model Download PDF

Info

Publication number
CN112581499A
CN112581499A CN202011500519.8A CN202011500519A CN112581499A CN 112581499 A CN112581499 A CN 112581499A CN 202011500519 A CN202011500519 A CN 202011500519A CN 112581499 A CN112581499 A CN 112581499A
Authority
CN
China
Prior art keywords
state
space model
state space
depth
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011500519.8A
Other languages
Chinese (zh)
Inventor
刘小丽
尹建芹
刘金
党永浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202011500519.8A priority Critical patent/CN112581499A/en
Publication of CN112581499A publication Critical patent/CN112581499A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a 3D human body motion prediction method based on a depth state space model, which comprises the steps of firstly taking the position and the speed of human body motion as observation, adopting a motion dynamics rule of a depth network extraction historical posture sequence to initialize the state of the state space model, and recursively predicting a plurality of future postures of the 3D human body motion through state to observation transfer. The invention utilizes the advantages of the depth network and the state space model to model the human motion system into the depth state space model, provides a uniform description for various human motion systems, and can analyze the existing model.

Description

3D human motion prediction method based on depth state space model
Technical Field
The invention belongs to the technical field of human motion prediction, and particularly relates to a depth state space model-based 3D human motion prediction method.
Background
Human awareness and interaction with the real world relies on the ability to predict changes in the surrounding environment over time. Also, intelligent robots that interact with humans must have the ability to predict future human dynamics,thereby enabling the robot to respond to changes of the human rapidly. Most of the existing prediction models generally only adopt L2Or MPJPE (mean per joint point position error) optimizes the prediction model for all frames of future poses or all joint losses. Since the early time penalty is less than the late time penalty, these models implicitly focus on the prediction of the late time, ignoring this hidden relationship between the early and late times, i.e., in recursive models, early predictions tend to affect the prediction of the late time. Therefore, none of these models can achieve relatively accurate predictions, particularly in recursive predictive models.
Therefore, how to provide an accurate 3D human motion prediction method becomes a problem that needs to be solved urgently by those skilled in the art.
Disclosure of Invention
In view of this, the invention provides a depth state space model-based 3D human motion prediction method, which models a human motion system into a depth state space model by using the advantages of a depth network and a state space model, provides a uniform description for various human motion systems, and can analyze the existing model.
In order to achieve the purpose, the invention adopts the following technical scheme:
A3D human motion prediction method based on a depth state space model is characterized in that the position and the speed of human motion are used as observation, the motion dynamics law of a historical attitude sequence is extracted by a depth network to initialize the state of the state space model, a human motion system is modeled as the depth state space model, and a plurality of future attitudes of the 3D human motion are predicted recursively through state-to-observation transfer.
Preferably, the future pose is represented as a transition of the model observations.
Preferably, the method for representing the gesture sequence is as follows:
given a length T1Is input sequence S1={p(t0)}(t=-(T1-1), …, -1,0), wherein p (t)0) Representing a sequence of poses S1Middle t0The attitude at that moment; input sequence S1Is defined as V1={v(t0) Where v (t)0)=p(t0)-p(t0-1),v(-(T1-1)) {0 }; will input a sequence S1Expressed as 3 2D tensors: s1x,S1yAnd S1zRespectively represent a sequence S1A motion trajectory along x, y, z axes; in the velocity space, V1Expressed as 3 2D tensors: v1x,V1yAnd V1zRepresenting velocity information along coordinate axes x, y and z, respectively.
Preferably, the human motion system is represented by a state space model of a dynamic system, as shown in formula (1) and formula (2):
I(t+1)=f1(I(t),t)+a(t) (1)
O(t)=f2(I(t),t)+b(t) (2)
wherein I (t) and O (t) represent the state and observation at time t, respectively; a (t) and b (t) are respectively process noise and measurement noise at time t; f. of1(. and f)2(. cndot.) is a system function.
Preferably, the position and the speed of the human motion are used as observation, the motion dynamics rule of the historical attitude sequence is initialized to the state of the state space model, and a (t) and b (t) are respectively initialized to 0; sequence S1The corresponding future pose is defined as:
Figure BDA0002843389220000021
wherein S is2Has a length of T2Corresponding to a speed of
Figure BDA0002843389220000031
Figure BDA0002843389220000032
Represents the sequence S2The posture at the middle time t is,
Figure BDA0002843389220000033
represents the velocity at time t; therefore, the temperature of the molten metal is controlled,the states I (t) and observations O (t) are redefined as equation (3) and equation (4), respectively:
Figure BDA0002843389220000034
Figure BDA0002843389220000035
where O (0) is initialized to { p (0),0},
Figure BDA0002843389220000036
other multi-level information representing historical poses at time t-1.
Preferably, the deep state space model is a two-stage system comprising state-state transitions and state-observation transitions; wherein, state-state transition: this stage is through the system function f1(-) update the system state, generate multi-level information for future poses, and update other multi-level information for previous poses, which are automatically learned by the deep network; state-observed transitions: the observed value at this stage passes through the system function f2() performing calculations based on the current state of the system, this part being done by the decoder auto-learning; further, since the current position and velocity of the human body determine the position of the human body at the next time, the future posture is calculated by equation (5):
Figure BDA0002843389220000037
preferably, the trunk layer of the depth state space model is a Densely Connected Convolution Module (DCCM), and the densely connected convolution module mainly comprises 5 convolution layers; in each convolutional layer, the input fuses features from all previous layers through the 1 × 1 convolutional layer; wherein 1 × 1 convolution in residual concatenation can learn joint level features; therefore, the dense residual connection in the dense connection convolution module allows the network to gradually acquire enhanced features by fusing the shallow joint level features, as shown in equation (6):
Ml=Hb([g0(M0),g1(M1),…,gl-2(Ml-2),Ml-1]) (6)
wherein M isl(1, 2, …,5) represents the output characteristic map of the l-th layer; hb(. to) a feature fusion layer, consisting of a 1 × 1 convolutional layer and a connection operation along the channel; g (-) represents 1 × 1 convolutional layer and Leaky ReLU activation function layer.
Preferably, the method for initializing the state of the deep state space model comprises the following steps: constructing a multi-branch network by using DCCMs, coding the motion dynamics of the coordinate level and the joint level of an input sequence in a position space and a speed space respectively, and forming a posture branch, a speed branch and a fusion module; wherein, the posture is branched: s1x、S1yAnd S1zAs input to each sub-branch consisting of 2 DCCMs, respectively, so that the network can capture the features at the coordinate level; fusing coordinate level features by using a DCCM module to obtain joint level features; speed branch: v1x、V1yAnd V1zRespectively as the input of each subbranch to gradually capture the multi-level characteristics of the input sequence in the speed space; a fusion module: fusing the kinetic information captured by the attitude branch and the speed branch; the fusion module is constructed by connecting operation along a channel, and adding a convolution layer and a LeakyReLU activation function layer; the output of the multi-branch network describes the kinetic characteristics of the input sequence, labeled h (0), and Fb(0) Initialized to h (0), and the initial state of system I (0) is set to { Fb(0)}。
Preferably, the deep state space model is based on a CNN recursive decoder to implement state transitions, including state-state transitions and state-observation transitions; firstly, more operation modeling historical characteristics are adopted, less operation modeling current speed information is adopted, and then the information is fused through element summation operation; finally, the other is the convolution layer and the FC layer to predict the future speed; the state of the dynamic system, i (t), and the observation o (t), are updated by equation (7) and equation (8), respectively:
I(t+1)=f1(I(t),t) (7)
O(t)=f2(I(t),t) (8)
where T is 1,2, …, T2,f1(. and f)2() use two learnable mapping systems; in I (t +1)
Figure BDA0002843389220000041
Representing an input state of a decoder at time t for capturing multi-order information of a pose sequence at a previous time;
Figure BDA0002843389220000042
sparsely aggregating historical motion dynamics of previous poses by equation (9)
Figure BDA0002843389220000043
Obtaining enhanced feature representations and updating
Figure BDA0002843389220000044
Figure BDA0002843389220000045
Wherein HmAnd the (, which represents a memory module, records the motion dynamics information of the historical postures.
Preferably, the depth state space model is optimized by using an attention time prediction loss function L:
note that the temporal prediction loss function L includes two parts: l isvAnd Lp,L=λ1Lv2LpWherein λ is1And λ2Is to balance LvAnd LpTwo hyper-parameters of (a); l isvDirecting the network to decode speed information for future poses; l ispEncouraging the network to recover future location information;
Lvdefined by formula (10):
Figure BDA0002843389220000051
wherein N isjRepresenting the number of joint points; t is2Representing the length of the predicted future pose sequence;
Figure BDA0002843389220000052
and
Figure BDA0002843389220000053
respectively representing a real joint point and a predicted joint point in a velocity space; alpha is alphatDenote the attention weight at time t, let αtt+1Forcing the network to focus more on early prediction, αtIs initialized to 2 (T)2-t +1) by
Figure BDA0002843389220000054
Normalizing the weight value to 1;
Lpdefined by formula (11):
Figure BDA0002843389220000055
wherein the content of the first and second substances,
Figure BDA0002843389220000057
and
Figure BDA0002843389220000056
representing the real and predicted joint points in the location space, respectively.
The invention has the beneficial effects that:
the invention utilizes the advantages of depth representation and state space model to model the human motion system into a depth state space model, provides a uniform description for various human motion systems, and can analyze the existing model; the invention provides an end-to-end feedforward network to establish the model, and jointly realizes the state initialization and the state transition of the system; furthermore, the proposed loss of attention to temporal prediction can effectively guide our recursive model to achieve more accurate predictions.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a diagram illustrating human motion prediction.
Fig. 2 is a network structure diagram of a deep state space model.
Fig. 3 is a network structure diagram of the DCCM.
FIG. 4 is a qualitative result graph of short-term prediction of H3.6M.
FIG. 5 is a qualitative result chart of long-term prediction of H3.6M.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a 3D human body motion prediction method based on a depth state space model, which comprises the steps of firstly taking the position and the speed of human body motion as observation, adopting a motion dynamics rule of a depth network extraction historical attitude sequence to initialize the state of the state space model, modeling a human body motion system into the depth state space model, recursively predicting a plurality of future attitudes of the 3D human body motion through state to observation transfer, and expressing the future attitudes as model observation transfer.
The method for representing the attitude sequence comprises the following steps:
given a length T1Is input sequence S1={p(t0)}(t=-(T1-1), …, -1,0), wherein p (t)0) Representing a sequence of poses S1Middle t0The attitude at that moment; input sequence S1Is defined as V1={v(t0) Where v (t)0)=p(t0)-p(t0-1),v(-(T1-1)) {0 }; the invention introduces a new skeleton representation, and the input skeleton point sequence is simultaneously represented in a speed space and a position space, so that the dynamic characteristics of the skeleton point sequence can be better captured. Fig. 1(a) is a diagram illustrating human body movement prediction. Fig. 1(b) is a joint trajectory along each coordinate axis (i.e., x, y, and z), in which the horizontal axis represents "time", the vertical axis represents "joint point", one curve represents a movement trajectory of one joint point along one coordinate axis, and the movement trajectories of the joint points on the same limb are marked with the same color.
As shown in FIG. 1(b), since the motion trajectories of the joint points along different coordinate axes have large differences, the present invention inputs the sequence S1Expressed as 3 2D tensors: s1x,S1yAnd S1zRespectively represent a sequence S1Motion trajectory along x, y, z axes. Similarly, in the velocity space, V will be1Expressed as 3 2D tensors: v1x,V1yAnd V1zRepresenting velocity information along coordinate axes x, y and z, respectively. In the skeletal representation of the present invention, the width represents the image frame and the height represents the joint point. Wherein the order of the joint points is consistent with the skeleton representation in the invention; the joint points of the same limb are placed at adjacent positions, so that the local characteristics of the human body structure can be conveniently captured.
In the invention, the human motion system is a typical dynamic system, and can be represented by a state space model of the dynamic system, as shown in formula (1) and formula (2):
I(t+1)=f1(I(t),t)+a(t) (1)
O(t)=f2(I(t),t)+b(t) (2)
wherein I (t) and O (t) represent the state and observation at time t, respectively; a (t) and b (t) are respectively process noise and measurement noise at time t; f. of1(. and f)2(. cndot.) is a system function.
In the invention, the position and the speed of human motion are used as observation, the motion dynamics rule of the historical attitude sequence is initialized to the state of a state space model, and a (t) and b (t) are respectively initialized to 0; sequence S1The corresponding future pose is defined as:
Figure BDA0002843389220000071
wherein S is2Has a length of T2Corresponding to a speed of
Figure BDA0002843389220000072
Figure BDA0002843389220000073
Represents the sequence S2The posture at the middle time t is,
Figure BDA0002843389220000074
represents the velocity at time t; thus, the states I (t) and observations O (t) are redefined as equation (3) and equation (4), respectively:
Figure BDA0002843389220000075
Figure BDA0002843389220000081
where O (0) is initialized to { p (0),0},
Figure BDA0002843389220000082
other multi-level information representing historical poses at time t-1.
In the invention, the depth state space model is a two-stage system and comprises state-state transition and state-observation transition; wherein, state-state transition: this is oneThe phases are passed through a system function f1(-) update the system state, generate multi-level information for future poses, and update other multi-level information for previous poses, which are automatically learned by the deep network; state-observed transitions: the observed value at this stage passes through the system function f2() performing calculations based on the current state of the system, this part being done by the decoder auto-learning; further, since the current position and velocity of the human body determine the position of the human body at the next time, the future posture is calculated by equation (5):
Figure BDA0002843389220000083
the network structure of the deep state space model of the present invention is shown in fig. 2, and includes state initialization, state transition and loss functions. The goal of state initialization is to initialize the state of the system to the kinematic law of the input gesture sequence, and state transition is to update the system state and generate the future gesture sequence.
The trunk layer of the depth state space model is a Dense Connected Convolution Module (DCCM), as shown in fig. 3, the dense Connected convolution Module mainly consists of 5 Convolutional layers; in each convolutional layer, the input fuses features from all previous layers through the 1 × 1 convolutional layer; wherein 1 × 1 convolution in residual concatenation can learn joint level features; therefore, the dense residual connection in the dense connection convolution module allows the network to gradually acquire enhanced features by fusing the shallow joint level features, as shown in equation (6):
Ml=Hb([g0(M0),g1(M1),…,gl-2(Ml-2),Ml-1]) (6)
wherein M isl(1, 2, …,5) represents the output characteristic map of the l-th layer; hb(. to) a feature fusion layer, consisting of a 1 × 1 convolutional layer and a connection operation along the channel; g (-) represents 1 × 1 convolutional layer and LeakyReLU activation function layer.
In the present invention, the depth isThe method for initializing the state of the state space model comprises the following steps: as shown in the left diagram of fig. 2, based on the above skeleton representation, a multi-branch network is constructed by DCCMs, and the kinematic dynamics of the coordinate level and the joint level of the input sequence are encoded in the position space and the velocity space, respectively, and are composed of a posture branch, a velocity branch and a fusion module; wherein, attitude branch (position branch): s1x、S1yAnd S1zAs input to each sub-branch consisting of 2 DCCMs, respectively, so that the network can capture the features at the coordinate level; and a DCCM module is utilized to fuse coordinate level features to obtain joint level features, all branches are weight-shared to reduce the complexity of the model, and meanwhile, the association between different coordinate axes can be implicitly learned. Speed branch (velocity branch): v1x、V1yAnd V1zRespectively as the input of each subbranch to gradually capture the multi-level characteristics of the input sequence in the speed space; fusion module (fusion module): fusing the kinetic information captured by the attitude branch and the speed branch; the fusion module is constructed by connecting operation along a channel, and adding a convolution layer and a LeakyReLU activation function layer; the output of the multi-branch network describes the kinetic characteristics of the input sequence, labeled h (0), as shown in FIG. 2, and Fb(0) Initialized to h (0), and the initial state of system I (0) is set to { Fb(0)}。
As shown in the right diagram of fig. 2, the present invention proposes a CNN-based recursive decoder of a deep state space model to implement state transitions, including state-state transitions and state-observation transitions; for each decoder, because the historical information is used as a relatively complex structure, and the current speed is relatively simple, more operation modeling historical characteristics are adopted, less operation modeling current speed information is adopted, and then the information is fused by element summation operation; finally, the other is the convolution layer and the FC layer to predict the future speed; the state of the dynamic system, i (t), and the observation o (t), are updated by equation (7) and equation (8), respectively:
I(t+1)=f1(I(t),t) (7)
O(t)=f2(I(t),t) (8)
where T is 1,2, …, T2,f1(. and f)2() use two learnable mapping systems; in I (t +1)
Figure BDA0002843389220000091
Representing an input state of a decoder at time t for capturing multi-order information of a pose sequence at a previous time; as shown in figure 2 of the drawings, in which,
Figure BDA0002843389220000101
representing the intermediate output of the decoder, encodes the kinematic laws of the preceding pose sequence. Due to the chain structure of the decoder, the history information of the early pose gradually disappears over time. In order to memorize the long-term time-dependent information of the long-term prediction,
Figure BDA0002843389220000102
historical motion dynamics of previous poses can be sparsely aggregated by equation (9)
Figure BDA0002843389220000103
Obtaining enhanced feature representations and updating
Figure BDA0002843389220000104
Figure BDA0002843389220000105
Wherein HmThe memory module is used for recording motion dynamics information of historical postures and consists of connecting operations along a channel and two 3 multiplied by 3 convolutional layers.
To model the multi-order information of future poses to achieve more accurate predictions, as shown in fig. 2, the depth state space model is optimized using an attention time prediction loss function L:
to alleviate the problem of error accumulation in recursive models, we introduceAn attention temporal prediction loss function (ATPL) guides the model to achieve more accurate predictions by increasing the attention to early moments. Note that the temporal prediction loss function L includes two parts: l isvAnd Lp,L=λ1Lv2LpWherein λ is1And λ2Is to balance LvAnd LpTwo hyper-parameters of (a); l isvDirecting the network to decode speed information for future poses; l ispEncouraging the network to recover future location information;
Lvdefined by formula (10):
Figure BDA0002843389220000106
wherein N isjRepresenting the number of joint points; t is2Representing the length of the predicted future pose sequence;
Figure BDA0002843389220000107
and
Figure BDA0002843389220000108
respectively representing a real joint point and a predicted joint point in a velocity space; alpha is alphatDenote the attention weight at time t, let αtt+1Forcing the network to focus more on early prediction, αtIs initialized to 2 (T)2-t +1) by
Figure BDA0002843389220000109
Normalizing the weight value to 1;
Lpdefined by formula (11):
Figure BDA0002843389220000111
wherein the content of the first and second substances,
Figure BDA0002843389220000112
and
Figure BDA0002843389220000113
representing the real and predicted joint points in the location space, respectively.
Experimental example:
1.1 data set and Experimental details
(1) Data set
Human3.6m (H3.6M): H3.6M is the most commonly used data set in human motion prediction problems. The data set consists of 15 actions performed by 7 professional actors, such as walking, eating, smoking, and discussing.
3D Pose in the Wild dataset (3 DPW): 3DPW is a field data set with precise 3D poses that contain various human behaviors such as shopping, sports, etc. The data set includes 60 pose sequences, over 51 kilo-frames.
(2) Details of the experiment
All experimental settings and data processing were consistent with baseline throughout the experiment. The model of the invention is based on TensorFlow implementation. The Mean Per Joint Point Error (MPJPE) in millimeters is used as the metric of the present invention. All models were trained using Adam optimizers with an initialization of the learning rate to 0.0001. Hyper-parametric lambda12The setting is 3: 1.
1.2 comparison with the advanced Process
Baseline: (1) RGRU: built entirely on GRUs and use residual concatenation to model the speed of future poses implicitly. (2) CS 2S: based on the feedforward model of the CNN, multiple poses are recursively predicted. (3) DTraj: at present, the most advanced 3D human motion prediction method is constructed by using DCT and GCN.
H3.6M Experimental results: table 1 shows H3.6M short-term and long-term prediction results on the data set. Our method outperforms all baselines for both short-term and long-term predictions at all times, showing the effectiveness of our proposed depth state space model. Wherein the error of the method of the invention is significantly reduced compared to RNN baseline RGRU. The possible reason is that the method of the present invention models the position and velocity of the human body directly as an observation of the depth state space model, whereas the RNN baseline RGRU models velocity implicitly as an internal state of the system through residual concatenation in the decoder, and the GRU-based modeling approach ignores a portion of the spatial features between human joints. The model of the invention also achieves the best results in most cases compared to CS2S and DTraj. This benefits from two major advantages of our model: (1) the method explicitly models the position and velocity of the human body posture as the observation of the depth state space model proposed by us, and CS2S and DTraj depend on the residual connection between the input and the output in the decoder, and the velocity modeling capability is relatively limited. Therefore, the method of the invention can better capture the dynamics law of human motion by adopting the speed learning. (2) The recursive decoder of the present invention enables our network to make full use of early predictions to achieve more accurate predictions, while DTraj predicts future poses in a non-recursive manner. Short-term and long-term prediction results on tables 1, H3.6M. Where "ms" represents milliseconds.
Figure BDA0002843389220000121
TABLE 1
FIGS. 4 and 5 also provide qualitative results at H3.6M, FIG. 4(a) being a walking motion gesture, FIG. 4(b) being a eating motion gesture, the black gesture representing a true gesture sequence, the blue gesture representing an experimental result of the DTraj method, and the red gesture representing an experimental result of the method of the present invention; fig. 5(a) is a sitting motion posture, fig. 5(b) is a photographing motion posture, a black posture represents a real posture sequence, a blue posture represents an experimental result of the DTraj method, and a red posture represents an experimental result of the method of the present invention. Compared with DTraj, the method of the invention achieves the best visualization effect in short-term and long-term prediction, and the effectiveness of the method of the invention is proved again. Specifically, the results of the method of the present invention are better than DTraj for the left hand of the predicted pose of fig. 4, the upper limb of the pose of fig. 5(a), and the right hand of the pose of fig. 5 (b). The main reasons are reflected in two aspects: (1) the superior performance of the method benefits from the proposed deep state space model, and fully combines the advantages of a deep network and the state space model. (2) The depth state space model of the invention combines position and speed as observation, and designs a new time loss function (attention time prediction loss function) to guide the model to realize accurate early prediction. In contrast, DTraj ignores modeling of velocity and does not fully exploit the results of early prediction. Therefore, the depth state space model of the invention can better capture the motion dynamics for accurate prediction.
Experimental results on 3DPW data set: table 2 shows the short-term and long-term prediction results on the 3DPW data set. In summary, our approach outperforms the baseline at all times, whether short-term or long-term prediction, further validating our proposed depth state space model in modeling motion dynamics for accurate prediction.
Figure BDA0002843389220000131
TABLE 2
1.3 ablation analysis
The invention carries out ablation experiments to verify the effectiveness of several important components in the depth state space model provided by the invention, and the corresponding experimental results are shown in table 3. Table 3 is the ablation experiment result at H3.6M, where "ms" represents milliseconds, "xyz" represents coordinate-level feature modeling of the gesture sequence, "pb" represents "gesture branches," "vb" represents velocity branches, and "x" represents removal of the corresponding module.
Figure BDA0002843389220000141
TABLE 3
(1) Multi-order information: a) the "# 2", "# 3", and "# 9" group experiments demonstrate the effectiveness of modeling multiple levels of information on an input sequence. Compared with "# 9", the errors of the experiments of the "# 2" and "# 3" groups increased at all times, indicating that the modeling of multiple orders of information using velocity and position is useful forTo capture the effectiveness of the kinematical laws. Without modeling the attitude velocity, the prediction error of the #3 set of experiments increased significantly at an early stage, illustrating the importance of velocity modeling for short-term predictions. b) The "# 4", "# 5", and "# 9" set of experiments verified the effectiveness of modeling future pose multi-order information. Similarly, the prediction error of the experiments in the "# 4" and "# 5" groups was larger than that of the experiment in the "# 9" group at all times. Binding of LpAnd LvThe network can be guided to model multi-order information of future gestures, so that better prediction performance is obtained. Furthermore, the error of the experiment of the #5 group is significantly increased, illustrating the importance of modeling future attitude velocity information. c) The "# 7", "# 8", and "# 9" set of experiments further validated the validity of the binding position and velocity as observations of the depth state space model. The position or velocity information without modeling the human body posture can reduce the performance of the system, especially velocity modeling, and the importance of velocity learning is explained. In conclusion, in combination with the position and the speed of the human body, the depth model can learn a robust motion dynamics rule to represent the internal state of the depth state space model, so that the accurate prediction capability of the system is further improved.
(2) Multi-level characteristics: comparing the prediction error between the experiment sets of "# 1" and "# 9", the prediction error of the experiment set of "# 1" increases at all times, illustrating the importance of modeling the kinematics at both the coordinate and joint levels.
(3) Note the temporal prediction loss function (ATPL): compared with the experiment of the group #6, the error of the experiment of the group #9 is obviously reduced, especially at the later moment. The ATPL gives a higher attention weight at an early moment, and can guide the network to realize more accurate prediction at the early moment, so that the problem of error accumulation in a recursive model can be reduced, and the overall performance of the system is further improved.
The invention utilizes the advantages of depth representation and state space model to model the human motion system into a depth state space model, provides a uniform description for various human motion systems, and can analyze the existing model; the invention provides an end-to-end feedforward network to establish the model, and jointly realizes the state initialization and the state transition of the system; furthermore, the proposed attentive temporal prediction penalty function can effectively guide the recursive model to achieve more accurate predictions by increasing the attention to early temporal predictions. In addition, the deep state space model was evaluated on two challenging data sets (H3.6M and 3DPW), which achieved the most advanced performance. Experiments also show that the performance of the system can be further improved by utilizing the characteristics of the coordinate level of the human motion.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A3D human motion prediction method based on a depth state space model is characterized in that the position and the speed of human motion are used as observation, the motion dynamics law of a historical attitude sequence is extracted by a depth network to initialize the state of the state space model, a human motion system is modeled as the depth state space model, and a plurality of future attitudes of the 3D human motion are predicted recursively through state to observation transfer.
2. The depth state space model-based 3D human motion prediction method of claim 1, wherein the future pose is represented as a shift of model observations.
3. The 3D human motion prediction method based on the depth state space model as claimed in claim 2, wherein the method of posture sequence representation is as follows:
given a length T1Is input sequence S1={p(t0)}(t=-(T1-1), …, -1,0), wherein p (t)0) Representing a sequence of poses S1Middle t0The attitude at that moment; input sequence S1Is defined as V1={v(t0) Where v (t)0)=p(t0)-p(t0-1),v(-(T1-1)) {0 }; will input a sequence S1Expressed as 3 2D tensors: s1x,S1yAnd S1zRespectively represent a sequence S1A motion trajectory along x, y, z axes; in the velocity space, V1Expressed as 3 2D tensors: v1x,V1yAnd V1zRepresenting velocity information along coordinate axes x, y and z, respectively.
4. The 3D human motion prediction method based on the depth state space model according to claim 3, wherein the human motion system is represented by a state space model of a dynamic system, as shown in formula (1) and formula (2):
I(t+1)=f1(I(t),t)+a(t) (1)
O(t)=f2(I(t),t)+b(t) (2)
wherein I (t) and O (t) represent the state and observation at time t, respectively; a (t) and b (t) are respectively process noise and measurement noise at time t; f. of1(. and f)2(. cndot.) is a system function.
5. The method as claimed in claim 4, wherein the position and speed of human motion are used as observationInitializing the motion dynamics law of the historical attitude sequence to the state of a state space model, and respectively initializing a (t) and b (t) to 0; sequence S1The corresponding future pose is defined as:
Figure FDA0002843389210000021
wherein S is2Has a length of T2Corresponding to a speed of
Figure FDA0002843389210000022
Figure FDA0002843389210000023
Represents the sequence S2The posture at the middle time t is,
Figure FDA0002843389210000024
represents the velocity at time t; thus, the states I (t) and observations O (t) are redefined as equation (3) and equation (4), respectively:
Figure FDA0002843389210000025
Figure FDA0002843389210000026
where O (0) is initialized to { p (0),0},
Figure FDA0002843389210000027
other multi-level information representing historical poses at time t-1.
6. The 3D human motion prediction method based on the depth state space model is characterized in that the depth state space model is a two-stage system comprising state-state transition and state-observation transition; wherein, state-state transition: this stage is through the system boxNumber f1(-) update the system state, generate multi-level information for future poses, and update other multi-level information for previous poses, which are automatically learned by the deep network; state-observed transitions: the observed value at this stage passes through the system function f2() performing calculations based on the current state of the system, this part being done by the decoder auto-learning; further, since the current position and velocity of the human body determine the position of the human body at the next time, the future posture is calculated by equation (5):
Figure FDA0002843389210000028
7. the 3D human motion prediction method based on the depth state space model according to claim 6, characterized in that the trunk layer of the depth state space model is a densely connected convolution module, and the densely connected convolution module mainly comprises 5 convolution layers; in each convolutional layer, the input fuses features from all previous layers through the 1 × 1 convolutional layer; wherein 1 × 1 convolution in residual concatenation can learn joint level features; therefore, the dense residual connection in the dense connection convolution module allows the network to gradually acquire enhanced features by fusing the shallow joint level features, as shown in equation (6):
Ml=Hb([g0(M0),g1(M1),…,gl-2(Ml-2),Ml-1]) (6)
wherein M isl(1, 2, …,5) represents the output characteristic map of the l-th layer; hb(. to) a feature fusion layer, consisting of a 1 × 1 convolutional layer and a connection operation along the channel; g (-) represents 1 × 1 convolutional layer and Leaky ReLU activation function layer.
8. The 3D human motion prediction method based on the depth state space model according to claim 7, wherein the method for initializing the state of the depth state space model comprises the following steps: construction of multi-branch networks with DCCMsCoding the motion dynamics of coordinate level and joint level of the input sequence in position space and speed space respectively, and comprising a posture branch, a speed branch and a fusion module; wherein, the posture is branched: s1x、S1yAnd S1zAs input to each sub-branch consisting of 2 DCCMs, respectively, so that the network can capture the features at the coordinate level; fusing coordinate level features by using a DCCM module to obtain joint level features; speed branch: v1x、V1yAnd V1zRespectively as the input of each subbranch to gradually capture the multi-level characteristics of the input sequence in the speed space; a fusion module: fusing the kinetic information captured by the attitude branch and the speed branch; the fusion module is constructed by connecting operation along a channel and adding a convolution layer and a Leaky ReLU activation function layer; the output of the multi-branch network describes the kinetic characteristics of the input sequence, labeled h (0), and Fb(0) Initialized to h (0), and the initial state of system I (0) is set to { Fb(0)}。
9. The depth state space model-based 3D human motion prediction method of claim 8, wherein the depth state space model is based on a CNN recursive decoder to implement state transitions, including state-state transitions and state-observation transitions; firstly, modeling historical characteristics by adopting more operations, modeling current speed information by adopting less operations, and then fusing the information by summing operations according to elements; finally, the other is the convolution layer and the FC layer to predict the future speed; the state of the dynamic system, i (t), and the observation o (t), are updated by equation (7) and equation (8), respectively:
I(t+1)=f1(I(t),t) (7)
O(t)=f2(I(t),t) (8)
where T is 1,2, …, T2,f1(. and f)2() use two learnable mapping systems; in I (t +1)
Figure FDA0002843389210000031
Representing an input state of a decoder at time t for capturing multi-order information of a pose sequence at a previous time;
Figure FDA0002843389210000041
sparsely aggregating historical motion dynamics of previous poses by equation (9)
Figure FDA0002843389210000042
Obtaining enhanced feature representations and updating
Figure FDA0002843389210000043
Figure FDA0002843389210000044
Wherein HmAnd the (, which represents a memory module, records the motion dynamics information of the historical postures.
10. The 3D human motion prediction method based on the depth state space model according to claim 9, wherein the depth state space model is optimized by using an attention time prediction loss function L:
note that the temporal prediction loss function L includes two parts: l isvAnd Lp,L=λ1Lv2LpWherein λ is1And λ2Is to balance LvAnd LpTwo hyper-parameters of (a); l isvDirecting the network to decode speed information for future poses; l ispEncouraging the network to recover future location information;
Lvdefined by formula (10):
Figure FDA0002843389210000045
wherein N isjDisplay switchThe number of nodes; t is2Representing the length of the predicted future pose sequence;
Figure FDA0002843389210000046
and
Figure FDA0002843389210000047
respectively representing a real joint point and a predicted joint point in a velocity space; alpha is alphatDenote the attention weight at time t, let αtt+1Forcing the network to focus more on early prediction, αtIs initialized to 2 (T)2-t +1) by
Figure FDA0002843389210000048
Normalizing the weight value to 1;
Lpdefined by formula (11):
Figure FDA0002843389210000049
wherein the content of the first and second substances,
Figure FDA00028433892100000410
and
Figure FDA00028433892100000411
representing the real and predicted joint points in the location space, respectively.
CN202011500519.8A 2020-12-17 2020-12-17 3D human motion prediction method based on depth state space model Pending CN112581499A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011500519.8A CN112581499A (en) 2020-12-17 2020-12-17 3D human motion prediction method based on depth state space model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011500519.8A CN112581499A (en) 2020-12-17 2020-12-17 3D human motion prediction method based on depth state space model

Publications (1)

Publication Number Publication Date
CN112581499A true CN112581499A (en) 2021-03-30

Family

ID=75136003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011500519.8A Pending CN112581499A (en) 2020-12-17 2020-12-17 3D human motion prediction method based on depth state space model

Country Status (1)

Country Link
CN (1) CN112581499A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887419A (en) * 2021-09-30 2022-01-04 四川大学 Human behavior identification method and system based on video temporal-spatial information extraction
WO2023082492A1 (en) * 2021-11-09 2023-05-19 中国民航大学 Path expansion and passage method for manned robot in terminal building

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188598A (en) * 2019-04-13 2019-08-30 大连理工大学 A kind of real-time hand Attitude estimation method based on MobileNet-v2
US20190279383A1 (en) * 2016-09-15 2019-09-12 Google Llc Image depth prediction neural networks
CN110826502A (en) * 2019-11-08 2020-02-21 北京邮电大学 Three-dimensional attitude prediction method based on pseudo image sequence evolution
CN111199216A (en) * 2020-01-07 2020-05-26 上海交通大学 Motion prediction method and system for human skeleton

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190279383A1 (en) * 2016-09-15 2019-09-12 Google Llc Image depth prediction neural networks
CN110188598A (en) * 2019-04-13 2019-08-30 大连理工大学 A kind of real-time hand Attitude estimation method based on MobileNet-v2
CN110826502A (en) * 2019-11-08 2020-02-21 北京邮电大学 Three-dimensional attitude prediction method based on pseudo image sequence evolution
CN111199216A (en) * 2020-01-07 2020-05-26 上海交通大学 Motion prediction method and system for human skeleton

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAOLI LIU 等: "DeepSSM: Deep State-Space Model for 3D Human Motion Prediction", 《HTTPS://ARXIV.ORG/ABS/2005.12155V3》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887419A (en) * 2021-09-30 2022-01-04 四川大学 Human behavior identification method and system based on video temporal-spatial information extraction
CN113887419B (en) * 2021-09-30 2023-05-12 四川大学 Human behavior recognition method and system based on extracted video space-time information
WO2023082492A1 (en) * 2021-11-09 2023-05-19 中国民航大学 Path expansion and passage method for manned robot in terminal building

Similar Documents

Publication Publication Date Title
Rahmatizadeh et al. Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration
Ebert et al. Self-Supervised Visual Planning with Temporal Skip Connections.
CN111079561B (en) Robot intelligent grabbing method based on virtual training
JP6861249B2 (en) How to Train a Convolutional Recurrent Neural Network, and How to Semantic Segmentation of Input Video Using a Trained Convolutional Recurrent Neural Network
Wang et al. A survey of learning‐based robot motion planning
CN110991027A (en) Robot simulation learning method based on virtual scene training
CN105095862A (en) Human gesture recognizing method based on depth convolution condition random field
CN110490035A (en) Human skeleton action identification method, system and medium
CN107480704A (en) It is a kind of that there is the real-time vision method for tracking target for blocking perception mechanism
Wang et al. Modeling motion patterns of dynamic objects by IOHMM
CN112581499A (en) 3D human motion prediction method based on depth state space model
CN109344992B (en) Modeling method for user control behavior habits of smart home integrating time-space factors
CN110014428B (en) Sequential logic task planning method based on reinforcement learning
CN115829171B (en) Pedestrian track prediction method combining space-time information and social interaction characteristics
Zhang et al. Explainable hierarchical imitation learning for robotic drink pouring
Valarezo Anazco et al. Natural object manipulation using anthropomorphic robotic hand through deep reinforcement learning and deep grasping probability network
CN111274438A (en) Language description guided video time sequence positioning method
Hafez et al. Improving robot dual-system motor learning with intrinsically motivated meta-control and latent-space experience imagination
CN113894780B (en) Multi-robot cooperation countermeasure method, device, electronic equipment and storage medium
CN115376103A (en) Pedestrian trajectory prediction method based on space-time diagram attention network
Hafez et al. Efficient intrinsically motivated robotic grasping with learning-adaptive imagination in latent space
Mavsar et al. Simulation-aided handover prediction from video using recurrent image-to-motion networks
CN115990875B (en) Flexible cable state prediction and control system based on hidden space interpolation
Liu et al. Deepssm: Deep state-space model for 3d human motion prediction
Maestre et al. Bootstrapping interactions with objects from raw sensorimotor data: a Novelty Search based approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210330