CN112581499A

CN112581499A - 3D human motion prediction method based on depth state space model

Info

Publication number: CN112581499A
Application number: CN202011500519.8A
Authority: CN
Inventors: 刘小丽; 尹建芹; 刘金; 党永浩
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2021-03-30

Abstract

The invention discloses a 3D human body motion prediction method based on a depth state space model, which comprises the steps of firstly taking the position and the speed of human body motion as observation, adopting a motion dynamics rule of a depth network extraction historical posture sequence to initialize the state of the state space model, and recursively predicting a plurality of future postures of the 3D human body motion through state to observation transfer. The invention utilizes the advantages of the depth network and the state space model to model the human motion system into the depth state space model, provides a uniform description for various human motion systems, and can analyze the existing model.

Description

3D human motion prediction method based on depth state space model

Technical Field

The invention belongs to the technical field of human motion prediction, and particularly relates to a depth state space model-based 3D human motion prediction method.

Background

Human awareness and interaction with the real world relies on the ability to predict changes in the surrounding environment over time. Also, intelligent robots that interact with humans must have the ability to predict future human dynamics,thereby enabling the robot to respond to changes of the human rapidly. Most of the existing prediction models generally only adopt L₂Or MPJPE (mean per joint point position error) optimizes the prediction model for all frames of future poses or all joint losses. Since the early time penalty is less than the late time penalty, these models implicitly focus on the prediction of the late time, ignoring this hidden relationship between the early and late times, i.e., in recursive models, early predictions tend to affect the prediction of the late time. Therefore, none of these models can achieve relatively accurate predictions, particularly in recursive predictive models.

Therefore, how to provide an accurate 3D human motion prediction method becomes a problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of this, the invention provides a depth state space model-based 3D human motion prediction method, which models a human motion system into a depth state space model by using the advantages of a depth network and a state space model, provides a uniform description for various human motion systems, and can analyze the existing model.

In order to achieve the purpose, the invention adopts the following technical scheme:

A3D human motion prediction method based on a depth state space model is characterized in that the position and the speed of human motion are used as observation, the motion dynamics law of a historical attitude sequence is extracted by a depth network to initialize the state of the state space model, a human motion system is modeled as the depth state space model, and a plurality of future attitudes of the 3D human motion are predicted recursively through state-to-observation transfer.

Preferably, the future pose is represented as a transition of the model observations.

Preferably, the method for representing the gesture sequence is as follows:

given a length T₁Is input sequence S₁＝{p(t₀)}(t＝-(T₁-1), …, -1,0), wherein p (t)₀) Representing a sequence of poses S₁Middle t₀The attitude at that moment; input sequence S₁Is defined as V₁＝{v(t₀) Where v (t)₀)＝p(t₀)-p(t₀-1)，v(-(T₁-1)) {0 }; will input a sequence S₁Expressed as 3 2D tensors: s_1x,S_1yAnd S_1zRespectively represent a sequence S₁A motion trajectory along x, y, z axes; in the velocity space, V₁Expressed as 3 2D tensors: v_1x，V_1yAnd V_1zRepresenting velocity information along coordinate axes x, y and z, respectively.

Preferably, the human motion system is represented by a state space model of a dynamic system, as shown in formula (1) and formula (2):

I(t+1)＝f₁(I(t),t)+a(t) (1)

O(t)＝f₂(I(t),t)+b(t) (2)

wherein I (t) and O (t) represent the state and observation at time t, respectively; a (t) and b (t) are respectively process noise and measurement noise at time t; f. of₁(. and f)₂(. cndot.) is a system function.

Preferably, the position and the speed of the human motion are used as observation, the motion dynamics rule of the historical attitude sequence is initialized to the state of the state space model, and a (t) and b (t) are respectively initialized to 0; sequence S₁The corresponding future pose is defined as:

wherein S is₂Has a length of T₂Corresponding to a speed of

Represents the sequence S₂The posture at the middle time t is,

represents the velocity at time t; therefore, the temperature of the molten metal is controlled,the states I (t) and observations O (t) are redefined as equation (3) and equation (4), respectively:

where O (0) is initialized to { p (0),0},

other multi-level information representing historical poses at time t-1.

Preferably, the deep state space model is a two-stage system comprising state-state transitions and state-observation transitions; wherein, state-state transition: this stage is through the system function f₁(-) update the system state, generate multi-level information for future poses, and update other multi-level information for previous poses, which are automatically learned by the deep network; state-observed transitions: the observed value at this stage passes through the system function f₂() performing calculations based on the current state of the system, this part being done by the decoder auto-learning; further, since the current position and velocity of the human body determine the position of the human body at the next time, the future posture is calculated by equation (5):

preferably, the trunk layer of the depth state space model is a Densely Connected Convolution Module (DCCM), and the densely connected convolution module mainly comprises 5 convolution layers; in each convolutional layer, the input fuses features from all previous layers through the 1 × 1 convolutional layer; wherein 1 × 1 convolution in residual concatenation can learn joint level features; therefore, the dense residual connection in the dense connection convolution module allows the network to gradually acquire enhanced features by fusing the shallow joint level features, as shown in equation (6):

M_l＝H_b([g₀(M₀),g₁(M₁),…,g_l-2(M_l-2),M_l-1]) (6)

wherein M is_l(1, 2, …,5) represents the output characteristic map of the l-th layer; h_b(. to) a feature fusion layer, consisting of a 1 × 1 convolutional layer and a connection operation along the channel; g (-) represents 1 × 1 convolutional layer and Leaky ReLU activation function layer.

Preferably, the method for initializing the state of the deep state space model comprises the following steps: constructing a multi-branch network by using DCCMs, coding the motion dynamics of the coordinate level and the joint level of an input sequence in a position space and a speed space respectively, and forming a posture branch, a speed branch and a fusion module; wherein, the posture is branched: s_1x、S_1yAnd S_1zAs input to each sub-branch consisting of 2 DCCMs, respectively, so that the network can capture the features at the coordinate level; fusing coordinate level features by using a DCCM module to obtain joint level features; speed branch: v_1x、V_1yAnd V_1zRespectively as the input of each subbranch to gradually capture the multi-level characteristics of the input sequence in the speed space; a fusion module: fusing the kinetic information captured by the attitude branch and the speed branch; the fusion module is constructed by connecting operation along a channel, and adding a convolution layer and a LeakyReLU activation function layer; the output of the multi-branch network describes the kinetic characteristics of the input sequence, labeled h (0), and F_b(0) Initialized to h (0), and the initial state of system I (0) is set to { F_b(0)}。

Preferably, the deep state space model is based on a CNN recursive decoder to implement state transitions, including state-state transitions and state-observation transitions; firstly, more operation modeling historical characteristics are adopted, less operation modeling current speed information is adopted, and then the information is fused through element summation operation; finally, the other is the convolution layer and the FC layer to predict the future speed; the state of the dynamic system, i (t), and the observation o (t), are updated by equation (7) and equation (8), respectively:

I(t+1)＝f₁(I(t),t) (7)

O(t)＝f₂(I(t),t) (8)

where T is 1,2, …, T₂，f₁(. and f)₂() use two learnable mapping systems; in I (t +1)

Representing an input state of a decoder at time t for capturing multi-order information of a pose sequence at a previous time;

sparsely aggregating historical motion dynamics of previous poses by equation (9)

Obtaining enhanced feature representations and updating

Wherein H_mAnd the (, which represents a memory module, records the motion dynamics information of the historical postures.

Preferably, the depth state space model is optimized by using an attention time prediction loss function L:

note that the temporal prediction loss function L includes two parts: l is_vAnd L_p，L＝λ₁L_v+λ₂L_pWherein λ is₁And λ₂Is to balance L_vAnd L_pTwo hyper-parameters of (a); l is_vDirecting the network to decode speed information for future poses; l is_pEncouraging the network to recover future location information;

L_vdefined by formula (10):

wherein N is_jRepresenting the number of joint points; t is₂Representing the length of the predicted future pose sequence;

and

respectively representing a real joint point and a predicted joint point in a velocity space; alpha is alpha_tDenote the attention weight at time t, let α_t>α_t+1Forcing the network to focus more on early prediction, α_tIs initialized to 2 (T)₂-t +1) by

Normalizing the weight value to 1;

L_pdefined by formula (11):

wherein the content of the first and second substances,

and

representing the real and predicted joint points in the location space, respectively.

The invention has the beneficial effects that:

the invention utilizes the advantages of depth representation and state space model to model the human motion system into a depth state space model, provides a uniform description for various human motion systems, and can analyze the existing model; the invention provides an end-to-end feedforward network to establish the model, and jointly realizes the state initialization and the state transition of the system; furthermore, the proposed loss of attention to temporal prediction can effectively guide our recursive model to achieve more accurate predictions.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a diagram illustrating human motion prediction.

Fig. 2 is a network structure diagram of a deep state space model.

Fig. 3 is a network structure diagram of the DCCM.

FIG. 4 is a qualitative result graph of short-term prediction of H3.6M.

FIG. 5 is a qualitative result chart of long-term prediction of H3.6M.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a 3D human body motion prediction method based on a depth state space model, which comprises the steps of firstly taking the position and the speed of human body motion as observation, adopting a motion dynamics rule of a depth network extraction historical attitude sequence to initialize the state of the state space model, modeling a human body motion system into the depth state space model, recursively predicting a plurality of future attitudes of the 3D human body motion through state to observation transfer, and expressing the future attitudes as model observation transfer.

The method for representing the attitude sequence comprises the following steps:

given a length T₁Is input sequence S₁＝{p(t₀)}(t＝-(T₁-1), …, -1,0), wherein p (t)₀) Representing a sequence of poses S₁Middle t₀The attitude at that moment; input sequence S₁Is defined as V₁＝{v(t₀) Where v (t)₀)＝p(t₀)-p(t₀-1)，v(-(T₁-1)) {0 }; the invention introduces a new skeleton representation, and the input skeleton point sequence is simultaneously represented in a speed space and a position space, so that the dynamic characteristics of the skeleton point sequence can be better captured. Fig. 1(a) is a diagram illustrating human body movement prediction. Fig. 1(b) is a joint trajectory along each coordinate axis (i.e., x, y, and z), in which the horizontal axis represents "time", the vertical axis represents "joint point", one curve represents a movement trajectory of one joint point along one coordinate axis, and the movement trajectories of the joint points on the same limb are marked with the same color.

As shown in FIG. 1(b), since the motion trajectories of the joint points along different coordinate axes have large differences, the present invention inputs the sequence S₁Expressed as 3 2D tensors: s_1x,S_1yAnd S_1zRespectively represent a sequence S₁Motion trajectory along x, y, z axes. Similarly, in the velocity space, V will be₁Expressed as 3 2D tensors: v_1x，V_1yAnd V_1zRepresenting velocity information along coordinate axes x, y and z, respectively. In the skeletal representation of the present invention, the width represents the image frame and the height represents the joint point. Wherein the order of the joint points is consistent with the skeleton representation in the invention; the joint points of the same limb are placed at adjacent positions, so that the local characteristics of the human body structure can be conveniently captured.

In the invention, the human motion system is a typical dynamic system, and can be represented by a state space model of the dynamic system, as shown in formula (1) and formula (2):

I(t+1)＝f₁(I(t),t)+a(t) (1)

O(t)＝f₂(I(t),t)+b(t) (2)

In the invention, the position and the speed of human motion are used as observation, the motion dynamics rule of the historical attitude sequence is initialized to the state of a state space model, and a (t) and b (t) are respectively initialized to 0; sequence S₁The corresponding future pose is defined as:

wherein S is₂Has a length of T₂Corresponding to a speed of

Represents the sequence S₂The posture at the middle time t is,

represents the velocity at time t; thus, the states I (t) and observations O (t) are redefined as equation (3) and equation (4), respectively:

where O (0) is initialized to { p (0),0},

other multi-level information representing historical poses at time t-1.

In the invention, the depth state space model is a two-stage system and comprises state-state transition and state-observation transition; wherein, state-state transition: this is oneThe phases are passed through a system function f₁(-) update the system state, generate multi-level information for future poses, and update other multi-level information for previous poses, which are automatically learned by the deep network; state-observed transitions: the observed value at this stage passes through the system function f₂() performing calculations based on the current state of the system, this part being done by the decoder auto-learning; further, since the current position and velocity of the human body determine the position of the human body at the next time, the future posture is calculated by equation (5):

the network structure of the deep state space model of the present invention is shown in fig. 2, and includes state initialization, state transition and loss functions. The goal of state initialization is to initialize the state of the system to the kinematic law of the input gesture sequence, and state transition is to update the system state and generate the future gesture sequence.

The trunk layer of the depth state space model is a Dense Connected Convolution Module (DCCM), as shown in fig. 3, the dense Connected convolution Module mainly consists of 5 Convolutional layers; in each convolutional layer, the input fuses features from all previous layers through the 1 × 1 convolutional layer; wherein 1 × 1 convolution in residual concatenation can learn joint level features; therefore, the dense residual connection in the dense connection convolution module allows the network to gradually acquire enhanced features by fusing the shallow joint level features, as shown in equation (6):

M_l＝H_b([g₀(M₀),g₁(M₁),…,g_l-2(M_l-2),M_l-1]) (6)

wherein M is_l(1, 2, …,5) represents the output characteristic map of the l-th layer; h_b(. to) a feature fusion layer, consisting of a 1 × 1 convolutional layer and a connection operation along the channel; g (-) represents 1 × 1 convolutional layer and LeakyReLU activation function layer.

In the present invention, the depth isThe method for initializing the state of the state space model comprises the following steps: as shown in the left diagram of fig. 2, based on the above skeleton representation, a multi-branch network is constructed by DCCMs, and the kinematic dynamics of the coordinate level and the joint level of the input sequence are encoded in the position space and the velocity space, respectively, and are composed of a posture branch, a velocity branch and a fusion module; wherein, attitude branch (position branch): s_1x、S_1yAnd S_1zAs input to each sub-branch consisting of 2 DCCMs, respectively, so that the network can capture the features at the coordinate level; and a DCCM module is utilized to fuse coordinate level features to obtain joint level features, all branches are weight-shared to reduce the complexity of the model, and meanwhile, the association between different coordinate axes can be implicitly learned. Speed branch (velocity branch): v_1x、V_1yAnd V_1zRespectively as the input of each subbranch to gradually capture the multi-level characteristics of the input sequence in the speed space; fusion module (fusion module): fusing the kinetic information captured by the attitude branch and the speed branch; the fusion module is constructed by connecting operation along a channel, and adding a convolution layer and a LeakyReLU activation function layer; the output of the multi-branch network describes the kinetic characteristics of the input sequence, labeled h (0), as shown in FIG. 2, and F_b(0) Initialized to h (0), and the initial state of system I (0) is set to { F_b(0)}。

As shown in the right diagram of fig. 2, the present invention proposes a CNN-based recursive decoder of a deep state space model to implement state transitions, including state-state transitions and state-observation transitions; for each decoder, because the historical information is used as a relatively complex structure, and the current speed is relatively simple, more operation modeling historical characteristics are adopted, less operation modeling current speed information is adopted, and then the information is fused by element summation operation; finally, the other is the convolution layer and the FC layer to predict the future speed; the state of the dynamic system, i (t), and the observation o (t), are updated by equation (7) and equation (8), respectively:

I(t+1)＝f₁(I(t),t) (7)

O(t)＝f₂(I(t),t) (8)

Representing an input state of a decoder at time t for capturing multi-order information of a pose sequence at a previous time; as shown in figure 2 of the drawings, in which,

representing the intermediate output of the decoder, encodes the kinematic laws of the preceding pose sequence. Due to the chain structure of the decoder, the history information of the early pose gradually disappears over time. In order to memorize the long-term time-dependent information of the long-term prediction,

historical motion dynamics of previous poses can be sparsely aggregated by equation (9)

Obtaining enhanced feature representations and updating

Wherein H_mThe memory module is used for recording motion dynamics information of historical postures and consists of connecting operations along a channel and two 3 multiplied by 3 convolutional layers.

To model the multi-order information of future poses to achieve more accurate predictions, as shown in fig. 2, the depth state space model is optimized using an attention time prediction loss function L:

to alleviate the problem of error accumulation in recursive models, we introduceAn attention temporal prediction loss function (ATPL) guides the model to achieve more accurate predictions by increasing the attention to early moments. Note that the temporal prediction loss function L includes two parts: l is_vAnd L_p，L＝λ₁L_v+λ₂L_pWherein λ is₁And λ₂Is to balance L_vAnd L_pTwo hyper-parameters of (a); l is_vDirecting the network to decode speed information for future poses; l is_pEncouraging the network to recover future location information;

L_vdefined by formula (10):

and

Normalizing the weight value to 1;

L_pdefined by formula (11):

wherein the content of the first and second substances,

and

Experimental example:

1.1 data set and Experimental details

(1) Data set

Human3.6m (H3.6M): H3.6M is the most commonly used data set in human motion prediction problems. The data set consists of 15 actions performed by 7 professional actors, such as walking, eating, smoking, and discussing.

3D Pose in the Wild dataset (3 DPW): 3DPW is a field data set with precise 3D poses that contain various human behaviors such as shopping, sports, etc. The data set includes 60 pose sequences, over 51 kilo-frames.

(2) Details of the experiment

All experimental settings and data processing were consistent with baseline throughout the experiment. The model of the invention is based on TensorFlow implementation. The Mean Per Joint Point Error (MPJPE) in millimeters is used as the metric of the present invention. All models were trained using Adam optimizers with an initialization of the learning rate to 0.0001. Hyper-parametric lambda₁:λ₂The setting is 3: 1.

1.2 comparison with the advanced Process

Baseline: (1) RGRU: built entirely on GRUs and use residual concatenation to model the speed of future poses implicitly. (2) CS 2S: based on the feedforward model of the CNN, multiple poses are recursively predicted. (3) DTraj: at present, the most advanced 3D human motion prediction method is constructed by using DCT and GCN.

H3.6M Experimental results: table 1 shows H3.6M short-term and long-term prediction results on the data set. Our method outperforms all baselines for both short-term and long-term predictions at all times, showing the effectiveness of our proposed depth state space model. Wherein the error of the method of the invention is significantly reduced compared to RNN baseline RGRU. The possible reason is that the method of the present invention models the position and velocity of the human body directly as an observation of the depth state space model, whereas the RNN baseline RGRU models velocity implicitly as an internal state of the system through residual concatenation in the decoder, and the GRU-based modeling approach ignores a portion of the spatial features between human joints. The model of the invention also achieves the best results in most cases compared to CS2S and DTraj. This benefits from two major advantages of our model: (1) the method explicitly models the position and velocity of the human body posture as the observation of the depth state space model proposed by us, and CS2S and DTraj depend on the residual connection between the input and the output in the decoder, and the velocity modeling capability is relatively limited. Therefore, the method of the invention can better capture the dynamics law of human motion by adopting the speed learning. (2) The recursive decoder of the present invention enables our network to make full use of early predictions to achieve more accurate predictions, while DTraj predicts future poses in a non-recursive manner. Short-term and long-term prediction results on tables 1, H3.6M. Where "ms" represents milliseconds.

TABLE 1

FIGS. 4 and 5 also provide qualitative results at H3.6M, FIG. 4(a) being a walking motion gesture, FIG. 4(b) being a eating motion gesture, the black gesture representing a true gesture sequence, the blue gesture representing an experimental result of the DTraj method, and the red gesture representing an experimental result of the method of the present invention; fig. 5(a) is a sitting motion posture, fig. 5(b) is a photographing motion posture, a black posture represents a real posture sequence, a blue posture represents an experimental result of the DTraj method, and a red posture represents an experimental result of the method of the present invention. Compared with DTraj, the method of the invention achieves the best visualization effect in short-term and long-term prediction, and the effectiveness of the method of the invention is proved again. Specifically, the results of the method of the present invention are better than DTraj for the left hand of the predicted pose of fig. 4, the upper limb of the pose of fig. 5(a), and the right hand of the pose of fig. 5 (b). The main reasons are reflected in two aspects: (1) the superior performance of the method benefits from the proposed deep state space model, and fully combines the advantages of a deep network and the state space model. (2) The depth state space model of the invention combines position and speed as observation, and designs a new time loss function (attention time prediction loss function) to guide the model to realize accurate early prediction. In contrast, DTraj ignores modeling of velocity and does not fully exploit the results of early prediction. Therefore, the depth state space model of the invention can better capture the motion dynamics for accurate prediction.

Experimental results on 3DPW data set: table 2 shows the short-term and long-term prediction results on the 3DPW data set. In summary, our approach outperforms the baseline at all times, whether short-term or long-term prediction, further validating our proposed depth state space model in modeling motion dynamics for accurate prediction.

TABLE 2

1.3 ablation analysis

The invention carries out ablation experiments to verify the effectiveness of several important components in the depth state space model provided by the invention, and the corresponding experimental results are shown in table 3. Table 3 is the ablation experiment result at H3.6M, where "ms" represents milliseconds, "xyz" represents coordinate-level feature modeling of the gesture sequence, "pb" represents "gesture branches," "vb" represents velocity branches, and "x" represents removal of the corresponding module.

TABLE 3

(1) Multi-order information: a) the "# 2", "# 3", and "# 9" group experiments demonstrate the effectiveness of modeling multiple levels of information on an input sequence. Compared with "# 9", the errors of the experiments of the "# 2" and "# 3" groups increased at all times, indicating that the modeling of multiple orders of information using velocity and position is useful forTo capture the effectiveness of the kinematical laws. Without modeling the attitude velocity, the prediction error of the #3 set of experiments increased significantly at an early stage, illustrating the importance of velocity modeling for short-term predictions. b) The "# 4", "# 5", and "# 9" set of experiments verified the effectiveness of modeling future pose multi-order information. Similarly, the prediction error of the experiments in the "# 4" and "# 5" groups was larger than that of the experiment in the "# 9" group at all times. Binding of L_pAnd L_vThe network can be guided to model multi-order information of future gestures, so that better prediction performance is obtained. Furthermore, the error of the experiment of the #5 group is significantly increased, illustrating the importance of modeling future attitude velocity information. c) The "# 7", "# 8", and "# 9" set of experiments further validated the validity of the binding position and velocity as observations of the depth state space model. The position or velocity information without modeling the human body posture can reduce the performance of the system, especially velocity modeling, and the importance of velocity learning is explained. In conclusion, in combination with the position and the speed of the human body, the depth model can learn a robust motion dynamics rule to represent the internal state of the depth state space model, so that the accurate prediction capability of the system is further improved.

(2) Multi-level characteristics: comparing the prediction error between the experiment sets of "# 1" and "# 9", the prediction error of the experiment set of "# 1" increases at all times, illustrating the importance of modeling the kinematics at both the coordinate and joint levels.

(3) Note the temporal prediction loss function (ATPL): compared with the experiment of the group #6, the error of the experiment of the group #9 is obviously reduced, especially at the later moment. The ATPL gives a higher attention weight at an early moment, and can guide the network to realize more accurate prediction at the early moment, so that the problem of error accumulation in a recursive model can be reduced, and the overall performance of the system is further improved.

The invention utilizes the advantages of depth representation and state space model to model the human motion system into a depth state space model, provides a uniform description for various human motion systems, and can analyze the existing model; the invention provides an end-to-end feedforward network to establish the model, and jointly realizes the state initialization and the state transition of the system; furthermore, the proposed attentive temporal prediction penalty function can effectively guide the recursive model to achieve more accurate predictions by increasing the attention to early temporal predictions. In addition, the deep state space model was evaluated on two challenging data sets (H3.6M and 3DPW), which achieved the most advanced performance. Experiments also show that the performance of the system can be further improved by utilizing the characteristics of the coordinate level of the human motion.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A3D human motion prediction method based on a depth state space model is characterized in that the position and the speed of human motion are used as observation, the motion dynamics law of a historical attitude sequence is extracted by a depth network to initialize the state of the state space model, a human motion system is modeled as the depth state space model, and a plurality of future attitudes of the 3D human motion are predicted recursively through state to observation transfer.

2. The depth state space model-based 3D human motion prediction method of claim 1, wherein the future pose is represented as a shift of model observations.

3. The 3D human motion prediction method based on the depth state space model as claimed in claim 2, wherein the method of posture sequence representation is as follows:

4. The 3D human motion prediction method based on the depth state space model according to claim 3, wherein the human motion system is represented by a state space model of a dynamic system, as shown in formula (1) and formula (2):

I(t+1)＝f₁(I(t),t)+a(t) (1)

O(t)＝f₂(I(t),t)+b(t) (2)

5. The method as claimed in claim 4, wherein the position and speed of human motion are used as observationInitializing the motion dynamics law of the historical attitude sequence to the state of a state space model, and respectively initializing a (t) and b (t) to 0; sequence S₁The corresponding future pose is defined as:

wherein S is₂Has a length of T₂Corresponding to a speed of

Represents the sequence S₂The posture at the middle time t is,

where O (0) is initialized to { p (0),0},

other multi-level information representing historical poses at time t-1.

6. The 3D human motion prediction method based on the depth state space model is characterized in that the depth state space model is a two-stage system comprising state-state transition and state-observation transition; wherein, state-state transition: this stage is through the system boxNumber f₁(-) update the system state, generate multi-level information for future poses, and update other multi-level information for previous poses, which are automatically learned by the deep network; state-observed transitions: the observed value at this stage passes through the system function f₂() performing calculations based on the current state of the system, this part being done by the decoder auto-learning; further, since the current position and velocity of the human body determine the position of the human body at the next time, the future posture is calculated by equation (5):

7. the 3D human motion prediction method based on the depth state space model according to claim 6, characterized in that the trunk layer of the depth state space model is a densely connected convolution module, and the densely connected convolution module mainly comprises 5 convolution layers; in each convolutional layer, the input fuses features from all previous layers through the 1 × 1 convolutional layer; wherein 1 × 1 convolution in residual concatenation can learn joint level features; therefore, the dense residual connection in the dense connection convolution module allows the network to gradually acquire enhanced features by fusing the shallow joint level features, as shown in equation (6):

M_l＝H_b([g₀(M₀),g₁(M₁),…,g_l-2(M_l-2),M_l-1]) (6)

8. The 3D human motion prediction method based on the depth state space model according to claim 7, wherein the method for initializing the state of the depth state space model comprises the following steps: construction of multi-branch networks with DCCMsCoding the motion dynamics of coordinate level and joint level of the input sequence in position space and speed space respectively, and comprising a posture branch, a speed branch and a fusion module; wherein, the posture is branched: s_1x、S_1yAnd S_1zAs input to each sub-branch consisting of 2 DCCMs, respectively, so that the network can capture the features at the coordinate level; fusing coordinate level features by using a DCCM module to obtain joint level features; speed branch: v_1x、V_1yAnd V_1zRespectively as the input of each subbranch to gradually capture the multi-level characteristics of the input sequence in the speed space; a fusion module: fusing the kinetic information captured by the attitude branch and the speed branch; the fusion module is constructed by connecting operation along a channel and adding a convolution layer and a Leaky ReLU activation function layer; the output of the multi-branch network describes the kinetic characteristics of the input sequence, labeled h (0), and F_b(0) Initialized to h (0), and the initial state of system I (0) is set to { F_b(0)}。

9. The depth state space model-based 3D human motion prediction method of claim 8, wherein the depth state space model is based on a CNN recursive decoder to implement state transitions, including state-state transitions and state-observation transitions; firstly, modeling historical characteristics by adopting more operations, modeling current speed information by adopting less operations, and then fusing the information by summing operations according to elements; finally, the other is the convolution layer and the FC layer to predict the future speed; the state of the dynamic system, i (t), and the observation o (t), are updated by equation (7) and equation (8), respectively:

I(t+1)＝f₁(I(t),t) (7)

O(t)＝f₂(I(t),t) (8)

Obtaining enhanced feature representations and updating

10. The 3D human motion prediction method based on the depth state space model according to claim 9, wherein the depth state space model is optimized by using an attention time prediction loss function L:

L_vdefined by formula (10):

wherein N is_jDisplay switchThe number of nodes; t is₂Representing the length of the predicted future pose sequence;

and

Normalizing the weight value to 1;

L_pdefined by formula (11):

wherein the content of the first and second substances,

and