CN111291693B

CN111291693B - Deep integration method based on skeleton motion recognition

Info

Publication number: CN111291693B
Application number: CN202010097008.XA
Authority: CN
Inventors: 杨会成; 徐姝琪
Original assignee: Anhui Polytechnic University
Current assignee: Anhui Polytechnic University
Priority date: 2020-02-17
Filing date: 2020-02-17
Publication date: 2023-03-31
Anticipated expiration: 2040-02-17
Also published as: CN111291693A

Abstract

The invention provides a deep integration method of motion recognition based on bones, which uses a Convolutional Neural Network (CNN) and a long-short term memory network (LSTM), namely a deep integration model, to capture various space-time dynamics of motion recognition tasks. The action is a spatiotemporal event, so the action recognition task requires spatiotemporal features. In view of the goals, we model three sub-networks (called SNet, TNet and body net) to capture the spatio-temporal dynamic differences of the action recognition task. Under the push of ensemble learning, a hybrid network (called HNet) is modeled using two subnetworks TNet and BodyNet to capture the strong temporal dynamics. Compared with other methods on the UTD MHAD data set, the recognition rate of the method on the data set reaches 92.1 percent, and is far higher than the recognition rates of 85.81 percent and 88.10 percent in the prior art.

Description

Deep integration method based on skeleton motion recognition

Technical Field

The invention relates to the technical field of human body action recognition, in particular to a deep integration method of action recognition based on bones.

Background

At present, human body action recognition is a popular research subject in the field of computer vision, has good practical application prospect, can be applied to the fields of image processing, computer vision, machine learning and the like, but becomes challenging due to the diversity of human body actions and the like. At present, human body action recognition is complex, human body action cannot be completely recognized during deep learning, and errors and the like easily occur. Therefore, it is important to design a deep integration method based on skeleton motion recognition.

Disclosure of Invention

In view of the shortcomings of the prior art, the present invention provides a deep integration method for bone-based motion recognition, which uses Convolutional Neural Network (CNN) and long-short term memory network (LSTM), i.e. a deep integration model, to capture various spatiotemporal dynamics of motion recognition task. The action is a spatiotemporal event, so the action recognition task requires spatiotemporal features. In view of the goals, we model three sub-networks (called SNet, TNet and body net) to capture the spatio-temporal dynamic differences of the action recognition task. Under the push of ensemble learning, a hybrid network (called HNet) is modeled using two subnets TNet and BodyNet to capture the strong temporal dynamics.

The invention provides a deep integration method based on skeleton action recognition, which comprises the following steps:

the method comprises the following steps: establishing a recurrent neural network, and modeling a sequence problem by using the recurrent neural network;

step two: modeling a spatial net using two spatial distance maps to capture spatial dynamics for a motion recognition task;

step three: modeling a time domain network using a distance map in the time domain to capture time domain dynamics and perform motion recognition tasks;

step four: using a multi-layer stacked LSTM network as BaseNet, wherein the BaseNet consists of three bidirectional LSTM layers, and a dropout layer is introduced between two Bi-LSTM layers so as to relieve the overfitting problem generated in the training of the BaseNet and enable a fully connected layer and a softmax layer to follow for operation classification tasks;

step five: the Hybrid Net model for HNet will be described using the body Net and TNet functions, and distinct and strong discriminative temporal features will be selected from the body Net and TNet features to efficiently construct HNet.

The further improvement lies in that: in the first step, the recurrent neural network comprises LSTM units which are input by an input gate (I) _t ) Input node (G) _t ) Forgetting the door (F) _t ) And an output gate (O) _t ) Said input gate (I) _t ) Is given by the equation I _t ＝σ(W _IX X _t +W _IH H _t-1 +b _I ) (ii) a Forget to remember the door (F) _t ) Is F _t ＝σ(W _FX X _t +W _FH H _t-1 +b _F ) (ii) a Output gate (O) _t ) Is given by the equation O _t ＝σ(W _OX X _t +W _OH H _t-1 +b _O ) (ii) a Input node (G) _t ) Is G _t ＝Tanh(W _GX X _t +W _GH H _t-1 +b _G ) The combination formula of LSTM units is C _t ＝F _t C _t-1 +I _t G _t And H _t ＝O _t Tanh(C _t ) Wherein all W _s And b _s Is the weight matrix and offset for each gate.

The further improvement lies in that: in the second step, on the basis of the paired distance features for the motion recognition task, four joint distance maps are constructed, one in 3D space and the other three in 2D orthogonal space, and each motion is performed by two subjects so AS to process the motion including human-to-human interaction, in a motion sequence, each frame has two bones related to main and auxiliary objects, a skeleton sequence AS includes M skeleton frames, each frame includes 2N joints, wherein the first N joints are related to the main object and the remaining N joints are used for the auxiliary objects, AS = { Fr = ₁ ,......Fr _M Where Fr represents the framework,

denotes j ^th I of the frame ^th 3D coordinates (x, y, z) of the individual joints; the first set of spatial features is defined in terms of hip joints and is named SF1 _xyz ，SF1 _xy ，SF1 _yz And SF1 _xz (ii) a The second set of spatial features is defined in terms of shoulder center and is named SF2 _xyz ，SF2 _xy ，SF2 _yz And SF2 _xz The modeling formula of the spatial feature comprises

Where f denotes the frame number and Jxyz denotes the (x, y, z) coordinates of the joint. Jxy, jyz and Jxz refer to the (x, y), (y, z), (x, z) coordinates of the joint, respectively. D _n () Representing the distance between two points in euclidean n space. Let r = (r) ₁ ,r ₂ ,...,r _n ) And s =(s) ₁ ,s ₂ ,...,s _n ) Are two points in Euclidean n space, then D _n () The calculation formula of (c) is: />

The further improvement lies in that: in the third step, four time characteristics are constructed, namely TF _xyz ，TF _xy ，TF _yz And TF _xz The formula of the four time characteristics is as follows:

the further improvement lies in that: in the fourth step, bodinet is used to extract various features from fine-grained body parts in the time domain of the entire sequence, and for each frame Fr, skeletal joints related to a main subject are grouped into five groups, which correspond to five body parts, respectively, and joints of auxiliary subjects are also grouped.

The invention has the beneficial effects that: a Convolutional Neural Network (CNN) and a long-short term memory network (LSTM), namely a deep integration model, are used for capturing various space-time dynamics of motion recognition tasks, and compared with other methods on a UTD MHAD data set, the recognition rate of the method on the data set reaches 92.1 percent and is far higher than the recognition rates of 85.81 percent and 88.10 percent in the prior art.

Drawings

Fig. 1 is a schematic diagram of the basic structure of the LSTM unit of the present invention.

FIG. 2 is a distance diagram of the SNet and TNet of the present invention.

Fig. 3 is a schematic view of the spatial network structure of the present invention.

Fig. 4 is a schematic diagram of a time domain network structure according to the present invention.

Fig. 5 is a schematic diagram of BaseNet of the present invention.

Figure 6 is a schematic view of the structure of the body network of the present invention.

Fig. 7 is a schematic view of a hybrid net structure of the present invention.

FIG. 8 is a table of action category calculations for the present invention.

Fig. 9 is a comparison graph of recognition accuracy of the present invention.

Detailed Description

In order to further understand the present invention, the following detailed description will be made with reference to the following examples, which are only used for explaining the present invention and are not to be construed as limiting the scope of the present invention. As shown in fig. 1-9, the present embodiment provides a deep integration method of bone-based motion recognition, the method comprising the steps of:

step three: modeling a time domain network using the distance map in the time domain to capture time domain dynamics and perform motion recognition tasks;

step four: the method comprises the following steps that a multi-layer stacked LSTM network is used as BaseNet, the BaseNet is composed of three bidirectional LSTM layers, a dropout layer is introduced between two Bi-LSTM layers, so that the overfitting problem generated in the process of training the BaseNet is relieved, and a completely connected layer and a softmax layer are used for operating a classification task;

Since the RNN has its internal memory, it can store information about previous calculations. Theoretically, RNNs handle sequences of arbitrary length. In practice, however, it cannot model long-term sequences due to two major problems: the gradient disappeared and the gradient exploded. Long term short term memory (LSTM) has been proposed to address this problem. Fig. 1 shows the basic structure of an LSTM cell. Suppose X _t Is the input of LSTM, t is the time step, the LSTM unit is input by the input gate (I) _t ) Input node (G) _t ) Forgetting the door (F) _t ) And an output gate (O) _t ). The basic equation for these gates is defined by the equation.

The further improvement lies in that: in the first step, the recurrent neural network comprises LSTM units which are input by an input gate (I) _t ) Input node (G) _t ) Forgetting the door (F) _t ) And an output gate (O) _t ) Said input gate (I) _t ) Is given by the equation I _t ＝σ(W _IX X _t +W _IH H _t-1 +b _I ) (ii) a Forget to remember the door (F) _t ) Is F _t ＝σ(W _FX X _t +W _FH H _t-1 +b _F ) (ii) a Output gate (O) _t ) Is given by the equation O _t ＝σ(W _OX X _t +W _OH H _t-1 +b _O ) (ii) a Input node (G) _t ) Is G _t ＝Tanh(W _GX X _t +W _GH H _t-1 +b _G ) The combination formula of LSTM units is C _t ＝F _t C _t-1 +I _t G _t And H _t ＝O _t Tanh(C _t ) Wherein all W _s And b _s Is the weight matrix and offset of each gate.

denotes j ^th I of the frame ^th 3D coordinates (x, y, z) of the individual joints; the first set of spatial features is defined in terms of hip joints and is named SF1 _xyz ，SF1 _xy ，SF1 _yz And SF1 _xz (ii) a The second set of spatial features is defined in terms of shoulder center and is named SF2 _xyz ，SF2 _xy ，SF2 _yz And SF2 _xz The modeling formula of the spatial feature comprises->

TF, SF-1 and SF-2 are characterized by having values of 2N × (M-1), (2N-1) × M and (2N-1) × M, respectively. Due to the number of framesThe quantity (M) varies from one motion sequence to another, so the size of the feature also varies according to the value of M. The size of the features should be the same for all action sequences of batch learning. Assuming that a motion sequence contains M frames, each containing N joints, and if the calculated distances between successive frames are arranged in a single vector, the TF feature will yield (M-1) Distance Vectors (DV). All these DVs together constitute a matrix with (M-1) columns, all the DVs being of the same size (N). The number of columns in the matrix varies according to the number of frames (M) in the sequence. Bicubic interpolation is used to generate a matrix array (M) with fixed values ^| ). Finally, the size of (N M) is generated ^| ) Of the matrix of (a). The matrix generation size is (N M) ^| ) The TF feature vector of (1). Note that for any value of M, (M) ^| ) Are all fixed. Like the TF eigenvectors, SF ₁ And SF ₂ The size of the feature vector is ((N-1) × M ^| ). Since the height and width of an object in skeletal data may have different proportions, the feature values extracted from the skeletal data need to be normalized to fit the desired range 0-1]. Thus, as equation

As shown, a normalization equation is proposed in this work. To keep the value at 0-255]Within the range, the Normalized matrix is multiplied by 255, as shown by the formula Gray image = Normalized M255. The resulting matrix looks like a grayscale image with 256 intensity levels. To classify using the pre-trained CNN model, a paper color coding mechanism is used to convert the grayscale image to a color image. The color image is input (X) into the CNN model. As a result, the motion recognition problem is converted into an image classification problem. Thus, CNN is fine-tuned for the action classification task herein.

Where M is the feature matrix to be normalized, min (M) is the minimum value in M, and max (M) is the maximum value in M. Multiplicative fusion is employed to compute the spatial and temporal scores of the proposed SNet and TNet, respectively. Suppose S1 ₁ ,S1 ₂ ,S1 ₃ ,S1 ₄ ,S2 ₁ ,S2 ₂ ,S2 ₃ And S2 ₄ Is a vector in the space distance maps (SF 1) and (SF 2), such as the equation Spatial score (ss) for action A = (S1) ₁ ΔS1 ₂ ΔS1 ₃ ΔS1 ₄ ΔS2 ₁ ΔS2 ₂ ΔS2 ₃ ΔS2 ₄ ) The spatial score (ss) is calculated as shown. Similarly, t ₁ ,t ₂ ,t ₃ And t ₄ Is a vector of four CNNs trained on a time-distance map (TF), as in the equation Temporal score (ts) for action A = (t) ₁ Δt ₂ Δt ₃ Δt ₄ ) The spatial score (ts) is calculated as shown.

In order to study the discrimination ability of the proposed spatial distance map with related working features, experiments have been performed using Alexnet, but features extracted in the time domain are also essential for robust motion recognition. To investigate the assumptions made herein, a subnet TNet using a time-distance map is proposed herein. To train subnets SNet and TNet using Alexnet, the maximum number of epochs for all experiments was 100, the batch size was set to 128, the initial learning rate was set to 0.001 for fine tuning, and training from 0.001 was performed from scratch. The network is trained using back propagation with a random gradient descent with a momentum value of 0.9, spatial and temporal features are crucial for the motion recognition task. Furthermore, the HNet is modeled herein using BodyNet and TNet to extract robust temporal features.

The further improvement lies in that: in the fourth step, bodinet is used to extract various features from fine-grained body parts in the time domain of the entire sequence, and for each frame Fr, skeletal joints related to a main subject are grouped into five groups, which correspond to five body parts, respectively, and joints of auxiliary subjects are also grouped. The LSTM network of the multi-layer stack serves as BaseNet. The proposed BaseNet consists of three Bi-directional LSTM (Bi-L STM) layers, as shown in fig. 5. A Dropout (DP) layer was introduced between the two Bi-LSTM layers to alleviate the overfitting problem that occurs when training BaseNet. Finally, the fully connected layer (FC) and softmax layer are immediately followed for the operation classification task. Since the relative geometry between body parts provides important information for the task of motion recognition, the present invention designs BodyNet to follow the entire sequenceFine-grained body parts in the time domain of the column extract various features, as shown in fig. 6. For each frame Fr, the skeletal joints related to the main subject are grouped into five groups, corresponding respectively to the equations

Five body parts in (1). Likewise, joints of the auxiliary objects are also grouped.

Where Fr ∈ AS, τ _i I =1, \8230, 5, a set of joints corresponding to body parts RH, RL, LH, LL and Trunk, respectively. The proposed BodyNet contains three BaseNet as shown in fig. 6. It uses three temporal features to extract temporal dynamics between different body parts in the time domain. These are the motion of the joint, part-to-part distance and edge-to-edge distance. They are called BodyNet-Feature1 (BNF 1), bodyNet-Feature2 (BNF 2), and BodyNet-Feature3 (BNF 3), respectively. The action function (BNF 1) is one of the important distinguishing functions for different classes of actions, such as the equation

The method as described in (1).

In addition, to capture the geometrical relationship between body parts in the time domain, equations are respectively proposed in equations

And

BNF 2 and BNF 3 characteristics as defined in (1).

Introduction of Hybrid Net modeling called HNet using the body Net and TNet functions, as shown in fig. 7, studies were conducted to select distinct strong discriminative temporal features from the body Net and TNet features to efficiently construct HNet. To explore the best temporal features, first, the accuracy of each feature of body net and TNet is calculated for the four action classes reported in fig. 8. According to fig. 8, BNF1 has good performance compared to other functions for "jump" movements in four movement categories, as it is highly relevant to capture the movement of the body to the ground. On the other hand, BNF 2 functions are good at recognizing "answer call" actions, but perform the worst for "clap hand" actions. The reason is that the motion between the two hands is the discriminant force to identify the action "clap" while BNF 2 captures the motion between one part until all the rest of the time domain. It is shown that different temporal characteristics have a unique discriminative power to identify actions.

Claims

1. A deep integration method based on skeleton action recognition is characterized in that: the method comprises the following steps:

2. A method of deep integration of bone-based motion recognition as claimed in claim 1, wherein: in the first step, the recurrent neural network comprises LSTM units which are formed by input gates (I) _t ) Input node (G) _t ) Forgetting the door (F) _t ) And an output gate (O) _t ) Said input gate (I) _t ) Is given by the equation I _t ＝σ(W _IX X _t +W _IH H _t-1 +b _I ) (ii) a Forget to remember the door (F) _t ) Equation of (2)Is F _t ＝σ(W _FX X _t +W _FH H _t-1 +b _F ) (ii) a Output gate (O) _t ) Is given by the equation O _t ＝σ(W _OX X _t +W _OH H _t-1 +b _O ) (ii) a Input node (G) _t ) Is G _t ＝Tanh(W _GX X _t +W _GH H _t-1 +b _G ) The combination formula of LSTM units is C _t ＝F _t C _t-1 +I _t G _t And H _t ＝O _t Tanh(C _t ) Wherein all W _s And b _s Is the weight matrix and offset for each gate.

3. A method of deep integration of bone-based motion recognition as claimed in claim 1, wherein: in the second step, on the basis of the paired distance features for the motion recognition task, four joint distance maps are constructed, one in 3D space and the other three in 2D orthogonal space, and each motion is performed by two subjects so AS to process the motion including human-to-human interaction, in a motion sequence, each frame has two bones related to main and auxiliary objects, a skeleton sequence AS includes M skeleton frames, each frame includes 2N joints, wherein the first N joints are related to the main object and the remaining N joints are used for the auxiliary objects, AS = { Fr = ₁ ,......Fr _M Where Fr represents the framework,

denotes j ^th I of the frame ^th 3D coordinates (x, y, z) of the individual joints; the first set of spatial features is defined in terms of hip joints and is named SF1 _xyz ，SF1 _xy ，SF1 _yz And SF1 _xz (ii) a The second set of spatial features is defined in terms of shoulder center and is named SF2 _xyz ，SF2 _xy ，SF2 _yz And SF2 _xz Modeling formula of said spatial featuresIncluded

Wherein f represents a frame number and Jxyz represents the (x, y, z) coordinates of the joint; jxy, jyz and Jxz refer to the (x, y), (y, z), (x, z) coordinates of the joint, respectively; d _n () Represents the distance between two points in euclidean n space; let r = (r) ₁ ,r ₂ ,...,r _n ) And s =(s) ₁ ,s ₂ ,...,s _n ) Are two points in Euclidean n space, then D _n () The calculation formula of (c) is:

4. a method of deep integration of bone-based motion recognition as claimed in claim 1, wherein: in the third step, four time characteristics are constructed, namely TF _xyz ，TF _xy ，TF _yz And TF _xz The formula of the four time characteristics is as follows:

5. a method of deep integration of bone-based motion recognition as claimed in claim 1, wherein: in the fourth step, bodinet is used to extract various features from fine-grained body parts in the time domain of the entire sequence, and for each frame Fr, skeletal joints related to a main subject are grouped into five groups, which correspond to five body parts, respectively, and joints of auxiliary subjects are also grouped.