CN112749585A

CN112749585A - Skeleton action identification method based on graph convolution

Info

Publication number: CN112749585A
Application number: CN201911041763.XA
Authority: CN
Inventors: 崔振; 刘蓉; 许春燕; 张桐; 杨健
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2021-05-04

Abstract

The invention discloses a skeleton action identification method based on graph convolution, and a basic unit of the method is a space-time graph convolution module. The space-time graph convolution module comprises the following steps: acquiring a skeleton video, constructing a skeleton graph (graph) based on the skeleton video of each frame, defining different human body part combinations according to the skeleton graph, constructing joint point relationship graphs for the human body part combinations, and further constructing a multi-dimensional relationship interaction graph which comprises part combination interaction dimensions and joint point interaction dimensions; carrying out graph convolution on the multi-dimensional interactive graph on the joint point interactive dimension and the component combination interactive dimension respectively; and then, the spatial features obtained by convolving the two images are sent to a local convolution network of a time slice to obtain the time dynamic features. A plurality of space-time graph convolution modules are stacked in the network model to construct a neural network, and a softmax classifier is used for classification.

Description

Skeleton action identification method based on graph convolution

Technical Field

The invention belongs to the motion recognition technology, and particularly relates to a skeleton motion recognition method based on graph convolution.

Background

Human motion recognition is a popular research direction in the field of computer vision. The main purpose is to correctly classify the human body actions in the video. The technology can be applied to the fields of intelligent video monitoring, man-machine natural interaction, motion video analysis, unmanned driving and the like. With the development of hardware equipment, multi-modal human motion data including RBG, depth, infrared data and the like can be easily collected. The skeleton video obtained from the depth data is robust to changes in appearance, lighting, and surrounding environment, and is of great significance as input data for performing motion recognition.

Deep learning is an important means for skeleton motion recognition, and Yan et al propose a space-time graph convolution network (ST-GCN) for adaptively learning the spatial and temporal patterns of human motion from skeleton video samples. Li et al recursively performs multi-scale local graph convolution on a skeleton graph using a space-time graph convolution (STGC) method, in conjunction with graph local convolution filtering and recursive learning. However, these existing skeleton-based motion recognition methods generally use joint position or sequence information to represent skeleton-based human motion, and do not take into account local/global information and the correspondence of specific motions to human body parts well.

Disclosure of Invention

The invention aims to provide a method for recognizing skeleton video actions based on graph convolution.

The technical solution for realizing the purpose of the invention is as follows: a skeleton action recognition method based on graph convolution comprises the following steps:

step 1, obtaining a skeleton video, and constructing a graph sequence based on a skeleton sequence;

step 2, constructing a joint point interaction graph representing each human body part combination according to the human body part combinations with different scales;

step 3, constructing a component combination interactive graph representing the integral structure of the human body by taking each human body component combination as a node;

step 4, performing K-order graph convolution on the human body component combination of each frame in the joint point interaction dimension to obtain corresponding human body component combination characteristics;

step 5, performing K-order graph convolution on the human body component combination of each frame in the component combination interaction dimension to obtain corresponding spatial features;

step 6, stacking the spatial features of all frames along a time axis to perform time dimension convolution operation to obtain time dynamic features;

step 7, constructing a joint point interaction graph and a component combination interaction graph for the time dynamic characteristics by adopting the method of the steps 2-6, calculating corresponding human body component combination characteristics and space characteristics, and updating the time dynamic characteristics to obtain a representation characteristic vector of the skeleton video;

and 8, classifying the expression characteristic vectors of the skeleton video by adopting a softmax classifier, and finishing the action recognition.

In the step 1, the following substeps are included:

(1.1) for the t-th frame of skeleton video, constructing a skeleton map S based on natural connection of human joints_t＝{V_t,E_tIn which V is_tRepresenting all nodes in the diagram, consisting of articulation points of the human skeleton, E_tAll edges in the figure are represented. Skeleton diagram S_tIs non-directional and if there is a bone connection between two joint points, there is an edge between the two nodes; otherwise, no edge exists between the two nodes;

(1.2) for a T frame long skeleton sequence, constructing a corresponding graph sequence { S }₁,S₂,…,S_T}。

In the step 2, the following substeps are included:

(2.1) for each skeleton map S_tThe limb part with obvious human motion characteristics is defined as different human body part combinations, particularly, four limbs are taken as four basic part combinations

(2.2) carrying out pairwise arrangement and combination on the four constructed basic components to construct six combinations with higher first-order dimensions:

-the right hand and the right leg,

the left hand and the left leg, respectively,

-the right hand and the left leg,

-the left hand and the right leg,

-the upper half of the body and

-a lower body, dividing the part of the human body with a relatively small motion amplitude into the corresponding closest combinations;

(2.3) construction of highest size human body part Assembly with Whole body skeleton

-whole body;

(2.4) combining each human body part, taking each joint point as a node of the graph, and establishing edges of the graph according to natural connection of the human body to obtain a joint point interaction graph;

(2.5) obtaining a Laplacian matrix of the joint point interaction diagram according to a spectrogram theory, namely a joint point adjacency relation matrix

In step 3, the following substeps are included:

(3.1) for the t-th frame, combining with the constructed h-person body parts

As nodes, each node randomly selects a plurality of other nodes to be connected with each other as edges, and a component combination interaction graph is constructed;

(3.2) obtaining a Laplacian matrix of the component combination interaction diagram according to the spectrogram theory, namely a component combination adjacency relation matrix

In step 4, the concrete operation of the joint point interaction dimension graph convolution is as follows:

wherein

Showing the combination of parts

In particular finger assemblies

3-dimensional coordinates of all internal joint points in the skeleton video are provided by the acquired skeleton video;

is a matrix of the adjacency of the joint points,

is a matrix

The largest eigenvalue;

is a matrix

Expansion of the chebyshev polynomial of (a);

is a combination of parts S_tiA graph convolution response at the joint interaction dimension; k₁Representing K around the convolution node₁Nodes and edges in the neighborhood participate in convolution operation; w_ikAre the model parameters of graph convolution.

In step 5, the specific operation of the convolution of the component combination interaction dimension graph is as follows:

wherein

The input characteristic of the convolution of the interactive dimension graph of the representing component combination is obtained by calculating the convolution of the interactive dimension graph of the joint points according to the characteristic of the human body component combination;

is a matrix of component combination adjacency,

is that

The maximum eigenvalue of (d);

is a matrix

Expansion of the chebyshev polynomial of (a); w_kAre model parameters for k-order convolution;

is the spatial map convolution response of the t-th frame skeleton video; k₂Meaning K around the convolution node₂Nodes and edges in the neighborhood participate in convolution operations.

In step 6, the specific operation of time dimension convolution by stacking spatial features is as follows:

Y＝L*f

wherein

Is a 3-D tensor eigenmatrix stacked by spatial features of all frames, f is a convolution kernel of 1 x 9, window size 9,

is the result of performing local convolution filtering only in the time dimension.

And 7, repeatedly updating the time dynamic characteristics to obtain a time dynamic characteristic representation skeleton video with higher dimensionality.

In step 8, the specific method for classifying the representation feature vectors of the skeleton video is as follows:

pooling and full-connection operation are carried out on the representation feature vectors of the skeleton video, feature dimensionality is reduced, the classification probability of the skeleton video for each action category is obtained through calculation of a softmax classifier, and the largest category is selected as the skeleton action.

Compared with the prior art, the invention has the following remarkable advantages: 1) according to the natural division of the human body, a multi-dimensional relationship interaction graph representing the whole structure of the human body is constructed for the input human body skeleton, and the global interaction relationship contained in the movement between the human body part combinations and the joint points can be comprehensively described; 2) the human skeleton dynamic different-level representation is learned through the space-time graph convolution network framework, the whole structure relation of the human body at a single moment can be captured, the dynamic change characteristic of the human skeleton in the time domain can be modeled, and the accuracy of skeleton action identification is improved.

Drawings

Fig. 1 is a schematic flow chart of a graph convolution-based skeleton action recognition method according to the present invention.

FIG. 2 is a diagram of the human body component assembly and its joint connection relationship.

FIG. 3 is a diagram of the adjacency relationship of the high-dimensional relationship interaction diagram constructed by the invention in the interaction dimension of the components.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings.

The invention provides a skeleton action identification method based on graph convolution aiming at a scene of a single video single label, which specifically comprises the following steps as shown in figure 1:

step 1, obtaining a skeleton video, and constructing a graph sequence based on a skeleton sequence, wherein the method comprises the following substeps:

(1.1) for the t-th frame of skeleton video, constructing a skeleton map S based on natural connection of human joints_t＝{V_t,E_tIn which V is_tRepresenting all nodes in the diagram, consisting of articulation points of the human skeleton, E_tAll edges in the figure are represented. Skeleton diagram S_tIs non-directional and if there is a bone connection between two joint points, there is an edge between the two nodes; otherwise there is no edge between the two nodes.

Step 2, defining h human body part combinations with different scales for the skeleton diagram constructed in the step 1, and constructing a joint point interaction diagram for each human body part combination, wherein h is 11, and the method comprises the following substeps:

And (2.2) based on four basic components and considering the intuitive cognition of human beings to the human skeleton, a multi-scale human body component division method is provided. Firstly, the four basic components (the lowest dimension) constructed in the step (2.1) are arranged and combined pairwise, so that six combinations with relatively higher first-order dimensions can be constructed:

-the right hand and the right leg,

the left hand and the left leg, respectively,

-the right hand and the left leg,

-the left hand and the right leg,

-the upper half of the body and

the lower body, during which the body parts with relatively small motion amplitudes (such as neck, trunk, etc.) are divided into the combinations with the corresponding closest distances; the highest scale human body part combination is then constructed with the entire body skeleton

-the whole body. The specific construction results and the division method of the neck, the trunk and the like are shown in the attached figure 2 of the specification.

And (2.3) combining each human body part to construct an interactive graph of joint points, wherein each joint point is used as a node of the graph, and each node establishes an edge according to the natural connection of the human body. According to the existing spectrogram theory, based on the connecting edge between the joint points, a Laplacian matrix can be constructed

Used for representing the adjacent relation among all the joint points in the human body part assembly.

And 3, based on the multi-scale human body part combination divided in the step 2, further constructing a part combination interaction graph representing the whole structure of the human body by taking each human body part combination as a node, and comprising the following substeps:

(3.1) for the t-th frame, the h-person body part combination constructed in step 2

And as nodes, randomly selecting other nodes from each node to be connected with each other as edges, constructing a component combination interactive graph, and forming a multi-dimensional relationship interactive graph together with the joint point interactive graph. In the invention, each node randomly selects 5 nodes to be connected with each other as edges.

(3.2) calculating a Laplacian matrix of the component combination interactive graph based on the connection edges of the combined nodes of the human body components in (3.1) according to the existing spectrogram theory

The adjacency relation of each node of the multi-dimensional relationship interaction graph in the interaction dimension of the component combination is shown.

And 4, performing K-order graph convolution on the h human body part combination of each frame in the joint point interaction dimension to obtain the characteristics of the corresponding human body part combination, wherein the specific operation is as follows:

wherein

Showing the combination of parts

In particular finger assemblies

is that

The largest eigenvalue;

is a matrix

Expansion of the chebyshev polynomial of (a);

And 5, performing K-order graph convolution on the h-shaped human body part combination of each frame in the interaction dimension of the part combination to obtain corresponding spatial features, wherein the specific operation is as follows:

wherein

is a Laplacian matrix of the multi-dimensional relationship interactive graph on the component combination interactive dimension, represents the adjacent relationship of the component combination interactive graph,

is that

The maximum eigenvalue of (d);

is a matrix

And 6, stacking the spatial features of all the frames along a time axis to perform time dimension convolution operation to obtain time dynamic features, wherein the time dynamic features are specifically operated as follows:

Y＝L*f

wherein

In step 7, the method of steps 2-6 forms a space-time diagram convolution module, which comprises five operations of constructing a joint point interaction diagram and a component combination interaction diagram, calculating corresponding human body component combination characteristics and space characteristics, and calculating time dynamic characteristics. And inputting the time dynamic features into the blank image convolution module again to obtain higher-level time dynamic features to represent the skeleton video. The network of the invention preferably selects 9 space-time convolution modules, namely, the updated time dynamic characteristics are circularly input into the space-time graph convolution module, and the time dynamic characteristics calculated for the 9 th time are used as the expression characteristic vector of the skeleton video.

Step 8, classifying the representation characteristic vectors of the skeleton video by adopting a softmax classifier, and finishing action recognition, wherein the concrete operations are as follows:

Examples

In order to verify the effectiveness of the scheme, a simulation experiment is carried out on the disclosed NTU RGB + D data set based on a Pythrch deep learning platform. During the experiment, the method of the invention determines training and testing data according to two evaluation protocols of cross-view (cross view) and cross subject (cross subject), and then performs the training and testing of the depth map convolution network. When the network is trained, training data is input into the depth map convolutional network for forward propagation, the classification probability of each sample for each action class is obtained, then backward propagation is carried out based on cross entropy loss, and network parameters are adjusted. After training is finished, class prediction is carried out on the samples to be tested on the basis of the network, namely the test samples are input into the trained depth map convolution network, the classification probability of each sample for each action class is obtained through forward propagation, and then the class with the maximum classification probability is selected as the prediction class of the sample. For each video sample, if the prediction type is consistent with the label of the video, the method correctly classifies the video; otherwise, the method classifies the video incorrectly. Experimental results show that the method disclosed by the invention achieves 89% (cross-view) and 84% (cross-subject) accuracy on the two evaluation protocols respectively.

Claims

1. A skeleton action recognition method based on graph convolution is characterized by comprising the following steps:

2. The graph convolution-based skeleton motion recognition method according to claim 1, wherein the step 1 includes the following substeps:

(1.1) for the t-th frame of skeleton video, constructing a skeleton map S based on natural connection of human joints_t＝{V_t,E_tIn which V is_tRepresenting all nodes in the diagram, consisting of articulation points of the human skeleton, E_tAll edges in the figure are represented. Skeleton diagram S_tIs non-directional, andif a bone connection exists between two joint points, an edge exists between the two joint points; otherwise, no edge exists between the two nodes;

3. The graph convolution-based skeleton motion recognition method according to claim 1, wherein the step 2 includes the following substeps:

-the right hand and the right leg,

the left hand and the left leg, respectively,

-the right hand and the left leg,

-the left hand and the right leg,

-the upper half of the body and

-a lower body segment dividing a body segment with a relatively small amplitude of motion into respective distancesIn the closest combination;

-whole body;

4. The graph convolution-based skeleton motion recognition method according to claim 1, wherein the step 3 includes the following substeps:

(3.1) for the t-th frame, combining with the constructed h-person body parts

5. The graph convolution-based skeleton motion recognition method according to claim 1, wherein in step 4, the specific operation of joint point interaction dimension graph convolution is:

wherein

Showing the combination of parts

In particular finger assemblies

is a matrix of the adjacency of the joint points,

is a matrix

The largest eigenvalue;

is a matrix

Expansion of the chebyshev polynomial of (a);

6. The graph convolution-based skeleton action recognition method according to claim 1, wherein in step 5, the specific operation of component combination interaction dimension graph convolution is as follows:

wherein

is a matrix of component combination adjacency,

is that

The maximum eigenvalue of (d);

is a matrix

7. The graph convolution-based skeleton motion recognition method according to claim 1, wherein in step 6, the specific operation of performing time dimension convolution on the spatial feature stack is as follows:

Y＝L*f

wherein

8. The graph convolution-based skeleton motion recognition method of claim 1, wherein in step 7, the temporal dynamic features are repeatedly updated to obtain a temporal dynamic feature characterization skeleton video with a higher dimension.

9. The method for recognizing skeleton motion based on graph convolution according to claim 1, wherein in step 8, the specific method for classifying the representation feature vectors of the skeleton video is as follows: