CN113762082A

CN113762082A - Unsupervised skeleton action identification method based on cyclic graph convolution automatic encoder

Info

Publication number: CN113762082A
Application number: CN202110908006.9A
Authority: CN
Inventors: 赵生捷; 梁爽; 姚晗
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2021-12-07
Anticipated expiration: 2041-08-09
Also published as: CN113762082B

Abstract

The invention relates to an unsupervised skeleton action recognition method based on a cyclic graph convolution automatic encoder, which is characterized by comprising the following steps of: inputting the human skeleton action sequence into a cyclic graph convolution encoder; outputting a representation vector of the action sequence by a cyclic graph convolution encoder; calculating a representation vector of the action sequence through a weighted nearest neighbor classification algorithm to obtain an identification category of the human skeleton action sequence; the cyclic graph convolution encoder includes: the multilayer spatial joint attention module is used for adaptively measuring the importance of different joints with different actions by combining a human body skeleton action sequence and a hidden layer of a cyclic graph convolution encoder to obtain a weighted skeleton sequence; and the multilayer graph convolution gating circulation unit layers are used for integrating the connection relation characteristics of the weighted framework sequences to obtain the characterization vectors of the action sequences. Compared with the prior art, the method can obviously improve the identification precision of the unsupervised action identification system, and has wide application prospect.

Description

Unsupervised skeleton action identification method based on cyclic graph convolution automatic encoder

Technical Field

The invention relates to the technical field of computer vision and motion recognition, in particular to an unsupervised skeleton motion recognition method based on a cyclic graph convolution automatic encoder.

Background

The movement of a natural organism such as a human or an animal, including the whole body movement and the partial movement of the body such as the head, limbs, hands, and eyes, is generally called biological movement. These forms of motion are crucial for humans to perceive dynamic environmental changes and to infer the intent of others or other species. The recognition and understanding of the motion of the observed individual is the basic attribute of human visual perception, and the recognition capability of the motion under different scenes is also crucial. For the above reasons, the human motion recognition task has attracted the attention of a large number of researchers in the field of computer vision. The motion recognition task is widely applied, for example, in the fields of video monitoring, human-computer interaction, motion analysis, and the like, so that the motion recognition task is gradually developed into an important research direction. Research on human body action recognition dates back to 1973, and Johansson finds that human body actions are mainly realized through movement of a plurality of key skeleton points of a body through experimental observation, and actions such as walking, running, dancing and the like can be described through combination and tracking of 10-12 key nodes, so that human body action recognition is realized.

In recent years, with the successive development and rapid development of depth sensors such as Kinect and RealSense, human beings can more conveniently obtain RGB information, depth information and skeleton information of images. This has also brought about a tremendous development in the field of motion recognition. The early motion recognition method is mostly based on a video sequence, but has the defects of high computational complexity, easiness in being influenced by other factors and the like, but the skeleton information is very robust to factors such as human appearance, environmental interaction, visual angle change and the like, and meanwhile, the computational complexity is low and data is easy to store. Motion recognition based on skeleton data is a rapidly growing research direction, and effective motion recognition can be performed by using change information of these key points.

At present, research in the field of motion recognition based on skeletal data is rapidly changing with the development of deep learning techniques, and recognition is performed by using a cyclic neural network or a convolutional neural network from the earliest recognition by using features extracted manually. However, these methods cannot utilize the topological features of the skeleton data itself, and thus the recognition accuracy still needs to be improved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an unsupervised skeleton action identification method based on a cyclic graph convolution automatic encoder.

The purpose of the invention can be realized by the following technical scheme:

an unsupervised skeleton action identification method based on a cyclic graph convolution automatic encoder comprises the following steps:

s1, inputting the human skeleton motion sequence into a cyclic graph convolution encoder;

s2, outputting by the cyclic graph convolution encoder to obtain a characterization vector of the action sequence;

s3, calculating a characteristic vector of the action sequence through a weighted nearest neighbor classification algorithm to obtain an identification category of the human skeleton action sequence;

the cyclic graph convolution encoder includes: the multilayer spatial joint attention module is used for adaptively measuring the importance of different joints with different actions by combining a human body skeleton action sequence and a hidden layer of a cyclic graph convolution encoder to obtain a weighted skeleton sequence; and the multilayer graph convolution gating circulation unit layers are used for integrating the connection relation characteristics of the weighted framework sequences to obtain the characterization vectors of the action sequences.

Further, in the spatial joint attention module, the weighted skeleton sequence calculation expression is as follows:

x′_t＝(α_t+1)·x_t

s_t＝U_sφ(W_xx_t+W_hh_t-1+b_s)+b_u

in formula (II), x'_tRepresenting a weighted skeleton sequence, alpha_tIndicates the importance of each joint, s_tAn importance score for each joint is represented,

sequence coordinates, h, representing N joints at time t_t-1Indicating hidden layer information, W_xAnd W_hRepresenting a learnable parameter matrix,. phi, representing an activation function, b_sAnd b_uIndicating the bias.

Further, in the graph convolution gating cycle unit layer, the expression of the connection relation characteristic of the integrated weighted framework sequence is as follows:

in the formula, H^(l+1)Represents the output of the l +1 th layer of the graph convolution,

representing a symmetric adjacency matrix with spins, a representing the adjacency matrix, I representing the identity matrix,

is a degree matrix, τ represents the activation function, H^(l)Represents the output of the l-th layer of the graph convolution, Θ^(l)A learnable parameter matrix representing the l-th layer.

Further, the expression of the graph convolution gated cyclic unit layer is:

in the formula, z_tRepresents an update gate, r_tWhich represents a reset of the gate, and,

the activation vectors that represent the candidates are then selected,

represents the sum of the graph volumes and H^(l ⁺¹⁾Corresponds to, W_xz、W_hz、W_xr、W_hr、W_xhAnd W_hhIndicates a parameter matrix in different gates, which represents a Hadamard multiplier.

Further, the training step of the cyclic graph convolution encoder comprises

A1, inputting a training motion sequence set to a cyclic graph convolution encoder to obtain a characterization vector of a motion sequence;

a2, inputting the characterization vector and the hidden layer vector of the motion sequence into a decoder, and restoring the sequence to obtain a reconstructed motion sequence set;

a3, comparing the reconstruction motion sequence set with the training motion sequence set, and calculating a loss function value through a reconstruction loss function;

the above steps a1 to A3 are repeated until the loss function value reaches a preset cutoff condition.

Furthermore, the hidden layer vector is a vector with a value of zero and the same length as the human skeleton action sequence.

Further, the expression of the reconstruction loss function is:

in the formula (I), the compound is shown in the specification,

a set of training motion sequences is represented,

representing a set of reconstructed motion sequences, | · | | non-conducting phosphor_FRepresenting the Frobenius norm and L the loss function value.

Further, the cyclic graph convolution encoder is trained by a gradient descent method.

Furthermore, the spatial joint attention module and the graph convolution gating cycle unit layer are three layers.

Further, in the weighted nearest neighbor classification algorithm, after k nearest samples are obtained, k is a set numerical value, the number of votes for each category is calculated, and an identification result is obtained through weighted voting, wherein a calculation expression of weight is as follows:

in the formula, w_iAnd d_iRespectively representing the voting weight and cosine distance of sample i.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, the cyclic graph convolution encoder is applied to skeleton action recognition, and the multilayer spatial joint attention module is arranged in the cyclic graph convolution encoder, so that the spatial topological relation of skeleton sequence data is considered in the recognition process, and the recognition precision is improved by utilizing the space-time dependency relation of an action sequence; meanwhile, the invention adopts a weighted nearest neighbor classification algorithm as a classifier for final identification, and utilizes the idea of exponential explosion to ensure that samples beneficial to the result have larger voting weight, thereby further improving the identification accuracy.

Drawings

FIG. 1 is a schematic overall flow chart of the present invention.

Fig. 2 is a schematic view of a spatial joint attention module.

FIG. 3 is a schematic diagram of a graph of convolution gated cyclic unit layers.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

The embodiment provides an unsupervised skeleton action recognition method based on a cyclic graph convolution automatic encoder, which is used for solving the problem that the existing unsupervised action recognition method ignores the action sequence spatial dependency relationship and improving the action recognition accuracy.

As shown in the straight-line flow in fig. 1, the specific steps of this embodiment are as follows:

and step S1, inputting the human body skeleton motion sequence into a cyclic graph convolution encoder after preprocessing.

And step S2, outputting the characterization vector of the motion sequence by the cyclic graph convolution encoder. Wherein the cyclic graph convolution encoder includes: the multilayer spatial joint attention module is used for adaptively measuring the importance of different joints with different actions by combining a human body skeleton action sequence and a hidden layer of a cyclic graph convolution encoder to obtain a weighted skeleton sequence; and the multilayer graph convolution gating circulation unit layer (graph convolution GRU layer) is used for integrating the connection relation characteristics of the weighted framework sequence to obtain the characterization vector of the action sequence. The spatial joint attention module and the graph convolution gating circulation unit layer preferably adopt three layers.

And step S3, calculating the characterization vector of the motion sequence through a weighted nearest neighbor classification algorithm to obtain the identification category of the human skeleton motion sequence, and finishing the motion identification process.

The training steps of the cyclic graph convolution encoder are as described in the dashed flow in fig. 1, and the whole is trained by a gradient descent method, specifically as follows:

step a1, inputting the training motion sequence set to the cyclic graph convolution encoder, thereby obtaining the characterization vector of the motion sequence.

Step A2, inputting the characterization vector and the hidden layer vector of the motion sequence into a decoder for sequence restoration to obtain a reconstructed motion sequence set;

and A3, comparing the reconstruction motion sequence set with the training motion sequence set, and calculating a loss function value through a reconstruction loss function.

The steps A1 to A3 are repeated until the loss function value reaches a preset cutoff condition.

In the training process, the expression of the trained reconstruction loss function is as follows:

in the formula (I), the compound is shown in the specification,

a set of training motion sequences is represented,

Next, this embodiment will be described in detail in several sections.

First, the spatial joint attention module is shown in fig. 2. Spatial joint attention module for combining human skeleton action sequence x_tAnd hidden layer h of cyclic-image convolution encoder_t-1Adaptively weighing the importance of different joints with different actions to obtain a weighted skeleton sequence x'_t. Calculating a weighted framework sequence x'_tThe method comprises the following specific steps:

first, an importance score s for each joint is calculated_tThe calculation expression is as follows:

s_t＝U_sφ(W_xx_t+W_hh_t-1+b_s)+b_u

in the formula (I), the compound is shown in the specification,

Then, the importance α of each joint is calculated_tThe calculation expression is as follows:

finally, the weighted skeleton sequence x 'is calculated'_tThe calculation expression is as follows:

x′_t＝(α_t+1)·x_t

where, represents a dot product.

Two, the layers of the convolution-gated cyclic unit of the multilayer diagram are shown in FIG. 3. The multilayer graph convolution gating circulation unit layer is used for integrating the connection relation characteristics of the weighted framework sequence, fully utilizing the space dependency relation between joints of each frame, and simultaneously keeping the characteristics of the time dimension to obtain the characterization vector of the action sequence.

In the graph convolution gating cycle unit layer, the expression of the connection relation characteristics of the integrated weighted framework sequences is as follows:

representing a symmetric adjacency matrix with spins, a representing the adjacency matrix of the figure, I representing the identity matrix,

The graph convolution gating circulation unit layer combines the graph convolution and the gating circulation unit, and the expression is as follows:

the activation vectors that represent the candidates are then selected,

In the training step of the cyclic graph convolutional encoder, the input of the decoder is a characterization vector of the action sequence and an implicit layer vector, wherein in order to make the decoder completely depend on the state transmitted by the encoder so as to force the encoder to learn better feature representation, the implicit layer vector is all 0, and x_tVectors of the same size.

And fourthly, in the embodiment, a weighted nearest neighbor classification algorithm is adopted as a classifier, and the identification category of the human skeleton action sequence is obtained by the characterization vector of the action sequence. Specifically, after the first k closest samples are obtained, the number of votes for each category is calculated, and the recognition result is obtained by weighted voting. In this example k is 9. Wherein the weight is calculated by the following expression:

In order to support and verify the performance of the action recognition method provided by the invention, the embodiment compares the invention with other conventional unsupervised skeleton action recognition methods including LongT GAN (Long-Term Dynamics GAN, Long-Term dynamic generation countermeasure network), P & C (Predict & Cluster), MS2L (Multi-Task Self-Supervised Learning) by using recognition accuracy as an evaluation index on three public standard data sets.

Table 1 shows the comparison of recognition accuracy of the invention with other unsupervised motion recognition methods based on a skeleton on an NTU-RGB + D60 data set. CS (Cross-Subject) and CV (Cross-View) represent two different testing methods of the data set, wherein CS is used for dividing a training set and a testing set according to different volunteers for acquiring data, and CV is used for dividing the training set and the testing set according to results of cameras with different viewing angles for acquiring data.

TABLE 1 identification accuracy (%) comparison on NTU-RGB + D60 dataset

As can be seen from Table 1, on two testing methods CS and CV of NTU-RGB + D60 data set, the unsupervised human body action recognition method based on the cyclic graph convolution automatic encoder provided by the invention is superior to the existing method, and is respectively 1.8 percent higher and 2.9 percent higher than the existing method.

Table 2 shows the comparison of recognition accuracy of the present invention with other unsupervised skeleton-based motion recognition methods in the NW-UCLA dataset.

TABLE 2 comparison of recognition accuracy (%) on NW-UCLA datasets

As can be seen from table 2, although the existing method has achieved an accuracy of 80% or more on the NW-UCLA data set, the method proposed by the present invention can still further improve the recognition accuracy.

Table 3 is a comparison of recognition accuracy of the present invention with other unsupervised skeleton-based motion recognition methods on the UWA3D dataset. Wherein V3 and V4 represent the two test methods on the UWA3D data set.

TABLE 3 UWA3D comparison of recognition accuracy (%) on the data set

As can be seen from table 3, compared to other unsupervised motion recognition methods based on a skeleton, on the UWA3D data set, the method has more excellent recognition accuracy on all test methods, and under the test condition of V4, the recognition accuracy is improved by 2.4%. The embodiments on the three data sets jointly illustrate that the unsupervised human skeleton action identification method based on the cyclic graph convolution automatic encoder can stably obtain excellent identification accuracy rate under different data sets and different test conditions.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. An unsupervised skeleton action identification method based on a cyclic graph convolution automatic encoder is characterized by comprising the following steps:

2. The unsupervised skeleton motion recognition method based on the cyclic graph convolution automatic encoder according to claim 1, wherein in the spatial joint attention module, the weighted skeleton sequence calculation expression is as follows:

x′_t＝(α_t+1)·x_t

s_t＝U_sφ(W_xx_t+W_hh_t-1+b_s)+b_u

3. The unsupervised skeleton motion recognition method based on cyclic graph convolution automatic encoder according to claim 1, wherein in graph convolution gating cyclic unit layer, the expression of the connection relation features of the integrated weighted skeleton sequence is:

4. The unsupervised skeleton motion recognition method based on the cyclic graph convolution automatic encoder according to claim 3, wherein the expression of the graph convolution gating cyclic unit layer is as follows:

the activation vectors that represent the candidates are then selected,

represents the sum of the graph volumes and H^(l+1)Corresponds to, W_xz、W_hz、W_xr、W_hr、W_xhAnd W_hhIndicates a parameter matrix in different gates, which represents a Hadamard multiplier.

5. The unsupervised skeleton motion recognition method of claim 1, wherein the training step of the cyclic graph convolution encoder comprises

6. The unsupervised skeleton motion recognition method based on cyclic graph convolution auto-encoder of claim 5, wherein the hidden layer vector is a vector with a value of zero and a length identical to a human skeleton motion sequence.

7. The unsupervised skeleton motion recognition method based on cyclic graph convolution automatic encoder according to claim 5, wherein the expression of the reconstruction loss function is:

in the formula (I), the compound is shown in the specification,

a set of training motion sequences is represented,

8. The unsupervised skeleton motion recognition method based on cyclic graph convolution auto-encoder of claim 5, wherein the cyclic graph convolution encoder is trained using a gradient descent method.

9. The unsupervised skeleton motion recognition method based on cyclic atlas automatic encoder of claim 1, wherein the spatial joint attention module and atlas gating cyclic unit layer are three layers.

10. The unsupervised skeleton motion recognition method based on the cyclic graph convolution automatic encoder as claimed in claim 1, wherein in the weighted nearest neighbor classification algorithm, after k nearest samples are obtained, k is a set value, the number of votes for each category is calculated, and a recognition result is obtained through weighted voting, wherein the calculation expression of the weight is: