CN113780129A

CN113780129A - Motion recognition method based on unsupervised graph sequence predictive coding and storage medium

Info

Publication number: CN113780129A
Application number: CN202111009498.4A
Authority: CN
Inventors: 赵生捷; 梁爽; 叶珂男
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-12-10
Anticipated expiration: 2041-08-31
Also published as: CN113780129B

Abstract

The invention relates to an action recognition method based on unsupervised graph sequence predictive coding and a storage medium, wherein the action recognition method comprises the training and the use of a model and is used for recognizing various actions performed by a human body in a skeleton sequence, and the action recognition method aims to solve the problems that the existing action recognition method highly depends on a large amount of marked data, and has lower precision under the condition of only a small amount of marks, and the existing unsupervised method is over-fitted without utilizing topological information of a graph and has poorer serious generalization capability. The system method comprises the steps of unchanged visual angle transformation, resampling and block-level bone map data enhancement of bone sequence data; embedding and extracting a space-time graph convolution bone sequence block; aggregating context features of the graph convolution cyclic neural network; the predictive coding constructs a positive sample pair and a negative sample pair; and extracting features through a pre-training model, and obtaining action categories corresponding to the bone sequences to be recognized by using a classifier. Compared with the prior art, the method has the advantages of low training difficulty, high identification precision, excellent performance and the like.

Description

Motion recognition method based on unsupervised graph sequence predictive coding and storage medium

Technical Field

The present invention relates to the field of motion recognition technologies, and in particular, to a motion recognition method and a storage medium based on unsupervised graph sequence predictive coding.

Background

In computer vision tasks, motion recognition is a hot problem that is now of great interest. The fields of unmanned robots, smart cities, intelligent transportation and the like need to analyze and recognize human behaviors. In recent years, as image convolution is valued and utilized by more and more researchers, the development of a pose estimation algorithm and a depth sensor, and the robustness and the visual removal characteristics of skeleton data are concentrated on the characteristics of actions, and action identification by using the skeleton data becomes a hot point of current research.

Early motion recognition was mainly based on still pictures. In recent years, as research progresses, more and more researchers have given more attention to the dynamic nature of motion, and thus have turned their attention to video-based motion recognition. The most significant difference of video-based motion recognition compared to still picture-based methods is the increase of the time dimension, the data becoming a time sequence of 2D pictures. The time dimension provides rich features and brings great challenges — computational power and increased storage space. Skeletal-based motion recognition alleviates the computational requirements of motion recognition algorithms, but most methods are based on supervised tasks, highly dependent on the number and quality of data set samples. Because of the high inter-class similarity of actions, accurately labeling enough data to train a deep learning model is a challenging and costly problem, and it is therefore highly desirable for researchers to find a robust, label-free method to learn representations of action recognition to better utilize temporal and spatial information. Existing unsupervised work attempts to address the borrowing task of drawing or reconstructing a skeleton sequence using the potential embedding of encoders. However, these codec models typically flatten spatial channels into a single feature vector, ignoring the spatial relationships of the skeleton map. And these borrowing tasks often have problems with overfitting and are not always helpful in downstream tasks.

Disclosure of Invention

The present invention aims to overcome the defects of the prior art, and provides an action recognition method based on unsupervised graph sequence predictive coding with low training difficulty, high recognition accuracy and excellent performance, and a storage medium.

The purpose of the invention can be realized by the following technical scheme:

an action recognition method based on unsupervised graph sequence predictive coding, the action recognition method comprises the following steps:

step 1: acquiring a skeleton data sequence, and preprocessing the data sequence to obtain an input training data block;

step 2: inputting the input training data block into a null graph convolutional network f (-) to obtain an embedded representation of the sequence skeleton graph block, inputting the embedded representation into a cyclic neural network g (-) and aggregating context information;

and step 3: predicting the next sequence of bone picture block embedded representation through a prediction network phi (-) according to the context information, inputting the predicted embedded representation into a recurrent neural network g (-) to obtain a new context representation, and repeating for a plurality of times to obtain a series of predicted picture embedded representations;

and 4, step 4: comparing the obtained prediction graph embedded representation with the real graph embedded representation, optimizing the space-time graph convolution network f (-) and the graph convolution cyclic neural network g (-) and the prediction network phi (-) through comparing loss function reverse conduction, and obtaining a pre-training model through a plurality of iterations;

and 5: removing the prediction network phi (-) according to the obtained pre-training model, taking the parts of the space-time graph convolution network f (-) and the cyclic neural network g (-) as feature extractors, adding a classifier on the upper layer of the feature extractors, and obtaining a final classification model through training of inputting labeled data;

step 6: acquiring a bone data sequence to be detected, and preprocessing the bone data sequence to obtain an input prediction data block;

and 7: and inputting the input prediction data block into the classification model, predicting various action probabilities of the people needing to be identified, and completing action identification.

Preferably, the step 1 specifically comprises:

step 1-1: for a given bone sequenceObtaining the bone sequence data of the corrected view angle by the data X through view angle invariant transformation F (-) to obtain

Step 1-2: bone sequence data for a given corrected view angle

And input sample window size T_windowFirst, will have T_sampleThe skeleton sequence of the frame is upsampled to T by linear interpolation_windowA sequence of xk frames, where k ∈ N +, T_window·(k-1)＜T_sample＜T_window·k；

Step 1-3: for interpolated data obtained in the preceding step

Is divided into fractions containing T_patchSequence block of frame, P ═ P₁,p₂,...,p_nFor each sequence block p_iApplying random skeleton map data enhancement to finally obtain enhanced skeleton sequence blocks

Preferably, the step 2 specifically comprises:

step 2-1: according to the bone sequence block obtained in the step 1

Inputting the input data block into the space-map convolutional network f (-) to obtain the embedded representation

Step 2-2: according to step 2-1: the resulting embedded representation

Obtaining a context representation C in an input graph convolution recurrent neural network g (-) to_i。

Preferably, the step 3 specifically comprises:

step 3-1: context information C obtained according to step 2_iPredicting a next sequence of bone tile-embedded representations over a prediction network phi (·)

Step 3-2: graph-embedded representation obtained according to step 3-1

Obtaining context information via a graph convolution recurrent neural network g (-) to

Step 3-3: context information obtained according to step 3-2

Repeating the step 3-1 and the step 3-2 for several times by analogy to obtain a series of predicted graph embedding representations

Preferably, the space-time graph convolution network f (-) and the recurrent neural network g (-) are both constructed based on a graph convolution neural network, and the prediction network Φ (-) is constructed based on a neural network.

More preferably, the graph convolution rule of the space-time graph convolution network f (-) and the recurrent neural network g (-) is:

wherein the content of the first and second substances,

and

respectively representing an input characteristic diagram and an output characteristic diagram;

the unit matrix I is added to the drawing defined tie matrix A, namely the node itself links the node itself,

representing its diagonal matrix, τ the activation function, and Θ the learnable weight matrix of the atlas layer.

More preferably, the structure of the recurrent neural network g (-) is based on a gated recurrent unit GRU, and the calculation rule is as follows:

wherein z is_tIndicating an update gate, r_tA reset gate is shown, which is,

representing candidate activation vectors;

is a graph convolution operator; an indication of a hadamard product; sigma represents a Sigmoid activation function, and psi is a Tanh activation function; omega_zz、ω_hz、ω_zrAnd ω_hrRespectively representing the parameters of each memory gate; q. q.s_tIs the memory/forgetting weight.

Preferably, the contrast loss function in step 4 is specifically:

wherein z is_i,kAnd

respectively represent z taken from the i-th sample_kAnd

representing embedded representation pairs

The similarity of (c).

Preferably, the step 5 specifically comprises:

step 5-1: the training model obtained according to the step 4 comprises a space-time graph convolution network f (-) and a graph convolution recurrent neural network g (-) and a prediction network phi (-) and only f (-) and g (-) are used for replacing phi (-) with a classifier network

Constructing a classification model;

step 5-2: inputting the training data with the labels, and training the labeled training data to obtain a final classification model.

A storage medium storing a motion recognition method based on unsupervised graph sequence predictive coding according to any one of the above.

Compared with the prior art, the invention has the following beneficial effects:

firstly, the skeleton action recognition framework based on the unsupervised graph convolution can learn the effective representation of human body action from unlabeled data through comparison and learning, so that the requirement of sample labeling is reduced, and the training difficulty is simplified.

Secondly, the recognition precision is high: the motion recognition method based on the unsupervised graph sequence predictive coding simultaneously and fully utilizes the space and time dependency by utilizing the graph convolution and the contrast learning, avoids the limitation of generative learning and sample-based contrast learning in the motion recognition based on the unsupervised skeleton, and improves the motion recognition precision.

Thirdly, excellent performance: compared with the latest SOTA method on three reference data sets, the action identification method based on the unsupervised graph sequence predictive coding has the performance that the SOTA is higher than 20 percent.

Drawings

FIG. 1 is a flow chart of a method of motion recognition in the present invention;

FIG. 2 is a schematic view of the overall framework of the present invention;

FIG. 3 is a schematic diagram of training of a contrast learning-based and training model according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

As shown in fig. 1, the present embodiment provides a skeleton motion recognition method based on unsupervised graph convolution, and the main objective is to learn a representation of motion recognition from unlabeled data by using an unsupervised contrast learning method, and to maximally utilize time information of a skeleton sequence and space information of a skeleton graph, and perform classification model training on the learned representation by using a small amount of labeled data, so as to more accurately recognize human motions.

As shown in fig. 1 and fig. 2, the method for identifying an action based on unsupervised graph sequence predictive coding in this embodiment mainly includes the following steps:

step 1: obtaining a skeleton data sequence, preprocessing the data through view angle invariant transformation, time window resampling and block-level data enhancement, and obtaining an input training data block which is segmented into specific lengths and has a fixed window size;

the method specifically comprises the following steps:

step 1-1: for given bone sequence data X, obtaining bone sequence data of corrected visual angle through constant visual angle transformation F (-) to obtain

Step 1-2: bone sequence data for a given corrected view angle

Step 1-3: for interpolated data obtained in the preceding step

Is divided into fractions containing T_patchSequence block of frame, P ═ P₁,p₂,...,p_nFor each sequence block p_iRandom skeleton map data enhancement is applied, the same enhancement is applied in blocks, and different enhancements are applied among blocks; the enhancement comprises displacement, inclination and rotation, and finally the enhanced bone sequence block is obtained

the method specifically comprises the following steps:

step 2-1: according to the bone sequence block obtained in the step 1

Step 2-2: according to step 2-1: the resulting embedded representation

Obtaining a context representation C in an input graph convolution recurrent neural network g (-) to_i；

the method specifically comprises the following steps:

Step 3-2: graph-embedded representation obtained according to step 3-1

Step 3-3: context information obtained according to step 3-2

Repeating the step 3-1 and the step 3-2 for a plurality of times by analogy to obtain a series of predicted graph inlaysIn represents

And 4, step 4: comparing the obtained prediction graph embedded representation with the real graph embedded representation, optimizing the space-time graph convolution network f (-) and the graph convolution cyclic neural network g (-) and the prediction network phi (-) through comparing the loss function reverse conduction, and obtaining a pre-training model through a plurality of iterations, as shown in FIG. 3;

in the embodiment, both the space-time graph convolution network f (-) and the recurrent neural network g (-) are constructed based on a graph convolution neural network, and the prediction network phi (-) is constructed based on a neural network;

the graph convolution rule of the space-time graph convolution network f (-) and the recurrent neural network g (-) is as follows:

wherein the content of the first and second substances,

and

representing the angle matrix, tau representing the activation function, theta representing the learnable weight matrix of the graph convolution layer;

the construction of the recurrent neural network g (-) is based on a gated recurrent unit GRU, and the calculation rule is as follows:

wherein z is_tIndicating an update gate, r_tA reset gate is shown, which is,

representing candidate activation vectors;

The contrast loss function is specifically:

wherein z is_i,kAnd

respectively represent z taken from the i-th sample_kAnd

representing embedded representation pairs

The similarity of (2);

the method specifically comprises the following steps:

Constructing a classification model;

step 5-2: inputting labeled training data, and performing labeled data training to obtain a final classification model;

In the embodiment, the prediction network phi (-) is a single-layer fully-connected neural network construction and classifier network

The multi-classification classifier is obtained by training through methods such as a multilayer perceptron.

In order to support and verify the performance of the motion recognition method proposed by the present invention, the method is compared with other latest leading-edge motion recognition methods on three widely used public standard data sets, and the comparison results are shown in table 1.

Experimental comparisons three widely used public standard data sets were used: NTU RGB + D60, Northwestern-UCLA (NW-UCLA) and UWA3D Multiview Activity II (UWA 3D). The experiment adopts a linear probe verification method widely used by an unsupervised learning method to verify, namely, the weight of a pre-training model is fixed, a linear classifier taking the output characteristics of the pre-training model as input is trained, and the performance of a test set is reported to measure the effectiveness of the learning representation.

TABLE 1 comparative results

The comparison results show that the motion recognition method proposed in this example is excellent in performance.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A motion recognition method based on unsupervised graph sequence predictive coding is characterized by comprising the following steps:

2. The method for motion recognition based on unsupervised graph sequence predictive coding according to claim 1, wherein the step 1 specifically comprises:

Step 1-2: bone sequence data for a given corrected view angle

Step 1-3: for interpolated data obtained in the preceding step

3. The method for motion recognition based on unsupervised graph sequence predictive coding according to claim 1, wherein the step 2 specifically comprises:

step 2-1: according to the bone sequence block obtained in the step 1

Step 2-2: according to step 2-1: the resulting embedded representation

4. The method for motion recognition based on unsupervised graph sequence predictive coding according to claim 1, wherein the step 3 specifically comprises:

Step 3-2: according toGraph-embedded representation from step 3-1

Step 3-3: context information obtained according to step 3-2

5. The method of claim 1, wherein the space-time graph convolutional network f (-) and the recurrent neural network g (-) are constructed based on a graph convolution neural network, and the prediction network Φ (-) is constructed based on a neural network.

6. The method of claim 5, wherein the graph convolution rule between the spatio-temporal graph convolution network f (-) and the recurrent neural network g (-) is as follows:

wherein the content of the first and second substances,

and

respectively representing input characteristicsOutputting a characteristic diagram;

7. The method of claim 5, wherein the recurrent neural network g (-) is constructed based on gated recurrent units GRU, and the calculation rule is:

wherein z is_tIndicating an update gate, r_tA reset gate is shown, which is,

representing candidate activation vectors;

for pattern convolutionAn operator; an indication of a hadamard product; sigma represents a Sigmoid activation function, and psi is a Tanh activation function; omega_zz、ω_hz、ω_zrAnd ω_hrRespectively representing the parameters of each memory gate; q. q.s_tIs the memory/forgetting weight.

8. The method according to claim 1, wherein the contrast loss function in step 4 is specifically:

wherein z is_i,kAnd

respectively represent z taken from the i-th sample_kAnd

representing embedded representation pairs

The similarity of (c).

9. The method for motion recognition based on unsupervised graph sequence predictive coding according to claim 1, wherein the step 5 specifically comprises:

Constructing a classification model;

10. A storage medium storing an unsupervised graph sequence predictive coding-based motion recognition method according to any one of claims 1 to 9.