CN110728183B

CN110728183B - Human body action recognition method of neural network based on attention mechanism

Info

Publication number: CN110728183B
Application number: CN201910846654.9A
Authority: CN
Inventors: 侯永宏; 李岳阳; 肖任意; 李翔宇; 郭子慧; 刘艳
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2023-09-22
Anticipated expiration: 2039-09-09
Also published as: CN110728183A

Abstract

The invention discloses a human body action recognition method of a neural network based on an attention mechanism, which provides an end-to-end trainable network comprising a deep convolution sub-network and an attention sub-network, wherein the deep convolution sub-network and the attention sub-network are used for recognizing human actions from skeleton data. First, the skeleton sequence is encoded into a color space-time diagram and fed into a deep convolutional sub-network to extract deep features and mapped into the tag space using fully connected layers. In the attention subnetwork, hand-made features representing the degree of articulation importance are extracted and the attention weights are learned by simple but efficient linear mapping, the result of which is also mapped into the tag space by the fully connected layer. The final recognition accuracy is obtained through multiplication fusion of the two results. The invention can extract effective deep features from data automatically at maximum amplitude. The network structure of the present invention comprises two sub-networks that are simultaneously co-trained in an end-to-end fashion without post-processing.

Description

Human body action recognition method of neural network based on attention mechanism

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a human body action recognition method of a neural network based on an attention mechanism.

Background

The human body action recognition has very wide application prospect, such as man-machine interaction, video monitoring, video understanding and the like. According to the current mainstream method, human motion recognition based on RGB data, depth data and bone data can be mainly classified. Skeleton data is a higher level representation than RGB data and depth data and is robust to changes in viewpoint, position and appearance, which is furthermore very challenging due to complex spatiotemporal changes in skeleton joints. Due to the popularization of economic and efficient depth cameras such as Microsoft Kinect and real-time skeleton estimation algorithms, human action recognition based on a 3D skeleton is attracting more and more attention.

Although the traditional method for manually extracting the characteristics can obtain good accuracy, people who need to design the characteristics have abundant experience and skillful skills, and the manual characteristics have great differences in different data sets, so that a better method is needed for identifying human body actions. With the progress of deep learning, convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been increasingly used for the past few years to develop highlights in the fields of picture classification, object detection, natural language processing, etc. And recently, attention mechanisms have become popular because it can focus on some important areas in the picture, thereby improving the effectiveness of the task.

Currently, the motion recognition methods of skeleton data based on deep learning can be divided into two types according to how the skeleton sequence represents and feeds into the deep neural network: CNN-based methods and RNN-based methods.

The first method is to generate texture images by encoding a skeleton sequence, and then feed the texture images into CNNs for feature extraction and classification. For example, the joint coordinates of the skeleton sequence are encoded as a matrix and normalized with respect to the entire training dataset, wherein the three Cartesian components (x, y, z) of the skeleton joint are processed into three channels (R, G, B) of the color image, respectively. However, this normalization does not guarantee dimensional invariance.

The second approach is to extract features from each time step of the skeleton sequence and feed frame-based features into the recurrent neural network. Recent attention models have enhanced this approach, which aim at identifying body parts or time steps that are more discriminative to the action classification task. So far, among the several methods that have been proposed, there is a tendency to overemphasize temporal information and underestimate spatial information, and spatial attention is often ignored. Meanwhile, when the method based on the circulating neural network such as LSTM, GRU and the like is used for identifying the human skeleton sequence, a large amount of time sequence calculation is relied on, so that the processing speed of the skeleton sequence is limited, and in addition, the scale of the network is greatly increased along with the introduction of the circulating neural network, so that the training of the network takes more time.

In addition to the above, the human motion recognition method based on deep learning is particularly dependent on the preprocessing process of the skeleton sequence, and the space-time features generated by the process directly determine whether the recognition quality is good or bad, so how to extract a good space-time feature to efficiently recognize the complex motion is still an open problem.

Disclosure of Invention

The invention discloses a human body action recognition method of a neural network based on an attention mechanism, which adopts an end-to-end supervised training mode, and effectively improves the accuracy of human body action recognition by introducing a strategy of extracting characteristics of the convolutional neural network and capturing key points of bones by the attention mechanism.

The invention adopts the following technical scheme for solving the technical problems:

a human body action recognition method of a neural network based on an attention mechanism comprises the following steps:

1) Constructing a feature extraction and classification neural network, wherein the neural network comprises two sub-models, namely a deep convolution sub-network and an attention sub-network;

2) Constructing an end-to-end supervised training scheme, processing an original skeleton sequence, and encoding the skeleton sequence into a color space-time diagramThe three-dimensional matrix is input into a deep convolution sub-network to extract the characteristics of the three-dimensional matrix and output a vector P ₁ ；

3) In the attention sub-network, the hand-made characteristics representing the joint movement degree are extracted, the key nodes of the movement are captured, and a vector P is output ₂ ；

4) Finally P is arranged ₁ And P ₂ And (3) fusing, namely training the model by reducing the loss function through an optimization means until the network converges, and obtaining the final recognition accuracy.

Moreover, the deep convolution sub-network adopts a structure of a laminated convolution neural network, and the attention sub-network adopts a combination of a custom layer and a full connection layer.

And, in step 2)

Wherein P is ₁ For deep space-time characteristics output by the deep convolution sub-network, representing the probability, W, of the action belonging to each category in the label space ₁ ∈R ^M×C And b ₁ ∈R ^M×1 Respectively representing a weight matrix and a bias vector of the full connection layer; m represents the number of label categories, C is the output dimension of the deep convolution sub-network;

for the spatiotemporal features extracted from the deep convolutional subnetwork, O represents the color image encoded from the skeleton sequence, GAP represents the global average pooling layer, +.>Is the output of GAP in DenseNet-161, conv is the convolution layer, reLU is the activation function, BN is the batch normalization layer.

And, in step 3)

P ₂ ＝W ₂ V+b ₂

Wherein P is ₂ To be the attention vector, W ₂ ∈R ^M × ^N And b ₂ ∈R ^M×1 The weight matrix and the bias vector of the full connection layer respectively,

V＝V _X ⊙V _Y ⊙V _Z

as indicated by the addition of the elements,

wherein the method comprises the steps ofRespectively represent x _k Average value of x _k 、y _k 、z _k X, Y, Z coordinates, x, respectively representing the kth joint in the skeleton sequence _k ＝[x _1，k ，...，x _t，k ，...，x _T，k ]，y _k ＝[y _1，k ，...，y _t，k ，...，y _T，k ]，z _k ＝[z _1，k ，...，z _t，k ，...，z _T，k ]T represents the number of frames of the framework sequence.

Furthermore, step 4) is specifically: the deep space-time characteristic P obtained by the method ₁ And an attention vector P ₂ The final result of the action classification is obtained by multiplying by the elements, and the result is expressed as follows:

wherein the method comprises the steps ofRepresenting the predicted result, measuring the true class label y and the predicted result +.>Differences between them.

The invention has the following advantages and beneficial effects:

1. the invention provides a human body action recognition method of a neural network based on an attention mechanism. The invention is based on end-to-end supervised deep learning, does not need to manually extract the features in the training process, and can automatically extract the effective deep features from the data to the maximum extent. The network structure of the present invention comprises two sub-networks that are simultaneously co-trained in an end-to-end fashion without post-processing.

2. In the attention model, the variance characteristic of each joint is extracted through effective linear mapping, and attention weights are learned, so that key points for motion recognition in skeleton data can be effectively captured, and recognition accuracy is remarkably improved on different data sets.

3. In the data processing process, each skeleton sequence is converted into a space-time image without any standardization, and in the network training process, the translation and scale invariance of skeleton data are ensured.

4. The better result compared with the current mainstream attention model can be obtained without introducing the circulating neural network into the attention sub-network, the defect that the circulating neural network is not good at extracting the space information is overcome, the calculated amount of the network is reduced, and the training speed of the network is accelerated.

Drawings

FIG. 1 is a diagram showing a network structure of a human motion recognition method of an attention mechanism according to an embodiment of the present invention;

FIG. 2 is a diagram of a preprocessing procedure for a skeleton sequence;

FIG. 3 is a graph comparing performance of different neural networks on four data sets;

wherein (a) is an NTU-CS dataset; (b) is an NTU-CV dataset; (c) is a SYSU-3D dataset; (d) is a UTD-MHAD dataset.

Detailed Description

The invention will now be described in further detail by way of specific examples, which are given by way of illustration only and not by way of limitation, with reference to the accompanying drawings.

The deep convolution sub-network and the attention sub-network are constructed, and the model comprises a convolution layer, a normalization layer, a full connection layer and the like based on the structural design of the stacked convolution neural network.

Fig. 1 is a network structure of a human motion recognition method of an attention mechanism according to the present invention.

The action recognition network mainly comprises two sub-networks, including a deep convolution sub-network and an attention sub-network.

Wherein the deep convolutional subnetwork adopts DenseNet-161 as a main body part, and the front end coding network part adopts a laminated convolutional network, which comprises 4 blocks, wherein each block consists of a convolutional layer, a normalization layer and a Relu layer. And a transition layer is arranged between each block layer, and the size is 2 x 2, so that pooling and downsampling of the feature map are performed. And finally, adopting a GlobalAverage Pooling layer to carry out global pooling on the feature map, and outputting a result through a softmax layer.

The attention sub-network is composed of 3 variance calculation layers, 1 fusion layer and 1 full connection layer. The key skeletal nodes are captured by calculating their variance during motion for each skeletal node in the input three-dimensional matrix. And multiplying and fusing three variance values of xyz to consider the condition of xyz coordinates, and finally outputting through a full-connection layer, wherein the unit number of the full-connection layer is the action type number of the data set.

The data set adopted by the invention is NTU-RGB+D data set, SYSU-3D data set and UTD-MHAD data set. The NTU-RGB+D data set is photographed by university of Nanyang, is the most authoritative data set in the field of human body action recognition, contains 60 common human body actions, contains 10 double interaction actions, and has two evaluation methods: cross-subjects and cross-views. The SYSYSY-3D data set is shot by the university of Zhongshan, contains 12 types of actions altogether, belongs to a smaller data set, but is high in recognition difficulty due to high similarity between actions. The UTD-MHAD dataset contains 861 sequences, belongs to a medium-scale dataset, and has similar actions as SYSU-3D. All methods of evaluating the data set obey the evaluation specifications in the data set paper.

The frame numbers T of all the skeleton sequences are normalized, and different data sets are normalized to different frame numbers, so that each skeleton sequence in the same data set has the same frame number, and the average value of the frame numbers of most sequences in the data set is generally selected for the frame numbers.

Inputting normalized skeleton sequence A ₁ …A _t It is transformed into a vector of T x N x 3, where T represents the number of frames, N represents the number of bone nodes in each frame, and 3 is the number of channels. Each row is the coordinates of a different skeletal node of the same frame, and each column is the coordinates of the same skeletal node in a different frame.

The preprocessed T3 vector is input into a deep convolution sub-network taking DenseNet-161 as a main body, and feature extraction and mapping are carried out, and a vector is outputThe specific process is as follows:

where "O" represents the color image encoded from the skeleton sequence, GAP represents the global average pooling layer, is the output of GAP in DenseNet-161, conv is the convolution layer, reLU is the activation function, BN is the batch normalization layer.

Obtained byNamely, the space-time characteristics extracted by the deep convolution sub-network are input into the full connection layer to be mapped to the label space, and the specific process is as follows:

wherein W is ₁ ∈R ^M×C And b ₁ ∈R ^M×1 The weight matrix and the bias vector of the full connection layer are represented respectively. M represents the number of label categories, P ₁ And (3) representing the probability that the action belongs to each category in the label space for deep space-time characteristics output by the deep convolution sub-network.

The movement of the joints is represented in the attention subnetwork using hand-made variance features. Input O ε R ^T×N×3 Is divided into three matrices: x epsilon R ^T×N ，Y∈R ^T×N ，Z∈R ^T×N . To describe the variance feature in detail, X is chosen as an example. Let X epsilon R ^T ^×N The method comprises the following steps:

wherein x is _k The X coordinate representing the kth joint in the skeleton sequence can be expressed as:

x _k ＝[x _1，k ，...，x _t，k ，...，x _T，k ]

x _k variance of (2)The calculation is as follows:

wherein the method comprises the steps ofRepresents x _k The average value of (2), the output V shown in FIG. 1 _X ∈R ^N×1 Can be expressed as:

calculating V in the same way _Y ∈R ^N×1 And V _Z ∈R ^N×1 ，

Obtaining the final variance characteristic V epsilon R ^N×1 The following are provided:

V＝V _X ⊙V _Y ⊙V _Z

wherein +.is the element-wise multiplication. The variance feature is used to measure the motion amplitude and importance of each node to capture key nodes for identifying the motion, thereby improving the identification accuracy. Thereafter, the variance V is used for the full-connected layer learning attention weight P ₂ ∈R ^M×1 It can be expressed as:

P ₂ ＝W ₂ V+b ₂

wherein W is ₂ ∈R ^M×N And b ₂ ∈R ^M×1 The weight matrix and the bias vector of the full connection layer are respectively W ₂ Automatically updated during the network training process.

The deep space-time characteristic P obtained by the method ₁ And an attention vector P ₂ The final result of the action classification is obtained by multiplying by the elements, and the result is expressed as follows:

wherein the method comprises the steps ofRepresenting the final predicted result. Measuring true class labels y and prediction results using cross entropy loss functionDifferences between them.

The invention adopts a keras deep learning framework to carry out experiments, and specific parameters are shown in the following figures:

after model training to convergence, evaluations were made on the NTU-RGB+D dataset, SYSU-3D dataset, UTD-MHAD dataset. The evaluation index is shown in the following table. Among them, MANs, VA-LSTM, etc. belong to other methods, ours (only DCM) to our method, but without attention to the subnetwork, our (dcm+sam) belongs to the complete method described above.

The above description is only of the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive conception of the present invention equally within the scope of the disclosure of the present invention.

Claims

1. A human body action recognition method of a neural network based on an attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:

2) Constructing an end-to-end supervised training scheme, processing an original skeleton sequence, encoding the skeleton sequence into a three-dimensional matrix composed of color space-time diagrams, inputting the three-dimensional matrix into a deep convolution sub-network for feature extraction, and outputting a vector P ₁ ；

4) Finally P is arranged ₁ And P ₂ Fusing, namely training a model by reducing a loss function through an optimization means until the network converges, and obtaining the final recognition accuracy;

the deep convolution sub-network adopts a structure of a laminated convolution neural network, and the attention sub-network adopts a combination of a custom layer and a full connection layer;

in step 2)

for the spatiotemporal features extracted from the deep convolutional subnetwork, O represents the color image encoded from the skeleton sequence, GAP represents the global average pooling layer, +.>The output of GAP in DenseNet-161, conv is convolution layer, reLU is activation function, BN is batch normalization layer;

in step 3)

P ₂ ＝W ₂ V+b ₂

Wherein P is ₂ To be the attention vector, W ₂ ∈R ^M×N And b ₂ ∈R ^M×1 The weight matrix and the bias vector of the full connection layer respectively,

V＝V _X ⊙V _Y ⊙V _Z

as indicated by the addition of the elements,

wherein the method comprises the steps ofMean value, x of X, Y, Z coordinates of kth joint in skeleton sequence _k 、y _k 、z _k X, Y, Z coordinates, x, respectively representing the kth joint in the skeleton sequence _k ＝[x _1,k ,…,x _t,k ,…,x _T,k ]，y _k ＝[y _1,k ,…,y _t,k ,…,y _T,k ]，z _k ＝[z _1,k ,…,z _t,k ,…,z _T,k ]T represents the number of frames of the framework sequence;

the step 4) is specifically as follows: the deep space-time characteristic P obtained by the method ₁ And an attention vector P ₂ The final result of the action classification is obtained by multiplying by the elements, and the result is expressed as follows: