CN112507940A

CN112507940A - Skeleton action recognition method based on difference guidance representation learning network

Info

Publication number: CN112507940A
Application number: CN202011497126.6A
Authority: CN
Inventors: 马千里; 陈子鹏
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2021-03-16
Anticipated expiration: 2040-12-17
Also published as: CN112507940B

Abstract

The invention discloses a bone action recognition method based on a difference guidance representation learning network, which comprises the following steps: acquiring a skeleton action sequence; calculating a differential value of the skeleton action sequence to obtain a differential sequence; inputting the differential sequence into a differential information module, and obtaining differential characteristic representation by a long-term and short-term memory network; inputting the skeleton action sequence and the differential feature representation into an original information module, and guiding the representation learning of the skeleton action sequence by using the differential feature representation to obtain the feature representation of the original sequence; splicing the differential feature representation and the original sequence feature representation, extracting multi-scale features by a multi-scale convolution neural network, and obtaining pooling representation by using maximum pooling operation; the pooled representation is input to the fully connected layer for classification. The invention uses the long-term and short-term memory network to model the difference information of the skeleton action sequence so as to guide the representation learning of the skeleton action sequence, and uses the multi-scale convolutional neural network to extract the multi-scale characteristics of the skeleton action sequence, thereby improving the accuracy of the skeleton action identification.

Description

Skeleton action recognition method based on difference guidance representation learning network

Technical Field

The invention relates to the technical field of skeleton action recognition, in particular to a skeleton action recognition method based on a difference guide representation learning network.

Background

As an important branch of computer vision, human action recognition has wide application. Traditional research has primarily identified motion from video recorded by a two-dimensional camera. However, the motion of the human body is generally represented and recognized in a three-dimensional space. Therefore, in recent years, motion recognition methods based on three-dimensional human bones have attracted attention, and are widely used in scenes such as human-computer interaction and virtual reality. In the motion recognition problem of human bones, a human body is composed of three-dimensional bones, and the motion of the human body is represented by the motion of bone joints in a three-dimensional space.

Motion recognition based on human bones is generally considered a time series problem. In the traditional method, machine learning methods such as k-means clustering, support vector machines and hidden Markov models are used for estimating adjacent joint points of a human body and identifying different actions. However, the conventional method cannot effectively model complex timing information and motion patterns of the bone motion sequence, resulting in poor recognition effect of the bone motion. Moreover, traditional methods are more difficult to handle bone motion sequences of different lengths.

With the development of deep learning, many neural network-based methods are applied to bone motion recognition. Among them, the recurrent neural networks and long-short term memory networks are more common and effective because they use recurrent structures to better model the timing-dependent information of the bone node sequences. However, the existing bone motion recognition method does not fully model the differential information of the bone sequence, and the differential information reflects the dynamic evolution of the bone sequence and plays an important role in the representation learning of the bone sequence. Bone sequence segments with larger differential values imply a larger range of motion, which provides important information for bone motion recognition. Therefore, it is highly desirable to use differential information to guide the representation learning of the neural network so as to improve the accuracy of the bone motion recognition.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a bone motion recognition method based on a difference guidance representation learning network.

The purpose of the invention can be achieved by adopting the following technical scheme:

a bone motion recognition method based on a difference guide representation learning network comprises the following steps:

s1, acquiring bone motion sequence data, and preprocessing the data;

s2, calculating the difference value of the skeleton action sequence to obtain a difference sequence, inputting the difference sequence into a difference information module, calculating by using a long-term and short-term memory network to obtain difference characteristic representation, then inputting the skeleton action sequence and the difference characteristic representation into an original information module, and guiding the representation learning of the skeleton action sequence by using the difference characteristic representation to obtain original sequence characteristic representation;

s3, splicing the differential feature representation and the original sequence feature representation, extracting the multi-scale features of the skeleton action sequence by using a multi-scale convolution neural network, and obtaining a pooling representation by using a maximum pooling operation;

and S4, inputting the pooled representation into a full connection layer for classification.

Further, the calculation process of the differential feature representation and the original sequence feature representation in step S2 is as follows:

s2.1, calculating the difference value of the bone motion sequence obtained in the step S1 to obtain a difference sequence:

given the original bone motion sequence X ═ { X ═ X₁,x₂,…,x_t…,x_TWhere T is the length of the bone action sequence, the corresponding differential sequence

The calculation is as follows:

in the formula ,x_tIs the input data for the time step t,

is the difference data for time step t;

s2.2, inputting the differential sequence into a differential information module, in the differential information module, calculating by using a long-short term memory network to obtain differential feature representation, and at a time step t, representing the differential feature representation

The calculation formula is as follows:

wherein ,

is the output of the hidden layer of the long-short term memory network at time step t-1,

is the input data for the time step t,

respectively an input gate, a forgetting gate and an output gate of the long-short term memory network,

is the information that is currently being added to,

is the information of the memory cell, σ, tanh are all nonlinear activation functions,

the method is element-by-element multiplication, M is an affine transformation matrix consisting of trainable parameters, and the difference characteristics represent difference information of a skeleton action sequence, reflect the dynamic change of skeleton actions and contribute to skeleton action identification;

s2.3, inputting the skeleton action sequence and the differential feature representation into an original information module, guiding the representation learning of the skeleton action sequence by using the differential feature representation in the original information module to obtain an original sequence feature representation, and calculating an original sequence feature representation h by using a long-short term memory network at a time step t_tThe formula of (1) is as follows:

u_t＝tanh(W_u[h_t-1,x_t]+b_u)

wherein ,h_t-1Is the long short term memory network hidden layer output, x, of time step t-1_tIs input data of time step t, i_t,f_t,o_tInput gate, forget gate and output gate u of long-short term memory network_tIs currently added information, c_tIs information of a memory cell, M' and W_uIs an affine transformation matrix composed of trainable parameters, b_uIs a bias term;

calculating to obtain differential feature representation through a differential information module and an original information module

And original sequence feature representation H ═ H₁,h₂,…,h_t…,h_T}, wherein ,

is a differential characterization of time step t, h_tIs the original sequence feature representation at time step t.

Further, the process of extracting the multi-scale features of the action sequence by using the multi-scale convolutional neural network in step S3 and obtaining the pooled representation by using the maximum pooling operation is as follows:

s3.1, splicing the differential feature representation and the original sequence feature representation obtained in the step S2:

wherein ,

is a differential characterization of time step t, h_tIs the original sequence characteristic representation of time step t, and the splicing operation is applied at all time steps to obtain the sequence representation

Is a sequential representation of time step t;

s3.2, extracting multi-scale features of the action sequence by using a multi-scale convolution neural network, capturing bone actions with different amplitudes by using the multi-scale convolution neural network, and further improving the accuracy of bone action identification by setting the F e to R^w×n×2×kIs the convolution kernel of the convolution operation, wherein w, n, k respectively represent the width, height and number of the convolution kernel, and the convolution operation is represented as:

wherein ,

finger sequence representation

Is a convolution operation, f is a non-linear transformation function, b_gIs a bias term, applies a convolution kernel to each position of the sequence, using zero padding to generate a feature matrix of the same length as the input

Wherein, T and k respectively represent the length of an input sequence and the number of convolution kernels, and G is a feature matrix obtained by performing convolution by using windows with the same size;

using a multi-scale convolution neural network, using windows with different sizes to carry out convolution operation, assuming that r is the number of windows, obtaining r convolution operation results, and splicing to obtain a multi-scale characteristic matrix

At a multi-scale feature matrix

Using maximal pooling operations in the time dimension of (a) to obtain a pooled representation

Further, the classification process in step S4 is as follows:

the pooled expression P obtained in step S3 is input to the full link layer for classification, and the formula is as follows:

wherein ,W_pIs an affine transformation matrix composed of trainable parameters, b_pIs a bias term, softmax is a non-linear activation function,

is a predicted distribution;

will reduce cross entropy lossAs a training target, wherein the cross entropy function of two distributions

The expression is as follows:

in the above formula, y is the true distribution, y_iRefers to the i-th dimension of y,

finger-shaped

The ith dimension of (a).

Compared with the prior art, the invention has the following advantages and effects:

the invention uses the long-short term memory network to model the differential information of the skeleton action sequence, and uses the differential information to guide the representation learning of the skeleton action sequence. And moreover, the multi-scale convolutional neural network is used for extracting the multi-scale features of the bone action sequence, so that the accuracy of bone action identification is further improved.

Drawings

FIG. 1 is a detailed flowchart of a method for recognizing skeletal actions based on a difference guide representation learning network according to an embodiment of the present invention;

fig. 2 is a network structure diagram of a method for recognizing skeletal actions based on a difference-oriented representation learning network according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

The embodiment discloses a bone motion recognition method based on a difference guide representation learning network, as shown in fig. 1, the bone motion recognition method includes the following steps:

and step S1, acquiring the bone motion sequence data and preprocessing the data. In practice, the data used is derived from the "UTD-MHAD" data set. The data set is human skeletal data collected in an indoor environment and contains 27 different actions. Each bone of the data set is composed of 20 joints, each joint being represented using three-dimensional coordinates. The coordinates of human bones are tiled into a 60-dimensional vector, and a plurality of bones in continuous time are represented as a bone motion sequence.

Step S2 is to calculate a difference value of the bone motion sequence obtained in step S1 to obtain a difference sequence. And inputting the differential sequence into a differential information module, and calculating by using a long-term and short-term memory network to obtain differential feature representation. Then, the skeleton action sequence and the differential feature representation are input into an original information module, and the differential feature representation is used for guiding the representation learning of the skeleton action sequence to obtain the original sequence feature representation. The specific process is as follows:

and S2.1, calculating the difference value of the bone motion sequence obtained in the step S1 to obtain a difference sequence. First, as shown in fig. 2, a 5-time step bone motion sequence X ═ X is given₁,x₂,…,x₅The corresponding human body action of the sequence is shooting, and the corresponding difference sequence

The calculation is as follows:

in the formula ,x_tIs the input data for the time step t,

is the difference data at time step t.

And S2.2, as shown in the figure 2, inputting the differential sequence into a differential information module, and calculating by using a long-short term memory network to obtain a differential feature representation. At time step t, differential feature representation

The calculation formula is as follows:

wherein ,

is the input data for the time step t,

the input gate, the forgetting gate and the output gate of the long-short term memory network are respectively.

Is the information that is currently being added to,

is to pursueMultiplication of elements, M is an affine transformation matrix consisting of trainable parameters.

Differential feature representation

The differential information of the skeleton action sequence is modeled, the dynamic change of the skeleton action is reflected, and the human skeleton action identification is facilitated. As shown in fig. 2, the more dynamically changing action is the action of shooting a shot, the less dynamically changing action may be a preparatory action before shooting a shot, and the differential information modeling the skeletal sequence of actions helps to better identify the action of "shooting a shot".

And S2.3, as shown in the figure 2, inputting the skeleton action sequence and the differential feature representation into an original information module together, and calculating the feature representation of the original sequence. In the original information module, the differential feature representation is used for guiding the representation learning of the skeleton action sequence to obtain the original sequence feature representation. At time step t, the original sequence feature representation h is calculated using the long-short term memory network_tThe formula of (1) is as follows:

u_t＝tanh(W_u[h_t-1,x_t]+b_u)

wherein ,h_t-1Is the long short term memory network hidden layer output, x, of time step t-1_tIs input data of time step t, i_t,f_t,o_tThe input gate, the forgetting gate and the output gate of the long-short term memory network are respectively. u. of_tIs currently added information, c_tIs information of the memory cell, σ, tanh are nonlinear activationThe function of the function is that of the function,

is an element-by-element multiplication, M' and W_uIs an affine transformation matrix composed of trainable parameters, b_uIs the bias term.

And original sequence feature representation H ═ H₁,h₂,…,h₅}, wherein ,

And S3, splicing the difference feature representation and the original sequence feature representation obtained in the step S2, extracting the multi-scale features of the action sequence by using a multi-scale convolution neural network, and obtaining a pooled representation by using a maximum pooling operation. The specific process is as follows:

wherein ,

Is a sequential representation of time steps t.

And S3.2, extracting the multi-scale features of the action sequence by using a multi-scale convolutional neural network as shown in figure 2. Let F ∈ R^w ^×n×2×kIs the convolution kernel of convolution operation, wherein w, n and k respectively represent the width, height and number of the convolution kernel. In practice, w is selected from 2, 3 and 5, n is set to 128 and k is set to 96. The convolution operation is represented as:

wherein ,

finger sequence representation

Is a convolution operation, f is a non-linear transformation function, b_gIs the bias term. A convolution kernel is applied to each position of the sequence, using zero padding to generate a feature matrix of the same length as the input

G is the feature matrix resulting from convolution using windows of the same size.

Using a multi-scale convolution neural network, carrying out convolution operation by using 3 windows with different sizes, wherein the sizes of the windows are respectively 2, 3 and 5, obtaining results of the 3 convolution operations, and splicing to obtain a multi-scale characteristic matrix

At a multi-scale feature matrix

Step S4, the pooled presentation obtained in step S3 is input to the full link layer and classified, and the formula is as follows:

is the predicted distribution.

Using reduced cross-entropy loss as a training target, wherein the cross-entropy function of two distributions

The expression is as follows:

finger-shaped

The ith dimension of (a).

In summary, in the present embodiment, the long and short term memory network is used to learn the difference information of the bone motion sequence, the difference information is used to guide the representation learning of the original sequence, and the multi-scale convolutional neural network is used to extract the multi-scale features of the bone motion sequence, so as to improve the accuracy of bone motion recognition. Compared with the traditional method, the method fully models the differential information of the skeleton action sequence, is more sensitive to the change of skeleton actions, is beneficial to more accurately identifying the actions of people by a machine, and serves the scenes of human-computer interaction, virtual reality and the like.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A bone motion recognition method based on a difference guide representation learning network is characterized by comprising the following steps:

s1, acquiring bone motion sequence data, and preprocessing the data;

2. A method for recognizing bone motion based on difference-oriented representation learning network as claimed in claim 1, wherein the calculation process of the difference feature representation and the original sequence feature representation in step S2 is as follows:

The calculation is as follows:

in the formula ,x_tIs the input data for the time step t,

is the difference data for time step t;

The calculation formula is as follows:

wherein ,

is the input data for the time step t,

is the information that is currently being added to,

is an element-by-element multiplication, M is an affine transformation matrix consisting of trainable parameters;

u_t＝tanh(W_u[h_t-1,x_t]+b_u)

3. A method for recognizing bone motion based on difference-oriented representation learning network as claimed in claim 1, wherein the step S3 is implemented by using a multi-scale convolutional neural network to extract multi-scale features of the motion sequence, and obtaining the pooled representation by using the maximal pooling operation as follows:

wherein ,

Is a sequential representation of time step t;

s3.2, extracting multi-scale features of the action sequence by using a multi-scale convolution neural network, and setting F to be within the range of R^w×n×2×kIs the convolution kernel of the convolution operation, wherein w, n, k respectively represent the width, height and number of the convolution kernel, and the convolution operation is represented as:

wherein ,

finger sequence representation

At a multi-scale feature matrix

4. The method for recognizing bone motion based on difference guide expression learning network as claimed in claim 1, wherein the classification procedure in step S4 is as follows:

is a predicted distribution;

The expression is as follows:

finger-shaped

The ith dimension of (a).