CN112329562A

CN112329562A - Human body interaction action recognition method based on skeleton features and slice recurrent neural network

Info

Publication number: CN112329562A
Application number: CN202011146588.3A
Authority: CN
Inventors: 成科扬; 吴金霞; 毛启容
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-02-05
Anticipated expiration: 2040-10-23
Also published as: CN112329562B

Abstract

The invention discloses a human body interaction action recognition method based on skeleton characteristics and a slice recurrent neural network, which is characterized in that for each action, OpenPose is used for acquiring skeleton sequences, action characteristics are acquired through the skeletons, interaction actions among the skeletons are designed, and additional interaction information is acquired to increase the accuracy of action recognition; constructing a new skeleton diagram through the connections, approximating the skeleton diagram through a high-order quick Chebyshev polynomial of spectrogram convolution, and extracting action characteristics; in order to enhance the extraction of time domain information, a slice cyclic neural network is innovatively applied to video motion recognition to capture the dependency information of the whole motion sequence, and meanwhile, a high-level feature diagram of space-time modeling can properly make up for long-term dependency loss caused by the slice network. The invention improves the accuracy of interactive identification, has good applicability, and can improve the speed of long-time sequence feature extraction by using the slice recurrent neural network.

Description

Human body interaction action recognition method based on skeleton features and slice recurrent neural network

Technical Field

The invention relates to the technical fields of computer vision, pattern recognition and the like, in particular to a human body interaction action recognition method based on skeleton features and a slice recurrent neural network.

Background

The interactive behavior recognition based on the video has higher practical value and wide application prospect. The purpose of human motion recognition is to analyze and understand human-to-human motion and interactions in a video. Although the motion recognition method based on the RGB video or the optical flow has high performance, the motion recognition method is susceptible to the change of the background, the illumination and the appearance, and the extraction of the optical flow information also requires high calculation cost. Nowadays, more and more people research the skeleton data, and the human skeleton can well express the motion of the human body, and is favorable for analyzing the motion of the human body.

At present, relatively mature researches aim at identifying single skeleton actions, and discussion on interaction actions is lacked. Compared with single-person actions, the interactive actions have higher complexity, more types of limb actions are performed in the process of finishing the interactive actions, and the change among the limbs is more diversified. How to effectively characterize an interaction and model and analyze interaction events is a very challenging problem.

When a video sequence is processed, in order to completely capture time information of the whole motion sequence and dependency information of the motion sequence, processing is generally performed through a recurrent neural network model. However, the current node information of the traditional recurrent neural network is only related to the previous node, so that the traditional recurrent neural network can only model short-time dynamic information and cannot store long-time sequences; meanwhile, the standard recurrent neural network structure cannot realize parallel computation like a CNN network model, and the computation speed is relatively low. The present invention therefore proposes a sliced recurrent neural network model to solve the above-mentioned problems.

Disclosure of Invention

In order to solve the problems of incomplete extraction of interaction information and missing of inter-frame dependency information in motion recognition, the invention provides a framework-based interaction space-time modeling method on the basis of single-person skeleton diagram convolution, designs the interaction information among frameworks, and increases the accuracy of motion recognition by capturing the additional interaction information. Meanwhile, the graph convolution and the slice cyclic neural network are combined, so that the dependency between nodes and between frames can be better extracted, the interactive behavior characteristics can be accurately extracted, and the interactive behavior can be identified.

The technical scheme adopted by the invention is as follows: the invention provides an interactive identification method based on skeleton characteristics and a slice recurrent neural network, which comprises the following steps:

(1) and based on the video frame, extracting a skeleton of the action, and designing different connections for the joint points so as to extract the interactive information between different nodes.

(2) A new skeleton map is constructed and a spectrogram convolution is applied to the spatio-temporal skeleton map to obtain a high-level feature map.

(3) And a slice recurrent neural network model is adopted to obtain time sequence dependence information, and the running speed of the time sequence dependence information is improved through parallel calculation. And classifying the actions in the video according to the characteristics extracted by the slice recurrent neural network.

Further, in the step (1), the connection is divided into single connection, interactive connection and inter-frame connection, and various connection methods include:

(1-1) the intra-frame connection comprises two parts of single person self-connection and interactive connection, wherein the interactive part mainly refers to the connection of points which are easy to have similar joint changes as corresponding connection, and epsilon is used₁For example, the actions of two participants are basically consistent, the connection between the corresponding joint points is established, and when the actions of the participants are basically consistent, the establishment of the corresponding connection edges plays an important role. The connection occurring between other joints is called an extrinsic connection, using ∈₂To indicate. Assigning theta to epsilon₁The weight of the edge in (1), assigning δ to ε₂The weight of the edge in (1), i.e.:

i and j denote the joint points of different persons, w_i,jRepresenting different weights of the edges. In order to determine the connection of the nodes in the intra-frame interactive modeling, the relevance between interactive nodes is measured by Euclidean distance. Calculate the values between all points, i.e.:

d(x_i,x_j)＝||x_i-x_j||²

wherein x_i,x_jRespectively as a key point i and a key pointj is characterized. Calculating Euclidean distance d (x) of edges of corresponding connection and extrinsic connection_i,x_j) And normalizing the obtained distance, and mapping the result to [0,1 ]]In between, the normalization method of the maximum and minimum values is adopted, namely:

wherein d is_maxRepresents the maximum value of joint distance, d_minRepresenting the minimum joint distance. Therefore, not only can some new necessary interaction connections be added, but also the underlying graph can have certain sparsity.

(1-2) in time domain, the corresponding joint points between the respective video frames are independent of each other, allowing the frame x to be divided_tIs connected to its previous frame x_t-1And the next frame x_t+1There are also two types of connections between the corresponding neighborhoods and frames: 1) joint of the same type, denoted ε₃(ii) a 2) The connected connection between any joint points in adjacent frames is expressed as epsilon₄. The weights of these two edges are expressed as:

further, the step (2) is realized by:

and constructing an undirected graph G which is { V, E, A }, and consists of a vertex set V, an edge set E connecting the vertices and a weighted adjacency matrix A. Constructing a multi-frame adjacency matrix:

wherein A is_totalAs a new adjacency matrix, A_*(i)Adjacent matrix of intra-frame model representing frame i, A_i,jRepresenting the adjacency matrix between frame i and frame j. 0 represents a zero matrix. The graph laplacian matrix thus calculated is: l ═ D-A_total. D is a measurement matrix, and D is a measurement matrix,from opposite angle

Is shown as a_i,jRepresenting the weight assigned to the edge connecting vertex i and vertex j.

The change of the bone is simulated by the graph laplace matrix. The laplacian matrix L is essentially a high-pass operator that captures the changes in the underlying signal. For any signal x ∈ R^NIt satisfies:

where (Lx) (i) represents the ith component of Lx. N is a radical of_iIs the set of vertices connected to i.

Then, the chebyshev polynomial is adopted to approximate the spectrogram convolution:

in the formula (I), the compound is shown in the specification,

is the defined symmetric normalized graph laplacian. Theta_kDenotes the k-th Chebyshev coefficient, g_θRepresenting the convolution kernel, K the order of chebyshev,

is a chebyshev polynomial of order k. It is composed of

Repeatedly calculated, wherein

Thus, the spectrogram convolution equation is defined as:

in the formula (I), the compound is shown in the specification,

is a weight parameter θ 'to be learned from the network'_kThe matrix of (a) is,

represents W_kIs determined by the dimensions of two adjacent connection layers, where F₁、F₂Respectively, the dimensions of the connection layer are indicated,

w in the formula of integral expression_kOf (c) is calculated. b is the bias and ReLU is the activation function.

Further, in the step (3), the action timing sequence dependency information is acquired by using a slice recurrent neural network, and the actions in the video are classified according to the extracted feature information, which specifically includes the following steps:

(3-1) after extracting the high-level graph features from the graph volume model, wherein T is the sequence length, the input X is divided into N subsequences with equal length, and then the length T of each subsequence N is as follows:

whereby the input is represented as X ═ N₁,N₂,...N_n]In which N is_p＝[x_(p-1)*t+1,x_(p-1)*t+2,...x_p*t]And X comprises a T frame image characteristic sequence.

(3-2) slicing each subsequence N into N subsequences of equal length again, and repeating the slicing operation k times until there is a suitable minimum subsequence length in the bottom layer, and obtaining k +1 layers by slicing k times.

(3-3) at the 0 th layer, the recursion unit acts on each minimum subsequence through the connection structure to obtain the last hidden state of each minimum subsequence at the 0 th layer, and the hidden state is used as the input of the parent sequence of the first layer. Then, using the last hidden state of each sub-sequence on the (p-1) th layer as the input of the parent sequence on the p-th layer, and calculating the last hidden state of the sub-sequence on the p-th layer, namely:

wherein

An implicit representation of the ith subsequence representing the p-th layer. mss denotes the smallest subsequence at level 0,/₀Denotes the minimum subsequence length of layer 0,/_pIndicates the length of the sub-sequence of the p-th layer,

representing the forward calculation process of a p-th layer GRU unit; .

(3-4) applying a plurality of gate control loop unit gru (gate recovery unit) network models to each layer of subsequence until a final hidden state F of the top layer (k layer) is obtained, namely:

(3-5) performing motion classification on the extracted video features:

p＝softmax(W_FF+b_F)

softmax is a normalized exponential function, W_FAs a weight matrix, b_FIs the bias term.

The invention has the beneficial effects that:

1. the interactive identification method based on the skeleton characteristic and the slice recurrent neural network integrates the time and space interactive characteristics, and solves the problems of incomplete interactive information extraction and interframe dependency loss to a great extent;

2. in addition, the slice recurrent neural network can greatly retain information of long-term dependence and simultaneously carry out parallel computation, and meanwhile, the high-level characteristics of space-time modeling are used as the input of the slice recurrent neural network, so that the loss of long-term dependence of the recurrent neural network caused by slices can be properly compensated, and the interactive action can be more completely and accurately identified.

3. The method can be applied to a plurality of fields such as intelligent monitoring, man-machine interaction, video sequence understanding, medical treatment and health.

Drawings

FIG. 1 is a schematic flow chart of the practice of the present invention.

Fig. 2 is a diagram of a sliced recurrent neural network according to the present invention.

Detailed Description

The invention will be further explained with reference to the drawings.

As shown in fig. 1, the interactive identification method based on the skeleton feature and the slice recurrent neural network mainly includes connection between different joints in a frame and a frame, extraction of joint features by spectrogram convolution, and a slice recurrent neural network method. The method of carrying out the invention is explained in detail below in relation to these several aspects.

Skeleton extraction is performed on each action sequence by openpos. And extracting information (x, y, z) of 15 joint points of the human skeleton of each frame of each video, wherein x is the abscissa of the joint point on the image, y is the ordinate of the joint point on the image, and z is the confidence value of the joint point. The joint point connection is divided into single connection, interactive connection and interframe connection, and various connection methods comprise:

(1) the intra-frame connection includes two parts of single-person self-connection and interactive connection, the interactive part mainly refers to the connection of points which are easy to produce similar joint change as correspondent connection, and uses epsilon₁For example, the actions of two participants are basically consistent, the connection between the corresponding joint points is established, and when the actions of the participants are basically consistent, the establishment of the corresponding connection edges plays an important role. In addition, other joint points occurThe connection of (2) is called extrinsic connection, using ∈₂To indicate. Assigning theta to epsilon₁The weight of the edge in (1), assigning δ to ε₂The weight of the edge in (1), i.e.:

d(x_i,x_j)＝||x_i-x_j||²

wherein x_i,x_jRespectively, the feature representations of keypoint i and keypoint j. Calculating Euclidean distance d (x) of edges of corresponding connection and extrinsic connection_i,x_j) And normalizing the obtained distance, and mapping the result to [0,1 ]]In between, the normalization method of the maximum and minimum values is adopted, namely:

(2) In the time domain, the corresponding joint points between the video frames are independent of each other, allowing the frame x to be divided_tIs connected to its previous frame x_t-1And the next frame x_t+1There are also two types of connections between the corresponding neighborhoods and frames: 1) joint of the same type, denoted ε₃(ii) a 2) The connected connection between any joint points in adjacent frames is expressed as epsilon₄. The weights of these two edges are expressed as:

(3) and constructing an undirected graph G which is { V, E, A }, and consists of a vertex set V, an edge set E connecting the vertices and a weighted adjacency matrix A. Constructing a multi-frame adjacency matrix:

wherein A is_totalAs a new adjacency matrix, A_*(i)Adjacent matrix of intra-frame model representing frame i, A_i,jRepresenting the adjacency matrix between frame i and frame j. 0 represents a zero matrix. The graph laplacian matrix thus calculated is: l ═ D-A_total. D is a measurement matrix consisting of diagonal angles

where (Lx) (i) represents the ith component of Lx. N is a radical of_iIs the set of vertices connected to i. Then, approximation of the convolution of the spectrogram by a Chebyshev polynomial is adopted to realize:

in the formula (I), the compound is shown in the specification,

is a chebyshev polynomial of order k. It is composed of

Repeatedly calculated, wherein

Thus, the spectrogram convolution equation is defined as:

in the formula (I), the compound is shown in the specification,

As shown in fig. 2, obtaining action timing dependency information by using a slice recurrent neural network, and classifying actions according to the information specifically includes the following steps:

(1) after extracting the high-level graph features from the graph convolution model, T is the sequence length, and the input X is divided into N subsequences with equal length, so that the length T of each subsequence N is:

(2) Each subsequence N is sliced again into N subsequences of equal length, and the slicing operation is repeated k times until there is an appropriate minimum subsequence length at the bottom layer, and k +1 layers are obtained by slicing k times.

(3) At the 0 th layer, the recursion unit acts on each minimum subsequence through the connection structure to obtain the last hidden state of each minimum subsequence at the 0 th layer, and the hidden state is used as the input of the parent sequence at the first layer. Then, using the last hidden state of each sub-sequence on the (p-1) th layer as the input of the parent sequence on the p-th layer, and calculating the last hidden state of the sub-sequence on the p-th layer, namely:

wherein

the forward calculation process of the p-th layer GRU unit is shown.

(4) Applying a plurality of gate control loop unit (gru) (gate recovery unit) network models to each layer of subsequence until a final hidden state F of the top layer (k layer) is obtained, that is:

(5) and performing action classification on the extracted video features:

p＝softmax(W_FF+b_F)

The above-listed series of detailed descriptions are merely specific illustrations of possible embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent means or modifications that do not depart from the technical spirit of the present invention are intended to be included within the scope of the present invention.

Claims

1. A human body interaction action recognition method based on skeleton features and a slice recurrent neural network is characterized by comprising the following steps:

s1: extracting a skeleton of the action based on the video frame, and designing different connections for the joint points to extract interaction information among different nodes;

s2: constructing a new skeleton graph, and applying spectrogram convolution to the space-time skeleton graph to obtain a high-level feature graph;

s3: and acquiring time sequence dependence information by adopting a slice recurrent neural network model, and classifying actions in the video according to the characteristics extracted by the slice recurrent neural network.

2. The human body interaction recognition method based on the skeleton feature and the slice recurrent neural network as claimed in claim 1, wherein the different connections designed in S1 include: intra-frame connections and inter-frame connections.

3. The human body interaction recognition method based on the skeleton feature and the slice recurrent neural network as claimed in claim 2, wherein the intra-frame connection comprises a single-person connection and an interactive connection; the interconnecting parts are mainly connections of points susceptible to joint-like changes called correspondencesConnection by epsilon₁The actions of the two participants are basically consistent, and the corresponding joint points are connected; the connection occurring between other joints is called an extrinsic connection, using ∈₂To represent; assigning theta to epsilon₁The weight of the edge in (1), assigning δ to ε₂The weight of the edge in (1), i.e.:

i and j denote the joint points of different persons, w_i,jDifferent weights representing edges;

in order to determine the connection of the joint points in the intra-frame connection modeling, the relevance between the interactive nodes is measured by Euclidean distance, and the values between all the points are calculated, namely:

d(x_i,x_j)＝||x_i-x_j||²

wherein x_i,x_jRespectively representing the characteristics of the key point i and the key point j; calculating Euclidean distance d (x) of edges of corresponding connection and extrinsic connection_i,x_j) And normalizing the obtained distance, and mapping the result to [0,1 ]]In between, the normalization method of the maximum and minimum values is adopted, namely:

wherein d is_maxRepresents the maximum value of joint distance, d_minRepresenting the minimum joint distance.

4. The human body interaction recognition method based on the skeleton feature and the slice recurrent neural network as claimed in claim 2, wherein the inter-frame connection comprises two connection types: 1) joint of the same type, denoted ε₃(ii) a 2) The connected connection between any joint points in adjacent frames is expressed as epsilon₄；

The weights of the two edges are expressed as:

5. the human body interaction recognition method based on the skeleton feature and the slice recurrent neural network as claimed in claim 2, wherein the implementation of S2 includes:

and constructing an undirected graph G which is { V, E, A }, consists of a vertex set V, an edge set E connecting the vertices and a weighted adjacency matrix A, and constructs a multi-frame adjacency matrix:

wherein A is_totalAs a new adjacency matrix, A_*(i)Adjacent matrix of intra-frame model representing frame i, A_i,jRepresents the adjacency matrix between frame i and frame j, with 0 representing the zero matrix;

calculated graph laplacian matrix: l ═ D-A_totalD is a metric matrix consisting of diagonal angles

The change in bone was modeled by the graph laplacian matrix: the Laplace matrix L is a high-pass operator that is used to capture changes in the underlying signal, for any signal x ∈ R^NIt satisfies:

6. The human body interaction recognition method based on the skeleton feature and the slice recurrent neural network as claimed in claim 5, further comprising: and (3) realizing spectrogram convolution by adopting a Chebyshev polynomial:

in the formula (I), the compound is shown in the specification,

is a Chebyshev polynomial of order k consisting of

Is obtained by repeated calculation, wherein

Thus, the spectrogram convolution equation is defined as:

in the formula (I), the compound is shown in the specification,

w in the formula of integral expression_kB is the bias and ReLU is the activation function.

7. The human body interaction recognition method based on the skeleton feature and the slice recurrent neural network as claimed in claim 1, wherein the implementation of S3 includes:

s3.1: t is the sequence length, and the input X is divided into N subsequences of equal length, so that the length T of each subsequence N is:

whereby the input is represented as X ═ N₁,N₂,...N_n]In which N is_p＝[x_(p-1)*t+1,x_(p-1)*t+2,...x_p*t]X comprises a T frame image feature sequence;

s3.2: slicing each subsequence N into N subsequences with equal length again, repeating the slicing operation for k times until a proper minimum subsequence length exists at the bottom layer, and slicing for k times to obtain a k +1 layer;

s3.3: at the 0 th layer, the recursion unit acts on each minimum subsequence through the connection structure to obtain the last hidden state of each minimum subsequence at the 0 th layer and uses the last hidden state as the input of the first-layer parent sequence, then uses the last hidden state of each subsequence at the (p-1) th layer as the input of the parent sequence at the p th layer, and calculates the last hidden state of the subsequence at the p th layer, namely:

wherein

An implicit representation of the l-th sub-sequence representing the p-th layer; mss denotes the smallest subsequence at level 0,/₀Denotes the minimum subsequence length of layer 0,/_pIndicates the length of the sub-sequence of the p-th layer,

representing the forward calculation process of a p-th layer GRU unit;

s3.4: applying a plurality of gate control loop unit (gru) (gate recovery unit) network models to each layer of subsequence until a final hidden state F of the top layer (k layer) is obtained, that is:

s3.5: and performing motion classification on the extracted video features F.

8. The human body interaction recognition method based on the skeleton feature and the slice recurrent neural network as claimed in claim 7, wherein the classification is implemented according to the following formula:

p＝softmax(W_FF+b_F)