CN111353447A

CN111353447A - Human skeleton behavior identification method based on graph convolution network

Info

Publication number: CN111353447A
Application number: CN202010146319.0A
Authority: CN
Inventors: 曹江涛; 赵挺; 洪恺临
Original assignee: Liaoning Shihua University
Current assignee: Liaoning Shihua University
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2020-06-30
Anticipated expiration: 2040-03-05
Also published as: CN111353447B

Abstract

A human skeleton behavior recognition method based on graph convolution network belongs to the field of computer vision and deep learning, and comprises the steps of obtaining a human skeleton video frame and carrying out normalization processing; constructing a human body joint internal dependence connection diagram, an individual external dependence connection diagram and an interactive dependence connection diagram corresponding to each frame diagram; obtaining all joint connection graphs of the interactive whole body; distributing weight values to edges of each connection graph of the human joints; performing graph convolution operation to obtain the spatial characteristics of the skeleton sequence; and performing time sequence modeling by using the long and short period memory network to obtain the corresponding category of the interactive behavior. The invention can learn the basic human behavior characteristics by the internal dependence connection edge, can learn the additional behavior characteristics by the external dependence connection edge, can better learn the interaction relationship between two persons by the interaction dependence connection edge, and can better represent the motion relationship of double interaction behaviors, thereby improving the identification performance.

Description

Human skeleton behavior identification method based on graph convolution network

Technical Field

The invention belongs to the technical field of computer vision and deep learning, and particularly relates to a human skeleton behavior identification method based on a graph convolution network.

Background

The human behavior recognition and understanding based on the video is the leading edge direction which is concerned by people in the field of image processing and computer vision, and along with the technical fusion and development of deep learning and computer vision, the behavior recognition is widely applied to the fields of video analysis, intelligent monitoring, man-machine interaction, augmented reality, video retrieval and the like. Compared with single-person actions, double-person interaction behaviors are more common and difficult in daily life. Double interaction behavior is mainly divided into RGB-based and skeleton joint point data-based studies. The traditional RGB video has poor robustness due to factors such as illumination change, shielding, complex background and the like. And the compact three-dimensional positions of main body joints are contained on the basis of the skeleton joint point data, so that the method is robust to changes of viewpoints, body dimensions and movement speed. Therefore, behavior recognition based on skeletal joint point data has received increasing attention in recent years.

The method for identifying the double interaction behavior based on the skeleton joint points is mainly divided into two categories based on manual characteristics and deep learning methods. For the first class, Vemulapalli^[1]Et al represent the human skeleton as a point in the Lie group and implement temporal modeling and classification in Lie algebra. Weng^[2]Et al extend the Naive Bayes Nearest Neighbor (NBNN) method to space-time and classify behaviors using phase-to-class distance. The method has complex characteristic design cost ratio and difficult further improvement of identification accuracy. Methods based on deep learning features can be further divided into CNN-based models and RNN-based models. For the CNN-based method, the joint data is converted into pictures and then sent into the network for learning classification. Such methods ignore timing information in the video. For the RNN-based method, the time sequence information can be effectively modeled, but the dependency between the joint points and the interaction relation of two persons are ignored. (see [1 ] for details]Raviteja Vemulapalli,Felipe Arrate,andRama Chellappa.Human action recognition by representing 3dskeletons as points in a lie group.In CVPR,pages 588–595,2014.[2]Junwu Weng,Chaoqun Weng,and Junsong Yuan.Spatiotemporal naive-bayes nearest-neighbor forskeleton-based action recognition.In CVPR,pages 4171–4180,2017.)。

Recently, with the popular application of Graph Convolutional Network (GCN), many researchers have conducted experiments using the GCN method in the field of behavior recognition. However, the current research is mainly directed to single-person behaviors and mostly adopts a natural connection diagram of a human body, and the dependence between unnatural connection joints of the human body is ignored. In the existing double-person interactive application, two persons are divided into two individuals to be modeled respectively, and the interactive dependency relationship between the two persons is ignored.

Disclosure of Invention

Aiming at the problems and the defects in the prior art, the invention provides a double interaction behavior recognition method based on a graph convolution network, wherein the recognition method comprises the steps of obtaining a double interaction framework video; carrying out normalization processing on the joint point coordinates of the obtained video; constructing a human body joint internal dependency graph, an individual external dependency graph and an interaction dependency graph; distributing different weights for the connecting edges of the three joint connecting graphs; sending the data into a graph convolution network for learning and extracting spatial features; sending the spatial characteristics obtained by each frame into a long-term and short-term memory network for time sequence modeling; and obtaining the identification result of the interactive behavior category.

The method specifically comprises the following steps:

step S10, capturing a video: and starting a camera, recording double interactive videos, collecting skeleton videos of various interactive actions of different action executors as training videos of the interactive actions, marking interactive action meanings of various training videos, and establishing a video training set.

And step S20, performing normalization processing on the preset video frames in the acquired skeleton video to obtain a skeleton sequence to be identified.

Step S30, for each frame of image in the skeleton sequence to be recognized, constructing a corresponding human body joint intrinsic dependence connection image according to joint point coordinates, wherein the joint points are nodes of the image, and natural connection among the joint points is an intrinsic dependence connection edge of the image; constructing respective external dependence connecting edges of a single person and interactive dependence connecting edges of two persons, and forming a human body joint connection graph of each frame of the skeleton sequence to be recognized together by the three;

step S40, respectively distributing weights for the edges of three joint connection graphs corresponding to each frame graph of the skeleton sequence to be recognized to obtain corresponding human joint connection graphs with different weight values;

step S50, carrying out image convolution operation on the human body joint connection image with different weight values corresponding to each frame of the skeleton sequence to be recognized to obtain the spatial characteristics of the skeleton sequence to be recognized;

and step S60, performing time sequence modeling on the time dimension based on the spatial characteristics of the skeleton sequence to be recognized, and obtaining the behavior category of the skeleton sequence to be recognized.

Further, "normalize the preset video frame in the obtained skeleton video and then use it as the skeleton sequence to be identified", the method is as follows:

step S11, sampling the obtained original skeleton video at preset equal intervals as training and recognizing skeleton sequences;

step S12, performing rotation, translation and scale normalization processing on the coordinates of each frame of joint points in the obtained skeleton sequence to obtain a skeleton sequence to be identified, wherein the specific method comprises the following steps:

wherein

The ith coordinate value for the original acquisition T frame, J and T representing the set of the joint point and acquisition frame,

the processed coordinate values;

rotation matrix R and rotation origin o_RIs defined as:

wherein v is₁And v₂Is a vector perpendicular to the ground and the difference vector of the left hip joint and the right hip joint of the initial skeleton in each sequence,

and v₁×v₂Respectively represent v₁And v₂The projection of the vector above and the outer product of these two vectors,

and

the coordinates of the left and right hip joints of the initial skeleton of each sequence are represented.

Further, for each frame of image in the skeleton sequence to be identified, constructing a corresponding human body joint intrinsic dependence connection image according to joint point coordinates, wherein joint points are nodes of the image, and natural connection among the joint points is an intrinsic dependence connection edge of the image; the method comprises the following steps of constructing respective external dependence connecting edges of a single person and interactive dependence connecting edges of two persons, and forming a human body joint connection graph of each frame of a skeleton sequence to be recognized together by the three parts, wherein the method comprises the following steps:

regarding each frame of double interaction as a whole to construct a G (x, W) graph, performing human body modeling on each frame, wherein

Three-dimensional coordinates containing 2N joints, W is a 2N × 2N weighted adjacency matrix:

(w_1,2)_mny, a first human joint point m and a second human joint point n

α, γ represents the weight of the corresponding intrinsic dependency, extrinsic dependency and interactive dependency, respectively.

Further, the method comprises the following steps of respectively distributing weights to the edges of three joint connection graphs corresponding to each frame graph of the skeleton sequence to be recognized to obtain corresponding human joint connection graphs with different weight values:

α is 3, β is 1, γ is 5 to emphasize the internal connection, and the external connection is added to emphasize the inter-connection.

Further, "the human body joint connection diagram with different weight values corresponding to each frame diagram of the skeleton sequence to be identified is subjected to diagram convolution operation to obtain the spatial characteristics of the skeleton sequence to be identified", and the method comprises the following steps:

wherein, represents graph convolution operation;

representing the graph convolution kernel. W is the weighted adjacency matrix of the human body articulation graph.

The specific graph convolution kernel is calculated as follows: graph laplacian normalization over the spectral domain: l ═ I_n-D^-1/2WD^-1/2Where D is a diagonal matrix, D_ii＝∑_jw_ijScaling L to

To represent

Wherein λ_maxIs the maximum eigenvalue of L, T_kIs a chebyshev polynomial. The convolution operation can be expressed as:

here η∈ [ η ]₀,η₁...,η_K-1]Is the training parameter, and K is the size of the graph convolution kernel.

Further, "based on the spatial feature of the skeleton sequence to be recognized, performing convolution operation in a time dimension to obtain a behavior category of the skeleton sequence to be recognized", the method includes:

and (3) for the spatial feature information of each frame obtained by the graph convolution operation, the spatial feature information is sent into a long-period memory network for time sequence modeling after being unfolded through a full connection layer, and softmax is adopted for classification so as to obtain a final interactive behavior classification result.

The invention has the advantages and effects that:

the double-person interactive behavior recognition method based on the graph convolution network comprises the steps of constructing a weighted joint connection graph added with a double-person interactive dependency relationship, obtaining a double-person interactive space characteristic with discriminability by adopting the graph convolution network, and sending the double-person interactive space characteristic into a long-period memory network to obtain a dynamic time relationship for modeling, so that the recognition precision is improved.

Drawings

FIG. 1 is a schematic flow chart diagram of a double-person interactive behavior recognition method based on graph convolution network according to the present invention;

FIG. 2 is a schematic diagram of the intra-articular dependency-connection diagram, the extra-articular dependency-connection diagram and the inter-connection diagram of the human body constructed by the present invention;

FIG. 3 is an algorithmic flow chart of the present invention;

FIG. 4 is a diagram of an LSTM module cell;

FIG. 5 is a confusion matrix of results of the test of the NTU RGB + D data set according to the present invention.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

The invention discloses a double interaction behavior recognition method based on a graph convolution network, which comprises the following steps:

In order to more clearly describe the double-person interactive behavior recognition method based on the graph convolution network, the following will expand the detailed description of the steps in the embodiment of the method of the present invention with reference to fig. 1.

With the development of image processing technology, the microsoft Kinect camera can be directly adopted to obtain skeleton videos of two persons with interactive behaviors, and corresponding joint point data is stored.

Due to the change of people and the change of visual angles in shooting, normalization processing is carried out on the shot people and the shot visual angles in a data processing stage, and the method specifically comprises the following steps:

wherein

the processed coordinate values;

rotation matrix R and rotation origin o_RIs defined as:

and

Step S30, for each frame of image in the skeleton sequence to be recognized, constructing a corresponding human body joint intrinsic dependence connection image according to joint point coordinates, wherein the joint points are nodes of the image, and natural connection among the joint points is an intrinsic dependence connection edge of the image; the method comprises the following steps of constructing respective external dependence connecting edges of a single person and interactive dependence connecting edges of two persons, and forming a human body joint connection diagram of each frame of a skeleton sequence to be recognized together by the three parts, wherein the method comprises the following steps:

(w_1,2)_mny, a first human joint point m and a second human joint point n

Step S40, respectively distributing weights for the edges of three kinds of joint connection graphs corresponding to each frame graph of the skeleton sequence to be recognized, and obtaining corresponding human body joint connection graphs with different weight values:

the weights are assigned, α -3, β -1, γ -5 to emphasize the internal connection, and to attach external connections, highlighting the inter-connection.

Step S50, carrying out graph convolution operation on the human body joint connection graph with different weight values corresponding to each frame graph of the skeleton sequence to be recognized, and obtaining the spatial characteristics of the skeleton sequence to be recognized:

given a video of T frames, constructing a graph [ G ] according to the method of claim 3₁,G₂,...,G_T]For each graph G constructed at t frames_TIt is input into the graph volume layer:

wherein, represents graph convolution operation;

The specific graph convolution kernel is calculated as follows:

graph laplacian normalization over the spectral domain: l ═ I_n-D^-1/2WD^-1/2Where D is a diagonal matrix, D_ii＝∑_jw_ijScaling L to

To represent

Step S60, performing convolution operation in the time dimension based on the spatial features of the skeleton sequence to be recognized, to obtain a behavior category of the skeleton sequence to be recognized:

spatial feature information f of each frame obtained by image convolution operation_tAnd after the full connection layer is expanded, sending the full connection layer into a long-term and short-term memory network for time sequence modeling, and classifying by adopting softmax to obtain a final interactive behavior recognition result.

A dataset of the present authentication algorithm is presented. The NTU RGB + D data set is the largest current behavior recognition data set based on a framework, has 56000 sequences and 400 ten thousand frames, has 60 types of actions, has 25 joint points for each framework, and relates to single-person actions and double-person actions. This embodiment will use 11 kinds of double interaction behaviors in NTU RGB + D as the data set.

The protocols of this method of evaluating a data set are of two types: cross subjects (CS, cross-subjects) and cross-views (CV, cross-views). The proposed method is evaluated herein using CV criteria.

Camera No. 2, 3 captures data for training and camera No. 1 captures data for testing, according to CV evaluation criteria. The final recognition rate is 88%, and a remarkable recognition effect is achieved. The confusion matrix is shown in fig. 4.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A human skeleton behavior identification method based on a graph convolution network is characterized in that: the identification method comprises the steps of obtaining a double-person interactive skeleton video; carrying out normalization processing on the joint point coordinates of the obtained video; constructing a human body joint internal dependency graph, an individual external dependency graph and an interaction dependency graph; distributing different weights for the connecting edges of the three joint connecting graphs; sending the data into a graph convolution network for learning and extracting spatial features; sending the spatial characteristics obtained by each frame into a long-term and short-term memory network for time sequence modeling; and obtaining the identification result of the interactive behavior category.

2. The method for recognizing human skeleton behaviors based on the graph convolution network according to claim 1, wherein the method comprises the following steps: the identification method specifically comprises the following steps:

step S10, capturing a video: starting a camera, recording double interactive videos, collecting skeleton videos of various interactive actions of different action executors as training videos of the interactive actions, carrying out interactive action meaning marking on various training videos, and establishing a video training set;

step S20, normalizing the preset video frames in the obtained skeleton video to be used as a skeleton sequence to be identified;

3. The method for recognizing human skeleton behaviors based on the graph convolution network as claimed in claim 2, wherein the method comprises the following steps: in the step S20, "normalize the preset video frames in the obtained skeleton video to obtain a skeleton sequence to be recognized", the method includes:

wherein

the processed coordinate values;

rotation matrix R and rotation origin o_RIs defined as:

and

4. The method for recognizing human body skeleton behavior based on graph convolution network as claimed in claim 2, wherein in step S30, "for each frame of graph in the skeleton sequence to be recognized, a corresponding human body joint intrinsic dependence connection graph is constructed according to joint point coordinates, joint points are nodes of the graph, and natural connections between joint points are intrinsic dependence connection edges of the graph; the method comprises the following steps of constructing respective external dependence connecting edges of a single person and interactive dependence connecting edges of two persons, and forming a human body joint connection graph of each frame of a skeleton sequence to be recognized together by the three parts, wherein the method comprises the following steps:

(w_1,2)_mny, a first human joint point m and a second human joint point n

5. The method according to claim 2, wherein in step S40, "assigning weights to the edges of three joint connection graphs corresponding to each frame graph of the skeleton sequence to be identified to obtain corresponding human joint connection graphs with different weight values" respectively includes:

6. The method for recognizing human body skeleton behavior based on graph convolution network according to claim 2, wherein in step S50, "performing graph convolution operation on human body articulation graphs with different weight values corresponding to each frame graph of the skeleton sequence to be recognized to obtain spatial features of the skeleton sequence to be recognized" includes:

given a video of T frames, construct graph [ G₁,G₂,...,G_T]For each ofA graph G constructed at t frame_TIt is input into the graph volume layer:

wherein, represents graph convolution operation;

representing a graph convolution kernel, W is a weighted adjacency matrix of the human body articulation graph;

the specific graph convolution kernel is calculated as follows:

To represent

Wherein λ_maxIs the maximum eigenvalue of L, T_kFor chebyshev polynomials, the convolution operation can be expressed as:

7. The method for recognizing human body skeleton behavior based on graph convolution network according to claim 2, wherein in step S60, "performing convolution operation in time dimension based on spatial feature of the skeleton sequence to be recognized to obtain behavior category of the skeleton sequence to be recognized" includes:

spatial feature information f of each frame obtained by image convolution operation_tAfter being unfolded through the full connecting layer, the long-term and short-term memory net is sent intoAnd performing time sequence modeling, and classifying by adopting softmax to obtain a final interactive behavior recognition result.