CN111967379A

CN111967379A - Human behavior recognition method based on RGB video and skeleton sequence

Info

Publication number: CN111967379A
Application number: CN202010821378.3A
Authority: CN
Inventors: 曹聪琦; 李嘉康; 李亚娟; 张艳宁; 郗润平
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2020-11-20
Anticipated expiration: 2040-08-14
Also published as: CN111967379B

Abstract

The invention relates to a human behavior recognition method based on RGB (red, green and blue) videos and a skeleton sequence, which belongs to the technical field of computer vision and pattern recognition and comprises the following contents: firstly, feature stream performs feature extraction on an input video clip to obtain a space-time feature map; secondly, generating a skeleton region heat map by the attribute stream; thirdly, extracting space-time characteristics of the bone region through bililinear; fourthly, local decision block is used for generating a local decision result; and fifthly, fusing the local decision result by using a decision fusion block to obtain a global decision result. The invention realizes Decision fusion by using two plug-and-play modules, namely a local Decision block and a Decision fusion block, wherein the local Decision block respectively makes a Decision on the space-time characteristics of each key region, and the Decision fusion block fuses all Decision results to obtain a final Decision result. The invention effectively improves the accuracy of behavior recognition on the Penn Action and NTU RGB + D data sets.

Description

Human behavior recognition method based on RGB video and skeleton sequence

Technical Field

The invention relates to the technical field of computer vision and pattern recognition, in particular to a human behavior recognition method based on RGB (red, green and blue) videos and a skeleton sequence.

Background

Human behavior recognition, a fundamental problem in computer vision, has now attracted a great deal of attention in the industry. With the continuous development of computer intelligent technology, the human motion recognition has wide application prospect in future life. For example: intelligent monitoring, human-computer interaction motion sensing games, video retrieval and the like. Human behavior recognition in video has similar problems to object recognition in still images, both of which must deal with significant intra-class variation, background clutter and occlusion. However, video has an additional time cue than images. The acquisition of video time clues is a big difficulty.

There are two main methods of applying Convolutional Neural Network (CNN) to video data: one is to apply the image-based model directly to each frame of the video, but only to depict the visual appearance of the video, using a 2D CNN structure. Another way is 3D CNN, so that the convolution kernel is three-dimensional and can extract both spatial and temporal information, but the number of network parameters can increase dramatically, resulting in overfitting.

The attention mechanism mimics the internal process of biological observation behavior, i.e., a mechanism that aligns internal experience with external perception to increase the fineness of observation of a partial region. The attention mechanism can obtain more detailed information of the object needing attention and suppress other useless information. Most networks extract key features by using an attention mechanism, then the features are fused to obtain a global feature descriptor, and finally a global classifier is used to obtain a classification result. The feature fusion mode has the following problems: 1. the gaps between different feature spaces. 2. The fused global descriptor dimension is high, which results in more parameters for classification, easily resulting in overfitting. 3. Some behavior predictions need to comprehensively consider multi-part decision results, such as state changes, context, etc. of objects. Both of these problems will seriously affect the performance of human behavior recognition.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides a human body behavior identification method based on RGB video and a skeleton sequence.

Technical scheme

A human behavior recognition method based on RGB video and skeleton sequence adopts LD-Net, comprising two streams: a feature stream and an attention stream; two modules: a local decision block and a decision fusion block; the method is characterized by comprising the following steps:

step 1: the human body behavior data set comprises two parts of video data and human body skeleton position data, the data set to be processed is divided into a training set and a testing set, and the video set of the training set and the testing set is set as I ═ I₁,I₂,...,I_i,...,I_VV denotes training set and test lumped video frequency, I_iRepresents the ith video; let the training set and the test set each have a video length set of F ═ F₁,F₂,...,F_i,...,F_V},F_iRepresents the length of the ith video; let the training set and the testing set be human skeleton set as J ═ J₁,J₂,...,J_i...,J_V}，J_iRepresenting the set of human skeleton points corresponding to the ith video, J_iIs of dimension F_iZ2, wherein F_iThe length of the ith video is represented, Z represents the number of human skeletons of each frame in the video, and 2 represents the abscissa and the ordinate of the position of each skeleton point;

step 2: pre-allocating initial labels for all videos in the training set and the test set, defining the total number of behavior categories as K, and setting the initial label set as { y_i＝k|1≤k≤K}，y_iRepresenting a video I_iInitial label of，i＝1,2,....,V；

And step 3: preprocessing a training set and a testing set, wherein the length of a frame sent by each video to a network is S, and carrying out scaling, random cutting and mean-val normalization processing on S frame video data to obtain video data with the dimension of S3X 224 and human skeleton position data with the dimension of S Z2;

and 4, step 4: sending video data with the dimension of S3X 224 into Feature stream, and using MF-Net to carry out Feature extraction on the Feature stream to obtain space-time Feature diagrams behind MF-Net Conv4 and Conv5 blocks; the space-time characteristic diagram dimension is marked as C, L, H, W and is respectively the number of channels, the length, the height and the width;

and 5: sending human skeleton position data with the dimension S X Z X2 into an Attention stream to obtain a heat map of an input video clip, wherein the dimension of the heat map is marked as M X L H W and is respectively the number of channels, the length, the height and the width; wherein, M is NxL, N represents the number of body joints in each frame, L represents the weight of the length of the video clip to the heat map, which is realized according to the activation of bones at the corresponding points of the heat map, and is equivalent to assigning hard weights;

step 6: adjusting a heat map of size M × L × H × W to a 2D matrix a having M rows and L × H × W columns, adjusting a C × L × H × W feature map derived from MF-Net to a 2D matrix B having C rows and L × H × W columns; then, the bilinear product is represented as:

X＝AB^T

wherein B is^TIs the transpose of B, the X matrix is the set of all the skeleton point features, and the dimension is M × C;

and 7: sending the set X of the bone point features obtained in the step 6 into a Local decision block, wherein the Local decision block decides all the Local bone point space-time features through a full connection layer to obtain a Local behavior classification result; the method comprises the following specific steps:

defining the total number of behavior classes as K the resulting matrix X of the Attention mechanism can be represented as:

X＝[x¹；x²；。...；xⁱ；...；x^M],

wherein xⁱRepresenting the characteristics of the ith bone point, which can be regarded as local characteristic description of the target;

respectively training a linear classifier for each human skeleton point characteristic:

xⁱ∈R^C×1,θ_i∈R^K×C,b_i∈R,i∈[1,M]

wherein

The expression parameter is b_iThe ith linear classifier of (c)_iMay be included in_iIn (F)_θRepresenting a set of linear classifiers;

according to F_θDeriving a set of decisions

It is expressed as:

wherein

Representing the decision result of the ith bone point characteristic;

and 8: the Decision fusion block fuses all local decisions obtained by the local Decision block to obtain a final Decision result; the method comprises the following specific steps:

generating a corresponding weight according to a conditional on current local patch criterion, wherein the weight represents the importance degree of the corresponding local feature; using a linear mapping g_βTo implement this function and then use sigmoiThe d function normalizes the weights, denoted by W:

W＝[w¹；w²；...；wⁱ；...；w^M],

g_β(xⁱ)＝βxⁱ+b,xⁱ∈R^C×1,β∈R^1×C,b∈R,i∈[1,M]

where β and b are linear mappings g_βThe parameters of (1);

obtaining a global decision result according to the generated weight:

and (3) locally supervising a local decision result obtained by the local decision block in the step (7), wherein a loss function is as follows:

where M represents the total number of skeletal points of an input video segment,

a label representing the jth position of the ith sample,

a label representing the jth position of the ith sample in the mth bone point feature decision result;

and (3) globally supervising the global decision result obtained in the step (8), wherein the loss function is as follows:

where U represents the sample size, K represents the total number of behavior classes,

a label representing the jth position of the ith sample,

indicating the prediction result;

the local and global decisions are supervised, and the loss function is as follows:

L＝L_g+L_l

the initial learning rate of training was set to 0.005 and after 20, 40 and 60 rounds, the learning rate was reduced by a factor of 0.1, and a random gradient descent SGD optimizer was used during the training.

Advantageous effects

The human behavior recognition method based on the RGB video and the skeleton sequence has the following beneficial effects:

(1) the original network fuses the extracted features of the attention to describe the current behavior and train a global classifier. The main problem of the feature fusion method is the gap between different feature spaces. In addition, the fused global descriptor dimension is high, which results in more parameters for classification, easily resulting in overfitting. The decision fusion method provided by the invention can solve the problems, on one hand, the decision aggregation fusion is projected to the decision in the same space, and on the other hand, the quantity of parameters is reduced because the local features are low-dimensional and the classifiers can share the parameters. Meanwhile, the decision fusion is also supported theoretically and experimentally in the aspects of statistics and machine learning, and the integration method can combine single classifiers to generate better performance than any single classifier.

(2) The invention provides a plug-and-play local decision fusion structure for human behavior recognition, which comprises a local decision block module and a decision fusion block module. And the local decision block makes a decision based on the local space-time characteristics to obtain a local decision. And fusing all the local decision results by using the decision fusion block to obtain a final decision result. The structure can fully utilize local space-time characteristics, fully considers the influence of local decision on the recognition effect, and thus effectively improves the behavior recognition effect.

(3) The invention provides that supervision can be added to both local and global decisions. The two supervision modes can be matched with each other, and the training of the model is facilitated.

Drawings

FIG. 1 is a schematic view of the overall system flow of the present invention

FIG. 2 is an overall structure diagram of LD-Net proposed in the present invention

FIG. 3 is an MF-Net network framework for use in the present invention

FIG. 4 is a schematic view of an attition

FIG. 5 is a skeletal weight heat map for different behaviors

FIG. 6 is a comparison of confusion matrices on the Penn Action dataset for MF-Net and the present invention

FIG. 7 is an embodiment of skeletal weight in concrete behavior

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

the invention provides a two-stream network structure which comprises two modules, namely a local decision block and a decision fusion block, and the two modules are called LD-Net. One LD-Net stream is a feature stream, and a Multi-Fiber network (MF-Net) is selected to extract the space-time features of the video segments. As the MF-Net is a multi-fiber structure network, the parameter quantity of the three-dimensional network can be effectively reduced, and overfitting is avoided. The MF-Net network framework is shown in fig. 3. The other stream is an attention stream, and the corresponding positions of human skeletal points are taken as attention (attention) areas. Because the skeleton point information reflects the posture characteristics of the human body, and simultaneously, useless information about the target is greatly eliminated. For the extracted key regional characteristics, the extracted key regional characteristics are not directly fused and then are decided, but are fused locally. The invention realizes Decision fusion by using two plug-and-play modules, namely the local Decision block and the Decision fusion block. And (4) the local decision block makes decisions on the space-time characteristics of each key region respectively. And fusing all Decision results by the Decision fusion block to obtain a final Decision result. The invention effectively improves the accuracy of behavior recognition on the Penn Action and NTU RGB + D data sets.

The technical scheme of the invention comprises six stages, wherein the first stage is that feature stream performs feature extraction on an input video clip to obtain a space-time feature diagram. The second phase is the attribute stream generates the bone region heat map. The third stage is to extract spatiotemporal features of the bone region by bilinear. And a fourth stage of utilizing the local decision block to generate a local decision result. And in the fifth stage, a decision fusion block is used for fusing local decision results to obtain a global decision result. The sixth stage is a loss function used by the training network. The seventh stage is experimental result analysis. The method comprises the following specific steps:

1) feature stream performs Feature extraction on video clips to obtain a space-time Feature map

Feature stream uses an MF-Net network framework that effectively reduces the number of parameters using a multi-fiber fabric and multiplexing modules. The invention uses MF-Net to obtain the space-time characteristic diagram of input data. The details are as follows:

(a) and (4) preprocessing data. The invention saves the video data into the form of pictures, and reduces the time spent on reading the data by the network. During the saving as a picture, the overall scaling is performed according to the scaling factor of the minimum edge scaling to 256.

(b) And ensuring the robustness of the training model. The training set is augmented by mirroring and training is performed by randomly selecting image patches of size S3 x 224 from the data set in each training, where S is the number of video segment frames.

(c) The invention marks the space-time characteristic diagram dimension generated by the MF-Net network as C, L, H, W, which are the number of channels, length, height and width. The invention obtains space-time characteristic diagrams after MF-Net Conv4 and Conv5 blocks respectively.

2) Attention stream generation of skeletal region heat maps

The Attention stream obtains a heat map of each bone point according to the position information of the human bones. The specific details are as follows: the invention obtains a skeleton region heat map of the input video clip according to the marked skeleton position information, wherein the heat map and the space-time feature map have the same space-time dimension. The dimensions of the device are marked as M, L, H, W, and are the number of channels, length, height and width respectively. Where M is NxL, where N denotes the number of body joints in each frame and L denotes the length of the video segment. The joint guidance feature set is realized by selecting the activation of the body joint at the corresponding point of the heat map, which is equivalent to assigning a hard weight. Specifically, a weight of 1 is given to a bone point correspondence site, and a weight of 0 is given to a bone point correspondence site. This step results in two heatmaps with dimensions equal to the spatio-temporal feature map dimensions after the MF-Net Conv4 and Conv5 blocks, respectively.

3) Biliner operation

Through the operation, the method can obtain the spatiotemporal feature map and the bone region heat map of the input data, carry out bilinear operation on the spatiotemporal feature map and the heat map, and extract the feature corresponding to each bone point. The specific process is as follows: first, a heat map of size M × L × H × W is adjusted to a 2D matrix a with M rows and L × H × W columns. Similarly, the C × L × H × W feature map derived from MF-Net is adjusted to a 2D matrix B having C rows and L × H × W columns. Then, the bilinear product can be expressed as:

X＝AB^T

wherein B is^TIs the transpose of B, P is a matrix of size M X C, and X is the set of all skeletal point features.

The method comprises the steps of respectively carrying out bilinear operation on the obtained space-time characteristic diagram and the obtained heat diagram to obtain characteristics corresponding to the space-time characteristic diagram of all bone points after MF-Net Conv4 and Conv5 blocks, and finally fusing the bone space-time characteristics corresponding to the two different blocks to obtain the unique representation of the bone space-time characteristics. Because the features behind different blocks can complement the space-time information mutually, the accuracy rate of behavior recognition is improved.

4) Obtaining a local decision block result by the local decision block

The method and the device respectively make decisions on all local bone point characteristics to obtain local behavior classification results. The invention defines the total number of behavior types as K and can express the matrix X obtained by the Attention mechanism as:

X＝[x¹；x²；...；xⁱ；...；x^M],

wherein xⁱThe feature representing the ith bone point can be regarded as a local feature description of the target.

The idea of the invention is to predict the probability of behavior of each local feature using linear mapping, which can be seen as multiple weak classifiers to make decisions. The invention provides two linear classifier schemes, one is to train a parameter-shared linear classifier for all local features, and the other is to train a linear classifier for each human skeleton point feature. Use of the invention F_θTo represent a set of linear classifiers.

The first scheme is as follows:

xⁱ∈R^C×1,θ∈R^K×C,b∈R,i∈[1,M]

wherein

A linear classifier with a parameter theta is represented, b may be included by theta.

The second scheme is as follows:

xⁱ∈R^C×1,θ_i∈R^K×C,b_i∈R,i∈[1,M]

wherein

The expression parameter is b_iThe ith linear classifier of (c)_iMay be included in_iIn (1).

Whatever scheme is used by the invention, a set of decisions is derived

It is expressed as:

wherein

And (4) representing the decision result of the ith bone point characteristic. Through the operation, the invention obtains a series of local decision results.

Subsequent experiments prove that a large number of parameters are brought by using the first scheme, so that overfitting is caused, and the identification accuracy is reduced. So the second approach is chosen and all skeletal point features are decided using a linear classifier with shared parameters.

5) Fusing the local Decision to obtain a global Decision result

Through the local decision block, all local decision results are obtained. And fusing all Decision results, which is a problem to be solved by the Decision fusion block. The invention provides two methods for fusion, one is to sum and average decision results to obtain the final decision result; and the other method is that corresponding weights are generated according to a conditional on current local patch criterion, and the obtained weights are multiplied by corresponding decision results respectively to sum and average to obtain a global decision result.

The first fusion method the present invention can be expressed as:

the second fusion method firstly generates corresponding weight according to the conditional on current local patch criterion, wherein the weight represents the importance degree of the corresponding local feature. The invention uses a linear mapping g_βTo achieve this function, the weights are then normalized using a sigmoid function, denoted by W:

W＝[w¹；w²；...；wⁱ；...；w^M],

g_β(xⁱ)＝βxⁱ+b,xⁱ∈R^C×1,β∈R^1×C,b∈R,i∈[1,M]

where β and b are linear mappings g_βLikewise β may comprise b.

Finally, the invention obtains a global decision result according to the generated weight:

it is known that the amplitude of skeletal changes of different behaviors of a human body is different, and the importance of the skeletal changes on behavior recognition is different. For example, the variation range of the bone points of the upper half part of the human body is larger in the push-up mode; the change range of the bone points of the lower half part of the human body is larger when people play football. This characteristic is proved in the experimental results of the invention, and the use of the second fusion method is more beneficial to human behavior recognition.

6) A loss function used by the training network.

The LD-Net provided by the invention can generate local and global decision results, and can monitor the network locally, globally and in combination with the local and global decision results.

(a) And (6) carrying out global supervision. The invention only supervises the global decision, and the loss function is as follows:

where U represents the sample size and K represents the total number of behavior classes.

A label representing the jth position of the ith sample,

indicating the prediction result.

(b) And (6) local supervision. The invention only supervises local decision, and the loss function is as follows:

where M represents the total number of skeletal points of an input video segment.

And (4) a label representing the j (th) position of the ith sample in the m (th) bone point feature decision result.

(c) Global + local supervision. The invention supervises both local and global decisions, and the loss function is as follows:

L＝L_g+L_l

the present invention sets the initial learning rate of training to 0.005, decreasing by a factor of 0.1 after 20, 40 and 60 rounds. A random gradient descent (SGD) optimizer is used in the training process.

7) Analysis of Experimental results

The invention performs experiments on the Penn Action and NTU RGB + D datasets.

(1) Experiments on the Penn Action dataset

Penn Action dataset: 2326 video sequences containing 15 action classes. The length of the video is from 18 frames to 663 frames. This dataset adds 13 skeletal point annotations to the human body for each frame, but some skeletal points are not visible in the video. The data set had 1258 training data and 1068 test data.

<1> decision module: in this subsection, the invention will compare the basic MF-Net with networks that have local precision block and precision fusion block added.

Table 1 basic MF-Net and recognition accuracy using local precision block and precision fusion block. conv4 and conv5 represent spatio-temporal feature maps of MF-Net after conv4 and conv5 modules, respectively. L represents a local resolution block. D represents a precision fusion block. S denotes a linear classifier for parameter sharing. NS denotes a linear classifier whose parameters are not shared. W represents decision fusion according to self-learning weights. NW represents the average fusion strategy.

Method of producing a composite material	Rate of accuracy
		MF-Net	0.945
MF-Net(conv4)+L(S)+D(NW)	0.928
		MF-Net(conv5)+L(S)+D(NW)	0.965
MF-Net(conv5)+L(S)+D(W)	0.973
		MF-Net(conv5)+L(NS)+D(NW)	0.932
MF-Net(conv5)+L(NS)+D(W)	0.941

As can be seen from table 1, the decision module used after the conv5 block of the MF-Net network performs better than the conv4 block. This is because the spatiotemporal features extracted after the conv5 block are more discriminative than after the conv4 block. The experimental result shows that the performance of the module added by the network is better than that of the basic MF-Net, and the success rate is improved from 94.5% to 96.5%. When the weight-based decision averaging strategy is used, the success rate is improved from 96.5% to 97.3%. This is because weight-based strategies may derive a set of learnable weights from local information, which may be more focused on important local information. On the other hand, when the invention uses a linear classifier with unshared parameters, the recognition accuracy is significantly reduced compared with the basic MF-Net network. The reason may be that each local feature corresponds to a classifier, and the number of network parameters may increase significantly, causing an over-fitting problem.

FIG. 4 the present invention visualizes the local decision weights for four different behaviors as a heat map. (a) Indicating that for pushup behavior, the network is more focused on elbow, wrist, and shoulder regions. (b) It is shown that for tenis forward behavior, the network is more focused on the upper body region such as the wrist, elbow, etc. (c) Indicating that for jump rope behavior, the network is more focused on the knee, wrist, ankle and other lower body regions. (d) Indicating that for setup behavior, the network is more focused on the shoulder, hip, wrist and knee regions.

<2> compare different feature fusion strategies: in this subsection, the present invention compares two different feature fusion strategies. One is to fuse the features and make a decision. The other is to fuse the local decision results. For feature fusion, the invention uses the JDD mode with MF-Net, which fuses the features guided by the bone points. JDD has two feature fusion strategies in common. One is to integrate all local features directly. The other is to integrate the features of the same timing first and then assemble the features in the timing dimension using max + min.

Table 2 identification accuracy of different fusion methods. Denotes the use of max + min pooling in the time-series dimension.

Method of producing a composite material	Rate of accuracy
		MF-Net	0.945
MF-Net(conv5)+JDD	0.953
		MF-Net(conv5)+JDD*	0.960
MF-Net(conv5)+L(S)+D(NW)	0.965
		MF-Net(conv5)+L(S)+D(W)	0.973

As can be seen from table 2, the success rate increased from 94.5% to 95.3% using the first feature fusion strategy. At that time, the success rate is further improved to 96.0 percent by using a max + min mode. However, the accuracy of the feature fusion method is lower than that of the decision fusion method provided by the invention, and the effectiveness of the method is proved.

<3> decision to fuse different layers: in this subsection, the invention uses spatio-temporal feature maps behind different MF-Net layers to extract local features, then performs local decision making, and finally fuses the local decision making results to obtain the final decision making result.

Table 3 local decision fusion recognition success rates for different layers.

Method of producing a composite material	Rate of accuracy
		MF-Net	0.945
MF-Net(conv4)+L(S)+D(NW)	0.928
		MF-Net(conv4)+L(S)+D(W)	0.936
MF-Net(conv5)+L(S)+D(NW)	0.965
		MF-Net(conv5)+L(S)+D(W)	0.973
MF-Net(conv4+conv5)+L(S)+D(NW)	0.977
		MF-Net(conv4+conv5)+L(S)+D(W)	0.982

As can be seen from table 3, the local decision after fusing the conv5 block is better than the local decision after fusing the conv4 block. This may be because higher-level semantic information is more favorable for behavior recognition. When the partial decision results after the conv4 blocks and the conv5 blocks are integrated, the recognition success rate is further improved. This is because the information of different layers can be complemented, thereby improving the recognition accuracy.

<4> different loss functions were used: in this subsection, the present invention uses different loss functions for the network.

Table 4 identification accuracy using different loss functions. GS denotes global supervision. LS denotes local supervision.

As can be seen from table 4, when the network uses only local supervision, its accuracy is lower than using global supervision. This is because global supervision can directly optimize the final objective function. When the network uses both local and global supervision, the accuracy is further improved. This reflects that local supervision can improve the accuracy of local classification, and cooperates with global supervision to improve the recognition accuracy. When the network fuses the local decision results after the conv4 and conv5 blocks and uses local and global supervision simultaneously in the training process, the accuracy rate reaches 98.4%.

<5> robustness analysis: in this subsection, the present invention assesses the impact of the accuracy of human joint positions on the model of the present invention. The present invention uses the alphapos algorithm to generate estimated human skeletal points, which predict the positions of 17 skeletal points for each person. The 17 skeletal points are nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle, respectively. For the Penn Action dataset, the 13 labeled skeletal points were head, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, right ankle, respectively. In order to make the estimated skeletal points correspond to the 13 labeled skeletal points, the present invention takes the estimated nasal skeletal points as the head skeletal points. The average L1 distance errors for bone points in width and height were estimated to be 54.03, 28.31, respectively. The error ratio of the frame size is (0.11, 0.09). The invention is compared with other deep networks that incorporate attitude information, such as P-CNN, JDD and dual-stream bilinear C3D. The results of the experiment are shown in Table 5.

Table 5 estimates the effect of skeletal points on the Penn Action dataset versus real skeletal points.

Method of producing a composite material	Labeling	Estimating	Error of the measurement
				P-CNN	0.977	0.953	0.024
JDD(conv5b)	0.943	0.874	0.069
				JDD(conv5b+conv4b)	0.957	0.893	0.064
JDD(conv5b+conv4b)*	0.981	0.938	0.043
				two-stream bilinear C3D	0.943	0.926	0.017
two-stream bilinear C3D*	0.971	0.953	0.018
				MF-Net(conv5)+L(S)+D(NW)+GS	0.965	0.955	0.010
MF-Net(conv5)+L(S)+D(W)+GS	0.973	0.961	0.012
				MF-Net(conv4+conv5)+L(S)+D(W)+LS+GS	0.984	0.969	0.015

As can be seen from table 5, the method proposed by the present invention is significantly superior to other methods on the Penn Action dataset, especially in the case of estimating the skeletal points of the human body. When the position of a human skeleton point is inaccurate, the accuracy rate of the existing posture-based method is rapidly reduced. The method utilizes an integrated learning idea and comprehensively considers a plurality of local information to carry out comprehensive decision. It achieves the best performance under the condition of marking and estimating the bone points. In addition, the accuracy rate of the method is reduced by the minimum extent, which proves that the method has good robustness on the position error of the estimated human skeleton point.

<5> the present invention compares with other advanced methods: in this subsection, experimental results of the methods presented by the present invention are compared with those of other advanced methods. As shown in table 6.

Table 6 experimental results of the present invention compared to other advanced methods. The precision Aggregation represents MF-Net (conv4+ conv5) + L (S) + (W) + LS + GS.

In table 6, the methods are classified into three types. The methods of the first part are based on video features, the methods of the second part are based on pose features, and the methods of the third part are based on video and pose features. In the first section, IDT-FV encodes dense tracks using Fisher vectors, which has better performance than DT and STIP. Compared with the C3D, the MF-Net effectively reduces parameters and further improves the precision by adopting the group convolution. The MGN combines local and global information extracted from the video, and the accuracy is improved to 95.5%.

In the second section, Action bank consists of many individual behavior detectors that are widely sampled in semantic space and viewpoint space. Actemes uses skeletal point labels (e.g., locations) in a data-driven training process to find regions that are highly clustered in space with spatio-temporal skeletal points. ACPS is a graph structure model that uses high-level information about behavior to incorporate higher-order partial dependencies.

In the third part, the method for guiding behavior recognition by using attitude information can reduce redundant information and further improve the recognition accuracy. The ACPS + IDF-FV combines ACPS and IDF-FV, and the accuracy is improved to 92.9%. MST-AOG and ST-AOG use spatio-temporal and-or maps for behavior recognition. P-CNN [34] benefits from taking the optical flow image as an additional input and cropping the optical flow image and RGB image into blocks under the guidance of human skeletal points. JDD, the double-flow bilinear C3D and the method of the invention all use the human skeletal point region in the feature map as the attention region to guide the space-time feature extraction. However, JDD and dual-stream bilinear C3D make decisions after fusing spatiotemporal features. JDD (conv5b + conv4b) performs feature aggregation using max + min method in the time dimension, and then performs classification using support vector machine. The method provided by the invention utilizes the linear classifier shared by parameters to make decisions on local space-time characteristics respectively, and finally carries out weight-based average aggregation on local decision results, thereby realizing end-to-end trainability and achieving the most advanced performance under the condition of marking skeletal points. The RPAN utilizes a posture attention mechanism to adaptively learn the characteristics related to the posture, and the accuracy rate reaches 97.4% under the condition that the information of the bone points is not marked. The Pose + MD-fusion is classified by combining the posture, space and motion characteristic graphs, and the accuracy is further improved to 97.6% under the condition that the skeleton points do not need to be marked. RPAN and dose + MD-fusion have better performance than the proposed method when tested without labeling skeletal points, since both methods have optical flow images as additional input, whereas the proposed method does not use optical flow information.

The invention visualizes the confusion matrix obtained by MF-Net and precision Aggregation in FIG. 5. Notably, the precision Aggregation represents MF-Net (conv4+ conv5) + L (S) + (W) + LS + GS. On the Penn Action dataset, there are some behaviors that are easily confused, such as clean and jerk and square, Tenis for hand and Tenis serve, because they have similar appearance and motion information. These similar actions are clearly less accurate to identify than other actions when using the basic MF-Net. When the method provided by the invention is used, more comprehensive and more accurate judgment can be made, and the performance is superior to that of an MF-Net network.

(2) Experiments on NTU RGB + D dataset

NTU RGB + D dataset: the motion image is composed of 56880 motion samples, and comprises RGB video, a depth map sequence, 3D skeleton data and infrared video of each sample. Where the 3D skeleton data contains the three-dimensional positions of the 25 major body joints per frame. The Ntu dataset contains a total of 60 action classes, of which 49 actions are performed by a single person and the rest by multiple persons. The present invention does not require depth map sequences and infrared video when using a data set. The 3D bone data for each sample needs to be converted into 2D bone data. The training set and the test set are divided according to two criteria, CS (cross-subject) and CV (cross-view).

<1> the present invention compares with other advanced methods: in this subsection, experimental results of the methods presented by the present invention are compared with those of other advanced methods. As shown in table 7.

Table 7 experimental results of the method proposed by the present invention compared with other methods.

In Table 7, these methods are also classified into three groups. The method of the first part is based on video features, the method of the second part is based on pose features, and the method of the third part is based on both video and pose features. It is noted that the method of the italic part uses pose information estimated based on a vision method, not pose information directly output by Kinect.

In the first part, the TSN is a dual stream network based on a long range time structure. MF-Net splits a complex network into an integration of lightweight networks. The DA-Net is a network which combines the classification scores of each view by using a view classifier, and effectively improves the identification accuracy.

In the second part, the Lie Group models the three-dimensional geometrical relationships between different body parts in three-dimensional space. HBRNNs use hierarchical RNNs for behavioral recognition, which use five body parts of the human body as inputs rather than the entire skeleton.

Compared with the Lie Group, the accuracy rate is obviously improved. Part-aware LSTM models long-range temporal correlations for features of each body Part. Trust Gate ST-LSTM analyzes motion information in data from the spatio-temporal domain using spatio-temporal LSTM. STA-LSTM uses spatial and temporal attention models to focus on more discriminative bone points. The VA-LSTM automatically adjusts the viewpoint using a view adaptation scheme. The DS-LSTM captures the temporal connections in the framework sequence. The 3scale ResNet152 maps the skeleton information to color images and models using 3 different scale inputs and networks. (P + C) Net rearranges the inputs using a permutation network and designs a classification network with gate convolution to improve learning. The ST-GCN automatically learns spatial and temporal patterns from data using a graph convolution network. PB-GCN divides the skeleton map into four sub-maps and learns a recognition model using part-based GCN.

In the third section, DSSCA-SSLM is a depth auto-encoder based on a shared specific feature decomposition network, combining RGB and skeletal sequences. STA-handles uses a spatiotemporal attention mechanism to focus on important human Hands and detect discriminative moments in behavior. PSTA is a two-stream approach in which the pose stream follows the topology of the human body and the RGB stream is processed by a spatiotemporal soft attention mechanism. The attention mechanism of STA-handles and PSTA is primarily concerned with the hand area of the person. The method provided by the invention not only focuses on the hand region, but also considers the key region of the human body based on the weight, and can comprehensively analyze the human body behavior. The accuracy of CS and CV was improved by 8% and 5.6%, respectively, compared to STA-handles. The accuracy of CS and CV was improved by 5.7% and 3.6%, respectively, compared to PSTA. CNN + RNN is also a structure based on two streams, one of which models the skeletal information and the other extracts features from the RGB frame. And finally, fusing and classifying the characteristics of the two streams by using a support vector machine, wherein the classification accuracy of the CV reaches 93.6%. Deep Biliner uses Bilinear blocks to assemble cube features from different schemas. The CentralNet makes decisions for each modality, with CS accuracy reaching 89.3%. The Chained Network computes and integrates visual cues using Markov chain models. 2D/3D Multi-task and Multi-task Deep Learning estimate the body pose from the image using a multitask framework and identify the body behavior from the video sequence. Multi-task Deep Learning is an extension of 2D/3D Multi-task. The 2D/3D Multi-task predicts the posture and the behavior in sequence, and the Multi-task Deep Learning predicts and optimizes the posture and the behavior in parallel, so that the accuracy of the CS is improved from 85.5% to 89.9%. In summary, both CNN + RNN and Deep Biliner fuse multimodal features and then make decisions. centralNet, Chained Network, 2D/3D Multi-task, and Multi-task Deep Learning all make decisions based on the characteristics of each modality and then use these decisions to obtain the final prediction. Compared with the methods, the method provided by the invention extracts the local RGB space-time characteristics by using the skeleton information without sending the skeleton information into an additional network for characteristic extraction, then decides the local characteristics by using a linear classifier shared by parameters, and finally fuses the local decision results by using a weighted average method to obtain the final decision result. The method provided by the invention fully considers the local information, and fuses the local decision through the proposed decision strategy, thereby achieving the best performance.

Claims

1. A human behavior recognition method based on RGB video and skeleton sequence adopts LD-Net, comprising two streams: a feature stream and an attention stream; two modules: a local decision block and a decision fusion block; the method is characterized by comprising the following steps:

step 2: pre-allocating initial labels for all videos in the training set and the test set, defining the total number of behavior categories as K, and setting the initial label set as { y_i＝k|1≤k≤K}，y_iRepresenting a video I_i1, 2.. ·, V;

X＝AB^T

wherein

according to F_θDeriving a set of decisions

It is expressed as:

wherein

Representing the decision result of the ith bone point characteristic;

generating a corresponding weight according to a conditional on current local patch criterion, wherein the weight represents the importance degree of the corresponding local feature; using a linear mapping g_βTo achieve this function, the weights are then normalized using a sigmoid function, denoted by W:

g_β(xⁱ)＝βxⁱ+b,xⁱ∈R^C×1,β∈R^1×C,b∈R,i∈[1,M]

where β and b are linear mappings g_βThe parameters of (1);

obtaining a global decision result according to the generated weight:

2. the human behavior recognition method based on the RGB video and the skeleton sequence as claimed in claim 1, wherein the local decision result obtained by the local decision block in step 7 is locally supervised, and the loss function is as follows:

a label representing the jth position of the ith sample,

a label representing the jth position of the ith sample,

indicating the prediction result;

L＝L_g+L_l