CN114708665A

CN114708665A - Skeleton map human behavior identification method and system based on multi-stream fusion

Info

Publication number: CN114708665A
Application number: CN202210505198.3A
Authority: CN
Inventors: 田智强; 王晨宇; 岳如靖; 杜少毅
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2022-07-05

Abstract

The invention discloses a skeleton map human behavior recognition method and system based on multi-stream fusion.A four different data streams are extracted from video skeleton data, the four different data streams are respectively subjected to network model training to obtain four different training models, the four different training models are subjected to multi-stream fusion training to obtain behavior recognition models, and the behavior recognition models obtained by training are utilized for human behavior recognition; and finally, fusing model results of the four-stream training to mutually reinforce the model output, thereby predicting the behavior action category more accurately.

Description

Skeleton map human behavior identification method and system based on multi-stream fusion

Technical Field

The invention belongs to the technical field of behavior recognition, and particularly relates to a skeleton map human behavior recognition method and system based on multi-stream fusion.

Background

Human beings have various behaviors and actions in daily life, and abundant information is contained in the behaviors and the actions. With the advent of the big data age, massive pictures and videos become main carriers for information transmission, and how to understand human behaviors becomes an important problem in the field of computer vision. The behavior recognition technology can be applied to the fields of man-machine interaction, intelligent monitoring, anomaly detection and the like, and has strong application value and research significance.

Compared with RGB data, the skeleton point data sequence has clearer appearance expression, the information expression of the human body structure is more visual, the motion and the space-time relation of human joints and bones are more obvious, a method for performing behavior recognition on the skeleton point data by using deep learning is also concerned by more and more researchers, and the proposal of the graph convolution network enables a new solution for the human behavior recognition based on the skeleton points. The bone map is not required to be converted into an RGB image and then the characteristics are extracted, and the space-time relationship of motion is directly extracted from the bone map, which becomes the key point of research.

The space-time graph convolutional network is proposed to be used for behavior recognition, and good effect is achieved. But currently, there are still several problems: .

First, the graph convolution network in the method is to convolute the neighborhood joints of the joints based on the skeleton map of the human body, and obtain the corresponding features and weights, so that the connection and global information of the remote joints can be ignored. The information of the human body in motion does not depend on adjacent joint points completely, and the relationship between the joint points which are far away can reflect the characteristics of the behavior at many times. Like the clapping of hands, the core movement is between two hands, and the connection and movement influence between other joints are small. If only adjacent joints are concerned and the distance between two hands in the skeleton point diagram is long, the algorithm is difficult to capture the long-distance dependency relationship, and the accuracy of partial behavior recognition is affected.

Secondly, the graph volume structures of each layer in the method are basically similar, multiple layers of stacks are deep layer by layer, the characteristics of each channel of each layer are viewed identically, and flexibility and adaptability of characteristic attention are lacked. The fixed graph convolution structure may have differences in feature extraction capability for different actions, and there are also relations and differences between different feature channels. Actions such as washing face emphasize the functional relationship between the hands and the head, while actions such as jumping and squatting require attention to the motion information of the lower body. With a fixed network it is difficult to notice the different concerns of different actions and to obtain an optimal modeling for a wide variety of behaviors.

And thirdly, the method only uses the joint point information of the human body in modeling and experiments, and although the space connection and the time connection are considered in the construction of the space-time skeleton diagram, the motion information expressed by the joint points is still limited and information loss is possible. The skeleton diagram of the human body only represents the physical structure of the human body, and the spatial and temporal relation of only 25 joint points is not optimal for expressing the motion information, so that more supplementary information is needed.

Disclosure of Invention

The invention aims to provide a skeleton map human behavior identification method and system based on multi-stream fusion so as to overcome the defects of the prior art.

A skeleton map human behavior recognition method based on multi-stream fusion comprises the following steps:

s1, extracting four different data streams from the video skeleton data;

and S2, respectively carrying out network model training on the four different data streams to obtain four different training models, carrying out multi-stream fusion training on the four different training models to obtain a behavior recognition model, and carrying out human behavior recognition by using the behavior model obtained by training.

Preferably, the four different data streams include a joint stream, a skeleton stream, an articulation stream and a skeleton motion stream.

Preferably, the joint flow, the skeleton flow, the joint motion flow and the skeleton diagram of the skeleton motion flow are overlapped, and the same joints are connected in the time dimension to form a space-time relation diagram as the input of the training model network.

Preferably, the network structure of the network model used for training the four data streams is the same, and the parameter setting of each data stream is kept consistent.

Preferably, the network model for training the four data streams comprises a space attention module, a graph convolution module, a channel attention module and a time domain convolution module;

the spatial attention module encodes long-term dependence between nodes through accurate position information of the nodes;

the graph convolution module is used for performing convolution operation on the bone graph to generate a corresponding feature graph;

the channel attention module adaptively adjusts the feature expression weight among the channels through modeling the interdependence relation among the channels;

the time domain convolution module is a convolution neural network of time domain dimension and is used for channel number transformation.

Preferably, in a skeleton diagram comprising N joints and T frames, all joints in the skeleton sequence are represented by a graph node V:

V＝{v_ti|t＝1,....,T,i＝1,....N}

wherein v is the coordinate of the joint in the three-dimensional space, and is represented as (x, y, z);

for a given joint point v₁＝(x₁,y₁,z₁) Point v of articulation with the target₂＝(x₂,y₂,z₂) Bone adjacency is represented by the following formula:

e_v1,v2＝(x₂-x₁,y₂-y₁,z₂-z₁)。

preferably, the multidimensional space-time graph convolution network can be simply expressed as the following formula:

in the formula A_jA contiguous matrix of j, Σ representing the natural connections between joints_jA_jThen the summation of intracorporeal connection and self-connection between joints is represented; wherein

W_jWeights for multiple output channelsAnd vector superposition to form a weight matrix.

Preferably, the obtained spatiotemporal relationship graph is input into the training model in the dimension of (N, C, T, V, M).

Preferably, the global pooling is decomposed using the following formula:

the two-dimensional GAP operation is converted into the operation of one-dimensional characteristics in the formula.

A skeleton map human behavior recognition system based on multi-stream fusion comprises a preprocessing module and a recognition module;

the preprocessing module is used for extracting four different data streams from video skeleton data, superposing a skeleton diagram of a joint stream, a skeleton stream, a joint motion stream and the skeleton diagram of the skeleton motion stream, connecting the same joint on a time dimension to form a space-time relation diagram as input of a training model network to obtain four different training models, performing multi-stream fusion training on the four different training models to obtain a behavior recognition model, storing the behavior recognition model into the recognition module, and recognizing human body behaviors by using the trained behavior model.

Compared with the prior art, the invention has the following beneficial technical effects:

the invention relates to a bone picture human behavior recognition method based on multi-stream fusion, which comprises the steps of extracting four different data streams from video bone data, respectively carrying out network model training on the four different data streams to obtain four different training models, carrying out multi-stream fusion training on the four different training models to obtain a behavior recognition model, and carrying out human behavior recognition by using the behavior model obtained by training; and finally, fusing model results of the four-stream training to reinforce the output of the model, thereby predicting the behavior and action categories more accurately.

Furthermore, long-term dependence between nodes is coded through accurate position information of the nodes in the time-space graph convolution, so that the nodes can be in contact with other time-space nodes, and the long-distance space sensing capability of the network is improved.

Furthermore, the space-time graph convolution network is utilized to make the network more sensitive to channel information expressing different motions so as to enhance the motion characteristics.

Furthermore, four kinds of stream input data are used in data preprocessing, and finally a multi-stream fusion method is used to reinforce the characteristics, so that the action category can be predicted more accurately.

Drawings

Fig. 1 is a flowchart of an implementation of a skeleton map human behavior recognition method based on multi-stream fusion in an embodiment of the present invention.

Fig. 2 is a block diagram of a multi-stream converged space-time graph convolutional network in an embodiment of the present invention.

Fig. 3 is a network structure diagram used by each data stream in the multi-stream convergence method in the embodiment of the present invention.

FIG. 4 is a block diagram of a spatial attention module according to an embodiment of the present invention.

FIG. 5 is a block diagram of a channel attention module in an embodiment of the invention.

Fig. 6 is a multi-stream data processing diagram in an embodiment of the invention.

Fig. 7 is an effect diagram of a skeleton map human behavior recognition method based on multi-stream fusion in the embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, a skeleton map human behavior recognition method based on multi-stream fusion comprises the following steps;

s1, extracting four different data streams from video skeleton data, connecting the obtained different data streams to obtain a data set, and dividing the data set into a training set and a test set;

the four different data streams include joint flow, skeleton flow, joint motion flow and skeleton motion flow; then carrying out data preprocessing: and superposing the joint throttling, the skeleton flow, the joint motion flow and a skeleton diagram of the skeleton motion flow, and connecting the same joints on a time dimension to form a space-time relation diagram as the input of a training model network.

S2, respectively carrying out network model on four different data streams to obtain four different training models, carrying out multi-stream fusion training on the four different training models to obtain a behavior recognition model, and carrying out human behavior recognition by using the behavior model obtained by training

Specifically, model training is carried out on four data streams by using a complete network based on graph convolution respectively to obtain 4 training models, the parameter settings of each data stream are kept consistent, and the network structures are the same;

adding the multi-stream prediction results according to different proportions in a multi-stream fusion stage to obtain a final prediction score, and classifying behavior actions; and a human body behavior recognition task based on a skeleton map can be performed by using the trained behavior recognition model.

Specifically, the method and the device adopt public data extracted from video skeleton data as a data set, and divide the data set into a training set and a testing set.

Data preprocessing: as shown in fig. 6(a), the method for acquiring joint flow data: in a skeleton diagram containing N joints and T frames, all joints in the skeleton sequence are represented by a graph node V, namely the following formula:

V＝{v_ti|t＝1,....,T,i＝1,....N}

where v is the coordinate of the joint in the three-dimensional space, and may be expressed as (x, y, z).

FIG. 6(b) shows the structure of the skeleton stream for a given joint point v₁＝(x₁,y₁,z₁) Point v of articulation with the target₂＝(x₂,y₂,z₂) Bone adjacency may be expressed as:

e_v1,v2＝(x₂-x₁,y₂-y₁,z₂-z₁)

the bone information therefore includes not only the coordinate position but also the orientation information. Since the number of skeleton data is one less than that of joints, in order to maintain input consistency and simplify the network processing, a hollow skeleton is added to the defined center-of-gravity joint, the value of the hollow skeleton is set to 0, and the hollow skeleton is bound to the center-of-gravity joint. This allows one-to-one correspondence of bones to joints, maintaining the same dimensionality as the joint flow in the preprocessing of the input data, so that the network can process different data streams in the same manner.

Considering that motion information is the key to behavior recognition, appearance features and implicit relations in spatio-temporal modeling by learning joints and skeletons only through a network may still be lost. In a classic dual-stream architecture in the field of RGB image behavior recognition, the expression of motion information is enhanced by a light-stream graph, and the light-stream graph alone is used as an input of a first stream to reinforce the timing information in the RGB image stream. Thus, the two data streams containing motion are augmented on the basis of joint flow and skeleton flow.

The data of the articulation flow and the skeletal motion flow as shown in fig. 6(c) and 6 (d). Because the skeletal data is represented by the coordinates of the joints, similar to obtaining pixel motion information between successive frames, the motion of the joints may be represented using the difference of the same coordinate point along the time dimension, i.e.

Similarly, the motion change of a bone can also be represented by a vector difference of the same bone in consecutive frames, i.e.

Finally, the motion information is also expressed as a (C, T, V) dimension diagram of joint flow input, and the consistency of the network structure is ensured.

As shown in fig. 2, the recognition model network adopted in the present invention first adds a Spatial-Channal Attention (SCA) module to enhance the extraction and modeling capability of the spatio-temporal features by the graph-convolution network, and constructs a basic training model (SCA-GCN network). And then preprocessing the bone data, respectively generating joint flow, skeleton flow, joint movement flow and skeleton movement flow as the input of a network, and acquiring a training model. And finally fusing the prediction results among the multiple streams and outputting the prediction results.

The network structures used by each data stream in the multi-stream fusion method are the same, specifically, as shown in fig. 3, preprocessed data is input into the SCA-GCN network according to the dimension of (N, C, T, V, M), and the input is normalized by a Batch Normalization layer (BN), because the input information of different data streams has partial difference, and the network needs to share weights on different nodes, so that the consistency of the scale of the input data on different nodes can be maintained. Then, preliminarily extracting features through a basic 3 x 64 CA-GCN network, and expanding channels to 64 x 64 dimensions;

and then integrating the information of the global graph nodes and the multi-frame time dimension by using a Spatial Attention module (SA), capturing long-distance dependence and enhancing the feature graph.

And inputting the characteristic diagram into a CA-GCN network group with 9 layers, wherein the first three layers are 64 channels, the middle three layers are 128 channels, and the last three layers are 256 channels. Each CA-GCN network includes a Graph Convolution Network (GCN), a Channel Attention (CA) and a time domain convolution network (TCN), and outputs a feature map after passing through a residual structure and a ReLu activation function. TCNs of a fourth layer and a seventh layer in the whole structure are pooling layers for channel number conversion, the step length is 2, and the rest are convolution layers with the step length of 1. And finally, performing global pooling on the obtained feature tensor to obtain 256-dimensional feature vectors of each sequence, and inputting the 256-dimensional feature vectors into a Softmax classifier to obtain output for classification prediction. The graph convolution network computation process can be written as:

in the formula: b (v)_ti)＝{v_tj|d(v_tj,v_ti) D represents the adjacent joint point set, D represents the shortest distance between two nodes, and D is the distance limit of taking the adjacent nodes.

In the method, D-1 represents that a joint point set with the adjacent distance of 1 is taken; the sampling function is denoted as p (v)_ti,v_tj)＝v_tj。Z_ti(v_tj)＝|{v_tk|l_ti(v_tk)＝l_ti(v_tj) And | is a normalized item, represents the cardinality of the corresponding subset, and controls the influence of different subsets on the output. Wherein l_ti:B(v_ti) K-1 denotes a mapping label that divides nodes in the neighborhood into its subsets, totally into a fixed number K of subsets. The weighting function can be expressed as w (v) on the basis of the weight_ti,v_tj)＝w′(l_ti(v_tj)). In the selection of the division mode, the subsets are divided according to the distance from the nodes to the gravity center position of the framework, and are divided into root nodes, a node set closer to the gravity center than the root nodes and a node set farther from the gravity center than the root nodes, as shown in the following formula:

in the formula: r is_iRepresenting the average distance from the skeleton center of gravity to joint i on all frames in the training dataset; and substituting the neighborhood set, the sampling function, the weighting function, the normalized item and the division label into a graph convolution network calculation formula to obtain an output result of the GCN module.

Because the connection relation of the same nodes of the continuous frames is already included in the construction of the input graph, after the graph convolution network is passed, the time domain characteristics can be further acquired through similar convolution operation. The concept of neighborhood is first expanded, expanding the graph from spatial to spatio-temporal connections, as represented by:

B(v_ti)＝{v_qj|d(v_tj,v_ti)≤K,|q-t∣≤Γ/2}

where Γ represents the size of the time domain kernel used to control the time range of the neighboring frame skeleton map. The sampling function is the same as that used in the GCN, and a certain improvement needs to be made on the label mapping in the weighting function to finish the space-time graph convolution. Since the time of adjacent frames is ordered, the neighborhood of the root node in space-time is directly expressed as:

l_ST(v_qj)＝l_ti(v_tj)+(q-t+Γ/2)×K

in the formula_ti(v_tj) The method can be used for carrying out better convolution operation on the space-time graph for single-frame label mapping, thereby obtaining the node relation and the space-time characteristic graph in space and time.

The final multidimensional space-time graph convolutional network can be simplified and expressed as the following formula:

in the formula A_jA contiguous matrix of j, sigma representing the natural connections between joints_jA_jThen the summation of intracorporeal connection and self-connection between joints is represented; wherein

W_jSimplifying the input characteristics into tensors of (C, V, T) for a weight matrix formed by superposing weight vectors of a plurality of output channels;

a learnable weight matrix M is also attached to each adjacency matrix, and is initialized to a full one matrix.

The spatial attention module aggregates the features along two different directions according to attention as shown in fig. 4, wherein a V dimension can capture remote dependency along a node direction to establish a connection between global nodes, and a T dimension retains accurate node position information along a time direction; then, respectively encoding the generated characteristic graphs, and respectively generating direction perception and position sensitive attention graphs; the method is activated and then added into an input feature map, and the perception of global spatial information in the features is enhanced, so that the interested region can be accurately captured and expressed. The specific operation comprises two steps: first, an information aggregation module, which in order to facilitate the attention module to be able to capture accurate joint position information and long-term spatial interaction, decomposes global pooling using the following formula:

in which two-dimensional GAP operations are converted into operations on one-dimensional features. Given an input X, each channel is encoded along two coordinates, space and time, using pooling kernels of size (V, 1) and (1, T), respectively, resulting in outputs of (C X V X1) and (C X1X T), respectively. The output of the c channel with node v and the c channel output at time t can be represented as

And

this operation can aggregate features in two directions, respectively, along the spatial position and time relationship, to obtain a pair of feature maps.

The second step is an attention generating module, after obtaining the characteristic diagram of the global receptive field and the coding accurate position information obtained in the information aggregation module, the two groups of vectors are connected, and a 1 multiplied by 1 convolution transformation function F is used₁Carrying out transformation and dimension reduction operation on the characteristic mapping table to obtain characteristic mapping; as shown in the following formula:

f＝δ(F₁([z^v,z^t]))

in the formula [,]denotes the Concat operation along the spatial dimension, δ being a nonlinear activation function. f is intermediate characteristic mapping for coding along node information and time information to obtain

Features of dimension, where r is the channel reduction ratio, in order to reduceThe number of channels improves the calculation efficiency, and r in the method is 32. F is then decomposed into two separate tensors f by normalization and non-linearization^v∈R^C/r×VAnd f^t∈R^C/r×TAnd using two 1 x 1 convolution transforms F_vAnd F_tRespectively expanding the channel number to tensor with the same channel number to obtain g^v＝σ(F_v(f^v) A) and g^t＝σ(F_t(f^t)). Wherein sigma is sigma activation function, g after expansion^vAnd g^tAttention weight, finally multiplying the weight and the output and adding the weight and the output through a residual error structure to obtain the output of the module.

As shown in the following formula:

the resulting output y_c(i, j) maintains the same dimension as the output, while weighting the attention g generated by the node dimension v^vAttention g generated by representing spatial global information and frame number dimensionality t^tThe represented time transformation information enhances the attention capacity of the feature vector to the spatial information.

As shown in fig. 5, the channel attention module is added with a CA module after the GCN extracts spatial features, and the CA module can model the dependency relationship between channels and adaptively adjust the feature expression weight between channels, thereby enhancing the channels beneficial to motion recognition and suppressing other unrelated channels. The channel attention module generally follows the design of the SE module, but changes the Global Average Pooling (GAP) therein to Discrete Cosine Transform (DCT), wherein it is proved that GAP is a special case of two-dimensional DCT, and the result obtained by GAP is proportional to the lowest component of the two-dimensional DCT. The present channel attention module will use more frequency components to introduce more information. First, divide the input X into n blocks along the channel, and mark as [ X ]⁰，X¹，…，X^n-1]Wherein each X isⁱ∈R^C′×V×T,i∈{0,1,…,n-1},

Then each block is assigned a two-dimensional DCT component, and the basis function of 2DDCT is recorded as

The output results for each block are given by:

in the formula [ j, k]The component index, i ∈ {0,1, …, n-1} representing 2DDCT, results in a different frequency component for each block, and an output, Freq ∈ R^CAnd obtaining a multispectral vector. Then the data is sent to a full connection layer FC to learn to obtain an attention diagram, and finally the attention module of the multispectral channel obtains the attention as follows:

ms-att＝sigmoid(fc(Freq))

and finally, multiplying the activation values on the channels by the original characteristics to learn the weight coefficients of the channels, so that the enhanced channel characteristics can be better represented by a model, and adding the characteristics for increasing the attention of the channels and the original input by using a residual connection method to obtain the output tensor with the same dimension.

And finally, outputting the model after the network training to a softmax classifier, and outputting the prediction category of the action. With the trained recognition model, a human behavior recognition task based on a skeleton map can be performed, as shown in fig. 7.

The invention relates to a skeleton map human behavior recognition method based on multi-stream fusion, which aims at a data set obtained by extracting and preprocessing video skeleton data and can solve the problems that a graph convolution network cannot capture long-distance dependency relationship in a skeleton point behavior recognition scene, action-sensitive channel attention is difficult to obtain and information acquisition is insufficient due to single data input.

A space attention SA module is introduced into a training model network framework adopted by the invention, and long-term dependence between nodes is coded through accurate position information of the nodes, so that the nodes can be linked with other space-time nodes, and the long-distance space perception capability of the network is improved.

A channel attention CA module is added in the space-time graph convolution network, so that the network is more sensitive to channel information expressing different motions, and motion characteristics are enhanced.

Four methods are used for data preprocessing to generate four-stream input data, and finally a multi-stream fusion method is used for mutually reinforcing model outputs, so that action categories are more accurately predicted.

Examples

s1, extracting four different data streams from video skeleton data: joint flow, bone flow, joint motion flow and bone motion flow. The specific working process is as follows:

(1.1) adopting a group of public skeleton point data sets and a group of public RGB data sets;

(1.2) for the RGB data set, extracting 18 joint points of each frame of human body by using a public OpenPose tool box, representing the joint points by coordinates, and integrating the joint points into a data format conforming to network input;

and (1.3) dividing the bone data in the step (1.1) and the step (1.2) into four different data streams by adopting four different preprocessing methods, wherein the four different data streams are joint flow, bone flow, joint movement flow and bone movement flow as shown in fig. 6.

And S2, processing the acquired different data streams, and dividing a training set and a test set. The specific working process is as follows:

(2.1) for the bone point data stream in (1.3), using a spatial configuration partitioning strategy. The center of gravity of the firmware is marked first, and then the neighborhood of the nodes is divided into 3 subsets, namely a green node 0 per se, a blue set 1 which is closer to the center of gravity of the skeleton than the node per se, and a yellow set 2 which is more schematic of the center of gravity of the skeleton than the node per se. The partition rule can pay more attention to the movement of the nodes in the behavior identification task, and the convolution kernel can obtain richer and more accurate node characteristics;

(2.2) for the processed data set according to 4: 1, randomly cutting the ratio into a training set and a testing set, and randomly translating and zooming the data set to realize expansion and data enhancement;

and S3, inputting the data of different streams into the time-space graph convolution network respectively, as shown in the figure 2 and the figure 3. The specific working process is as follows:

(3.1) inputting the data preprocessed in the step (2.2) into the SCA-GCN network according to the dimension of (N, C, T, V, M), and firstly, inputting the data to be normalized through a Batch Normalization layer (BN);

(3.2) initially extracting features through a basic 3 x 64 CA-GCN network after the step (3.1), and expanding channels to 64 x 64 dimensions;

and S4, enhancing the feature extraction capability of the network by using a space attention module and a channel attention module, as shown in FIGS. 4 and 5. The specific working process is as follows:

(4.1) after the step (3.1), integrating the global graph nodes and the information of the multi-frame time dimension through a space attention module by the feature graph, capturing long-distance dependence, and enhancing the feature graph;

(4.2) the CA-GCN network in (3.2) is provided with a graph convolution GCN, the channel attention CA and a time domain convolution TCN layer, wherein the GCN network is responsible for the convolution of a space graph, the TCN network is a time domain dimension convolution neural network, and the CA module multiplies the activation values on each channel by the original characteristics to obtain the weight coefficients of each channel, so that the enhanced channel characteristics can be better represented by a model.

And S5, a behavior recognition model is trained through stacking of a multilayer network structure. The specific working process is as follows:

and (5.1) sending the feature map into a CA-GCN network group of 9 layers after (4.1) to extract features, wherein the first three layers are 64 channels, the middle three layers are 128 channels, and the last three layers are 256 channels.

And (5.2) globally pooling the feature tensors obtained in the step (5.1) to obtain 256-dimensional feature vectors of each sequence, and inputting the 256-dimensional feature vectors into a Softmax classifier to obtain output for classification prediction.

S6, fusing model results of the four-stream training, as shown in FIG. 2, and obtaining a final output result;

and S7, regarding the trained multi-stream fusion human behavior recognition model, taking the test image as input to obtain a behavior recognition result, as shown in FIG. 7. The specific working process is as follows:

(7.1) regarding the multi-stream fusion behavior recognition model in the step S6, taking the test set in the step (2.2) as an input to obtain a model classification result;

(7.2) comparing the behavior prediction result of the multi-stream fusion behavior recognition model in the step (7.1) with an actual action label, finding that the multi-stream fusion behavior recognition model in the step (7.1) achieves an excellent recognition effect, and shows a prominent accuracy on two data sets, as shown in fig. 7.

Claims

1. A skeleton map human behavior recognition method based on multi-stream fusion is characterized by comprising the following steps:

s1, extracting four different data streams from the video skeleton data;

2. The method for recognizing human body behaviors based on the multi-stream fusion as claimed in claim 1, wherein the four different data streams comprise joint stream, skeleton stream, joint motion stream and skeleton motion stream.

3. The method for recognizing human body behaviors based on the multi-stream fusion as claimed in claim 2, wherein joint flow, skeleton flow, joint motion flow and skeleton diagram of the skeleton motion flow are overlapped, and the same joints are connected in time dimension to form a spatiotemporal relationship diagram as input of a training model network.

4. The skeleton map human behavior recognition method based on multi-stream fusion as claimed in claim 3, wherein network structures of network models used for training four data streams are the same, and parameter settings of each data stream are kept consistent.

5. The skeleton map human behavior recognition method based on multi-stream fusion as claimed in claim 1, wherein the network model for training four data streams comprises a spatial attention module, a map convolution module, a channel attention module and a time domain convolution module;

6. The skeleton map human behavior recognition method based on multi-stream fusion as claimed in claim 2, wherein in a skeleton map comprising N joints and T frames, all joints in the skeleton sequence are represented by a graph node V:

V＝{v_ti|t＝1,....,T,i＝1,....N}

e_v1,v2＝(x₂-x₁,y₂-y₁,z₂-z₁)。

7. the method for recognizing human body behaviors based on multi-stream fusion as claimed in claim 5, wherein the multidimensional space-time graph convolution network can be simplified as follows:

W_jAnd forming a weight matrix for the superposition of the weight vectors of the plurality of output channels.

8. The method for recognizing human body behaviors based on multi-stream fusion according to claim 3, wherein the obtained spatiotemporal relationship diagram is input into the training model in the dimension of (N, C, T, V, M).

9. The skeleton map human behavior recognition method based on multi-stream fusion as claimed in claim 5, wherein the global pooling is decomposed by using the following formula:

10. A skeleton map human behavior recognition system based on multi-stream fusion is characterized by comprising a preprocessing module and a recognition module;