CN114708665A - Skeleton map human behavior identification method and system based on multi-stream fusion - Google Patents

Skeleton map human behavior identification method and system based on multi-stream fusion Download PDF

Info

Publication number
CN114708665A
CN114708665A CN202210505198.3A CN202210505198A CN114708665A CN 114708665 A CN114708665 A CN 114708665A CN 202210505198 A CN202210505198 A CN 202210505198A CN 114708665 A CN114708665 A CN 114708665A
Authority
CN
China
Prior art keywords
skeleton
stream
training
network
behavior recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210505198.3A
Other languages
Chinese (zh)
Inventor
田智强
王晨宇
岳如靖
杜少毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202210505198.3A priority Critical patent/CN114708665A/en
Publication of CN114708665A publication Critical patent/CN114708665A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a skeleton map human behavior recognition method and system based on multi-stream fusion.A four different data streams are extracted from video skeleton data, the four different data streams are respectively subjected to network model training to obtain four different training models, the four different training models are subjected to multi-stream fusion training to obtain behavior recognition models, and the behavior recognition models obtained by training are utilized for human behavior recognition; and finally, fusing model results of the four-stream training to mutually reinforce the model output, thereby predicting the behavior action category more accurately.

Description

Skeleton map human behavior identification method and system based on multi-stream fusion
Technical Field
The invention belongs to the technical field of behavior recognition, and particularly relates to a skeleton map human behavior recognition method and system based on multi-stream fusion.
Background
Human beings have various behaviors and actions in daily life, and abundant information is contained in the behaviors and the actions. With the advent of the big data age, massive pictures and videos become main carriers for information transmission, and how to understand human behaviors becomes an important problem in the field of computer vision. The behavior recognition technology can be applied to the fields of man-machine interaction, intelligent monitoring, anomaly detection and the like, and has strong application value and research significance.
Compared with RGB data, the skeleton point data sequence has clearer appearance expression, the information expression of the human body structure is more visual, the motion and the space-time relation of human joints and bones are more obvious, a method for performing behavior recognition on the skeleton point data by using deep learning is also concerned by more and more researchers, and the proposal of the graph convolution network enables a new solution for the human behavior recognition based on the skeleton points. The bone map is not required to be converted into an RGB image and then the characteristics are extracted, and the space-time relationship of motion is directly extracted from the bone map, which becomes the key point of research.
The space-time graph convolutional network is proposed to be used for behavior recognition, and good effect is achieved. But currently, there are still several problems: .
First, the graph convolution network in the method is to convolute the neighborhood joints of the joints based on the skeleton map of the human body, and obtain the corresponding features and weights, so that the connection and global information of the remote joints can be ignored. The information of the human body in motion does not depend on adjacent joint points completely, and the relationship between the joint points which are far away can reflect the characteristics of the behavior at many times. Like the clapping of hands, the core movement is between two hands, and the connection and movement influence between other joints are small. If only adjacent joints are concerned and the distance between two hands in the skeleton point diagram is long, the algorithm is difficult to capture the long-distance dependency relationship, and the accuracy of partial behavior recognition is affected.
Secondly, the graph volume structures of each layer in the method are basically similar, multiple layers of stacks are deep layer by layer, the characteristics of each channel of each layer are viewed identically, and flexibility and adaptability of characteristic attention are lacked. The fixed graph convolution structure may have differences in feature extraction capability for different actions, and there are also relations and differences between different feature channels. Actions such as washing face emphasize the functional relationship between the hands and the head, while actions such as jumping and squatting require attention to the motion information of the lower body. With a fixed network it is difficult to notice the different concerns of different actions and to obtain an optimal modeling for a wide variety of behaviors.
And thirdly, the method only uses the joint point information of the human body in modeling and experiments, and although the space connection and the time connection are considered in the construction of the space-time skeleton diagram, the motion information expressed by the joint points is still limited and information loss is possible. The skeleton diagram of the human body only represents the physical structure of the human body, and the spatial and temporal relation of only 25 joint points is not optimal for expressing the motion information, so that more supplementary information is needed.
Disclosure of Invention
The invention aims to provide a skeleton map human behavior identification method and system based on multi-stream fusion so as to overcome the defects of the prior art.
A skeleton map human behavior recognition method based on multi-stream fusion comprises the following steps:
s1, extracting four different data streams from the video skeleton data;
and S2, respectively carrying out network model training on the four different data streams to obtain four different training models, carrying out multi-stream fusion training on the four different training models to obtain a behavior recognition model, and carrying out human behavior recognition by using the behavior model obtained by training.
Preferably, the four different data streams include a joint stream, a skeleton stream, an articulation stream and a skeleton motion stream.
Preferably, the joint flow, the skeleton flow, the joint motion flow and the skeleton diagram of the skeleton motion flow are overlapped, and the same joints are connected in the time dimension to form a space-time relation diagram as the input of the training model network.
Preferably, the network structure of the network model used for training the four data streams is the same, and the parameter setting of each data stream is kept consistent.
Preferably, the network model for training the four data streams comprises a space attention module, a graph convolution module, a channel attention module and a time domain convolution module;
the spatial attention module encodes long-term dependence between nodes through accurate position information of the nodes;
the graph convolution module is used for performing convolution operation on the bone graph to generate a corresponding feature graph;
the channel attention module adaptively adjusts the feature expression weight among the channels through modeling the interdependence relation among the channels;
the time domain convolution module is a convolution neural network of time domain dimension and is used for channel number transformation.
Preferably, in a skeleton diagram comprising N joints and T frames, all joints in the skeleton sequence are represented by a graph node V:
V={vti|t=1,....,T,i=1,....N}
wherein v is the coordinate of the joint in the three-dimensional space, and is represented as (x, y, z);
for a given joint point v1=(x1,y1,z1) Point v of articulation with the target2=(x2,y2,z2) Bone adjacency is represented by the following formula:
ev1,v2=(x2-x1,y2-y1,z2-z1)。
preferably, the multidimensional space-time graph convolution network can be simply expressed as the following formula:
Figure BDA0003637207460000031
in the formula AjA contiguous matrix of j, Σ representing the natural connections between jointsjAjThen the summation of intracorporeal connection and self-connection between joints is represented; wherein
Figure BDA0003637207460000032
WjWeights for multiple output channelsAnd vector superposition to form a weight matrix.
Preferably, the obtained spatiotemporal relationship graph is input into the training model in the dimension of (N, C, T, V, M).
Preferably, the global pooling is decomposed using the following formula:
Figure BDA0003637207460000033
the two-dimensional GAP operation is converted into the operation of one-dimensional characteristics in the formula.
A skeleton map human behavior recognition system based on multi-stream fusion comprises a preprocessing module and a recognition module;
the preprocessing module is used for extracting four different data streams from video skeleton data, superposing a skeleton diagram of a joint stream, a skeleton stream, a joint motion stream and the skeleton diagram of the skeleton motion stream, connecting the same joint on a time dimension to form a space-time relation diagram as input of a training model network to obtain four different training models, performing multi-stream fusion training on the four different training models to obtain a behavior recognition model, storing the behavior recognition model into the recognition module, and recognizing human body behaviors by using the trained behavior model.
Compared with the prior art, the invention has the following beneficial technical effects:
the invention relates to a bone picture human behavior recognition method based on multi-stream fusion, which comprises the steps of extracting four different data streams from video bone data, respectively carrying out network model training on the four different data streams to obtain four different training models, carrying out multi-stream fusion training on the four different training models to obtain a behavior recognition model, and carrying out human behavior recognition by using the behavior model obtained by training; and finally, fusing model results of the four-stream training to reinforce the output of the model, thereby predicting the behavior and action categories more accurately.
Furthermore, long-term dependence between nodes is coded through accurate position information of the nodes in the time-space graph convolution, so that the nodes can be in contact with other time-space nodes, and the long-distance space sensing capability of the network is improved.
Furthermore, the space-time graph convolution network is utilized to make the network more sensitive to channel information expressing different motions so as to enhance the motion characteristics.
Furthermore, four kinds of stream input data are used in data preprocessing, and finally a multi-stream fusion method is used to reinforce the characteristics, so that the action category can be predicted more accurately.
Drawings
Fig. 1 is a flowchart of an implementation of a skeleton map human behavior recognition method based on multi-stream fusion in an embodiment of the present invention.
Fig. 2 is a block diagram of a multi-stream converged space-time graph convolutional network in an embodiment of the present invention.
Fig. 3 is a network structure diagram used by each data stream in the multi-stream convergence method in the embodiment of the present invention.
FIG. 4 is a block diagram of a spatial attention module according to an embodiment of the present invention.
FIG. 5 is a block diagram of a channel attention module in an embodiment of the invention.
Fig. 6 is a multi-stream data processing diagram in an embodiment of the invention.
Fig. 7 is an effect diagram of a skeleton map human behavior recognition method based on multi-stream fusion in the embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, a skeleton map human behavior recognition method based on multi-stream fusion comprises the following steps;
s1, extracting four different data streams from video skeleton data, connecting the obtained different data streams to obtain a data set, and dividing the data set into a training set and a test set;
the four different data streams include joint flow, skeleton flow, joint motion flow and skeleton motion flow; then carrying out data preprocessing: and superposing the joint throttling, the skeleton flow, the joint motion flow and a skeleton diagram of the skeleton motion flow, and connecting the same joints on a time dimension to form a space-time relation diagram as the input of a training model network.
S2, respectively carrying out network model on four different data streams to obtain four different training models, carrying out multi-stream fusion training on the four different training models to obtain a behavior recognition model, and carrying out human behavior recognition by using the behavior model obtained by training
Specifically, model training is carried out on four data streams by using a complete network based on graph convolution respectively to obtain 4 training models, the parameter settings of each data stream are kept consistent, and the network structures are the same;
adding the multi-stream prediction results according to different proportions in a multi-stream fusion stage to obtain a final prediction score, and classifying behavior actions; and a human body behavior recognition task based on a skeleton map can be performed by using the trained behavior recognition model.
Specifically, the method and the device adopt public data extracted from video skeleton data as a data set, and divide the data set into a training set and a testing set.
Data preprocessing: as shown in fig. 6(a), the method for acquiring joint flow data: in a skeleton diagram containing N joints and T frames, all joints in the skeleton sequence are represented by a graph node V, namely the following formula:
V={vti|t=1,....,T,i=1,....N}
where v is the coordinate of the joint in the three-dimensional space, and may be expressed as (x, y, z).
FIG. 6(b) shows the structure of the skeleton stream for a given joint point v1=(x1,y1,z1) Point v of articulation with the target2=(x2,y2,z2) Bone adjacency may be expressed as:
ev1,v2=(x2-x1,y2-y1,z2-z1)
the bone information therefore includes not only the coordinate position but also the orientation information. Since the number of skeleton data is one less than that of joints, in order to maintain input consistency and simplify the network processing, a hollow skeleton is added to the defined center-of-gravity joint, the value of the hollow skeleton is set to 0, and the hollow skeleton is bound to the center-of-gravity joint. This allows one-to-one correspondence of bones to joints, maintaining the same dimensionality as the joint flow in the preprocessing of the input data, so that the network can process different data streams in the same manner.
Considering that motion information is the key to behavior recognition, appearance features and implicit relations in spatio-temporal modeling by learning joints and skeletons only through a network may still be lost. In a classic dual-stream architecture in the field of RGB image behavior recognition, the expression of motion information is enhanced by a light-stream graph, and the light-stream graph alone is used as an input of a first stream to reinforce the timing information in the RGB image stream. Thus, the two data streams containing motion are augmented on the basis of joint flow and skeleton flow.
The data of the articulation flow and the skeletal motion flow as shown in fig. 6(c) and 6 (d). Because the skeletal data is represented by the coordinates of the joints, similar to obtaining pixel motion information between successive frames, the motion of the joints may be represented using the difference of the same coordinate point along the time dimension, i.e.
Figure BDA0003637207460000061
Similarly, the motion change of a bone can also be represented by a vector difference of the same bone in consecutive frames, i.e.
Figure BDA0003637207460000062
Finally, the motion information is also expressed as a (C, T, V) dimension diagram of joint flow input, and the consistency of the network structure is ensured.
As shown in fig. 2, the recognition model network adopted in the present invention first adds a Spatial-Channal Attention (SCA) module to enhance the extraction and modeling capability of the spatio-temporal features by the graph-convolution network, and constructs a basic training model (SCA-GCN network). And then preprocessing the bone data, respectively generating joint flow, skeleton flow, joint movement flow and skeleton movement flow as the input of a network, and acquiring a training model. And finally fusing the prediction results among the multiple streams and outputting the prediction results.
The network structures used by each data stream in the multi-stream fusion method are the same, specifically, as shown in fig. 3, preprocessed data is input into the SCA-GCN network according to the dimension of (N, C, T, V, M), and the input is normalized by a Batch Normalization layer (BN), because the input information of different data streams has partial difference, and the network needs to share weights on different nodes, so that the consistency of the scale of the input data on different nodes can be maintained. Then, preliminarily extracting features through a basic 3 x 64 CA-GCN network, and expanding channels to 64 x 64 dimensions;
and then integrating the information of the global graph nodes and the multi-frame time dimension by using a Spatial Attention module (SA), capturing long-distance dependence and enhancing the feature graph.
And inputting the characteristic diagram into a CA-GCN network group with 9 layers, wherein the first three layers are 64 channels, the middle three layers are 128 channels, and the last three layers are 256 channels. Each CA-GCN network includes a Graph Convolution Network (GCN), a Channel Attention (CA) and a time domain convolution network (TCN), and outputs a feature map after passing through a residual structure and a ReLu activation function. TCNs of a fourth layer and a seventh layer in the whole structure are pooling layers for channel number conversion, the step length is 2, and the rest are convolution layers with the step length of 1. And finally, performing global pooling on the obtained feature tensor to obtain 256-dimensional feature vectors of each sequence, and inputting the 256-dimensional feature vectors into a Softmax classifier to obtain output for classification prediction. The graph convolution network computation process can be written as:
Figure BDA0003637207460000071
in the formula: b (v)ti)={vtj|d(vtj,vti) D represents the adjacent joint point set, D represents the shortest distance between two nodes, and D is the distance limit of taking the adjacent nodes.
In the method, D-1 represents that a joint point set with the adjacent distance of 1 is taken; the sampling function is denoted as p (v)ti,vtj)=vtj。Zti(vtj)=|{vtk|lti(vtk)=lti(vtj) And | is a normalized item, represents the cardinality of the corresponding subset, and controls the influence of different subsets on the output. Wherein lti:B(vti) K-1 denotes a mapping label that divides nodes in the neighborhood into its subsets, totally into a fixed number K of subsets. The weighting function can be expressed as w (v) on the basis of the weightti,vtj)=w′(lti(vtj)). In the selection of the division mode, the subsets are divided according to the distance from the nodes to the gravity center position of the framework, and are divided into root nodes, a node set closer to the gravity center than the root nodes and a node set farther from the gravity center than the root nodes, as shown in the following formula:
Figure BDA0003637207460000081
in the formula: r isiRepresenting the average distance from the skeleton center of gravity to joint i on all frames in the training dataset; and substituting the neighborhood set, the sampling function, the weighting function, the normalized item and the division label into a graph convolution network calculation formula to obtain an output result of the GCN module.
Because the connection relation of the same nodes of the continuous frames is already included in the construction of the input graph, after the graph convolution network is passed, the time domain characteristics can be further acquired through similar convolution operation. The concept of neighborhood is first expanded, expanding the graph from spatial to spatio-temporal connections, as represented by:
B(vti)={vqj|d(vtj,vti)≤K,|q-t∣≤Γ/2}
where Γ represents the size of the time domain kernel used to control the time range of the neighboring frame skeleton map. The sampling function is the same as that used in the GCN, and a certain improvement needs to be made on the label mapping in the weighting function to finish the space-time graph convolution. Since the time of adjacent frames is ordered, the neighborhood of the root node in space-time is directly expressed as:
lST(vqj)=lti(vtj)+(q-t+Γ/2)×K
in the formulati(vtj) The method can be used for carrying out better convolution operation on the space-time graph for single-frame label mapping, thereby obtaining the node relation and the space-time characteristic graph in space and time.
The final multidimensional space-time graph convolutional network can be simplified and expressed as the following formula:
Figure BDA0003637207460000091
in the formula AjA contiguous matrix of j, sigma representing the natural connections between jointsjAjThen the summation of intracorporeal connection and self-connection between joints is represented; wherein
Figure BDA0003637207460000092
WjSimplifying the input characteristics into tensors of (C, V, T) for a weight matrix formed by superposing weight vectors of a plurality of output channels;
a learnable weight matrix M is also attached to each adjacency matrix, and is initialized to a full one matrix.
The spatial attention module aggregates the features along two different directions according to attention as shown in fig. 4, wherein a V dimension can capture remote dependency along a node direction to establish a connection between global nodes, and a T dimension retains accurate node position information along a time direction; then, respectively encoding the generated characteristic graphs, and respectively generating direction perception and position sensitive attention graphs; the method is activated and then added into an input feature map, and the perception of global spatial information in the features is enhanced, so that the interested region can be accurately captured and expressed. The specific operation comprises two steps: first, an information aggregation module, which in order to facilitate the attention module to be able to capture accurate joint position information and long-term spatial interaction, decomposes global pooling using the following formula:
Figure BDA0003637207460000093
in which two-dimensional GAP operations are converted into operations on one-dimensional features. Given an input X, each channel is encoded along two coordinates, space and time, using pooling kernels of size (V, 1) and (1, T), respectively, resulting in outputs of (C X V X1) and (C X1X T), respectively. The output of the c channel with node v and the c channel output at time t can be represented as
Figure BDA0003637207460000094
And
Figure BDA0003637207460000095
this operation can aggregate features in two directions, respectively, along the spatial position and time relationship, to obtain a pair of feature maps.
The second step is an attention generating module, after obtaining the characteristic diagram of the global receptive field and the coding accurate position information obtained in the information aggregation module, the two groups of vectors are connected, and a 1 multiplied by 1 convolution transformation function F is used1Carrying out transformation and dimension reduction operation on the characteristic mapping table to obtain characteristic mapping; as shown in the following formula:
f=δ(F1([zv,zt]))
in the formula [,]denotes the Concat operation along the spatial dimension, δ being a nonlinear activation function. f is intermediate characteristic mapping for coding along node information and time information to obtain
Figure BDA0003637207460000101
Features of dimension, where r is the channel reduction ratio, in order to reduceThe number of channels improves the calculation efficiency, and r in the method is 32. F is then decomposed into two separate tensors f by normalization and non-linearizationv∈RC/r×VAnd ft∈RC/r×TAnd using two 1 x 1 convolution transforms FvAnd FtRespectively expanding the channel number to tensor with the same channel number to obtain gv=σ(Fv(fv) A) and gt=σ(Ft(ft)). Wherein sigma is sigma activation function, g after expansionvAnd gtAttention weight, finally multiplying the weight and the output and adding the weight and the output through a residual error structure to obtain the output of the module.
As shown in the following formula:
Figure BDA0003637207460000102
the resulting output yc(i, j) maintains the same dimension as the output, while weighting the attention g generated by the node dimension vvAttention g generated by representing spatial global information and frame number dimensionality ttThe represented time transformation information enhances the attention capacity of the feature vector to the spatial information.
As shown in fig. 5, the channel attention module is added with a CA module after the GCN extracts spatial features, and the CA module can model the dependency relationship between channels and adaptively adjust the feature expression weight between channels, thereby enhancing the channels beneficial to motion recognition and suppressing other unrelated channels. The channel attention module generally follows the design of the SE module, but changes the Global Average Pooling (GAP) therein to Discrete Cosine Transform (DCT), wherein it is proved that GAP is a special case of two-dimensional DCT, and the result obtained by GAP is proportional to the lowest component of the two-dimensional DCT. The present channel attention module will use more frequency components to introduce more information. First, divide the input X into n blocks along the channel, and mark as [ X ]0,X1,…,Xn-1]Wherein each X isi∈RC′×V×T,i∈{0,1,…,n-1},
Figure BDA0003637207460000103
Then each block is assigned a two-dimensional DCT component, and the basis function of 2DDCT is recorded as
Figure BDA0003637207460000111
The output results for each block are given by:
Figure BDA0003637207460000112
in the formula [ j, k]The component index, i ∈ {0,1, …, n-1} representing 2DDCT, results in a different frequency component for each block, and an output, Freq ∈ RCAnd obtaining a multispectral vector. Then the data is sent to a full connection layer FC to learn to obtain an attention diagram, and finally the attention module of the multispectral channel obtains the attention as follows:
ms-att=sigmoid(fc(Freq))
and finally, multiplying the activation values on the channels by the original characteristics to learn the weight coefficients of the channels, so that the enhanced channel characteristics can be better represented by a model, and adding the characteristics for increasing the attention of the channels and the original input by using a residual connection method to obtain the output tensor with the same dimension.
And finally, outputting the model after the network training to a softmax classifier, and outputting the prediction category of the action. With the trained recognition model, a human behavior recognition task based on a skeleton map can be performed, as shown in fig. 7.
The invention relates to a skeleton map human behavior recognition method based on multi-stream fusion, which aims at a data set obtained by extracting and preprocessing video skeleton data and can solve the problems that a graph convolution network cannot capture long-distance dependency relationship in a skeleton point behavior recognition scene, action-sensitive channel attention is difficult to obtain and information acquisition is insufficient due to single data input.
A space attention SA module is introduced into a training model network framework adopted by the invention, and long-term dependence between nodes is coded through accurate position information of the nodes, so that the nodes can be linked with other space-time nodes, and the long-distance space perception capability of the network is improved.
A channel attention CA module is added in the space-time graph convolution network, so that the network is more sensitive to channel information expressing different motions, and motion characteristics are enhanced.
Four methods are used for data preprocessing to generate four-stream input data, and finally a multi-stream fusion method is used for mutually reinforcing model outputs, so that action categories are more accurately predicted.
Examples
A skeleton map human behavior recognition method based on multi-stream fusion comprises the following steps:
s1, extracting four different data streams from video skeleton data: joint flow, bone flow, joint motion flow and bone motion flow. The specific working process is as follows:
(1.1) adopting a group of public skeleton point data sets and a group of public RGB data sets;
(1.2) for the RGB data set, extracting 18 joint points of each frame of human body by using a public OpenPose tool box, representing the joint points by coordinates, and integrating the joint points into a data format conforming to network input;
and (1.3) dividing the bone data in the step (1.1) and the step (1.2) into four different data streams by adopting four different preprocessing methods, wherein the four different data streams are joint flow, bone flow, joint movement flow and bone movement flow as shown in fig. 6.
And S2, processing the acquired different data streams, and dividing a training set and a test set. The specific working process is as follows:
(2.1) for the bone point data stream in (1.3), using a spatial configuration partitioning strategy. The center of gravity of the firmware is marked first, and then the neighborhood of the nodes is divided into 3 subsets, namely a green node 0 per se, a blue set 1 which is closer to the center of gravity of the skeleton than the node per se, and a yellow set 2 which is more schematic of the center of gravity of the skeleton than the node per se. The partition rule can pay more attention to the movement of the nodes in the behavior identification task, and the convolution kernel can obtain richer and more accurate node characteristics;
(2.2) for the processed data set according to 4: 1, randomly cutting the ratio into a training set and a testing set, and randomly translating and zooming the data set to realize expansion and data enhancement;
and S3, inputting the data of different streams into the time-space graph convolution network respectively, as shown in the figure 2 and the figure 3. The specific working process is as follows:
(3.1) inputting the data preprocessed in the step (2.2) into the SCA-GCN network according to the dimension of (N, C, T, V, M), and firstly, inputting the data to be normalized through a Batch Normalization layer (BN);
(3.2) initially extracting features through a basic 3 x 64 CA-GCN network after the step (3.1), and expanding channels to 64 x 64 dimensions;
and S4, enhancing the feature extraction capability of the network by using a space attention module and a channel attention module, as shown in FIGS. 4 and 5. The specific working process is as follows:
(4.1) after the step (3.1), integrating the global graph nodes and the information of the multi-frame time dimension through a space attention module by the feature graph, capturing long-distance dependence, and enhancing the feature graph;
(4.2) the CA-GCN network in (3.2) is provided with a graph convolution GCN, the channel attention CA and a time domain convolution TCN layer, wherein the GCN network is responsible for the convolution of a space graph, the TCN network is a time domain dimension convolution neural network, and the CA module multiplies the activation values on each channel by the original characteristics to obtain the weight coefficients of each channel, so that the enhanced channel characteristics can be better represented by a model.
And S5, a behavior recognition model is trained through stacking of a multilayer network structure. The specific working process is as follows:
and (5.1) sending the feature map into a CA-GCN network group of 9 layers after (4.1) to extract features, wherein the first three layers are 64 channels, the middle three layers are 128 channels, and the last three layers are 256 channels.
And (5.2) globally pooling the feature tensors obtained in the step (5.1) to obtain 256-dimensional feature vectors of each sequence, and inputting the 256-dimensional feature vectors into a Softmax classifier to obtain output for classification prediction.
S6, fusing model results of the four-stream training, as shown in FIG. 2, and obtaining a final output result;
and S7, regarding the trained multi-stream fusion human behavior recognition model, taking the test image as input to obtain a behavior recognition result, as shown in FIG. 7. The specific working process is as follows:
(7.1) regarding the multi-stream fusion behavior recognition model in the step S6, taking the test set in the step (2.2) as an input to obtain a model classification result;
(7.2) comparing the behavior prediction result of the multi-stream fusion behavior recognition model in the step (7.1) with an actual action label, finding that the multi-stream fusion behavior recognition model in the step (7.1) achieves an excellent recognition effect, and shows a prominent accuracy on two data sets, as shown in fig. 7.

Claims (10)

1. A skeleton map human behavior recognition method based on multi-stream fusion is characterized by comprising the following steps:
s1, extracting four different data streams from the video skeleton data;
and S2, respectively carrying out network model training on the four different data streams to obtain four different training models, carrying out multi-stream fusion training on the four different training models to obtain a behavior recognition model, and carrying out human behavior recognition by using the behavior model obtained by training.
2. The method for recognizing human body behaviors based on the multi-stream fusion as claimed in claim 1, wherein the four different data streams comprise joint stream, skeleton stream, joint motion stream and skeleton motion stream.
3. The method for recognizing human body behaviors based on the multi-stream fusion as claimed in claim 2, wherein joint flow, skeleton flow, joint motion flow and skeleton diagram of the skeleton motion flow are overlapped, and the same joints are connected in time dimension to form a spatiotemporal relationship diagram as input of a training model network.
4. The skeleton map human behavior recognition method based on multi-stream fusion as claimed in claim 3, wherein network structures of network models used for training four data streams are the same, and parameter settings of each data stream are kept consistent.
5. The skeleton map human behavior recognition method based on multi-stream fusion as claimed in claim 1, wherein the network model for training four data streams comprises a spatial attention module, a map convolution module, a channel attention module and a time domain convolution module;
the spatial attention module encodes long-term dependence between nodes through accurate position information of the nodes;
the graph convolution module is used for performing convolution operation on the bone graph to generate a corresponding feature graph;
the channel attention module adaptively adjusts the feature expression weight among the channels through modeling the interdependence relation among the channels;
the time domain convolution module is a convolution neural network of time domain dimension and is used for channel number transformation.
6. The skeleton map human behavior recognition method based on multi-stream fusion as claimed in claim 2, wherein in a skeleton map comprising N joints and T frames, all joints in the skeleton sequence are represented by a graph node V:
V={vti|t=1,....,T,i=1,....N}
wherein v is the coordinate of the joint in the three-dimensional space, and is represented as (x, y, z);
for a given joint point v1=(x1,y1,z1) Point v of articulation with the target2=(x2,y2,z2) Bone adjacency is represented by the following formula:
ev1,v2=(x2-x1,y2-y1,z2-z1)。
7. the method for recognizing human body behaviors based on multi-stream fusion as claimed in claim 5, wherein the multidimensional space-time graph convolution network can be simplified as follows:
Figure FDA0003637207450000021
in the formula AjA contiguous matrix of j, sigma representing the natural connections between jointsjAjThen the summation of intracorporeal connection and self-connection between joints is represented; wherein
Figure FDA0003637207450000022
WjAnd forming a weight matrix for the superposition of the weight vectors of the plurality of output channels.
8. The method for recognizing human body behaviors based on multi-stream fusion according to claim 3, wherein the obtained spatiotemporal relationship diagram is input into the training model in the dimension of (N, C, T, V, M).
9. The skeleton map human behavior recognition method based on multi-stream fusion as claimed in claim 5, wherein the global pooling is decomposed by using the following formula:
Figure FDA0003637207450000023
the two-dimensional GAP operation is converted into the operation of one-dimensional characteristics in the formula.
10. A skeleton map human behavior recognition system based on multi-stream fusion is characterized by comprising a preprocessing module and a recognition module;
the preprocessing module is used for extracting four different data streams from video skeleton data, superposing a skeleton diagram of a joint stream, a skeleton stream, a joint motion stream and the skeleton diagram of the skeleton motion stream, connecting the same joint on a time dimension to form a space-time relation diagram as input of a training model network to obtain four different training models, performing multi-stream fusion training on the four different training models to obtain a behavior recognition model, storing the behavior recognition model into the recognition module, and recognizing human body behaviors by using the trained behavior model.
CN202210505198.3A 2022-05-10 2022-05-10 Skeleton map human behavior identification method and system based on multi-stream fusion Pending CN114708665A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210505198.3A CN114708665A (en) 2022-05-10 2022-05-10 Skeleton map human behavior identification method and system based on multi-stream fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210505198.3A CN114708665A (en) 2022-05-10 2022-05-10 Skeleton map human behavior identification method and system based on multi-stream fusion

Publications (1)

Publication Number Publication Date
CN114708665A true CN114708665A (en) 2022-07-05

Family

ID=82177177

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210505198.3A Pending CN114708665A (en) 2022-05-10 2022-05-10 Skeleton map human behavior identification method and system based on multi-stream fusion

Country Status (1)

Country Link
CN (1) CN114708665A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116012950A (en) * 2023-02-15 2023-04-25 杭州电子科技大学信息工程学院 Skeleton action recognition method based on multi-heart space-time attention pattern convolution network
CN116524601A (en) * 2023-06-21 2023-08-01 深圳市金大智能创新科技有限公司 Self-adaptive multi-stage human behavior recognition model for assisting in monitoring of pension robot
CN117437392A (en) * 2023-12-15 2024-01-23 杭州锐健医疗科技有限公司 Cruciate ligament dead center marker and model training method and arthroscope system thereof
CN117851920A (en) * 2024-03-07 2024-04-09 国网山东省电力公司信息通信公司 Power Internet of things data anomaly detection method and system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116012950A (en) * 2023-02-15 2023-04-25 杭州电子科技大学信息工程学院 Skeleton action recognition method based on multi-heart space-time attention pattern convolution network
CN116012950B (en) * 2023-02-15 2023-06-30 杭州电子科技大学信息工程学院 Skeleton action recognition method based on multi-heart space-time attention pattern convolution network
CN116524601A (en) * 2023-06-21 2023-08-01 深圳市金大智能创新科技有限公司 Self-adaptive multi-stage human behavior recognition model for assisting in monitoring of pension robot
CN116524601B (en) * 2023-06-21 2023-09-12 深圳市金大智能创新科技有限公司 Self-adaptive multi-stage human behavior recognition model for assisting in monitoring of pension robot
CN117437392A (en) * 2023-12-15 2024-01-23 杭州锐健医疗科技有限公司 Cruciate ligament dead center marker and model training method and arthroscope system thereof
CN117437392B (en) * 2023-12-15 2024-03-26 杭州锐健医疗科技有限公司 Cruciate ligament dead center marker and model training method and arthroscope system thereof
CN117851920A (en) * 2024-03-07 2024-04-09 国网山东省电力公司信息通信公司 Power Internet of things data anomaly detection method and system

Similar Documents

Publication Publication Date Title
CN114708665A (en) Skeleton map human behavior identification method and system based on multi-stream fusion
CN111047548B (en) Attitude transformation data processing method and device, computer equipment and storage medium
CN113469094A (en) Multi-mode remote sensing data depth fusion-based earth surface coverage classification method
CN113408455B (en) Action identification method, system and storage medium based on multi-stream information enhanced graph convolution network
CN111460928B (en) Human body action recognition system and method
CN112307995B (en) Semi-supervised pedestrian re-identification method based on feature decoupling learning
CN112597883A (en) Human skeleton action recognition method based on generalized graph convolution and reinforcement learning
CN114596520A (en) First visual angle video action identification method and device
CN113283525B (en) Image matching method based on deep learning
CN113688765B (en) Action recognition method of self-adaptive graph rolling network based on attention mechanism
CN114332573A (en) Multi-mode information fusion recognition method and system based on attention mechanism
CN113792641A (en) High-resolution lightweight human body posture estimation method combined with multispectral attention mechanism
CN115841697A (en) Motion recognition method based on skeleton and image data fusion
CN111462274A (en) Human body image synthesis method and system based on SMP L model
CN116092185A (en) Depth video behavior recognition method and system based on multi-view feature interaction fusion
JPH09502586A (en) Data analysis method and device
KR20210114257A (en) Action Recognition Method and Apparatus in Untrimmed Videos Based on Artificial Neural Network
CN113657272B (en) Micro video classification method and system based on missing data completion
CN113489958A (en) Dynamic gesture recognition method and system based on video coding data multi-feature fusion
CN116524601B (en) Self-adaptive multi-stage human behavior recognition model for assisting in monitoring of pension robot
CN112766099A (en) Hyperspectral image classification method for extracting context information from local to global
CN116453025A (en) Volleyball match group behavior identification method integrating space-time information in frame-missing environment
CN112348033A (en) Cooperative significance target detection method
CN116246338B (en) Behavior recognition method based on graph convolution and transducer composite neural network
CN116844004A (en) Point cloud automatic semantic modeling method for digital twin scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination