CN111967379A - Human behavior recognition method based on RGB video and skeleton sequence - Google Patents

Human behavior recognition method based on RGB video and skeleton sequence Download PDF

Info

Publication number
CN111967379A
CN111967379A CN202010821378.3A CN202010821378A CN111967379A CN 111967379 A CN111967379 A CN 111967379A CN 202010821378 A CN202010821378 A CN 202010821378A CN 111967379 A CN111967379 A CN 111967379A
Authority
CN
China
Prior art keywords
local
decision
video
feature
skeleton
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010821378.3A
Other languages
Chinese (zh)
Other versions
CN111967379B (en
Inventor
曹聪琦
李嘉康
李亚娟
张艳宁
郗润平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202010821378.3A priority Critical patent/CN111967379B/en
Publication of CN111967379A publication Critical patent/CN111967379A/en
Application granted granted Critical
Publication of CN111967379B publication Critical patent/CN111967379B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a human behavior recognition method based on RGB (red, green and blue) videos and a skeleton sequence, which belongs to the technical field of computer vision and pattern recognition and comprises the following contents: firstly, feature stream performs feature extraction on an input video clip to obtain a space-time feature map; secondly, generating a skeleton region heat map by the attribute stream; thirdly, extracting space-time characteristics of the bone region through bililinear; fourthly, local decision block is used for generating a local decision result; and fifthly, fusing the local decision result by using a decision fusion block to obtain a global decision result. The invention realizes Decision fusion by using two plug-and-play modules, namely a local Decision block and a Decision fusion block, wherein the local Decision block respectively makes a Decision on the space-time characteristics of each key region, and the Decision fusion block fuses all Decision results to obtain a final Decision result. The invention effectively improves the accuracy of behavior recognition on the Penn Action and NTU RGB + D data sets.

Description

Human behavior recognition method based on RGB video and skeleton sequence
Technical Field
The invention relates to the technical field of computer vision and pattern recognition, in particular to a human behavior recognition method based on RGB (red, green and blue) videos and a skeleton sequence.
Background
Human behavior recognition, a fundamental problem in computer vision, has now attracted a great deal of attention in the industry. With the continuous development of computer intelligent technology, the human motion recognition has wide application prospect in future life. For example: intelligent monitoring, human-computer interaction motion sensing games, video retrieval and the like. Human behavior recognition in video has similar problems to object recognition in still images, both of which must deal with significant intra-class variation, background clutter and occlusion. However, video has an additional time cue than images. The acquisition of video time clues is a big difficulty.
There are two main methods of applying Convolutional Neural Network (CNN) to video data: one is to apply the image-based model directly to each frame of the video, but only to depict the visual appearance of the video, using a 2D CNN structure. Another way is 3D CNN, so that the convolution kernel is three-dimensional and can extract both spatial and temporal information, but the number of network parameters can increase dramatically, resulting in overfitting.
The attention mechanism mimics the internal process of biological observation behavior, i.e., a mechanism that aligns internal experience with external perception to increase the fineness of observation of a partial region. The attention mechanism can obtain more detailed information of the object needing attention and suppress other useless information. Most networks extract key features by using an attention mechanism, then the features are fused to obtain a global feature descriptor, and finally a global classifier is used to obtain a classification result. The feature fusion mode has the following problems: 1. the gaps between different feature spaces. 2. The fused global descriptor dimension is high, which results in more parameters for classification, easily resulting in overfitting. 3. Some behavior predictions need to comprehensively consider multi-part decision results, such as state changes, context, etc. of objects. Both of these problems will seriously affect the performance of human behavior recognition.
Disclosure of Invention
Technical problem to be solved
In order to avoid the defects of the prior art, the invention provides a human body behavior identification method based on RGB video and a skeleton sequence.
Technical scheme
A human behavior recognition method based on RGB video and skeleton sequence adopts LD-Net, comprising two streams: a feature stream and an attention stream; two modules: a local decision block and a decision fusion block; the method is characterized by comprising the following steps:
step 1: the human body behavior data set comprises two parts of video data and human body skeleton position data, the data set to be processed is divided into a training set and a testing set, and the video set of the training set and the testing set is set as I ═ I1,I2,...,Ii,...,IVV denotes training set and test lumped video frequency, IiRepresents the ith video; let the training set and the test set each have a video length set of F ═ F1,F2,...,Fi,...,FV},FiRepresents the length of the ith video; let the training set and the testing set be human skeleton set as J ═ J1,J2,...,Ji...,JV},JiRepresenting the set of human skeleton points corresponding to the ith video, JiIs of dimension FiZ2, wherein FiThe length of the ith video is represented, Z represents the number of human skeletons of each frame in the video, and 2 represents the abscissa and the ordinate of the position of each skeleton point;
step 2: pre-allocating initial labels for all videos in the training set and the test set, defining the total number of behavior categories as K, and setting the initial label set as { yi=k|1≤k≤K},yiRepresenting a video IiInitial label of,i=1,2,....,V;
And step 3: preprocessing a training set and a testing set, wherein the length of a frame sent by each video to a network is S, and carrying out scaling, random cutting and mean-val normalization processing on S frame video data to obtain video data with the dimension of S3X 224 and human skeleton position data with the dimension of S Z2;
and 4, step 4: sending video data with the dimension of S3X 224 into Feature stream, and using MF-Net to carry out Feature extraction on the Feature stream to obtain space-time Feature diagrams behind MF-Net Conv4 and Conv5 blocks; the space-time characteristic diagram dimension is marked as C, L, H, W and is respectively the number of channels, the length, the height and the width;
and 5: sending human skeleton position data with the dimension S X Z X2 into an Attention stream to obtain a heat map of an input video clip, wherein the dimension of the heat map is marked as M X L H W and is respectively the number of channels, the length, the height and the width; wherein, M is NxL, N represents the number of body joints in each frame, L represents the weight of the length of the video clip to the heat map, which is realized according to the activation of bones at the corresponding points of the heat map, and is equivalent to assigning hard weights;
step 6: adjusting a heat map of size M × L × H × W to a 2D matrix a having M rows and L × H × W columns, adjusting a C × L × H × W feature map derived from MF-Net to a 2D matrix B having C rows and L × H × W columns; then, the bilinear product is represented as:
X=ABT
wherein B isTIs the transpose of B, the X matrix is the set of all the skeleton point features, and the dimension is M × C;
and 7: sending the set X of the bone point features obtained in the step 6 into a Local decision block, wherein the Local decision block decides all the Local bone point space-time features through a full connection layer to obtain a Local behavior classification result; the method comprises the following specific steps:
defining the total number of behavior classes as K the resulting matrix X of the Attention mechanism can be represented as:
X=[x1;x2;。...;xi;...;xM],
Figure BDA0002634500750000031
wherein xiRepresenting the characteristics of the ith bone point, which can be regarded as local characteristic description of the target;
respectively training a linear classifier for each human skeleton point characteristic:
Figure BDA0002634500750000032
Figure BDA0002634500750000033
xi∈RC×1i∈RK×C,bi∈R,i∈[1,M]
wherein
Figure BDA0002634500750000034
The expression parameter is biThe ith linear classifier of (c)iMay be included iniIn (F)θRepresenting a set of linear classifiers;
according to FθDeriving a set of decisions
Figure BDA0002634500750000035
It is expressed as:
Figure BDA0002634500750000036
wherein
Figure BDA0002634500750000037
Representing the decision result of the ith bone point characteristic;
and 8: the Decision fusion block fuses all local decisions obtained by the local Decision block to obtain a final Decision result; the method comprises the following specific steps:
generating a corresponding weight according to a conditional on current local patch criterion, wherein the weight represents the importance degree of the corresponding local feature; using a linear mapping gβTo implement this function and then use sigmoiThe d function normalizes the weights, denoted by W:
W=[w1;w2;...;wi;...;wM],
Figure BDA0002634500750000041
gβ(xi)=βxi+b,xi∈RC×1,β∈R1×C,b∈R,i∈[1,M]
Figure BDA0002634500750000042
where β and b are linear mappings gβThe parameters of (1);
obtaining a global decision result according to the generated weight:
Figure BDA0002634500750000043
and (3) locally supervising a local decision result obtained by the local decision block in the step (7), wherein a loss function is as follows:
Figure BDA0002634500750000044
where M represents the total number of skeletal points of an input video segment,
Figure BDA0002634500750000045
a label representing the jth position of the ith sample,
Figure BDA0002634500750000046
a label representing the jth position of the ith sample in the mth bone point feature decision result;
and (3) globally supervising the global decision result obtained in the step (8), wherein the loss function is as follows:
Figure BDA0002634500750000047
where U represents the sample size, K represents the total number of behavior classes,
Figure BDA0002634500750000048
a label representing the jth position of the ith sample,
Figure BDA0002634500750000049
indicating the prediction result;
the local and global decisions are supervised, and the loss function is as follows:
L=Lg+Ll
the initial learning rate of training was set to 0.005 and after 20, 40 and 60 rounds, the learning rate was reduced by a factor of 0.1, and a random gradient descent SGD optimizer was used during the training.
Advantageous effects
The human behavior recognition method based on the RGB video and the skeleton sequence has the following beneficial effects:
(1) the original network fuses the extracted features of the attention to describe the current behavior and train a global classifier. The main problem of the feature fusion method is the gap between different feature spaces. In addition, the fused global descriptor dimension is high, which results in more parameters for classification, easily resulting in overfitting. The decision fusion method provided by the invention can solve the problems, on one hand, the decision aggregation fusion is projected to the decision in the same space, and on the other hand, the quantity of parameters is reduced because the local features are low-dimensional and the classifiers can share the parameters. Meanwhile, the decision fusion is also supported theoretically and experimentally in the aspects of statistics and machine learning, and the integration method can combine single classifiers to generate better performance than any single classifier.
(2) The invention provides a plug-and-play local decision fusion structure for human behavior recognition, which comprises a local decision block module and a decision fusion block module. And the local decision block makes a decision based on the local space-time characteristics to obtain a local decision. And fusing all the local decision results by using the decision fusion block to obtain a final decision result. The structure can fully utilize local space-time characteristics, fully considers the influence of local decision on the recognition effect, and thus effectively improves the behavior recognition effect.
(3) The invention provides that supervision can be added to both local and global decisions. The two supervision modes can be matched with each other, and the training of the model is facilitated.
Drawings
FIG. 1 is a schematic view of the overall system flow of the present invention
FIG. 2 is an overall structure diagram of LD-Net proposed in the present invention
FIG. 3 is an MF-Net network framework for use in the present invention
FIG. 4 is a schematic view of an attition
FIG. 5 is a skeletal weight heat map for different behaviors
FIG. 6 is a comparison of confusion matrices on the Penn Action dataset for MF-Net and the present invention
FIG. 7 is an embodiment of skeletal weight in concrete behavior
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
the invention provides a two-stream network structure which comprises two modules, namely a local decision block and a decision fusion block, and the two modules are called LD-Net. One LD-Net stream is a feature stream, and a Multi-Fiber network (MF-Net) is selected to extract the space-time features of the video segments. As the MF-Net is a multi-fiber structure network, the parameter quantity of the three-dimensional network can be effectively reduced, and overfitting is avoided. The MF-Net network framework is shown in fig. 3. The other stream is an attention stream, and the corresponding positions of human skeletal points are taken as attention (attention) areas. Because the skeleton point information reflects the posture characteristics of the human body, and simultaneously, useless information about the target is greatly eliminated. For the extracted key regional characteristics, the extracted key regional characteristics are not directly fused and then are decided, but are fused locally. The invention realizes Decision fusion by using two plug-and-play modules, namely the local Decision block and the Decision fusion block. And (4) the local decision block makes decisions on the space-time characteristics of each key region respectively. And fusing all Decision results by the Decision fusion block to obtain a final Decision result. The invention effectively improves the accuracy of behavior recognition on the Penn Action and NTU RGB + D data sets.
The technical scheme of the invention comprises six stages, wherein the first stage is that feature stream performs feature extraction on an input video clip to obtain a space-time feature diagram. The second phase is the attribute stream generates the bone region heat map. The third stage is to extract spatiotemporal features of the bone region by bilinear. And a fourth stage of utilizing the local decision block to generate a local decision result. And in the fifth stage, a decision fusion block is used for fusing local decision results to obtain a global decision result. The sixth stage is a loss function used by the training network. The seventh stage is experimental result analysis. The method comprises the following specific steps:
1) feature stream performs Feature extraction on video clips to obtain a space-time Feature map
Feature stream uses an MF-Net network framework that effectively reduces the number of parameters using a multi-fiber fabric and multiplexing modules. The invention uses MF-Net to obtain the space-time characteristic diagram of input data. The details are as follows:
(a) and (4) preprocessing data. The invention saves the video data into the form of pictures, and reduces the time spent on reading the data by the network. During the saving as a picture, the overall scaling is performed according to the scaling factor of the minimum edge scaling to 256.
(b) And ensuring the robustness of the training model. The training set is augmented by mirroring and training is performed by randomly selecting image patches of size S3 x 224 from the data set in each training, where S is the number of video segment frames.
(c) The invention marks the space-time characteristic diagram dimension generated by the MF-Net network as C, L, H, W, which are the number of channels, length, height and width. The invention obtains space-time characteristic diagrams after MF-Net Conv4 and Conv5 blocks respectively.
2) Attention stream generation of skeletal region heat maps
The Attention stream obtains a heat map of each bone point according to the position information of the human bones. The specific details are as follows: the invention obtains a skeleton region heat map of the input video clip according to the marked skeleton position information, wherein the heat map and the space-time feature map have the same space-time dimension. The dimensions of the device are marked as M, L, H, W, and are the number of channels, length, height and width respectively. Where M is NxL, where N denotes the number of body joints in each frame and L denotes the length of the video segment. The joint guidance feature set is realized by selecting the activation of the body joint at the corresponding point of the heat map, which is equivalent to assigning a hard weight. Specifically, a weight of 1 is given to a bone point correspondence site, and a weight of 0 is given to a bone point correspondence site. This step results in two heatmaps with dimensions equal to the spatio-temporal feature map dimensions after the MF-Net Conv4 and Conv5 blocks, respectively.
3) Biliner operation
Through the operation, the method can obtain the spatiotemporal feature map and the bone region heat map of the input data, carry out bilinear operation on the spatiotemporal feature map and the heat map, and extract the feature corresponding to each bone point. The specific process is as follows: first, a heat map of size M × L × H × W is adjusted to a 2D matrix a with M rows and L × H × W columns. Similarly, the C × L × H × W feature map derived from MF-Net is adjusted to a 2D matrix B having C rows and L × H × W columns. Then, the bilinear product can be expressed as:
X=ABT
wherein B isTIs the transpose of B, P is a matrix of size M X C, and X is the set of all skeletal point features.
The method comprises the steps of respectively carrying out bilinear operation on the obtained space-time characteristic diagram and the obtained heat diagram to obtain characteristics corresponding to the space-time characteristic diagram of all bone points after MF-Net Conv4 and Conv5 blocks, and finally fusing the bone space-time characteristics corresponding to the two different blocks to obtain the unique representation of the bone space-time characteristics. Because the features behind different blocks can complement the space-time information mutually, the accuracy rate of behavior recognition is improved.
4) Obtaining a local decision block result by the local decision block
The method and the device respectively make decisions on all local bone point characteristics to obtain local behavior classification results. The invention defines the total number of behavior types as K and can express the matrix X obtained by the Attention mechanism as:
X=[x1;x2;...;xi;...;xM],
Figure BDA0002634500750000081
wherein xiThe feature representing the ith bone point can be regarded as a local feature description of the target.
The idea of the invention is to predict the probability of behavior of each local feature using linear mapping, which can be seen as multiple weak classifiers to make decisions. The invention provides two linear classifier schemes, one is to train a parameter-shared linear classifier for all local features, and the other is to train a linear classifier for each human skeleton point feature. Use of the invention FθTo represent a set of linear classifiers.
The first scheme is as follows:
Figure BDA0002634500750000082
Figure BDA0002634500750000083
xi∈RC×1,θ∈RK×C,b∈R,i∈[1,M]
wherein
Figure BDA0002634500750000084
A linear classifier with a parameter theta is represented, b may be included by theta.
The second scheme is as follows:
Figure BDA0002634500750000091
Figure BDA0002634500750000092
xi∈RC×1i∈RK×C,bi∈R,i∈[1,M]
wherein
Figure BDA0002634500750000093
The expression parameter is biThe ith linear classifier of (c)iMay be included iniIn (1).
Whatever scheme is used by the invention, a set of decisions is derived
Figure BDA0002634500750000094
It is expressed as:
Figure BDA0002634500750000095
wherein
Figure BDA0002634500750000096
And (4) representing the decision result of the ith bone point characteristic. Through the operation, the invention obtains a series of local decision results.
Subsequent experiments prove that a large number of parameters are brought by using the first scheme, so that overfitting is caused, and the identification accuracy is reduced. So the second approach is chosen and all skeletal point features are decided using a linear classifier with shared parameters.
5) Fusing the local Decision to obtain a global Decision result
Through the local decision block, all local decision results are obtained. And fusing all Decision results, which is a problem to be solved by the Decision fusion block. The invention provides two methods for fusion, one is to sum and average decision results to obtain the final decision result; and the other method is that corresponding weights are generated according to a conditional on current local patch criterion, and the obtained weights are multiplied by corresponding decision results respectively to sum and average to obtain a global decision result.
The first fusion method the present invention can be expressed as:
Figure BDA0002634500750000097
the second fusion method firstly generates corresponding weight according to the conditional on current local patch criterion, wherein the weight represents the importance degree of the corresponding local feature. The invention uses a linear mapping gβTo achieve this function, the weights are then normalized using a sigmoid function, denoted by W:
W=[w1;w2;...;wi;...;wM],
Figure BDA0002634500750000101
gβ(xi)=βxi+b,xi∈RC×1,β∈R1×C,b∈R,i∈[1,M]
Figure BDA0002634500750000102
where β and b are linear mappings gβLikewise β may comprise b.
Finally, the invention obtains a global decision result according to the generated weight:
Figure BDA0002634500750000103
it is known that the amplitude of skeletal changes of different behaviors of a human body is different, and the importance of the skeletal changes on behavior recognition is different. For example, the variation range of the bone points of the upper half part of the human body is larger in the push-up mode; the change range of the bone points of the lower half part of the human body is larger when people play football. This characteristic is proved in the experimental results of the invention, and the use of the second fusion method is more beneficial to human behavior recognition.
6) A loss function used by the training network.
The LD-Net provided by the invention can generate local and global decision results, and can monitor the network locally, globally and in combination with the local and global decision results.
(a) And (6) carrying out global supervision. The invention only supervises the global decision, and the loss function is as follows:
Figure BDA0002634500750000104
where U represents the sample size and K represents the total number of behavior classes.
Figure BDA0002634500750000105
A label representing the jth position of the ith sample,
Figure BDA0002634500750000106
indicating the prediction result.
(b) And (6) local supervision. The invention only supervises local decision, and the loss function is as follows:
Figure BDA0002634500750000107
where M represents the total number of skeletal points of an input video segment.
Figure BDA0002634500750000108
And (4) a label representing the j (th) position of the ith sample in the m (th) bone point feature decision result.
(c) Global + local supervision. The invention supervises both local and global decisions, and the loss function is as follows:
L=Lg+Ll
the present invention sets the initial learning rate of training to 0.005, decreasing by a factor of 0.1 after 20, 40 and 60 rounds. A random gradient descent (SGD) optimizer is used in the training process.
7) Analysis of Experimental results
The invention performs experiments on the Penn Action and NTU RGB + D datasets.
(1) Experiments on the Penn Action dataset
Penn Action dataset: 2326 video sequences containing 15 action classes. The length of the video is from 18 frames to 663 frames. This dataset adds 13 skeletal point annotations to the human body for each frame, but some skeletal points are not visible in the video. The data set had 1258 training data and 1068 test data.
<1> decision module: in this subsection, the invention will compare the basic MF-Net with networks that have local precision block and precision fusion block added.
Table 1 basic MF-Net and recognition accuracy using local precision block and precision fusion block. conv4 and conv5 represent spatio-temporal feature maps of MF-Net after conv4 and conv5 modules, respectively. L represents a local resolution block. D represents a precision fusion block. S denotes a linear classifier for parameter sharing. NS denotes a linear classifier whose parameters are not shared. W represents decision fusion according to self-learning weights. NW represents the average fusion strategy.
Method of producing a composite material Rate of accuracy
MF-Net 0.945
MF-Net(conv4)+L(S)+D(NW) 0.928
MF-Net(conv5)+L(S)+D(NW) 0.965
MF-Net(conv5)+L(S)+D(W) 0.973
MF-Net(conv5)+L(NS)+D(NW) 0.932
MF-Net(conv5)+L(NS)+D(W) 0.941
As can be seen from table 1, the decision module used after the conv5 block of the MF-Net network performs better than the conv4 block. This is because the spatiotemporal features extracted after the conv5 block are more discriminative than after the conv4 block. The experimental result shows that the performance of the module added by the network is better than that of the basic MF-Net, and the success rate is improved from 94.5% to 96.5%. When the weight-based decision averaging strategy is used, the success rate is improved from 96.5% to 97.3%. This is because weight-based strategies may derive a set of learnable weights from local information, which may be more focused on important local information. On the other hand, when the invention uses a linear classifier with unshared parameters, the recognition accuracy is significantly reduced compared with the basic MF-Net network. The reason may be that each local feature corresponds to a classifier, and the number of network parameters may increase significantly, causing an over-fitting problem.
FIG. 4 the present invention visualizes the local decision weights for four different behaviors as a heat map. (a) Indicating that for pushup behavior, the network is more focused on elbow, wrist, and shoulder regions. (b) It is shown that for tenis forward behavior, the network is more focused on the upper body region such as the wrist, elbow, etc. (c) Indicating that for jump rope behavior, the network is more focused on the knee, wrist, ankle and other lower body regions. (d) Indicating that for setup behavior, the network is more focused on the shoulder, hip, wrist and knee regions.
<2> compare different feature fusion strategies: in this subsection, the present invention compares two different feature fusion strategies. One is to fuse the features and make a decision. The other is to fuse the local decision results. For feature fusion, the invention uses the JDD mode with MF-Net, which fuses the features guided by the bone points. JDD has two feature fusion strategies in common. One is to integrate all local features directly. The other is to integrate the features of the same timing first and then assemble the features in the timing dimension using max + min.
Table 2 identification accuracy of different fusion methods. Denotes the use of max + min pooling in the time-series dimension.
Method of producing a composite material Rate of accuracy
MF-Net 0.945
MF-Net(conv5)+JDD 0.953
MF-Net(conv5)+JDD* 0.960
MF-Net(conv5)+L(S)+D(NW) 0.965
MF-Net(conv5)+L(S)+D(W) 0.973
As can be seen from table 2, the success rate increased from 94.5% to 95.3% using the first feature fusion strategy. At that time, the success rate is further improved to 96.0 percent by using a max + min mode. However, the accuracy of the feature fusion method is lower than that of the decision fusion method provided by the invention, and the effectiveness of the method is proved.
<3> decision to fuse different layers: in this subsection, the invention uses spatio-temporal feature maps behind different MF-Net layers to extract local features, then performs local decision making, and finally fuses the local decision making results to obtain the final decision making result.
Table 3 local decision fusion recognition success rates for different layers.
Method of producing a composite material Rate of accuracy
MF-Net 0.945
MF-Net(conv4)+L(S)+D(NW) 0.928
MF-Net(conv4)+L(S)+D(W) 0.936
MF-Net(conv5)+L(S)+D(NW) 0.965
MF-Net(conv5)+L(S)+D(W) 0.973
MF-Net(conv4+conv5)+L(S)+D(NW) 0.977
MF-Net(conv4+conv5)+L(S)+D(W) 0.982
As can be seen from table 3, the local decision after fusing the conv5 block is better than the local decision after fusing the conv4 block. This may be because higher-level semantic information is more favorable for behavior recognition. When the partial decision results after the conv4 blocks and the conv5 blocks are integrated, the recognition success rate is further improved. This is because the information of different layers can be complemented, thereby improving the recognition accuracy.
<4> different loss functions were used: in this subsection, the present invention uses different loss functions for the network.
Table 4 identification accuracy using different loss functions. GS denotes global supervision. LS denotes local supervision.
Figure BDA0002634500750000131
As can be seen from table 4, when the network uses only local supervision, its accuracy is lower than using global supervision. This is because global supervision can directly optimize the final objective function. When the network uses both local and global supervision, the accuracy is further improved. This reflects that local supervision can improve the accuracy of local classification, and cooperates with global supervision to improve the recognition accuracy. When the network fuses the local decision results after the conv4 and conv5 blocks and uses local and global supervision simultaneously in the training process, the accuracy rate reaches 98.4%.
<5> robustness analysis: in this subsection, the present invention assesses the impact of the accuracy of human joint positions on the model of the present invention. The present invention uses the alphapos algorithm to generate estimated human skeletal points, which predict the positions of 17 skeletal points for each person. The 17 skeletal points are nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle, respectively. For the Penn Action dataset, the 13 labeled skeletal points were head, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, right ankle, respectively. In order to make the estimated skeletal points correspond to the 13 labeled skeletal points, the present invention takes the estimated nasal skeletal points as the head skeletal points. The average L1 distance errors for bone points in width and height were estimated to be 54.03, 28.31, respectively. The error ratio of the frame size is (0.11, 0.09). The invention is compared with other deep networks that incorporate attitude information, such as P-CNN, JDD and dual-stream bilinear C3D. The results of the experiment are shown in Table 5.
Table 5 estimates the effect of skeletal points on the Penn Action dataset versus real skeletal points.
Method of producing a composite material Labeling Estimating Error of the measurement
P-CNN 0.977 0.953 0.024
JDD(conv5b) 0.943 0.874 0.069
JDD(conv5b+conv4b) 0.957 0.893 0.064
JDD(conv5b+conv4b)* 0.981 0.938 0.043
two-stream bilinear C3D 0.943 0.926 0.017
two-stream bilinear C3D* 0.971 0.953 0.018
MF-Net(conv5)+L(S)+D(NW)+GS 0.965 0.955 0.010
MF-Net(conv5)+L(S)+D(W)+GS 0.973 0.961 0.012
MF-Net(conv4+conv5)+L(S)+D(W)+LS+GS 0.984 0.969 0.015
As can be seen from table 5, the method proposed by the present invention is significantly superior to other methods on the Penn Action dataset, especially in the case of estimating the skeletal points of the human body. When the position of a human skeleton point is inaccurate, the accuracy rate of the existing posture-based method is rapidly reduced. The method utilizes an integrated learning idea and comprehensively considers a plurality of local information to carry out comprehensive decision. It achieves the best performance under the condition of marking and estimating the bone points. In addition, the accuracy rate of the method is reduced by the minimum extent, which proves that the method has good robustness on the position error of the estimated human skeleton point.
<5> the present invention compares with other advanced methods: in this subsection, experimental results of the methods presented by the present invention are compared with those of other advanced methods. As shown in table 6.
Table 6 experimental results of the present invention compared to other advanced methods. The precision Aggregation represents MF-Net (conv4+ conv5) + L (S) + (W) + LS + GS.
Figure BDA0002634500750000151
In table 6, the methods are classified into three types. The methods of the first part are based on video features, the methods of the second part are based on pose features, and the methods of the third part are based on video and pose features. In the first section, IDT-FV encodes dense tracks using Fisher vectors, which has better performance than DT and STIP. Compared with the C3D, the MF-Net effectively reduces parameters and further improves the precision by adopting the group convolution. The MGN combines local and global information extracted from the video, and the accuracy is improved to 95.5%.
In the second section, Action bank consists of many individual behavior detectors that are widely sampled in semantic space and viewpoint space. Actemes uses skeletal point labels (e.g., locations) in a data-driven training process to find regions that are highly clustered in space with spatio-temporal skeletal points. ACPS is a graph structure model that uses high-level information about behavior to incorporate higher-order partial dependencies.
In the third part, the method for guiding behavior recognition by using attitude information can reduce redundant information and further improve the recognition accuracy. The ACPS + IDF-FV combines ACPS and IDF-FV, and the accuracy is improved to 92.9%. MST-AOG and ST-AOG use spatio-temporal and-or maps for behavior recognition. P-CNN [34] benefits from taking the optical flow image as an additional input and cropping the optical flow image and RGB image into blocks under the guidance of human skeletal points. JDD, the double-flow bilinear C3D and the method of the invention all use the human skeletal point region in the feature map as the attention region to guide the space-time feature extraction. However, JDD and dual-stream bilinear C3D make decisions after fusing spatiotemporal features. JDD (conv5b + conv4b) performs feature aggregation using max + min method in the time dimension, and then performs classification using support vector machine. The method provided by the invention utilizes the linear classifier shared by parameters to make decisions on local space-time characteristics respectively, and finally carries out weight-based average aggregation on local decision results, thereby realizing end-to-end trainability and achieving the most advanced performance under the condition of marking skeletal points. The RPAN utilizes a posture attention mechanism to adaptively learn the characteristics related to the posture, and the accuracy rate reaches 97.4% under the condition that the information of the bone points is not marked. The Pose + MD-fusion is classified by combining the posture, space and motion characteristic graphs, and the accuracy is further improved to 97.6% under the condition that the skeleton points do not need to be marked. RPAN and dose + MD-fusion have better performance than the proposed method when tested without labeling skeletal points, since both methods have optical flow images as additional input, whereas the proposed method does not use optical flow information.
The invention visualizes the confusion matrix obtained by MF-Net and precision Aggregation in FIG. 5. Notably, the precision Aggregation represents MF-Net (conv4+ conv5) + L (S) + (W) + LS + GS. On the Penn Action dataset, there are some behaviors that are easily confused, such as clean and jerk and square, Tenis for hand and Tenis serve, because they have similar appearance and motion information. These similar actions are clearly less accurate to identify than other actions when using the basic MF-Net. When the method provided by the invention is used, more comprehensive and more accurate judgment can be made, and the performance is superior to that of an MF-Net network.
(2) Experiments on NTU RGB + D dataset
NTU RGB + D dataset: the motion image is composed of 56880 motion samples, and comprises RGB video, a depth map sequence, 3D skeleton data and infrared video of each sample. Where the 3D skeleton data contains the three-dimensional positions of the 25 major body joints per frame. The Ntu dataset contains a total of 60 action classes, of which 49 actions are performed by a single person and the rest by multiple persons. The present invention does not require depth map sequences and infrared video when using a data set. The 3D bone data for each sample needs to be converted into 2D bone data. The training set and the test set are divided according to two criteria, CS (cross-subject) and CV (cross-view).
<1> the present invention compares with other advanced methods: in this subsection, experimental results of the methods presented by the present invention are compared with those of other advanced methods. As shown in table 7.
Table 7 experimental results of the method proposed by the present invention compared with other methods.
Figure BDA0002634500750000171
Figure BDA0002634500750000181
In Table 7, these methods are also classified into three groups. The method of the first part is based on video features, the method of the second part is based on pose features, and the method of the third part is based on both video and pose features. It is noted that the method of the italic part uses pose information estimated based on a vision method, not pose information directly output by Kinect.
In the first part, the TSN is a dual stream network based on a long range time structure. MF-Net splits a complex network into an integration of lightweight networks. The DA-Net is a network which combines the classification scores of each view by using a view classifier, and effectively improves the identification accuracy.
In the second part, the Lie Group models the three-dimensional geometrical relationships between different body parts in three-dimensional space. HBRNNs use hierarchical RNNs for behavioral recognition, which use five body parts of the human body as inputs rather than the entire skeleton.
Compared with the Lie Group, the accuracy rate is obviously improved. Part-aware LSTM models long-range temporal correlations for features of each body Part. Trust Gate ST-LSTM analyzes motion information in data from the spatio-temporal domain using spatio-temporal LSTM. STA-LSTM uses spatial and temporal attention models to focus on more discriminative bone points. The VA-LSTM automatically adjusts the viewpoint using a view adaptation scheme. The DS-LSTM captures the temporal connections in the framework sequence. The 3scale ResNet152 maps the skeleton information to color images and models using 3 different scale inputs and networks. (P + C) Net rearranges the inputs using a permutation network and designs a classification network with gate convolution to improve learning. The ST-GCN automatically learns spatial and temporal patterns from data using a graph convolution network. PB-GCN divides the skeleton map into four sub-maps and learns a recognition model using part-based GCN.
In the third section, DSSCA-SSLM is a depth auto-encoder based on a shared specific feature decomposition network, combining RGB and skeletal sequences. STA-handles uses a spatiotemporal attention mechanism to focus on important human Hands and detect discriminative moments in behavior. PSTA is a two-stream approach in which the pose stream follows the topology of the human body and the RGB stream is processed by a spatiotemporal soft attention mechanism. The attention mechanism of STA-handles and PSTA is primarily concerned with the hand area of the person. The method provided by the invention not only focuses on the hand region, but also considers the key region of the human body based on the weight, and can comprehensively analyze the human body behavior. The accuracy of CS and CV was improved by 8% and 5.6%, respectively, compared to STA-handles. The accuracy of CS and CV was improved by 5.7% and 3.6%, respectively, compared to PSTA. CNN + RNN is also a structure based on two streams, one of which models the skeletal information and the other extracts features from the RGB frame. And finally, fusing and classifying the characteristics of the two streams by using a support vector machine, wherein the classification accuracy of the CV reaches 93.6%. Deep Biliner uses Bilinear blocks to assemble cube features from different schemas. The CentralNet makes decisions for each modality, with CS accuracy reaching 89.3%. The Chained Network computes and integrates visual cues using Markov chain models. 2D/3D Multi-task and Multi-task Deep Learning estimate the body pose from the image using a multitask framework and identify the body behavior from the video sequence. Multi-task Deep Learning is an extension of 2D/3D Multi-task. The 2D/3D Multi-task predicts the posture and the behavior in sequence, and the Multi-task Deep Learning predicts and optimizes the posture and the behavior in parallel, so that the accuracy of the CS is improved from 85.5% to 89.9%. In summary, both CNN + RNN and Deep Biliner fuse multimodal features and then make decisions. centralNet, Chained Network, 2D/3D Multi-task, and Multi-task Deep Learning all make decisions based on the characteristics of each modality and then use these decisions to obtain the final prediction. Compared with the methods, the method provided by the invention extracts the local RGB space-time characteristics by using the skeleton information without sending the skeleton information into an additional network for characteristic extraction, then decides the local characteristics by using a linear classifier shared by parameters, and finally fuses the local decision results by using a weighted average method to obtain the final decision result. The method provided by the invention fully considers the local information, and fuses the local decision through the proposed decision strategy, thereby achieving the best performance.

Claims (2)

1. A human behavior recognition method based on RGB video and skeleton sequence adopts LD-Net, comprising two streams: a feature stream and an attention stream; two modules: a local decision block and a decision fusion block; the method is characterized by comprising the following steps:
step 1: the human body behavior data set comprises two parts of video data and human body skeleton position data, the data set to be processed is divided into a training set and a testing set, and the video set of the training set and the testing set is set as I ═ I1,I2,...,Ii,...,IVV denotes training set and test lumped video frequency, IiRepresents the ith video; let the training set and the test set each have a video length set of F ═ F1,F2,...,Fi,...,FV},FiRepresents the length of the ith video; let the training set and the testing set be human skeleton set as J ═ J1,J2,...,Ji...,JV},JiRepresenting the set of human skeleton points corresponding to the ith video, JiIs of dimension FiZ2, wherein FiThe length of the ith video is represented, Z represents the number of human skeletons of each frame in the video, and 2 represents the abscissa and the ordinate of the position of each skeleton point;
step 2: pre-allocating initial labels for all videos in the training set and the test set, defining the total number of behavior categories as K, and setting the initial label set as { yi=k|1≤k≤K},yiRepresenting a video Ii1, 2.. ·, V;
and step 3: preprocessing a training set and a testing set, wherein the length of a frame sent by each video to a network is S, and carrying out scaling, random cutting and mean-val normalization processing on S frame video data to obtain video data with the dimension of S3X 224 and human skeleton position data with the dimension of S Z2;
and 4, step 4: sending video data with the dimension of S3X 224 into Feature stream, and using MF-Net to carry out Feature extraction on the Feature stream to obtain space-time Feature diagrams behind MF-Net Conv4 and Conv5 blocks; the space-time characteristic diagram dimension is marked as C, L, H, W and is respectively the number of channels, the length, the height and the width;
and 5: sending human skeleton position data with the dimension S X Z X2 into an Attention stream to obtain a heat map of an input video clip, wherein the dimension of the heat map is marked as M X L H W and is respectively the number of channels, the length, the height and the width; wherein, M is NxL, N represents the number of body joints in each frame, L represents the weight of the length of the video clip to the heat map, which is realized according to the activation of bones at the corresponding points of the heat map, and is equivalent to assigning hard weights;
step 6: adjusting a heat map of size M × L × H × W to a 2D matrix a having M rows and L × H × W columns, adjusting a C × L × H × W feature map derived from MF-Net to a 2D matrix B having C rows and L × H × W columns; then, the bilinear product is represented as:
X=ABT
wherein B isTIs the transpose of B, the X matrix is the set of all the skeleton point features, and the dimension is M × C;
and 7: sending the set X of the bone point features obtained in the step 6 into a Local decision block, wherein the Local decision block decides all the Local bone point space-time features through a full connection layer to obtain a Local behavior classification result; the method comprises the following specific steps:
defining the total number of behavior classes as K the resulting matrix X of the Attention mechanism can be represented as:
Figure FDA0002634500740000021
wherein xiRepresenting the characteristics of the ith bone point, which can be regarded as local characteristic description of the target;
respectively training a linear classifier for each human skeleton point characteristic:
Figure FDA0002634500740000022
Figure FDA0002634500740000023
wherein
Figure FDA0002634500740000024
The expression parameter is biThe ith linear classifier of (c)iMay be included iniIn (F)θRepresenting a set of linear classifiers;
according to FθDeriving a set of decisions
Figure FDA0002634500740000025
It is expressed as:
Figure FDA0002634500740000026
wherein
Figure FDA0002634500740000027
Representing the decision result of the ith bone point characteristic;
and 8: the Decision fusion block fuses all local decisions obtained by the local Decision block to obtain a final Decision result; the method comprises the following specific steps:
generating a corresponding weight according to a conditional on current local patch criterion, wherein the weight represents the importance degree of the corresponding local feature; using a linear mapping gβTo achieve this function, the weights are then normalized using a sigmoid function, denoted by W:
Figure FDA0002634500740000031
gβ(xi)=βxi+b,xi∈RC×1,β∈R1×C,b∈R,i∈[1,M]
Figure FDA0002634500740000032
where β and b are linear mappings gβThe parameters of (1);
obtaining a global decision result according to the generated weight:
Figure FDA0002634500740000033
2. the human behavior recognition method based on the RGB video and the skeleton sequence as claimed in claim 1, wherein the local decision result obtained by the local decision block in step 7 is locally supervised, and the loss function is as follows:
Figure FDA0002634500740000034
where M represents the total number of skeletal points of an input video segment,
Figure FDA0002634500740000035
a label representing the jth position of the ith sample,
Figure FDA0002634500740000036
a label representing the jth position of the ith sample in the mth bone point feature decision result;
and (3) globally supervising the global decision result obtained in the step (8), wherein the loss function is as follows:
Figure FDA0002634500740000037
where U represents the sample size, K represents the total number of behavior classes,
Figure FDA0002634500740000038
a label representing the jth position of the ith sample,
Figure FDA0002634500740000039
indicating the prediction result;
the local and global decisions are supervised, and the loss function is as follows:
L=Lg+Ll
the initial learning rate of training was set to 0.005 and after 20, 40 and 60 rounds, the learning rate was reduced by a factor of 0.1, and a random gradient descent SGD optimizer was used during the training.
CN202010821378.3A 2020-08-14 2020-08-14 Human behavior recognition method based on RGB video and skeleton sequence Active CN111967379B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010821378.3A CN111967379B (en) 2020-08-14 2020-08-14 Human behavior recognition method based on RGB video and skeleton sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010821378.3A CN111967379B (en) 2020-08-14 2020-08-14 Human behavior recognition method based on RGB video and skeleton sequence

Publications (2)

Publication Number Publication Date
CN111967379A true CN111967379A (en) 2020-11-20
CN111967379B CN111967379B (en) 2022-04-08

Family

ID=73387758

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010821378.3A Active CN111967379B (en) 2020-08-14 2020-08-14 Human behavior recognition method based on RGB video and skeleton sequence

Country Status (1)

Country Link
CN (1) CN111967379B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287891A (en) * 2020-11-23 2021-01-29 福州大学 Method for evaluating learning concentration through video based on expression and behavior feature extraction
CN112749671A (en) * 2021-01-19 2021-05-04 澜途集思生态科技集团有限公司 Human behavior recognition method based on video
CN112906604A (en) * 2021-03-03 2021-06-04 安徽省科亿信息科技有限公司 Behavior identification method, device and system based on skeleton and RGB frame fusion
CN113139469A (en) * 2021-04-25 2021-07-20 武汉理工大学 Driver road stress adjusting method and system based on micro-expression recognition
CN113139432A (en) * 2021-03-25 2021-07-20 杭州电子科技大学 Industrial packaging behavior identification method based on human body skeleton and local image
CN113469018A (en) * 2021-06-29 2021-10-01 中北大学 Multi-modal interaction behavior recognition method based on RGB and three-dimensional skeleton
CN113610071A (en) * 2021-10-11 2021-11-05 深圳市一心视觉科技有限公司 Face living body detection method and device, electronic equipment and storage medium
CN114091601A (en) * 2021-11-18 2022-02-25 业成科技(成都)有限公司 Sensor fusion method for detecting personnel condition
CN117137435A (en) * 2023-07-21 2023-12-01 北京体育大学 Rehabilitation action recognition method and system based on multi-mode information fusion

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609460A (en) * 2017-05-24 2018-01-19 南京邮电大学 A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism
CN110135319A (en) * 2019-05-09 2019-08-16 广州大学 A kind of anomaly detection method and its system
CN110135251A (en) * 2019-04-09 2019-08-16 上海电力学院 A kind of group's image Emotion identification method based on attention mechanism and hybrid network
CN110222665A (en) * 2019-06-14 2019-09-10 电子科技大学 Human motion recognition method in a kind of monitoring based on deep learning and Attitude estimation
CN110555387A (en) * 2019-08-02 2019-12-10 华侨大学 Behavior identification method based on local joint point track space-time volume in skeleton sequence
CN110728183A (en) * 2019-09-09 2020-01-24 天津大学 Human body action recognition method based on attention mechanism neural network
CN111160164A (en) * 2019-12-18 2020-05-15 上海交通大学 Action recognition method based on human body skeleton and image fusion
US10713493B1 (en) * 2020-02-06 2020-07-14 Shenzhen Malong Technologies Co., Ltd. 4D convolutional neural networks for video recognition

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609460A (en) * 2017-05-24 2018-01-19 南京邮电大学 A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism
CN110135251A (en) * 2019-04-09 2019-08-16 上海电力学院 A kind of group's image Emotion identification method based on attention mechanism and hybrid network
CN110135319A (en) * 2019-05-09 2019-08-16 广州大学 A kind of anomaly detection method and its system
CN110222665A (en) * 2019-06-14 2019-09-10 电子科技大学 Human motion recognition method in a kind of monitoring based on deep learning and Attitude estimation
CN110555387A (en) * 2019-08-02 2019-12-10 华侨大学 Behavior identification method based on local joint point track space-time volume in skeleton sequence
CN110728183A (en) * 2019-09-09 2020-01-24 天津大学 Human body action recognition method based on attention mechanism neural network
CN111160164A (en) * 2019-12-18 2020-05-15 上海交通大学 Action recognition method based on human body skeleton and image fusion
US10713493B1 (en) * 2020-02-06 2020-07-14 Shenzhen Malong Technologies Co., Ltd. 4D convolutional neural networks for video recognition

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
CONGQI CAO ET AL.: "Body Joint Guided 3-D Deep Convolutional Descriptors for Action Recognition", 《IEEE TRANSACTIONS ON CYBERNETICS》 *
ENQING CHEN ET AL.: "A Spatiotemporal Heterogeneous Two-Stream Network for Action Recognition", 《IEEE ACCESS》 *
FABIEN BARADEL ET AL.: "Human Action Recognition: Pose-Based Attention Draws Focus to Hands", 《2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW)》 *
JIAN-FANG HU ET AL.: "Deep bilinear learning for RGB-D action recognition", 《ECCV 2018: COMPUTER VISION – ECCV 2018》 *
SRIJAN DAS ET AL.: "Where to Focus on for Human Action Recognition?", 《2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV)》 *
何冰倩 等: "基于改进的深度神经网络的人体动作识别模型", 《计算机应用研究》 *
刘志强: "基于Kinect平台融合视频信息和骨骼点数据的人体动作识别", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王建玺 等: "姿势内核学习融合决策森林在线手势识别算法", 《电视技术》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287891B (en) * 2020-11-23 2022-06-10 福州大学 Method for evaluating learning concentration through video based on expression behavior feature extraction
CN112287891A (en) * 2020-11-23 2021-01-29 福州大学 Method for evaluating learning concentration through video based on expression and behavior feature extraction
CN112749671A (en) * 2021-01-19 2021-05-04 澜途集思生态科技集团有限公司 Human behavior recognition method based on video
CN112906604A (en) * 2021-03-03 2021-06-04 安徽省科亿信息科技有限公司 Behavior identification method, device and system based on skeleton and RGB frame fusion
CN112906604B (en) * 2021-03-03 2024-02-20 安徽省科亿信息科技有限公司 Behavior recognition method, device and system based on skeleton and RGB frame fusion
CN113139432B (en) * 2021-03-25 2024-02-06 杭州电子科技大学 Industrial packaging behavior identification method based on human skeleton and partial image
CN113139432A (en) * 2021-03-25 2021-07-20 杭州电子科技大学 Industrial packaging behavior identification method based on human body skeleton and local image
CN113139469B (en) * 2021-04-25 2022-04-29 武汉理工大学 Driver road stress adjusting method and system based on micro-expression recognition
CN113139469A (en) * 2021-04-25 2021-07-20 武汉理工大学 Driver road stress adjusting method and system based on micro-expression recognition
CN113469018A (en) * 2021-06-29 2021-10-01 中北大学 Multi-modal interaction behavior recognition method based on RGB and three-dimensional skeleton
CN113469018B (en) * 2021-06-29 2024-02-23 中北大学 Multi-modal interactive behavior recognition method based on RGB and three-dimensional skeleton
CN113610071A (en) * 2021-10-11 2021-11-05 深圳市一心视觉科技有限公司 Face living body detection method and device, electronic equipment and storage medium
CN114091601A (en) * 2021-11-18 2022-02-25 业成科技(成都)有限公司 Sensor fusion method for detecting personnel condition
CN114091601B (en) * 2021-11-18 2023-05-05 业成科技(成都)有限公司 Sensor fusion method for detecting personnel condition
CN117137435A (en) * 2023-07-21 2023-12-01 北京体育大学 Rehabilitation action recognition method and system based on multi-mode information fusion
CN117137435B (en) * 2023-07-21 2024-06-25 北京体育大学 Rehabilitation action recognition method and system based on multi-mode information fusion

Also Published As

Publication number Publication date
CN111967379B (en) 2022-04-08

Similar Documents

Publication Publication Date Title
CN111967379B (en) Human behavior recognition method based on RGB video and skeleton sequence
US11783183B2 (en) Method and system for activity classification
Bhagat et al. Indian sign language gesture recognition using image processing and deep learning
Ling et al. Building data-driven models with microstructural images: Generalization and interpretability
WO2023000872A1 (en) Supervised learning method and apparatus for image features, device, and storage medium
CN110135251B (en) Group image emotion recognition method based on attention mechanism and hybrid network
Liu et al. Si-GCN: Structure-induced graph convolution network for skeleton-based action recognition
CN108537145A (en) Human bodys&#39; response method based on space-time skeleton character and depth belief network
Gammulle et al. Coupled generative adversarial network for continuous fine-grained action segmentation
CN109657634A (en) A kind of 3D gesture identification method and system based on depth convolutional neural networks
Nale et al. Suspicious human activity detection using pose estimation and lstm
Chen et al. Action keypoint network for efficient video recognition
Castro et al. AttenGait: Gait recognition with attention and rich modalities
Batool et al. Fundamental recognition of ADL assessments using machine learning engineering
CN112487926A (en) Scenic spot feeding behavior identification method based on space-time diagram convolutional network
Chen et al. Skeleton moving pose-based human fall detection with sparse coding and temporal pyramid pooling
Hassan et al. Enhanced dynamic sign language recognition using slowfast networks
Ramanathan et al. Combining pose-invariant kinematic features and object context features for rgb-d action recognition
Goga et al. Hand gesture recognition using 3D sensors
Ye et al. Human interactive behaviour recognition method based on multi-feature fusion
CN112597842B (en) Motion detection facial paralysis degree evaluation system based on artificial intelligence
Xie et al. Multi-channel Capsule Network for Micro-expression Recognition with Multiscale Fusion
Zhao et al. Research on human behavior recognition in video based on 3DCCA
CN114973305A (en) Accurate human body analysis method for crowded people
CN113591797A (en) Deep video behavior identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant