CN110321833B - Human body behavior identification method based on convolutional neural network and cyclic neural network - Google Patents

Human body behavior identification method based on convolutional neural network and cyclic neural network Download PDF

Info

Publication number
CN110321833B
CN110321833B CN201910580116.XA CN201910580116A CN110321833B CN 110321833 B CN110321833 B CN 110321833B CN 201910580116 A CN201910580116 A CN 201910580116A CN 110321833 B CN110321833 B CN 110321833B
Authority
CN
China
Prior art keywords
neural network
time
convolutional neural
rnn
human body
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910580116.XA
Other languages
Chinese (zh)
Other versions
CN110321833A (en
Inventor
谢子凡
陈志�
岳文静
葛宇轩
王多
崔明浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201910580116.XA priority Critical patent/CN110321833B/en
Publication of CN110321833A publication Critical patent/CN110321833A/en
Application granted granted Critical
Publication of CN110321833B publication Critical patent/CN110321833B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a human body behavior identification method based on a convolutional neural network and a cyclic neural network, which is characterized in that a sensor is used for tracking human body behaviors, and a 3-dimensional coordinate vector group and an RGB video of human body joints in a time period are collected. And then training the 3-dimensional coordinates of the joints of the human body by using a Recurrent Neural Network (RNN) to obtain a time characteristic vector. Training the RGB video by using a convolutional neural network CNN to obtain a space-time characteristic vector, finally combining the time characteristic vector and the space-time characteristic vector and normalizing, feeding the normalized space-time characteristic vector to a classifier of a linear SVM, using a verification data set to find a parameter C of the linear support vector machine SVM, and finally obtaining a comprehensive recognition model. The method can solve the problem of overfitting of the model to action classification in the model training process, and can effectively improve the human behavior recognition efficiency and accuracy.

Description

Human body behavior identification method based on convolutional neural network and cyclic neural network
Technical Field
The invention relates to a human body behavior recognition method based on a convolutional neural network and a cyclic neural network, and belongs to the cross technical field of behavior recognition, deep learning, machine learning and the like.
Background
An important research topic in the field of computer vision for behavior recognition and classification in videos becomes a research hotspot in the field of computer vision due to the wide application of video tracking, motion analysis, virtual reality and artificial intelligence interaction,
because the same action scene can be different under different illumination, visual angles, backgrounds and other conditions, and the same object and action in different action scenes can also generate obvious difference in characteristics and postures, the human body action has larger degree of freedom even in a constant action scene, and each same action has great difference in direction, angle and other aspects. In addition, problems of partial occlusion, individual difference and the like are the embodiment of the motion recognition complexity in space. The method of manual design is difficult to obtain the essential features of the object from the scene with severe changes, so that a more popular feature extraction method needs to be provided to solve the problems of one-sidedness, blindness and the like caused by the manual feature extraction method.
At present, many motion recognition algorithms are shallow network learning methods, and have certain limitations, for example, under the condition that training samples are limited, the capability of representing complex functions is limited, and the generalization capability of complex classification problems is also limited.
The motion recognition problem can be considered as a classification problem, and many methods in classification are designed for motion recognition, specifically using Logistic regression analysis, decision tree models, naive bayes classifiers, and support vector machines. These methods have both advantages and disadvantages in practical applications.
In general television and video images, 3D videos are often adopted, the existing technology for behavior recognition in the 3D videos is not mature, most human behavior recognition systems rely on manual marking processing of data, and then the data are placed in models for recognition. The method has strong dependence on data, has low system operation efficiency, and is not suitable for the requirements of industrialization and commercialization.
Disclosure of Invention
The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides a human body behavior recognition method based on a convolutional neural network and a cyclic neural network.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
a human body behavior recognition method based on a convolutional neural network and a cyclic neural network comprises the steps of firstly enabling a user to continuously swing a hand at a fixed position for 5 times, tracking human body behaviors by using a Microsoft Kinect v2 sensor, and collecting 3-dimensional coordinate vector groups of 25 main joints of the user in a time period and RGB videos in the time period by taking 16 milliseconds as time step. Then, a gated circulation unit GRU is used as a basic unit of a circulation layer, and a bidirectional circulation neural network is used for training a 3-dimensional coordinate data set of the human body joint, so that a time feature vector of the 3-dimensional coordinate data of the human body joint is obtained. Dividing the RGB video into continuous RGB frames with 16 milliseconds as time step length, training a convolutional neural network on the data set to obtain a space-time characteristic vector of the RGB video, finally combining the output results of the two cyclic neural networks and the convolutional neural network, normalizing the output results after connection, feeding the normalized result to a classifier of a linear SVM, using a verification data set to find a parameter C of the linear support vector machine SVM, and finally obtaining a comprehensive recognition model, wherein the comprehensive recognition model specifically comprises the following steps:
step 1), a user continuously swings hands 5 times at a fixed position, a sensor is used for tracking human body behaviors, the time step is set to be 16 milliseconds during collection, and 3-dimensional coordinates V { (x) of the main 25 joints of the user in the time period are collected1,y1,z1),(x2,y2,z2),...,(x25,y25,z25) }; the x axis is vertical to the vertical direction of the human body, the positive axis points to the front, the y axis is parallel to the vertical direction of the human body, the positive axis points to the head, and the z axis is vertical to the vertical direction of the human body, and the positive axis points to the left side of the human body; at the same time collectRGB video of the scene over a period of time;
step 2), taking the 3-dimensional coordinates of 25 joints of the human body in the step 1) as a training set of a Recurrent Neural Network (RNN), training the Recurrent Neural Network (RNN) by using the training set of the Recurrent Neural Network (RNN), operating input coordinates by using a gated cyclic unit (GRU) in the Recurrent Neural Network (RNN), wherein the RNN has the functions of batch normalization and random loss, the Recurrent Neural Network (RNN) comprises two bidirectional gated cyclic layers, a hidden full-connection layer and an output layer of a softmax model, a linear rectification function (ReLU) is taken as an activation function, and a dropout mechanism is used for mapping motion characteristics to motion categories of a waving hand to obtain the trained RNN;
step 3), dividing the RGB video acquired in the step 1) into continuous RGB frames by taking 16 milliseconds as a time step, adjusting the resolution of the continuous RGB frames, and taking the continuous RGB frames as a training set of a Convolutional Neural Network (CNN); carrying out convolution operation on an input RGB frame stream by using a convolutional neural network CNN through a plurality of convolution cores, wherein the convolutional neural network CNN comprises a convolution layer, a pooling layer, a full connection layer and an output layer of a softmax model, and finely adjusting the model by using pre-training parameters on a Sports-1M data set so as to reduce overfitting and training time and obtain a trained convolutional neural network CNN;
step 4), taking the output result of the recurrent neural network RNN trained in the step 2) as a time characteristic vector of human joint 3-dimensional coordinate data, taking the output result of the convolutional neural network CNN trained in the step 3) as a space-time characteristic vector of an RGB video, connecting and combining the time characteristic vector of the human joint 3-dimensional coordinate data and the space-time characteristic vector of the RGB video, normalizing the time characteristic vector after connection, feeding the normalized time characteristic vector to a classifier of a linear Support Vector Machine (SVM), and finding a penalty coefficient C of the linear Support Vector Machine (SVM) by using a verification data set to obtain a comprehensive recognition model;
and 5) during recognition, acquiring 3-dimensional coordinates and RGB (red, green and blue) videos of 25 joints of the human body behavior to be recognized by adopting the method in the step 1, putting the acquired 3-dimensional coordinates of 25 joints of the human body into the trained Recurrent Neural Network (RNN) in the step 2) to obtain a time characteristic vector, putting the acquired RGB videos into the trained Convolutional Neural Network (CNN) in the step 3) to obtain a space-time characteristic vector, connecting the time characteristic vector and the space-time characteristic vector into a characteristic array, normalizing the characteristic array, introducing the characteristic array into a comprehensive recognition model, and recognizing the behavior of the human body by using the comprehensive recognition model.
Preferably: the sensor in step 1) is a Microsoft Kinect v2 sensor.
Preferably: the method for obtaining the trained recurrent neural network RNN in the step 2) comprises the following steps:
step 21), taking the 3-dimensional coordinates of the 25 joints of the human body in the step 1) as a training set of the recurrent neural network RNN, wherein the dimensionality of the training set of the recurrent neural network RNN is 16 x (25 x 3);
step 22), training a recurrent neural network on a training set of the recurrent neural network RNN, first passing through two bidirectional gated recurrent layers, using gated recurrent units GRU as basic units of the recurrent layers and using a bidirectional recurrent neural network, which respectively train input data from T-1 to T-T and T-T to T-1, where T represents a time variable in the data set and T represents the last moment in the data set;
step 23), utilizing GRU type recurrent neural network RNN to classify scene features, utilizing the degree of updating the gate to control the state information of the previous moment to be brought into the current state, and updating the gate zt=σ(Wz·[ht-1,xt]) Wherein h ist-1Value of candidate memory cell at time t-1, xtData representing input at time t, WzWeight, z, representing the value of the memory cell corresponding to the refresh gate datatThe value of an update gate at the moment t is represented, and sigma represents a sigmoid activation function;
step 24), control how much information of the previous state is written into the candidate memory unit h at time t by using the reset gatetReset gate rt=σ(Wr·[ht-1,xt]) Wherein h ist-1Value of candidate memory cell at time t-1, xtData representing input at time t, WrRepresenting the weight, r, of the value of the memory cell corresponding to the refresh gate datatA value representing the reset gate at time t;
step 25), calculating the candidate memory cell value, candidate memory cell
Figure BDA0002112899030000031
Wherein h istIs the value of the candidate memory cell at time t, ht-1The value of the candidate memory cell for t-1, xtRepresents the data input at time t, W represents the weight of the GRU unit at the current time,
Figure BDA0002112899030000032
representing the value of the candidate memory cell, tanh representing the trigonometric tangent function;
step 26), calculating the state value of the memory unit at the current moment, and calculating the memory unit at the t moment
Figure BDA0002112899030000033
Wherein, "" indicates the dot-by-dot product, and the memory cell state update depends on the value h of the candidate memory cell at the moment of t-1t-1And the value of the candidate memory cell
Figure BDA0002112899030000034
And the two factors are respectively adjusted through an update gate and a reset gate;
step 27), finally obtaining the output y of the recurrent neural networkt=σ(W·ht) Transfer to fully-connected layer using softmax activation function and interpret output as probability, ytRepresenting the probability of prediction using a recurrent neural network.
Preferably: the method for obtaining the trained convolutional neural network CNN in the step 3) is as follows:
step 31), dividing the RGB video acquired in step 1) into continuous RGB frames by taking 16 milliseconds as a time step, adjusting the resolution of the continuous RGB frames, taking the RGB frame stream as a training set of a convolutional neural network CNN, and inputting the RGB frame stream into the training set called as the size of c multiplied by t multiplied by h multiplied by w, wherein c is the number of channels, t is the time step, and h and w are the height and width of the RGB frames;
step 32) of receiving video frames by convolutional neural network CNNParallelizing the input, checking the input matrix x by the convolutional neural network CNN for a convolutional window of length h1:nPerforming convolution operation with convolution kernel output value of ci=f(w·xi:i+h-1+ b) of which
Figure RE-GDA0002168234800000041
Is the weight of the convolution kernel and is,
Figure RE-GDA0002168234800000042
denotes that w takes value in the real number domain and d denotes xiThe dimensions of the material are measured in the same way,
Figure RE-GDA00021682348000000412
is the offset, f is the activation function, xi:i+h-1A frame vector matrix of a convolution window and a frame stream with a time gradient of n are subjected to convolution operation to obtain a convolution eigenvector c ═ c1,c2,…,cn-h+1]。
Step 33) of extracting a maximum value from each of the eigenvectors, obtaining the eigenvectors from the windows with m convolution kernels
Figure RE-GDA0002168234800000043
Wherein,
Figure RE-GDA0002168234800000044
the feature vectors extracted for the convolutional neural network,
Figure RE-GDA0002168234800000045
the feature vector representing the mth convolution kernel.
And step 34), outputting a classification result through a softmax function to obtain the following formula:
Figure BDA00021128990300000411
where y represents the probability of prediction using a convolutional neural network,
Figure RE-GDA0002168234800000047
for the regularization term constraint of the downsampled layer output,
Figure RE-GDA0002168234800000048
the multiplication is carried out for the corresponding elements,
Figure RE-GDA0002168234800000049
is a matrix of the weights of the full connection layer,
Figure RE-GDA00021682348000000410
for the offset, the convolutional neural network CNN is optimized by a random gradient descent optimizer.
Preferably: the method for obtaining the comprehensive identification model in the step 4) comprises the following steps:
step 41), extracting the recurrent neural network RNN trained in the step 2) from the 3-dimensional coordinate data of the human body, taking the output result as a time characteristic vector, extracting the convolutional neural network CNN trained in the step 3) from the RGB video, taking the output result as a space-time characteristic vector, connecting the time characteristic vector and the space-time characteristic vector into a characteristic array, and normalizing the characteristic array;
and 42), taking the normalized feature vector groups as input, taking the specific action or behavior mark corresponding to each normalized feature vector group as output, submitting the output to a linear Support Vector Machine (SVM) for training, wherein an optimization model of the SVM is as the following formula:
Figure BDA00021128990300000410
yiTxi+b0)≥1-ξi
s.t.ξi≥0
i=1,2,...,N
where ω represents the input vector, C represents a penalty factor, ξiIndicating the classification loss for the ith sample point,
yirepresenting each sample corresponds toAction flag, b0Representing the intercept, N representing the total number of feature vectors input into the SVM;
and 43), finding the optimal value of the penalty coefficient C through the training and verifying set to obtain a comprehensive identification model.
Preferably: in step 33), when the convolutional neural network CNN is optimized by the random gradient descent optimizer, the initial learning rate is 0.0001, and when no training progress is observed, the learning rate is reduced by half.
Preferably, the following components: normalized formula in step 41):
Figure BDA0002112899030000051
wherein x isiDenotes the element, x 'in the feature array'iRepresenting elements in the normalized feature vector.
Preferably: step 31) the resolution is adjusted from 1920 x 1080 pixels to 320 x 240 pixels.
Compared with the prior art, the invention has the following beneficial effects:
the invention uses a Microsoft Kinect v2 sensor to collect 3-dimensional coordinates and RGB video of human joints, uses a convolution neural network and a circulation neural network to obtain time characteristics and space-time characteristics of human behavior data in the video, and effectively combines the time characteristics and the space-time characteristics to finally obtain a model which can adapt to complex environment and actions, and can quickly and effectively obtain the identification of human actions in the video by inputting corresponding data of a section of video in the future, thereby having good accuracy and stability, in particular:
(1) the invention uses double-current RNN/CNN neural network algorithm, can effectively consider the influence of action continuity on action identification, and can identify and predict the action in a short time
(2) The invention combines the scene information and the action information into consideration, and matches the action sequence in the action database by using the scene information as a label, thereby more accurately finishing the identification of human actions.
(3) The invention provides an effective and strong-practicability system structure, and a user interface module and an operation recording module are configured, so that the stability of human behavior recognition is improved, and the specific application of the system structure in industry is facilitated.
Drawings
FIG. 1 is a flow chart of 3D video motion recognition based on convolutional neural network and cyclic neural network
FIG. 2 is a schematic diagram of a GRU-RNN-based neural network.
Fig. 3 is a schematic diagram of three-dimensional data sampling points of a human joint.
Detailed Description
The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are intended only to illustrate the present invention and not to limit the scope of the present invention, which is defined in the appended claims to the present application, and that modifications of various equivalent forms to the present invention by those skilled in the art will fall within the scope of the present invention after reading the present invention.
A human behavior recognition method based on convolutional neural network and cyclic neural network, as shown in fig. 1-3, comprises the following steps:
step 1), a user continuously swings his hand 5 times at a fixed position, the behavior of the human body is tracked by using a Microsoft Kinect v2 sensor, and the sampling points of the joints of the human body by using the Microsoft Kinect v2 sensor are as follows: 1-base of spine, 2-middle of spine, 3-neck, 4-head, 5-left shoulder, 6-left elbow, 7-left wrist, 8-hand, 9-right shoulder, 10-right elbow, 11-right wrist, 12-right hand, 13-left hip, 14-left knee, 15-left ankle, 16-left foot, 17-right hip, 18-right knee, 19-right ankle, 20-right foot, 21-spine, 22-left apex, 24-right apex, 25-right thumb. Setting the time step to be 16 milliseconds during collection, and collecting the 3-dimensional coordinates V { (x) of the main 25 joints of the user in the time period1,y1,z1),(x2,y2,z2),...,(x25,y25,z25) }; the x axis is vertical to the vertical direction of the human body, the positive axis points to the front, the y axis is parallel to the vertical direction of the human body, the positive axis points to the head, and the z axis is vertical to the vertical direction of the human body, and the positive axis points to the left side of the human body; simultaneous collectionRGB video of this scene during the time period;
step 2), taking the 3-dimensional coordinates of 25 joints of the human body in the step 1) as a training set of a Recurrent Neural Network (RNN), training the Recurrent Neural Network (RNN) by using the training set of the Recurrent Neural Network (RNN), operating input coordinates by using a gated cyclic unit (GRU) in the Recurrent Neural Network (RNN), wherein the RNN has the functions of batch normalization and random loss, the Recurrent Neural Network (RNN) comprises two bidirectional gated cyclic layers, a hidden full connection layer and an output layer of a softmax model, a linear rectification function (ReLU) is taken as an activation function, and a dropout mechanism is used for mapping motion characteristics to motion categories of a waving hand to obtain the trained Recurrent Neural Network (RNN);
step 21), taking the 3-dimensional coordinates of 25 joints of the human body in the step 1) as a training set of a Recurrent Neural Network (RNN), wherein the dimensionality of the training set of the Recurrent Neural Network (RNN) is 16 (time step) x (25 x 3) (3-dimensional joint coordinates);
step 22), training a recurrent neural network on a training set of the recurrent neural network RNN, first passing through two bidirectional gated recurrent layers, using gated recurrent units GRU as basic units of the recurrent layers and using bidirectional recurrent neural networks, which train input data from T ═ 1 to T ═ T and T ═ T to T ═ 1, respectively, wherein T represents a time variable in the data set, and T represents the last minute time in the data set;
step 23), utilizing GRU type recurrent neural network RNN to classify scene features, utilizing the degree of updating the gate to control the state information of the previous moment to be brought into the current state, and updating the gate zt=σ(Wz·[ht-1,xt]) Wherein h ist-1Value of candidate memory cell at time t-1, xtData representing input at time t, WzWeight, z, representing the value of the corresponding refresh gate data memory celltThe table represents the value of an update gate at the time t, and sigma represents a sigmoid activation function;
step 24), using the reset gate to control how much information of the previous state is written into the candidate memory unit h at time ttReset gate rt=σ(Wr·[ht-1,xt]) Wherein h ist-1Value of candidate memory cell at time t-1, xtData representing input at time t, WrRepresenting the weight, r, of the value of the memory cell corresponding to the refresh gate datatA value representing the reset gate at time t;
step 25), calculating the candidate memory cell value, candidate memory cell
Figure BDA0002112899030000071
Wherein h istIs the value of the candidate memory cell at time t, ht-1The value of the candidate memory cell for t-1, xtRepresents the data input at time t, W represents the weight of the GRU unit at the current time,
Figure BDA0002112899030000072
a value representing a candidate memory cell, tanh representing a trigonometric tangent function;
step 26), calculating the state value of the memory unit at the current moment, and calculating the memory unit at the t moment
Figure BDA0002112899030000073
Wherein, "" indicates the dot-by-dot product, and the memory cell state update depends on the value h of the candidate memory cell at the moment of t-1t-1And the value of the candidate memory cell
Figure BDA0002112899030000074
The two factors are respectively adjusted through an updating gate and a resetting gate;
step 27), finally obtaining the output y of the recurrent neural networkt=σ(W·ht) Transfer to fully-connected layer using softmax activation function and interpret output as probability, ytRepresenting the probability of prediction using a recurrent neural network.
Step 3), dividing the RGB video collected in the step 1) into continuous RGB frames by taking 16 milliseconds as a time step, adjusting the resolution of the continuous RGB frames from 1920 × 1080 pixels to 320 × 240 pixels, and using the continuous RGB frames as a training set of the convolutional neural network CNN; carrying out convolution operation on an input RGB frame stream by using a convolutional neural network CNN through a plurality of convolution kernels, wherein the convolutional neural network CNN comprises a convolution layer, a pooling layer, a full connection layer and an output layer of a softmax model, and finely adjusting the model by using pre-training parameters on a Sports-1M data set to reduce overfitting and training time to obtain a trained convolutional neural network CNN;
the method for obtaining the trained convolutional neural network CNN in the step 3) is as follows:
step 31), dividing the RGB video acquired in step 1) into continuous RGB frames by taking 16 milliseconds as a time step, adjusting the resolution of the continuous RGB frames, taking the RGB frame stream as a training set of a convolutional neural network CNN, and inputting the RGB frame stream into the training set called as the size of c multiplied by t multiplied by h multiplied by w, wherein c is the number of channels, t is the time step, and h and w are the height and width of the RGB frames;
step 32), the convolutional neural network CNN receives the parallelized input of the video frames, and for a convolutional window of length h, the convolutional neural network CNN checks the input matrix x by a convolution kernel1:nPerforming convolution operation with convolution kernel output value of ci=f(w·xi:i+h-1+ b) of
Figure RE-GDA0002168234800000075
In order to be the weights of the convolution kernel,
Figure RE-GDA0002168234800000076
denotes that w takes value in the real number domain and d denotes xiThe dimensions of the material are measured in the same way,
Figure RE-GDA0002168234800000077
is the offset, f is the activation function, xi:i+h-1A frame vector matrix of a convolution window and a frame stream with a time gradient of n are subjected to convolution operation to obtain a convolution eigenvector c ═ c1,c2,…,cn-h+1]。
Step 33) of extracting a maximum value from each of the eigenvectors, obtaining the eigenvectors for the window with m convolution kernels
Figure RE-GDA0002168234800000081
Wherein,
Figure RE-GDA0002168234800000082
the feature vectors extracted for the convolutional neural network,
Figure RE-GDA0002168234800000083
the feature vector representing the mth convolution kernel.
And step 34), outputting a classification result through a softmax function to obtain the following formula:
Figure BDA0002112899030000088
where y represents the probability of prediction using a convolutional neural network,
Figure RE-GDA0002168234800000085
for the regularization term constraint of the downsampled layer output,
Figure RE-GDA0002168234800000086
the multiplication is carried out for the corresponding elements,
Figure RE-GDA0002168234800000087
is a matrix of the weights of the full connection layer,
Figure RE-GDA0002168234800000088
for the offset, the convolutional neural network CNN is optimized by a random gradient descent optimizer.
Step 4), taking the output result of the recurrent neural network RNN trained in the step 2) as a time characteristic vector of human joint 3-dimensional coordinate data, taking the output result of the convolutional neural network CNN trained in the step 3) as a space-time characteristic vector of an RGB video, connecting the time characteristic vector of the human joint 3-dimensional coordinate data and the space-time characteristic vector of the RGB video to form a characteristic array, normalizing the characteristic array after connection, feeding the characteristic array to a classifier of a linear Support Vector Machine (SVM), finding a penalty coefficient C of the linear Support Vector Machine (SVM) by using a verification data set, wherein the penalty coefficient represents the tolerance degree of errors, and obtaining a comprehensive identification model;
the method for obtaining the comprehensive identification model in the step 4) comprises the following steps:
step 41), extracting the recurrent neural network RNN trained in the step 2) from the 3-dimensional coordinate data of the human body, taking the output result as a time characteristic vector (600 dimensions), extracting the recurrent neural network CNN trained in the step 3) from the RGB video, taking the output result as a space-time characteristic vector (4096 dimensions), connecting the time characteristic vector and the space-time characteristic vector into a characteristic array (4696 dimensions), and normalizing;
normalized formula:
Figure BDA0002112899030000087
wherein x isiDenotes the element, x 'in the feature array'iRepresenting elements in the normalized feature vector.
And 42), taking the normalized feature vector groups as input, taking the specific action or behavior mark corresponding to each normalized feature vector group as output, submitting the output to a linear Support Vector Machine (SVM) for training, wherein an optimization model of the SVM is as the following formula:
Figure BDA0002112899030000091
yiTxi+b0)≥1-ξi
s.t.ξi≥0
i=1,2,...,N
where ω represents the vector of inputs, C represents a penalty coefficient, ξiIndicating the classification loss for the ith sample point,
yirepresenting the action signature corresponding to each sample, b0Representing the intercept, N representing the total number of feature vectors input into the SVM;
and 43) finding the optimal value of the penalty coefficient C through the training and verification set to obtain a comprehensive recognition model, and finally verifying the motion category in the human body by using the model.
And 5) during recognition, acquiring 3-dimensional coordinates and RGB (red, green and blue) videos of 25 joints of the human body behavior to be recognized by adopting the method in the step 1, putting the acquired 3-dimensional coordinates of the 25 joints of the human body into the Recurrent Neural Network (RNN) trained in the step 2) to obtain a time characteristic vector, putting the acquired RGB videos into the Convolutional Neural Network (CNN) trained in the step 3) to obtain a space-time characteristic vector, connecting the time characteristic vector and the space-time characteristic vector into a characteristic array, normalizing the characteristic array, guiding the characteristic array into a comprehensive recognition model, and recognizing the human behavior by using the comprehensive recognition model.
Simulation:
a user continuously waves hands 5 times at a fixed position, uses a Microsoft Kinect V2 sensor to track human body behaviors, sets a time step length at the collection time to be 16 milliseconds, and collects a 3-dimensional coordinate vector set V { (x) of 25 joints which are main for the user in the time period1,y1,z1),(x2,y2,z2),...,(x25,y25,z25) }; each joint has 25 3D coordinates. This motion sample dimension is processed as 16 (time step) × (25 × 3) (3-dimensional joint coordinates) and is taken as a training set of the recurrent neural network.
Training the recurrent neural network using multiple GPUs, setting the learning rate to 0.001 and the decay rate to 0.9, and then training the network for a single-layer model using a small batch of 1000 sequences from the beginning; a small batch of 650 sequences was used for the bilayer model. For all RNN networks, 300 neurons are used in each unidirectional layer, the number of neurons in the bidirectional layer is doubled, the neurons are lost by using 75% of holding probability, and finally the time characteristic vector (600 dimensions) of the human body 3D joint coordinate data is obtained through a complete connection layer.
The RGB video is then imported for convolutional neural network training, frames are extracted from the RGB video using a 3D-CNN model in Caffe, and they are clipped from 1920 x 1080 pixels to 320 x 240 pixels with a time step of 16 milliseconds, we refer to the input to the CNN model as the size of c x t h w, where c is the number of channels, t is the time step, and h and w are the height and width of the RGB frame, respectively. The network takes a video clip as input and marks the data with the motion tag of the waving hand. The input RGB frame was then resized to 128 x 171 pixel resolution, the input size changed to 3 x 16 x 128 x 171, and the network was network verified using a random gradient descent optimizer with an initial learning rate of 0.0001. Then, when no training progress was observed, the learning rate was reduced by half. Finally, the space-time characteristic vector (4096 dimensions) of the RGB video is obtained after complete connection layers.
For feature fusion, we concatenate the temporal feature vectors extracted from RNN with the spatio-temporal feature vectors extracted from CNN into a feature array (4696 dimensions) and perform L2 normalization. Finally, we normalized the RNN/CNN function through training, validation and test segmentation. We use training and verification splitting to find the optimal value of the parameter C of the linear SVM model. The comprehensive model is used for verifying the accuracy of motion recognition.
The method can solve the problem of overfitting of the model to action classification in the model training process, and can effectively improve the human behavior recognition efficiency and accuracy.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention, and such modifications and adaptations are intended to be within the scope of the invention.

Claims (5)

1. A human behavior recognition method based on a convolutional neural network and a cyclic neural network is characterized by comprising the following steps:
step 1), a user continuously swings hands 5 times at a fixed position, a sensor is used for tracking human body behaviors, the time step is set to be 16 milliseconds during collection, and 3-dimensional coordinates V { (x) of the main 25 joints of the user in the time period are collected1,y1,z1),(x2,y2,z2),...,(x25,y25,z25) }; wherein, the x-axis is vertical to the vertical direction of the human body and the positive axis points to the front, the y-axis is parallel to the vertical direction of the human body and the positive axis points to the frontThe z axis is vertical to the vertical direction of the human body, and the positive axis points to the left side of the human body; simultaneously collecting the RGB video of the scene in the time period;
step 2), taking the 3-dimensional coordinates of 25 joints of the human body in the step 1) as a training set of a Recurrent Neural Network (RNN), training the Recurrent Neural Network (RNN) by using the training set of the Recurrent Neural Network (RNN), operating input coordinates by using a gated cyclic unit (GRU) in the Recurrent Neural Network (RNN), wherein the RNN has the functions of batch normalization and random loss, the Recurrent Neural Network (RNN) comprises two bidirectional gated cyclic layers, a hidden full-connection layer and an output layer of a softmax model, a linear rectification function (ReLU) is taken as an activation function, and a dropout mechanism is used for mapping motion characteristics to motion categories of a waving hand to obtain the trained RNN;
the method for obtaining the trained recurrent neural network RNN is as follows:
step 21), taking the 3-dimensional coordinates of the 25 joints of the human body in the step 1) as a training set of a Recurrent Neural Network (RNN), wherein the dimensionality of the training set of the Recurrent Neural Network (RNN) is 16 x (25 x 3);
step 22), training a recurrent neural network on a training set of the recurrent neural network RNN, first passing through two bidirectional gated recurrent layers, using gated recurrent units GRU as basic units of the recurrent layers and using a bidirectional recurrent neural network, which respectively train input data from T-1 to T-T and T-T to T-1, where T represents a time variable in the data set and T represents the last moment in the data set;
step 23), using GRU type recurrent neural network RNN to classify the scene characteristics, using the degree of the state information of the previous time of the refresh gate control being brought into the current state, refreshing the gate zt=σ(Wz·[ht-1,xt]) Wherein h ist-1Value of candidate memory cell at time t-1, xtData representing input at time t, WzWeight, z, representing the value of the corresponding refresh gate data memory celltRepresenting the value of an update gate at the moment t, and sigma representing a sigmoid activation function;
step 24), control of previous state using reset gateHow much information of a state is written into a candidate memory cell h at time ttReset gate rt=σ(Wr·[ht-1,xt]) Wherein h ist-1Value of candidate memory cell at time t-1, xtData representing the input at time t, WrRepresenting the weight, r, of the value of the memory cell corresponding to the refresh gate datatA value representing the reset gate at time t;
step 25), calculating the candidate memory cell value, candidate memory cell
Figure FDA0003528229530000011
Wherein h istIs the value of the candidate memory cell at time t, ht-1The value of the candidate memory cell for t-1, xtRepresents the data input at time t, W represents the weight of the GRU unit at the current time,
Figure FDA0003528229530000021
a value representing a candidate memory cell, tanh representing a trigonometric tangent function;
step 26), calculating the state value of the memory unit at the current moment, and calculating the memory unit at the t moment
Figure FDA0003528229530000022
Wherein s indicates a dot-by-dot product, and the memory cell state update depends on the value h of the candidate memory cell at t-1t-1And the value of the candidate memory cell
Figure FDA0003528229530000023
And the two factors are respectively adjusted through an update gate and a reset gate;
step 27), finally obtaining the output y of the recurrent neural networkt=σ(W·ht) Transfer to the fully-connected layer using the softmax activation function and interpret the output as a probability, ytRepresenting the probability of using a recurrent neural network prediction;
step 3), dividing the RGB video acquired in the step 1) into continuous RGB frames by taking 16 milliseconds as a time step, adjusting the resolution of the continuous RGB frames, and taking the continuous RGB frames as a training set of a Convolutional Neural Network (CNN); carrying out convolution operation on an input RGB frame stream by using a convolutional neural network CNN through a plurality of convolution cores, wherein the convolutional neural network CNN comprises a convolution layer, a pooling layer, a full connection layer and an output layer of a softmax model, and finely adjusting the model by using pre-training parameters on a Sports-1M data set to reduce overfitting and training time so as to obtain a trained convolutional neural network CNN;
the method for obtaining the trained convolutional neural network CNN is as follows:
step 31), dividing the RGB video acquired in step 1) into continuous RGB frames by taking 16 milliseconds as a time step, adjusting the resolution of the continuous RGB frames, taking the RGB frame stream as a training set of a convolutional neural network CNN, and inputting the RGB frame stream into the training set called as the size of c multiplied by t multiplied by h multiplied by w, wherein c is the number of channels, t is the time step, and h and w are the height and width of the RGB frames;
step 32), the convolutional neural network CNN receives the parallelized input of the video frames, and for a convolutional window of length h, the convolutional neural network CNN checks the input matrix x by a convolution kernel1:nPerforming convolution operation with convolution kernel output value of ci=f(w·xi:i+h-1+ b) of
Figure FDA0003528229530000024
In order to be the weights of the convolution kernel,
Figure FDA0003528229530000025
denotes that w takes value in the real number domain and d denotes xiThe dimensions of the material are measured in the same way,
Figure FDA0003528229530000026
is the offset, f is the activation function, xi:i+h-1A frame vector matrix of a convolution window and a frame stream with time gradient n are subjected to convolution operation to obtain a feature vector c ═ c after convolution1,c2,…,cn-h+1];
Step 33) of extracting a maximum value from each of the eigenvectors, obtaining the eigenvectors for the window with m convolution kernels
Figure FDA0003528229530000027
Wherein,
Figure FDA0003528229530000028
the feature vectors extracted for the convolutional neural network,
Figure FDA0003528229530000029
a feature vector representing the mth convolution kernel;
and step 34), outputting a classification result through a softmax function to obtain the following formula:
Figure FDA0003528229530000031
where y represents the probability of prediction using a convolutional neural network,
Figure FDA0003528229530000032
for the regularized term constraint of the downsampled layer output,
Figure FDA0003528229530000033
in order to multiply the corresponding elements,
Figure FDA0003528229530000034
is a matrix of the weights of the full connection layer,
Figure FDA0003528229530000035
optimizing the convolutional neural network CNN for offset by a random gradient descent optimizer;
step 4), taking the output result of the recurrent neural network RNN trained in the step 2) as a time characteristic vector of human joint 3-dimensional coordinate data, taking the output result of the convolutional neural network CNN trained in the step 3) as a space-time characteristic vector of an RGB video, connecting and combining the time characteristic vector of the human joint 3-dimensional coordinate data and the space-time characteristic vector of the RGB video, normalizing the time characteristic vector after connection, feeding the normalized time characteristic vector to a classifier of a linear Support Vector Machine (SVM), and finding a penalty coefficient C of the linear Support Vector Machine (SVM) by using a verification data set to obtain a comprehensive recognition model;
the method for obtaining the comprehensive identification model comprises the following steps:
step 41), extracting the recurrent neural network RNN trained in the step 2) from the 3-dimensional coordinate data of the human body, taking the output result as a time characteristic vector, extracting the convolutional neural network CNN trained in the step 3) from the RGB video, taking the output result as a space-time characteristic vector, connecting the time characteristic vector and the space-time characteristic vector into a characteristic array, and normalizing;
and 42) taking the normalized feature vector groups as input, taking the specific action or behavior mark corresponding to each normalized feature vector group as output, submitting the output to a linear Support Vector Machine (SVM) for training, wherein the optimization model is as the following formula:
Figure FDA0003528229530000036
yiTxi+b0)≥1-ξi
s.t.ξi≥0
i=1,2,...,N
where ω represents the vector of inputs, C represents a penalty coefficient, ξiRepresents the classification loss for the ith sample point, yiRepresenting the action signature corresponding to each sample, b0Representing the intercept, N representing the total number of feature vectors input into the SVM;
step 43), finding the optimal value of the penalty coefficient C through a training and verification set to obtain a comprehensive identification model;
and 5) during recognition, acquiring 3-dimensional coordinates and RGB (red, green and blue) videos of 25 joints of the human body behavior to be recognized by adopting the method in the step 1, putting the acquired 3-dimensional coordinates of the 25 joints of the human body into the Recurrent Neural Network (RNN) trained in the step 2) to obtain a time characteristic vector, putting the acquired RGB videos into the Convolutional Neural Network (CNN) trained in the step 3) to obtain a space-time characteristic vector, connecting the time characteristic vector and the space-time characteristic vector into a characteristic array, normalizing the characteristic array, introducing the characteristic array into a comprehensive recognition model, and recognizing the human behavior by utilizing the comprehensive recognition model.
2. The human behavior recognition method based on the convolutional neural network and the cyclic neural network as claimed in claim 1, wherein: the sensor in step 1) is a Microsoft Kinect v2 sensor.
3. The human behavior recognition method based on the convolutional neural network and the cyclic neural network as claimed in claim 2, wherein: in step 33), when the convolutional neural network CNN is optimized by the random gradient descent optimizer, the initial learning rate is 0.0001, and when no training progress is observed, the learning rate is reduced by half.
4. The human behavior recognition method based on the convolutional neural network and the cyclic neural network as claimed in claim 3, wherein: normalized formula in step 41):
Figure FDA0003528229530000041
wherein x isiDenotes the element, x 'in the feature array'iRepresenting elements in the normalized feature vector.
5. The human behavior recognition method based on the convolutional neural network and the cyclic neural network as claimed in claim 4, wherein: step 31) the resolution is adjusted from 1920 x 1080 pixels to 320 x 240 pixels.
CN201910580116.XA 2019-06-28 2019-06-28 Human body behavior identification method based on convolutional neural network and cyclic neural network Active CN110321833B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910580116.XA CN110321833B (en) 2019-06-28 2019-06-28 Human body behavior identification method based on convolutional neural network and cyclic neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910580116.XA CN110321833B (en) 2019-06-28 2019-06-28 Human body behavior identification method based on convolutional neural network and cyclic neural network

Publications (2)

Publication Number Publication Date
CN110321833A CN110321833A (en) 2019-10-11
CN110321833B true CN110321833B (en) 2022-05-20

Family

ID=68121381

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910580116.XA Active CN110321833B (en) 2019-06-28 2019-06-28 Human body behavior identification method based on convolutional neural network and cyclic neural network

Country Status (1)

Country Link
CN (1) CN110321833B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110880172A (en) * 2019-11-12 2020-03-13 中山大学 Video face tampering detection method and system based on cyclic convolution neural network
CN111079928B (en) * 2019-12-14 2023-07-07 大连大学 Method for predicting human body movement by using circulating neural network based on countermeasure learning
CN111597881B (en) * 2020-04-03 2022-04-05 浙江工业大学 Human body complex behavior identification method based on data separation multi-scale feature combination
CN111459283A (en) * 2020-04-07 2020-07-28 电子科技大学 Man-machine interaction implementation method integrating artificial intelligence and Web3D
CN111681321B (en) * 2020-06-05 2023-07-04 大连大学 Method for synthesizing three-dimensional human motion by using cyclic neural network based on layered learning
CN111914638B (en) * 2020-06-29 2022-08-12 南京邮电大学 Character action recognition method based on improved long-term recursive deep convolution model
CN111860269B (en) * 2020-07-13 2024-04-16 南京航空航天大学 Multi-feature fusion series RNN structure and pedestrian prediction method
CN112232489A (en) * 2020-10-26 2021-01-15 南京明德产业互联网研究院有限公司 Method and device for gating cycle network and method and device for link prediction
CN112488014B (en) * 2020-12-04 2022-06-10 重庆邮电大学 Video prediction method based on gated cyclic unit
CN112906509A (en) * 2021-01-28 2021-06-04 浙江省隧道工程集团有限公司 Method and system for identifying operation state of water delivery tunnel excavator
CN113568819B (en) * 2021-01-31 2024-04-16 腾讯科技(深圳)有限公司 Abnormal data detection method, device, computer readable medium and electronic equipment
CN112906383B (en) * 2021-02-05 2022-04-19 成都信息工程大学 Integrated adaptive water army identification method based on incremental learning
CN113111756B (en) * 2021-04-02 2024-05-03 浙江工业大学 Human body fall recognition method based on human body skeleton key points and long-short-term memory artificial neural network
CN113378638B (en) * 2021-05-11 2023-12-22 大连海事大学 Method for identifying abnormal behavior of turbine operator based on human body joint point detection and D-GRU network
CN114091596A (en) * 2021-11-15 2022-02-25 长安大学 Problem behavior recognition system and method for barrier population
CN114399841A (en) * 2022-01-25 2022-04-26 台州学院 Human behavior recognition method under man-machine cooperation assembly scene
CN116911955B (en) * 2023-09-12 2024-01-05 深圳须弥云图空间科技有限公司 Training method and device for target recommendation model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107943967A (en) * 2017-11-28 2018-04-20 华南理工大学 Algorithm of documents categorization based on multi-angle convolutional neural networks and Recognition with Recurrent Neural Network
CN108280406A (en) * 2017-12-30 2018-07-13 广州海昇计算机科技有限公司 A kind of Activity recognition method, system and device based on segmentation double-stream digestion
AU2018101512A4 (en) * 2018-10-11 2018-11-15 Dong, Xun Miss A comprehensive stock trend predicting method based on neural networks
CN109117701A (en) * 2018-06-05 2019-01-01 东南大学 Pedestrian's intension recognizing method based on picture scroll product

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107943967A (en) * 2017-11-28 2018-04-20 华南理工大学 Algorithm of documents categorization based on multi-angle convolutional neural networks and Recognition with Recurrent Neural Network
CN108280406A (en) * 2017-12-30 2018-07-13 广州海昇计算机科技有限公司 A kind of Activity recognition method, system and device based on segmentation double-stream digestion
CN109117701A (en) * 2018-06-05 2019-01-01 东南大学 Pedestrian's intension recognizing method based on picture scroll product
AU2018101512A4 (en) * 2018-10-11 2018-11-15 Dong, Xun Miss A comprehensive stock trend predicting method based on neural networks

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Combination of CNN-GRU Model to Recognize Characters of a License Plate number without Segmentation;Bhargavi Suvarnam等;《2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS)》;20190606;第317-322页 *
基于姿态和骨架信息的行为识别方法研究与实现;马静;《中国优秀硕士学位论文全文数据库信息科技辑》;20190115(第12期);第9-12、18-28、35页 *
基于注意力机制和BGRU网络的文本情感分析方法研究;尹良亮等;《无线互联科技》;20190531(第9期);第27-28页 *
基于视频深度学习的时空双流人物动作识别模型;杨天明等;《计算机应用》;20180310(第3期);第895-899页 *

Also Published As

Publication number Publication date
CN110321833A (en) 2019-10-11

Similar Documents

Publication Publication Date Title
CN110321833B (en) Human body behavior identification method based on convolutional neural network and cyclic neural network
Zhang et al. Dynamic hand gesture recognition based on short-term sampling neural networks
Reddy et al. Spontaneous facial micro-expression recognition using 3D spatiotemporal convolutional neural networks
CN110096950B (en) Multi-feature fusion behavior identification method based on key frame
CN106682598B (en) Multi-pose face feature point detection method based on cascade regression
CN112784763B (en) Expression recognition method and system based on local and overall feature adaptive fusion
Cui Applying gradient descent in convolutional neural networks
CN108932500A (en) A kind of dynamic gesture identification method and system based on deep neural network
US9317785B1 (en) Method and system for determining ethnicity category of facial images based on multi-level primary and auxiliary classifiers
Littlewort et al. Dynamics of facial expression extracted automatically from video
CN102930302B (en) Based on the incrementally Human bodys' response method of online sequential extreme learning machine
CN110188637A (en) A kind of Activity recognition technical method based on deep learning
CN112528928B (en) Commodity identification method based on self-attention depth network
Caroppo et al. Comparison between deep learning models and traditional machine learning approaches for facial expression recognition in ageing adults
CN108182397B (en) Multi-pose multi-scale human face verification method
CN110575663B (en) Physical education auxiliary training method based on artificial intelligence
CN107767416B (en) Method for identifying pedestrian orientation in low-resolution image
CN110084211B (en) Action recognition method
Zhang et al. BoMW: Bag of manifold words for one-shot learning gesture recognition from kinect
CN109063626A (en) Dynamic human face recognition methods and device
Lu et al. Automatic lip reading using convolution neural network and bidirectional long short-term memory
CN113255602A (en) Dynamic gesture recognition method based on multi-modal data
CN107220597B (en) Key frame selection method based on local features and bag-of-words model human body action recognition process
Zheng et al. Attention assessment based on multi‐view classroom behaviour recognition
Mohana et al. Emotion recognition from facial expression using hybrid CNN–LSTM network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant