CN110321833B

CN110321833B - Human body behavior identification method based on convolutional neural network and cyclic neural network

Info

Publication number: CN110321833B
Application number: CN201910580116.XA
Authority: CN
Inventors: 谢子凡; 陈志�; 岳文静; 葛宇轩; 王多; 崔明浩
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2022-05-20
Anticipated expiration: 2039-06-28
Also published as: CN110321833A

Abstract

The invention discloses a human body behavior identification method based on a convolutional neural network and a cyclic neural network, which is characterized in that a sensor is used for tracking human body behaviors, and a 3-dimensional coordinate vector group and an RGB video of human body joints in a time period are collected. And then training the 3-dimensional coordinates of the joints of the human body by using a Recurrent Neural Network (RNN) to obtain a time characteristic vector. Training the RGB video by using a convolutional neural network CNN to obtain a space-time characteristic vector, finally combining the time characteristic vector and the space-time characteristic vector and normalizing, feeding the normalized space-time characteristic vector to a classifier of a linear SVM, using a verification data set to find a parameter C of the linear support vector machine SVM, and finally obtaining a comprehensive recognition model. The method can solve the problem of overfitting of the model to action classification in the model training process, and can effectively improve the human behavior recognition efficiency and accuracy.

Description

Human body behavior identification method based on convolutional neural network and cyclic neural network

Technical Field

The invention relates to a human body behavior recognition method based on a convolutional neural network and a cyclic neural network, and belongs to the cross technical field of behavior recognition, deep learning, machine learning and the like.

Background

An important research topic in the field of computer vision for behavior recognition and classification in videos becomes a research hotspot in the field of computer vision due to the wide application of video tracking, motion analysis, virtual reality and artificial intelligence interaction,

because the same action scene can be different under different illumination, visual angles, backgrounds and other conditions, and the same object and action in different action scenes can also generate obvious difference in characteristics and postures, the human body action has larger degree of freedom even in a constant action scene, and each same action has great difference in direction, angle and other aspects. In addition, problems of partial occlusion, individual difference and the like are the embodiment of the motion recognition complexity in space. The method of manual design is difficult to obtain the essential features of the object from the scene with severe changes, so that a more popular feature extraction method needs to be provided to solve the problems of one-sidedness, blindness and the like caused by the manual feature extraction method.

At present, many motion recognition algorithms are shallow network learning methods, and have certain limitations, for example, under the condition that training samples are limited, the capability of representing complex functions is limited, and the generalization capability of complex classification problems is also limited.

The motion recognition problem can be considered as a classification problem, and many methods in classification are designed for motion recognition, specifically using Logistic regression analysis, decision tree models, naive bayes classifiers, and support vector machines. These methods have both advantages and disadvantages in practical applications.

In general television and video images, 3D videos are often adopted, the existing technology for behavior recognition in the 3D videos is not mature, most human behavior recognition systems rely on manual marking processing of data, and then the data are placed in models for recognition. The method has strong dependence on data, has low system operation efficiency, and is not suitable for the requirements of industrialization and commercialization.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides a human body behavior recognition method based on a convolutional neural network and a cyclic neural network.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

a human body behavior recognition method based on a convolutional neural network and a cyclic neural network comprises the steps of firstly enabling a user to continuously swing a hand at a fixed position for 5 times, tracking human body behaviors by using a Microsoft Kinect v2 sensor, and collecting 3-dimensional coordinate vector groups of 25 main joints of the user in a time period and RGB videos in the time period by taking 16 milliseconds as time step. Then, a gated circulation unit GRU is used as a basic unit of a circulation layer, and a bidirectional circulation neural network is used for training a 3-dimensional coordinate data set of the human body joint, so that a time feature vector of the 3-dimensional coordinate data of the human body joint is obtained. Dividing the RGB video into continuous RGB frames with 16 milliseconds as time step length, training a convolutional neural network on the data set to obtain a space-time characteristic vector of the RGB video, finally combining the output results of the two cyclic neural networks and the convolutional neural network, normalizing the output results after connection, feeding the normalized result to a classifier of a linear SVM, using a verification data set to find a parameter C of the linear support vector machine SVM, and finally obtaining a comprehensive recognition model, wherein the comprehensive recognition model specifically comprises the following steps:

step 1), a user continuously swings hands 5 times at a fixed position, a sensor is used for tracking human body behaviors, the time step is set to be 16 milliseconds during collection, and 3-dimensional coordinates V { (x) of the main 25 joints of the user in the time period are collected₁,y₁,z₁),(x₂,y₂,z₂),...,(x₂₅,y₂₅,z₂₅) }; the x axis is vertical to the vertical direction of the human body, the positive axis points to the front, the y axis is parallel to the vertical direction of the human body, the positive axis points to the head, and the z axis is vertical to the vertical direction of the human body, and the positive axis points to the left side of the human body; at the same time collectRGB video of the scene over a period of time;

step 2), taking the 3-dimensional coordinates of 25 joints of the human body in the step 1) as a training set of a Recurrent Neural Network (RNN), training the Recurrent Neural Network (RNN) by using the training set of the Recurrent Neural Network (RNN), operating input coordinates by using a gated cyclic unit (GRU) in the Recurrent Neural Network (RNN), wherein the RNN has the functions of batch normalization and random loss, the Recurrent Neural Network (RNN) comprises two bidirectional gated cyclic layers, a hidden full-connection layer and an output layer of a softmax model, a linear rectification function (ReLU) is taken as an activation function, and a dropout mechanism is used for mapping motion characteristics to motion categories of a waving hand to obtain the trained RNN;

step 3), dividing the RGB video acquired in the step 1) into continuous RGB frames by taking 16 milliseconds as a time step, adjusting the resolution of the continuous RGB frames, and taking the continuous RGB frames as a training set of a Convolutional Neural Network (CNN); carrying out convolution operation on an input RGB frame stream by using a convolutional neural network CNN through a plurality of convolution cores, wherein the convolutional neural network CNN comprises a convolution layer, a pooling layer, a full connection layer and an output layer of a softmax model, and finely adjusting the model by using pre-training parameters on a Sports-1M data set so as to reduce overfitting and training time and obtain a trained convolutional neural network CNN;

step 4), taking the output result of the recurrent neural network RNN trained in the step 2) as a time characteristic vector of human joint 3-dimensional coordinate data, taking the output result of the convolutional neural network CNN trained in the step 3) as a space-time characteristic vector of an RGB video, connecting and combining the time characteristic vector of the human joint 3-dimensional coordinate data and the space-time characteristic vector of the RGB video, normalizing the time characteristic vector after connection, feeding the normalized time characteristic vector to a classifier of a linear Support Vector Machine (SVM), and finding a penalty coefficient C of the linear Support Vector Machine (SVM) by using a verification data set to obtain a comprehensive recognition model;

and 5) during recognition, acquiring 3-dimensional coordinates and RGB (red, green and blue) videos of 25 joints of the human body behavior to be recognized by adopting the method in the step 1, putting the acquired 3-dimensional coordinates of 25 joints of the human body into the trained Recurrent Neural Network (RNN) in the step 2) to obtain a time characteristic vector, putting the acquired RGB videos into the trained Convolutional Neural Network (CNN) in the step 3) to obtain a space-time characteristic vector, connecting the time characteristic vector and the space-time characteristic vector into a characteristic array, normalizing the characteristic array, introducing the characteristic array into a comprehensive recognition model, and recognizing the behavior of the human body by using the comprehensive recognition model.

Preferably: the sensor in step 1) is a Microsoft Kinect v2 sensor.

Preferably: the method for obtaining the trained recurrent neural network RNN in the step 2) comprises the following steps:

step 21), taking the 3-dimensional coordinates of the 25 joints of the human body in the step 1) as a training set of the recurrent neural network RNN, wherein the dimensionality of the training set of the recurrent neural network RNN is 16 x (25 x 3);

step 22), training a recurrent neural network on a training set of the recurrent neural network RNN, first passing through two bidirectional gated recurrent layers, using gated recurrent units GRU as basic units of the recurrent layers and using a bidirectional recurrent neural network, which respectively train input data from T-1 to T-T and T-T to T-1, where T represents a time variable in the data set and T represents the last moment in the data set;

step 23), utilizing GRU type recurrent neural network RNN to classify scene features, utilizing the degree of updating the gate to control the state information of the previous moment to be brought into the current state, and updating the gate z_t＝σ(W_z·[h_t-1,x_t]) Wherein h is_t-1Value of candidate memory cell at time t-1, x_tData representing input at time t, W_zWeight, z, representing the value of the memory cell corresponding to the refresh gate data_tThe value of an update gate at the moment t is represented, and sigma represents a sigmoid activation function;

step 24), control how much information of the previous state is written into the candidate memory unit h at time t by using the reset gate_tReset gate r_t＝σ(W_r·[h_t-1,x_t]) Wherein h is_t-1Value of candidate memory cell at time t-1, x_tData representing input at time t, W_rRepresenting the weight, r, of the value of the memory cell corresponding to the refresh gate data_tA value representing the reset gate at time t;

step 25), calculating the candidate memory cell value, candidate memory cell

Wherein h is_tIs the value of the candidate memory cell at time t, h_t-1The value of the candidate memory cell for t-1, x_tRepresents the data input at time t, W represents the weight of the GRU unit at the current time,

representing the value of the candidate memory cell, tanh representing the trigonometric tangent function;

step 26), calculating the state value of the memory unit at the current moment, and calculating the memory unit at the t moment

Wherein, "" indicates the dot-by-dot product, and the memory cell state update depends on the value h of the candidate memory cell at the moment of t-1_t-1And the value of the candidate memory cell

And the two factors are respectively adjusted through an update gate and a reset gate;

step 27), finally obtaining the output y of the recurrent neural network_t＝σ(W·h_t) Transfer to fully-connected layer using softmax activation function and interpret output as probability, y_tRepresenting the probability of prediction using a recurrent neural network.

Preferably: the method for obtaining the trained convolutional neural network CNN in the step 3) is as follows:

step 31), dividing the RGB video acquired in step 1) into continuous RGB frames by taking 16 milliseconds as a time step, adjusting the resolution of the continuous RGB frames, taking the RGB frame stream as a training set of a convolutional neural network CNN, and inputting the RGB frame stream into the training set called as the size of c multiplied by t multiplied by h multiplied by w, wherein c is the number of channels, t is the time step, and h and w are the height and width of the RGB frames;

step 32) of receiving video frames by convolutional neural network CNNParallelizing the input, checking the input matrix x by the convolutional neural network CNN for a convolutional window of length h_1：nPerforming convolution operation with convolution kernel output value of c_i＝f(w·x_i：i+h-1+ b) of which

Is the weight of the convolution kernel and is,

denotes that w takes value in the real number domain and d denotes x_iThe dimensions of the material are measured in the same way,

is the offset, f is the activation function, x_i：i+h-1A frame vector matrix of a convolution window and a frame stream with a time gradient of n are subjected to convolution operation to obtain a convolution eigenvector c ═ c₁，c₂，…，c_n-h+1]。

Step 33) of extracting a maximum value from each of the eigenvectors, obtaining the eigenvectors from the windows with m convolution kernels

Wherein,

the feature vectors extracted for the convolutional neural network,

the feature vector representing the mth convolution kernel.

And step 34), outputting a classification result through a softmax function to obtain the following formula:

where y represents the probability of prediction using a convolutional neural network,

for the regularization term constraint of the downsampled layer output,

the multiplication is carried out for the corresponding elements,

is a matrix of the weights of the full connection layer,

for the offset, the convolutional neural network CNN is optimized by a random gradient descent optimizer.

Preferably: the method for obtaining the comprehensive identification model in the step 4) comprises the following steps:

step 41), extracting the recurrent neural network RNN trained in the step 2) from the 3-dimensional coordinate data of the human body, taking the output result as a time characteristic vector, extracting the convolutional neural network CNN trained in the step 3) from the RGB video, taking the output result as a space-time characteristic vector, connecting the time characteristic vector and the space-time characteristic vector into a characteristic array, and normalizing the characteristic array;

and 42), taking the normalized feature vector groups as input, taking the specific action or behavior mark corresponding to each normalized feature vector group as output, submitting the output to a linear Support Vector Machine (SVM) for training, wherein an optimization model of the SVM is as the following formula:

y_i(ω^Tx_i+b₀)≥1-ξ_i

s.t.ξ_i≥0

i＝1,2,...,N

where ω represents the input vector, C represents a penalty factor, ξ_iIndicating the classification loss for the ith sample point,

y_irepresenting each sample corresponds toAction flag, b₀Representing the intercept, N representing the total number of feature vectors input into the SVM;

and 43), finding the optimal value of the penalty coefficient C through the training and verifying set to obtain a comprehensive identification model.

Preferably: in step 33), when the convolutional neural network CNN is optimized by the random gradient descent optimizer, the initial learning rate is 0.0001, and when no training progress is observed, the learning rate is reduced by half.

Preferably, the following components: normalized formula in step 41):

wherein x is_iDenotes the element, x 'in the feature array'_iRepresenting elements in the normalized feature vector.

Preferably: step 31) the resolution is adjusted from 1920 x 1080 pixels to 320 x 240 pixels.

Compared with the prior art, the invention has the following beneficial effects:

the invention uses a Microsoft Kinect v2 sensor to collect 3-dimensional coordinates and RGB video of human joints, uses a convolution neural network and a circulation neural network to obtain time characteristics and space-time characteristics of human behavior data in the video, and effectively combines the time characteristics and the space-time characteristics to finally obtain a model which can adapt to complex environment and actions, and can quickly and effectively obtain the identification of human actions in the video by inputting corresponding data of a section of video in the future, thereby having good accuracy and stability, in particular:

(1) the invention uses double-current RNN/CNN neural network algorithm, can effectively consider the influence of action continuity on action identification, and can identify and predict the action in a short time

(2) The invention combines the scene information and the action information into consideration, and matches the action sequence in the action database by using the scene information as a label, thereby more accurately finishing the identification of human actions.

(3) The invention provides an effective and strong-practicability system structure, and a user interface module and an operation recording module are configured, so that the stability of human behavior recognition is improved, and the specific application of the system structure in industry is facilitated.

Drawings

FIG. 1 is a flow chart of 3D video motion recognition based on convolutional neural network and cyclic neural network

FIG. 2 is a schematic diagram of a GRU-RNN-based neural network.

Fig. 3 is a schematic diagram of three-dimensional data sampling points of a human joint.

Detailed Description

The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are intended only to illustrate the present invention and not to limit the scope of the present invention, which is defined in the appended claims to the present application, and that modifications of various equivalent forms to the present invention by those skilled in the art will fall within the scope of the present invention after reading the present invention.

A human behavior recognition method based on convolutional neural network and cyclic neural network, as shown in fig. 1-3, comprises the following steps:

step 1), a user continuously swings his hand 5 times at a fixed position, the behavior of the human body is tracked by using a Microsoft Kinect v2 sensor, and the sampling points of the joints of the human body by using the Microsoft Kinect v2 sensor are as follows: 1-base of spine, 2-middle of spine, 3-neck, 4-head, 5-left shoulder, 6-left elbow, 7-left wrist, 8-hand, 9-right shoulder, 10-right elbow, 11-right wrist, 12-right hand, 13-left hip, 14-left knee, 15-left ankle, 16-left foot, 17-right hip, 18-right knee, 19-right ankle, 20-right foot, 21-spine, 22-left apex, 24-right apex, 25-right thumb. Setting the time step to be 16 milliseconds during collection, and collecting the 3-dimensional coordinates V { (x) of the main 25 joints of the user in the time period₁,y₁,z₁),(x₂,y₂,z₂),...,(x₂₅,y₂₅,z₂₅) }; the x axis is vertical to the vertical direction of the human body, the positive axis points to the front, the y axis is parallel to the vertical direction of the human body, the positive axis points to the head, and the z axis is vertical to the vertical direction of the human body, and the positive axis points to the left side of the human body; simultaneous collectionRGB video of this scene during the time period;

step 2), taking the 3-dimensional coordinates of 25 joints of the human body in the step 1) as a training set of a Recurrent Neural Network (RNN), training the Recurrent Neural Network (RNN) by using the training set of the Recurrent Neural Network (RNN), operating input coordinates by using a gated cyclic unit (GRU) in the Recurrent Neural Network (RNN), wherein the RNN has the functions of batch normalization and random loss, the Recurrent Neural Network (RNN) comprises two bidirectional gated cyclic layers, a hidden full connection layer and an output layer of a softmax model, a linear rectification function (ReLU) is taken as an activation function, and a dropout mechanism is used for mapping motion characteristics to motion categories of a waving hand to obtain the trained Recurrent Neural Network (RNN);

step 21), taking the 3-dimensional coordinates of 25 joints of the human body in the step 1) as a training set of a Recurrent Neural Network (RNN), wherein the dimensionality of the training set of the Recurrent Neural Network (RNN) is 16 (time step) x (25 x 3) (3-dimensional joint coordinates);

step 22), training a recurrent neural network on a training set of the recurrent neural network RNN, first passing through two bidirectional gated recurrent layers, using gated recurrent units GRU as basic units of the recurrent layers and using bidirectional recurrent neural networks, which train input data from T ═ 1 to T ═ T and T ═ T to T ═ 1, respectively, wherein T represents a time variable in the data set, and T represents the last minute time in the data set;

step 23), utilizing GRU type recurrent neural network RNN to classify scene features, utilizing the degree of updating the gate to control the state information of the previous moment to be brought into the current state, and updating the gate z_t＝σ(W_z·[h_t-1,x_t]) Wherein h is_t-1Value of candidate memory cell at time t-1, x_tData representing input at time t, W_zWeight, z, representing the value of the corresponding refresh gate data memory cell_tThe table represents the value of an update gate at the time t, and sigma represents a sigmoid activation function;

step 24), using the reset gate to control how much information of the previous state is written into the candidate memory unit h at time t_tReset gate r_t＝σ(W_r·[h_t-1,x_t]) Wherein h is_t-1Value of candidate memory cell at time t-1, x_tData representing input at time t, W_rRepresenting the weight, r, of the value of the memory cell corresponding to the refresh gate data_tA value representing the reset gate at time t;

step 25), calculating the candidate memory cell value, candidate memory cell

a value representing a candidate memory cell, tanh representing a trigonometric tangent function;

The two factors are respectively adjusted through an updating gate and a resetting gate;

Step 3), dividing the RGB video collected in the step 1) into continuous RGB frames by taking 16 milliseconds as a time step, adjusting the resolution of the continuous RGB frames from 1920 × 1080 pixels to 320 × 240 pixels, and using the continuous RGB frames as a training set of the convolutional neural network CNN; carrying out convolution operation on an input RGB frame stream by using a convolutional neural network CNN through a plurality of convolution kernels, wherein the convolutional neural network CNN comprises a convolution layer, a pooling layer, a full connection layer and an output layer of a softmax model, and finely adjusting the model by using pre-training parameters on a Sports-1M data set to reduce overfitting and training time to obtain a trained convolutional neural network CNN;

the method for obtaining the trained convolutional neural network CNN in the step 3) is as follows:

step 32), the convolutional neural network CNN receives the parallelized input of the video frames, and for a convolutional window of length h, the convolutional neural network CNN checks the input matrix x by a convolution kernel_1：nPerforming convolution operation with convolution kernel output value of c_i＝f(w·x_i：i+h-1+ b) of

In order to be the weights of the convolution kernel,

Step 33) of extracting a maximum value from each of the eigenvectors, obtaining the eigenvectors for the window with m convolution kernels

Wherein,

the feature vectors extracted for the convolutional neural network,

the feature vector representing the mth convolution kernel.

for the regularization term constraint of the downsampled layer output,

the multiplication is carried out for the corresponding elements,

is a matrix of the weights of the full connection layer,

Step 4), taking the output result of the recurrent neural network RNN trained in the step 2) as a time characteristic vector of human joint 3-dimensional coordinate data, taking the output result of the convolutional neural network CNN trained in the step 3) as a space-time characteristic vector of an RGB video, connecting the time characteristic vector of the human joint 3-dimensional coordinate data and the space-time characteristic vector of the RGB video to form a characteristic array, normalizing the characteristic array after connection, feeding the characteristic array to a classifier of a linear Support Vector Machine (SVM), finding a penalty coefficient C of the linear Support Vector Machine (SVM) by using a verification data set, wherein the penalty coefficient represents the tolerance degree of errors, and obtaining a comprehensive identification model;

the method for obtaining the comprehensive identification model in the step 4) comprises the following steps:

step 41), extracting the recurrent neural network RNN trained in the step 2) from the 3-dimensional coordinate data of the human body, taking the output result as a time characteristic vector (600 dimensions), extracting the recurrent neural network CNN trained in the step 3) from the RGB video, taking the output result as a space-time characteristic vector (4096 dimensions), connecting the time characteristic vector and the space-time characteristic vector into a characteristic array (4696 dimensions), and normalizing;

normalized formula:

y_i(ω^Tx_i+b₀)≥1-ξ_i

s.t.ξ_i≥0

i＝1,2,...,N

where ω represents the vector of inputs, C represents a penalty coefficient, ξ_iIndicating the classification loss for the ith sample point,

y_irepresenting the action signature corresponding to each sample, b₀Representing the intercept, N representing the total number of feature vectors input into the SVM;

and 43) finding the optimal value of the penalty coefficient C through the training and verification set to obtain a comprehensive recognition model, and finally verifying the motion category in the human body by using the model.

And 5) during recognition, acquiring 3-dimensional coordinates and RGB (red, green and blue) videos of 25 joints of the human body behavior to be recognized by adopting the method in the step 1, putting the acquired 3-dimensional coordinates of the 25 joints of the human body into the Recurrent Neural Network (RNN) trained in the step 2) to obtain a time characteristic vector, putting the acquired RGB videos into the Convolutional Neural Network (CNN) trained in the step 3) to obtain a space-time characteristic vector, connecting the time characteristic vector and the space-time characteristic vector into a characteristic array, normalizing the characteristic array, guiding the characteristic array into a comprehensive recognition model, and recognizing the human behavior by using the comprehensive recognition model.

Simulation:

a user continuously waves hands 5 times at a fixed position, uses a Microsoft Kinect V2 sensor to track human body behaviors, sets a time step length at the collection time to be 16 milliseconds, and collects a 3-dimensional coordinate vector set V { (x) of 25 joints which are main for the user in the time period₁,y₁,z₁),(x₂,y₂,z₂),...,(x₂₅,y₂₅,z₂₅) }; each joint has 25 3D coordinates. This motion sample dimension is processed as 16 (time step) × (25 × 3) (3-dimensional joint coordinates) and is taken as a training set of the recurrent neural network.

Training the recurrent neural network using multiple GPUs, setting the learning rate to 0.001 and the decay rate to 0.9, and then training the network for a single-layer model using a small batch of 1000 sequences from the beginning; a small batch of 650 sequences was used for the bilayer model. For all RNN networks, 300 neurons are used in each unidirectional layer, the number of neurons in the bidirectional layer is doubled, the neurons are lost by using 75% of holding probability, and finally the time characteristic vector (600 dimensions) of the human body 3D joint coordinate data is obtained through a complete connection layer.

The RGB video is then imported for convolutional neural network training, frames are extracted from the RGB video using a 3D-CNN model in Caffe, and they are clipped from 1920 x 1080 pixels to 320 x 240 pixels with a time step of 16 milliseconds, we refer to the input to the CNN model as the size of c x t h w, where c is the number of channels, t is the time step, and h and w are the height and width of the RGB frame, respectively. The network takes a video clip as input and marks the data with the motion tag of the waving hand. The input RGB frame was then resized to 128 x 171 pixel resolution, the input size changed to 3 x 16 x 128 x 171, and the network was network verified using a random gradient descent optimizer with an initial learning rate of 0.0001. Then, when no training progress was observed, the learning rate was reduced by half. Finally, the space-time characteristic vector (4096 dimensions) of the RGB video is obtained after complete connection layers.

For feature fusion, we concatenate the temporal feature vectors extracted from RNN with the spatio-temporal feature vectors extracted from CNN into a feature array (4696 dimensions) and perform L2 normalization. Finally, we normalized the RNN/CNN function through training, validation and test segmentation. We use training and verification splitting to find the optimal value of the parameter C of the linear SVM model. The comprehensive model is used for verifying the accuracy of motion recognition.

The method can solve the problem of overfitting of the model to action classification in the model training process, and can effectively improve the human behavior recognition efficiency and accuracy.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention, and such modifications and adaptations are intended to be within the scope of the invention.

Claims

1. A human behavior recognition method based on a convolutional neural network and a cyclic neural network is characterized by comprising the following steps:

step 1), a user continuously swings hands 5 times at a fixed position, a sensor is used for tracking human body behaviors, the time step is set to be 16 milliseconds during collection, and 3-dimensional coordinates V { (x) of the main 25 joints of the user in the time period are collected₁,y₁,z₁),(x₂,y₂,z₂),...,(x₂₅,y₂₅,z₂₅) }; wherein, the x-axis is vertical to the vertical direction of the human body and the positive axis points to the front, the y-axis is parallel to the vertical direction of the human body and the positive axis points to the frontThe z axis is vertical to the vertical direction of the human body, and the positive axis points to the left side of the human body; simultaneously collecting the RGB video of the scene in the time period;

the method for obtaining the trained recurrent neural network RNN is as follows:

step 21), taking the 3-dimensional coordinates of the 25 joints of the human body in the step 1) as a training set of a Recurrent Neural Network (RNN), wherein the dimensionality of the training set of the Recurrent Neural Network (RNN) is 16 x (25 x 3);

step 23), using GRU type recurrent neural network RNN to classify the scene characteristics, using the degree of the state information of the previous time of the refresh gate control being brought into the current state, refreshing the gate z_t＝σ(W_z·[h_t-1,x_t]) Wherein h is_t-1Value of candidate memory cell at time t-1, x_tData representing input at time t, W_zWeight, z, representing the value of the corresponding refresh gate data memory cell_tRepresenting the value of an update gate at the moment t, and sigma representing a sigmoid activation function;

step 24), control of previous state using reset gateHow much information of a state is written into a candidate memory cell h at time t_tReset gate r_t＝σ(W_r·[h_t-1,x_t]) Wherein h is_t-1Value of candidate memory cell at time t-1, x_tData representing the input at time t, W_rRepresenting the weight, r, of the value of the memory cell corresponding to the refresh gate data_tA value representing the reset gate at time t;

step 25), calculating the candidate memory cell value, candidate memory cell

Wherein s indicates a dot-by-dot product, and the memory cell state update depends on the value h of the candidate memory cell at t-1_t-1And the value of the candidate memory cell

step 27), finally obtaining the output y of the recurrent neural network_t＝σ(W·h_t) Transfer to the fully-connected layer using the softmax activation function and interpret the output as a probability, y_tRepresenting the probability of using a recurrent neural network prediction;

step 3), dividing the RGB video acquired in the step 1) into continuous RGB frames by taking 16 milliseconds as a time step, adjusting the resolution of the continuous RGB frames, and taking the continuous RGB frames as a training set of a Convolutional Neural Network (CNN); carrying out convolution operation on an input RGB frame stream by using a convolutional neural network CNN through a plurality of convolution cores, wherein the convolutional neural network CNN comprises a convolution layer, a pooling layer, a full connection layer and an output layer of a softmax model, and finely adjusting the model by using pre-training parameters on a Sports-1M data set to reduce overfitting and training time so as to obtain a trained convolutional neural network CNN;

the method for obtaining the trained convolutional neural network CNN is as follows:

step 32), the convolutional neural network CNN receives the parallelized input of the video frames, and for a convolutional window of length h, the convolutional neural network CNN checks the input matrix x by a convolution kernel_1:nPerforming convolution operation with convolution kernel output value of c_i＝f(w·x_i:i+h-1+ b) of

In order to be the weights of the convolution kernel,

is the offset, f is the activation function, x_i:i+h-1A frame vector matrix of a convolution window and a frame stream with time gradient n are subjected to convolution operation to obtain a feature vector c ═ c after convolution₁,c₂,…,c_n-h+1]；

Wherein,

the feature vectors extracted for the convolutional neural network,

a feature vector representing the mth convolution kernel;

for the regularized term constraint of the downsampled layer output,

in order to multiply the corresponding elements,

is a matrix of the weights of the full connection layer,

optimizing the convolutional neural network CNN for offset by a random gradient descent optimizer;

the method for obtaining the comprehensive identification model comprises the following steps:

step 41), extracting the recurrent neural network RNN trained in the step 2) from the 3-dimensional coordinate data of the human body, taking the output result as a time characteristic vector, extracting the convolutional neural network CNN trained in the step 3) from the RGB video, taking the output result as a space-time characteristic vector, connecting the time characteristic vector and the space-time characteristic vector into a characteristic array, and normalizing;

and 42) taking the normalized feature vector groups as input, taking the specific action or behavior mark corresponding to each normalized feature vector group as output, submitting the output to a linear Support Vector Machine (SVM) for training, wherein the optimization model is as the following formula:

y_i(ω^Tx_i+b₀)≥1-ξ_i

s.t.ξ_i≥0

i＝1,2,...,N

where ω represents the vector of inputs, C represents a penalty coefficient, ξ_iRepresents the classification loss for the ith sample point, y_iRepresenting the action signature corresponding to each sample, b₀Representing the intercept, N representing the total number of feature vectors input into the SVM;

step 43), finding the optimal value of the penalty coefficient C through a training and verification set to obtain a comprehensive identification model;

and 5) during recognition, acquiring 3-dimensional coordinates and RGB (red, green and blue) videos of 25 joints of the human body behavior to be recognized by adopting the method in the step 1, putting the acquired 3-dimensional coordinates of the 25 joints of the human body into the Recurrent Neural Network (RNN) trained in the step 2) to obtain a time characteristic vector, putting the acquired RGB videos into the Convolutional Neural Network (CNN) trained in the step 3) to obtain a space-time characteristic vector, connecting the time characteristic vector and the space-time characteristic vector into a characteristic array, normalizing the characteristic array, introducing the characteristic array into a comprehensive recognition model, and recognizing the human behavior by utilizing the comprehensive recognition model.

2. The human behavior recognition method based on the convolutional neural network and the cyclic neural network as claimed in claim 1, wherein: the sensor in step 1) is a Microsoft Kinect v2 sensor.

3. The human behavior recognition method based on the convolutional neural network and the cyclic neural network as claimed in claim 2, wherein: in step 33), when the convolutional neural network CNN is optimized by the random gradient descent optimizer, the initial learning rate is 0.0001, and when no training progress is observed, the learning rate is reduced by half.

4. The human behavior recognition method based on the convolutional neural network and the cyclic neural network as claimed in claim 3, wherein: normalized formula in step 41):

5. The human behavior recognition method based on the convolutional neural network and the cyclic neural network as claimed in claim 4, wherein: step 31) the resolution is adjusted from 1920 x 1080 pixels to 320 x 240 pixels.