CN110321833A

CN110321833A - Human bodys' response method based on convolutional neural networks and Recognition with Recurrent Neural Network

Info

Publication number: CN110321833A
Application number: CN201910580116.XA
Authority: CN
Inventors: 谢子凡; 陈志�; 岳文静; 葛宇轩; 王多; 崔明浩
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2019-10-11
Anticipated expiration: 2039-06-28
Also published as: CN110321833B

Abstract

The Human bodys' response method based on convolutional neural networks and Recognition with Recurrent Neural Network that the invention discloses a kind of tracks human body behavior using sensor, collects the 3 dimension coordinate vector groups and rgb video of human synovial in the period.Then it is trained using 3 dimension coordinates of the Recognition with Recurrent Neural Network RNN to human synovial, obtains temporal characteristics vector.Using convolutional neural networks CNN to rgb video training, obtain space-time characteristic vector, last binding time feature vector and space-time characteristic vector simultaneously normalize, and it is fed to the classifier of Linear SVM, use validation data set, the parameter C for finding linear SVM SVM finally obtains a comprehensive identification model.The present invention is able to solve model during model training to the overfitting problem of the classification of motion, while can also effectively improve Human bodys' response efficiency and accuracy.

Description

Human bodys' response method based on convolutional neural networks and Recognition with Recurrent Neural Network

Technical field

The Human bodys' response method based on convolutional neural networks and Recognition with Recurrent Neural Network that the present invention relates to a kind of, belongs to row For the interleaving techniques such as identification, deep learning, machine learning field.

Background technique

An important subject in Activity recognition and calculating sorting-machine visual field in video, video with Track, motion analysis, virtual reality and artificial intelligence interaction extensive use and become computer vision field research hotspot,

Since identical action scene will be different under the conditions ofs different illumination, visual angle and background etc., and different Identical target and movement can also generate apparent difference in characteristic and posture in action scene, even if in constant movement In scene, human action also has a biggish freedom degree, and every kind of identical movement direction, in terms of have it is very big Otherness.In addition, the problems such as partial occlusion and individual difference is all the embodiment of action recognition complexity spatially.Using The method of manual designs is difficult to obtain the substantive characteristics of object from the violent scene of variation, therefore needs to propose more popular Feature extracting method come the problems such as solution by manual feature extracting method bring one-sidedness and blindness.

Current many action recognition algorithms are shallow-layer online learning methods, are had some limitations, such as in training sample Under this limited situation, indicate that the ability of complicated function is limited, and to the generalization ability of complicated classification problem also can therefore by Certain constraint.

Action recognition class problem is considered a kind of classification problem, much is designed to use actuation in classificatory method Identify specifically have using Logistic regression analysis, decision-tree model, Naive Bayes Classifier and support vector machines.This A little methods have pros and cons in practical applications.

In general TV, video image, often 3D video, and at present for used in the Activity recognition in 3D video Technology is not overripened, and most of Human bodys' response system depends on and processing manually is marked to data, then places data into It is identified in model.There is stronger dependence to data, running efficiency of system is low, is not suitable for industrialization and commercialization Demand.

Summary of the invention

Goal of the invention: in order to overcome the deficiencies in the prior art, the present invention provides a kind of based on convolutional neural networks With the Human bodys' response method of Recognition with Recurrent Neural Network, by a set of algorithm, in conjunction with convolutional neural networks to sample time feature Extraction to sample space-time characteristic of extraction and convolutional neural networks, and further combine Linear SVM, training obtains one The model that figure action in video can be judged.

Technical solution: to achieve the above object, the technical solution adopted by the present invention are as follows:

A kind of Human bodys' response method based on convolutional neural networks and Recognition with Recurrent Neural Network, enables a user exist first Fixed position is persistently waved 5 times, using Microsoft Kinect v2 sensor track human body behavior, with 16 milliseconds for when Between step-length, collect the user main 25 joints period Nei 3 dimension coordinate vector groups and in the period RGB regard Frequently.Then using has gating cycle unit GRU as the basic unit of circulation layer and uses bidirectional circulating neural network to people 3 Wei Zuobiaoshuojuji in body joint are trained, and show that human synovial 3 ties up the temporal characteristics vector of coordinate data.Then will Rgb video is divided into the continuous RGB frame with 16 milliseconds for time step, then one convolutional Neural of training on the data set Network obtains the space-time characteristic vector of RGB video, finally combines the defeated of the two Recognition with Recurrent Neural Network and convolutional neural networks Out as a result, being normalized after connection, and it is fed to the classifier of Linear SVM, using validation data set, finds linear support The parameter C of vector machine SVM finally obtains a comprehensive identification model, specifically includes the following steps:

Step 1), user persistently wave 5 times in fixed position, track human body behavior using sensor, when collection is arranged Time step is 16 milliseconds, collects 3 dimension coordinate the V={ (x in the user main 25 joints period Nei₁,y₁,z₁), (x₂,y₂,z₂),...,(x₂₅,y₂₅,z₂₅)}；Wherein, x-axis is that human body vertical direction is vertical and positive axis is directing forwardly, y-axis and people Body vertical direction is parallel and positive axis is directed toward head, and z-axis is that human body vertical direction is vertical and positive axis is directed toward on the left of human body；It adopts simultaneously Collect the rgb video of this scene period Nei；

Step 2), using the 3 dimension coordinates in 25 joints of human body in step 1) as the training set of Recognition with Recurrent Neural Network RNN, with It is followed in training set the training one Recognition with Recurrent Neural Network RNN, Recognition with Recurrent Neural Network RNN of Recognition with Recurrent Neural Network RNN using gate Ring element GRU operates the coordinate of input, and there is batch to normalize, the function of random loss, the circulation nerve net Network RNN includes the output layer of two bidirectional valve controlled circulation layers, hiding a full articulamentum and a softmax model, linearly Function, that is, ReLU function is rectified as activation primitive, and has used dropout mechanism that motion feature is mapped to the movement waved Classification obtains trained Recognition with Recurrent Neural Network RNN；

The collected rgb video of step 1) is divided into continuous RGB frame with 16 milliseconds for time step by step 3), and The resolution ratio for adjusting them, as the training set of convolutional neural networks CNN；Convolutional neural networks CNN uses more convolution kernels Convolution operation carried out to the RGB frame stream of input, the convolutional neural networks CNN include convolutional layer, pond layer, full articulamentum and The output layer of one softmax model, using the pre-training small parameter perturbations model on Sports-1M data set, to reduce Degree fitting and training time, obtain trained convolutional neural networks CNN；

The output result of the trained Recognition with Recurrent Neural Network RNN of step 2) is tieed up number of coordinates by step 4) According to temporal characteristics vector, using the output result of the trained convolutional neural networks CNN of step 3) as the space-time of rgb video Feature vector, the connected binding of the space-time characteristic vector that human synovial 3 is tieed up into the temporal characteristics vector sum rgb video of coordinate data It closes, is normalized after connection, and be fed to the classifier of linear SVM SVM, using validation data set, find line Property support vector machines penalty coefficient C, obtain comprehensive identification model；

When identification, the 3 of 25 joints of human body of human body behavior to be identified are acquired using the method in step 1 for step 5) Coordinate and rgb video are tieed up, the 3 dimension coordinates in 25 joints of collected human body are put into the trained circulation nerve of step 2) Temporal characteristics vector is obtained in network RNN, by collected rgb video into the trained convolutional neural networks CNN of step 3) Space-time characteristic vector is obtained, temporal characteristics vector sum space-time characteristic vector is connected into a feature array, and lead after normalizing Enter comprehensive identification model, utilizes the behavior of comprehensive identification model identification people.

Preferred: sensor described in step 1) is Microsoft Kinect v2 sensor.

Preferred: the method that trained Recognition with Recurrent Neural Network RNN is obtained in the step 2) is as follows:

Step 21) is followed using the 3 dimension coordinates in 25 joints of human body in step 1) as the training set of Recognition with Recurrent Neural Network RNN The dimension of the training set of ring neural network RNN is 16 × (25 × 3)；

Step 22), the training Recognition with Recurrent Neural Network on the training set of Recognition with Recurrent Neural Network RNN, first passes around two pairs To gating cycle layer, gating cycle unit GRU is used as the basic unit of the circulation layer and uses forward-backward recutrnce nerve net Network, is respectively trained from t=1 to t=T and the input data of t=T to t=1, wherein t indicates the time in the data set Variable, T indicate the time at the last moment in the data set；

Step 23) carries out scene characteristic classification using GRU type Recognition with Recurrent Neural Network RNN, when controlling previous using update door The status information at quarter is brought into the degree in current state, updates door z_t=σ (W_z·[h_t-1,x_t]), wherein h_t-1For t-1 The value of moment candidate's memory unit, x_tIndicate the data of t moment input, W_zIndicate the corresponding power for updating gated data memory unit value Value, z_tIndicate that t moment updates the value of door, σ indicates sigmoid activation primitive；

Step 24) is written to t moment candidate's memory unit h using resetting door control how many information of previous state_t, Reset door r_t=σ (W_r·[h_t-1,x_t]), wherein h_t-1For the value of t-1 moment candidate memory unit, x_tIndicate t moment input Data, W_rIndicate the corresponding weight for updating gated data memory unit value, r_tIndicate the value of t moment resetting door；

Step 25) calculates candidate memory unit value, candidate memory unitWherein h_tFor t The value of moment candidate's memory unit, h_t-1The value of candidate memory unit, x are carved for t-1_tIndicate the data of t moment input, W is indicated The weight of current time GRU unit,Indicate the value of candidate memory unit, tanh indicates trigonometric tangential function；

Step 26) calculates current time memory unit state value, t moment memory unitWherein, ⊙ indicates point-by-point product, and memory unit state, which updates, depends on t-1 quarter candidate note Recall the value h of unit_t-1With the value of candidate memory unitAnd by update door and resetting door respectively to this two parts factor into Row is adjusted；

Step 27) finally obtains the output y of Recognition with Recurrent Neural Network_t=σ (Wh_t), it is transmitted using softmax activation primitive Probability, y are construed to full articulamentum, and by output_tIndicate the probability predicted using Recognition with Recurrent Neural Network.

Preferred: the method that the step 3) obtains trained convolutional neural networks CNN is as follows:

The collected rgb video of step 1) is divided into continuous RGB frame with 16 milliseconds for time step by step 31), and The resolution ratio for adjusting them is inputted referred to as c × t × h using the RGB frame stream as the training set of convolutional neural networks CNN × w size, wherein c is port number, and t is time step, and h and w are the height and width of RGB frame；

Step 32), convolutional neural networks CNN receive the parallelization input of video frame, the convolution window for being h to length, volume Product neural network CNN is by convolution kernel to input matrix x_1:nConvolution operation is carried out, convolution kernel output valve is c_i=f (w x_I:i+h-1+ b), whereinFor convolution kernel weight,Indicate w value in real number field, d indicates x_iDimension,For Amount of bias, f are activation primitive, x_I:i+h-1For the frame vector matrix of a convolution window, time gradient is the frame stream of n, passes through volume Feature vector c=[c after convolution can be obtained in product operation₁, c₂..., c_n-h+1]。

Step 33) extracts a maximum value from each feature vector, has the window of m convolution kernel to obtain feature VectorWherein,For the feature vector that convolutional neural networks extract,It indicates m-th The feature vector of convolution kernel.

Step 34) obtains following formula by softmax function output category result:

Wherein, y indicates the probability predicted using convolutional neural networks,For the regular terms limit of down-sampling layer output System,It is multiplied for corresponding element,For full articulamentum weight matrix,For amount of bias, by under stochastic gradient It drops optimizer and optimizes convolutional neural networks CNN.

Preferred: the method that comprehensive identification model is obtained in the step 4) is as follows:

The trained Recognition with Recurrent Neural Network RNN of step 2) is tieed up coordinate data from human body 3 and extracted by step 41), defeated Result extracts the trained convolutional neural networks CNN of step 3) from rgb video as temporal characteristics vector out, defeated Temporal characteristics vector sum space-time characteristic vector is connected into a feature array as space-time characteristic vector by result out, and is carried out Normalization；

Step 42), it is corresponding with the feature vector group after each normalization using the feature vector group after normalizing as input Specific movement or behavior label as output, be delivered in linear SVM SVM training, Optimized model is for example following Formula:

y_i(ω^Tx_i+b₀)≥1-ξ_i

s.t.ξ_i≥0

I=1,2 ..., N

Wherein, ω indicates that the vector of input, C indicate penalty coefficient, ξ_iIndicate the Classification Loss for i-th of sample point,

y_iIndicate the corresponding action mark of each sample, b₀Indicate intercept, N indicates the feature vector sum of input SVM；

Step 43) is collected by training and verifying, finds the optimum value of penalty coefficient C, and comprehensive identification model is obtained.

It is preferred: initial to learn when optimizing convolutional neural networks CNN by stochastic gradient descent optimizer in step 33) Rate is 0.0001, when observing not training progress, learning rate drop by half.

It is preferred: normalized formula in step 41):

Wherein, x_iIndicate the element in feature array, x'_iThe element in feature vector after indicating normalization.

It is preferred: resolution ratio being adjusted to 320*240 pixel from 1920*1080 pixel in step 31).

The present invention compared with prior art, has the advantages that

The present invention ties up coordinates and rgb video using the 3 of Microsoft Kinect v2 sensor acquisition human synovial, And the temporal characteristics and space-time characteristic of people's behavioral data in video are obtained using convolutional neural networks and Recognition with Recurrent Neural Network, and It is effectively combined, the model for adapting to complex environment and movement, the phase that future passes through one section of video of input are finally obtained Data are answered, the identification of figure action in video just can be quickly and effectively obtained, there is good accuracy and stability, specifically For:

(1) present invention uses double fluid RNN/CNN neural network algorithm, effectively consideration can act continuity to action recognition Influence, can be identified and prediction action in the short period

(2) scene information is combined consideration with action message by the present invention, by scene information as label in behavior Action sequence is matched in database, it is more acurrate to complete the identification to human body behavior.

(3) the invention proposes a kind of effective, practical system structures, and configure user interface mould group and operation note Module is recorded, the stability of Human bodys' response, the concrete application of the convenient invention industrially are improved.

Detailed description of the invention

Fig. 1 is the 3D video actions identification process figure based on convolutional neural networks and Recognition with Recurrent Neural Network

Fig. 2 is based on GRU-RNN neural network schematic diagram.

Fig. 3 is the three-dimensional data sampled point schematic diagram of human synovial.

Specific embodiment

In the following with reference to the drawings and specific embodiments, the present invention is furture elucidated, it should be understood that these examples are merely to illustrate this It invents rather than limits the scope of the invention, after the present invention has been read, those skilled in the art are to of the invention various The modification of equivalent form falls within the application range as defined in the appended claims.

A kind of Human bodys' response method based on convolutional neural networks and Recognition with Recurrent Neural Network, as shown in Figure 1-3, with packet Include following steps:

Step 1), a user persistently wave 5 times in fixed position, use Microsoft Kinect v2 sensor Human body behavior is tracked, is to the sampled point of human synovial using Microsoft Kinect v2 sensor: the base portion of 1- backbone, The middle part of 2- backbone, 3- neck, the head 4-, 5- left shoulder, 6- left hand elbow, 7- left finesse, 8- hand, 9- right shoulder, 10- are right Side ancon, the right side 11- wrist, the 12- right hand, the left stern of 13-, the left knee of 14-, 15- left ankle, 16- left foot, the right stern of 17-, 18- are right Knee, 19- right ankle, 20- right crus of diaphragm, 21- vertebra, 22- left hand point, 24- right hand point, the right thumb of 25-.Time step is set when collection A length of 16 milliseconds, collect 3 dimension coordinate the V={ (x in the user main 25 joints period Nei₁,y₁,z₁),(x₂,y₂, z₂),...,(x₂₅,y₂₅,z₂₅)}；Wherein, x-axis is that human body vertical direction is vertical and positive axis is directing forwardly, and y-axis is vertical with human body Direction is parallel and positive axis is directed toward head, and z-axis is that human body vertical direction is vertical and positive axis is directed toward on the left of human body；When acquiring this simultaneously Between in section this scene rgb video；

Step 21) is followed using the 3 dimension coordinates in 25 joints of human body in step 1) as the training set of Recognition with Recurrent Neural Network RNN The dimension of the training set of ring neural network RNN is 16 (time steps) × (25 × 3) (3 dimension joint coordinates)；

The collected rgb video of step 1) is divided into continuous RGB frame with 16 milliseconds for time step by step 3), and The resolution ratio for adjusting them is adjusted to 320*240 pixel from 1920*1080 pixel, as the instruction of convolutional neural networks CNN Practice collection；Convolutional neural networks CNN carries out convolution operation, the convolutional neural networks using the RGB frame stream of multireel product verification input CNN includes the output layer of convolutional layer, pond layer, full articulamentum and a softmax model, using on Sports-1M data set The pre-training small parameter perturbations models obtain trained convolutional neural networks CNN to reduce overfitting and training time；

The method that the step 3) obtains trained convolutional neural networks CNN is as follows:

Step 34) obtains following formula by softmax function output category result:

Wherein, y indicates the probability predicted using convolutional neural networks,For the regular terms limit of down-sampling layer output System,It is multiplied for corresponding element,For full articulamentum weight matrix,For amount of bias, pass through stochastic gradient descent Optimizer optimizes convolutional neural networks CNN.

The output result of the trained Recognition with Recurrent Neural Network RNN of step 2) is tieed up number of coordinates by step 4) According to temporal characteristics vector, using the output result of the trained convolutional neural networks CNN of step 3) as the space-time of rgb video Feature vector, the connected binding of the space-time characteristic vector that human synovial 3 is tieed up into the temporal characteristics vector sum rgb video of coordinate data A feature array is synthesized, is normalized after connection, and is fed to the classifier of linear SVM SVM, uses verifying Data set, finds the penalty coefficient C of linear SVM SVM, and penalty coefficient represents the degrees of tolerance to error, obtains comprehensive Close identification model；

The method that comprehensive identification model is obtained in the step 4) is as follows:

The trained Recognition with Recurrent Neural Network RNN of step 2) is tieed up coordinate data from human body 3 and extracted by step 41), defeated Result mentions the trained convolutional neural networks CNN of step 3) from rgb video as temporal characteristics vector (600 dimension) out It takes, temporal characteristics vector sum space-time characteristic vector is connected into one as space-time characteristic vector (4096 dimension) by output result Feature array (4696 dimension), and be normalized；

Normalized formula:

y_i(ω^Tx_i+b₀)≥1-ξ_i

s.t.ξ_i≥0

I=1,2 ..., N

Step 43) is collected by training and verifying, finds the optimum value of penalty coefficient C, obtains comprehensive identification model, finally With the action classification in model verifying human body.

Emulation:

One user persistently waves 5 times in fixed position, tracks human body using Microsoft Kinect v2 sensor Behavior, setting time step is 16 milliseconds when collection, and coordinates are tieed up in collect the user main 25 joints period Nei 3 Vector Groups V={ (x₁,y₁,z₁),(x₂,y₂,z₂),...,(x₂₅,y₂₅,z₂₅)}；There are 25 3D coordinates in each joint.It is dynamic to handle this Making sample dimension is 16 (time steps) × (25 × 3) (3 dimension joint coordinates), using the sample action as Recognition with Recurrent Neural Network Training set.

Using multiple GPU training Recognition with Recurrent Neural Network, 0.001 is set by learning rate, rate of decay 0.9, so Network is trained to be used for single-layer model from the beginning using 1000 sequences of small lot afterwards；It is used for using 650 sequences of small lot Bilayer model.For all RNN networks, each unidirectional ply uses 300 neurons, and bidirectional layer neuronal quantity doubles, makes It is lost with 75% holding probability, eventually passes through and be fully connected layer, obtain the temporal characteristics of human body 3D joint coordinates data Vector (600 dimension).

Rgb video is subsequently introduced for convolutional neural networks training, 3D-CNN model is used in Caffe, is regarded from RGB Frame is extracted in frequency, and cuts them from 1920*1080 pixel to 320*240 pixel, and time step is 16 milliseconds, we are by CNN The input of model is known as the size of c*t*h*w, and wherein c is the quantity in channel, and t is time step, and h and w are RGB frame respectively Height and width.Network is using video clipping as input, and the movement label waved for data markers.RGB frame will then be inputted Size be adjusted to 128*171 pixel resolution, the size of input becomes 3 × 16 × 128 × 171, uses stochastic gradient descent Optimizer carries out network verification to network, and initial learning rate is 0.0001.Then, when training progress is not observed, study Rate reduces half.Eventually pass through the space-time characteristic vector (4096 dimension) that rgb video is obtained after being fully connected layer.

For Fusion Features, the temporal characteristics vector that we will extract from RNN, with the space-time characteristic vector extracted from CNN It connects into feature array (4696 dimension) and carries out L2 normalization.Finally, we pass through training, verifying and test segmentation pair RNN/CNN function is standardized.We find the optimal of the parameter C of Linear SVM model using training and verifying division Value.This collective model is used to verify the accuracy of action recognition.

The present invention is able to solve model during model training to the overfitting problem of the classification of motion, while can also be effective Improve Human bodys' response efficiency and accuracy.

The above is only a preferred embodiment of the present invention, it should be pointed out that: for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims

1. a kind of Human bodys' response method based on convolutional neural networks and Recognition with Recurrent Neural Network, which is characterized in that with including Following steps:

Step 1), user persistently wave 5 times in fixed position, track human body behavior using sensor, the time is arranged in when collection Step-length is 16 milliseconds, collects 3 dimension coordinate the V={ (x in the user main 25 joints period Nei₁,y₁,z₁),(x₂,y₂, z₂),...,(x₂₅,y₂₅,z₂₅)}；Wherein, x-axis is that human body vertical direction is vertical and positive axis is directing forwardly, and y-axis and human body are square vertically It is directed toward head to parallel and positive axis, z-axis is that human body vertical direction is vertical and positive axis is directed toward on the left of human body；The period is acquired simultaneously The rgb video of this interior scene；

Step 2) is followed using the 3 dimension coordinates in 25 joints of human body in step 1) as the training set of Recognition with Recurrent Neural Network RNN with this Gating cycle list is used in training set the training one Recognition with Recurrent Neural Network RNN, Recognition with Recurrent Neural Network RNN of ring neural network RNN First GRU operates the coordinate of input, and there is batch to normalize, the function of random loss, the Recognition with Recurrent Neural Network RNN Output layer including two bidirectional valve controlled circulation layers, hiding a full articulamentum and a softmax model, line rectification letter Number is ReLU function as activation primitive, and has used dropout mechanism that motion feature is mapped to the action classification waved, and is obtained To trained Recognition with Recurrent Neural Network RNN；

The collected rgb video of step 1) is divided into continuous RGB frame with 16 milliseconds for time step, and adjusted by step 3) Their resolution ratio, as the training set of convolutional neural networks CNN；Convolutional neural networks CNN is defeated using multireel product verification The RGB frame stream that enters carries out convolution operation, and the convolutional neural networks CNN includes convolutional layer, pond layer, full articulamentum and one The output layer of softmax model, it is excessively quasi- to reduce using the pre-training small parameter perturbations model on Sports-1M data set Conjunction and training time, obtain trained convolutional neural networks CNN；

The output result of the trained Recognition with Recurrent Neural Network RNN of step 2) is tieed up coordinate data by step 4) Temporal characteristics vector, using the output result of the trained convolutional neural networks CNN of step 3) as the space-time characteristic of rgb video to The space-time characteristic vector that human synovial 3 ties up the temporal characteristics vector sum rgb video of coordinate data is connected combinations by amount, connects Normalized afterwards, and be fed to the classifier of linear SVM SVM, using validation data set, find linearly support to The penalty coefficient C of amount machine SVM obtains comprehensive identification model；

Step 5) when identification, is sat using 3 dimensions that the method in step 1 acquires 25 joints of human body of human body behavior to be identified The 3 dimension coordinates in 25 joints of collected human body are put into the trained Recognition with Recurrent Neural Network of step 2) by mark and rgb video Temporal characteristics vector is obtained in RNN, and collected rgb video is obtained into the trained convolutional neural networks CNN of step 3) Temporal characteristics vector sum space-time characteristic vector is connected into a feature array, and imported after normalizing comprehensive by space-time characteristic vector Identification model is closed, the behavior of comprehensive identification model identification people is utilized.

2. the Human bodys' response method based on convolutional neural networks and Recognition with Recurrent Neural Network according to claim 1, special Sign is: sensor described in step 1) is Microsoft Kinect v2 sensor.

3. the Human bodys' response method based on convolutional neural networks and Recognition with Recurrent Neural Network according to claim 2, special Sign is: the method that trained Recognition with Recurrent Neural Network RNN is obtained in the step 2) is as follows:

Step 21), using the 3 dimension coordinates in 25 joints of human body in step 1) as the training set of Recognition with Recurrent Neural Network RNN, circulation mind The dimension of training set through network RNN is 16 × (25 × 3)；

Step 22), training Recognition with Recurrent Neural Network, first passes around two bidirectional gates on the training set of Recognition with Recurrent Neural Network RNN Circulation layer is controlled, gating cycle unit GRU is used as the basic unit of the circulation layer and uses forward-backward recutrnce neural network, point It Xun Lian not be from t=1 to t=T and the input data of t=T to t=1, wherein t indicates the time variable in the data set, T table Show the time at the last moment in the data set；

Step 23) carries out scene characteristic classification using GRU type Recognition with Recurrent Neural Network RNN, controls previous moment using door is updated Status information is brought into the degree in current state, updates door z_t=σ (W_z·[h_t-1,x_t]), wherein h_t-1It is waited for the t-1 moment Select the value of memory unit, x_tIndicate the data of t moment input, W_zIndicate the corresponding weight for updating gated data memory unit value, z_tTable Show that t moment updates the value of door, σ indicates sigmoid activation primitive；

Step 24) is written to t moment candidate's memory unit h using resetting door control how many information of previous state_t, reset door r_t=σ (W_r·[h_t-1,x_t]), wherein h_t-1For the value of t-1 moment candidate memory unit, x_tIndicate the data of t moment input, W_r Indicate the corresponding weight for updating gated data memory unit value, r_tIndicate the value of t moment resetting door；

Step 25) calculates candidate memory unit value, candidate memory unitWherein h_tFor t moment The value of candidate memory unit, h_t-1The value of candidate memory unit, x are carved for t-1_tThe data of t moment input are indicated, when W indicates current The weight of GRU unit is carved,Indicate the value of candidate memory unit, tanh indicates trigonometric tangential function；

Step 26) calculates current time memory unit state value, t moment memory unitIts s In, ⊙ indicates point-by-point product, and memory unit state, which updates, depends on the value h that t-1 carves candidate memory unit_t-1It is single with candidate's memory The value of memberAnd this two parts factor is adjusted respectively by updating door and resetting door；

Step 27) finally obtains the output y of Recognition with Recurrent Neural Network_t=σ (Wh_t), it is transmitted to entirely using softmax activation primitive Articulamentum, and output is construed to probability, y_tIndicate the probability predicted using Recognition with Recurrent Neural Network.

4. the Human bodys' response method based on convolutional neural networks and Recognition with Recurrent Neural Network according to claim 3, special Sign is: the method that the step 3) obtains trained convolutional neural networks CNN is as follows:

The collected rgb video of step 1) is divided into continuous RGB frame with 16 milliseconds for time step, and adjusted by step 31) It is big to be inputted referred to as c × t × h × w using the RGB frame stream as the training set of convolutional neural networks CNN for their resolution ratio It is small, wherein c is port number, and t is time step, and h and w are the height and width of RGB frame；

Step 32), convolutional neural networks CNN receive the parallelization input of video frame, the convolution window for being h to length, convolution mind Through network C NN by convolution kernel to input matrix x_1:nConvolution operation is carried out, convolution kernel output valve is c_i=f (wx_i:i+h-1+ b), WhereinFor convolution kernel weight,Indicate w value in real number field, d indicates x_iDimension,For amount of bias, f is sharp Function living, x_i:i+h-1For the frame vector matrix of a convolution window, time gradient is the frame stream of n, available by convolution operation Feature vector c=[c after convolution₁,c₂,…,c_n-h+1]；

Step 33) extracts a maximum value from each feature vector, has the window of m convolution kernel to obtain feature vectorWherein,For the feature vector that convolutional neural networks extract,Indicate m-th of convolution kernel Feature vector；

Step 34) obtains following formula by softmax function output category result:

Wherein, y indicates the probability predicted using convolutional neural networks,Regular terms for the output of down-sampling layer limits,For Corresponding element is multiplied,For full articulamentum weight matrix,For amount of bias, pass through stochastic gradient descent optimizer Optimize convolutional neural networks CNN.

5. the Human bodys' response method based on convolutional neural networks and Recognition with Recurrent Neural Network according to claim 4, special Sign is: the method that comprehensive identification model is obtained in the step 4) is as follows:

The trained Recognition with Recurrent Neural Network RNN of step 2) is tieed up coordinate data from human body 3 and extracted by step 41), output knot Fruit extracts as temporal characteristics vector, by the trained convolutional neural networks CNN of step 3) from rgb video, output knot Fruit connects into a feature array as space-time characteristic vector, by temporal characteristics vector sum space-time characteristic vector, and carries out normalizing Change；

Step 42), using the feature vector group after normalizing as input, with the corresponding tool of feature vector group after each normalization Body movement or behavior label are delivered to training in linear SVM SVM, Optimized model such as following formula as output:

y_i(ω^Tx_i+b₀)≥1-ξ_i

s.t.ξ_i≥0

I=1,2 ..., N

Wherein, ω indicates that the vector of input, C indicate penalty coefficient, ξ_iIndicate the Classification Loss for i-th of sample point, y_iIt indicates The corresponding action mark of each sample, b₀Indicate intercept, N indicates the feature vector sum of input SVM；

6. the Human bodys' response method based on convolutional neural networks and Recognition with Recurrent Neural Network according to claim 5, special Sign is: when optimizing convolutional neural networks CNN by stochastic gradient descent optimizer in step 33), initial learning rate is 0.0001, when observing not training progress, learning rate drop by half.

7. the Human bodys' response method based on convolutional neural networks and Recognition with Recurrent Neural Network according to claim 6, special Sign is: normalized formula in step 41):

8. the Human bodys' response method based on convolutional neural networks and Recognition with Recurrent Neural Network according to claim 7, special Sign is: resolution ratio being adjusted to 320*240 pixel from 1920*1080 pixel in step 31).