CN109977904A

CN109977904A - A kind of human motion recognition method of the light-type based on deep learning

Info

Publication number: CN109977904A
Application number: CN201910269644.3A
Authority: CN
Inventors: 魏维; 何冰倩; 魏敏
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2019-07-05

Abstract

The invention discloses a kind of human motion recognition methods of light-type based on deep learning, the method constructs the deep learning network (SDNet) for the light-type that shallow-layer and deep layer network combine first, the network includes the multiple dimensioned module of shallow-layer and deep layer network module, and the human action identification model of the light-type based on deep learning is constructed based on the network.In the model, feature extraction and expression are carried out to space-time double fluid first with SDNet；Time pyramid pond layer is recycled to indicate the characteristic aggregation of time flow and the video frame level of spatial flow at videl stage, space-time double fluid is obtained to the recognition result of list entries by full articulamentum and softmax layers again, finally, double-current result is merged in the way of weighted average fusion, to obtain final recognition result.Using the human motion recognition method of the light-type based on deep learning, model parameter amount can be greatly decreased under the premise of guaranteeing that accuracy of identification does not reduce.

Description

A kind of human motion recognition method of the light-type based on deep learning

Technical field

The present invention relates to graph and image processing technical fields, and in particular to a kind of light weight based on deep learning for having supervision The human action identification model method of type.

Background technique

Human action identifies the main problem to be solved is how to lead to video camera or the collected video sequence of sensor Analysis and processing are crossed, enables a computer to " understand " movement and behavior of the mankind in video, to security monitoring, entertainment way Etc. have important research significance, and human action based on video is identified in human-computer interaction, virtual reality, smart home The fields such as equipment also suffer from extensive use.For many artificial intelligence systems, human action identification or human behavior reason Solution is essential.For example, including hundreds of hours monitor videos in video monitoring system, if manually going to traverse Monitor video, not only work longsomeness, and efficiency is also very low.By utilizing human action identification technology, so that it may right The movement of human body in monitor video is identified and is understood, to effectively detect malicious act and abnormal behaviour automatically.

There are huge challenges for human action identification mission based on video itself.The reason of creating a huge challenge is main There are two aspects, are on the one hand video environment factors, are on the other hand the complexities of action classification itself.The change that video light shines Change, the shake of video camera, the variation at visual angle etc. are all to belong to video environment factor.Moving scene in video is always varied , it, all can be to people even if involved party is blocked in the variation of illumination indoors in the relatively-stationary environment of such video background Body action recognition task affects.And for the complexity of action classification itself, mainly between movement class and in class Difference and diverse problems.Such as " jogging ", " stroll " and " running " these three are different classes of, due to movement speed etc., It will cause the different classes of lesser problem of difference；And for identical movement, it will also result in due to visual angle etc. mutually similar Other movement has biggish different difference problems.

From the proposition of deep learning network model LeNet network, and obtain in Handwritten Digit Recognition task it is considerable at After fruit, domestic and foreign scholars propose the various network models based on deep learning in succession and are applied to human action identification, such as AlexNet, VggNet, GoogleNet, ResNet, DenseNet etc..AlexNet and VggNet is by deepening network The mode of depth come improve network performance, GoogleNet and ResNet using increase network model width or depth by the way of Network performance is improved, it is considerable that these networks in picture recognition and classification and human action identify etc. that fields achieve Achievement.Although by network model increase by three layers of weight layer experiments have shown that expression of the shallow-layer learning network to complicated function And the generalization ability of model all has certain limitation, but with the continuous continuous expansion being superimposed with width of the network number of plies Exhibition, also brings along some problems: such as ginseng enormous amount, the computation complexity of network is larger, the network the deep more is easy to appear ladder The problems such as degree disappears.And blindly increase network depth and will appear the case where accuracy rate includes or declines, to bring network The degenerate problem of model.

Human action identification mission and the main distinction of the picture recognition task based on still image based on video are Video sequence not only includes that the appearance information of image also includes the motion information in time series, and the analysis of single image identification is not Need to consider temporal information, it therefore, cannot effective combining video sequences in order to make up two-dimensional convolution neural network model In motion information, gradually using Three dimensional convolution neural network model or double-current convolutional neural networks model to the people in video Body movement is identified.These models consider the characteristic that video sequence has motion information to a certain extent, but still There is a problem of that network structure still becomes deeper and deeper when the precision that concern improves recognition accuracy.

In conclusion present inventor during realizing the present application technical solution, has found currently based on depth The human action identification model or method for spending study at least have the following technical problems:

One, the recognition performance of network model is improved by deepening network depth and widen network-wide, increase considerably The calculation amount of network model, and since parameter amount is excessive, it is easy to appear the problem of gradient disappears, discrimination does not rise anti-drop.

Two, current light-type deep learning network model reduces although having compressed scale of model to a certain extent Parameter amount, but in the human action identification problem based on video, it is difficult to reply is effectively extracted comprising complicated incidence relation Space-time characteristic problem.

Summary of the invention

To solve the problems such as existing human action identification model parameter amount based on deep learning is big, network is too deep overweight, The present invention provides a kind of human motion recognition methods of light-type based on deep learning.This method contain a kind of shallow-layer and The deep learning network for the light-type that deep layer network combines, using the multiple dimensioned module of network middle-shallow layer to the office in video sequence Portion's feature carries out the description of different scale, is effectively melted using depth network module in network to the Analysis On Multi-scale Features extracted It closes and characterizes, eventually form the human action identification model of the light-type based on deep learning, effectively realize the ginseng for reducing model Quantity is without losing precision.

On the one hand, the present invention is achieved through the following technical solutions:

A kind of human motion recognition method of the light-type based on deep learning, the specific steps of which are as follows:

Step 1: the video data comprising human action being handled, RGB data frame sequence and optical flow data frame are obtained Sequence；

Step 2: constructing a kind of deep learning network (A for the light-type that shallow-layer and deep layer network combine Lightweight deep learning network model combining shallow and deep networks, SDNet), which includes the multiple dimensioned module of shallow-layer and deep layer network module；

Step 3: being known using the human action of light-type of the SDNet network struction based on deep learning constructed in step 2 Other model, model are double-stream digestion, that is, include time flow and spatial flow；

Step 4: using the human action identification model of the light-type based on deep learning constructed in step 3 to video sequence The RGB data and optical flow data of column are handled, and human action classification results are obtained.

On the other hand, a kind of deep learning network (A of the light-type combined the invention proposes shallow-layer and deep layer network Lightweight deep learning network model combining shallow and deep networks, SDNet), which includes the multiple dimensioned module of shallow-layer (Shallow multi-scale module, SMSM) and deep layer network mould Block (Deep networks module, DNM):

The multiple dimensioned module of shallow-layer is used to obtain the human action of initial RGB data frame sequence and optical flow data frame sequence Local feature；

The deep layer network module is used to merge the human action local feature that the multiple dimensioned module of shallow-layer is extracted, and generates height Layer feature.

Further, the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine is set It counts as follows:

(a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 32, with ReLU function As the activation primitive of this layer, the formula of ReLU function is as follows:

ReLU (x)=max (0, x)

(b) pond layer S1: carrying out resampling using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers, The core size in pond is step-length 2；

(c1) the multiple dimensioned module SMSM-1 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution The convolutional layer of core, the filter blocks of all convolutional layers are 64 in module；

(c2) the multiple dimensioned module SMSM-2 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution The convolutional layer of core, the filter blocks of all convolutional layers are 128 in module；

(c3) the multiple dimensioned module SMSM-3 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution The convolutional layer of core, the filter blocks of all convolutional layers are 256 in module；

(d) convolutional layer C2: the layer includes two parts, and the main function of first part is more to the shallow-layer intensively connected Scale module is attached (concatenation), merges connection to characteristic pattern by concatenation function；The Two parts are convolutional layers, and convolution kernel size is step-length 1, filter blocks 256, which does not include nonlinear activation Function；

(e) deep layer network module DNM.

Further, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine (c1) the network structure design of the multiple dimensioned module SMSM-1 of the shallow-layer of layer are as follows:

1) branch 1:

(a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU function Activation primitive as this layer；

(b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system Number is 1, using ReLU function as the activation primitive of this layer.Convolution kernel size calculation after expansion is as follows:

Convolution kernel size=convolution kernel coefficient (convolution kernel size -1 before convolution expands)+1 after convolution expansion

(c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system Number is 1, using ReLU function as the activation primitive of this layer；

2) branch 2:

(b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system Number is 2, using ReLU function as the activation primitive of this layer；

(c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system Number is 2, using ReLU function as the activation primitive of this layer；

3) branch 3:

(b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system Number is 3, using ReLU function as the activation primitive of this layer；

(c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system Number is 3, using ReLU function as the activation primitive of this layer；

4) articulamentum concatenation: this layer is carried out using characteristic pattern of the concatenation function to three branches Connection；

5) resampling, pond pond layer S1: are carried out using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers The core size of change is step-length 2.

Further, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine (c2) filter blocks that the network structure design of the multiple dimensioned module SMSM-2 of the shallow-layer of layer removes convolutional layer are 128, remaining is same SMSM-1 structure.

Further, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine (c3) filter blocks that the network structure design of the multiple dimensioned module SMSM-3 of the shallow-layer of layer removes convolutional layer are 256, remaining is same SMSM-1 structure.

Preferably, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine (c1) the multiple dimensioned module SMSM of (c2) (c3) three shallow-layers is by the way of intensively connecting, i.e. (c1) SMSM-1 connection (c2) SMSM-2 and (d) convolutional layer C2, SMSM-2 connection (c3) SMSM-3 and (d) convolutional layer C2.

Further, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine (e) the deep layer network module DNM network structure design of layer are as follows:

(a) convolutional layer C1: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, with Activation primitive of the ReLU function as this layer；

(b) MLP convolutional layer C2: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, Using ReLU function as the activation primitive of this layer；

(c) convolutional layer C3: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, with Activation primitive of the ReLU function as this layer；

(d) pond layer S1: carrying out resampling using the characteristic pattern that maximum pond layer (Max Pooling) obtains C3 layers, The core size in pond is step-length 2；

(e) convolutional layer C4: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, with Activation primitive of the ReLU function as this layer；

(f) MLP convolutional layer C5: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, Using ReLU function as the activation primitive of this layer；

(g) convolutional layer C6: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, with Activation primitive of the ReLU function as this layer；

(h) pond layer S2: using global average pond layer (Global Average Pooling) to the characteristic pattern of C6 into The global average pondization operation of row, replaces full articulamentum with the layer, to reduce parameter amount.

Another aspect, the present invention construct the human action identification model of the light-type based on deep learning, which is Double-stream digestion includes time flow and spatial flow, model specific structure is as follows:

(a) input layer: time flow input data is the optical flow data of video sequence, and frame size is；Spatial flow input data For the RGB data of video sequence, frame size is still；

(b) SDNet: the part-time flow network and space flow network are made of SDNet, extract video sequence with SDNet The space-time characteristic of column；

(c) pond layer S1: utilize time pyramid pond layer (TPP) by the video frame of time flow network and space flow network The characteristic aggregation of grade is indicated at videl stage.Time pyramid pond is horizontally placed to, i.e., time pyramid uses 3 layers of pyramid Formula；

(d) full articulamentum FC: including 1024 groups of filters.1024 neurons are arranged to be connected with S1, are made with ReLU For activation primitive；

(e) softmax layers: being calculated using the characteristic value that Softmax analyzer obtains FC layers different classes of relatively general Rate obtains class score.Softmax function is defined as follows:

Wherein, V_iIt is the output of classifier Softmax i.e. FC layers of output unit of prime.I indicates classification index, total classification number For C.p_iIndicate currentElement index and all elements index and ratio, max (p_i) it is its classification score class score；

(f) fused layer: this layer is moved using the class score of Decision fusion rule time of fusion stream and spatial flow Make classification results.The recognition confidence of time flow and spatial flow when fusion is set as 1: 1.

The present invention has the following advantages and benefits compared to the prior art:

1, the deep learning network for the light-type that the present invention is combined using shallow-layer and deep layer network is in video sequence Space-time characteristic extracts, and realizes model parameter amount using the multiple dimensioned module of the shallow-layer intensively connected and depth network module It is greatly decreased, avoids the too deep problem of network.

2, the human action identification model for the light-type based on deep learning that the present invention constructs utilizes binary-flow network structure, Deja Vu can be effectively captured, the more preferable space time information for utilizing human action enhances the identification energy of human action identification model Power and generalization ability.

It is 3, of the invention under the premise of the action recognition accuracy rate of holding precision level and current leading edge method is substantially uniform, Model parameter amount is considerably reduced, scale of model is had compressed.

Detailed description of the invention

Attached drawing described herein is used to provide to further understand the embodiment of the present invention, constitutes one of the application Point, do not constitute the restriction to the embodiment of the present invention.In the accompanying drawings:

Fig. 1 is the deep learning network chart for the light-type that shallow-layer of the invention and deep layer network combine.

Fig. 2 is the multiple dimensioned mould of deep learning network middle-shallow layer for the light-type that shallow-layer of the invention and deep layer network combine Block SMSM-1 exemplary diagram.

Fig. 3 is that the human action identification model of the light-type of the invention based on deep learning illustrates.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this Invention is described in further detail, and exemplary embodiment of the invention and its explanation for explaining only the invention, are not made For limitation of the invention.

As shown in figure 3, firstly, the deep learning for the light-type that the present invention is combined using the shallow-layer and deep layer network of proposition Network (SDNet) carries out feature extraction and expression to space-time double fluid, then utilizes time pyramid pond layer (TPP) by time flow It is indicated with the characteristic aggregation of the video frame level of spatial flow at videl stage, then space-time pair is obtained by full articulamentum and softmax layers The recognition result to list entries is flowed, finally, merging in the way of weighted average fusion to double-current result, to obtain Final recognition result.This method mainly comprises the steps that

Step 2: constructing a kind of deep learning network (A for the light-type that shallow-layer and deep layer network combine Lightweight deep learning network model combining shallow and deep networks, SDNet), which includes the multiple dimensioned module of shallow-layer (Shallow multi-scale module, SMSM) and deep layer network mould Block (Deep networks module, DNM).Wherein, the multiple dimensioned module of shallow-layer is for obtaining initial RGB data frame sequence and light The human action local feature of flow data frame sequence；It is dynamic that deep layer network module is used to merge the human body that the multiple dimensioned module of shallow-layer is extracted Make local feature, and generates high-level characteristic.

The network structure of the deep learning network (SDNet) for the light-type that shallow-layer and deep layer network combine is as follows:

ReLU (x)=max (0, x)

(c1) the multiple dimensioned module SMSM-1 of shallow-layer: the module includes three branches and an articulamentum and a pond layer. As shown in Fig. 2, specific structure is as follows:

(c1-1) branch 1:

(c1-1-a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU Activation primitive of the function as this layer；

(c1-1-b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen Swollen coefficient is 1, using ReLU function as the activation primitive of this layer.Convolution kernel size calculation after expansion is as follows:

(c1-1-c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen Swollen coefficient is 1, using ReLU function as the activation primitive of this layer；

(c1-2) branch 2:

(c1-2-a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU Activation primitive of the function as this layer；

(c1-2-b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen Swollen coefficient is 2, using ReLU function as the activation primitive of this layer；

(c1-2-c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen Swollen coefficient is 2, using ReLU function as the activation primitive of this layer；

(c1-3) branch 3:

(c1-3-a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU Activation primitive of the function as this layer；

(c1-3-b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen Swollen coefficient is 3, using ReLU function as the activation primitive of this layer；

(c1-3-c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen Swollen coefficient is 3, using ReLU function as the activation primitive of this layer；

(c1-4) articulamentum concatenation: this layer is using concatenation function to the characteristic pattern of three branches It is attached；

(c1-5) it pond layer S1: is adopted again using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers The core size of sample, pond is step-length 2.

(c2) the multiple dimensioned module SMSM-2 of shallow-layer: the module also includes three branches and an articulamentum and a pond Layer.The filter blocks of the convolutional layer of each branch are 128, remaining same SMSM-1；

(c3) the multiple dimensioned module SMSM-3 of shallow-layer: the module also includes three branches and an articulamentum and a pond Layer.The filter blocks of the convolutional layer of each branch are 256, remaining same SMSM-1；

Preferably, as shown in Figure 1, the multiple dimensioned module SMSM of (c1) (c2) (c3) three shallow-layers is using the side intensively connected Formula, i.e. (c1) SMSM-1 connection (c2) SMSM-2 and (d) convolutional layer C2, SMSM-2 connection (c3) SMSM-3 and (d) convolutional layer C2.

(e) deep layer network module DNM.The network structure of the module are as follows:

(e-a) convolutional layer C1: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, with Activation primitive of the ReLU function as this layer；

(e-b) MLP convolutional layer C2: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, using ReLU function as the activation primitive of this layer；

(e-c) convolutional layer C3: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, with Activation primitive of the ReLU function as this layer；

(e-d) it pond layer S1: is adopted again using the characteristic pattern that maximum pond layer (Max Pooling) obtains C3 layers The core size of sample, pond is step-length 2；

(e-e) convolutional layer C4: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, with Activation primitive of the ReLU function as this layer；

(e-f) MLP convolutional layer C5: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, using ReLU function as the activation primitive of this layer；

(e-g) convolutional layer C6: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, with Activation primitive of the ReLU function as this layer；

(e-h) pond layer S2: using global average pond layer (Global Average Pooling) to the characteristic pattern of C6 Global average pondization operation is carried out, full articulamentum is replaced with the layer, to reduce parameter amount.

Step 3: being known using the human action of light-type of the SDNet network struction based on deep learning constructed in step 2 Other model, the model are double-stream digestion, that is, include time flow and spatial flow.The model is as shown in figure 3, specific structure is as follows:

(a) input layer: time flow input data is the optical flow data of video sequence, and frame size is 224 × 224；Spatial flow Input data is the RGB data of video sequence, and frame size is still 224 × 224；

(c) pond layer S1: utilize time pyramid pond layer (TPP) by the video frame of time flow network and space flow network The characteristic aggregation of grade is indicated at videl stage.Time pyramid pond is horizontally placed to { 4 × 4 × 1,2 × 2 × 1,1 × 1 × 1 }, i.e., Time pyramid uses 3 layers of pyramid form；

(e) softmax layers: being calculated using the characteristic value that Softmax analyzer obtains FC layers different classes of relatively general Rate obtains class score, and Softmax function is defined as follows:

Wherein, V_iIt is the output of classifier Softmax i.e. FC layers of output unit of prime.I indicates classification index, total classification number For C, p_iIndicate currentElement index and all elements index and ratio, max (p_i) it is its classification score class score；

The pre-training and small parameter perturbations that model of the present invention has first been carried out on ImageNet data set, then to action recognition Data set UCF101 and HMDB51 have carried out the human action identification model method processing of the light-type based on deep learning, finally Achieve 94.0% and 69.4% action recognition accuracy rate respectively on data set UCF101 and HMDB51, and model parameter amount Only 19M.It follows that the human action identification model of the light-type proposed in this paper based on deep learning can not only be to view Human action in frequency is effectively identified, parameter amount also is greatly reduced compared to human action identification model in recent years, is saved Calculating cost.

Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims

1. a kind of human motion recognition method of the light-type based on deep learning, which comprises the following steps:

Step 1: the video data comprising human action being handled, RGB data frame sequence and optical flow data frame sequence are obtained；

Step 2: constructing a kind of deep learning network (the A lightweight for the light-type that shallow-layer and deep layer network combine Deep learning network model combining shallow and deep networks, SDNet), the network Include the multiple dimensioned module of shallow-layer and deep layer network module；

Step 3: identifying mould using the human action of light-type of the SDNet network struction based on deep learning constructed in step 2 Type, the model are double-stream digestion, that is, include time flow and spatial flow；

Step 4: using the human action identification model of the light-type based on deep learning constructed in step 3 to video sequence RGB data and optical flow data are handled, and human action classification results are obtained.

2. the method as described in claim 1, which is characterized in that in step 2, construct a kind of shallow-layer and deep layer network combines Light-type deep learning network (A lightweight deep learning network model combining Shallow and deep networks, SDNet), which includes the multiple dimensioned module of shallow-layer (Shallow multi-scale Module, SMSM) and deep layer network module (Deep networks module, DNM):

The multiple dimensioned module of shallow-layer is used to obtain the human action part of initial RGB data frame sequence and optical flow data frame sequence Feature；

The deep layer network module is used to merge the human action local feature that the multiple dimensioned module of shallow-layer is extracted, and generates high-rise spy Sign.

3. the method as described in claim 1, which is characterized in that in step 2, construct a kind of shallow-layer and deep layer network combines Light-type deep learning network (SDNet), the network network structure design are as follows:

(a) convolutional layer C1: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 32, with ReLU letter Activation primitive of the number as this layer, the formula of ReLU function are as follows:

ReLU (x)=max (0, x)

(b) resampling, Chi Hua pond layer S1: are carried out using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers Core size be 2 × 2 × 2, step-length 2；

(c1) the multiple dimensioned module SMSM-1 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution kernels Convolutional layer, the filter blocks of all convolutional layers are 64 in module；

(c2) the multiple dimensioned module SMSM-2 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution kernels Convolutional layer, the filter blocks of all convolutional layers are 128 in module；

(c3) the multiple dimensioned module SMSM-3 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution kernels Convolutional layer, the filter blocks of all convolutional layers are 256 in module；

(d) convolutional layer C2: the layer includes two parts, and the main function of first part is multiple dimensioned to the shallow-layer intensively connected Module is attached (concatenation), merges connection to characteristic pattern by concatenation function；Second Part is convolutional layer, and convolution kernel size is 1 × 1 × 1, step-length 1, filter blocks 256, which does not include non-linear sharp Function living；

(e) deep layer network module DNM.

4. claim 3 as described in method, which is characterized in that the network knot of the multiple dimensioned module SMSM-1 of the shallow-layer of (c1) layer Structure design are as follows:

1) branch 1:

(a) convolutional layer C1: taking convolution kernel size is 1 × 1 × 1, step-length 1, and the filter blocks of convolutional layer are 64, with ReLU letter Activation primitive of the number as this layer；

(b) convolutional layer C2: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion Coefficient is 1, using ReLU function as the activation primitive of this layer.Convolution kernel size calculation after expansion is as follows:

Convolution kernel size=convolution kernel coefficient × (convolution kernel size -1 before convolution expands)+1 after convolution expansion

(c) convolutional layer C3: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion Coefficient is 1, using ReLU function as the activation primitive of this layer；

2) branch 2:

(b) convolutional layer C2: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion Coefficient is 2, using ReLU function as the activation primitive of this layer；

(c) convolutional layer C3: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion Coefficient is 2, using ReLU function as the activation primitive of this layer；

3) branch 3:

(b) convolutional layer C2: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion Coefficient is 3, using ReLU function as the activation primitive of this layer；

(c) convolutional layer C3: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion Coefficient is 3, using ReLU function as the activation primitive of this layer；

4) articulamentum concatenation: this layer is connected using characteristic pattern of the concatenation function to three branches It connects；

5) pond layer S1: carrying out resampling using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers, pond Core size is 2 × 2 × 1, step-length 2.

5. claim 3 as described in method, which is characterized in that the network knot of the multiple dimensioned module SMSM-2 of the shallow-layer of (c2) layer Structure design is 128 except the filter blocks of convolutional layer, remaining is the same as claim 4.

6. claim 3 as described in method, which is characterized in that the network knot of the multiple dimensioned module SMSM-3 of the shallow-layer of (c3) layer Structure design is 256 except the filter blocks of convolutional layer, remaining is the same as claim 4.

7. claim 3 as described in method, which is characterized in that the multiple dimensioned module SMSM of (c1) (c2) (c3) three shallow-layers is adopted With the mode intensively connected, i.e. (c1) SMSM-1 connection (c2) SMSM-2 and (d) convolutional layer C2, SMSM-2 connection (c3) SMSM-3 (d) convolutional layer C2.

8. method as claimed in claim 3, which is characterized in that the deep layer network module DNM network structure of (e) layer designs are as follows:

(a) convolutional layer C1: convolution kernel size is 3 × 3 × 3, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, Using ReLU function as the activation primitive of this layer；

(b) MLP convolutional layer C2: convolution kernel size is 1 × 1 × 1, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, using ReLU function as the activation primitive of this layer；

(c) convolutional layer C3: convolution kernel size is 3 × 3 × 3, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, Using ReLU function as the activation primitive of this layer；

(d) resampling, Chi Hua pond layer S1: are carried out using the characteristic pattern that maximum pond layer (Max Pooling) obtains C3 layers Core size be 2 × 2 × 2, step-length 2；

(e) convolutional layer C4: convolution kernel size is 3 × 3 × 3, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, Using ReLU function as the activation primitive of this layer；

(f) MLP convolutional layer C5: convolution kernel size is 1 × 1 × 1, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, using ReLU function as the activation primitive of this layer；

(g) convolutional layer C6: convolution kernel size is 3 × 3 × 3, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, Using ReLU function as the activation primitive of this layer；

(h) it pond layer S2: is carried out using the characteristic pattern of global average pond layer (Global Average Pooling) to C6 complete The average pondization operation of office, replaces full articulamentum with the layer, to reduce parameter amount.

9. the method as described in claim 1, which is characterized in that in step 3, construct the people of the light-type based on deep learning Body action recognition model, the model are double-stream digestion, that is, include time flow and spatial flow, model specific structure is as follows:

(a) input layer: time flow input data is the optical flow data of video sequence, and frame size is 224 × 224；Spatial flow input Data are the RGB data of video sequence, and frame size is still 224 × 224；

(b) SDNet: the part-time flow network and space flow network are made of SDNet, extract video sequence with SDNet Space-time characteristic；

(c) pond layer S1: utilize time pyramid pond layer (TPP) by the video frame level of time flow network and space flow network Characteristic aggregation is indicated at videl stage.Time pyramid pond is horizontally placed to { 4 × 4 × 1,2 × 2 × 1,1 × 1 × 1 }, i.e. time Pyramid uses 3 layers of pyramid form；

(d) full articulamentum FC: including 1024 groups of filters.1024 neurons are arranged to be connected with S1, using ReLU as sharp Function living；

(e) softmax layers: calculating different classes of relative probability using the characteristic value that Softmax analyzer obtains FC layers, obtain To class score, Softmax function is defined as follows:

Wherein, V_iIt is the output of classifier Softmax i.e. FC layers of output unit of prime.I indicates that classification index, total classification number are C, p_iIndicate currentElement index and all elements index and ratio, max (p_i) it is its classification score class score；

(f) fused layer: this layer obtains movement point using the class score of Decision fusion rule time of fusion stream and spatial flow Class result.The recognition confidence of time flow and spatial flow when fusion is set as 1: 1.