CN109977904A - A kind of human motion recognition method of the light-type based on deep learning - Google Patents
A kind of human motion recognition method of the light-type based on deep learning Download PDFInfo
- Publication number
- CN109977904A CN109977904A CN201910269644.3A CN201910269644A CN109977904A CN 109977904 A CN109977904 A CN 109977904A CN 201910269644 A CN201910269644 A CN 201910269644A CN 109977904 A CN109977904 A CN 109977904A
- Authority
- CN
- China
- Prior art keywords
- layer
- network
- module
- shallow
- convolutional layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
Abstract
The invention discloses a kind of human motion recognition methods of light-type based on deep learning, the method constructs the deep learning network (SDNet) for the light-type that shallow-layer and deep layer network combine first, the network includes the multiple dimensioned module of shallow-layer and deep layer network module, and the human action identification model of the light-type based on deep learning is constructed based on the network.In the model, feature extraction and expression are carried out to space-time double fluid first with SDNet;Time pyramid pond layer is recycled to indicate the characteristic aggregation of time flow and the video frame level of spatial flow at videl stage, space-time double fluid is obtained to the recognition result of list entries by full articulamentum and softmax layers again, finally, double-current result is merged in the way of weighted average fusion, to obtain final recognition result.Using the human motion recognition method of the light-type based on deep learning, model parameter amount can be greatly decreased under the premise of guaranteeing that accuracy of identification does not reduce.
Description
Technical field
The present invention relates to graph and image processing technical fields, and in particular to a kind of light weight based on deep learning for having supervision
The human action identification model method of type.
Background technique
Human action identifies the main problem to be solved is how to lead to video camera or the collected video sequence of sensor
Analysis and processing are crossed, enables a computer to " understand " movement and behavior of the mankind in video, to security monitoring, entertainment way
Etc. have important research significance, and human action based on video is identified in human-computer interaction, virtual reality, smart home
The fields such as equipment also suffer from extensive use.For many artificial intelligence systems, human action identification or human behavior reason
Solution is essential.For example, including hundreds of hours monitor videos in video monitoring system, if manually going to traverse
Monitor video, not only work longsomeness, and efficiency is also very low.By utilizing human action identification technology, so that it may right
The movement of human body in monitor video is identified and is understood, to effectively detect malicious act and abnormal behaviour automatically.
There are huge challenges for human action identification mission based on video itself.The reason of creating a huge challenge is main
There are two aspects, are on the one hand video environment factors, are on the other hand the complexities of action classification itself.The change that video light shines
Change, the shake of video camera, the variation at visual angle etc. are all to belong to video environment factor.Moving scene in video is always varied
, it, all can be to people even if involved party is blocked in the variation of illumination indoors in the relatively-stationary environment of such video background
Body action recognition task affects.And for the complexity of action classification itself, mainly between movement class and in class
Difference and diverse problems.Such as " jogging ", " stroll " and " running " these three are different classes of, due to movement speed etc.,
It will cause the different classes of lesser problem of difference;And for identical movement, it will also result in due to visual angle etc. mutually similar
Other movement has biggish different difference problems.
From the proposition of deep learning network model LeNet network, and obtain in Handwritten Digit Recognition task it is considerable at
After fruit, domestic and foreign scholars propose the various network models based on deep learning in succession and are applied to human action identification, such as
AlexNet, VggNet, GoogleNet, ResNet, DenseNet etc..AlexNet and VggNet is by deepening network
The mode of depth come improve network performance, GoogleNet and ResNet using increase network model width or depth by the way of
Network performance is improved, it is considerable that these networks in picture recognition and classification and human action identify etc. that fields achieve
Achievement.Although by network model increase by three layers of weight layer experiments have shown that expression of the shallow-layer learning network to complicated function
And the generalization ability of model all has certain limitation, but with the continuous continuous expansion being superimposed with width of the network number of plies
Exhibition, also brings along some problems: such as ginseng enormous amount, the computation complexity of network is larger, the network the deep more is easy to appear ladder
The problems such as degree disappears.And blindly increase network depth and will appear the case where accuracy rate includes or declines, to bring network
The degenerate problem of model.
Human action identification mission and the main distinction of the picture recognition task based on still image based on video are
Video sequence not only includes that the appearance information of image also includes the motion information in time series, and the analysis of single image identification is not
Need to consider temporal information, it therefore, cannot effective combining video sequences in order to make up two-dimensional convolution neural network model
In motion information, gradually using Three dimensional convolution neural network model or double-current convolutional neural networks model to the people in video
Body movement is identified.These models consider the characteristic that video sequence has motion information to a certain extent, but still
There is a problem of that network structure still becomes deeper and deeper when the precision that concern improves recognition accuracy.
In conclusion present inventor during realizing the present application technical solution, has found currently based on depth
The human action identification model or method for spending study at least have the following technical problems:
One, the recognition performance of network model is improved by deepening network depth and widen network-wide, increase considerably
The calculation amount of network model, and since parameter amount is excessive, it is easy to appear the problem of gradient disappears, discrimination does not rise anti-drop.
Two, current light-type deep learning network model reduces although having compressed scale of model to a certain extent
Parameter amount, but in the human action identification problem based on video, it is difficult to reply is effectively extracted comprising complicated incidence relation
Space-time characteristic problem.
Summary of the invention
To solve the problems such as existing human action identification model parameter amount based on deep learning is big, network is too deep overweight,
The present invention provides a kind of human motion recognition methods of light-type based on deep learning.This method contain a kind of shallow-layer and
The deep learning network for the light-type that deep layer network combines, using the multiple dimensioned module of network middle-shallow layer to the office in video sequence
Portion's feature carries out the description of different scale, is effectively melted using depth network module in network to the Analysis On Multi-scale Features extracted
It closes and characterizes, eventually form the human action identification model of the light-type based on deep learning, effectively realize the ginseng for reducing model
Quantity is without losing precision.
On the one hand, the present invention is achieved through the following technical solutions:
A kind of human motion recognition method of the light-type based on deep learning, the specific steps of which are as follows:
Step 1: the video data comprising human action being handled, RGB data frame sequence and optical flow data frame are obtained
Sequence;
Step 2: constructing a kind of deep learning network (A for the light-type that shallow-layer and deep layer network combine
Lightweight deep learning network model combining shallow and deep networks,
SDNet), which includes the multiple dimensioned module of shallow-layer and deep layer network module;
Step 3: being known using the human action of light-type of the SDNet network struction based on deep learning constructed in step 2
Other model, model are double-stream digestion, that is, include time flow and spatial flow;
Step 4: using the human action identification model of the light-type based on deep learning constructed in step 3 to video sequence
The RGB data and optical flow data of column are handled, and human action classification results are obtained.
On the other hand, a kind of deep learning network (A of the light-type combined the invention proposes shallow-layer and deep layer network
Lightweight deep learning network model combining shallow and deep networks,
SDNet), which includes the multiple dimensioned module of shallow-layer (Shallow multi-scale module, SMSM) and deep layer network mould
Block (Deep networks module, DNM):
The multiple dimensioned module of shallow-layer is used to obtain the human action of initial RGB data frame sequence and optical flow data frame sequence
Local feature;
The deep layer network module is used to merge the human action local feature that the multiple dimensioned module of shallow-layer is extracted, and generates height
Layer feature.
Further, the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine is set
It counts as follows:
(a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 32, with ReLU function
As the activation primitive of this layer, the formula of ReLU function is as follows:
ReLU (x)=max (0, x)
(b) pond layer S1: carrying out resampling using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers,
The core size in pond is step-length 2;
(c1) the multiple dimensioned module SMSM-1 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution
The convolutional layer of core, the filter blocks of all convolutional layers are 64 in module;
(c2) the multiple dimensioned module SMSM-2 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution
The convolutional layer of core, the filter blocks of all convolutional layers are 128 in module;
(c3) the multiple dimensioned module SMSM-3 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution
The convolutional layer of core, the filter blocks of all convolutional layers are 256 in module;
(d) convolutional layer C2: the layer includes two parts, and the main function of first part is more to the shallow-layer intensively connected
Scale module is attached (concatenation), merges connection to characteristic pattern by concatenation function;The
Two parts are convolutional layers, and convolution kernel size is step-length 1, filter blocks 256, which does not include nonlinear activation
Function;
(e) deep layer network module DNM.
Further, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine
(c1) the network structure design of the multiple dimensioned module SMSM-1 of the shallow-layer of layer are as follows:
1) branch 1:
(a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU function
Activation primitive as this layer;
(b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system
Number is 1, using ReLU function as the activation primitive of this layer.Convolution kernel size calculation after expansion is as follows:
Convolution kernel size=convolution kernel coefficient (convolution kernel size -1 before convolution expands)+1 after convolution expansion
(c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system
Number is 1, using ReLU function as the activation primitive of this layer;
2) branch 2:
(a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU function
Activation primitive as this layer;
(b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system
Number is 2, using ReLU function as the activation primitive of this layer;
(c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system
Number is 2, using ReLU function as the activation primitive of this layer;
3) branch 3:
(a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU function
Activation primitive as this layer;
(b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system
Number is 3, using ReLU function as the activation primitive of this layer;
(c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system
Number is 3, using ReLU function as the activation primitive of this layer;
4) articulamentum concatenation: this layer is carried out using characteristic pattern of the concatenation function to three branches
Connection;
5) resampling, pond pond layer S1: are carried out using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers
The core size of change is step-length 2.
Further, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine
(c2) filter blocks that the network structure design of the multiple dimensioned module SMSM-2 of the shallow-layer of layer removes convolutional layer are 128, remaining is same
SMSM-1 structure.
Further, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine
(c3) filter blocks that the network structure design of the multiple dimensioned module SMSM-3 of the shallow-layer of layer removes convolutional layer are 256, remaining is same
SMSM-1 structure.
Preferably, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine
(c1) the multiple dimensioned module SMSM of (c2) (c3) three shallow-layers is by the way of intensively connecting, i.e. (c1) SMSM-1 connection (c2)
SMSM-2 and (d) convolutional layer C2, SMSM-2 connection (c3) SMSM-3 and (d) convolutional layer C2.
Further, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine
(e) the deep layer network module DNM network structure design of layer are as follows:
(a) convolutional layer C1: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, with
Activation primitive of the ReLU function as this layer;
(b) MLP convolutional layer C2: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2,
Using ReLU function as the activation primitive of this layer;
(c) convolutional layer C3: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, with
Activation primitive of the ReLU function as this layer;
(d) pond layer S1: carrying out resampling using the characteristic pattern that maximum pond layer (Max Pooling) obtains C3 layers,
The core size in pond is step-length 2;
(e) convolutional layer C4: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, with
Activation primitive of the ReLU function as this layer;
(f) MLP convolutional layer C5: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1,
Using ReLU function as the activation primitive of this layer;
(g) convolutional layer C6: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, with
Activation primitive of the ReLU function as this layer;
(h) pond layer S2: using global average pond layer (Global Average Pooling) to the characteristic pattern of C6 into
The global average pondization operation of row, replaces full articulamentum with the layer, to reduce parameter amount.
Another aspect, the present invention construct the human action identification model of the light-type based on deep learning, which is
Double-stream digestion includes time flow and spatial flow, model specific structure is as follows:
(a) input layer: time flow input data is the optical flow data of video sequence, and frame size is;Spatial flow input data
For the RGB data of video sequence, frame size is still;
(b) SDNet: the part-time flow network and space flow network are made of SDNet, extract video sequence with SDNet
The space-time characteristic of column;
(c) pond layer S1: utilize time pyramid pond layer (TPP) by the video frame of time flow network and space flow network
The characteristic aggregation of grade is indicated at videl stage.Time pyramid pond is horizontally placed to, i.e., time pyramid uses 3 layers of pyramid
Formula;
(d) full articulamentum FC: including 1024 groups of filters.1024 neurons are arranged to be connected with S1, are made with ReLU
For activation primitive;
(e) softmax layers: being calculated using the characteristic value that Softmax analyzer obtains FC layers different classes of relatively general
Rate obtains class score.Softmax function is defined as follows:
Wherein, ViIt is the output of classifier Softmax i.e. FC layers of output unit of prime.I indicates classification index, total classification number
For C.piIndicate currentElement index and all elements index and ratio, max (pi) it is its classification score class score;
(f) fused layer: this layer is moved using the class score of Decision fusion rule time of fusion stream and spatial flow
Make classification results.The recognition confidence of time flow and spatial flow when fusion is set as 1: 1.
The present invention has the following advantages and benefits compared to the prior art:
1, the deep learning network for the light-type that the present invention is combined using shallow-layer and deep layer network is in video sequence
Space-time characteristic extracts, and realizes model parameter amount using the multiple dimensioned module of the shallow-layer intensively connected and depth network module
It is greatly decreased, avoids the too deep problem of network.
2, the human action identification model for the light-type based on deep learning that the present invention constructs utilizes binary-flow network structure,
Deja Vu can be effectively captured, the more preferable space time information for utilizing human action enhances the identification energy of human action identification model
Power and generalization ability.
It is 3, of the invention under the premise of the action recognition accuracy rate of holding precision level and current leading edge method is substantially uniform,
Model parameter amount is considerably reduced, scale of model is had compressed.
Detailed description of the invention
Attached drawing described herein is used to provide to further understand the embodiment of the present invention, constitutes one of the application
Point, do not constitute the restriction to the embodiment of the present invention.In the accompanying drawings:
Fig. 1 is the deep learning network chart for the light-type that shallow-layer of the invention and deep layer network combine.
Fig. 2 is the multiple dimensioned mould of deep learning network middle-shallow layer for the light-type that shallow-layer of the invention and deep layer network combine
Block SMSM-1 exemplary diagram.
Fig. 3 is that the human action identification model of the light-type of the invention based on deep learning illustrates.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this
Invention is described in further detail, and exemplary embodiment of the invention and its explanation for explaining only the invention, are not made
For limitation of the invention.
As shown in figure 3, firstly, the deep learning for the light-type that the present invention is combined using the shallow-layer and deep layer network of proposition
Network (SDNet) carries out feature extraction and expression to space-time double fluid, then utilizes time pyramid pond layer (TPP) by time flow
It is indicated with the characteristic aggregation of the video frame level of spatial flow at videl stage, then space-time pair is obtained by full articulamentum and softmax layers
The recognition result to list entries is flowed, finally, merging in the way of weighted average fusion to double-current result, to obtain
Final recognition result.This method mainly comprises the steps that
Step 1: the video data comprising human action being handled, RGB data frame sequence and optical flow data frame are obtained
Sequence;
Step 2: constructing a kind of deep learning network (A for the light-type that shallow-layer and deep layer network combine
Lightweight deep learning network model combining shallow and deep networks,
SDNet), which includes the multiple dimensioned module of shallow-layer (Shallow multi-scale module, SMSM) and deep layer network mould
Block (Deep networks module, DNM).Wherein, the multiple dimensioned module of shallow-layer is for obtaining initial RGB data frame sequence and light
The human action local feature of flow data frame sequence;It is dynamic that deep layer network module is used to merge the human body that the multiple dimensioned module of shallow-layer is extracted
Make local feature, and generates high-level characteristic.
The network structure of the deep learning network (SDNet) for the light-type that shallow-layer and deep layer network combine is as follows:
(a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 32, with ReLU function
As the activation primitive of this layer, the formula of ReLU function is as follows:
ReLU (x)=max (0, x)
(b) pond layer S1: carrying out resampling using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers,
The core size in pond is step-length 2;
(c1) the multiple dimensioned module SMSM-1 of shallow-layer: the module includes three branches and an articulamentum and a pond layer.
As shown in Fig. 2, specific structure is as follows:
(c1-1) branch 1:
(c1-1-a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU
Activation primitive of the function as this layer;
(c1-1-b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen
Swollen coefficient is 1, using ReLU function as the activation primitive of this layer.Convolution kernel size calculation after expansion is as follows:
Convolution kernel size=convolution kernel coefficient (convolution kernel size -1 before convolution expands)+1 after convolution expansion
(c1-1-c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen
Swollen coefficient is 1, using ReLU function as the activation primitive of this layer;
(c1-2) branch 2:
(c1-2-a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU
Activation primitive of the function as this layer;
(c1-2-b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen
Swollen coefficient is 2, using ReLU function as the activation primitive of this layer;
(c1-2-c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen
Swollen coefficient is 2, using ReLU function as the activation primitive of this layer;
(c1-3) branch 3:
(c1-3-a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU
Activation primitive of the function as this layer;
(c1-3-b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen
Swollen coefficient is 3, using ReLU function as the activation primitive of this layer;
(c1-3-c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen
Swollen coefficient is 3, using ReLU function as the activation primitive of this layer;
(c1-4) articulamentum concatenation: this layer is using concatenation function to the characteristic pattern of three branches
It is attached;
(c1-5) it pond layer S1: is adopted again using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers
The core size of sample, pond is step-length 2.
(c2) the multiple dimensioned module SMSM-2 of shallow-layer: the module also includes three branches and an articulamentum and a pond
Layer.The filter blocks of the convolutional layer of each branch are 128, remaining same SMSM-1;
(c3) the multiple dimensioned module SMSM-3 of shallow-layer: the module also includes three branches and an articulamentum and a pond
Layer.The filter blocks of the convolutional layer of each branch are 256, remaining same SMSM-1;
(d) convolutional layer C2: the layer includes two parts, and the main function of first part is more to the shallow-layer intensively connected
Scale module is attached (concatenation), merges connection to characteristic pattern by concatenation function;The
Two parts are convolutional layers, and convolution kernel size is step-length 1, filter blocks 256, which does not include nonlinear activation
Function;
Preferably, as shown in Figure 1, the multiple dimensioned module SMSM of (c1) (c2) (c3) three shallow-layers is using the side intensively connected
Formula, i.e. (c1) SMSM-1 connection (c2) SMSM-2 and (d) convolutional layer C2, SMSM-2 connection (c3) SMSM-3 and (d) convolutional layer C2.
(e) deep layer network module DNM.The network structure of the module are as follows:
(e-a) convolutional layer C1: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, with
Activation primitive of the ReLU function as this layer;
(e-b) MLP convolutional layer C2: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is
2, using ReLU function as the activation primitive of this layer;
(e-c) convolutional layer C3: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, with
Activation primitive of the ReLU function as this layer;
(e-d) it pond layer S1: is adopted again using the characteristic pattern that maximum pond layer (Max Pooling) obtains C3 layers
The core size of sample, pond is step-length 2;
(e-e) convolutional layer C4: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, with
Activation primitive of the ReLU function as this layer;
(e-f) MLP convolutional layer C5: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is
1, using ReLU function as the activation primitive of this layer;
(e-g) convolutional layer C6: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, with
Activation primitive of the ReLU function as this layer;
(e-h) pond layer S2: using global average pond layer (Global Average Pooling) to the characteristic pattern of C6
Global average pondization operation is carried out, full articulamentum is replaced with the layer, to reduce parameter amount.
Step 3: being known using the human action of light-type of the SDNet network struction based on deep learning constructed in step 2
Other model, the model are double-stream digestion, that is, include time flow and spatial flow.The model is as shown in figure 3, specific structure is as follows:
(a) input layer: time flow input data is the optical flow data of video sequence, and frame size is 224 × 224;Spatial flow
Input data is the RGB data of video sequence, and frame size is still 224 × 224;
(b) SDNet: the part-time flow network and space flow network are made of SDNet, extract video sequence with SDNet
The space-time characteristic of column;
(c) pond layer S1: utilize time pyramid pond layer (TPP) by the video frame of time flow network and space flow network
The characteristic aggregation of grade is indicated at videl stage.Time pyramid pond is horizontally placed to { 4 × 4 × 1,2 × 2 × 1,1 × 1 × 1 }, i.e.,
Time pyramid uses 3 layers of pyramid form;
(d) full articulamentum FC: including 1024 groups of filters.1024 neurons are arranged to be connected with S1, are made with ReLU
For activation primitive;
(e) softmax layers: being calculated using the characteristic value that Softmax analyzer obtains FC layers different classes of relatively general
Rate obtains class score, and Softmax function is defined as follows:
Wherein, ViIt is the output of classifier Softmax i.e. FC layers of output unit of prime.I indicates classification index, total classification number
For C, piIndicate currentElement index and all elements index and ratio, max (pi) it is its classification score class score;
(f) fused layer: this layer is moved using the class score of Decision fusion rule time of fusion stream and spatial flow
Make classification results.The recognition confidence of time flow and spatial flow when fusion is set as 1: 1.
Step 4: using the human action identification model of the light-type based on deep learning constructed in step 3 to video sequence
The RGB data and optical flow data of column are handled, and human action classification results are obtained.
The pre-training and small parameter perturbations that model of the present invention has first been carried out on ImageNet data set, then to action recognition
Data set UCF101 and HMDB51 have carried out the human action identification model method processing of the light-type based on deep learning, finally
Achieve 94.0% and 69.4% action recognition accuracy rate respectively on data set UCF101 and HMDB51, and model parameter amount
Only 19M.It follows that the human action identification model of the light-type proposed in this paper based on deep learning can not only be to view
Human action in frequency is effectively identified, parameter amount also is greatly reduced compared to human action identification model in recent years, is saved
Calculating cost.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects
It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention
Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include
Within protection scope of the present invention.
Claims (9)
1. a kind of human motion recognition method of the light-type based on deep learning, which comprises the following steps:
Step 1: the video data comprising human action being handled, RGB data frame sequence and optical flow data frame sequence are obtained;
Step 2: constructing a kind of deep learning network (the A lightweight for the light-type that shallow-layer and deep layer network combine
Deep learning network model combining shallow and deep networks, SDNet), the network
Include the multiple dimensioned module of shallow-layer and deep layer network module;
Step 3: identifying mould using the human action of light-type of the SDNet network struction based on deep learning constructed in step 2
Type, the model are double-stream digestion, that is, include time flow and spatial flow;
Step 4: using the human action identification model of the light-type based on deep learning constructed in step 3 to video sequence
RGB data and optical flow data are handled, and human action classification results are obtained.
2. the method as described in claim 1, which is characterized in that in step 2, construct a kind of shallow-layer and deep layer network combines
Light-type deep learning network (A lightweight deep learning network model combining
Shallow and deep networks, SDNet), which includes the multiple dimensioned module of shallow-layer (Shallow multi-scale
Module, SMSM) and deep layer network module (Deep networks module, DNM):
The multiple dimensioned module of shallow-layer is used to obtain the human action part of initial RGB data frame sequence and optical flow data frame sequence
Feature;
The deep layer network module is used to merge the human action local feature that the multiple dimensioned module of shallow-layer is extracted, and generates high-rise spy
Sign.
3. the method as described in claim 1, which is characterized in that in step 2, construct a kind of shallow-layer and deep layer network combines
Light-type deep learning network (SDNet), the network network structure design are as follows:
(a) convolutional layer C1: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 32, with ReLU letter
Activation primitive of the number as this layer, the formula of ReLU function are as follows:
ReLU (x)=max (0, x)
(b) resampling, Chi Hua pond layer S1: are carried out using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers
Core size be 2 × 2 × 2, step-length 2;
(c1) the multiple dimensioned module SMSM-1 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution kernels
Convolutional layer, the filter blocks of all convolutional layers are 64 in module;
(c2) the multiple dimensioned module SMSM-2 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution kernels
Convolutional layer, the filter blocks of all convolutional layers are 128 in module;
(c3) the multiple dimensioned module SMSM-3 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution kernels
Convolutional layer, the filter blocks of all convolutional layers are 256 in module;
(d) convolutional layer C2: the layer includes two parts, and the main function of first part is multiple dimensioned to the shallow-layer intensively connected
Module is attached (concatenation), merges connection to characteristic pattern by concatenation function;Second
Part is convolutional layer, and convolution kernel size is 1 × 1 × 1, step-length 1, filter blocks 256, which does not include non-linear sharp
Function living;
(e) deep layer network module DNM.
4. claim 3 as described in method, which is characterized in that the network knot of the multiple dimensioned module SMSM-1 of the shallow-layer of (c1) layer
Structure design are as follows:
1) branch 1:
(a) convolutional layer C1: taking convolution kernel size is 1 × 1 × 1, step-length 1, and the filter blocks of convolutional layer are 64, with ReLU letter
Activation primitive of the number as this layer;
(b) convolutional layer C2: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion
Coefficient is 1, using ReLU function as the activation primitive of this layer.Convolution kernel size calculation after expansion is as follows:
Convolution kernel size=convolution kernel coefficient × (convolution kernel size -1 before convolution expands)+1 after convolution expansion
(c) convolutional layer C3: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion
Coefficient is 1, using ReLU function as the activation primitive of this layer;
2) branch 2:
(a) convolutional layer C1: taking convolution kernel size is 1 × 1 × 1, step-length 1, and the filter blocks of convolutional layer are 64, with ReLU letter
Activation primitive of the number as this layer;
(b) convolutional layer C2: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion
Coefficient is 2, using ReLU function as the activation primitive of this layer;
(c) convolutional layer C3: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion
Coefficient is 2, using ReLU function as the activation primitive of this layer;
3) branch 3:
(a) convolutional layer C1: taking convolution kernel size is 1 × 1 × 1, step-length 1, and the filter blocks of convolutional layer are 64, with ReLU letter
Activation primitive of the number as this layer;
(b) convolutional layer C2: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion
Coefficient is 3, using ReLU function as the activation primitive of this layer;
(c) convolutional layer C3: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion
Coefficient is 3, using ReLU function as the activation primitive of this layer;
4) articulamentum concatenation: this layer is connected using characteristic pattern of the concatenation function to three branches
It connects;
5) pond layer S1: carrying out resampling using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers, pond
Core size is 2 × 2 × 1, step-length 2.
5. claim 3 as described in method, which is characterized in that the network knot of the multiple dimensioned module SMSM-2 of the shallow-layer of (c2) layer
Structure design is 128 except the filter blocks of convolutional layer, remaining is the same as claim 4.
6. claim 3 as described in method, which is characterized in that the network knot of the multiple dimensioned module SMSM-3 of the shallow-layer of (c3) layer
Structure design is 256 except the filter blocks of convolutional layer, remaining is the same as claim 4.
7. claim 3 as described in method, which is characterized in that the multiple dimensioned module SMSM of (c1) (c2) (c3) three shallow-layers is adopted
With the mode intensively connected, i.e. (c1) SMSM-1 connection (c2) SMSM-2 and (d) convolutional layer C2, SMSM-2 connection (c3) SMSM-3
(d) convolutional layer C2.
8. method as claimed in claim 3, which is characterized in that the deep layer network module DNM network structure of (e) layer designs are as follows:
(a) convolutional layer C1: convolution kernel size is 3 × 3 × 3, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2,
Using ReLU function as the activation primitive of this layer;
(b) MLP convolutional layer C2: convolution kernel size is 1 × 1 × 1, step-length 1, filter blocks 512, and the convolution coefficient of expansion is
2, using ReLU function as the activation primitive of this layer;
(c) convolutional layer C3: convolution kernel size is 3 × 3 × 3, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2,
Using ReLU function as the activation primitive of this layer;
(d) resampling, Chi Hua pond layer S1: are carried out using the characteristic pattern that maximum pond layer (Max Pooling) obtains C3 layers
Core size be 2 × 2 × 2, step-length 2;
(e) convolutional layer C4: convolution kernel size is 3 × 3 × 3, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1,
Using ReLU function as the activation primitive of this layer;
(f) MLP convolutional layer C5: convolution kernel size is 1 × 1 × 1, step-length 1, filter blocks 512, and the convolution coefficient of expansion is
1, using ReLU function as the activation primitive of this layer;
(g) convolutional layer C6: convolution kernel size is 3 × 3 × 3, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1,
Using ReLU function as the activation primitive of this layer;
(h) it pond layer S2: is carried out using the characteristic pattern of global average pond layer (Global Average Pooling) to C6 complete
The average pondization operation of office, replaces full articulamentum with the layer, to reduce parameter amount.
9. the method as described in claim 1, which is characterized in that in step 3, construct the people of the light-type based on deep learning
Body action recognition model, the model are double-stream digestion, that is, include time flow and spatial flow, model specific structure is as follows:
(a) input layer: time flow input data is the optical flow data of video sequence, and frame size is 224 × 224;Spatial flow input
Data are the RGB data of video sequence, and frame size is still 224 × 224;
(b) SDNet: the part-time flow network and space flow network are made of SDNet, extract video sequence with SDNet
Space-time characteristic;
(c) pond layer S1: utilize time pyramid pond layer (TPP) by the video frame level of time flow network and space flow network
Characteristic aggregation is indicated at videl stage.Time pyramid pond is horizontally placed to { 4 × 4 × 1,2 × 2 × 1,1 × 1 × 1 }, i.e. time
Pyramid uses 3 layers of pyramid form;
(d) full articulamentum FC: including 1024 groups of filters.1024 neurons are arranged to be connected with S1, using ReLU as sharp
Function living;
(e) softmax layers: calculating different classes of relative probability using the characteristic value that Softmax analyzer obtains FC layers, obtain
To class score, Softmax function is defined as follows:
Wherein, ViIt is the output of classifier Softmax i.e. FC layers of output unit of prime.I indicates that classification index, total classification number are C,
piIndicate currentElement index and all elements index and ratio, max (pi) it is its classification score class score;
(f) fused layer: this layer obtains movement point using the class score of Decision fusion rule time of fusion stream and spatial flow
Class result.The recognition confidence of time flow and spatial flow when fusion is set as 1: 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910269644.3A CN109977904A (en) | 2019-04-04 | 2019-04-04 | A kind of human motion recognition method of the light-type based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910269644.3A CN109977904A (en) | 2019-04-04 | 2019-04-04 | A kind of human motion recognition method of the light-type based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109977904A true CN109977904A (en) | 2019-07-05 |
Family
ID=67082966
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910269644.3A Pending CN109977904A (en) | 2019-04-04 | 2019-04-04 | A kind of human motion recognition method of the light-type based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109977904A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110458038A (en) * | 2019-07-19 | 2019-11-15 | 天津理工大学 | The cross-domain action identification method of small data based on double-strand depth binary-flow network |
CN111368666A (en) * | 2020-02-25 | 2020-07-03 | 上海蠡图信息科技有限公司 | Living body detection method based on novel pooling and attention mechanism double-current network |
CN111666852A (en) * | 2020-05-28 | 2020-09-15 | 天津大学 | Micro-expression double-flow network identification method based on convolutional neural network |
CN111738357A (en) * | 2020-07-24 | 2020-10-02 | 完美世界(北京)软件科技发展有限公司 | Junk picture identification method, device and equipment |
CN112244863A (en) * | 2020-10-23 | 2021-01-22 | 京东方科技集团股份有限公司 | Signal identification method, signal identification device, electronic device and readable storage medium |
CN112308885A (en) * | 2019-07-29 | 2021-02-02 | 顺丰科技有限公司 | Violent throwing detection method, device, equipment and storage medium based on optical flow |
CN112686329A (en) * | 2021-01-06 | 2021-04-20 | 西安邮电大学 | Electronic laryngoscope image classification method based on dual-core convolution feature extraction |
CN112749684A (en) * | 2021-01-27 | 2021-05-04 | 萱闱(北京)生物科技有限公司 | Cardiopulmonary resuscitation training and evaluating method, device, equipment and storage medium |
CN113836969A (en) * | 2020-06-23 | 2021-12-24 | 山西农业大学 | Abnormal event detection method based on double flows |
CN114037930A (en) * | 2021-10-18 | 2022-02-11 | 苏州大学 | Video action recognition method based on space-time enhanced network |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106845351A (en) * | 2016-05-13 | 2017-06-13 | 苏州大学 | It is a kind of for Activity recognition method of the video based on two-way length mnemon in short-term |
CN106845329A (en) * | 2016-11-11 | 2017-06-13 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | A kind of action identification method based on depth convolution feature multichannel pyramid pond |
CN107240066A (en) * | 2017-04-28 | 2017-10-10 | 天津大学 | Image super-resolution rebuilding algorithm based on shallow-layer and deep layer convolutional neural networks |
CN107862376A (en) * | 2017-10-30 | 2018-03-30 | 中山大学 | A kind of human body image action identification method based on double-current neutral net |
CN108875674A (en) * | 2018-06-29 | 2018-11-23 | 东南大学 | A kind of driving behavior recognition methods based on multiple row fusion convolutional neural networks |
-
2019
- 2019-04-04 CN CN201910269644.3A patent/CN109977904A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106845351A (en) * | 2016-05-13 | 2017-06-13 | 苏州大学 | It is a kind of for Activity recognition method of the video based on two-way length mnemon in short-term |
CN106845329A (en) * | 2016-11-11 | 2017-06-13 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | A kind of action identification method based on depth convolution feature multichannel pyramid pond |
CN107240066A (en) * | 2017-04-28 | 2017-10-10 | 天津大学 | Image super-resolution rebuilding algorithm based on shallow-layer and deep layer convolutional neural networks |
CN107862376A (en) * | 2017-10-30 | 2018-03-30 | 中山大学 | A kind of human body image action identification method based on double-current neutral net |
CN108875674A (en) * | 2018-06-29 | 2018-11-23 | 东南大学 | A kind of driving behavior recognition methods based on multiple row fusion convolutional neural networks |
Non-Patent Citations (3)
Title |
---|
FISHER YU 等: "MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS", 《ICLR 2016》 * |
YIFAN WANG 等: "End-to-End Image Super-Resolution via Deep and Shallow Convolutional Networks", 《DIGITAL OBJECT IDENTIFIER》 * |
杨天明 等: "基于视频深度学习的时空双流人物动作识别模型", 《计算机应用》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110458038A (en) * | 2019-07-19 | 2019-11-15 | 天津理工大学 | The cross-domain action identification method of small data based on double-strand depth binary-flow network |
CN112308885A (en) * | 2019-07-29 | 2021-02-02 | 顺丰科技有限公司 | Violent throwing detection method, device, equipment and storage medium based on optical flow |
CN111368666A (en) * | 2020-02-25 | 2020-07-03 | 上海蠡图信息科技有限公司 | Living body detection method based on novel pooling and attention mechanism double-current network |
CN111368666B (en) * | 2020-02-25 | 2023-08-18 | 上海蠡图信息科技有限公司 | Living body detection method based on novel pooling and attention mechanism double-flow network |
CN111666852A (en) * | 2020-05-28 | 2020-09-15 | 天津大学 | Micro-expression double-flow network identification method based on convolutional neural network |
CN113836969A (en) * | 2020-06-23 | 2021-12-24 | 山西农业大学 | Abnormal event detection method based on double flows |
CN111738357A (en) * | 2020-07-24 | 2020-10-02 | 完美世界(北京)软件科技发展有限公司 | Junk picture identification method, device and equipment |
CN111738357B (en) * | 2020-07-24 | 2020-11-20 | 完美世界(北京)软件科技发展有限公司 | Junk picture identification method, device and equipment |
CN112244863A (en) * | 2020-10-23 | 2021-01-22 | 京东方科技集团股份有限公司 | Signal identification method, signal identification device, electronic device and readable storage medium |
CN112686329A (en) * | 2021-01-06 | 2021-04-20 | 西安邮电大学 | Electronic laryngoscope image classification method based on dual-core convolution feature extraction |
CN112749684A (en) * | 2021-01-27 | 2021-05-04 | 萱闱(北京)生物科技有限公司 | Cardiopulmonary resuscitation training and evaluating method, device, equipment and storage medium |
CN114037930A (en) * | 2021-10-18 | 2022-02-11 | 苏州大学 | Video action recognition method based on space-time enhanced network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109977904A (en) | A kind of human motion recognition method of the light-type based on deep learning | |
Yang et al. | Visual perception enabled industry intelligence: state of the art, challenges and prospects | |
Yin et al. | Recurrent convolutional network for video-based smoke detection | |
CN107862300A (en) | A kind of descending humanized recognition methods of monitoring scene based on convolutional neural networks | |
CN102332095B (en) | Face motion tracking method, face motion tracking system and method for enhancing reality | |
CN110427813A (en) | Pedestrian's recognition methods again based on the twin production confrontation network that posture instructs pedestrian image to generate | |
Ming et al. | Simple triplet loss based on intra/inter-class metric learning for face verification | |
CN110210539A (en) | The RGB-T saliency object detection method of multistage depth characteristic fusion | |
CN110232361B (en) | Human behavior intention identification method and system based on three-dimensional residual dense network | |
CN116343330A (en) | Abnormal behavior identification method for infrared-visible light image fusion | |
CN110472634A (en) | Change detecting method based on multiple dimensioned depth characteristic difference converged network | |
CN110022422A (en) | A kind of sequence of frames of video generation method based on intensive connection network | |
Singh et al. | A deep learning based technique for anomaly detection in surveillance videos | |
CN107194380A (en) | The depth convolutional network and learning method of a kind of complex scene human face identification | |
Le et al. | Cross-resolution feature fusion for fast hand detection in intelligent homecare systems | |
CN109583334A (en) | A kind of action identification method and its system based on space time correlation neural network | |
CN115063836A (en) | Pedestrian tracking and re-identification method based on deep learning | |
Sun et al. | YOLO-P: An efficient method for pear fast detection in complex orchard picking environment | |
CN113705384B (en) | Facial expression recognition method considering local space-time characteristics and global timing clues | |
CN114333002A (en) | Micro-expression recognition method based on deep learning of image and three-dimensional reconstruction of human face | |
CN110223240A (en) | Image defogging method, system and storage medium based on color decaying priori | |
CN109583406B (en) | Facial expression recognition method based on feature attention mechanism | |
CN116403286A (en) | Social grouping method for large-scene video | |
CN116342877A (en) | Semantic segmentation method based on improved ASPP and fusion module in complex scene | |
CN105224952A (en) | Based on the double interbehavior recognition methods of largest interval markov pessimistic concurrency control |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190705 |