CN109977904A - A kind of human motion recognition method of the light-type based on deep learning - Google Patents

A kind of human motion recognition method of the light-type based on deep learning Download PDF

Info

Publication number
CN109977904A
CN109977904A CN201910269644.3A CN201910269644A CN109977904A CN 109977904 A CN109977904 A CN 109977904A CN 201910269644 A CN201910269644 A CN 201910269644A CN 109977904 A CN109977904 A CN 109977904A
Authority
CN
China
Prior art keywords
layer
network
module
shallow
convolutional layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910269644.3A
Other languages
Chinese (zh)
Inventor
魏维
何冰倩
魏敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN201910269644.3A priority Critical patent/CN109977904A/en
Publication of CN109977904A publication Critical patent/CN109977904A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Abstract

The invention discloses a kind of human motion recognition methods of light-type based on deep learning, the method constructs the deep learning network (SDNet) for the light-type that shallow-layer and deep layer network combine first, the network includes the multiple dimensioned module of shallow-layer and deep layer network module, and the human action identification model of the light-type based on deep learning is constructed based on the network.In the model, feature extraction and expression are carried out to space-time double fluid first with SDNet;Time pyramid pond layer is recycled to indicate the characteristic aggregation of time flow and the video frame level of spatial flow at videl stage, space-time double fluid is obtained to the recognition result of list entries by full articulamentum and softmax layers again, finally, double-current result is merged in the way of weighted average fusion, to obtain final recognition result.Using the human motion recognition method of the light-type based on deep learning, model parameter amount can be greatly decreased under the premise of guaranteeing that accuracy of identification does not reduce.

Description

A kind of human motion recognition method of the light-type based on deep learning
Technical field
The present invention relates to graph and image processing technical fields, and in particular to a kind of light weight based on deep learning for having supervision The human action identification model method of type.
Background technique
Human action identifies the main problem to be solved is how to lead to video camera or the collected video sequence of sensor Analysis and processing are crossed, enables a computer to " understand " movement and behavior of the mankind in video, to security monitoring, entertainment way Etc. have important research significance, and human action based on video is identified in human-computer interaction, virtual reality, smart home The fields such as equipment also suffer from extensive use.For many artificial intelligence systems, human action identification or human behavior reason Solution is essential.For example, including hundreds of hours monitor videos in video monitoring system, if manually going to traverse Monitor video, not only work longsomeness, and efficiency is also very low.By utilizing human action identification technology, so that it may right The movement of human body in monitor video is identified and is understood, to effectively detect malicious act and abnormal behaviour automatically.
There are huge challenges for human action identification mission based on video itself.The reason of creating a huge challenge is main There are two aspects, are on the one hand video environment factors, are on the other hand the complexities of action classification itself.The change that video light shines Change, the shake of video camera, the variation at visual angle etc. are all to belong to video environment factor.Moving scene in video is always varied , it, all can be to people even if involved party is blocked in the variation of illumination indoors in the relatively-stationary environment of such video background Body action recognition task affects.And for the complexity of action classification itself, mainly between movement class and in class Difference and diverse problems.Such as " jogging ", " stroll " and " running " these three are different classes of, due to movement speed etc., It will cause the different classes of lesser problem of difference;And for identical movement, it will also result in due to visual angle etc. mutually similar Other movement has biggish different difference problems.
From the proposition of deep learning network model LeNet network, and obtain in Handwritten Digit Recognition task it is considerable at After fruit, domestic and foreign scholars propose the various network models based on deep learning in succession and are applied to human action identification, such as AlexNet, VggNet, GoogleNet, ResNet, DenseNet etc..AlexNet and VggNet is by deepening network The mode of depth come improve network performance, GoogleNet and ResNet using increase network model width or depth by the way of Network performance is improved, it is considerable that these networks in picture recognition and classification and human action identify etc. that fields achieve Achievement.Although by network model increase by three layers of weight layer experiments have shown that expression of the shallow-layer learning network to complicated function And the generalization ability of model all has certain limitation, but with the continuous continuous expansion being superimposed with width of the network number of plies Exhibition, also brings along some problems: such as ginseng enormous amount, the computation complexity of network is larger, the network the deep more is easy to appear ladder The problems such as degree disappears.And blindly increase network depth and will appear the case where accuracy rate includes or declines, to bring network The degenerate problem of model.
Human action identification mission and the main distinction of the picture recognition task based on still image based on video are Video sequence not only includes that the appearance information of image also includes the motion information in time series, and the analysis of single image identification is not Need to consider temporal information, it therefore, cannot effective combining video sequences in order to make up two-dimensional convolution neural network model In motion information, gradually using Three dimensional convolution neural network model or double-current convolutional neural networks model to the people in video Body movement is identified.These models consider the characteristic that video sequence has motion information to a certain extent, but still There is a problem of that network structure still becomes deeper and deeper when the precision that concern improves recognition accuracy.
In conclusion present inventor during realizing the present application technical solution, has found currently based on depth The human action identification model or method for spending study at least have the following technical problems:
One, the recognition performance of network model is improved by deepening network depth and widen network-wide, increase considerably The calculation amount of network model, and since parameter amount is excessive, it is easy to appear the problem of gradient disappears, discrimination does not rise anti-drop.
Two, current light-type deep learning network model reduces although having compressed scale of model to a certain extent Parameter amount, but in the human action identification problem based on video, it is difficult to reply is effectively extracted comprising complicated incidence relation Space-time characteristic problem.
Summary of the invention
To solve the problems such as existing human action identification model parameter amount based on deep learning is big, network is too deep overweight, The present invention provides a kind of human motion recognition methods of light-type based on deep learning.This method contain a kind of shallow-layer and The deep learning network for the light-type that deep layer network combines, using the multiple dimensioned module of network middle-shallow layer to the office in video sequence Portion's feature carries out the description of different scale, is effectively melted using depth network module in network to the Analysis On Multi-scale Features extracted It closes and characterizes, eventually form the human action identification model of the light-type based on deep learning, effectively realize the ginseng for reducing model Quantity is without losing precision.
On the one hand, the present invention is achieved through the following technical solutions:
A kind of human motion recognition method of the light-type based on deep learning, the specific steps of which are as follows:
Step 1: the video data comprising human action being handled, RGB data frame sequence and optical flow data frame are obtained Sequence;
Step 2: constructing a kind of deep learning network (A for the light-type that shallow-layer and deep layer network combine Lightweight deep learning network model combining shallow and deep networks, SDNet), which includes the multiple dimensioned module of shallow-layer and deep layer network module;
Step 3: being known using the human action of light-type of the SDNet network struction based on deep learning constructed in step 2 Other model, model are double-stream digestion, that is, include time flow and spatial flow;
Step 4: using the human action identification model of the light-type based on deep learning constructed in step 3 to video sequence The RGB data and optical flow data of column are handled, and human action classification results are obtained.
On the other hand, a kind of deep learning network (A of the light-type combined the invention proposes shallow-layer and deep layer network Lightweight deep learning network model combining shallow and deep networks, SDNet), which includes the multiple dimensioned module of shallow-layer (Shallow multi-scale module, SMSM) and deep layer network mould Block (Deep networks module, DNM):
The multiple dimensioned module of shallow-layer is used to obtain the human action of initial RGB data frame sequence and optical flow data frame sequence Local feature;
The deep layer network module is used to merge the human action local feature that the multiple dimensioned module of shallow-layer is extracted, and generates height Layer feature.
Further, the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine is set It counts as follows:
(a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 32, with ReLU function As the activation primitive of this layer, the formula of ReLU function is as follows:
ReLU (x)=max (0, x)
(b) pond layer S1: carrying out resampling using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers, The core size in pond is step-length 2;
(c1) the multiple dimensioned module SMSM-1 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution The convolutional layer of core, the filter blocks of all convolutional layers are 64 in module;
(c2) the multiple dimensioned module SMSM-2 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution The convolutional layer of core, the filter blocks of all convolutional layers are 128 in module;
(c3) the multiple dimensioned module SMSM-3 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution The convolutional layer of core, the filter blocks of all convolutional layers are 256 in module;
(d) convolutional layer C2: the layer includes two parts, and the main function of first part is more to the shallow-layer intensively connected Scale module is attached (concatenation), merges connection to characteristic pattern by concatenation function;The Two parts are convolutional layers, and convolution kernel size is step-length 1, filter blocks 256, which does not include nonlinear activation Function;
(e) deep layer network module DNM.
Further, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine (c1) the network structure design of the multiple dimensioned module SMSM-1 of the shallow-layer of layer are as follows:
1) branch 1:
(a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU function Activation primitive as this layer;
(b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system Number is 1, using ReLU function as the activation primitive of this layer.Convolution kernel size calculation after expansion is as follows:
Convolution kernel size=convolution kernel coefficient (convolution kernel size -1 before convolution expands)+1 after convolution expansion
(c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system Number is 1, using ReLU function as the activation primitive of this layer;
2) branch 2:
(a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU function Activation primitive as this layer;
(b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system Number is 2, using ReLU function as the activation primitive of this layer;
(c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system Number is 2, using ReLU function as the activation primitive of this layer;
3) branch 3:
(a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU function Activation primitive as this layer;
(b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system Number is 3, using ReLU function as the activation primitive of this layer;
(c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion system Number is 3, using ReLU function as the activation primitive of this layer;
4) articulamentum concatenation: this layer is carried out using characteristic pattern of the concatenation function to three branches Connection;
5) resampling, pond pond layer S1: are carried out using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers The core size of change is step-length 2.
Further, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine (c2) filter blocks that the network structure design of the multiple dimensioned module SMSM-2 of the shallow-layer of layer removes convolutional layer are 128, remaining is same SMSM-1 structure.
Further, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine (c3) filter blocks that the network structure design of the multiple dimensioned module SMSM-3 of the shallow-layer of layer removes convolutional layer are 256, remaining is same SMSM-1 structure.
Preferably, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine (c1) the multiple dimensioned module SMSM of (c2) (c3) three shallow-layers is by the way of intensively connecting, i.e. (c1) SMSM-1 connection (c2) SMSM-2 and (d) convolutional layer C2, SMSM-2 connection (c3) SMSM-3 and (d) convolutional layer C2.
Further, in the network structure of the deep learning network for the light-type that a kind of shallow-layer and deep layer network combine (e) the deep layer network module DNM network structure design of layer are as follows:
(a) convolutional layer C1: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, with Activation primitive of the ReLU function as this layer;
(b) MLP convolutional layer C2: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, Using ReLU function as the activation primitive of this layer;
(c) convolutional layer C3: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, with Activation primitive of the ReLU function as this layer;
(d) pond layer S1: carrying out resampling using the characteristic pattern that maximum pond layer (Max Pooling) obtains C3 layers, The core size in pond is step-length 2;
(e) convolutional layer C4: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, with Activation primitive of the ReLU function as this layer;
(f) MLP convolutional layer C5: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, Using ReLU function as the activation primitive of this layer;
(g) convolutional layer C6: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, with Activation primitive of the ReLU function as this layer;
(h) pond layer S2: using global average pond layer (Global Average Pooling) to the characteristic pattern of C6 into The global average pondization operation of row, replaces full articulamentum with the layer, to reduce parameter amount.
Another aspect, the present invention construct the human action identification model of the light-type based on deep learning, which is Double-stream digestion includes time flow and spatial flow, model specific structure is as follows:
(a) input layer: time flow input data is the optical flow data of video sequence, and frame size is;Spatial flow input data For the RGB data of video sequence, frame size is still;
(b) SDNet: the part-time flow network and space flow network are made of SDNet, extract video sequence with SDNet The space-time characteristic of column;
(c) pond layer S1: utilize time pyramid pond layer (TPP) by the video frame of time flow network and space flow network The characteristic aggregation of grade is indicated at videl stage.Time pyramid pond is horizontally placed to, i.e., time pyramid uses 3 layers of pyramid Formula;
(d) full articulamentum FC: including 1024 groups of filters.1024 neurons are arranged to be connected with S1, are made with ReLU For activation primitive;
(e) softmax layers: being calculated using the characteristic value that Softmax analyzer obtains FC layers different classes of relatively general Rate obtains class score.Softmax function is defined as follows:
Wherein, ViIt is the output of classifier Softmax i.e. FC layers of output unit of prime.I indicates classification index, total classification number For C.piIndicate currentElement index and all elements index and ratio, max (pi) it is its classification score class score;
(f) fused layer: this layer is moved using the class score of Decision fusion rule time of fusion stream and spatial flow Make classification results.The recognition confidence of time flow and spatial flow when fusion is set as 1: 1.
The present invention has the following advantages and benefits compared to the prior art:
1, the deep learning network for the light-type that the present invention is combined using shallow-layer and deep layer network is in video sequence Space-time characteristic extracts, and realizes model parameter amount using the multiple dimensioned module of the shallow-layer intensively connected and depth network module It is greatly decreased, avoids the too deep problem of network.
2, the human action identification model for the light-type based on deep learning that the present invention constructs utilizes binary-flow network structure, Deja Vu can be effectively captured, the more preferable space time information for utilizing human action enhances the identification energy of human action identification model Power and generalization ability.
It is 3, of the invention under the premise of the action recognition accuracy rate of holding precision level and current leading edge method is substantially uniform, Model parameter amount is considerably reduced, scale of model is had compressed.
Detailed description of the invention
Attached drawing described herein is used to provide to further understand the embodiment of the present invention, constitutes one of the application Point, do not constitute the restriction to the embodiment of the present invention.In the accompanying drawings:
Fig. 1 is the deep learning network chart for the light-type that shallow-layer of the invention and deep layer network combine.
Fig. 2 is the multiple dimensioned mould of deep learning network middle-shallow layer for the light-type that shallow-layer of the invention and deep layer network combine Block SMSM-1 exemplary diagram.
Fig. 3 is that the human action identification model of the light-type of the invention based on deep learning illustrates.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this Invention is described in further detail, and exemplary embodiment of the invention and its explanation for explaining only the invention, are not made For limitation of the invention.
As shown in figure 3, firstly, the deep learning for the light-type that the present invention is combined using the shallow-layer and deep layer network of proposition Network (SDNet) carries out feature extraction and expression to space-time double fluid, then utilizes time pyramid pond layer (TPP) by time flow It is indicated with the characteristic aggregation of the video frame level of spatial flow at videl stage, then space-time pair is obtained by full articulamentum and softmax layers The recognition result to list entries is flowed, finally, merging in the way of weighted average fusion to double-current result, to obtain Final recognition result.This method mainly comprises the steps that
Step 1: the video data comprising human action being handled, RGB data frame sequence and optical flow data frame are obtained Sequence;
Step 2: constructing a kind of deep learning network (A for the light-type that shallow-layer and deep layer network combine Lightweight deep learning network model combining shallow and deep networks, SDNet), which includes the multiple dimensioned module of shallow-layer (Shallow multi-scale module, SMSM) and deep layer network mould Block (Deep networks module, DNM).Wherein, the multiple dimensioned module of shallow-layer is for obtaining initial RGB data frame sequence and light The human action local feature of flow data frame sequence;It is dynamic that deep layer network module is used to merge the human body that the multiple dimensioned module of shallow-layer is extracted Make local feature, and generates high-level characteristic.
The network structure of the deep learning network (SDNet) for the light-type that shallow-layer and deep layer network combine is as follows:
(a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 32, with ReLU function As the activation primitive of this layer, the formula of ReLU function is as follows:
ReLU (x)=max (0, x)
(b) pond layer S1: carrying out resampling using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers, The core size in pond is step-length 2;
(c1) the multiple dimensioned module SMSM-1 of shallow-layer: the module includes three branches and an articulamentum and a pond layer. As shown in Fig. 2, specific structure is as follows:
(c1-1) branch 1:
(c1-1-a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU Activation primitive of the function as this layer;
(c1-1-b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen Swollen coefficient is 1, using ReLU function as the activation primitive of this layer.Convolution kernel size calculation after expansion is as follows:
Convolution kernel size=convolution kernel coefficient (convolution kernel size -1 before convolution expands)+1 after convolution expansion
(c1-1-c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen Swollen coefficient is 1, using ReLU function as the activation primitive of this layer;
(c1-2) branch 2:
(c1-2-a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU Activation primitive of the function as this layer;
(c1-2-b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen Swollen coefficient is 2, using ReLU function as the activation primitive of this layer;
(c1-2-c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen Swollen coefficient is 2, using ReLU function as the activation primitive of this layer;
(c1-3) branch 3:
(c1-3-a) convolutional layer C1: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, with ReLU Activation primitive of the function as this layer;
(c1-3-b) convolutional layer C2: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen Swollen coefficient is 3, using ReLU function as the activation primitive of this layer;
(c1-3-c) convolutional layer C3: taking convolution kernel size is step-length 1, and the filter blocks of convolutional layer are 64, and convolution is swollen Swollen coefficient is 3, using ReLU function as the activation primitive of this layer;
(c1-4) articulamentum concatenation: this layer is using concatenation function to the characteristic pattern of three branches It is attached;
(c1-5) it pond layer S1: is adopted again using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers The core size of sample, pond is step-length 2.
(c2) the multiple dimensioned module SMSM-2 of shallow-layer: the module also includes three branches and an articulamentum and a pond Layer.The filter blocks of the convolutional layer of each branch are 128, remaining same SMSM-1;
(c3) the multiple dimensioned module SMSM-3 of shallow-layer: the module also includes three branches and an articulamentum and a pond Layer.The filter blocks of the convolutional layer of each branch are 256, remaining same SMSM-1;
(d) convolutional layer C2: the layer includes two parts, and the main function of first part is more to the shallow-layer intensively connected Scale module is attached (concatenation), merges connection to characteristic pattern by concatenation function;The Two parts are convolutional layers, and convolution kernel size is step-length 1, filter blocks 256, which does not include nonlinear activation Function;
Preferably, as shown in Figure 1, the multiple dimensioned module SMSM of (c1) (c2) (c3) three shallow-layers is using the side intensively connected Formula, i.e. (c1) SMSM-1 connection (c2) SMSM-2 and (d) convolutional layer C2, SMSM-2 connection (c3) SMSM-3 and (d) convolutional layer C2.
(e) deep layer network module DNM.The network structure of the module are as follows:
(e-a) convolutional layer C1: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, with Activation primitive of the ReLU function as this layer;
(e-b) MLP convolutional layer C2: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, using ReLU function as the activation primitive of this layer;
(e-c) convolutional layer C3: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, with Activation primitive of the ReLU function as this layer;
(e-d) it pond layer S1: is adopted again using the characteristic pattern that maximum pond layer (Max Pooling) obtains C3 layers The core size of sample, pond is step-length 2;
(e-e) convolutional layer C4: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, with Activation primitive of the ReLU function as this layer;
(e-f) MLP convolutional layer C5: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, using ReLU function as the activation primitive of this layer;
(e-g) convolutional layer C6: convolution kernel size is step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, with Activation primitive of the ReLU function as this layer;
(e-h) pond layer S2: using global average pond layer (Global Average Pooling) to the characteristic pattern of C6 Global average pondization operation is carried out, full articulamentum is replaced with the layer, to reduce parameter amount.
Step 3: being known using the human action of light-type of the SDNet network struction based on deep learning constructed in step 2 Other model, the model are double-stream digestion, that is, include time flow and spatial flow.The model is as shown in figure 3, specific structure is as follows:
(a) input layer: time flow input data is the optical flow data of video sequence, and frame size is 224 × 224;Spatial flow Input data is the RGB data of video sequence, and frame size is still 224 × 224;
(b) SDNet: the part-time flow network and space flow network are made of SDNet, extract video sequence with SDNet The space-time characteristic of column;
(c) pond layer S1: utilize time pyramid pond layer (TPP) by the video frame of time flow network and space flow network The characteristic aggregation of grade is indicated at videl stage.Time pyramid pond is horizontally placed to { 4 × 4 × 1,2 × 2 × 1,1 × 1 × 1 }, i.e., Time pyramid uses 3 layers of pyramid form;
(d) full articulamentum FC: including 1024 groups of filters.1024 neurons are arranged to be connected with S1, are made with ReLU For activation primitive;
(e) softmax layers: being calculated using the characteristic value that Softmax analyzer obtains FC layers different classes of relatively general Rate obtains class score, and Softmax function is defined as follows:
Wherein, ViIt is the output of classifier Softmax i.e. FC layers of output unit of prime.I indicates classification index, total classification number For C, piIndicate currentElement index and all elements index and ratio, max (pi) it is its classification score class score;
(f) fused layer: this layer is moved using the class score of Decision fusion rule time of fusion stream and spatial flow Make classification results.The recognition confidence of time flow and spatial flow when fusion is set as 1: 1.
Step 4: using the human action identification model of the light-type based on deep learning constructed in step 3 to video sequence The RGB data and optical flow data of column are handled, and human action classification results are obtained.
The pre-training and small parameter perturbations that model of the present invention has first been carried out on ImageNet data set, then to action recognition Data set UCF101 and HMDB51 have carried out the human action identification model method processing of the light-type based on deep learning, finally Achieve 94.0% and 69.4% action recognition accuracy rate respectively on data set UCF101 and HMDB51, and model parameter amount Only 19M.It follows that the human action identification model of the light-type proposed in this paper based on deep learning can not only be to view Human action in frequency is effectively identified, parameter amount also is greatly reduced compared to human action identification model in recent years, is saved Calculating cost.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims (9)

1. a kind of human motion recognition method of the light-type based on deep learning, which comprises the following steps:
Step 1: the video data comprising human action being handled, RGB data frame sequence and optical flow data frame sequence are obtained;
Step 2: constructing a kind of deep learning network (the A lightweight for the light-type that shallow-layer and deep layer network combine Deep learning network model combining shallow and deep networks, SDNet), the network Include the multiple dimensioned module of shallow-layer and deep layer network module;
Step 3: identifying mould using the human action of light-type of the SDNet network struction based on deep learning constructed in step 2 Type, the model are double-stream digestion, that is, include time flow and spatial flow;
Step 4: using the human action identification model of the light-type based on deep learning constructed in step 3 to video sequence RGB data and optical flow data are handled, and human action classification results are obtained.
2. the method as described in claim 1, which is characterized in that in step 2, construct a kind of shallow-layer and deep layer network combines Light-type deep learning network (A lightweight deep learning network model combining Shallow and deep networks, SDNet), which includes the multiple dimensioned module of shallow-layer (Shallow multi-scale Module, SMSM) and deep layer network module (Deep networks module, DNM):
The multiple dimensioned module of shallow-layer is used to obtain the human action part of initial RGB data frame sequence and optical flow data frame sequence Feature;
The deep layer network module is used to merge the human action local feature that the multiple dimensioned module of shallow-layer is extracted, and generates high-rise spy Sign.
3. the method as described in claim 1, which is characterized in that in step 2, construct a kind of shallow-layer and deep layer network combines Light-type deep learning network (SDNet), the network network structure design are as follows:
(a) convolutional layer C1: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 32, with ReLU letter Activation primitive of the number as this layer, the formula of ReLU function are as follows:
ReLU (x)=max (0, x)
(b) resampling, Chi Hua pond layer S1: are carried out using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers Core size be 2 × 2 × 2, step-length 2;
(c1) the multiple dimensioned module SMSM-1 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution kernels Convolutional layer, the filter blocks of all convolutional layers are 64 in module;
(c2) the multiple dimensioned module SMSM-2 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution kernels Convolutional layer, the filter blocks of all convolutional layers are 128 in module;
(c3) the multiple dimensioned module SMSM-3 of shallow-layer: the module includes three branches, and each branch includes three layers of different convolution kernels Convolutional layer, the filter blocks of all convolutional layers are 256 in module;
(d) convolutional layer C2: the layer includes two parts, and the main function of first part is multiple dimensioned to the shallow-layer intensively connected Module is attached (concatenation), merges connection to characteristic pattern by concatenation function;Second Part is convolutional layer, and convolution kernel size is 1 × 1 × 1, step-length 1, filter blocks 256, which does not include non-linear sharp Function living;
(e) deep layer network module DNM.
4. claim 3 as described in method, which is characterized in that the network knot of the multiple dimensioned module SMSM-1 of the shallow-layer of (c1) layer Structure design are as follows:
1) branch 1:
(a) convolutional layer C1: taking convolution kernel size is 1 × 1 × 1, step-length 1, and the filter blocks of convolutional layer are 64, with ReLU letter Activation primitive of the number as this layer;
(b) convolutional layer C2: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion Coefficient is 1, using ReLU function as the activation primitive of this layer.Convolution kernel size calculation after expansion is as follows:
Convolution kernel size=convolution kernel coefficient × (convolution kernel size -1 before convolution expands)+1 after convolution expansion
(c) convolutional layer C3: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion Coefficient is 1, using ReLU function as the activation primitive of this layer;
2) branch 2:
(a) convolutional layer C1: taking convolution kernel size is 1 × 1 × 1, step-length 1, and the filter blocks of convolutional layer are 64, with ReLU letter Activation primitive of the number as this layer;
(b) convolutional layer C2: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion Coefficient is 2, using ReLU function as the activation primitive of this layer;
(c) convolutional layer C3: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion Coefficient is 2, using ReLU function as the activation primitive of this layer;
3) branch 3:
(a) convolutional layer C1: taking convolution kernel size is 1 × 1 × 1, step-length 1, and the filter blocks of convolutional layer are 64, with ReLU letter Activation primitive of the number as this layer;
(b) convolutional layer C2: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion Coefficient is 3, using ReLU function as the activation primitive of this layer;
(c) convolutional layer C3: taking convolution kernel size is 3 × 3 × 3, step-length 1, and the filter blocks of convolutional layer are 64, convolution expansion Coefficient is 3, using ReLU function as the activation primitive of this layer;
4) articulamentum concatenation: this layer is connected using characteristic pattern of the concatenation function to three branches It connects;
5) pond layer S1: carrying out resampling using the characteristic pattern that maximum pond layer (Max Pooling) obtains C1 layers, pond Core size is 2 × 2 × 1, step-length 2.
5. claim 3 as described in method, which is characterized in that the network knot of the multiple dimensioned module SMSM-2 of the shallow-layer of (c2) layer Structure design is 128 except the filter blocks of convolutional layer, remaining is the same as claim 4.
6. claim 3 as described in method, which is characterized in that the network knot of the multiple dimensioned module SMSM-3 of the shallow-layer of (c3) layer Structure design is 256 except the filter blocks of convolutional layer, remaining is the same as claim 4.
7. claim 3 as described in method, which is characterized in that the multiple dimensioned module SMSM of (c1) (c2) (c3) three shallow-layers is adopted With the mode intensively connected, i.e. (c1) SMSM-1 connection (c2) SMSM-2 and (d) convolutional layer C2, SMSM-2 connection (c3) SMSM-3 (d) convolutional layer C2.
8. method as claimed in claim 3, which is characterized in that the deep layer network module DNM network structure of (e) layer designs are as follows:
(a) convolutional layer C1: convolution kernel size is 3 × 3 × 3, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, Using ReLU function as the activation primitive of this layer;
(b) MLP convolutional layer C2: convolution kernel size is 1 × 1 × 1, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, using ReLU function as the activation primitive of this layer;
(c) convolutional layer C3: convolution kernel size is 3 × 3 × 3, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 2, Using ReLU function as the activation primitive of this layer;
(d) resampling, Chi Hua pond layer S1: are carried out using the characteristic pattern that maximum pond layer (Max Pooling) obtains C3 layers Core size be 2 × 2 × 2, step-length 2;
(e) convolutional layer C4: convolution kernel size is 3 × 3 × 3, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, Using ReLU function as the activation primitive of this layer;
(f) MLP convolutional layer C5: convolution kernel size is 1 × 1 × 1, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, using ReLU function as the activation primitive of this layer;
(g) convolutional layer C6: convolution kernel size is 3 × 3 × 3, step-length 1, filter blocks 512, and the convolution coefficient of expansion is 1, Using ReLU function as the activation primitive of this layer;
(h) it pond layer S2: is carried out using the characteristic pattern of global average pond layer (Global Average Pooling) to C6 complete The average pondization operation of office, replaces full articulamentum with the layer, to reduce parameter amount.
9. the method as described in claim 1, which is characterized in that in step 3, construct the people of the light-type based on deep learning Body action recognition model, the model are double-stream digestion, that is, include time flow and spatial flow, model specific structure is as follows:
(a) input layer: time flow input data is the optical flow data of video sequence, and frame size is 224 × 224;Spatial flow input Data are the RGB data of video sequence, and frame size is still 224 × 224;
(b) SDNet: the part-time flow network and space flow network are made of SDNet, extract video sequence with SDNet Space-time characteristic;
(c) pond layer S1: utilize time pyramid pond layer (TPP) by the video frame level of time flow network and space flow network Characteristic aggregation is indicated at videl stage.Time pyramid pond is horizontally placed to { 4 × 4 × 1,2 × 2 × 1,1 × 1 × 1 }, i.e. time Pyramid uses 3 layers of pyramid form;
(d) full articulamentum FC: including 1024 groups of filters.1024 neurons are arranged to be connected with S1, using ReLU as sharp Function living;
(e) softmax layers: calculating different classes of relative probability using the characteristic value that Softmax analyzer obtains FC layers, obtain To class score, Softmax function is defined as follows:
Wherein, ViIt is the output of classifier Softmax i.e. FC layers of output unit of prime.I indicates that classification index, total classification number are C, piIndicate currentElement index and all elements index and ratio, max (pi) it is its classification score class score;
(f) fused layer: this layer obtains movement point using the class score of Decision fusion rule time of fusion stream and spatial flow Class result.The recognition confidence of time flow and spatial flow when fusion is set as 1: 1.
CN201910269644.3A 2019-04-04 2019-04-04 A kind of human motion recognition method of the light-type based on deep learning Pending CN109977904A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910269644.3A CN109977904A (en) 2019-04-04 2019-04-04 A kind of human motion recognition method of the light-type based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910269644.3A CN109977904A (en) 2019-04-04 2019-04-04 A kind of human motion recognition method of the light-type based on deep learning

Publications (1)

Publication Number Publication Date
CN109977904A true CN109977904A (en) 2019-07-05

Family

ID=67082966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910269644.3A Pending CN109977904A (en) 2019-04-04 2019-04-04 A kind of human motion recognition method of the light-type based on deep learning

Country Status (1)

Country Link
CN (1) CN109977904A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110458038A (en) * 2019-07-19 2019-11-15 天津理工大学 The cross-domain action identification method of small data based on double-strand depth binary-flow network
CN111368666A (en) * 2020-02-25 2020-07-03 上海蠡图信息科技有限公司 Living body detection method based on novel pooling and attention mechanism double-current network
CN111666852A (en) * 2020-05-28 2020-09-15 天津大学 Micro-expression double-flow network identification method based on convolutional neural network
CN111738357A (en) * 2020-07-24 2020-10-02 完美世界(北京)软件科技发展有限公司 Junk picture identification method, device and equipment
CN112244863A (en) * 2020-10-23 2021-01-22 京东方科技集团股份有限公司 Signal identification method, signal identification device, electronic device and readable storage medium
CN112308885A (en) * 2019-07-29 2021-02-02 顺丰科技有限公司 Violent throwing detection method, device, equipment and storage medium based on optical flow
CN112686329A (en) * 2021-01-06 2021-04-20 西安邮电大学 Electronic laryngoscope image classification method based on dual-core convolution feature extraction
CN112749684A (en) * 2021-01-27 2021-05-04 萱闱(北京)生物科技有限公司 Cardiopulmonary resuscitation training and evaluating method, device, equipment and storage medium
CN113836969A (en) * 2020-06-23 2021-12-24 山西农业大学 Abnormal event detection method based on double flows
CN114037930A (en) * 2021-10-18 2022-02-11 苏州大学 Video action recognition method based on space-time enhanced network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845351A (en) * 2016-05-13 2017-06-13 苏州大学 It is a kind of for Activity recognition method of the video based on two-way length mnemon in short-term
CN106845329A (en) * 2016-11-11 2017-06-13 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of action identification method based on depth convolution feature multichannel pyramid pond
CN107240066A (en) * 2017-04-28 2017-10-10 天津大学 Image super-resolution rebuilding algorithm based on shallow-layer and deep layer convolutional neural networks
CN107862376A (en) * 2017-10-30 2018-03-30 中山大学 A kind of human body image action identification method based on double-current neutral net
CN108875674A (en) * 2018-06-29 2018-11-23 东南大学 A kind of driving behavior recognition methods based on multiple row fusion convolutional neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845351A (en) * 2016-05-13 2017-06-13 苏州大学 It is a kind of for Activity recognition method of the video based on two-way length mnemon in short-term
CN106845329A (en) * 2016-11-11 2017-06-13 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of action identification method based on depth convolution feature multichannel pyramid pond
CN107240066A (en) * 2017-04-28 2017-10-10 天津大学 Image super-resolution rebuilding algorithm based on shallow-layer and deep layer convolutional neural networks
CN107862376A (en) * 2017-10-30 2018-03-30 中山大学 A kind of human body image action identification method based on double-current neutral net
CN108875674A (en) * 2018-06-29 2018-11-23 东南大学 A kind of driving behavior recognition methods based on multiple row fusion convolutional neural networks

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
FISHER YU 等: "MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS", 《ICLR 2016》 *
YIFAN WANG 等: "End-to-End Image Super-Resolution via Deep and Shallow Convolutional Networks", 《DIGITAL OBJECT IDENTIFIER》 *
杨天明 等: "基于视频深度学习的时空双流人物动作识别模型", 《计算机应用》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110458038A (en) * 2019-07-19 2019-11-15 天津理工大学 The cross-domain action identification method of small data based on double-strand depth binary-flow network
CN112308885A (en) * 2019-07-29 2021-02-02 顺丰科技有限公司 Violent throwing detection method, device, equipment and storage medium based on optical flow
CN111368666A (en) * 2020-02-25 2020-07-03 上海蠡图信息科技有限公司 Living body detection method based on novel pooling and attention mechanism double-current network
CN111368666B (en) * 2020-02-25 2023-08-18 上海蠡图信息科技有限公司 Living body detection method based on novel pooling and attention mechanism double-flow network
CN111666852A (en) * 2020-05-28 2020-09-15 天津大学 Micro-expression double-flow network identification method based on convolutional neural network
CN113836969A (en) * 2020-06-23 2021-12-24 山西农业大学 Abnormal event detection method based on double flows
CN111738357A (en) * 2020-07-24 2020-10-02 完美世界(北京)软件科技发展有限公司 Junk picture identification method, device and equipment
CN111738357B (en) * 2020-07-24 2020-11-20 完美世界(北京)软件科技发展有限公司 Junk picture identification method, device and equipment
CN112244863A (en) * 2020-10-23 2021-01-22 京东方科技集团股份有限公司 Signal identification method, signal identification device, electronic device and readable storage medium
CN112686329A (en) * 2021-01-06 2021-04-20 西安邮电大学 Electronic laryngoscope image classification method based on dual-core convolution feature extraction
CN112749684A (en) * 2021-01-27 2021-05-04 萱闱(北京)生物科技有限公司 Cardiopulmonary resuscitation training and evaluating method, device, equipment and storage medium
CN114037930A (en) * 2021-10-18 2022-02-11 苏州大学 Video action recognition method based on space-time enhanced network

Similar Documents

Publication Publication Date Title
CN109977904A (en) A kind of human motion recognition method of the light-type based on deep learning
Yang et al. Visual perception enabled industry intelligence: state of the art, challenges and prospects
Yin et al. Recurrent convolutional network for video-based smoke detection
CN107862300A (en) A kind of descending humanized recognition methods of monitoring scene based on convolutional neural networks
CN102332095B (en) Face motion tracking method, face motion tracking system and method for enhancing reality
CN110427813A (en) Pedestrian's recognition methods again based on the twin production confrontation network that posture instructs pedestrian image to generate
Ming et al. Simple triplet loss based on intra/inter-class metric learning for face verification
CN110210539A (en) The RGB-T saliency object detection method of multistage depth characteristic fusion
CN110232361B (en) Human behavior intention identification method and system based on three-dimensional residual dense network
CN116343330A (en) Abnormal behavior identification method for infrared-visible light image fusion
CN110472634A (en) Change detecting method based on multiple dimensioned depth characteristic difference converged network
CN110022422A (en) A kind of sequence of frames of video generation method based on intensive connection network
Singh et al. A deep learning based technique for anomaly detection in surveillance videos
CN107194380A (en) The depth convolutional network and learning method of a kind of complex scene human face identification
Le et al. Cross-resolution feature fusion for fast hand detection in intelligent homecare systems
CN109583334A (en) A kind of action identification method and its system based on space time correlation neural network
CN115063836A (en) Pedestrian tracking and re-identification method based on deep learning
Sun et al. YOLO-P: An efficient method for pear fast detection in complex orchard picking environment
CN113705384B (en) Facial expression recognition method considering local space-time characteristics and global timing clues
CN114333002A (en) Micro-expression recognition method based on deep learning of image and three-dimensional reconstruction of human face
CN110223240A (en) Image defogging method, system and storage medium based on color decaying priori
CN109583406B (en) Facial expression recognition method based on feature attention mechanism
CN116403286A (en) Social grouping method for large-scene video
CN116342877A (en) Semantic segmentation method based on improved ASPP and fusion module in complex scene
CN105224952A (en) Based on the double interbehavior recognition methods of largest interval markov pessimistic concurrency control

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190705