CN110222556A - A kind of human action identifying system and method - Google Patents

A kind of human action identifying system and method Download PDF

Info

Publication number
CN110222556A
CN110222556A CN201910324097.4A CN201910324097A CN110222556A CN 110222556 A CN110222556 A CN 110222556A CN 201910324097 A CN201910324097 A CN 201910324097A CN 110222556 A CN110222556 A CN 110222556A
Authority
CN
China
Prior art keywords
network
residual error
image
video image
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910324097.4A
Other languages
Chinese (zh)
Inventor
叶青
钟浩鑫
张永梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China University of Technology
Original Assignee
North China University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China University of Technology filed Critical North China University of Technology
Priority to CN201910324097.4A priority Critical patent/CN110222556A/en
Publication of CN110222556A publication Critical patent/CN110222556A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The present invention provides a kind of human action identifying system and method, the system comprises: the two-way more Memory Neural Networks modules of video image acquisition module, image standardization module, depth residual error, classifier modules;Video image acquisition module will carry out sampling processing to raw video image, obtain sampling rear video image, and input to image standardization module and be standardized, obtain stadardized video image;The stadardized video image input two-way more Memory Neural Networks modules of depth residual error are handled;The two-way more Memory Neural Networks modules of depth residual error include beta pruning network unit, depth residual error bilateral network unit;Classifier modules are classified based on the output of the two-way more Memory Neural Networks modules of the depth residual error.Technical solution of the present invention well solves gradient disappearance problem, and the coupling ability of Enhanced feature significantly speeds up training, improves the accuracy rate and arithmetic speed of identification.

Description

A kind of human action identifying system and method
Technical field
The present invention relates to computer vision field, be based particularly on the human action of neural network visual identifying system and Method.
Background technique
With the continuous development of the information society, computer vision is increasingly by the concern of every field, and in computer In visual field, the identification of human action in video is a big important research direction, by the human action in video into Row identification, can bring huge convenience to fields such as intelligent video monitoring, virtual reality, video security protection and smart homes.Meter Calculation machine visual field studies computer generation mainly for manually the work such as identification, detection, processing, video is completed to multimedia information In human action identification be an important research direction in the field again, the identification of human action, for smart home, The fields such as security protection, virtual reality play a very important role.This sentences safety protection field as an example to carry out Explanation.The range of security protection is very wide, the small accident early warning into city, is all the model of security protection to Border Protection early warning greatly It encloses, by the way that the human action identification in video is applied to the field, analysis can be made to the concrete condition of accident, it can also be with The movement of a suspect on boundary is judged and early warning, to greatly improve the efficiency of security protection and reduce people The consumption of power material resources, meanwhile, rest can not had in view of computer, the time of work is longer and will not be tired out, moreover it is possible to without dead angle It is comprehensive be monitored to ensure the features such as safe, which brings great convenience to people's lives, has wide General application prospect.
Traditional human motion recognition method is first to carry out moving object detection, then carry out feature extraction, finally by feature Classified to obtain recognition result.There is common feature type in conventional action identification: static nature, behavioral characteristics, space-time are special It seeks peace descriptive characteristics.Common action identification method is divided into three classes: the method for method, probability statistics based on template is based on The method of grammer.In order to further increase the discrimination of algorithm, with the proposition of deep learning method, it is based on depth e-learning The method for automatically extracting feature has been applied in human action identification.However the existing human action identification based on deep learning In method, the network structures such as traditional AlexNet, VGG are all to obtain better instruction by increasing the depth (number of plies) of network Practice effect, but the increase of the number of plies can bring many negative interactions, such as the problems such as the disappearance of over-fitting, gradient, gradient explosion.
Summary of the invention
In view of the deficiencies of the prior art, this patent proposes a kind of human motion recognition method, two-way more based on depth residual error Memory Neural Networks, the network constitute front network module, then the feature square that will be obtained by removing unwanted network unit Battle array is sent into as input and is trained by the depth bilateral network of residual error structure of modification, is finally classified by classifier, is obtained Human action recognition result.The training technique of the Web vector graphic " end-to-end ", parameter and depth residual error network to neural network Combined optimization is carried out, ability to express of the new neural network characteristics in new data set is ensure that, improves the general of whole system Change performance.Specifically, the present invention provides technical solutions below:
On the one hand, the present invention provides a kind of human action identifying system, the system comprises: video image acquisition mould The two-way more Memory Neural Networks modules of block, image standardization module, depth residual error, classifier modules;
The video image acquisition module will carry out sampling processing to raw video image, obtain sampling rear video image, And input to image standardization module and be standardized, obtain stadardized video image;
The stadardized video image input two-way more Memory Neural Networks modules of depth residual error are handled;
The two-way more Memory Neural Networks modules of depth residual error include beta pruning network unit, depth residual error bilateral network list Member;
The beta pruning network unit handles convolutional neural networks by deleting mode, the neural network after obtaining beta pruning, institute It states beta pruning and follows following principle:
If | w | < α, respective weights are removed, and wherein w is circulation weight, and α is sensitivity parameter;
The depth residual error bilateral network unit, by the two-way LSTM network implementations of depth residual error network integration, with quick Connection type reduces relation of interdependence between parameter, and itself will be mapped to superimposed layer and be added with the output of convolutional layer;
Neural network and the two-way LSTM network association training after the beta pruning;
The classifier modules are classified based on the output of the two-way more Memory Neural Networks modules of the depth residual error.
Preferably, raw video image carries out sampling processing to the video acquisition module in the following manner:
The extraction of frame number, the frame number T at interval are carried out to one section of video every identical frame number are as follows:
Wherein, M is the frame number of original video, and N is the video frame number after sampling.
Preferably, the standardization is realized by handling the contrast of each frame image, the contrast of image after processing Are as follows:
WhereinIt is the average gray of entire described image, meetsImage size be r × c。
Preferably, the parameter sets of the convolutional neural networks and LSTM network are (θ12), then system loss function can It is denoted as:
Wherein, X is the abstract characteristics of input signal or level indicates feature, and W is convolution kernel, and b is biasing, and M is entire net The number of stages of network, N are input feature vector quantity, and L is loss function, and t is the network number of plies, and k is characterized mapping channel, and TR indicates instruction Practice, TE indicates test, and R (W) and R (θ) are regularization term.
Preferably, the classifier modules are as follows:
Wherein, { (x(1),y(1)),...,(x(m),y(m)) it is training set, y(i)∈ { 1,2,3 ..., k }, total k classification, p (y=j | x) is the probability that each input x can have each class of correspondence, j=(1,2 ..., k),It is the parameter of model.
In addition, the present invention also provides a kind of human motion recognition methods, which comprises
S1, sampling processing is carried out to the raw video image of input, obtains sampling rear video image;And to sampling rear video Image is standardized, and obtains stadardized video image;
S2, pass through the two-way LSTM network of convolutional neural networks combination depth residual error after beta pruning, extraction characteristics of image;
S3, classified based on described image feature, obtain recognition result.
Preferably, in the S1, the sampling processing carries out in the following manner:
The extraction of frame number, the frame number T at interval are carried out to one section of video every identical frame number are as follows:
Wherein, M is the frame number of original video, and N is the video frame number after sampling
Preferably, the neural network in the S2, after the beta pruning, it then follows following principle carries out beta pruning:
If | w | < α, respective weights are removed, and wherein w is circulation weight, and α is sensitivity parameter.
Preferably, in the S2, the parameter set of the two-way LSTM network of convolutional neural networks combination depth residual error after beta pruning It is combined into (θ12), then system loss function can be denoted as:
Wherein, TR indicates training, and TE indicates test, and R (W) and R (θ) are regularization term, and X is the abstract characteristics of input signal Or level indicates feature, W is convolution kernel, and b is biasing, and M is the number of stages of whole network, and N is input feature vector quantity, and L is loss Function, t are the network number of plies, and k is characterized mapping channel ....
Preferably, in network training process, the convolutional neural networks and the two-way LSTM network association of depth residual error are instructed Practice.
Compared with prior art, the present invention has following advantages:
1, the present invention combines residual error Module Idea, well solves since too deep network structure generates asking for gradient disappearance Topic, because residual error network can increase the range of choice of feature, thus the coupling ability of Enhanced feature, this structure can be greatly Accelerate training, while performance also has promotion;It efficiently solves and disappears since too deep network structure easily leads to gradient, cause to identify Effect and bad problem.
2, knowledge can be improved for traditional human motion recognition method in human motion recognition method of the invention Other accuracy rate, and arithmetic speed can be improved using technology of prunning branches.
Detailed description of the invention
Fig. 1 is the human action identification process figure of the embodiment of the present invention;
Fig. 2 is the GoogLeNet network structure adjusted of the embodiment of the present invention;
Fig. 3 is the histogram of 102400 connection weights in Conv layers of network model of the embodiment of the present invention;
Fig. 4 is two-way more Memory Neural Networks single layer structure figures of the embodiment of the present invention;
Fig. 5 is the residual error network structure of the embodiment of the present invention.
Specific embodiment
Below in conjunction with the figure in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground description, it is clear that the described embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, those of ordinary skill in the art's every other implementation obtained under that premise of not paying creative labor Example, shall fall within the protection scope of the present invention.
Embodiment 1
In a specific embodiment, technical solution provided by the present invention can pass through human action identifying system Mode is realized, it is preferred that as shown in Figure 1, including video image sampling module, image standardization module, depth in the system The two-way more Memory Neural Networks modules of residual error, classifier modules are known the human action video of input by above system structure It Wei not the result of the action output.Specifically, can be accomplished in that
One, video image sampling module
For convolutional neural networks, some video images are input in network and extract feature, then video is adopted Quadrat method will directly influence the generalization ability of trained data model.Because for action video complete for one, It is extracted if whole videos participates in image feature information, consecutive frame may have a large amount of information redundancy phenomenon, great Liang Lei Same image can also aggravate the burden of network.In addition, since the frame number of different action videos is not also most identical, rationally Video down-sampling processing be necessary.
More to retain the effective information in human action, and unnecessary redundancy is removed, in the present embodiment Video sampling carries out in the following way:
For entire video, the extraction of frame number, the frame at this interval are carried out to one section of video every identical frame number Number will consider the totalframes size of different video.Assuming that one section of action video has M frame video, finally want down-sampling generate N (N < M) the short-sighted frequency of the movement of frame at this time can take a video image every T frame, form last video image.Here the choosing of T The size of M and N is depended on, under normal conditions:
By the above method, the video of multi input is sampled, and obtains sampling rear video.
Two, image standardization module
The image of different video is there are situations such as not of uniform size, resolution ratio is different, and uneven illumination is even, these situations are to video The result of image classification can generate adverse effect.So video image data before feature extraction, is needed to its expanding data Pretreatment work.In traditional feature extraction algorithm, pretreatment work has image denoising, picture superposition etc., and adopts It is also most important to the pretreatment of image when carrying out the identification of image with neural network.Image should be standardized, so that Their pixel is all in identical and reasonable range, such as [0,1] or [- 1,1].By in [0,1] image with [0, 255] image blend in, it will usually lead to the failure.Picture format is turned into ratio having the same, it is stringent above to say it is unique one The necessary pretreatment of kind.Many computer vision frameworks need standard-sized image, it is therefore necessary to cut or zoomed image with Adapt to the size.However, strictly speaking even the operation of this readjustment ratio is not always necessary.Some convolution moulds Type receives the input of variable-size, and dynamically adjusts their pond area size to keep output size constant.Other volumes Product module type has the output of variable-size, and size is with input auto zoom, such as denoises to each pixel in image Or the model of mark.
The pretreatment of image is realized in the following way in the present embodiment: in deep learning, contrast is commonly referred to as The standard deviation of pixel in image or image-region.Assuming that we have the image X ∈ R an of tensor representationr×c×3, wherein Xi,j,1Table Show the red gray scale of the i-th row jth column, Xi,j,2Indicate the gray scale of the i-th row jth column green, Xi,j,3Indicate the i-th row jth column blue Gray scale.Then the contrast of whole image can be expressed as follows:
WhereinIt is the average gray of entire picture, image size is r × c, is met
The standardized way proposed in the present embodiment, it is intended to by subtracting its average value from each image, then weigh New scaling makes the standard deviation in its pixel be equal to some constant s to prevent the modified contrast of image.
Three, the two-way more Memory Neural Networks modules of depth residual error
1. beta pruning network characterization extraction module
1.1 in the present embodiment, which can be used GoogLeNet network implementations adjusted
The feature that the present embodiment makes full use of the convergence rate of GoogLeNet network faster than traditional convolutional network, not While influencing last discrimination, training detection time is improved.First to the size of original input picture known to network structure It is fixed as the RGB image of 224x224x3.First layer convolutional layer uses pad=3, convolution kernel size for 7x7, the fixation of step=2 Structure, in this way by obtaining the feature of 64 dimensions after convolution, therefore this layer of feature output is 112x112x64.It is directly defeated after convolutional layer Enter to excitation layer, excitation function equally selects general ReLu function, followed by the pond layer of 3x3 core, and pondization uses max The mode of pool, can calculate becomes 56x56x64 dimension by Chi Huahou feature vector.It then joined norm layers.The second layer Convolution uses the convolution kernel of 3x3, and step-length step, which is set as 1 such feature vector, becomes 56x56x192, is equally then motivated Layer and pond layer output feature are 28x28x192 dimension.1x1,3x3,5x5 convolution kernel are then used, includes such as Fig. 2 institute, 4 branches Show.
Construct the mathematical model of neural network
(1) input and output relation are as follows:
Wherein, X is the abstract characteristics of input signal or level indicates feature, and W is convolution kernel, and b is biasing.
(2) classifier parameters are as follows:
Final output classification are as follows: y=argmax { Y (k) } (6)
(3) on training set, the objective function that is constructed using cross entropy are as follows:
Wherein, TR indicates training, and TE indicates test, and R (W) and R (θ) are regularization term, make parameter rarefaction, prevent from intending It closes.
(4) set the number of stages of whole network as M (in one preferred embodiment, M=3), GoogLeNet network and The parameter sets of LSTM network are (θ12), then system loss function can be denoted as:
Wherein X is the abstract characteristics of input signal or level indicates feature, and W is convolution kernel, and b is biasing, and M is whole network Number of stages, N is input feature vector quantity, and L is loss function, and t is the network number of plies, and k is characterized mapping channel.
On the basis of the above, by prune approach, network operating efficiency is improved.
1.2 pruning method
In optimal cerebral injury (Optimal Brain Damage) algorithm, the conspicuousness of parameter is defined as deleting and is somebody's turn to do Parameter and the variation for causing objective function.Each parameter conspicuousness can be directly assessed from this definition, i.e., temporarily deletes each ginseng Objective function is counted and reappraises, by simplifying objective function E
H in formulaiiIt is diagonal second dervative, uiIt is parameter to be measured, then calculates the income s of each parameteri, according to conspicuousness Parameter is ranked up, and deletes some low conspicuousness parameters.
This process needs a kind of effective calculating hiiMethod, this needs takes a substantial amount of time and computing resource.Most Approximation needed for excellent cerebral injury is mainly cornerwise value on Hessian matrix, and usually there are many off-diagonal Xiang Yuqi Diagonal term is suitable, and therefore, optimal brain surgery (Optimal Brain Surgeon) algorithm has done certain on OBD algorithm It improves, it utilizes the second derivative information of error function, and analytical Prediction weight disturbs the influence degree to function, with from the top down Mode weaken or eliminate certain connection weights, realize structure optimization.However it turns out that, these hypothesis be not it is effective, Also it is not enough to reach optimal beta pruning effect.
The quantity of hidden unit largely determines the ability that network accurately estimates posterior probability.However, neural Connection number in network increases sharply as the quantity of hiding level and unit increases, and the quantity is by actually available The constraint of computing resource.Not only number of parameters is network performance key factor, and understand they be how to come into operation, and Obtain network in it is available can training parameter and connect it is even more important.By taking GoogLeNet model as an example, Fig. 3 is in its convolutional layer Connection weight histogram, wherein x-axis show connection weight, y-axis is the connection number in each section.Filament is normal state point Cloth, average value are -0.01, standard deviation 0.056.If the hypothesis that the weight of small magnitude can be deleted be correctly, As can be seen that many connections may be trimmed to about in Fig. 3.This will greatly reduce the calculation amount of trained network development process, and can be with Improve the generalization ability of network.In the present embodiment, it can be cut off by two state modulator beta pruning processes when w meets (14).
Wherein, α, β are respectively two different sensitivity parameters, and E is objective function, wiFor weight after update.The sensitivity of α Value depends on network and training data lower than 0.10, β, but adjustable to obtain required trimming accuracy.Through many experiments, Second about first constraint of beam ratio is much weaker.By considering the absolute value of weight, realize that the trimming of identical quantity obtains almost Identical network performance.Therefore, first standard or equivalent in (14) assessment is used only, it is assumed that β=∞.Therefore, such as Fruit | w | < α, then weight is simply removed.In this case, constantly adjustment is to obtain optimal pruning rate.
The gradient descent method updated for weight is generated as many small weights for updating the sum of steps.If it is assumed that for Small weight, new parameter is the independent random variable with same distribution after single updates, then gained weight is normal distribution.Foundation The weight distribution of Fig. 3, it is clear that its actual distribution can be approximate well by normal distribution, and mean value is but right close to zero In the weight of amplitude about 0.1 or so, degree of fitting is simultaneously bad.It follows that at least not every weight 0.1 or so is all It is the human factor generation of training process, this is great reference value to the selection of parameter alpha.
After trimming, network is by re -training, to find the optimal network weight of new more correlation, and retraining Convergence rate it is faster than original training.
1.3 beta pruning neural network implementations
Network beta pruning refers to when one larger neural network of training, removes unwanted part.I.e. by removing one Unit deletes itself and its all connections that is transferred into and out from network.The network architecture of larger size is to initial training Condition is more sensitive, and trims so that network complexity reduction, is conducive to promote generalization ability.In the present embodiment, beta pruning nerve Network is accomplished in that firstly, it starts to update weight by proper network training.However, it is different from conventional exercises, Whether important focus on relationship between the level unit of training front and back.Secondly, trimming low weight connection.It is low that weight is deleted from network In all connections of threshold value, so that dense network is converted to sparse network.Finally, re -training network, remaining dilute to obtain Dredge the final weight of connection.This step is very crucial, if the network after trimming directly carries out in the case where no retraining It uses, precision will receive influence.
2. depth residual error bilateral network module
In the present embodiment, it on the basis of model after standardization, is not finely adjusted merely with the parameter in beta pruning network, Two-way LSTM network is improved using depth residual error structure simultaneously.Traditional convolutional network feature representation ability is limited, mould Type generalization ability is not strong, so that mistake is more when action recognition, but rationally can provide one using residual error structure for whole network Fixed disturbance and regularization, avoids the network of higher from falling into over-fitting state, improves the generalization ability of whole network.
2.1 two-way LSTM networks
It in the present embodiment, is improved for two-way LSTM network, the foundation structure of LSTM network such as Fig. 4.It can be seen that Forward layers and Backward layers are connected to output layer jointly, wherein containing 6 shared weight w 1-w6.
It is calculated one time from 1 moment to t moment forward direction at Forward layers, obtain and saves hidden layer forward of each moment Output.Backward layers along moment t to 1 retrospectively calculate of moment one time, obtain and save hidden layer backward of each moment Output.The result of the output of corresponding moment of Forward layer and Backward layers is finally combined to obtain at each moment final defeated Out, as follows with mathematic(al) representation, wherein htFor positive hidden state information;ht' it is reversed hidden state information:
It is related to first circulation layer in duplicate network, so that side by side now with two layers, then as former state by list entries It is supplied to the input of first layer, and the reversed copy of list entries is supplied to the second layer.In all time steps of list entries In the problem of length can be used, two-way LSTM trains two rather than a LSTM on list entries.This can be provided more for network More context, and causing faster, more fully problem concerning study.
2.2 depth residual errors
For theoretically, the capacity and feature decision ability of model framework can with the network number of plies continuous intensification without It is disconnected to improve.However if the depth for simply increasing network will appear gradient disperse problem, i.e., too deep network structure easily leads to ladder Degree disappears, therefore in the present embodiment, improves for depth residual error network structure.
Assuming that H (x) indicates optimal demapping of the deep neural network after input sample x, traditional convolutional neural networks It is direct fitting H (x)=x, and the expectation regression criterion mapping of depth residual error network, i.e. F (x)=H (x)-x.Since x is input Source images, it is of equal value for can verifying fitting F (x) and be fitted the target of H (x).With this condition, optimal demapping quilt originally It is expressed as H (x)=F (x)+x.As shown in Figure 5.
At this point, in the present embodiment, it is residual by depth is input to by the eigenmatrix obtained after the Processing with Neural Network after beta pruning The two-way LSTM network of difference.Depth residual error network is skipped 2 or 3 convolutional layers, is used between each network by change connection structure The activation of ReLU function, the relationship of interdependence between parameter is reduced with this, the appropriate generation for alleviating over-fitting.In the present embodiment The two-way LSTM network of depth residual error is after skipping 2 two-way LSTM networks as shown in Figure 5, itself is mapped to superimposed layer and convolutional layer Output be added.The two-way LSTM network of depth residual error will be by the eigenmatrix that obtains after the Processing with Neural Network after beta pruning Output is as input, second two-way LSTM network that input sample is mapped to superimposed layer (adder in Fig. 5) and is skipped The output of convolutional layer is added.In this way, before the two-way more Memory Neural Networks of depth residual error are used as by the neural network after beta pruning Network is held, the two-way LSTM network of depth residual error is as back-end network.Wherein the two-way LSTM network of depth residual error is by residual error network Structure is introduced into the two-way LSTM network of depth, and the single two-way LSTM network of depth is improved.
Obviously, the function F (x)=0 for making network go fitting determining approaches optimal function H (x) Yao Rongyi very than optimization It is more.Depth residual error structure can be added in existing depth model, without changing the existing framework of master mould, so that training Performance is more preferable out, the more network models of the number of plies are possibly realized.Even if in this way, the residual error network with identical mapping is with depth Degree is continuously increased similarly that there is multilayer residual error modules to share a small amount of this drawback of gradient information stream.So only few The parameter of part residual error module is updated.In order to solve this problem, the present embodiment introduces the two-way LSTM network of depth, instead of Traditional neural network, and residual error network structure is introduced into the two-way LSTM network of depth, single depth is two-way LSTM network improves, and obtains depth residual error bilateral network, because residual error network increases the range of choice of feature, to increase The strong coupling ability of feature.
Four, classifier modules
In the present embodiment, the classification that classifier carries out human action identification behavioural characteristic is finally set in system, is each Video generates a probability tag.The classifier set-up mode is as follows:
For training set { (x(1),y(1)),...,(x(m),y(m)), there is y(i)∈ { 1,2,3 ..., k } shares k classification, Can there are the Probability p (y=j | x) of corresponding each class, j=(1,2 ..., k) for each input x.It is assumed that function hθ(x) The vector (vector element and for 1) of k dimension will be exported to indicate this k probability value estimated, as follows:
It is the parameter of model
By above-mentioned identifying system, final realize carries out feature extraction and classification to the raw video image of input, real Now final human action identification.
Embodiment 2
In the present embodiment, it is dynamic to describe a kind of human body for being preferably based on the progress of system described in embodiment 1 in detail Make recognition methods.The described method includes:
S1, sampling processing is carried out to the raw video image of input, obtains sampling rear video image;And to sampling rear video Image is standardized, and obtains stadardized video image;
S2, pass through the two-way LSTM network of convolutional neural networks combination depth residual error after beta pruning, extraction characteristics of image;
S3, classified based on described image feature, obtain recognition result.
Preferably, in the S1, the sampling processing carries out in the following manner:
The extraction of frame number, the frame number T at interval are carried out to one section of video every identical frame number are as follows:
Wherein, M is the frame number of original video, and N is the video frame number after sampling
Preferably, the neural network in the S2, after the beta pruning, it then follows following principle carries out beta pruning:
If | w | < α, respective weights are removed, and wherein w is circulation weight, and α is sensitivity parameter ....
Preferably, in the S2, the parameter set of the two-way LSTM network of convolutional neural networks combination depth residual error after beta pruning It is combined into (θ12), then system loss function can be denoted as:
Wherein, TR indicates training, and TE indicates test, and R (W) and R (θ) are regularization term, and X is the abstract characteristics of input signal Or level indicates feature, W is convolution kernel, and b is biasing, and M is the number of stages of whole network, and N is input feature vector quantity, and L is loss Function, t are the network number of plies, and k is characterized mapping channel.
Preferably, in network training process, the convolutional neural networks and the two-way LSTM network association of depth residual error are instructed Practice.
Preferably, in S3, Classification and Identification mode is as follows:
For training set { (x(1),y(1)),...,(x(m),y(m)), there is y(i)∈ { 1,2,3 ..., k } shares k classification, Can there are the Probability p (y=j | x) of corresponding each class, j=(1,2 ..., k) for each input x.It is assumed that function hθ(x) The vector (vector element and for 1) of k dimension will be exported to indicate this k probability value estimated, as follows:
It is the parameter of model.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify to technical solution documented by previous embodiment or equivalent replacement of some of the technical features;And These are modified or replaceed, the spirit and model of technical solution of the embodiment of the present invention that it does not separate the essence of the corresponding technical solution It encloses.

Claims (10)

1. a kind of human action identifying system, which is characterized in that the system comprises: video image acquisition module, graphics standard Change module, the two-way more Memory Neural Networks modules of depth residual error, classifier modules;
The video image acquisition module will carry out sampling processing to raw video image, obtain sampling rear video image, and defeated Enter and be standardized to image standardization module, obtains stadardized video image;
The stadardized video image input two-way more Memory Neural Networks modules of depth residual error are handled;
The two-way more Memory Neural Networks modules of depth residual error include beta pruning network unit, depth residual error bilateral network unit;
The beta pruning network unit handles convolutional neural networks by deleting mode, and the neural network after obtaining beta pruning is described to cut Branch follows following principle:
If | w | < α, respective weights are removed, and wherein w is circulation weight, and α is sensitivity parameter;
The depth residual error bilateral network unit, by the two-way LSTM network implementations of depth residual error network integration, fast to connect Mode reduces relation of interdependence between parameter, and itself will be mapped to superimposed layer and be added with the output of convolutional layer;
Neural network and the two-way LSTM network association training after the beta pruning;
The classifier modules are classified based on the output of the two-way more Memory Neural Networks modules of the depth residual error.
2. system according to claim 1, which is characterized in that video acquisition module original video in the following manner Image carries out sampling processing:
The extraction of frame number, the frame number T at interval are carried out to one section of video every identical frame number are as follows:
Wherein, M is the frame number of original video, and N is the video frame number after sampling.
3. system according to claim 1, which is characterized in that the comparison that the standardization passes through each frame image of processing Degree realization, the contrast of image after processing are as follows:
WhereinIt is the average gray of entire described image, meetsImage size is r × c.
4. system according to claim 1, which is characterized in that the parameter sets of the convolutional neural networks and LSTM network For (θ12), then system loss function can be denoted as:
Wherein, X is the abstract characteristics of input signal or level indicates feature, and W is convolution kernel, and b is biasing, and M is whole network Number of stages, N are input feature vector quantity, and L is loss function, and t is the network number of plies, and k is characterized mapping channel.
5. system according to claim 1, which is characterized in that the classifier modules are as follows:
Wherein, { (x(1),y(1)),...,(x(m),y(m)) it is training set, y(i)∈ { 1,2,3 ..., k }, total k classification, p (y= J | can x) there be the probability of each class of correspondence for each input x, j=(1,2 ..., k),It is The parameter of model.
6. a kind of human motion recognition method, which is characterized in that the described method includes:
S1, sampling processing is carried out to the raw video image of input, obtains sampling rear video image;And to sampling rear video image It is standardized, obtains stadardized video image;
S2, pass through the two-way LSTM network of convolutional neural networks combination depth residual error after beta pruning, extraction characteristics of image;
S3, classified based on described image feature, obtain recognition result.
7. according to the method described in claim 6, it is characterized in that, in the S1, the sampling processing in the following manner into Row:
The extraction of frame number, the frame number T at interval are carried out to one section of video every identical frame number are as follows:
Wherein, M is the frame number of original video, and N is the video frame number after sampling.
8. according to the method described in claim 6, it is characterized in that, neural network in the S2, after the beta pruning, it then follows with Lower principle carries out beta pruning:
If | w | < α, respective weights are removed, and wherein w is circulation weight, and α is sensitivity parameter.
9. according to the method described in claim 6, it is characterized in that, the convolutional neural networks after beta pruning combine deep in the S2 The parameter sets for spending the two-way LSTM network of residual error are (θ12), then system loss function can be denoted as:
Wherein, X is the abstract characteristics of input signal or level indicates feature, and W is convolution kernel, and b is biasing, and M is whole network Number of stages, N are input feature vector quantity, and L is loss function, and t is the network number of plies, and k is characterized mapping channel.
10. according to the method described in claim 6, it is characterized in that, nerve net in network training process, after the beta pruning Network and the two-way LSTM network association training of depth residual error.
CN201910324097.4A 2019-04-22 2019-04-22 A kind of human action identifying system and method Pending CN110222556A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910324097.4A CN110222556A (en) 2019-04-22 2019-04-22 A kind of human action identifying system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910324097.4A CN110222556A (en) 2019-04-22 2019-04-22 A kind of human action identifying system and method

Publications (1)

Publication Number Publication Date
CN110222556A true CN110222556A (en) 2019-09-10

Family

ID=67819978

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910324097.4A Pending CN110222556A (en) 2019-04-22 2019-04-22 A kind of human action identifying system and method

Country Status (1)

Country Link
CN (1) CN110222556A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738213A (en) * 2019-09-20 2020-01-31 成都芯云微电子有限公司 image recognition method and device comprising surrounding environment
CN111008640A (en) * 2019-10-17 2020-04-14 平安科技(深圳)有限公司 Image recognition model training and image recognition method, device, terminal and medium
CN111401207A (en) * 2020-03-11 2020-07-10 福州大学 Human body action recognition method based on MARS depth feature extraction and enhancement
CN111931602A (en) * 2020-07-22 2020-11-13 北方工业大学 Multi-stream segmented network human body action identification method and system based on attention mechanism
CN113469062A (en) * 2021-07-05 2021-10-01 中山大学 Method, system and medium for detecting face exchange tampering video based on key frame face characteristics

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341462A (en) * 2017-06-28 2017-11-10 电子科技大学 A kind of video classification methods based on notice mechanism
CN108446594A (en) * 2018-02-11 2018-08-24 四川省北青数据技术有限公司 Emergency reaction ability assessment method based on action recognition
CN108491680A (en) * 2018-03-07 2018-09-04 安庆师范大学 Drug relationship abstracting method based on residual error network and attention mechanism

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341462A (en) * 2017-06-28 2017-11-10 电子科技大学 A kind of video classification methods based on notice mechanism
CN108446594A (en) * 2018-02-11 2018-08-24 四川省北青数据技术有限公司 Emergency reaction ability assessment method based on action recognition
CN108491680A (en) * 2018-03-07 2018-09-04 安庆师范大学 Drug relationship abstracting method based on residual error network and attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YU ZHAO: "Deep Residual Bidir-LSTM for Human Activity Recognition Using Wearable Sensors", 《HINDAWI》 *
汤鹏杰等: "基于GoogLeNet多阶段连带优化的图像描述", 《井冈山大学学报(自然科学版)》 *
王天兴: "基于GoogLeNet网络结构的改进算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738213A (en) * 2019-09-20 2020-01-31 成都芯云微电子有限公司 image recognition method and device comprising surrounding environment
CN110738213B (en) * 2019-09-20 2022-07-01 成都芯云微电子有限公司 Image identification method and device comprising surrounding environment
CN111008640A (en) * 2019-10-17 2020-04-14 平安科技(深圳)有限公司 Image recognition model training and image recognition method, device, terminal and medium
CN111008640B (en) * 2019-10-17 2024-03-19 平安科技(深圳)有限公司 Image recognition model training and image recognition method, device, terminal and medium
CN111401207A (en) * 2020-03-11 2020-07-10 福州大学 Human body action recognition method based on MARS depth feature extraction and enhancement
CN111401207B (en) * 2020-03-11 2022-07-08 福州大学 Human body action recognition method based on MARS depth feature extraction and enhancement
CN111931602A (en) * 2020-07-22 2020-11-13 北方工业大学 Multi-stream segmented network human body action identification method and system based on attention mechanism
CN111931602B (en) * 2020-07-22 2023-08-08 北方工业大学 Attention mechanism-based multi-flow segmented network human body action recognition method and system
CN113469062A (en) * 2021-07-05 2021-10-01 中山大学 Method, system and medium for detecting face exchange tampering video based on key frame face characteristics
CN113469062B (en) * 2021-07-05 2023-07-25 中山大学 Method, system and medium for detecting face exchange tampered video based on key frame face characteristics

Similar Documents

Publication Publication Date Title
CN110222556A (en) A kind of human action identifying system and method
CN109859190B (en) Target area detection method based on deep learning
Liao et al. Deep facial spatiotemporal network for engagement prediction in online learning
CN106650806B (en) A kind of cooperating type depth net model methodology for pedestrian detection
CN106446930B (en) Robot operative scenario recognition methods based on deep layer convolutional neural networks
CN109741318B (en) Real-time detection method of single-stage multi-scale specific target based on effective receptive field
Ma et al. Contrast-based image attention analysis by using fuzzy growing
CN108446729A (en) Egg embryo classification method based on convolutional neural networks
CN107463920A (en) A kind of face identification method for eliminating partial occlusion thing and influenceing
CN109785300A (en) A kind of cancer medical image processing method, system, device and storage medium
CN109255375A (en) Panoramic picture method for checking object based on deep learning
CN107633522A (en) Brain image dividing method and system based on local similarity movable contour model
CN107145889A (en) Target identification method based on double CNN networks with RoI ponds
CN105654141A (en) Isomap and SVM algorithm-based overlooked herded pig individual recognition method
WO2022198808A1 (en) Medical image data classification method and system based on bilinear attention network
CN112529146B (en) Neural network model training method and device
CN114998210B (en) Retinopathy of prematurity detecting system based on deep learning target detection
CN111062296B (en) Automatic white blood cell identification and classification method based on computer
CN109902558A (en) A kind of human health deep learning prediction technique based on CNN-LSTM
CN107066916A (en) Scene Semantics dividing method based on deconvolution neutral net
CN109190683A (en) A kind of classification method based on attention mechanism and bimodal image
CN110648331A (en) Detection method for medical image segmentation, medical image segmentation method and device
JP2024018938A (en) Night object detection and training method and device based on frequency domain self-attention mechanism
CN114581434A (en) Pathological image processing method based on deep learning segmentation model and electronic equipment
CN108230330A (en) A kind of quick express highway pavement segmentation and the method for Camera Positioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190910

RJ01 Rejection of invention patent application after publication