CN106407903A

CN106407903A - Multiple dimensioned convolution neural network-based real time human body abnormal behavior identification method

Info

Publication number: CN106407903A
Application number: CN201610790306.0A
Authority: CN
Inventors: 郝宗波; 魏元满
Original assignee: Sichuan Insight Technology Co Ltd
Current assignee: Sichuan Insight Technology Co Ltd
Priority date: 2016-08-31
Filing date: 2016-08-31
Publication date: 2017-02-15

Abstract

The invention discloses a multiple dimensioned convolution neural network-based real time human body abnormal behavior identification method. A convolution neural network is used for replacing a conventional feature extraction algorithm, and the convolution neural network is improved so as to satisfy requirements for human body behavior classification; specifically, three dimensional convolution, three dimensional down-sampling, NIN, three dimensional pyramid structures are added; human body abnormal behavior feature extraction capability of the convolution neural network is enabled to be increased; training operation is performed in a specific video set, features with great classification capacity can be obtained, robustness and accuracy of a whole identification algorithm can be improved, GPU speed is increased so as to satisfy requirements for practical application, and therefore multi-channel videos can be monitored in real time.

Description

Real-time body's abnormal behaviour recognition methods based on multiple dimensioned convolutional neural networks

Technical field

The present invention relates to computer vision and machine learning field, particularly to the detection skill of human body abnormal behaviour in video Art.

Background technology

The occasion higher to security requirement such as railway station, bank, airport has huge demand, such as to Human bodys' response Fruit system can identify human body behavior it is possible to automatically judging abnormal case and reporting to the police, thus substantially reducing human cost, and carries Rise control and monitoring, accomplish on-line monitoring, Realtime Alerts.

The scheme of conventional body's Activity recognition be substantially based on background modeling and and characteristic matching, such scheme contains three Individual step：The first step is main extraction space-time characteristic point, that is, possess the pixel of time and spatial character, using background difference or Person's optical flow method carries out the modeling of background and the extraction of prospect；Second step is according to the feature selected, using regarding that the first step obtains The video of its characteristic point peripheral region of the feature point pairs of frequency carries out particular conversion and process, to obtain the spy of description specific behavior Levy, wherein majority is based on static scene and object recognition technique；3rd step is to be trained these features input grader, Obtain grader and be applied in identification.

Patent No. 101719216A proposes a kind of movement human abnormal behaviour recognition methods based on template matches, should Having main steps that of scheme, is first pre-processed to video and denoising with gaussian filtering method, then will be pretreated after image It is divided into the little subregion of W*W, Gaussian Background modeling is carried out to each subregion；By moving target recognition out after empty in HSV color Between lower remove shade, this moving target is carried out coupling with the master pattern storehouse of foundation and compares, the class of behavior is judged with this Not.

But all there is problems with said method：

1st, Gaussian Background modeling method in if the color of prospect and background is same or like, judgement prospect with look for Seek during connected region it is easy to be background by the judgement of part prospect mistake, lead to extract the disappearance in region；

2nd, the method compared again using the feature extracting picture, this to a certain extent can not be well time-domain Characteristic use；

3rd, using the mode of template matches on Activity recognition, due to feature distribution and the true monitoring of environmental of training set video Difference, the discrepant situation of sample during some can not be identified well with Sample Storehouse, the change of such as video angle, behavior , easily wrong report and fails to report in the change of action.

Content of the invention

The present invention is to solve above-mentioned technical problem it is proposed that a kind of real-time body based on multiple dimensioned convolutional neural networks is different Often Activity recognition method, employs convolutional neural networks and replaces traditional feature extraction algorithm, and convolutional neural networks are entered Row improves, and increased Three dimensional convolution, three-dimensional down-sampling, NIN, three-dimensional pyramid structure so that convolutional neural networks can be to people Body abnormal behaviour has higher ability in feature extraction；In specific video concentration training, obtain the feature with more classification capacity, Strengthen robustness and the accuracy of whole recognizer；The real-time prison of multi-channel video in addition, having carried out GPU acceleration, can be met Survey.

The technical solution used in the present invention is：Real-time body's abnormal behaviour identification side based on multiple dimensioned convolutional neural networks Method, including：

S1, determine the structure of multiple dimensioned convolutional neural networks；Including ground floor, the second layer, third layer, the 4th layer, the 5th Layer, layer 6 and layer 7；Described ground floor is input layer, comprises three passages, and this three passages accept gray scale respectively and turn Two passage Ox, Oy of the dense optical flow that the video image information of current time upper one second after changing and this video calculate；The Two layers is Three dimensional convolution layer, be n with quantity, yardstick be that the video inputting and light stream are carried out to ground floor for the convolution kernel of cw*ch*cl Convolution algorithm；Third layer is three-dimensional down-sampling layer, is pw with yardstick, and the convolution kernel of ph, pl carries out maximum to the output of the second layer Chi Hua；4th layer is Three dimensional convolution layer, for carrying out convolution algorithm to the output of third layer；Layer 5 is NIN layer, adopts by two The network composition of layer perceptron convolutional layer, for extracting the nonlinear characteristic of human body behavior according to the 4th layer of output；Layer 6 It is pyramid down-sampling layer, be made up of different size of three-dimensional down-sampling layer, non-for human body behavior that layer 5 is exported Linear character carries out down-sampling process；Layer 7 is full articulamentum, according to the output of layer 6 be fixed the feature of dimension to Amount；

Wherein, Ox represents component on x axis of orientation for the light stream, and Oy represents component on y axis of orientation for the light stream；Cw represents volume The width of long-pending core, ch represents convolution kernel height, and cl represents convolution kernel length on a timeline；

S2, the off-line training of multiple dimensioned convolutional neural networks；By being learnt in abnormal human body behavior storehouse, obtain net Network parameter model combines the structure of the multiple dimensioned convolutional neural networks that step S1 determines as model file during ONLINE RECOGNITION；

S3, the ONLINE RECOGNITION of convolutional neural networks；By video input model file being obtained the spy as basis of characterization Levy vector.

Further, the three-dimensional down-sampling described in third layer adopts below equation：

Wherein, x ' represents input vector, and y ' represents the output that obtains after sampling, and s, t and r are picture traverse, highly respectively Sampling step length with three directions of length of the video time；In Three dimensional convolution, the characteristic pattern of last layer output is the matrix of two dimension, S₁、S₂With total columns, m, n then represent m row in this matrix, n row, and 0≤m ＜ S to the total line number being respectively this two-dimensional matrix₁, 0≤n ＜ S₂；S₃Represent the time span of video, that is, video long S altogether₃Frame, l represents l frame, and 0≤l ＜ S₃；I represents this Two-dimensional matrix line number sequence number；J represents this two-dimensional matrix columns sequence number；K represents video frame number.

Further, described down-sampling processes and is specially：The nonlinear characteristic of the human body behavior of layer 5 output is carried out Then each unit spliced of the characteristic pattern obtaining is become a vector by multiple window sizes and the overlapping down-sampling of step-length, makees Output for layer 6.

Further, the network paramter models described in step S2；Specifically comprise following step by step：

The Sample Storehouse that S21, loading are collected and marked, described Sample Storehouse comprises positive sample and negative sample, described positive sample For abnormal human body behavior video, negative sample is normal human body behavior video；

The multiple dimensioned convolutional neural networks that S22, loading are determined by step S1；

S23, video is pre-processed after input multiple dimensioned convolutional neural networks；

S24, judge that whether the error of multiple dimensioned convolutional neural networks is less than threshold value, if then going to step S25；Otherwise will The output result of multiple dimensioned convolutional neural networks enters multiple dimensioned convolutional neural networks with the difference reverse conduction of true tag, and Adjustment network parameter, execution step S23 re -training；

S25, the network parameter of the multiple dimensioned convolutional neural networks of preservation.

Further, described pretreatment is：Be converted to gray scale color space, and deduct average, extract half-tone information and Optic flow information.

Further, the ONLINE RECOGNITION of the convolutional neural networks described in step S3；Specifically include following step by step：

S31, obtain the video V of a second from camera or video file；

S32, the video obtaining step S31 zoom to fixed resolution according to demand；

S33, step S32 is scaled after video first carry out gray processing process, and calculate dense optical flow and obtain two light streams Passage O_x, O_y；

S34, gray processing in step S33 is processed after the image that obtains carry out whitening processing；

S35, the O that will be obtained by the image obtaining after step S34 whitening processing and step S33_x, O_yIt is separately input to many In yardstick convolutional neural networks, output characteristic vector F after network calculations；

S36, video features F being input in grader C, judging the behavior species of this video, if belonging to abnormal behaviour Then process abnormal.

Beneficial effects of the present invention：The application instead of traditional feature extraction algorithm using convolutional neural networks, and will Convolutional neural networks improve, and adapt to the demand of human body behavior classification；Specifically increased Three dimensional convolution, three-dimensional down-sampling, NIN, Three-dimensional pyramid structure is so that convolutional neural networks can have higher ability in feature extraction to human body abnormal behaviour；Specific Video concentration training, obtain the feature with more classification capacity, strengthen the robustness of whole recognizer and accuracy；Separately Outward, in order to the demand meeting practical application has carried out GPU acceleration, the real-time monitoring of multi-channel video can be met；

The present processes have advantages below：

1st, instead of the method for traditional feature extraction so that convolutional neural networks can be to human body with convolutional neural networks Abnormal behaviour has higher ability in feature extraction；

2nd, the information that Three dimensional convolution is extracted time-domain, preferably capture movement information are employed；

3rd, amount of calculation is not only greatly reduced using three-dimensional down-sampling technology, and introduce algorithm in time-domain when Between consistency, improve the stability of identification and the discrimination of Geng Gao；

4th, adopt NIN so that the more complicated people of structure extraction of multiple dimensioned convolutional neural networks that proposes of the application Body behavior nonlinear characteristic；

5th, the flexibility of system is improved so that the video segment of different resolution and duration is permissible using pyramid structure This system can be used without any changes, improve flexibility and the range of application of system；

6th, increased light circulation road in input, allow whole algorithm that higher recognition capability is had on time-domain；

7th, network employs specifically abnormal Human bodys' response storehouse and is learnt, by increase as far as possible sample size and Increase the scene species of sample, being capable of preferably training pattern；

8th, accelerate to allow whole recognizer to meet many videos real-time detection using gpu.

Brief description

The multiple dimensioned convolutional neural networks configuration diagram that Fig. 1 provides for the present invention.

The comparison diagram of the Three dimensional convolution that Fig. 2 provides for the present invention and two-dimensional convolution；

Wherein, a figure is the schematic diagram of Three dimensional convolution, and b figure is the schematic diagram of two-dimensional convolution.

Linear convolution and MLP convolution schematic diagram that Fig. 3 provides for the present invention.

The pyramid structure schematic diagram that Fig. 4 provides for the present invention.

The parameter model training flow chart that Fig. 5 provides for the present invention.

The ONLINE RECOGNITION flow chart that Fig. 6 provides for the present invention.

Specific embodiment

For ease of skilled artisan understands that the technology contents of the present invention, below in conjunction with the accompanying drawings one being entered to present invention Step explaination.

The technical scheme is that：Based on real-time body's abnormal behaviour recognition methods of multiple dimensioned convolutional neural networks, Including：

S1, determine the structure of multiple dimensioned convolutional neural networks；As shown in figure 1, include ground floor, the second layer, third layer, the Four layers, layer 5, layer 6 and layer 7；If each layer all comprises dry contact, interstitial content determines extraction feature species Number, data is more, and the characteristic information of extraction is more, but amount of calculation is also bigger.

Ground floor is input layer input, comprises three passages, this three passages accept after gradation conversion respectively current when Engrave the video image information of a second and two passage Ox, Oy of dense optical flow that this video calculates；In original video input Add two light circulation roads on the basis of passage and can largely strengthen the sensitiveness to behavior act, in Activity recognition There is higher discrimination.

Second layer conv1 is Three dimensional convolution layer, with quantity be n, a size of the convolution kernel of cw*ch*cl to ground floor input Video and light stream carry out convolution algorithm；The application adds time domain information using Three dimensional convolution computing, can preferably capture Movable information.

Particular content is：In the task of processing video, need the information from multiple continuous frame-grab campaigns, the application Different from other convolutional neural networks, employ the convolutional layer of three-dimensional.The formula of Three dimensional convolution such as formula (1)：

Wherein, w is the weight of convolution kernel, and u is input vector, represents the image intensity value of three passages, the level of light stream Component and vertical component, y^efgIt is output, subscript e, f, g represent the element value referring to relevant position, and that is, g frame e row f arranges Element, P, Q, R are the size of three dimensions respectively, and in Three dimensional convolution, the characteristic pattern feature map of last layer output is two The matrix of dimension, P, Q be respectively total line number of this two-dimensional matrix with total columns, and small letter p, q then represent pth row in this matrix, q Row, and 0≤p ＜ P, 0≤q ＜ Q；R represents the length of video, that is, video long R frame altogether, and small letter r represents r frame, and 0≤r ＜ R.

Here Three dimensional convolution computing is considered as the cube being got up with three-dimensional core come the accumulation of convolution multiple frame, such as schemes 2a show the schematic diagram of Three dimensional convolution, and reference axis indicates three dimensions：The time of the width of image, height and video, The cube of lower section represents the input of convolution, and the cube of top represents the output of convolution.Fig. 2 b is the schematic diagram of two-dimensional convolution, Its input and output are all the rectangles of two dimension, only include width, the elevation information of image, the no information of time-domain.With three-dimensional Convolution, the feature map of each convolutional layer is the multiple successive frames being associated with preceding layer, thus captures motion letter Breath.

Third layer pool1 is three-dimensional down-sampling layer, is pw with yardstick, and the convolution kernel of ph, pl carries out maximum pond；The application Using three-dimensional down-sampling technology not only greatly reduce amount of calculation, and it is constant to introduce the time in time-domain for the algorithm Property, improve the stability of identification and the discrimination of Geng Gao.

Particular content is：The same with three-dimensional convolution, when convolutional neural networks process video, down-sampling layer also needs Three-dimensional to be extended to.Process the down-sampling layer of the convolutional neural networks of picture, so that after data volume strongly reduces, accelerates Calculating, also make network have certain consistency, consistency here is the consistency in spatial domain simultaneously.And at place

When reason video, certain consistency is also required on time-domain, and the data processing of video is than single frames Picture is much bigger, therefore it is necessary to down-sampling is also extended to three-dimensional.The formula such as (2) of three-dimensional overlapping maximum down-sampling：

Wherein, x ' is three-dimensional input vector, represents the feature (feature map) that previous convolutional layer extracts, that is, second The feature that layer Three dimensional convolution extracts, y ' is the output that obtains after sampling, and s, t and r be picture traverse respectively, highly with video time The sampling step length in three directions of length；In Three dimensional convolution, the characteristic pattern feature map of last layer output is the matrix of two dimension, I.e. the characteristic pattern of the second layer output in the application is two-dimensional matrix, S₁、S₂The total line number being respectively this two-dimensional matrix arranges with total Number, m, n then represent m row in this matrix, n row, and 0≤m ＜ S₁, 0≤n ＜ S₂；S₃Represent the length of video, that is, regard Frequency long S altogether₃Frame, small letter l represents l frame, and 0≤l ＜ S₃；I represents this two-dimensional matrix line number sequence number, i=1,2, S₁； J represents this two-dimensional matrix columns sequence number, j=1,2, S₂；K represents video frame number, k=1,2, S₃；Although S in the Three dimensional convolution computing of third layer₁、S₂、S₃, m, n, l and the second layer Three dimensional convolution computing in P, Q, R, p, q, r institute right Identical implication should be expressed, but respective value differing in the second layer and third layer, adopt different letters here In order to distinguish；Feature map data volume after sampling reduces at double, and amount of calculation also can greatly reduce, meanwhile, network to when Between change more robust on domain.

4th layer of conv2 is Three dimensional convolution layer, similar with the second layer, for doing convolution algorithm to the output of third layer.

Layer 5 is NIN (Network in Network) layer, and the network using the perceptron convolutional layer by two-layer forms, For extracting the nonlinear characteristic of human body behavior according to the 4th layer of output；Using NIN so that the system can be extracted more Plus the human body behavior nonlinear characteristic of complexity.

Particular content is：Employ NIN structure in the application and improve whole network structure.Volume in convolutional neural networks Long-pending is a generalized linear model (generalized linear model, GLM).GLM is only when sample is linear separability It is abstract that time has had.That the convolution of convolutional neural networks is implied by this linear separability it is assumed that but human body behavior model being discontented with This hypothesis of foot.So, GLM is substituted for and there is the model of non-linear sign ability is capable of the abstracting power of boosting algorithm.

Feedforward neural network or multi-layer perception (MLP) (MLP) are the strong nonlinear models of an abstracting power, if use Common linear convolution core is substituted for and does nonlinear convolution operation with MLP, will necessarily increase the abstracting power of model.In network It is referred to herein as MLP convolutional layer with the layer that MLP does convolution, be only referred to as linear convolution layer with the convolutional layer that linear kernel does convolution.Make It is referred to herein as NIN (Networks in Networks) with the network of this MLP convolutional layer.

Classical convolution operation and MLP convolution operation are as shown in Figure 3.Classical convolution is that line is made in the input to one piece of region Property weighted sum, and exported by a nonlinear activation function, fit simple nonlinear model.MLP convolution uses One multi-layer perception (MLP) being made up of multiple full articulamentums calculates in the enterprising line slip of feature map of last layer, then passes through One nonlinear activation function (adopting ReLU here), has thus obtained the feature map of current layer.The meter of MLP convolution Calculation mode such as (3)：

Wherein, (i, j) is the pixel index of the feature map of current layer, x_ijIt is the input block at (i, j) for the center, k_n It is the index of the feature map of current layer, n is the number of plies of MPL.Because being ReLU (Y=max (0, X)) the activation letter using Number, so using the maximum comparing with 0 in formula.

From the point of view of another kind of viewpoint, MLP convolutional layer is equivalent to multiple linear convolution layers, if a MLP has n-layer, then one Individual MLP convolutional layer can regard n linear convolution layer as, be this n convolutional layer rear n-1 layer be 1 × 1 core, and often The part feature map of individual map convolution last layer of feature, using NIN so that the system can be extracted more Complicated human body behavior nonlinear characteristic.

Layer 6 pyramid is three-dimensional pyramid down-sampling layer, is made up of different size of three-dimensional down-sampling layer, in order to right The output of layer 5 carries out down-sampling process, obtains the output feature map of different resolution；Three Vygens that the application adopts The flexibility of word tower down-sampling this skill upgrading of layer system is so that the video segment of different resolution and duration can not do Any change can use this system, improves flexibility and the range of application of system.

Particular content is：The three-dimensional pyramid down-sampling layer of the application is made up of the feature map of multiple resolution ratio, Conventional down-sampling layer is all with identical sampling scale and the feature map that inputs has identical size, so obtaining Feature map have identical resolution ratio.And pyramid down-sampling layer obtained using multiple sampling scale a series of solid The feature map of fixed different resolution.

The sample of Activity recognition is all some video segments, may have different resolution ratio it is also possible to different videos is long Degree.These othernesses so that traditional convolutional neural networks have no idea process because traditional convolutional neural networks each Feature map is fixed size.Cause traditional convolutional neural networks cannot process different resolution and length video Reason does not lie in convolutional layer and down-sampling layer, but at full articulamentum (the FC layer in Fig. 1), because the framework of full articulamentum is solid Fixed, have no idea to change, the size that result in the feature map of this layer of input also must be fixing.And in convolutional layer, The size of the feature map of input does not interfere with the structure of network, and the size of the feature map simply exporting can be with this The change of feature map size of layer input and change, because convolution kernel simply slides on input feature map.? Down-sampling layer, simply to reduce size in a manner input feature map, also not to interfere with the structure of network.By This is apparently necessary for carrying out the process of suitable feature map size before full articulamentum so that various sizes of input Feature map obtains identical size.

This process can be realized using the overlapping down-sampling of different windows size and different step-length.With two-dimensional case it is Example, the extension that three-dimensional situation can be natural.The size of the feature map of hypothesis input is a × a, needs to be down sampled to Size n × n, currently uses window size：

Slide step-length be：

In formulaFor the operation that rounds up,For downward floor operation.Such as now with a=13, n=3, then according to public affairs Formula (4) and (5) we obtain win=5 and str=4.

Determine that using above formula window size and sliding step can be effectively from different input sizes Feature map obtains Output Size identical feature map.But different resolution ratio uses identical size meeting Make high-resolution feature map lose too many information, and the feature map of low resolution may sample very little and not have There is consistency, so, adopt pyramidal mode here.As shown in Figure 4.Pyramid is by some different resolutions Feature map forming, existing comparatively bigger resolution ratio, also have little resolution ratio, also there are some mistakes centre Cross the resolution ratio of size.(as 16*256-d in Fig. 3,4*256-d, 256-d) wants to obtain difference from the feature map of input The output feature map of resolution ratio needs to carry out the overlapping down-sampling of multiple window sizes and step-length.Obtain these Each unit of feature map is spliced into a vector again, the such as L-Fix layer of Fig. 4, then, then accesses full articulamentum.In figure Example in 4 is 3 grades of pyramids, and fixing resolution ratio is respectively 3 × 3,2 × 2,1 × 1, and last layer has 256 feature map.

The output unit total number of pyramidal layer is all fixing, it is possible to fixing output feature map even Connect full articulamentum, and, the introducing of pyramid model, the feature map of multiple resolution ratio can be formed, it also avoid regular not The impact being brought with the input of resolution ratio.

In the present invention, that use is all three-dimensional feature map, pyramid model is expanded into three-dimensional also very square Just, often one-dimensional window size and step-length, all according to formula (4) and (5), input the ratio E with the length of side of output by calculating, Again by the acquisition window size that rounds up, round downwards acquisition step-length.

Layer 7 FC is full articulamentum, for exporting the characteristic vector of fixed dimension, is supplied to grader (softmax) and makees For identifying the characteristic of division of human body behavior.

S2, the off-line training of multiple dimensioned convolutional neural networks；By being learnt in abnormal human body behavior storehouse, obtain net Network parameter model combines the structure of the multiple dimensioned convolutional neural networks that step S1 determines as model file during ONLINE RECOGNITION；As Described parameter model shown in Fig. 5 specifically comprise following step by step：

The Sample Storehouse that S21, loading have marked, described Sample Storehouse comprises positive sample and negative sample, and described positive sample is abnormal Human body behavior video, negative sample be normal human body behavior video；

S23, video is pre-processed after input multiple dimensioned convolutional neural networks；Described pretreatment be：Be converted to gray scale Color space, and deduct average, extract half-tone information and Optic flow information；

Whether the difference of S24, the output result judging multiple dimensioned convolutional neural networks and true tag is less than threshold value, if Then go to step S25；Otherwise the difference reverse conduction of the output result of multiple dimensioned convolutional neural networks and true tag is entered Multiple dimensioned convolutional neural networks, and adjust network parameter, execution step S23；Here network parameter includes convolutional neural networks Weight coefficient and grader connection weight, learning to suitable parameter is the off-line training of multiple dimensioned convolutional neural networks Purpose.

S25, the network parameter of the multiple dimensioned convolutional neural networks of preservation；These network parameters will be used for behavioral value.

S3, the ONLINE RECOGNITION of convolutional neural networks；By video input model file being obtained the spy as basis of characterization Levy vector.Specifically include as shown in Figure 6 following step by step：

S31, obtain the video V of a second from camera or video file, if N two field picture composition, then just have every One two field picture V_i, i=[1, N]；The video being directed to one second length every time carries out the identification of human body behavior.

S32, the video obtaining step S31 zoom to fixed resolution, Vi=resize (Vi) according to demand.

S33, step S32 is scaled after video first carry out gray processing process, and calculate dense optical flow and obtain two light streams Passage O_x, O_y；Dense optical flow is a kind of method for registering images carrying out pointwise coupling for the adjacent image in video, is different from Just for several characteristic points on image, dense optical flow calculates the side-play amount of all of point on image to sparse optical flow, thus being formed One dense optical flow field.By this dense optical flow field, the image registration of pixel scale can be carried out, so after its registration Effect be also significantly better than sparse optical flow registration effect.

S34, the image H after gray processing obtaining in step S33 is carried out whitening processing；Deduct each frame figure Average H of picture_i(mean), H_i:=H_i-H_i(mean), then by H normalize, H:=(H-0)/(256-0)；Albefaction and normalized Purpose is that the information of input is carried out a certain degree of denoising, the computing of convolutional neural networks after also allowing for.

S35, the O that image through whitening processing will be obtained by step S34 and step S33 obtains_x, O_yIt is separately input to In multiple dimensioned convolutional neural networks, output characteristic vector F after network calculations, F are exactly the use that goes out of video extraction of this second In the feature that video human behavior is classified；

So it is recycled to, after record result, the video that step S31 obtains next second.The ONLINE RECOGNITION of convolutional neural networks During be directed to every time the video of one second length and carry out the identification of human body behavior；The video scaling of each second is to fixing resolution ratio Afterwards, using dense optical flow algorithm, calculate Optical-flow Feature；Then whitening processing will be carried out after video gradation together with Optical-flow Feature And normalized；The abstract characteristics of higher-dimension can be obtained afterwards by convolutional neural networks.The feature of these outputs is exactly last The input of human body behavior grader.

The convolutional neural networks of the use in the application have very strong ability in feature extraction it is possible to train identification The grader of the outstanding multiple difference behavior of ability.In order to meet what convolutional neural networks can be exported by different application scenarios In the different grader of characteristic vector input, (classification is beaten for such as C1 grader (classification is fought and normal behaviour), C2 grader Struggle against, run, normal behaviour).Such combination will have higher adaptive capacity to environment and using value.

Consider the convolutional neural networks using in the present invention, it is big that number of parameters is big, Three dimensional convolution computing calculates consumption And the feature of the complicated network structure, employ the technology of GPU acceleration.Employ CUDA storehouse and cuDNN storehouse in the present invention excellent Change the recognition speed of whole algorithm.CUDA storehouse is mainly used in convolution algorithm in whole network computing, first according to matrix size Distribute corresponding video memory, afterwards parallel distribution multiple tasks in multiple GPU cores, actual is exactly to be whole matrix-split Multiple minor matrixs, and cuDNN storehouse mainly optimizes computational efficiency during three-dimensional computing further.Table 1 is the video of a minute Recognition speed.

Table 1 parallel detection speedometer

From table it is seen that, when this algorithm detects multiple video at the same time, using GPU speed technology, can meet and connect The real-time detection of nearly 20 videos, the algorithm comparing conventional body's Activity recognition is advantageous in recognition speed on the contrary.Such Recognition speed can carry out real-time detection completely in various real application scenarios.

Those of ordinary skill in the art will be appreciated that, embodiment described here is to aid in reader and understands this Bright principle is it should be understood that protection scope of the present invention is not limited to such special statement and embodiment.For ability For the technical staff in domain, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, made Any modification, equivalent substitution and improvement etc., should be included within scope of the presently claimed invention.

Claims

1. the real-time body's abnormal behaviour recognition methods based on multiple dimensioned convolutional neural networks is it is characterised in that include：

S1, determine the structure of multiple dimensioned convolutional neural networks；Including ground floor, the second layer, third layer, the 4th layer, layer 5, Six layers and layer 7；Described ground floor is input layer, comprises three passages, after this three passages accept gradation conversion respectively Two passage Ox, Oy of the dense optical flow that the video image information of current time upper one second and this video calculate；The second layer is Three dimensional convolution layer is n with quantity, yardstick is that the video inputting and light stream carry out convolution fortune to ground floor for the convolution kernel of cw*ch*cl Calculate；Third layer is three-dimensional down-sampling layer, is pw with yardstick, and the convolution kernel of ph, pl carries out maximum pond to the output of the second layer；The Four layers is Three dimensional convolution layer, for carrying out convolution algorithm to the output of third layer；Layer 5 is NIN layer, perceives using by two-layer The network composition of machine convolutional layer, for extracting the nonlinear characteristic of human body behavior according to the 4th layer of output；Layer 6 is golden word Tower down-sampling layer, is made up of different size of three-dimensional down-sampling layer, the non-linear spy of the human body behavior for exporting to layer 5 Levy and carry out down-sampling process；Layer 7 is full articulamentum, is fixed the characteristic vector of dimension according to the output of layer 6；

Wherein, Ox represents component on x axis of orientation for the light stream, and Oy represents component on y axis of orientation for the light stream；Cw represents convolution kernel Width, ch represents convolution kernel height, and cl represents convolution kernel length on a timeline；

S2, the off-line training of multiple dimensioned convolutional neural networks；By being learnt in abnormal human body behavior storehouse, obtain network ginseng Exponential model combines the structure of the multiple dimensioned convolutional neural networks that step S1 determines as model file during ONLINE RECOGNITION；

S3, the ONLINE RECOGNITION of convolutional neural networks；By video input model file is obtained feature as basis of characterization to Amount.

2. the real-time body's abnormal behaviour recognition methods based on multiple dimensioned convolutional neural networks according to claim 1, its It is characterised by, the three-dimensional down-sampling described in third layer adopts below equation：

y_{m n l}^{'} = \underset{0 \leq i < S_{1}, 0 \leq j < S_{2}, 0 \leq k < S_{3}}{m a x} (x_{m \times s + i, n \times t + j, l \times r + k}^{'})

Wherein, x ' represents input vector, and y ' represents the output that obtains after sampling, and s, t and r be picture traverse respectively, highly with regarding The sampling step length in three directions of frequency time span；In Three dimensional convolution, the characteristic pattern of last layer output is the matrix of two dimension, S₁、S₂ With total columns, m, n then represent m row in this matrix, n row, and 0≤m ＜ S to the total line number being respectively this two-dimensional matrix₁, 0≤n ＜ S₂；S₃Represent the time span of video, that is, video long S altogether₃Frame, l represents l frame, and 0≤l ＜ S₃；I represents this two dimension Row matrix number sequence number；J represents this two-dimensional matrix columns sequence number；K represents video frame number.

3. the real-time body's abnormal behaviour recognition methods based on multiple dimensioned convolutional neural networks according to claim 1, its It is characterised by, down-sampling described in layer 6 processes and is specially：The nonlinear characteristic of the human body behavior of layer 5 output is carried out Then each unit spliced of the characteristic pattern obtaining is become a vector by multiple window sizes and the overlapping down-sampling of step-length, makees Output for layer 6.

4. the real-time body's abnormal behaviour recognition methods based on multiple dimensioned convolutional neural networks according to claim 1, its It is characterised by, the network paramter models described in step S2；Specifically comprise following step by step：

The Sample Storehouse that S21, loading are collected and marked, described Sample Storehouse comprises positive sample and negative sample, and described positive sample is different Normal human body behavior video, negative sample is normal human body behavior video；

S24, judge that whether the error of multiple dimensioned convolutional neural networks is less than threshold value, if then going to step S25；Otherwise by many chis The output result of degree convolutional neural networks enters multiple dimensioned convolutional neural networks with the difference reverse conduction of true tag, and adjusts Network parameter, execution step S23 re -training；

S25, the network parameter of the multiple dimensioned convolutional neural networks of preservation, obtain network paramter models.

5. the real-time body's abnormal behaviour recognition methods based on multiple dimensioned convolutional neural networks according to claim 4, its It is characterised by, described pretreatment is：Be converted to gray scale color space, and deduct average, extract half-tone information and Optic flow information.

6. the real-time body's abnormal behaviour recognition methods based on multiple dimensioned convolutional neural networks according to claim 1, its It is characterised by, the ONLINE RECOGNITION of the convolutional neural networks described in step S3；Specifically include following step by step：

S31, obtain the video V of a second from camera or video file；

S33, step S32 is scaled after video first carry out gray processing process, and calculate dense optical flow and obtain two light circulation roads O_x, O_y；

S35, the O that will be obtained by the image obtaining after step S34 whitening processing and step S33_x, O_yIt is separately input to multiple dimensioned In convolutional neural networks, output characteristic vector F after network calculations；

S36, video features F is input in grader C, judging the behavior species of this video, if belonging to abnormal behaviour, locating Reason is abnormal.