CN109447014A

CN109447014A - A kind of online behavioral value method of video based on binary channels convolutional neural networks

Info

Publication number: CN109447014A
Application number: CN201811317221.6A
Authority: CN
Inventors: 陆生礼; 庞伟; 向丽苹; 范雪梅; 舒程昊; 梁彪
Original assignee: Sanbao Sci & Tech Co Ltd Nanjing; Wuxi Institute Of Integrated Circuit Technology Southeast University; Southeast University
Current assignee: Sanbao Sci & Tech Co Ltd Nanjing; Wuxi Institute Of Integrated Circuit Technology Southeast University; Southeast University
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2019-03-08

Abstract

The invention discloses a kind of online behavioral value methods of video based on binary channels convolutional neural networks.Firstly, converting by RGB figure of the light stream sequence of pictures generation module to input, light stream figure is obtained；Secondly, obtained light stream figure and original RGB are schemed to be separately input by binary channels in two identical light weights pair convolution kernel SSD networks, the temporal aspect and space characteristics and detection block and confidence score of two kinds of figures are extracted respectively；Then, the detection block and confidence score generated by Fusion Module to two kinds of pictures merges, and forms the picture with detection block and confidence score；Finally, the picture Input Online with detection block and confidence score is acted pipeline, final behavioral value result is provided from the angle of video.The present invention significantly simplifies deep learning network, improves behavioral value efficiency by the double convolution kernel SSD networks of design light weight.

Description

A kind of online behavioral value method of video based on binary channels convolutional neural networks

Technical field

The invention belongs to computer vision fields, in particular to a kind of video based on binary channels convolutional neural networks exists Line behavioral value method.

Background technique

Intelligent video analysis be current computer vision field it is very popular and it is great challenge power a direction, can To be used under many scenes.Intelligent video analysis direction includes numerous sub- research direction, wherein main two research Direction is Activity recognition and behavioral value.Activity recognition is similar to the classification of image, and mainly " behavior in video is assorted for solution " the problem of, one section is provided only comprising the trimming video of a behavior act, it is desirable that gives visual classification.And behavioral value (or it is fixed Position) it is consistent with target detection, mainly solve " whether have corresponding behavior in video, if so, then occurring in video frame sequence Column which section and each frame where " the problem of, this is mainly completed in two steps, and one is analogous in target detection Candidate region is extracted, this step is mainly found out from video may be containing the video-frequency band of movement；Second is that dividing the video-frequency band Class.

Before deep learning appearance, the algorithm to behave oneself best is iDT algorithm for behavior sort research, and thinking is to utilize light Flow field obtains some tracks in video sequence, further along trajectory extraction HOF, HOG, MBH, feature in trajectory4, Wherein HOF is calculated based on grayscale image, in addition several to be based on intensive optical flow computation.Recycle Fisher Vector method pair Feature is encoded.Finally based on coding training result training SVM, traditional Machine learning classifiers method such as random forest into The output of row classification and final result.After deep learning comes out, the end-to-end solution from feature extraction to classification is realized. Du Tran et al. introduces time dimension in two-dimensional convolution kernel, handles video with three-dimensional convolution kernel, realization is instructed end to end Practice.Network structure is simple and speed is fast.Because video, other than Spatial Dimension, maximum difficult point is time series problem, base Processing sequence problem, the network algorithm based on RNN that Du Wenbin et al. is proposed posture supervision can be introduced very well in RNN network Mechanism, improve the effect of visual classification.Behavioral value task has huge latent because closer to life in monitoring security protection It is being worth.The biggest problems are that how to position the video-frequency band there are behavior act in behavioral value, behavior over detection side Fado uses slip window sampling, however the operating position fixing based on slip window sampling is very time-consuming, and time efficiency is low.

Summary of the invention

In order to solve the technical issues of above-mentioned background technique proposes, the present invention is intended to provide a kind of based on binary channels convolution mind The online behavioral value method of video through network.

In order to achieve the above technical purposes, the technical solution of the present invention is as follows:

A kind of online behavioral value method of video based on binary channels convolutional neural networks, firstly, passing through light stream picture sequence Column-generation module converts the RGB figure of input, obtains light stream figure；Secondly, obtained light stream figure and original RGB figure are passed through double Channel is separately input into the double convolution kernel SSD networks of two identical light weights, extracts the temporal aspect and sky of two kinds of figures respectively Between feature and detection block and confidence score；Then, the detection block and confidence score two kinds of pictures generated by Fusion Module It is merged, forms the picture with detection block and confidence score；Finally, the picture input with detection block and confidence score is existed Line acts pipeline, and final behavioral value result is provided from the angle of video.

Further, the light stream figure that the light stream sequence of pictures generation module generates is equal in magnitude with original RGB figure.

Further, the double convolution kernel SSD networks of the light weight are by using light weight two-fold product nuclear network as the base net of SSD Network and formed；The light weight two-fold product nuclear network includes that depth separates two pattern contract hyper parameters of convolution sum: width multiplies Musical instruments used in a Buddhist or Taoist mass α and resolution ratio multiplier ρ；The depth separates convolution and Standard convolution is resolved into the point-by-point convolution of depth convolution sum, deep Degree convolution is filtered for each single input channel application single filter, then the convolution behaviour of point-by-point convolution application 1x1 Make to combine all depth convolution to be exported；The width multiplication device α ∈ (0,1], by width multiplication device α by a certain layer Input channel number M and output channel number N change into α M and α N, realize the thinning of network；The resolution ratio multiplier ρ ∈ (0,1], The size that input resolution ratio is adjusted by setting resolution ratio multiplier ρ, realizes the thinning of network.

Further, the detailed process of detection block and confidence score that two kinds of pictures generate is merged by Fusion Module: when The detection block of light stream figureWith the detection block of RGB figureWith maximum area overlapping value, and the overlapping value is greater than the threshold value of setting When, final fusion results are obtained by following formula:

In above formula,For final fusion results,ForConfidence score, beTwo inspections Survey the intersection of frame and the ratio of union.

Further, in the online actions pipeline, if T_i ^cFor the confuser of particular category i, n_c(t) for each frame figure into Pipeline the piece number after coming, t are frame figure, and T is all frame figure numbers；As t=1, n_c(t)=1；As t=T-1, according to confidence point It is worth confuser T of the inverted order arrangement from the 1st frame to T-1 frame_i ^c；As t=T, find out in all detection blocks of T frame with the last one Confuser T_i ^cDegree of overlapping be greater than set threshold value detection block, and using the highest detection block of confidence score in these detection blocks as Confuser output；When there is no confuser output by k frame picture, then pipeline being terminated, output termination pipeline is final movement pipeline, Its corresponding score value is the final behavior testing result of video.

By adopting the above technical scheme bring the utility model has the advantages that

The present invention, which passes through, uses efficient double convolution kernel mini Mods and classics based on deep learning convolutional neural networks method The double convolution kernel SSD networks of the light weight that target detection model SSD combines, significantly simplify deep learning network, improve Behavioral value efficiency.Therefore, present invention can apply to such as detect that someone climbs over the walls to trigger alarm system security protection scene, or discover And it records in the traffic administrations scene such as the sports scenes such as the foul such as sportsman operation or the candid photograph of vehicle behavior abnormality detection On hardware platform.

Detailed description of the invention

Fig. 1 is flow chart of the invention.

Specific embodiment

Below with reference to attached drawing, technical solution of the present invention is described in detail.

The invention proposes the online behavioral value methods of video based on binary channels convolutional neural networks, as shown in Figure 1, packet Include 4 parts, i.e. light stream graphic sequence generation module, the double convolution kernel SSD networks of light weight, Fusion module and online actions pipe Road.Firstly, converting by RGB figure of the light stream sequence of pictures generation module to input, light stream figure is obtained；Secondly, will obtain Light stream figure and original RGB scheme to be separately input by binary channels in two identical light weights pair convolution kernel SSD networks, extract respectively The temporal aspect of two kinds of figures and space characteristics and detection block and confidence score out；Then, by Fusion Module to two kinds of pictures The detection block and confidence score of generation are merged, and the picture with detection block and confidence score is formed；Finally, detection block will be had Pipeline is acted with the picture Input Online of confidence score, final behavioral value result is provided from the angle of video.

First part: light stream sequence of pictures generation module.The core algorithm of light stream sequence of pictures generation module is optical flow method. In space, movement can be described with sports ground.And on a plane of delineation, the movement of object is often by image sequence The different embodiments of middle different images intensity profile.Sports ground in space, which is transferred on image, is indicated as optical flow field, light stream Field reflects the variation tendency of every bit gray scale on image.Light stream is considered as pixel in the instantaneous of plane of delineation movement generation Velocity field.Optical flow field is the displacement that each pixel has an X-direction and Y-direction in picture, so after optical flow computation Obtained light stream is a and equal-sized Channel Image of original image.

Second part: the double convolution kernel SSD networks of light weight.SSD is that image object is detected and realized to divide end to end Class and testing result output.Characteristic pattern is mainly divided into multiple grids by the realization of SSD, and each grid is according to different size ratios Multiple default detection blocks are generated, default detection block can be adjusted constantly to target box label in subsequent training and be lost as far as possible Small position, and extra detection block is being deleted by maximum restrainable algorithms later, obtain final grid detection frame.And lead to It crosses and scale size different characteristic figure layer is detected and classified, finally integrate this several layers of result and export final detection and divide Class result.The base net network of SSD is VGG-16 network, which includes 16 layers of convolutional layer, 5 layers of pond layer and 3 layers of full articulamentum, net Network structure is deep, network parameter and computationally intensive.And the present invention uses light weight two-fold product nuclear network to substitute VGG-16 as the base net of SSD Network forms light weight two-fold product nuclear network, it is possible to reduce a large amount of calculating and parameter are advantageously implemented mobile terminal and embedded vision Application.Light weight two-fold product nuclear network is based on a kind of streamlined structure and separates convolution using depth to construct light-duty weight depth Neural network contains depth and separates convolutional network structure and two pattern contract hyper parameters, i.e. width multiplication device and resolution Rate multiplier.It is that a kind of convolution that Standard convolution is resolved into depth convolution and a 1x1 is i.e. point-by-point that depth, which separates convolution, Convolution.Depth convolution is filtered for each single input channel application single filter, then point-by-point convolution application 1x1 Convolution operation combines all depth convolution to be exported.These are smaller and the smaller model of calculation amount in order to construct, and introduce A kind of very simple parameter is called width multiplication device α (α ∈ (0,1]), and effect is to each layer of homogenizing and thinning.It is one given Layer and width multiplication device α, can become α M and α N for input channel number M and output channel number N.Second thinning neural network The hyper parameter of calculation amount is resolution ratio multiplier ρ (ρ ∈ (0,1]), wherein for most basic light weight two-fold product nuclear network when ρ=1. The double convolution kernel SSD networks of light weight separate convolution only compared to Standard convolution accuracy rate using depth on ImageNet data set Reduce 1%, but reduces much in calculation amount and parameter amount.The module is because the inspection based on time and space will be obtained Classification results are surveyed, so, the double convolution kernel SSD networks of the given two identical light weights of dual channel mode are taken, respectively to RGB and light Flow graph piece is detected, and the characterization attributes and action attributes testing result of frame figure are obtained, and the result form of expression is the inspection of frame figure Survey frame and target confidence score.

Part III: Fusion Module.The network of second part is the time for forming respectively frame figure and inspection spatially Survey frame and confidence score.And Fusion Module merges the result of light stream figure and RGB figure then according to certain mechanism.When light stream figure Detection blockWith the detection block of RGB figureWith maximum area overlapping value, and the overlapping value be greater than setting threshold value when, pass through Following formula obtains final fusion results (the frame figure with detection block and confidence point):

Part IV: online actions pipeline.The network of front is that a processing to frame figure exports, and does not consider timing Characteristic, and in video behavioral value identify, need the angle from video, find behavior occur tract, to row Detection and class prediction identification are carried out for movement.The input of the module is the fused frame to be detected of Part III and confidence point Picture.If T_i ^cFor the confuser of particular category i, n_c(t) the pipeline the piece number after coming in for each frame figure, t are frame figure, and T is all Frame figure number；As t=1, n_c(t)=1；As t=T-1, the pipe from the 1st frame to T-1 frame is arranged according to confidence score inverted order Road piece T_i ^c；As t=T, find out in all detection blocks of T frame with the last one confuser T_i ^cDegree of overlapping be greater than set threshold value Detection block, and using the highest detection block of confidence score in these detection blocks as confuser export；Do not have when by k frame picture Confuser output then terminates pipeline, and output termination pipeline is final movement pipeline, and corresponding score value is the final of video Behavioral value result.

Embodiment is merely illustrative of the invention's technical idea, and this does not limit the scope of protection of the present invention, it is all according to Technical idea proposed by the present invention, any changes made on the basis of the technical scheme are fallen within the scope of the present invention.

Claims

1. a kind of online behavioral value method of video based on binary channels convolutional neural networks, which is characterized in that firstly, passing through light Flow graph piece sequence generating module converts the RGB figure of input, obtains light stream figure；Secondly, by obtained light stream figure and original RGB Figure is separately input into the double convolution kernel SSD networks of two identical light weights by binary channels, extracts the timing of two kinds of figures respectively Feature and space characteristics and detection block and confidence score；Then, the detection block that two kinds of pictures is generated by Fusion Module and Confidence score is merged, and the picture with detection block and confidence score is formed；Finally, the figure that detection block and confidence score will be had Piece Input Online acts pipeline, and final behavioral value result is provided from the angle of video.

2. the online behavioral value method of video according to claim 1 based on binary channels convolutional neural networks, feature exist In the light stream figure that the light stream sequence of pictures generation module generates is equal in magnitude with original RGB figure.

3. the online behavioral value method of video according to claim 1 based on binary channels convolutional neural networks, feature exist In the double convolution kernel SSD networks of the light weight are formed and light weight two-fold is accumulated nuclear network as the base net network of SSD；Institute Stating light weight two-fold product nuclear network includes that depth separates two pattern contract hyper parameters of convolution sum: width multiplication device α and resolution ratio Multiplier ρ；The depth separates convolution and Standard convolution is resolved into the point-by-point convolution of depth convolution sum, and depth convolution is for each Single input channel application single filter is filtered, and then the convolution operation of point-by-point convolution application 1x1 combines all depths Degree convolution is exported；The width multiplication device α ∈ (0,1], by width multiplication device α by the input channel number M of a certain layer and Output channel number N changes into α M and α N, realizes the thinning of network；The resolution ratio multiplier ρ ∈ (0,1], pass through setting and differentiates Rate multiplier ρ inputs the size of resolution ratio to adjust, and realizes the thinning of network.

4. the online behavioral value method of video according to claim 1 based on binary channels convolutional neural networks, feature exist In the detailed process of the detection block and confidence score that are generated by Fusion Module two kinds of pictures of fusion: when the detection block of light stream figureWith the detection block of RGB figureWith maximum area overlapping value, and the overlapping value be greater than setting threshold value when, obtained by following formula To final fusion results:

In above formula,For final fusion results,ForConfidence score, beTwo detection blocks The ratio of intersection and union.

5. the online behavioral value method of video according to claim 1 based on binary channels convolutional neural networks, feature exist In in the online actions pipeline, if T_i ^cFor the confuser of particular category i, n_c(t) confuser after coming in for each frame figure Number, t are frame figure, and T is all frame figure numbers；As t=1, n_c(t)=1；As t=T-1, according to confidence score inverted order arrangement from Confuser T of 1st frame to T-1 frame_i ^c；As t=T, find out in all detection blocks of T frame with the last one confuser T_i ^c's Degree of overlapping is greater than the detection block of set threshold value, and the highest detection block of confidence score in these detection blocks is defeated as confuser Out；When there is no confuser output by k frame picture, then pipeline is terminated, output termination pipeline is final movement pipeline, corresponding Score value is the final behavior testing result of video.