CN109145822A

CN109145822A - A kind of violence detection system of deep learning

Info

Publication number: CN109145822A
Application number: CN201810960914.0A
Authority: CN
Inventors: 覃群英
Original assignee: Foshan Zheng Rong Technology Co Ltd
Current assignee: Foshan Zheng Rong Technology Co Ltd
Priority date: 2018-08-22
Filing date: 2018-08-22
Publication date: 2019-01-04

Abstract

The present invention provides a kind of violence detection systems of deep learning, including image input module, image overall personality modnies, depth network model module, 3D network model module, D3D network model module, image output module, described image input module is for inputting image detected, described image global property module is used to extract the global characteristics of image, the depth network model module is used for extracted image overall Fusion Features in depth network model, the 3D network model module determines violence testing result based on depth network model module, the D3D network model module is for optimizing 3D network model module violence testing result, described image output module is used to export the violence testing result of optimization.The invention has the benefit that providing a kind of violence detection system of deep learning, the accuracy rate of violence detection is effectively raised.

Description

A kind of violence detection system of deep learning

Technical field

The present invention relates to violence detection technique fields, and in particular to a kind of violence detection system of deep learning.

Background technique

With the continuous propulsion that safe city is built, video monitoring system is popularized on a large scale, utilizes intelligent video analysis skill Art realizes that carrying out detection and early warning to similar act of violence becomes a kind of urgent need.

Existing violence detection system can be subdivided into the method based on audio, be regarded based on sound according to the difference of analysis signal The method of frequency and method based on video.In actual monitored system, there is no installation audios to adopt for most of monitoring system Collect equipment, in this case, the method based on audio just can not prove effective, and just based on the violence of image/video data detection Become more researching value.In addition, explosion, bleed and the behaviors such as car chasing be usually detect vidclip in Violent scene Effective clue, but in daily life, this class behavior is very rare.It is opposite, violence is fought and group has a fist fight row For occur in daily life the most frequently, caused by damaging range it is most wide.

Summary of the invention

In view of the above-mentioned problems, the present invention provides a kind of violence detection system of deep learning.

The purpose of the present invention is realized using following technical scheme:

Provide a kind of violence detection system of deep learning, including image input module, image overall personality modnies, depth Spend network model module, 3D network model module, D3D network model module, image output module, described image input module For inputting image detected, described image global property module is used to extract the global characteristics of image, the depth net Network model module is used for by extracted image overall Fusion Features in depth network model, the 3D network model module Violence testing result is determined based on depth network model module, and the D3D network model module is for optimizing 3D network model mould Block violence testing result, described image output module are used to export the violence testing result of optimization.

The invention has the benefit that providing a kind of violence detection system of deep learning, violence is effectively raised The accuracy rate of detection.

Detailed description of the invention

The present invention will be further described with reference to the accompanying drawings, but the embodiment in attached drawing is not constituted to any limit of the invention System, for those of ordinary skill in the art, without creative efforts, can also obtain according to the following drawings Obtain other attached drawings.

Fig. 1 is structural schematic diagram of the invention；

Appended drawing reference:

Image input module 1, image overall personality modnies 2, depth network model module 3,3D network model module 4, D3D Network model module 5, image output module 6.

Specific embodiment

The invention will be further described with the following Examples.

Referring to Fig. 1, a kind of violence detection system of deep learning of the present embodiment, including image input module 1, image are complete Office personality modnies 2, depth network model module 3,3D network model module 4, D3D network model module 5, image output module 6, described image input module 1 is for inputting image detected, and described image global property module 2 is for extracting image Global characteristics, the depth network model module 3 be used for by extracted image overall Fusion Features in depth network model In, the 3D network model module 4 determines violence testing result, the D3D network model based on depth network model module 3 Module 5 is used to export the violence of optimization for optimizing 4 violence testing result of 3D network model module, described image output module 6 Testing result.

The violence detection system for present embodiments providing a kind of deep learning effectively raises the accurate of violence detection Rate.

Preferably, described image global property module 2 includes data input layer, convolutional calculation layer, excitation layer, pond layer； The data input layer pre-processes the image of input；The convolutional calculation layer is filtered to image and convolution behaviour Make；The output result of convolutional calculation layer is done Nonlinear Mapping by the excitation layer；The pond layer is mapped for compressive non-linearity Image afterwards；In convolutional calculation layer, by convolution operation to pretreated image zooming-out local neighborhood feature, through excessive In stacking generation, extracts the global characteristics of image by two-dimensional convolution:

In above formula, i indicates that the convolutional layer that image is currently located, j indicate the Feature Mapping quantity of this layer,It indicates i-th Activation value in j-th of Feature Mapping of layer at the position (x, y), this activation value is exactly the two-dimentional global characteristics of image；F () table Show activation primitive, wherein H, W respectively indicate the height of two-dimensional convolution core, the size of width；Indicate the weight of convolution kernel,Indicate activation value of (i-1)-th layer of d-th of the Feature Mapping at (x, y), b_ijIndicate bias vector.

This preferred embodiment by two-dimensional convolution can easily abstract image spatial information, simple and convenient, application It is widest in area, but be not sufficient to carry out expressed intact to video merely with these appearance features, video can be made to be lacked.

Preferably, the depth network model module 3 is by the two-dimensional convolution core in image overall personality modnies 2 by space Extension generates three dimensional convolution kernel, and the Three dimensional convolution at pixel (x, y, z) calculates is defined as:

In above formula, i indicates that the convolutional layer that image is currently located, j indicate the Feature Mapping quantity of this layer,It indicates the Activation value in i j-th of Feature Mapping of layer at the position (x, y, z)；This activation value is exactly the three-dimensional global characteristics of image；f(·) Indicate activation primitive, wherein H, W, T respectively indicate the size on height, width and the time dimension of three dimensional convolution kernel； Indicate the weight of convolution kernel,Indicate activation value of (i-1)-th layer of d-th of the Feature Mapping at (x, y, z), b_ijIndicate bias vector.

Compared with two-dimensional convolution formula, Three dimensional convolution all increases this preferred embodiment in the expression to convolution kernel and pixel Time dimension is added.After convolution kernel is extended to three-dimensional space, when carrying out convolution to image sequence, convolution operation will be It spatially and temporally carries out simultaneously, in this way after the operation of convolution sum pondization, the characteristic pattern of output remains image sequence, can be with The space time information being effectively maintained in video.By the feature extraction of multiple Three dimensional convolutions, so that it may extract the overall situation of video Space-time characteristic.

Preferably, the 3D network model module 4 is based on use tri- convolution of C1, C2, C3 of depth network model module 3 Layer is calculated, the three dimensional convolution kernel size that C1, C2 and C3 are used is respectively 7 × 7 × 5,5 × 5 × 5 and 3 × 3 × 3 pixels；3D network The input of model module 4 is the image segments X being made of 40 frame consecutive images；Picture frame is normalized to after pretreatment 60 × 90 pixel sizes are simultaneously converted to grayscale image；Scalar Y scalar is exported, for indicating testing result that model inputs image, For trained model, if in test image including Violent scene, output Y is 1, and otherwise exporting result is 0；

The 3D network model module 4 carries out pondization operation, pond to the characteristic pattern that the first two convolutional calculation layer is calculated Change is calculate by the following formula:

In formula, δ_TFor sampling function,Wherein, t is the time, and T is the sampling period, n ∈ [0, + ∞] and n be positive integer,Indicate y-th of characteristic pattern of x layer,Indicate that y-th of characteristic pattern of x-1 layer, θ and B are respectively to multiply Property biasing and additivity biasing,Indicate y-th of the multiplying property biasing of x layer,Indicate y-th of the additivity biasing of x layer；

The pondization operation does not carry out input feature vector graphic sequence in time dimension down-sampled using two-dimentional pondization operation Operation, the pond factor are set to 3 × 3 and 2 × 2 pixels；

During model training, 3D network model module 4 is using mean square error as cost function, and expression formula is such as Under:

In formula, H₁(X, θ) indicates that 3D network model cost function, G are pattern function, and θ is model parameter, and X is training sample This, N is sample size, andIt is sample physical tags, k ∈ [1, N], N ∈ [1 ,+∞]；Cost function value is smaller to show model It is better to be fitted with training set；

On the one hand this preferred embodiment can be further reduced network parameter, on the other hand also give characteristic pattern translation not The characteristics such as change and invariable rotary, so that the feature acquired is more robust.

Preferably, the D3D network model module 5 is based on 3D network model, and input is 40 frames of 128 × 128 pixels Consecutive image, consecutive image are Three Channel Color image；

Three dimensional convolution kernel is uniformly set as to 3 × 3 × 3 pixels, in convolution operation, D3D network model module 5 is to characteristic pattern It is filled operation, so that the size before the characteristic pattern obtained after convolution and calculating holding；It is also used during pond Three-dimensional pondization operation, i.e., carry out down-sampled operation to input feature vector graphic sequence in time dimension, the pond factor is set as 2 × 2 × 2 Pixel；

Cost function of the D3D network model module 5 during model training chooses negative log-likelihood function, table It is as follows up to formula:

In formula, H₂(X, θ) indicates D3D network model cost function, and G is pattern function, and θ is model parameter, X_kIt is k-th Training sample, m are classification numbers, and N is every class sample number,It is k-th of data physical tags；K ∈ [1, N], N ∈ [1 ,+∞], l ∈ [1, m], m ∈ [1 ,+∞].

This preferred embodiment uses more complicated structure, therefore the image data dimension handled can be higher, in this way The extraction that can accelerate image temporal information removes bulk redundancy therein.

Violence detection is carried out using the violence detection system of deep learning of the present invention, 5 detection scenes is chosen and is tested, It respectively detects scene 1, detect scene 2, detection scene 3, detection scene 4, detection scene 5, to violence Detection accuracy and cruelly Power detection speed is counted, and is compared compared with violence detection system, generation has the beneficial effect that shown in table:

	Violence Detection accuracy improves	Violence detects speed and improves
			Detect scene 1	29%	27%
Detect scene 2	27%	26%
			Detect scene 3	26%	26%
Detect scene 4	25%	24%
			Detect scene 5	24%	22%

Through the above description of the embodiments, those skilled in the art can be understood that it should be appreciated that can To realize the embodiments described herein with hardware, software, firmware, middleware, code or its any appropriate combination.For hard Part realizes that processor can be realized in one or more the following units: specific integrated circuit (ASIC), Digital Signal Processing Device (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), place Manage device, controller, microcontroller, microprocessor, other electronic units or its group designed for realizing functions described herein It closes.For software implementations, some or all of embodiment process can instruct relevant hardware come complete by computer program At.When realization, above procedure can be stored in computer-readable medium or as one on computer-readable medium or Multiple instruction or code are transmitted.Computer-readable medium includes computer storage media and communication media, wherein communication is situated between Matter includes convenient for from a place to any medium of another place transmission computer program.Storage medium can be calculating Any usable medium that machine can access.Computer-readable medium can include but is not limited to RAM, ROM, EEPROM, CD-ROM Or other optical disc storages, magnetic disk storage medium or other magnetic storage apparatus or can be used in carry or store have instruction Or data structure form desired program code and can be by any other medium of computer access.

Finally it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than the present invention is protected The limitation of range is protected, although explaining in detail referring to preferred embodiment to the present invention, those skilled in the art are answered Work as understanding, it can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the reality of technical solution of the present invention Matter and range.

Claims

1. a kind of violence detection system of deep learning, which is characterized in that including image input module, image overall characteristic mould Block, depth network model module, 3D network model module, D3D network model module, image output module, described image input Module is used to extract the global characteristics of image, the depth for inputting image detected, described image global property module It spends network model module to be used for by extracted image overall Fusion Features in depth network model, the 3D network model mould Block determines violence testing result based on depth network model module, and the D3D network model module is for optimizing 3D network model Module violence testing result, described image output module are used to export the violence testing result of optimization.

2. the violence detection system of deep learning according to claim 1, which is characterized in that described image global property mould Block includes data input layer, convolutional calculation layer, excitation layer, pond layer；The data input layer locates the image of input in advance Reason；The convolutional calculation layer is filtered to image and convolution operation；Output result of the excitation layer convolutional calculation layer Do Nonlinear Mapping；The pond layer is for the image after compressive non-linearity mapping.

3. the violence detection system of deep learning according to claim 2, which is characterized in that in convolutional calculation layer, lead to Convolution operation is crossed to pretreated image zooming-out local neighborhood feature, by Multilevel Iteration, figure is extracted by two-dimensional convolution The global characteristics of picture:

In above formula, i indicates that the convolutional layer that image is currently located, j indicate the Feature Mapping quantity of this layer,It indicates at i-th layer the Activation value in j Feature Mapping at the position (x, y), this activation value are exactly the two-dimentional global characteristics of image；F () indicates activation Function, wherein H, W respectively indicate the height of two-dimensional convolution core, the size of width；Indicate the weight of convolution kernel,Indicate activation value of (i-1)-th layer of d-th of the Feature Mapping at (x, y), b_ijIndicate bias vector.

4. the violence detection system of deep learning according to claim 3, which is characterized in that the depth network model mould Two-dimensional convolution core in image overall personality modnies is generated three dimensional convolution kernel by spatial spread by block, at pixel (x, y, z) The Three dimensional convolution at place calculates is defined as:

In above formula, i indicates that the convolutional layer that image is currently located, j indicate the Feature Mapping quantity of this layer,It indicates at i-th layer Activation value in j-th of Feature Mapping at the position (x, y, z)；This activation value is exactly the three-dimensional global characteristics of image；F () is indicated Activation primitive, wherein H, W, T respectively indicate the size on height, width and the time dimension of three dimensional convolution kernel；It indicates The weight of convolution kernel,Indicate activation value of (i-1)-th layer of d-th of the Feature Mapping at (x, y, z), b_ijIt indicates Bias vector.

5. the violence detection system of deep learning according to claim 4, which is characterized in that the 3D network model module Tri- convolutional calculation layers of C1, C2, C3, the three dimensional convolution kernel size that C1, C2 and C3 are used are used based on depth network model module The pixel of respectively 7 × 7 × 5,5 × 5 × 5 and 3 × 3 × 3.

6. the violence detection system of deep learning according to claim 5, which is characterized in that the 3D network model module Input be the image segments X being made of 40 frame consecutive images；Picture frame is normalized to 60 × 90 pixels after pretreatment Size is simultaneously converted to grayscale image；Scalar Y is exported, for indicating testing result that model inputs image, for trained mould Type, if in test image including Violent scene, output Y is 1, and otherwise exporting result is 0.

7. the violence detection system of deep learning according to claim 6, which is characterized in that the 3D network model module Pondization operation is carried out to the characteristic pattern that the first two convolutional calculation layer is calculated, pond is calculate by the following formula:

In formula, δ_TFor sampling function,Wherein, t is the time, and T is sampling period, n ∈ [0 ,+∞] And n is positive integer,Indicate y-th of characteristic pattern of x layer,Indicate that y-th of characteristic pattern of x-1 layer, θ and B are respectively the biasing of multiplying property It is biased with additivity,Indicate y-th of the multiplying property biasing of x layer,Indicate y-th of the additivity biasing of x layer.

8. the violence detection system of deep learning according to claim 7, which is characterized in that the pondization operation uses two Wei Chiization operation does not carry out down-sampled operation to input feature vector graphic sequence in time dimension, the pond factor is set to 3 × 3 With 2 × 2 pixels.