CN110532959B

CN110532959B - Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network

Info

Publication number: CN110532959B
Application number: CN201910817372.6A
Authority: CN
Inventors: 沈小艳; 阴文佳; 毕胜
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2022-10-14
Anticipated expiration: 2039-08-30
Also published as: CN110532959A

Abstract

The invention provides a real-time violent behavior detection system based on a two-channel three-dimensional convolutional neural network, which comprises the following components: the video acquisition module captures video frames in real time and respectively sends the video frames to the video processing module and the playing module; the video processing module is used for extracting the characteristics of the received video frames by using the convolutional neural network, combining the extracted characteristics and classifying the image data according to the combined characteristics; the playing module is used for marking the image classification result obtained by the video processing module into the video frame sent by the video acquisition module and playing the video frame to a user; the video acquisition module, the video processing module and the playing module work in parallel. The invention improves the identification accuracy by introducing a double-channel idea, and realizes the accurate positioning of the occurrence time of violent behaviors by introducing the deconvolution layer.

Description

Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network

Technical Field

The invention relates to the technical field of video monitoring, in particular to a real-time violent behavior detection system based on a fast and slow two-channel three-dimensional convolutional neural network.

Background

Video human behavior recognition and detection are one of the most challenging tasks in computer vision, and can be widely applied to numerous fields such as video monitoring, motion retrieval, man-machine interaction, intelligent home, medical care and the like. The current field of behavior recognition has two major branches: a conventional manner represented by IDT (enhanced depth transactions) algorithm, and a deep learning manner represented by two-dimensional convolution, three-dimensional convolution, RNN-LSTM. The deep learning mode exceeds the traditional mode in performance from the development trend.

Dense trajectory algorithm (IDT): the main difference between traditional methods and deep learning methods is the source of the features used for classification. In the traditional method, one or more characteristics with good classification effect are manually extracted according to experience and mixed for classification. The deep learning method is that a person gives classified samples to a computer to learn a set of models, the learned models can be used for extracting characteristics of a certain characteristic combination for classification, and the model specifically extracts the characteristics, so that the person can not know the characteristics. The traditional method has limited extracted feature types and limited feature selection range, so the manually extracted features are not as accurate as those extracted by a model. This is also an advantage of deep learning.

Two-channel convolutional neural network (Two-Stream-CNN): is a representative algorithm for solving the behavior recognition problem by two-dimensional convolution. The main contents are as follows: the two channels simultaneously process the RGB frame sequence and the optical flow frame sequence, no information exchange exists in the feature extraction process of the two channels, and after the feature extraction is finished, the features are fused in a certain mode for classification to obtain a final result. Because the network can only process one image at a time, each frame of image in the sequence needs to be processed, and a large amount of repeated information exists between adjacent frames of images in the video, the algorithm has the phenomenon of repeated calculation, the identification and detection speed is greatly restricted, and the requirement of real-time property cannot be met.

Long-Short Term Memory network (LSTM): due to the unique design structure, LSTM is suitable for handling and predicting significant events in time series that are very long-spaced and delayed. Therefore, the LSTM has good effect in the behavior identification and detection direction, and is one of the mainstream directions at present.

Two-dimensional convolution has matured well in the image recognition and detection problem, but video adds information in one time dimension compared to images. The traditional two-dimensional convolution kernel can not meet the requirement of extracting three-dimensional features. The three-dimensional convolution operation speed is an advantage and can well capture interframe information, and is the mainstream research direction at present. However, the existing methods have the problems of low identification accuracy and low identification speed, and the development and application of the human behavior identification detection technology are greatly limited.

Disclosure of Invention

According to the technical problems of low recognition accuracy and low recognition speed, the real-time violent behavior detection system based on the fast and slow two-channel three-dimensional convolutional neural networks is provided, the recognition accuracy is improved by introducing a two-channel idea, and meanwhile, the accurate positioning of the occurrence time of the violent behaviors is realized by introducing the deconvolution layer.

The technical means adopted by the invention are as follows:

a real-time violent behavior detection system based on a two-channel three-dimensional convolutional neural network comprises:

the video acquisition module captures video frames in real time and respectively sends the video frames to the video processing module and the playing module;

the video processing module is used for extracting the characteristics of the received video frames by using the convolutional neural network, combining the extracted characteristics and classifying the image data according to the combined characteristics;

the playing module is used for marking the image classification result obtained by the video processing module into the video frame sent by the video acquisition module and playing the video frame to a user;

the video acquisition module, the video processing module and the playing module work in parallel.

Further, before the video processing module performs feature extraction on the received video frame, the video processing module also performs preprocessing on the video frame, including: and respectively sending the RGB images into a slow channel and a fast channel for processing, and taking the obtained preprocessing result of the slow channel and the preprocessing result of the fast channel as the input of the video processing module.

Furthermore, the slow channel is used for sampling RGB images into video segments at equal intervals, and inputting the trained slow channel network model to predict and obtain the slow channel preprocessing data.

Further, the fast channel is used for processing the RGB image into gray image data, extracting optical flow image data, inputting the trained fast channel network model, and predicting to obtain fast channel preprocessing data.

Further, the video processing module performs horizontal fusion processing based on convolution feature fusion on the slow channel feature extraction result and the fast channel feature extraction result.

Furthermore, the system also comprises a storage module used for storing data information in the operation process of the system.

Compared with the prior art, the invention has the following advantages:

the method utilizes the multilayer convolutional neural network to extract the time correlation characteristics of the video frames, achieves parameter sharing to a certain extent, can capture interframe information with good performance, improves the operation speed and has strong real-time performance. Meanwhile, the fast channel and the slow channel are combined, the characteristics of the fast channel and the slow channel to be fused have similar shapes under the condition that data are not lost through a convolution mode, and after the characteristic fusion structure is added, the slow output of the convolution layer in the slow channel and the output of the fast channel which is subjected to convolution deformation at the same layer are superposed to be used as the input of the next convolution layer, so that the identification accuracy is improved.

In addition, the present invention does not use the pooling operation in the time domain, which ensures that the time domain information is retained to the maximum extent, and further, the time of the violent behavior can be more accurately positioned.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a functional block diagram of the structure of the detection system of the present invention.

FIG. 2 is a flow chart of the operation of the detection system of the present invention.

FIG. 3 is a flow chart of the operation of the video processing module of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1 to 3, the present invention provides a real-time violent behavior detection system based on a two-channel three-dimensional convolutional neural network, which comprises three modules, namely a video acquisition module, a video processing module and a delayed playing module, and three threads are required to be performed simultaneously to meet the real-time performance.

And the video acquisition module captures video frames in real time and respectively sends the video frames to the video processing module and the delayed playing module. Specifically, video frames are captured in real time by using an opencv library and a network camera. Dividing the captured frame image into two paths, storing one path of the captured frame image into a second queue, providing input for a thread three-delay playing module, storing the other path of the captured frame image into the first queue, and providing materials for the image preprocessing step of a video processing module.

And the video processing module is used for extracting the characteristics of the received video frames by using the convolutional neural network, combining the extracted characteristics and classifying the image data according to the combined characteristics. The method is particularly used for realizing the functions of image preprocessing, video feature extraction, image feature classification and the like. The invention introduces the Slow _ Fast idea in order to improve the identification accuracy, and data are respectively sent into a Fast channel and a Slow channel for processing.

As a preferred embodiment of the invention, the following technical scheme is adopted when image preprocessing is carried out:

slow passage: RGB frame equal interval sampling: and sampling the video unit at equal intervals by using 64 frames as a video unit, taking one frame every 16 frames, and obtaining a video segment in a shape of 4 × h × w × 3.

Adjusting: in the prediction stage, the width and height of each frame of RGB image is scaled to 224 × 224, and the shape of the output result is 4 × 224 × 3. In the training phase, data was first scaled to 4 × 256 × 3, then randomly cropped to 4 × 224 × 3, with random flipping to complete data augmentation. Thereby increasing the generalization ability of the network model and preventing the model from being over-fitted.

Fast channel: the first part converts the RGB image into a gray image, the RGB comprises three color channels, and the gray image only has one color channel. Multiplying the gray values of the three channels of RGB by the corresponding weights, and taking the summation result as the gray value of the corresponding point of the gray map, wherein the formula is as follows:

Gray＝R*0.299+G*0.587+B*0.114

the shape of the output data of the link is 64 × w × h × 1

The second part converts the grayscale image into optical flow data, where the optical flow reflects the motion information of objects between adjacent frames.

The present embodiment preferably employs the Farneback optical flow algorithm to extract dense optical flow. The optical flow is calculated every two frames, one video unit has 64 frames, so the shape of the output data of the link is 32 × w × h × 2 (the optical flow image has 2 channels of x-direction optical flow and y-direction optical flow respectively)

And adjusting the prediction stage, namely scaling the width and the height of each frame of image to be 224 × 224, and outputting the shape of the output result to be 32 × 224 × 2. In the training phase, data was first scaled to 32 × 256 × 2, then randomly cropped to 32 × 224 × 2, and augmented with random flips. Thereby increasing the generalization ability of the network model and preventing the model from being over-fitted. Here the random cropping and random flipping should be consistent with the slow channel.

And preprocessing results of the data of the two channels and simultaneously taking the results as input of the step of extracting the features.

As a preferred embodiment of the present invention, the following technical scheme is adopted when extracting video features:

and inputting the processed data into corresponding network channels respectively, inputting the trained network model, and extracting the features layer by layer. The output shape of the feature after passing through each layer is labeled in the form of T × W × H × C, and example 32 × 112 × 8 refers to the shape of the feature after passing through the last convolution module, where the output feature is 32 frames wide and 112 high, and the number of convolution kernel channels is 8.

Fast passage: 32 frames of optical flow images are input, each frame of the images is 224 wide and 224 high, and 2 channels of optical flow in the X direction and optical flow in the Y direction are divided. The fast channel comprises five convolution modules, all the convolution modules have the same structure and comprise a three-dimensional convolution layer, a BN layer, a relu excitation layer and a three-dimensional pooling layer, and the names of the convolution modules, the sizes of convolution kernels and the sizes of pooling kernels are marked on a graph. The example Conv 1_3 _1 _2indicates that the name of the layer is Conv 1, the convolution kernel size is 3 x 3 and the pooling kernel size is 1 x 2. After 5 convolution layers, an Average Possing 3D layer was added, the pooled nuclei were sized 1 × 7, and the feature shapes were reduced from 32 × 7 × 128 to 32 × 1 × 128, reducing the computational cost.

Slow passage: input 4 frames of RGB color images, each frame 224 wide and 224 high, in three RGB color channels. 5 convolution modules are also included in the slow channel, with the names Convx _ S and x being 1-5, respectively. The first and last module hierarchies are the same as the fast path convolution module hierarchy, and comprise a three-dimensional convolution layer, a BN layer, a relu excitation layer and a three-dimensional pooling layer, wherein the convolution kernel size is 3 x 3, and the pooling kernel size is 1 x 2. In the middle three layers of the slow channel, the up-sampling in the time domain and the up-down sampling in the space domain are simultaneously completed by using the convolution and deconvolution operations. The convolution kernel and pooling kernel sizes are shown. Similarly, after 5 convolutional layers, an Average Possing 3D layer was added, pooling the nuclei with a size of 1 × 7, reducing the feature shapes from 32 × 7 × 128 to 32 × 1 × 128, reducing the computational cost.

Transverse fusion: in order to enable the two channels to fully utilize contents learned by the other channel, transverse feature fusion is used in many schemes, and fusion modes are also various. That is, under the condition of ensuring that data is not lost through a convolution mode, two features to be fused have similar shapes, and the output shape of each volume layer is marked in the attached drawing. After the characteristic fusion structure is added, in the slow channel, the output of the slow channel of the convolution layer and the output of the fast channel which is subjected to convolution deformation at the same layer are superposed to be used as the input of the next convolution layer.

As a preferred embodiment of the present invention, the arrangement scheme of the full connection layer and the classifier is as follows:

after the feature information of the two channels is fused and compressed, a plurality of local features are extracted, the local features are required to be reassembled into complete features through a full connection layer, and the global features are used as input for classifier classification. In this embodiment, the number of nodes in the full connection layer is preferably 1024.

The Sigmoid function is chosen as the classifier because it is only necessary to classify the action as being violent or not. The number of output nodes is 2.

The classifier output is 32 x 2 in shape, i.e. one video unit 64 frames, and two probability values of whether the violent behavior is present or not can be obtained for every two frames of images. The higher probability value is taken as the prediction result, so the final result is a sequence with the length of 32, corresponding to 32 frames of images. Thereby enabling frame-level prediction.

Sigmoid function formula is as follows:

g(x)＝1/(1+e^(-x))

it should be noted that the pooling operation is not used in the time domain in the whole network, so as to ensure that the time domain information is retained to the maximum extent. In addition, since RGB mainly focuses on detail information, the change speed over time is low, the repeated information is more, calculation expense is saved in order to avoid repeated calculation, a slow channel is carried out at a low frame rate, 1 frame is sampled every 16 frames, and 4 frames are sampled in a time unit. Since the light flow graph mainly focuses on motion information, the time change speed is high, the time change is carried out at a high frame rate, 1 frame is sampled every 2 frames, and 32 frames are sampled in a time unit. Finally, although the number of input frames is small, the slow channel needs to pay attention to more detailed information, and it is known that the more the number of convolution kernels, the more detailed information that can be paid attention to, the more the number of convolution kernels of the fast channel is, the more the number of convolution kernels of the slow channel is, the more the number of convolution kernels of the fast channel is set to be 8 times that of the slow channel in the whole process.

And the delayed playing module is used for marking the image classification result obtained by the video processing module into the video frame sent by the video acquisition module and performing delayed playing on the video frame to a user.

In addition, the system in this embodiment further includes a storage module for storing data information during the operation of the system.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the spirit of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A real-time violent behavior detection system based on a two-channel three-dimensional convolution neural network is characterized by comprising:

a video processing module for preprocessing the video frame, which comprises respectively sending RGB images into a slow channel and a fast channel for processing, using the obtained preprocessing result of the slow channel and the preprocessing result of the fast channel as the input of the video processing module, the slow channel is used for sampling the RGB images into video segments at equal intervals, inputting the trained network model of the slow channel for predicting to obtain the preprocessing data of the slow channel, the fast channel is used for processing the RGB images into gray image data and extracting the optical flow image data, inputting the trained network model of the fast channel for predicting to obtain the preprocessing data of the fast channel,

performing feature extraction on the received video frames by using a convolutional neural network, and combining the extracted features, wherein the transverse fusion processing based on convolutional feature fusion is performed on the slow channel feature extraction result and the fast channel feature extraction result, and then the image data is classified according to the combined features;

2. The real-time violent behavior detection system of claim 1 further comprising a storage module for storing data information during operation of the system.