CN111027440A

CN111027440A - Crowd abnormal behavior detection device and method based on neural network

Info

Publication number: CN111027440A
Application number: CN201911221923.9A
Authority: CN
Inventors: 杨戈; 陈德城
Original assignee: Beijing Normal University Zhuhai
Current assignee: Beijing Normal University Zhuhai
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2020-04-17
Anticipated expiration: 2039-12-03
Also published as: CN111027440B

Abstract

The invention discloses a crowd abnormal behavior detection device and a detection method based on a neural network, which are characterized by comprising two steps, wherein the first step is the construction and training of a double-current residual neural network, the second step is the detection of a video image to be detected by using the neural network after the video image to be detected is processed, the advantage of simple optical flow acquisition is utilized as a classifier of a local characteristic character, the classifier is used for carrying out primary screening, the classifier which is constructed by carrying out characteristic extraction by using an automatic encoder based on the neural network is cascaded, the fine judgment is carried out, a 70-layer deep residual neural network model is used as a detection model, the characteristic learning is more accurate, the idea of the residual network is used, the 70-layer deep residual neural network model is constructed as the detection model, and the detection speed and the detection accuracy are more efficient.

Description

Crowd abnormal behavior detection device and method based on neural network

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of computer vision, in particular to a crowd abnormal behavior detection method based on a neural network.

[ background of the invention ]

In modern cities, surveillance cameras have spread throughout all public areas. These closed circuit television systems usually require the frequent presence of a person to monitor the scene being photographed in order to alert the responsible person when an abnormal event occurs. However, manual detection of the recorded pictures is usually time-consuming and labor-consuming. Narrow-sense crowd abnormal event detection usually refers to judging some behaviors of some scenes; in the broad sense, the detection of abnormal events in the human population means that the probability of occurrence is generally lower than 5% to 10% compared with the probability of occurrence of normal events.

At present, the population inspection at home and abroad mainly comprises the following two methods: object recognition based methods (Cen wing steel, Wang Wen, Liang, Wang Yong. salient Optical Flow histogram dictionary representation of group anomaly event detection [ J ] Signal processing, 2017,33(03): 330-. Object-based methods can preserve local features well, but take relatively long time. Whereas the global-based approach is the opposite.

In terms of a mainstream global method, the method is mainly based on the following two feature extraction methods:

1) using easily-obtained low-dimensional features, the most commonly used are Optical Flow features (Colque RVHM, Caetano C, Andrad MTLD, et al. Histograms of Optical Flow organization and Magnitude and control to Detect analysis Events in Video [ J ]. IEEE Transactions on circuits & Systems for Video Technology,2017,27(3):673-682), performing local or global feature extraction to construct classifiers, such as documents (Wang T, Snsis H.detection of absolute Video view Optical Flow Orientation [ J ]. IEEE Transactions in SefForenses cur, 9(6): 988) on the basis of Optical Flow histogram, creating more global scales for Optical Flow histogram; the literature (Y.Cong, J.Yuan, and Y.Tang, Video analog search section Video-spatial-temporal motion context, IEEE trans.Inf. forces Security,2013,8(10): 1590-. This method has advantages of simple feature acquisition, disadvantages of insufficient optical flow accuracy, and consumption of a large amount of time for global optical flow calculation.

2) Constructing complex high dimensional features such AS energy flow features (Nam Y. Crown flux analysis and structured scenes [ J ]. MultimediaTools and Applications 2014,72(3): 3001-. The invention uses the advantage of simple optical flow acquisition as a classifier of local characteristic characters for carrying out preliminary screening, and cascades the classifier which is constructed by carrying out characteristic extraction on an automatic encoder based on a neural network for carrying out fine judgment.

In the patent aspect, a crowd abnormal event detection method, an electronic device and a storage medium [ P ] Chinese patent of Wuyuanchun, Yangyuan, CN108288021A, 2018-07-17) calculate an optical flow point adjacency matrix by utilizing the abscissa relationship and the ordinate relationship among optical flow points to detect abnormal events, and the method has the advantages of high accuracy, but has the defect that the time cost of optical flow calculation is still high. The patent (Xuanzuxing; Guo Yanfei; Wanghai; Sun Xin.) a crowd abnormal event detection method based on mixed tracking and a generalized linear model [ P ] Chinese patent: CN108280408A.2018-07-13.) provides a crowd abnormal event detection method based on the generalized linear model, and the method has the advantages that the linear model maintains the advantage of high dimensional accuracy, meanwhile, has certain advantages in speed, and has the disadvantages that the manually extracted feature model is easy to over-fit and under-fit, and the robustness is low.

[ summary of the invention ]

The invention provides a crowd abnormal behavior detection method based on a neural network. The method has the advantages of simple optical flow acquisition, is used as a classifier of local characteristic characters for preliminary screening, and is used for carrying out fine judgment by cascading classifiers constructed by feature extraction of automatic encoders based on neural networks.

The technical scheme adopted by the invention is as follows:

a crowd abnormal behavior detection method based on a neural network is characterized by comprising the following steps:

1) constructing and training a double-current residual error neural network;

2) and processing the video image to be detected and then detecting by using a neural network.

Further, the step 1) includes the following sub-steps:

1-1) dividing a standard data set of the crowd movement into two parts, wherein one part is a training data set, the other part is a testing data set, and each sample comprises normal behaviors and abnormal behaviors;

1-2), respectively carrying out optical flow processing on the training sample and the test sample, obtaining an optical flow image sequence, then regulating the optical flow image sequence and the RGB image sequence into a size of 320 x 240, and expanding the data set;

1-3), after training samples are randomly disturbed, inputting a double-current residual error neural network, and training the neural network;

1-4) detecting the trained neural network by using the test data set. If the detection result reaches the expected accuracy, the method can be used for detecting the crowd abnormal behaviors of the video to be detected, and the training is carried out again if the detection result does not reach the expected accuracy.

Further, the step 2) includes the following sub-steps:

2-1) denoising the video image sequence in the sequence to be detected. Denoising the video image by adopting a wiener filter, and denoising by selecting a wiener filter in a denoising algorithm;

2-2) performing optical flow processing on the video to be detected, and regulating the optical flow image sequence and the RGB image sequence into a size of 320 x 240 after acquiring the optical flow image sequence;

2-3) inputting the video image into the trained double-current residual error neural network. The neural network directly gives the detection result.

And further, if the detection result is that abnormal behaviors appear, a warning mark is sent out, and if the detection result is that abnormal behaviors do not appear, the video sequence to be detected is returned for re-detection.

A crowd abnormal behavior detection device based on a neural network is characterized by comprising:

the device comprises a neural network module and a to-be-detected video processing module.

Further, the main structure of the construction of the neural network is as follows:

1) the input layer converts an input picture into a database file, unifies the image into 320-240 pixels, and randomly scrambles the data image;

2) the first part of the hidden layer is a composite layer consisting of convolutional layers "Conv 1" and pooling layers "pool", Conv1 "consisting of 64 convolution kernels of size 7 × 7, the step size of the convolution being 2. And after the 64 convolution kernels are respectively convoluted, performing output matrix superposition and taking an average value as a final convolution result. The 'Pool' layer performs overlapped maximum downsampling on the convolution result, and the size of a pooling window is 2 x 2;

3) the second part is a complex convolutional network "Conv 2" made up of three sub-convolutional networks, each having 3 layers, with the number and size of convolutional cores 128 × 1, 128 × 3, 512 × 1, respectively. Each sub-part is connected by using a shortcut, and the three sub-convolution networks form a total 12-layer residual convolution network;

4) the third part is a convolution network "Conv 3" made up of three sub-convolution networks, each sub-network having 3 layers, with the number and size of convolution kernels being 64 × 1, 64 × 3, 128 × 1, respectively. Each subsection is connected by using a shortcut, and a third subsection has 12 convolutional layers to form a coincident convolutional layer;

5) the fourth part is a composite convolution network "Conv 4" comprising twenty-three sub-convolution networks, each having 3 layers, with the number and size of convolution kernels being 256 × 1, 512 × 3, 1024 × 1, respectively. Each subsection is connected by using a shortcut of a residual error network, and the fourth subsection has 69 layers of convolution layers in total;

6) the fifth part is the last part of the residual network, which is a composite convolution network "Conv 5" consisting of three sub-convolution networks, each having 3 layers, with the number and size of convolution kernels being 512 × 1, 1024 × 3, 2048 × 1, respectively. Each subsection is connected by using a shortcut of a residual error network, and the fourth subsection totally has 9 convolutional layers;

7) the last part of the neural network is an output part, and the 0 output layer is composed of an expansion layer, a 4-layer full-connection layer and a softmax classification layer. The expansion layer expands the fused features into one-dimensional vectors, the full-connection layer is full-connection with output sizes of 1024, 512, 256 and 64 respectively, and softmax normalizes output results of the full-connection layer.

Further, the video processing module to be detected is responsible for preprocessing the video to be detected, and the main process is as follows:

1) image noise reduction (wiener filtering);

2) extracting optical flow to generate an optical flow image;

3) and extracting the RGB image.

The invention has the beneficial effects that:

the present invention builds a system by using deep learning based ideas. The neural network is used for automatically carrying out high-dimensional feature learning on various behaviors of the sports crowd; traditional artificial feature extraction is limited by human visual fields, too complicated features are extremely poor in robustness, and too simple features cannot represent features effectively enough. Therefore, the accuracy and speed of detection are greatly limited, and the extraction method can fully and effectively represent the crowd characteristics.

The invention uses the 70 layers of deep residual error neural network models as the detection models, so that the learning of the characteristics is more accurate, and the method uses the thought of the residual error network to construct the 70 layers of deep residual error neural network models as the detection models, so that the method has higher efficiency in detection speed and detection accuracy.

The invention adopts a double-flow neural network structure and simultaneously extracts two kinds of information respectively. The information related to the time and the space sequence is fully utilized, the features obtained by learning from the two kinds of information are fused to be used as the final detection index, and the information loss is not easy to generate.

[ description of the drawings ]

FIG. 1 is an exemplary data augmentation of the present invention;

FIG. 2 is a flow chart of the training of the neural network of the present invention;

FIG. 3 is a flow chart of the present invention for processing a video image to be detected and detecting using a neural network;

fig. 4 shows a specific structure of the neural network of the present invention.

[ detailed description ] embodiments

The present invention is further illustrated in detail by the following examples and the accompanying drawings.

The invention provides a crowd abnormal behavior detection method based on a neural network. The method utilizes the advantage of simple optical flow acquisition as a classifier of local characteristic characters for preliminary screening, and cascades the classifier constructed by extracting the characteristics of the automatic encoder based on the neural network for fine judgment. The construction and the concrete method steps of the neural network of the invention are as follows by combining the attached drawings:

construction of a neural network:

7) the last part of the neural network is an output part, and the 0 output layer is composed of an expansion layer, a 4-layer full-connection layer and a softmax classification layer. The expansion layer expands the fused features into one-dimensional vectors, the full-connection layer is full-connection with output sizes of 1024, 512, 256 and 64 respectively, and softmax normalizes the output results of the full-connection layer.

The method comprises the following specific steps:

1. data set production

Take a UMN database as an example. The database is an open database established by the university of minnesota for standard measurements of crowd abnormal event detection algorithms, and videos collected in the database are recorded by the college students in cooperation with the teacher. The database contains 11 video segments in total, and each video segment contains a crowd motion state normal part and a motion state abnormal part. The video content starts with normal behavior and ends with abnormal behavior.

The data set is divided according to a normal sample and an abnormal sample, the divided sample is divided into two parts, one part is a training sample, the other part is a testing sample, and each sample comprises a normal behavior and an abnormal behavior. Both parts are left to be put into the next network training.

2. Data set expansion

For the neural network applied to the task of detecting and identifying abnormal behaviors, the training effect is closely related to the quantity and the scale of the data sets, and the more large-scale data sets are used in the training process, the less easily the trained network is over-fit or under-fit. Because the training set of abnormal behaviors is limited, the existing data is expanded by adopting a data gain method.

The images of each frame are rotated, scaled and translated before the training data set is input into the network. For rotation, its angle is limited to [ -30, +30 ]; for the translation, the translation amplitude of each image is controlled in a range of [ -10%, + 10% ], and the translation amplitude and direction are randomly determined by a program to translate horizontally and vertically; for zooming, the range of the zoom is also controlled to [ -10%, + 10% ], when the picture is zoomed out resulting in a resulting blank, the nearest pixel in the original image will be filled in the blank. The change range of each change mode is controlled within a certain range, so that the situation that the newly generated image and the original image are excessively changed to cause the process of network training to become unpredictable is avoided. The range set by the embodiment contains zero values, and when the image is subjected to zero value conversion, the image means that no conversion is applied, and the size of the expanded data set is 4 times that of the original data set.

Shown in fig. 1 is an exemplary operation of the original image through +30 degrees rotation and-20 degrees rotation, 10% scaling and 10% enlargement, 10% translation to the right and 10% translation to the down, respectively.

3. Neural network setup

Most parameters of the neural network are self-adjusted to adapt to the learning content, but before network training, the hyper-parameters of the neural network need to be set so as to enable the network learning to be smoothly carried out. The setting of the respective hyper-parameters is explained in detail below:

(1) the batch _ size of the training set is set to 2. Batch _ size is how many units are used for a single Batch of training. The data size of the training set is relatively small, and in order to fully utilize information in the training set with a limited size, the training efficiency is obviously improved although the expense is increased in training time by setting batch _ size to be a relatively low array. According to common experience, the size of batch _ size is generally selected to be 2 times to the power of cpu and gpu processing requirements.

(2) The epoch for the training set is set to 50. epochs are the total batch of network training, each epoch means a new round of training based on the previous training using the training set. The identification of the completion of the neural network training is that the quasi-cluster rate and the loss rate of the network converge to specific values in a plurality of epochs.

(3) Momentum of the network is set to 0.9 a common method of optimizing parameters during network training is known as gradient descent. In the gradient descending process, the initial state of the network can influence the convergence of the network in the optimal solution, the network can quickly converge to the global optimal solution under the condition of correct convergence, and the network can collapse to the local optimal solution under the condition of error. The momentum refers to the relationship between potential energy and kinetic energy in physics and is used for guiding the direction in the descending process. The larger the descending gradient is, the smaller the adjustment angle is; the smaller the gradient of the descent is, the larger the angle of adjustment is, so that the situation that the local optimal solution is trapped can be avoided with higher probability. When the Momentum value is set, the process that the local optimal solution is to be got rid of as much as possible and the influence of the oscillation amplitude is not reduced is considered at the same time, and through a plurality of tests, the effect of network training is optimal when the momentun value is set to be 0.9.

(4) The initial learning rate is set to 0.001. In the training process, the learning rate represents the amplitude of parameter adjustment of the network, and in the gradient descent, the amplitude of adjustment means the step length of each step in the descent process. The larger the learning rate is, the larger the convergence rate is, but the phenomenon of gradient explosion and oscillation near the optimal solution are more easily caused; however, the smaller the learning rate is, the more accurate the network obtains the result, but the learning result is easy to be over-fitted. Therefore, the learning rate set in the training process of this embodiment will dynamically change with the epoch, and each time the epoch gradually increases the learning rate on the basis of the initial learning rate being 0.001.

3. Training of neural networks

And inputting the preprocessed video set into a neural network for training. And after the training reaches the expected accuracy, finishing the training of the neural network, and re-training without reaching the expected accuracy.

4. And detection video pre-processing

After the training is finished, a test video (a video to be detected) is also input into the trained neural network for detection. When the picture phase has abnormal behavior, the system gives out warning mark, if the picture phase has no abnormal behavior, the system detects again

The above-mentioned embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the scope of the invention, but rather as equivalent variations of the shape, structure and principle of the invention are also within the scope of the invention.

Claims

1. A crowd abnormal behavior detection method based on a neural network is characterized by comprising the following steps:

1) constructing and training a double-current residual error neural network;

2. The method for detecting the abnormal behavior of the crowd based on the neural network as claimed in claim 1, wherein the step 1) comprises the following substeps:

1-2), respectively carrying out optical flow processing on the training sample and the test sample, obtaining an optical flow image sequence, then regulating the optical flow image sequence and the RGB image sequence to be 320 × 240, and expanding the data set;

3. The method for detecting the abnormal behavior of the crowd based on the neural network as claimed in claim 1, wherein the step 2) comprises the following substeps:

2-3) inputting the video image into the trained double-current residual error neural network, and directly giving out the detection result by the neural network.

4. The method as claimed in claim 3, wherein the method for detecting the abnormal behavior of the crowd based on the neural network is characterized in that a warning mark is sent out if the detection result is that the abnormal behavior occurs, and the video sequence to be detected is returned for re-detection if the detection result is that the abnormal behavior does not occur.

5. A crowd abnormal behavior detection device based on a neural network is characterized by comprising:

6. The device according to claim 5, wherein the neural network is constructed by the following main structure:

3) the second part is a complex convolution network "Conv 2" made up of three sub-convolution networks, each sub-network having 3 layers, with the number and size of convolution kernels being 128 x 1, 128 x 3, 512 x 1, respectively. Each subsection is connected by a shortcut, and the three sub convolution networks form a total 12-layer residual convolution network;

6) the fifth part is the last part of the residual network, which is a composite convolution network "Conv 5" consisting of three sub-convolution networks, each having 3 layers, with the number and size of convolution kernels being 512 × 1, 1024 × 3, 2048 × 1, respectively. Each subsection is connected by using a shortcut of a residual error network, and the fourth subsection has 9 convolutional layers in total;

7) the last part of the neural network is an output part, and the 0 output layer is composed of an expansion layer, a 4-layer full-connection layer and a softmax classification layer. The expansion layer expands all the fused features into one-dimensional vectors, the full-connection layer is full-connected with output sizes of 1024, 512, 256 and 64 respectively, and softmax performs normalization processing on output results of the full-connection layer.

7. The device according to claim 5, wherein the video processing module to be detected is responsible for preprocessing the video to be detected, and the main process is as follows:

1) image noise reduction (wiener filtering);

2) extracting optical flow to generate an optical flow image;

3) and extracting the RGB image.