CN111027440B

CN111027440B - Crowd abnormal behavior detection device and detection method based on neural network

Info

Publication number: CN111027440B
Application number: CN201911221923.9A
Authority: CN
Inventors: 杨戈; 陈德城
Original assignee: Beijing Normal University Zhuhai
Current assignee: Beijing Normal University Zhuhai
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2023-05-30
Anticipated expiration: 2039-12-03
Also published as: CN111027440A

Abstract

The invention discloses a crowd abnormal behavior detection device and a detection method based on a neural network, which are characterized by comprising two steps, wherein the first step is the construction and training of a double-flow residual neural network, the second step is the detection of a video image to be detected by using the neural network after processing, the classifier which is used for carrying out preliminary screening by utilizing the advantage of simple optical flow acquisition is used as a local characteristic, the classifier which is constructed by cascading an automatic encoder based on the neural network for carrying out characteristic extraction is used for carrying out fine judgment, a 70-layer deep residual neural network model is used as a detection model, the characteristic learning is more accurate, the 70-layer deep residual neural network model is constructed by using the thought of a residual network as a detection model, and the detection speed and the detection accuracy are more efficient.

Description

Crowd abnormal behavior detection device and detection method based on neural network

Technical Field

The invention relates to the technology in the field of computer vision, in particular to a crowd abnormal behavior detection device and method based on a neural network.

Background

Modern cities, surveillance cameras have spread over all public areas. These closed-circuit television systems typically require frequent presence of people, monitoring of the scene captured, and alerting responsible persons in the event of an abnormal event. However, manual detection of video pictures is often time consuming and laborious. Narrow crowd abnormal event detection generally refers to making a judgment for certain behaviors of certain scenes, and generalized crowd abnormal event detection refers to occurrence probability of the crowd abnormal event detection is generally lower than that of normal event occurrence by 5% to 10%.

At present, the domestic and foreign crowd inspection aspects mainly comprise the following two methods: methods based on object recognition (Cen Ying Gang, wang Wenjiang, li Ang, liang Liequan, wang Hengyou. Group anomaly detection by significant optical flow histogram dictionary [ J ]. Signal processing, 2017, 33 (03): 330-337.) and methods based on global (Xie S, zhang X, cai J, video crowd detection and abnormal behavior model detection based on machine learning method [ J ]. Nerve Computing & Applications, 2018:1-10) (Colque RVHM, caetano C, andrade MTLD, et al Histograms of Optical Flow Orientation and Magnitude and Entropy to Detect Anomalous Events in Videos [ J ]. IEEE Transactions on Circuits & Systems for Video Technology, 2017, 27 (3): 673-682). The object recognition-based method can well preserve local features, but is relatively time consuming. Whereas the global-based approach is the opposite.

In the aspect of the global method of the main stream, the method is mainly based on the following two feature extraction methods:

1) With easily available low-dimensional features, the most commonly used are optical flow features (Colque RVHM, caetano C, andrade MTLD, et al Histograms of Optical Flow Orientation and Magnitude and Entropy to Detect Anomalous Events in Videos [ J ]. IEEE Transactions on Circuits & Systems for Video Technology, 2017, 27 (3): 673-682), local or global feature extraction is performed to construct a classifier, for example literature (Wang T, snoussi h. Detection of abnormal visual events via global optical flow orientation histogram [ J ]. IEEE Trans Inf Forensics Secur, 2014,9 (6): 988-998) proposes to incorporate more features on the basis of global optical flow, and to construct a multi-scale optical flow histogram; the literature (Y. Cong, J. Yuan, and Y. Tang, video anomaly search in crowded scenes via spatio-temporal motion context, IEEE Trans. Inf. Forensics Security, 2013,8 (10): 1590-1599) proposes an improved method of constructing an optical flow direction histogram with spatio-temporal dimensions. The advantage of this approach is that feature acquisition is simpler, the disadvantage is that the accuracy of the optical flow is not sufficient, and the global optical flow calculation requires a significant amount of time.

2) The advantage of constructing complex high-dimensional features such AS energy flow features (Nam Y Crowd flux analysis and abnormal event detection in unstructured and structured scenes [ J ]. Multimedia Tools and Applications 2014,72 (3): 3001-3029), super-cluster features (Rao AS, gubi J, rajasegarar S, marusic S, palaniswami M Detection of anomalous crowd behaviour using hyperspherical clustering [ J ]. International conference on digital lmage computing: techniques and applications (DlCTA). IEEE, 2014, pp 1-8), contrast filter features (Shi Y, liu Y, zhang Q, et al Saliency-based abnormal event detection in crowded scenes [ J ]. Journal of Electronic Imaging, 2016, 25 (6): 061608.) is that high-dimensional features tend to have a more compact representation of events, but that high-dimensional features that are sufficiently accurate tend to be difficult to obtain. The invention uses the advantage of simple optical flow acquisition as a classifier of local features, is used for preliminary screening, and cascades an automatic encoder based on a neural network to perform feature extraction and construction, so as to perform fine judgment.

In the aspect of the patent, the patent (Wu Yochun; yang Yansheng) is a crowd abnormal event detection method, electronic equipment and a storage medium [ P ]. Chinese patent CN108288021A, 2018-07-17.) utilizes the abscissa relation and the ordinate relation between optical flow points to calculate an optical flow point adjacent matrix for abnormal event detection, and the method has the advantages of higher accuracy, but has the disadvantage that the time cost of optical flow calculation is still higher. Patent (Xuan Zu Xing; guo Yanfei; wang Hai; sun Xin. Crowd abnormal event detection method based on hybrid tracking and generalized linear model [ P ]. Chinese patent: CN108280408 A.2018-07-13.) provides a crowd abnormal event detection method based on generalized linear model, which has the advantages that the linear model maintains the advantage of high dimensional accuracy and has certain advantages in speed, and the disadvantage that the characteristic model extracted manually is easier to be subjected to over fitting and under fitting, and has lower robustness.

Disclosure of Invention

The invention provides a crowd abnormal behavior detection method based on a neural network. The advantage of simple optical flow acquisition is utilized as a classifier of local features, and the classifier is used for performing primary screening, cascading an automatic encoder based on a neural network to perform feature extraction and construction, and performing fine judgment.

The technical scheme adopted by the invention is as follows:

the crowd abnormal behavior detection method based on the neural network is characterized by comprising the following steps of:

1) Constructing and training a double-flow residual neural network;

2) And processing the video image to be detected and then detecting the video image by using a neural network.

Further, the step 1) includes the following substeps:

1-1), dividing a standard data set of crowd movement into two parts, wherein one part is a training data set and the other part is a test data set, and each sample contains normal behaviors and abnormal behaviors;

1-2), respectively carrying out optical flow processing on the training sample and the test sample, obtaining an optical flow image sequence, then regulating the optical flow image sequence and the RGB image sequence to 320 x 240, and expanding a data set;

1-3), after randomly disturbing training samples, inputting a double-current residual error neural network, and training the neural network;

1-4), detecting the trained neural network by using the test data set. If the detection result reaches the expected accuracy, the method can be used for detecting the abnormal behaviors of the crowd of the video to be detected, the expected accuracy cannot be achieved, and the training is performed again.

Further, said step 2) comprises the sub-steps of:

2-1), denoising the video image sequence in the sequence to be detected. Denoising the video image by adopting wiener filtering, and selecting a wiener filter of a denoising algorithm to denoise;

2-2), carrying out optical flow processing on the video to be detected, and after obtaining an optical flow image sequence, regulating the optical flow image sequence and the RGB image sequence to 320 x 240;

2-3), inputting video images into a double-current residual neutral network after training. The neural network directly gives the detection result.

Further, the detection result is that the abnormal behavior is judged to be generated, a warning mark is sent out, and the detection result is that the abnormal behavior is not generated, and the video sequence to be detected is returned to be detected for re-detection.

The utility model provides a crowd's unusual behavior detection device based on neural network which characterized in that includes:

the system comprises a neural network module and a video processing module to be detected.

Further, the neural network is constructed as follows:

1) The input layer converts an input picture into a database file, unifies the images into 320 x 240 pixels, and randomly breaks up the data images;

2) The first part of the hidden layer is a composite layer consisting of a convolution layer "Conv1" and a pooling layer "pool", wherein "Conv1" consists of 64 convolution kernels of size 7*7, and the step size of the convolution is 2. And after the convolution is completed respectively, the 64 convolution kernels are subjected to output matrix superposition and an average value is taken as a final convolution result. The Pool layer performs overlapped maximum downsampling on the convolution result, and the pooling window size is 2 x 2;

3) The second part is a composite convolutional network "Conv2" consisting of three sub-convolutional networks, each having 3 layers, the number and size of the convolutional kernels being 128 x 1, 128 x 3, 512 x 1, respectively. Using shortcut connection between each sub-convolution network part, three sub-convolution networks form a total 12-layer residual convolution network;

4) The third part is a convolutional network "Conv3" consisting of three sub-convolutional networks, each having 3 layers, the number and size of the convolutional kernels being 64×1, 64×3×3, 128×1×1, respectively. The sub-convolution network parts are connected by using a shortcut, and the third part has 12 convolution layers together to form a composite convolution layer;

5) The fourth part is a composite convolutional network "Conv4" comprising twenty-three sub-convolutional networks, each having 3 layers, the number and size of the convolutional kernels being 256×1×1, 512×3×3, 1024×1×1, respectively. A shortcut connection of a residual network is used between each sub-convolution network part, and a fourth part has 69 convolution layers in total;

6) The fifth part is the last part of the residual network, the part is a composite convolutional network 'Conv 5' formed by three sub-convolutional networks, each sub-convolutional network has 3 layers, the number and the size of convolution kernels are respectively 512 x 1, 1024 x 3 and 2048 x 1, each sub-convolutional network part uses shortcut connection of the residual network, and the fourth part has 9 convolutions;

7) The last part of the neural network is an output part, and the 0 output layer consists of an unfolding layer, a 4-layer full-connection layer and a softmax classification layer. The unfolding layer unfolds the fused features into one-dimensional vectors, the full-connection layer is full-connection with output sizes of 1024, 512, 256 and 64 respectively, and the softmax normalizes the output result of the full-connection layer.

Further, the to-be-detected video processing module is responsible for preprocessing the to-be-detected video, and the main flow is as follows:

1) Image noise reduction and wiener filtering;

2) Extracting an optical flow to generate an optical flow image;

3) And extracting RGB images.

The beneficial effects of the invention are as follows:

the present invention builds a system by using the deep learning based concept. The neural network is used for automatically performing high-dimensional feature learning on various behaviors of the sports crowd; traditional artificial feature extraction is often limited by the human field of view, and too complex features are extremely poor in robustness and too simple features cannot represent features sufficiently effectively. Therefore, the accuracy and the speed of detection are greatly limited, and the extraction method can fully and effectively represent crowd characteristics.

The invention uses the 70-layer depth residual neutral network model as the detection model, so that the characteristic learning is more accurate, and the thought of the residual network is used, and the 70-layer depth residual neutral network model is constructed as the detection model, so that the invention has higher efficiency in detection speed and detection accuracy.

The invention adopts the structure of the double-current neural network, and simultaneously extracts two kinds of information respectively. The information related to the time and the space sequence is fully utilized, and the characteristics learned from the two information are fused to be used as a final detection index, so that information loss is not easy to occur.

Drawings

FIG. 1 is an illustration of the data expansion of the present invention;

FIG. 2 is a training flow diagram of the neural network of the present invention;

FIG. 3 is a flow chart of the present invention for processing a video image to be detected and detecting using a neural network;

fig. 4 is a specific structure of the neural network of the present invention.

Detailed Description

The present invention is achieved by the following examples and drawings, in which the present invention is further described in detail.

The invention provides a crowd abnormal behavior detection method based on a neural network. The advantage of simple optical flow acquisition is utilized as a classifier of local features, and the classifier is used for performing primary screening, cascading an automatic encoder based on a neural network to perform feature extraction and construction, and performing fine judgment. The construction and specific method steps of the neural network of the invention are as follows:

construction of a neural network:

3) The second part is a composite convolutional network "Conv2" consisting of three sub-convolutional networks, each having 3 layers, the number and size of the convolutional kernels being 128 x 1, 128 x 3, 512 x 1, respectively. Using shortcut connection between each sub-part, three sub-convolution networks form a total 12-layer residual convolution network;

4) The third part is a convolutional network "Conv3" consisting of three sub-convolutional networks, each having 3 layers, the number and size of the convolutional kernels being 64×1, 64×3×3, 128×1×1, respectively. Each sub-part uses shortcut connection, and the third part has 12 convolution layers to form a coincidence convolution layer;

5) The fourth part is a composite convolutional network "Conv4" comprising twenty-three sub-convolutional networks, each having 3 layers, the number and size of the convolutional kernels being 256×1×1, 512×3×3, 1024×1×1, respectively. A shortcut connection of a residual network is used between each sub-part, and the fourth part has 69 convolutions layers in total;

6) The fifth part is the last part of the residual network, which is a composite convolutional network "Conv5" comprising three sub-convolutional networks, each having 3 layers, the number and size of the convolutional kernels being 512 x 1, 1024 x 3, 2048 x 1, respectively. A shortcut connection of a residual network is used between each sub-part, and a fourth part has 9 layers of convolution layers in total;

The specific method comprises the following steps:

take the example of a UMN database. The database is a public database and is established by university of Minnesota for carrying out standard measurement on crowd abnormal event detection algorithms, and videos collected in the database are jointly participated in recording by students and teachers of the university. The database contains 11 video clips, and each video clip contains a crowd motion state normal part and a motion state abnormal part. The video content starts with normal behavior and the abnormal behavior ends.

Dividing the data set into two parts according to a normal sample and an abnormal sample, wherein one part is a training sample and the other part is a test sample, and each part of the samples contains normal behaviors and abnormal behaviors. Leaving the two parts to be put into the next network training.

For the neural network applied to the task of abnormal behavior detection and identification, the training effect is closely related to the number scale of the data sets, and the larger the scale of the data sets is used in training, the less the phenomenon of over fitting or under fitting of the trained network is easy to occur. Because the abnormal behavior training set is limited, the existing data is expanded by adopting a data gain method.

The image of each frame is subjected to rotation, scaling and translation processes before the training dataset is input into the network. For rotation, the angle is limited to be [ -30 ^◦ ，+30 ^◦ ]The method comprises the steps of carrying out a first treatment on the surface of the For translation, the translation amplitude of each image is controlled to be [ -10%, +10%]The horizontal translation and the vertical translation of the translation amplitude and the direction are randomly determined by a program in the range; for scaling, the scale range is also controlled to be [ -10%, +10%]When the picture shrinks resulting in a blank, the closest pixel in the original image will be filled in the blank. The variation amplitude of each variation mode is controlled in a certain range, so that the phenomenon that the newly generated image and the original image possibly generate excessive variation, and the network training process becomes unpredictable is avoided. The range set in this embodiment contains zero values, and when the image performs zero value conversion, this means that no conversion is applied, and the size of the data set after expansion becomes 4 times of the original size.

Shown in fig. 1 is an exemplary operation of scaling 10% and expanding 10%, shifting 10% to the right and shifting 10% downward by +30 degrees and-20 degrees, respectively, of the original image.

Most of the parameters of the neural network are self-adjusting to adapt to the learning content, but before the network training, the super parameters of the neural network need to be set to enable the learning of the network to be performed smoothly. The following describes the setting of each super parameter in detail:

the batch_size of the training set is set to 2.Batch_size is how many units are used for a single Batch of training. According to the crowd abnormal behavior data, the data size of the training set is relatively small, and in order to fully utilize information in the training set with a limited size, the batch_size is set to be a relatively low array, and the cost of training time is increased, but the training efficiency is obviously improved. According to common experience, the size of the batch_size is generally 2 times the size of the batch_size to accommodate cpu and gpu processing requirements.

The epoch of the training set is set to 50.epoch is a total batch of network training, each epoch meaning a new round of training based on the previous training using the training set. The neural network training is often identified by convergence of the quasi-group rate and the loss rate to specific values over multiple epochs.

A common method of optimizing parameters during network training is called gradient descent, where momentum of the network is set to 0.9. In the gradient descent process, the initial state of the network can influence the convergence of the network in the optimal solution, the network can quickly converge to the global optimal solution under the condition of correct convergence, and can be trapped in the local optimal solution under the condition of error. momentun momentum references the potential energy versus kinetic energy relationship in physics for guiding the direction during descent. The larger the descending gradient is, the smaller the angle of adjustment is; the smaller the gradient of the descent, the larger the adjustment angle, so that the more the disturbance gets rid of the local optimal solution. When setting the Momentum value, the process of getting rid of the local optimal solution as far as possible and not enabling the influence of the oscillation amplitude to be reduced should be considered at the same time, and through multiple tests, the embodiment proposes that the network training effect is optimal when the Momentum value is 0.9.

The initial learning rate was set to 0.001. During training, the learning rate represents the amplitude of parameter adjustment of the network, and during gradient descent, the amplitude of adjustment means the step size of each step in the descent process. The larger the learning rate is, the larger the convergence rate is, but the phenomenon of gradient explosion and oscillation near the optimal solution are more easily caused; however, the smaller the learning rate, the more accurate the network obtains the result, but the learning result is easy to be over-fitted. Therefore, the learning rate set in the training process of the embodiment can be dynamically changed along with the epoch, and the learning rate can be gradually increased every time when the initial learning rate is 0.001.

And inputting the preprocessed video set into a neural network for training. After the training reaches the expected accuracy, the training of the neural network is completed, and the training is performed again without reaching the expected accuracy.

After training is completed, the test video (video to be detected) is also input into the neural network after training is completed for detection. When abnormal behavior occurs in the graph, the system gives a warning mark, and if abnormal behavior does not occur, the system re-detects

The above embodiments are merely preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, but all equivalent changes according to the shape, construction and principle of the present invention are intended to be included in the scope of the present invention.

Claims

1. The utility model provides a crowd's unusual behavior detection device based on neural network which characterized in that includes: the system comprises a neural network module and a video processing module to be detected;

the main structure of the neural network is as follows:

2) The first part of the hidden layer is a composite layer consisting of a convolution layer Conv1 and a pooling layer pool, wherein the Conv1 consists of 64 convolution kernels with the size of 7*7, and the step length of convolution is 2; after the convolution is completed respectively, the 64 convolution kernels are subjected to output matrix superposition and take an average value as a final convolution result, the Pool layer carries out the overlapped maximum downsampling on the convolution result, and the pooling window size is 2 x 2;

3) The second part is a composite convolutional network 'Conv 2' formed by three sub-convolutional networks, each sub-convolutional network has 3 layers, and the number and the size of convolution kernels are 128 x 1, 128 x 3 and 512 x 1 respectively; using shortcut connection between each sub-convolution network part, three sub-convolution networks form a total 12-layer residual convolution network;

4) The third part is a convolutional network 'Conv 3' formed by three sub-convolutional networks, each sub-convolutional network has 3 layers, and the number and the size of convolution kernels are respectively 64 x 1, 64 x 3 and 128 x 1; the sub-convolution network parts are connected by using a shortcut, and the third part has 12 convolution layers together to form a composite convolution layer;

5) The fourth part is a composite convolutional network 'Conv 4' consisting of twenty-three sub-convolutional networks, each of which has 3 layers, and the number and the size of the convolutional kernels are 256×1×1, 512×3×3, 1024×1×1 respectively; a shortcut connection of a residual network is used between each sub-convolution network part, and a fourth part has 69 convolution layers in total;

6) The fifth part is the last part of the residual network, which is a composite convolutional network 'Conv 5' consisting of three sub-convolutional networks, each having 3 layers, the number and size of the convolutional kernels being 512 x 1, 1024 x 3, 2048 x 1 respectively; a shortcut connection of a residual network is used between each sub-convolution network part, and a fourth part is totally provided with 9 layers of convolution layers;

7) The last part of the neural network is an output part, and the 0 output layer consists of an unfolding layer, 4 full-connection layers and a softmax classification layer; the unfolding layer unfolds the fused features into one-dimensional vectors, the full-connection layer is full-connection with output sizes of 1024, 512, 256 and 64 respectively, and the softmax normalizes the output result of the full-connection layer;

the to-be-detected video processing module is responsible for preprocessing the to-be-detected video, and the main flow is as follows:

1) Image noise reduction and wiener filtering;

2) Extracting an optical flow to generate an optical flow image;

3) And extracting RGB images.

2. The crowd abnormal behavior detection method based on the neural network is realized based on the crowd abnormal behavior detection device based on the neural network as claimed in claim 1, and is characterized by comprising the following steps:

1) Constructing and training a double-flow residual neural network;

2) Processing the video image to be detected and then detecting the video image by using a neural network;

said step 1) comprises the sub-steps of:

1-4), detecting the trained neural network by using a test data set, and if the detection result reaches the expected accuracy, detecting the abnormal behaviors of the crowd of the video to be detected, if the detection result does not reach the expected accuracy, retraining.

3. The method for detecting abnormal behaviors of a crowd based on a neural network according to claim 2, wherein the step 2) comprises the following sub-steps:

2-1), denoising the video image sequence in the sequence to be detected, denoising the video image by adopting wiener filtering, and denoising by adopting a wiener filter selected by a denoising algorithm;

2-3), inputting video images into a double-current residual neutral network after training, and directly giving out a detection result by the neutral network.

4. The method for detecting abnormal behaviors of a crowd based on a neural network according to claim 3, wherein the detection result is that a warning mark is sent out when the detection result is that abnormal behaviors appear, and the detection result is that the detection result does not appear, and the detection result is that the detection result returns to a video sequence to be detected for re-detection.