CN109753906B

CN109753906B - Method for detecting abnormal behaviors in public places based on domain migration

Info

Publication number: CN109753906B
Application number: CN201811594841.4A
Authority: CN
Inventors: 王�琦; 李学龙; 林维
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2022-06-07
Anticipated expiration: 2038-12-25
Also published as: CN109753906A

Abstract

The invention relates to a method for detecting abnormal behaviors of public places based on domain migration, which utilizes the simulation of a virtual world to create a large number of virtual abnormal time videos, solves the problem that the diversity of abnormal events is insufficient but the data is insufficient, and uses the domain migration method to migrate virtual data to a real situation, thereby improving the adaptability of a classification detection network in formal monitoring videos and effectively improving the usability of a training network.

Description

Method for detecting abnormal behaviors in public places based on domain migration

Technical Field

The invention belongs to the field of computer vision and video monitoring. Abnormal behaviors such as fighting a shelf, escaping and the like in videos are detected aiming at the public places of video monitoring.

Background

Nowadays, cameras in public areas throughout cities generate countless monitoring videos at all times, and if abnormal behaviors of collected videos can be detected through an automatic method, the monitoring videos have a very strong preventive effect on the occurrence of public safety events. But the detection of abnormal events becomes very difficult due to the frequency of occurrence of abnormal behavior being much less than the frequency of occurrence of normal behavior, and the diversity of abnormal behavior.

At present, two methods for detecting abnormal behaviors in public places are provided: the first is a social force model-based method proposed by r.mehran et al in the documents "r.mehran, a.oyama, and m.shah, Abnormal crown behaver detection using social force model, Computer Vision and Pattern Recognition,2009.CVPR 2009.IEEE Conference on, pp.935-942,2009", which considers pedestrians as individual moving points, human-human interactions as forces between points, and detects Abnormal behavior in a video by finding Abnormal particle movements.

The second method is based on optical flow method, such as the method proposed in "y.yu, w.shen, h.huang, and z.zhang, Abnormal event detection in the crowned scenes using two spaced signatures with safety discovery, Journal of Electronic Imaging, vol.26, No.3, pp.033013, 2017", which combines multi-optical flow histogram and multi-scale gradient histogram to obtain the surface and motion features of a pedestrian, and adds Abnormal features to the traditional sparse model containing normal features only to construct a dictionary. In addition, the significance of the test sample is combined with the sparse reconstruction cost on the normal dictionary and the abnormal dictionary, and the normality of the test sample is measured.

These methods have limitations, the particle point model cannot capture the motion characteristics of the person, and the feature dictionary based on optical flow cannot guarantee that all abnormal behaviors exist in the dictionary.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides a method for detecting abnormal behaviors in public places based on domain migration.

Technical scheme

A method for detecting abnormal behaviors in public places based on domain migration is characterized by comprising the following steps:

step 1: generating virtual abnormal data by using the existing virtual image product, wherein the virtual abnormal data comprises different abnormal types and normal type data, and the data quantity of each type is the same;

step 2: training a video classification network by using the virtual abnormal data generated in the step 1 to obtain a virtual abnormal data classification network;

and step 3: training a domain migration network by using the generated virtual abnormal data and the acquired real data to obtain real domain video data corresponding to the virtual abnormal video data; the domain migration network is an improved cycle-GAN, and the improved method comprises the following steps: all 2D convolution structures in the cycle-GAN network are changed into 3D convolution structures facing video data, and the calculation method of the 3D convolution structures comprises the following steps:

Wherein P, Q, R represents the length, width and height of the feature map of the network output in the previous layer, and m represents the number of the feature maps of the network output. Finally, under the convolution module W, the corresponding characteristic diagram V and b in the next layer of network are calculated and obtained as the offset, the ith 3d convolution structure of the ith layer and the j th layer, and the coordinate values of the length, the width and the height of x, y and z;

and 4, step 4: carrying out further classification training on the virtual abnormal data classification network obtained in the step 2 by using the real domain abnormal data obtained in the step 3, wherein the training process is the same as that in the step 2, so that an abnormal video classification network of a real domain is obtained;

and 5: inputting real abnormal data to be tested into the network model obtained by training in the step 4, obtaining the probability of the input video in each abnormal category by using a softmax function, and taking the category of the maximum value as the abnormal type of the video.

The video classification network in the step 2 is a 3DresNet or a space-time double-flow video classification network.

Advantageous effects

According to the method for detecting the abnormal behaviors of the public place based on the domain migration, provided by the invention, a large number of virtual abnormal time videos are created by utilizing the simulation of the virtual world, so that the problem that the diversity of abnormal events is poor but the data is insufficient is solved, the adaptability of a classification detection network in formal monitoring videos is improved by migrating the virtual data to a real condition by using the domain migration method, and the usability of a training network is effectively improved.

Drawings

FIG. 1 is a model, data flow diagram of the present invention;

fig. 2 is a data flow diagram of a domain migration network.

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

the invention provides a public scene abnormal behavior detection method based on domain migration, which aims to solve the difficulty of abnormal behavior detection caused by the phenomena of abnormal behavior diversity, low frequency and the like. The whole technical scheme comprises the following steps:

1. the existing virtual image products such as games, CG and the like are used for creating virtual scenes, tasks, models and actions related to the abnormity, and recording abnormal behaviors in the virtual world.

2. After capturing a large amount of recorded virtual video data, the data are utilized to train a video classification deep neural network, and the network can effectively distinguish abnormal behavior categories (such as fighting, escaping and the like) and normal conditions in the virtual data set.

3. With some real-world surveillance videos, these videos do not necessarily have to have an abnormal event occurring. By utilizing the mutual conversion relation between the videos and the existing virtual videos, a domain migration network is learned, unsupervised video domain migration is carried out, the virtual videos are migrated to a real video domain which is very similar to a real scene and is lifelike, and a large number of monitoring videos containing abnormal behaviors are obtained.

4. And (3) training the classified neural network obtained in the step (2) again by using the migrated video as a data set to improve the adaptability of the neural network after crossing the domain, namely in the real data domain, and improve the detection capability of the network applied to real video monitoring.

5. In the actual application process, the monitoring video with a fixed time length can be transmitted into the trained neural network in real time each time, the classification probability of the captured short video under each abnormal class and normal condition is obtained, and the class with the highest probability is taken as the class of the video. And determining whether abnormal behaviors occur under monitoring by using the abnormal or normal behaviors of which levels the detection result belongs to.

The invention has the following concrete implementation steps:

step 1, first, an unsupervised domain migration network of the type "j.zhu, t.park, p.isola, and a.a.efros, unapplied image-to-image transformation using cycle-dependent adaptive networks, arXiv print,2017. In contrast, it should be modified somewhat so that it can process data of the video domain (cycle-GAN can only process images). The modified method is to change all 2D convolution structures in the cycle-GAN network into 3D convolution structures facing the video data. The calculation method of the 3D convolution structure comprises the following steps:

Wherein P, Q, R respectively represents the length, width and height of the characteristic diagram output by the previous network, and m represents the number of the characteristic diagrams output by the previous network. And finally, calculating to obtain a corresponding characteristic diagram V in the next layer of network under the convolution module W. Meanwhile, related abnormal event video data are simulated and recorded in the virtual world and are represented as rounded square blocks in the figure, namely, the virtual abnormal video data. These data include different abnormal category and normal category data for fighting, chase, flee, gunshot, run, arrest, etc. The time scale data amounts of the respective categories were approximately the same. Finally, a part of real video monitoring data is needed to express what the monitoring video is in the real scene, the time evaluation data does not need to be labeled, and the video content is not limited.

And 2, initializing a video classification network, wherein the network can be a 3DResNet, a space-time double-flow video classification network or other existing video classification networks. Here we use the existing 3DResNet, which is from "K.Hara, H.Kataoka, and Y.Satoh," Learning space-temporal features with 3D residual networks for Action Recognition, "Proceedings of the ICCV Workshop on Action, Gesture, and Electron Recognition, vol.2, No.3, pp.4, 2017". This network is an improved version of the network structure proposed in 2015-ResNet, which is improved in the same way as set forth in step one, i.e. changing the 2D convolution structure to a 3D convolution structure.

And 3, training a domain migration network by using the collected virtual abnormal data and any real data, and obtaining real domain video data corresponding to the virtual abnormal video data. As shown in FIG. 2, assume S_real、R_realRespectively transmitting the collected virtual abnormal data and any real data to a generation network G_StoRAnd G_RtoSTo obtain R_fakeAnd S_fakeThen respectively transmitted into G_RtoSAnd G_StoRIn (1), obtaining_real、R_realCorresponding video, through consistency comparison and discriminator D_RAnd D_STo improve the fidelity of the video after domain migration.

The whole process can be represented by the following formula:

i.e. in the course of training the generator, in an effort to minimize the value of the discriminator versus maximizing consistency; the value of the discriminator is maximized during the discriminator training process. R obtained finally_fakeIt can be regarded as real domain video data corresponding to the virtual abnormal video in fig. 1.

And 4, performing further classification training on the network obtained in the step 2 by using the abnormal data of the real domain obtained in the step 3, wherein the process is the same as that of the step 2, so that the abnormal video classification network of the real domain is obtained.

And 5, in the actual test process, inputting the real abnormal data into the network model obtained by training in the step 4, obtaining the probability of the input video in each abnormal category by using a softmax function, and taking the category of the maximum value as the abnormal type of the video.

The effects of the present invention can be further explained by the following simulation experiments.

1. Simulation conditions

The invention takes a four-block GeForce GTX 1080 Ti GPU as a hardware basis, takes a python programming language of 3.5.4 version on a 64-bit Ubuntu 16.04 LTS system, and takes Pytorch of 0.4.1 version and CUDA of 9.2 version as software environment to carry out actual drilling of the whole invention.

2. Emulated content

Firstly, training according to a figure 1 by using a virtual video data set obtained by simulation and video data taken from some video data sets, and finally obtaining a real domain abnormal video classification network. And the results of the model without domain migration data training and the model with domain migration data training are compared with the results of the model with domain migration data training by using 'K.Hara, H.Kataoka, and Y.Satoh, Learning spatial-temporal features with 3D residual networks for Action Recognition, Proceedings of the ICCV Workshop on Action, Gesture, and event Recognition, vol.2, No.3, pp.4, 2017'. The judgment criteria are two, i.e., the classification accuracy of the video and the misclassification severity (MISE). The latter ranks the abnormality categories by their severity and then calculates the severity after misclassification. The results are as follows:

Table 1: test results of four models on a real data set

Accuracy(％)	3D ResNet	The invention
			Before domain migration	19.51	17.07
After domain migration	21.14	26.02

As can be seen from table 1, the classification accuracy of the network of the present invention on the real data set after domain migration is significantly improved. The domain migration technology provided by the invention has a certain effect on improving the performance of the 3DResNet, so that the domain migration technology has higher prediction accuracy on abnormal behavior detection in public places.

Table 2: misclassification severity of four models on a real dataset

MISE	3D ResNet	The invention
			Before domain migration	3.48	3.45
After domain migration	3.45	2.74

From table 2, our method also has the lowest value in the severity of misclassification, which also confirms that the present invention has a lower severity of misclassification for the detection of abnormal behavior in public places.

Claims

1. A method for detecting abnormal behaviors in public places based on domain migration is characterized by comprising the following steps:

And 3, step 3: training a domain migration network by using the generated virtual abnormal data and the acquired real data to obtain real domain video data corresponding to the virtual abnormal video data; the domain migration network is an improved cycle-GAN, and the improved method comprises the following steps: all 2D convolution structures in the cycle-GAN network are changed into 3D convolution structures facing video data, and the calculation method of the 3D convolution structures comprises the following steps:

p, Q, R respectively represents the length, width and height of the feature map output by the previous layer of network, and m represents the number of the feature maps output by the previous layer of network; finally, under a convolution module W, obtaining a corresponding characteristic diagram V in a next layer of network by calculation, wherein b is an offset, i and j are jth 3d convolution structures of an ith layer, and x, y and z are coordinate values of length, width and height;

2. The method according to claim 1, wherein the video classification network in step 2 is 3 dressnet or space-time dual-stream video classification network.