CN110705376A

CN110705376A - Abnormal behavior detection method based on generative countermeasure network

Info

Publication number: CN110705376A
Application number: CN201910860850.1A
Authority: CN
Inventors: 卢博文; 郭文波; 朱松豪
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2020-01-17

Abstract

The invention discloses an abnormal behavior detection method based on a generative countermeasure network (GAN). The method comprises the following steps: first, a U-Net network is built and treated as a generator module in the GAN, where not only appearance (spatial) constraints are used, but also motion (temporal) constraints are introduced. Next, a patch discriminator (patch discriminator) is employed as the discriminator module in the GAN. The generator and discriminator are then trained alternately in confrontation until the discriminator cannot distinguish between the generated frames and the true frames. And finally, carrying out an abnormal behavior detection experiment through the trained GAN model. The experimental results on three publicly available abnormal detection data sets show that the method provided by the invention effectively improves the accuracy of abnormal behavior detection.

Description

Abnormal behavior detection method based on generative countermeasure network

Technical Field

The invention relates to the technical field of video analysis, in particular to an abnormal behavior detection method based on a generative countermeasure network, which relates to a method for detecting abnormal behaviors in a surveillance video scene.

Background

The traditional video monitoring mainly depends on artificially monitoring abnormal behaviors in a scene, which not only needs extremely high labor cost, but also is easy to generate visual fatigue, and even can cause some abnormal behaviors not to be observed in time. The abnormal behavior detection and analysis aims to automatically detect the abnormal behavior in the monitoring scene through algorithms such as video signal processing and machine learning, and therefore people are helped to take corresponding measures in time. Therefore, the method has very important significance and value in developing the abnormal behavior detection algorithm for the monitoring video.

Anomaly detection in video refers to identifying events that do not conform to expected behavior. Its application in video surveillance is an important task. However, since the abnormal events are not controlled in practical applications, it is almost impossible to collect and process various abnormal events by using a classification method, so that the task of detecting the abnormal events is very challenging at the same time.

All existing methods can be roughly divided into two categories, depending on the nature used:

i) artificial design features based on the study method. These methods typically represent each video with some artificially designed features including appearance and action. On the basis, a dictionary method for reconstructing a normal event by using a small reconstruction error is learned, so that an abnormal event corresponds to a larger reconstruction error. But since the dictionary is not trained on exceptional events and is often too complete, the expected results may not be guaranteed.

i i) deep learning based methods. These methods typically learn deep neural networks by means of an auto-encoder and use them to reconstruct normal events with relatively small errors compared to abnormal events. However, the deep neural network has certain fault tolerance, and a large reconstruction error does not necessarily occur for an abnormal event. It follows that almost all methods based on reconstruction of training data do not guarantee the detection of abnormal events.

Interestingly, although anomalies are defined as those events that are not expected, much of the existing work in computer vision is to solve the problem within the framework of reconstructing the training data. Current video frame prediction may be far from satisfactory. In recent years, with the advent of the generative countermeasure network (GAN), the performance of video prediction has improved dramatically. In this context, we do not reconstruct the training data for anomaly detection, but propose to identify an anomaly event by comparing it with its prediction, and introduce a method of anomaly detection based on video frame prediction. Specifically, given a video segment, we predict future frames based on its historical observations. We first train a predictor (generator) that can predict well future frames of normal training data. In the test phase, a frame may correspond to a normal event if it is consistent with its prediction. Otherwise, it may correspond to an exception event. Therefore, a good predictor is the key to the present technique.

Disclosure of Invention

The invention aims to overcome the defects of an abnormal behavior detection method in the existing monitoring video scene and provides an abnormal behavior detection method based on a generative countermeasure network. The invention constructs a generation type countermeasure network which takes U-Net as a generator and a patch discriminator as a discriminator, and carries out an abnormal behavior detection experiment through a trained GAN model. Compared with the most advanced method at present, the method further improves the accuracy rate of abnormal behavior detection and is more suitable for practical application.

The invention relates to an abnormal behavior detection method based on a generative countermeasure network, which comprises the following steps:

s1 selecting a frame I with t consecutive frames₁，I₂，…，I_tDetecting a video to be detected;

s2, inputting the video to be detected into the trained generative confrontation network to obtain a prediction frame

S3 predicting the frame

Comparing with a ground truth value I, and judging whether the ground truth value I is normal or not;

the generative countermeasure network comprises a generator network

And discriminator network

Further, the generative confrontation network training process is as follows:

s2-1, acquiring data of each frame of a video to be detected, and preprocessing the data of each frame;

s2-2 building a prediction network based on U-Net and regarding the prediction network as a generator network

Employing a patch discriminator as a discriminator network

S2-3, introducing appearance constraint, and alternately optimizing the generator network by adopting a random gradient descent method

And discriminator network

Minimizing the intensity difference and gradient difference between the predicted frame and the ground truth value;

s2-4, introducing motion constraint, and predicting the difference value between the optical flow of the frame and the ground truth value;

s2-5 alternate pair generator network

And discriminator network

Performing antagonistic training

Training by using an objective function

A loss function is adopted;

s2-6, inputting the preprocessed data into the trained generative confrontation network, calculating the regular score of each frame, and judging whether the frame is normal or not according to a preset threshold value.

Further, in step S2-1, the data of each frame is preprocessed by normalizing the pixel intensities of all frames to [ -1,1], and the size of each frame is set to 256 × 256.

Further, S2-2, when a prediction network based on U-Net is built, the U-Net adds shortcuts between the high resolution layer and the second resolution layer.

The depth of the network has a great influence on the effect of the convolutional neural network, but simply increasing the depth of the network cannot simply improve the effect of the network, but may damage the effect of the model due to gradient divergence. One of the tricks to solve this problem is to introduce shortcuts (translations into shortcuts or shortcuts), i.e. connections between two non-adjacent layers.

Further, in step S2-3, the frame is predicted in the intensity spaceL between ground truth value I₂Distance minimization, as in equation (1):

gradient loss function, as in equation (2):

wherein i, j represents a spatial index of the video frame, whereinRepresents the gradient of the predicted frame in the horizontal axis, | I_i，j-I_i-1，jI represents the gradient of the real frame in the direction of the horizontal axis,

represents the gradient of the predicted frame in the vertical axis direction, | I_i，j-I_i，j-1And | represents the gradient of the real frame in the direction of the vertical axis.

Further, in step S2-4, the difference between the optical flow of the predicted frame and the ground truth value is expressed by an optical flow loss function, as shown in formula (3),

wherein

Representing predicted t +1 frames, I_t+1Representing the real t +1 th frame, I_tRepresenting the t-th frame. By means of I₁， I₂，…，I_tRespectively representing successive t frames in a video to be detected, and inputting the t frames into a generator

The generator function isGenerating a predicted frame for a t +1 th frame

In the implementation, f is pre-trained on one integrated data set. In the detection process, the frame is predictedWith the real t +1 th frame I_t+1A comparison is made to detect anomalies.

Further, in step S2-5,

training

Is aimed at_t+1Classify to class1 and

classificationThrough class0, where 0 and 1 represent a false tag and a true tag, respectively; when training

At the same time, fix

Applying a Mean Square Error (MSE) loss function:

where i, j is the spatial patch index, L_MSEAs a function of Mean Square Error (MSE), L_MSEThe definition is as follows:

wherein the value range of Y is {0,1},

is in the value range of [0,1]]；

When training

The following loss function is used:

training

Is to generate a quilt

Frames classified into class1, when trained

When the temperature of the water is higher than the set temperature,is fixed, and again a Mean Square Error (MSE) function is applied as follows:

when training

All these constraints on appearance, motion and antagonism training are then incorporated into the objective function, and the following objective functions are obtained:

in step S2-6, the peak snr for each frame of each test video is calculated as follows:

the peak signal-to-noise ratio of all frames in each test video is normalized to the range [0,1], and the canonical score for each frame is calculated using the following formula:

a threshold may be set to distinguish between regular or irregular frames, based on the fraction s (t) of a frame to predict whether it is normal or abnormal. And if the difference value between the regular score of the ground truth value and the regular score of the predicted frame is greater than the discrimination threshold, judging the video as the abnormal video.

The invention has the beneficial effects that: the method overcomes the defects of the existing abnormal behavior detection method in the monitoring video scene, constructs a generation type countermeasure network which takes U-Net as a generator and a patch discriminator as a discriminator, and carries out an abnormal behavior detection experiment through a trained GAN model. Compared with the most advanced method at present, the method further improves the accuracy rate of abnormal behavior detection and is more suitable for practical application.

Drawings

In order that the present invention may be more readily and clearly understood, there now follows a more particular description of the invention in terms of specific embodiments and reference to the accompanying drawings.

FIG. 1 is a flow chart of a method for detecting abnormal behavior according to the present disclosure;

FIG. 2 is a block diagram of a generator used in this patent;

FIG. 3 is a graph of optical flow generated with/without motion constraints;

FIG. 4 is a schematic diagram of the anomaly detection results of a video prediction network and an auto-encoder network;

fig. 5 is an assessment of abnormal behavior detection performance on three publicly available data sets based on the assessment index AUC.

Detailed Description

Mathematically, given a frame I with consecutive t frames₁，I₂，…，I_tWe stack all these frames in order and use them to predict future frame I_t+1We useRepresenting our prediction. Let

Is close to I_t+1We minimize the distance between them in intensity and gradient. To protect temporal coherence between adjacent frames, we strengthen I_t+1And I_tIn between, such that

To I_tClose. Finally, the difference between the prediction of the future frame and itself determines whether the future frame is normal or abnormal. The structure of our network framework is shown in figure 1. Next, we will describe in detail all the components of the framework.

1. Building a generator network

The frame generation network or image generation network commonly used in the existing work generally comprises two modules: i) an encoder for extracting features by gradually reducing spatial resolution; ii) a decoder for gradually recovering the frame by increasing the spatial resolution. However, this solution faces the problem of gradient disappearance and the problem of unbalance of the information of the layers. To avoid this, U-Net is implemented by adding a fast path between the high resolution layer and the low resolution layer. This method suppresses the disappearance of the gradient, making the information symmetrical. In the implementation, U-Net is modified slightly to make future frame predictions. Specifically, the output resolution is kept constant for every two convolutional layers. Thus, there is no longer a need to crop and resize operations when adding shortcuts. Details of this network are shown in figure 2. The size of all convolution and deconvolution kernels is set to 3 x 3, and the maximum pooling layer size is set to 2 x 2.

2. Appearance constraint

In order to get the prediction close to its ground truth value, the method followed uses intensity and gradient differences. The intensity punishment guarantees the similarity of all pixels in the RGB space, and the gradient punishment can carry out sharpening processing on the generated image. In particular, in the intensity space, frames will be predicted

L between ground truth value I₂The distance is minimized as follows:

further, the gradient loss defined previously is followed as follows:

where i, j represents the spatial index of the video frame.

3. Motion constraint

Previous work only considers the intensity difference and gradient difference when future frames are generated, and cannot guarantee that the predicted frames have correct motion. This is because, in a predicted frame, even if the pixel intensities of all pixels vary very little, it may lead to a complete difference in optical flow (which is a good motion estimator) even if it corresponds to a small prediction error in terms of gradient and intensity.

Therefore, it is necessary to ensure the correctness of the motion prediction. Especially in anomaly detection, coherence of motion is an important factor in evaluating normal events. Therefore, the present invention introduces a time penalty, defined as the difference between the optical flow of the predicted frame and the ground truth. However, the calculation of optical flow is not easy. Recently, a CNN-based optical flow estimation method has appeared, and the invention also adopts the flow to estimate the optical flow, and f represents the functional relationship in the flow, so the optical flow loss can be expressed as:

wherein

The generator function is

Generating a predicted frame for a t +1 th frame

In the implementation, f is pre-trained on one integrated data set. In the detection process, the frame is predicted

With the real t +1 th frame I_t+1A comparison is made to detect anomalies.

4. Counter training

Generative countermeasure networks (GANs) have demonstrated their usefulness for image and video generation. The present invention also utilizes variants of the GAN (least squares GAN) module to generate a more realistic framework. Typically the GAN comprises a network of discriminatorsAnd a generator network

Learning generation is difficult to be

Classified frames of

Is aimed at

The generated frames are distinguished. In practice, antagonism training is achieved by an alternative updating method. Furthermore, the prediction network based on U-Net is regarded as a generator network

For discriminator network

Using a patch discriminator (patch discriminator), i.e.

Each output scalar of (1) corresponds to a patch of the input image. The training program is as follows:

training

Training

Is aimed at_t+1Classify to class1 and

class to class0, where 0 and 1 represent a false label and a true label, respectively. When training

At the same time, fixApplying a Mean Square Error (MSE) loss function:

where i, j is the spatial patch index, L_MSEIs a Mean Square Error (MSE) function. L is_MSEThe definition is as follows:

wherein the value range of Y is {0,1},

is in the value range of [0,1]]。

TrainingTraining

Is to generate a quilt

Frames classified into class 1. When training

When the temperature of the water is higher than the set temperature,is fixed. Also, a Mean Square Error (MSE) function is applied as follows:

5. objective function

When training

All these constraints on appearance, motion and antagonism training are then combined into the objective function and the following objective function is obtained:

when trainingThe following loss function is used:

6. anomaly detection on a test set

Assuming normal events are well predicted, the predicted frames can be utilized

And its ground truth value I to make an anomaly prediction. MSE (mean square error) is a popular method to measure the quality of a predicted image by computing the euclidean distance between the predicted value and the ground truth of all pixels in the RGB color space. However, Mathieu shows that Peak Signal to Noise Ratio (PSNR) is a better image quality assessment method, and is expressed as follows:

the peak signal-to-noise ratio (PSNR) of the t-th frame is high, indicating that it is more likely to be normal. After calculating the PSNR of each frame of each test video, we normalized the PSNR of all frames in each test video to the range [0,1] and calculated the regular score (regular score) of each frame using the following formula:

therefore, it can be predicted whether it is normal or abnormal from the score s (t) of one frame. A threshold may be set to distinguish between regular or irregular frames.

The method specifically comprises the following steps:

1. preparation of Experimental data

The present invention evaluates the proposed method and the functionality of the different components on three publicly available anomaly detection datasets, including the CUHK Avenue dataset, the UCSD Peerreian dataset, and the ShanghaiTech dataset. We further verified the robustness of our method using a small dataset, that is, our method can correctly classify normal and abnormal events even if there is some uncertainty in the normal events.

2. Experimental details settings

To train the network, the pixel intensities of all frames are normalized to [ -1,1]The size of each frame is set to 256 × 256. I amThey set t 4 and use a random slice consisting of 5 ordered frames. And (3) performing parameter optimization by adopting a random gradient descent method based on Adam. The small batch size was 4. For gray scale datasets, the learning rates of the generator and discriminator are set to 0.0001 and 0.00001, respectively. While for the chrominance data sets (color scaled data sets), the learning rates of the generator and discriminator start from 0.0002 and 0.00002, respectively. Coefficient factor lambda for different data sets_int，λ_gd，λ_opAnd λ_advSlightly different. A simple method is to apply λ_int，λ_gd，λ_opAnd λ_advSet to 1.0, 1.0, 2.0 and 0.05, respectively.

3. Antagonism training

During training, a network is generated

The aim of the method is to generate a real picture as much as possible to deceive a discrimination networkWhile

The aim is to identify the most probable

The generated false image and the real image. Thus, the training is divided and alternated

And

they form a dynamic "gaming process" and the final balance points, namely nash balance points, are considered as the fit state of the network.

4. Performance evaluation

In the literature of conventional anomaly detection, a common evaluation index is to calculate a receiver operating characteristic curve (ROC) by gradually changing a threshold value of a regularization score. The area under the curve (AUC) is then accumulated as a scalar for performance evaluation, with higher values giving better anomaly detection performance. In the present invention, the AUC at the frame level is used for performance evaluation. (there are three different levels of criteria for evaluating the performance of the abnormal behavior detection algorithm, from small to large, pixel level, frame level, behavior level.)

5. Compared with the prior method

In order to verify the effectiveness of the method, compared with the traditional method and the latest deep learning-based method, as shown in fig. 5, the results of performance evaluation of abnormal behavior detection on three publicly available data sets based on the evaluation index AUC are shown, and fig. 5 lists AUC performance comparison of different methods in a common data set. It can be seen that the method proposed by the present invention is about 3% -5% better than all existing methods in all datasets, which demonstrates the effectiveness of the method.

6. Effects of motion constraints

To assess the importance of motion constraints on video frame generation and anomaly detection, experiments were conducted in training by removing target constraints. Such a reference is then compared to the method of the present invention.

The motion constraints are evaluated using a flowsheet. In fig. 3, an optical flow diagram generated with/without motion constraint, a visualization of a prediction image and an optical flow on a Ped1 data set are shown, a red box represents an optical flow difference predicted by a model under the condition of motion constraint, and it can be seen that the optical flow generated with motion constraint is more consistent with ground truth, which shows that such motion constraint terms help our prediction network to capture motion information more accurately. In addition, the mean square error (7.51 and 8.26, respectively) between the optical flow graph generated with/without motion constraint and the ground truth is compared, which further illustrates the effectiveness of the motion constraint.

7. Comparing performance of predictive networks and autoencoder networks in anomaly detection

As shown in fig. 4, the video prediction network-based and the self-encoder network-based anomaly detection are also compared. For exceptions based on an autoencoder networkDetection, the most advanced anomaly detection performance was achieved using the latest Conv-AE. Due to the capability of deep neural networks, the method based on the self-encoder can well reconstruct the normal and abnormal frames of the test phase, and in order to evaluate the performance of the prediction network and the automatic encoder, the difference delta between the normal and abnormal scores is used_sGreater Δ_sIndicating that the network is better able to distinguish between normal and abnormal patterns. The average scores for normal and abnormal frames in the Ped1, Ped2, and Avenue dataset test sets were first calculated. Then, I calculate the difference Δ of these two scores_sTo measure our method and the ability of Conv-AE to distinguish between normal and abnormal frames. Greater difference Δ_sCorresponding to a smaller false alarm rate and a higher detection rate. As can be seen from the results of FIG. 4, the method presented in the present application consistently outperforms the Conv-AE method in terms of the difference in scores between normal and abnormal events the solution of the method of the present invention can always reach a higher difference (Δ) than Conv-AE_s) The effectiveness of the video prediction based anomaly detection method is verified.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention further, and all equivalent variations made by using the contents of the present specification and the drawings are within the protection scope of the present invention.

Claims

1. An abnormal behavior detection method based on a generative countermeasure network comprises the following steps:

S3 predicting the frame

Ground truth value I_t+1Comparing and judging whether the test sample is normal or not;

the generative countermeasure network comprises a generator network

And discriminator network

2. The method as claimed in claim 1, wherein the generative confrontation network training process comprises:

Employing a patch discriminator as a discriminator network

And discriminator networkMinimizing the intensity difference and gradient difference between the predicted frame and the ground truth value;

s2-5 alternate pair generator network

And discriminator network

Performing antagonistic training

Training by using an objective function

A loss function is adopted;

3. The method as claimed in claim 2, wherein in step S2-1, the data of each frame is preprocessed by normalizing the pixel intensity of all frames to [ -1,1], and the size of each frame is set to 256 × 256.

4. The abnormal behavior detection method based on the generative countermeasure network as claimed in claim 2, wherein, in step S2-2, when building a U-Net based prediction network, U-Net adds a shortcut between the high resolution layer and the second resolution layer.

5. The method for detecting abnormal behavior of generative-based countermeasure network as claimed in claim 2, wherein in step S2-3, the predicted frame is predicted in intensity space

L between ground truth value I₂Distance minimization, as in equation (1):

gradient loss function, as in equation (2):

where i, j represent the spatial index of the video frame,

representing the gradient of the predicted frame in the direction of the horizontal axis,

representing the gradient of the predicted frame in the direction of the vertical axis.

6. The method for detecting abnormal behavior based on generative countermeasure network as claimed in claim 2, wherein in step S2-4, the difference between the optical flow of the predicted frame and the ground truth value is represented by optical flow loss function, as shown in formula (3),

wherein

Representing predicted t +1 frames, I_t+1Representing the real t +1 th frame, I_tRepresenting the t-th frame. By means of I₁，I₂，…，I_tRespectively representing successive t frames in a video to be detected, and inputting the t frames into a generator

The generator function is

Generating a predicted frame for a t +1 th frame

With the real t +1 th frame I_t+1A comparison is made to detect anomalies.

7. The abnormal behavior detection method based on generative countermeasure network as claimed in claim 2, wherein in step S2-5,

training

Is aimed at_t+1Classify to class1 andclass0, where 0 and 1 represent false and true tags, respectively; when training

At the same time, fix

Applying a Mean Square Error (MSE) loss function:

wherein the value range of Y is {0,1},is in the value range of [0,1]]；

When trainingThe following loss function is used:

trainingIs to generate a quilt

Frames classified into class1, when trained

When the temperature of the water is higher than the set temperature,

is fixed, and again a Mean Square Error (MSE) function is applied as follows:

when training

8. the method for detecting abnormal behavior based on generative countermeasure network as claimed in claim 2, wherein in step S2-6, the peak snr of each frame of each test video is calculated as follows:

a threshold may be set to distinguish between regular or irregular frames, based on the fraction s (t) of a frame to predict whether it is normal or abnormal.