CN110263728B

CN110263728B - Abnormal behavior detection method based on improved pseudo-three-dimensional residual error neural network

Info

Publication number: CN110263728B
Application number: CN201910548528.5A
Authority: CN
Inventors: 卢博文; 郭文波; 朱松豪
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2022-08-19
Anticipated expiration: 2039-06-24
Also published as: CN110263728A

Abstract

The invention discloses an abnormal behavior detection method based on an improved pseudo-three-dimensional residual error neural network, which comprises the following steps: firstly, dividing each video in a training set into a plurality of video segments; secondly, inputting all video segments of a video into an improved pseudo-three-dimensional residual error neural network respectively to obtain the characteristics of the video segments; then, taking the average value of the feature vectors of all frames in each segment, and then carrying out L2 norm normalization on the feature average value to obtain the feature vector of the video segment; finally, inputting the feature vector of the video clip into a 3-layer fully-connected neural network, and outputting the abnormal score of the video clip. Experimental results show that compared with the existing method, the method further improves the accuracy of abnormal behavior detection and is more suitable for practical application.

Description

Abnormal behavior detection method based on improved pseudo-three-dimensional residual error neural network

Technical Field

The invention relates to an abnormal behavior detection method in a surveillance video scene, in particular to an abnormal behavior detection method based on multi-example learning and improved pseudo three-dimensional residual error neural network, and belongs to the technical field of video analysis.

Background

The traditional video monitoring mainly depends on artificially monitoring abnormal behaviors in a scene, which not only needs extremely high labor cost, but also is easy to generate visual fatigue, and even can cause some abnormal behaviors not to be observed in time; the abnormal behavior detection and analysis aims to automatically detect abnormal behaviors in a monitoring scene through algorithms such as video signal processing and machine learning, and therefore people are helped to take corresponding measures in time; therefore, the abnormal behavior detection of the monitoring scene has very important research significance.

Early research work on abnormal behavior detection utilized low-level trajectory features to describe normal patterns, however, these methods were not robust in complex or crowded scenes with multiple occlusions due to the difficulty in obtaining reliable trajectories; in consideration of the shortcomings of the trajectory feature and the low-level spatio-temporal feature, Histogram of Oriented Gradient (HOG), Histogram of Optical Flow (HOF), and histogram of boundary (MBH) are widely used, on which basis a markov random field Model (MRF), a Social Force Model (SFM), a multi-scale histogram of optical flow (MHOF), a hybrid dynamic texture (MDT) are proposed one after another; these methods model normal behavior according to training data of the normal behavior, detect a low probability pattern as an anomaly, however, these artificially designed features hardly reflect behavior characteristics efficiently, and are computationally complex.

With the success of sparse representation and dictionary learning methods in some computer vision problems, researchers began to learn dictionaries of normal behaviors using sparse representations, and in the testing process, patterns with large reconstruction errors were considered to be abnormal behaviors; recently, researchers have learned a normal behavior model using a self-encoder based on deep learning, and detected abnormalities using reconstruction loss; the method is based on the assumption that any behavior deviating from the learned normal behavior pattern will be considered abnormal; however, this assumption may not hold because both normal and abnormal behavior have a complex diversity and the boundaries between them are sometimes ambiguous.

It is not appropriate to learn the normal behavior dictionary only from the training data of the normal behavior and detect the abnormal behavior based on the reconstruction error, and it is reasonable to utilize both the normal behavior and the abnormal behavior video data and should be performed with as little tag information as possible.

Disclosure of Invention

The invention aims to overcome the defects of an abnormal behavior detection method in the existing monitoring video scene, and provides an abnormal behavior detection method based on an improved Pseudo-three-dimensional Residual Neural Network (P3D-ResNet), which is used for improving the P3D-ResNet and learning the characteristics of a video.

The abnormal behavior detection method based on the improved pseudo-three-dimensional residual error neural network comprises the following steps:

based on a multi-example learning method, a training set only with a coarse granularity label (namely a video level label) is adopted, data preprocessing is carried out on each video in the training set, and each video is divided into a plurality of video segments;

step two, improving the structure of the P3D-ResNet network, and respectively inputting all video segments of a video into the improved P3D-ResNet to obtain a feature vector of each video frame in each video segment;

calculating the average value of the feature vectors of all frames in each video segment, and then performing L2 norm normalization on the feature average value to obtain the feature vectors of the video segment;

step four, inputting the feature vector of the video clip obtained in the step three into a 3-layer fully connected neural network (FC neural network for short), and outputting the abnormal score (i.e. the probability of belonging to the abnormality) of the video clip;

step five, drawing a Receiver Operating Characteristic Curve (ROC for short), calculating a corresponding Area Under the Curve (AUC for short), and evaluating the abnormal segments of the input video.

Further, in the first step, each video in the training set is regarded as a packet, the video with abnormal behavior is marked as a positive packet, the normal video without abnormal behavior is marked as a negative packet, the size and frame rate of each frame of the video are adjusted, and the data preprocessing method comprises the following steps: each video is divided into non-overlapping video segments having the same number of frames, each as an example in a positive or negative packet.

In a further step two, the P3D-ResNet network structure is improved in that, on the basis of the pseudo three-dimensional convolution residual neural network framework, a 3 × 3 × 3 average pooling operation is added to the shortcut connection shortcuts part, and a Batch Normalization (BN) operation is added after each convolution operation.

Further, the classifier 3-layer fully-connected neural network used in the fourth step is trained by adopting an objective function.

Further, in step four, the design step of the objective function is:

1) in a multi-example learning algorithm, the training set with video level labels can be represented as { (x) ₁ ,y ₁ ),…,(x _i ,y _i ),…,(x _N ,y _N ) In which x is _i Is the ith in the training setBag, y _i The label of the packet, N is the total number of packets in the training set; the ith packet can be represented as

Wherein x _ik Representing a package x _i The kth example of (1), n _i Is a positive packet x _i Total number of examples in (1). Suppose there is N in the training set _a A positive packet, then x _i (i∈[1,N _a ]) Is a positive packet, and y _i 1 ═ 1; with simultaneous training centered on N-N _a A negative bag, then x _j (j∈[N _a +1,N]) Is a negative bag, and y _j ＝-1；

In a standard supervised classification problem using a Support Vector Machine (SVM), the optimization objectives are:

wherein, the first is a risk term, the second is a regularization term, the C is a regularization constant, the m is the total number of training examples, and the y is _i For each exemplary label, φ (x) _i ) The feature vector of the image block or video segment, w is the weight parameter of the model, and b is the bias parameter.

2) Since the abnormal score is the probability of being abnormal, it is reasonable that the abnormal video is higher than the abnormal score of the normal video, in order to improve the accuracy of detection, it is desirable that the abnormal video segment has a higher abnormal score than the normal video segment, a sorting loss can be used, which enables the abnormal video segment to obtain a higher score than the normal segment, however, in the case of no video segment level annotation (whether the normal video segment exists in the abnormal video is unknown), a multi-example sorting loss function is needed:

wherein x _ik And x _jl Representing abnormal video segments and normal video segments, respectively, f (x) _ik ) And f (x) _jl ) Respectively represent corresponding abnormal scores, and the value range is [0,1 ]]。

3) The example with the highest anomaly score in the positive packet is most likely a true positive example (anomalous segment), and the example with the highest anomaly score in the negative packet is actually a normal example, but the negative example is likely to be erroneously detected, and is therefore referred to as a difficult example in anomaly detection. To solve the above problem, we do not order every example in a package, but force to order only the two examples with the highest anomaly score in positive and negative packages, in order to make the positive example with the highest anomaly score and the negative example with the highest anomaly score differ greatly in anomaly score, an ordering loss function in the form of change-loss is used:

4) the above ordering penalty function alone is not sufficient and also takes into account the temporal structure of the video. First, since the segment sequence is continuous, the outlier score between two adjacent video segments should be relatively smooth. In this regard, we minimize the outlier score difference of neighboring video segments by adding a timing smoothness constraint. Second, since the time range in which abnormal behavior occurs in a video is usually relatively small, the score of the positive example (video clip) should be sparse; in this regard, we make the exceptional score of the video segment sparse by adding sparsity constraint; combining the smoothness constraint and the sparsity constraint to obtain a complete objective function as follows:

wherein, the first one is a time sequence smooth constraint term, the second one is a sparse constraint term, the third one is a regularization term,

representing the model weight, λ ₁ 、λ ₂ And λ ₃ Respectively representAnd the time sequence smooth constraint term coefficient, the sparse constraint term coefficient and the regularization term coefficient, and other variables are the same as the formula (2). In MIL ordering loss, the error is propagated backwards from the video segment with the highest score among the positive and negative packets.

Further, in the fifth step, the classifier is used to obtain the abnormal probability of each video clip, and then an ROC curve is drawn, so that the performance of the abnormal behavior detection method can be evaluated, and the specific method is as follows:

1) sorting the video clips in a descending order according to the abnormal probability, and meanwhile, taking the maximum abnormal probability as a threshold, classifying the examples which are greater than or equal to the threshold into a positive class, otherwise, classifying the examples into a negative class;

2) calculating false positive rate and true positive rate according to the classification result, and respectively taking the false positive rate and the true positive rate as an abscissa and an ordinate to obtain a point of a coordinate axis;

3) setting the classification threshold values to the predicted values of other examples in sequence, and obtaining a series of points on coordinate axes according to the method;

4) and connecting all coordinate points into an ROC curve, wherein the corresponding area under the ROC curve is AUC.

The invention has the beneficial effects that: the method only uses the weakly labeled training data, not only utilizes the video data of the normal behavior, but also utilizes the video data of the abnormal behavior, improves the accuracy rate of the abnormal behavior detection, and is more suitable for practical application.

Drawings

In order that the manner in which the present invention is attained and can be understood in detail, a more particular description of the invention briefly summarized above may be had by reference to the embodiments thereof which are illustrated in the appended drawings.

FIG. 1 is a flow chart of a method for detecting abnormal behavior as set forth herein;

FIG. 2 is a diagram of the structure of 3 pseudo three-dimensional residual blocks in P3D-ResNet;

FIG. 3 is a structural diagram of a modified P3D-ResNet box;

FIG. 4 is an ROC curve of the present application and two other prior art three-dimensional neural network framework based methods and reference methods; wherein, the method is expressed by a method, a method based on 3D CNN is expressed by a method, a method based on 3D ResNet-34 is expressed by a method, and a method based on a two-classification support vector machine is expressed by a method.

Detailed Description

The invention aims to provide an abnormal behavior detection method in a monitoring video scene, in particular to an abnormal behavior detection method based on multi-example learning and an improved pseudo three-dimensional residual error neural network, so as to enhance the monitoring capability and improve the public safety.

As shown in FIG. 1, the method comprises the following steps:

firstly, dividing each video in a training set into a plurality of video segments, and inputting improved P3D-ResNet to obtain the characteristics of the video segments; then, taking the feature average value of all frames in each video segment, and then carrying out L2 norm normalization on the feature average value so as to obtain the features of the video; finally, inputting the characteristics into a full-connection neural network of 3 layers, the abnormal score of the video clip is output.

As shown in FIG. 2, the performance of a pseudo three-dimensional residual neural network (P3D-ResNet) formed by sequentially mixing three block structures of a pseudo three-dimensional serial shortcut residual block (P3D-A), a pseudo three-dimensional parallel shortcut residual block (P3D-B) and a pseudo three-dimensional serial parallel shortcut residual block (P3D-C) is superior to that of three variant networks P3D-A ResNet, P3D-BResNet and P3D-C ResNet formed by a single block structure, so that the pseudo three-dimensional convolutional residual neural network with diversified block structures is obtained by sequentially replacing 2-dimensional block structures of ResNet-50 with 3 pseudo three-dimensional residual block structures; on the basis of a pseudo three-dimensional residual neural network framework, 3 × 3 × 3 average pooling operation is added to the shortcut connection shortcuts part, batch normalization operation is added after each convolution operation, and a schematic diagram of improved P3D-ResNet is shown in FIG. 3.

The method specifically comprises the following steps:

1. preparation of Experimental data

The abnormal behaviors in the past training sets are single, and some training sets are formed by performing and recording people at a certain position and cannot reflect the situation of a real scene monitored by a video.

Due to the limitations of the previous training set, our method was evaluated using a new large-scale training set of video-level labels UCF-analog-Detection-Dataset. The training set consists of uncut monitoring videos, the total duration is long, 1900 videos are provided, and the number of the normal videos and the number of the abnormal videos are 950. The training set covers abnormal events of 13 real scenes, including human or animals abuse, suspect arrest, fire, blow, traffic accidents, burglary, explosion, fighting, robbery, gunshot events, stealing, shoplifting, property destruction and the like.

The training set is divided into two parts: the training set comprises 1610 videos (800 normal videos and 810 abnormal videos), and the test set comprises 290 videos (150 normal videos and 140 abnormal videos)

2. Experimental details settings

All video frame pixels are uniformly adjusted to 240 multiplied by 320, the frame rate is uniformly adjusted to 30 frames per second, in order to extract pseudo three-dimensional characteristics, each video is divided into a plurality of video segments with the length of 16 frames according to the video length, and 8 frames are overlapped between every two continuous segments.

Firstly, dividing input video data into a plurality of video segments; secondly, inputting the video clips into an improved pseudo three-dimensional residual error neural network to obtain characteristics; then, taking the feature average value of all frames in each video clip; then, performing L2 norm normalization on the feature mean value to obtain feature representation of the input video data; and finally, inputting the obtained characteristics into a 3-layer fully-connected neural network to realize the detection of abnormal segments of the input video data.

The FC layer (full connection layer) of the first layer has 512 units, the FC layers of the second layer and the third layer have 32 units and 1 unit respectively, and Dropout regularization of 50% is used among the FC layers. In this embodiment, an ELU activation function and a Swish activation function are used for the first and last fully-connected layers, respectively, and an initial learning rate of 0.001, β is used ₁ ＝0.9，β ₂ ＝0.999，ε＝1×10 ^-8 The Adam optimizer of (1); to obtain the best performance, in the ordering loss of multiple instance learning, the timing sequence is setSetting the coefficients of the smooth constraint term, the sparse constraint term and the regularization term as lambda ₁ ＝λ ₂ ＝8×10 ^-5 And λ ₃ ＝0.01。

3. Selection of evaluation index

The common evaluation index in abnormal behavior detection is the characteristic curve (ROC) of the operation of a subject and the corresponding Area (AUC) under the curve.

The horizontal axis of the ROC curve is False Positive Rate (FPR), and the vertical axis is True Positive Rate (TPR):

wherein TP, FN, FP, TN respectively represent the number of samples corresponding to four cases of true positive (true positive), false positive (false positive), true negative (true negative), and false negative (false negative). The closer the ROC curve is to the upper part, the larger the area AUC under the curve is, the higher the detection accuracy is, and otherwise, the lower the detection accuracy is.

And after the classifier is utilized to obtain the abnormal probability of all video segments, an ROC curve is drawn and the AUC is calculated, so that the performance of the abnormal behavior detection method can be evaluated. The specific method comprises the following steps: 1) sorting the video clips in a descending order according to the abnormal probability, and meanwhile, taking the maximum abnormal probability as a threshold, classifying the examples which are greater than or equal to the threshold into a positive class, otherwise, classifying the examples into a negative class; 2) calculating false positive rate and true positive rate according to the classification result, and respectively taking the false positive rate and the true positive rate as an abscissa and an ordinate to obtain a point of a coordinate axis; 3) setting the classification threshold values as predicted values of other examples in a descending order, and obtaining points on a series of coordinate axes according to the method; 4) and connecting all coordinate points into an ROC curve, and calculating the corresponding area AUC under the ROC curve.

The abnormal behavior detection is performed on the UCF data set together with the other two existing three-dimensional neural network-based methods and the reference method, and an ROC curve is drawn for performance evaluation, as shown in FIG. 4. In fig. 4, the method based on the improved pseudo three-dimensional residual error neural network proposed by the present application is: based on video-level label data, a behavior mode is learned through a deep multi-instance method, an improved pseudo three-dimensional residual neural network is used as a feature extractor, a three-layer fully-connected neural network which is regularized by 50% Dropout is used as a classifier, and therefore abnormal behavior detection is achieved;

and the method is a three-dimensional convolution-based fully-connected neural network (3D CNN): based on video level label data, a 3-dimensional convolutional neural network is used as a feature extractor, and a 3-layer fully-connected neural network which is normalized by using 50% of Dropout is used as a classifier, so that abnormal behaviors are detected;

③ is a three-dimensional residual error neural network (3D ResNet-34) method based on 34 layers: 3DRESNet-34 is used as a feature extractor, and a linear Support Vector Machine (SVM) is used as a classifier, so that the detection of abnormal behaviors is realized;

and fourthly, directly using a binary support vector machine (binary SVM) classifier to detect abnormal behaviors, and using the method as a reference method.

The application carries out quantitative comparison of the abnormal behavior detection effect with the existing method based on the three-dimensional neural network framework, and the following table 1 shows:

TABLE 1

Method	Acc.	AUC
			①P3D-ResNet(ours)	89.2	94.6
②C3D+FC	76.6	86.5
			③3D ResNet-34	68.9	72.7
④Binary classifier	50.0	55.5

As can be seen from the experimental results of fig. 4 and table 1, the method of the present application has the best abnormal behavior detection effect compared to the other three methods.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and all equivalent variations made by using the contents of the present specification and the drawings are within the protection scope of the present invention.

Claims

1. The abnormal behavior detection method based on the improved pseudo-three-dimensional residual error neural network is characterized by comprising the following steps of:

based on a multi-example learning method, adopting a video level label training set only with a coarse granularity label to perform data preprocessing on each video in the training set, and dividing each video into a plurality of video segments;

step two, improving the structure of the P3D-ResNet network, and respectively inputting all video segments of a video into the improved P3D-ResNet to obtain the feature vector of each video frame in each video segment;

the improvement of the P3D-ResNet network structure is that on the basis of a pseudo three-dimensional convolution residual neural network framework, 3 multiplied by 3 average pooling operation is added to a shortcut connection short part, and batch normalization operation is added after each convolution operation;

step four, inputting the feature vector of the video clip obtained in the step three into a 3-layer fully-connected neural network, and outputting the abnormal score of the video clip;

in the fourth step, the design steps of the objective function are as follows:

1) in a multi-example learning algorithm, a training set with video level labels is represented as { (x) ₁ ,y ₁ ),…,(x _i ,y _i ),…,(x _N ,y _N ) In which x is _i Is the ith packet, y, in the training set _i For the label of the packet, N is the total number of packets in the training set, and the ith packet can be expressed as

Wherein x _ik Representing a package x _i The kth example of (1), n _i Is a positive packet x _i Total number of examples in (1); suppose there is N in the training set _a A positive packet, then x _i ，i∈[1,N _a ]Is a positive packet, and y _i 1 ═ 1; with simultaneous training centered on N-N _a A negative bag, then x _j ，j∈[N _a +1，N]Is a negative bag, and y _j ＝-1；

In the standard supervised classification problem using the support vector machine SVM, the optimization objectives that can be obtained are:

wherein, the risk term, the regularization constant, the total number of training examples, and the regularization constant are respectively defined as _i For each exemplary label, φ (x) _i ) The characteristic vector of the image block or the video clip, w is the weight parameter of the model, and b is the offset parameter;

2) using the ordering loss, abnormal video segments get a higher abnormal score than normal segments, considering that without video segment level annotation, a multi-example ordering loss function is used:

wherein x _ik And x _jl Representing abnormal video segments and normal video segments, respectively, f (x) _ik ) And f (x) _jl ) Respectively represent corresponding abnormal scores, and the value range is [0,1 ]]；

3) Instead of ordering each instance in a packet, it is forced to order only the two instances with the highest anomaly score in the positive and negative packets, and in order for the positive instance with the highest anomaly score to differ greatly in anomaly score from the negative instance with the highest anomaly score, an ordering penalty function in the form of change-loss is used:

4) the time sequence structure of the video is also considered; firstly, since the segment sequence is continuous, the abnormal score between two adjacent video segments is smooth, and the abnormal score difference of the adjacent video segments is minimized by adding a timing smoothness constraint; secondly, as the time range of abnormal behaviors in the video is small, the scores of examples in the positive bag are sparse, and the abnormal scores of video segments have sparsity by adding sparsity constraint; and combining the smoothness constraint and the sparsity constraint to obtain a complete objective function:

representing the model weight, λ ₁ 、λ ₂ And λ ₃ Respectively representing a time sequence smooth constraint term coefficient, a sparse constraint term coefficient and a regularization term coefficient;

and step five, drawing an operation characteristic curve of the subject, calculating a corresponding area under the curve, and evaluating an abnormal fragment of the input video.

2. The improved detection method for abnormal behaviors based on pseudo-three-dimensional residual error neural network of claim 1, wherein in step one, each video in the training set is regarded as a packet, the video with abnormal behaviors is marked as a positive packet, the normal video without abnormal behaviors is marked as a negative packet, the size and frame rate of each frame of the video are adjusted, and the data preprocessing method comprises: each video is divided into video segments with the same number of frames with overlap between neighbors, each as an example in a positive or negative packet.

3. The improved method for detecting abnormal behavior based on pseudo-three-dimensional residual error neural network as claimed in claim 1, wherein the 3-layer fully-connected neural network used in step four is trained with an objective function.

4. The improved abnormal behavior detection method based on the pseudo-three-dimensional residual error neural network according to claim 1, wherein in step five, the classifier is used to obtain the abnormal probability of each video segment, and an ROC curve is drawn, and the specific method is as follows:

3) sequentially setting the classification threshold values to the predicted values of other examples, and obtaining a series of points on coordinate axes according to the method;

4) and connecting all coordinate points into an ROC curve, and calculating the corresponding area AUC under the ROC curve.