CN112418149A

CN112418149A - Abnormal behavior detection method based on deep convolutional neural network

Info

Publication number: CN112418149A
Application number: CN202011408898.8A
Authority: CN
Inventors: 蔡畅奇; 金欣
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2021-02-26

Abstract

A method for detecting abnormal behavior based on a deep convolutional neural network, the method comprising: a1: encoding an input video frame; a2: decoding the encoded stream to obtain an appearance stream and a motion stream; a3: and scoring the frame through an abnormality detection module, comparing the frame with a threshold value, and judging abnormal behaviors. The method makes full use of the structure information and the motion information extracted from the video frame, and can accurately and efficiently finish the intelligent detection of abnormal behaviors.

Description

Abnormal behavior detection method based on deep convolutional neural network

Technical Field

The invention relates to the field of computer vision and video detection and analysis, in particular to an abnormal behavior detection method based on a deep convolutional neural network.

Background

A practical anomaly monitoring system aims to signal in time, and identify the type of anomaly, in case of an anomaly. In general, anomaly detection can be viewed as a rough video understanding that only distinguishes anomalies from normals. Once an abnormal condition is detected, the abnormal behavior is identified and classified using further classification techniques.

The following three difficulties need to be overcome to realize online detection of abnormal behaviors in video monitoring: the algorithm can meet the real-time requirement; the algorithm can effectively utilize a long time sequence uncut video data set; the algorithm can cope with the complexity of the environment where the monitoring camera is located.

To date, image-based tasks such as image classification and object detection have revolutionized the development driven by deep learning (especially convolutional neural networks). Compared with the traditional method, the deep learning method has higher identification accuracy and stronger robustness. However, advances in video analysis are not satisfactory, suggesting that learning the characterization of spatiotemporal data is very difficult. The main difficulties are as follows: finding motion information that is apparent in video requires some new network design that has not been found and tested.

Previous research has learned features by performing convolution operations simultaneously in both the spatial and temporal dimensions. Optical flow features have wide and effective applications in video analysis. The application of optical flow to the video understanding task can explicitly and conveniently realize the modeling of motion clues. However, this approach is inefficient and tends to be costly in computing and storing the estimated optical flow.

The abnormal behavior detection method in video monitoring can be used for detecting the garbage throwing behavior. The household garbage is discarded at will, a large amount of harmful gases such as ammonia and sulfide are released, water body pollution, harm of breeding bacteria and pests and the like are caused, and the main reason of the urban environmental pollution problem is that the household garbage is discarded at will. For this reason, a household garbage classification measure is necessary. It is desirable to provide an abnormal behavior detection method based on an intelligent computer vision algorithm that can accurately and efficiently detect abnormal behaviors such as litter.

Disclosure of Invention

The main purpose of the present invention is to overcome the problems in the background art, and to provide an abnormal behavior detection method based on a deep convolutional neural network, so as to achieve accurate and efficient intelligent detection of abnormal behaviors.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for detecting abnormal behavior based on a deep convolutional neural network, the method comprising:

a1: encoding an input video frame;

a2: decoding the encoded stream to obtain an appearance stream and a motion stream;

a3: and scoring the frame through an abnormality detection module, comparing the frame with a threshold value, and judging abnormal behaviors.

Further:

the step a1 specifically includes:

a11, adding an inclusion module after inputting a layer to determine low-level characteristics;

a12: video is encoded using a convolutional auto-encoder.

In the step A11:

and an inclusion module is added behind the input layer to determine low-level features as early as possible, so that the model automatically selects proper convolution operation, and is preferably applied to processing the monitoring video shot at a fixed angle.

In the step A12:

the encoder adopts a method that a convolution self-encoder Conv-AE learns the abnormal target detection from a template of normal expression; the encoder is a sequence of layer blocks comprising three layers: convolution, batch normalization and leave-ReLU activation functions, applying convolution directly rather than using pooling layers to reduce the resolution of feature mapping;

wherein the spatial resolution of the feature map is reduced by parameterisation to support network finding information ways and further up-sampling is learned in the decoding phase.

The step a2 specifically includes:

a21: decoding the coded stream by an appearance decoder to obtain an appearance stream;

a22: and decoding the coded stream by a motion decoder to obtain the motion stream.

In the step A21:

the appearance decoder learns appearance information from a static image and outputs probability distribution of different abnormal behavior categories, wherein the appearance information comprises textures, contours and interest points; the appearance decoder is a layer block sequence, and a Dropout layer is added before the ReLU activation function of each block as a regularization method for reducing the risk of over-fitting in the training phase.

In the step A21:

for input image I and its reconstructed image

Forcing the generation of an image with similar intensity for each pixel, the intensity loss being estimated as

Adding a constraint to preserve the original gradient, i.e. sharpness, in the reconstructed image, the gradient loss being defined as the difference between the absolute gradients along two spatial dimensions

Wherein x, y represent the horizontal and vertical directions of the image space, respectively, g_dRepresenting the image gradient in both the horizontal and vertical directions, the final loss function of the appearance transformation is the sum of the intensity and gradient losses:

in the step A22:

the motion decoder learns motion information and predicts the probability of different abnormal behavior categories; the motion decoder is a layer block sequence, and a Dropout layer is added before a ReLU activation function of each block as a regularization method for reducing the risk of over-fitting in a training stage; and the network used by the motion decoder comprises a jump connection which can extract low-level features from the original image;

wherein a pre-trained FlowNet2 is employed to estimate optical flow;

wherein associations between U-Net subnet learning patterns and corresponding movements are employed;

distance-based losses between output optical flow and ground truth optical flow are

Wherein F_tIs based on two successive frames I_tAnd I_t+1The estimated ground truth-value optical flow,

is given by_tThe output of U-Net of (1);

given an input video frame I and its associated optical flow F obtained by FlowNet2, the network in the model graph produces a reconstructed frame

And predicted optical flow

Discriminator D estimates the probability that the optical flow associated with I is the ground truth F, and the GAN objective function consists of two loss functions:

where x, y and c represent the spatial positions and corresponding channels, respectively, of the cells in the feature map output from discriminator D, and the λ value is the weight associated with the partial loss in the model; GAN is optimized by alternately minimizing the two GAN losses to indicate the efficiency of motion prediction.

The step a3 specifically includes:

a fractional estimation scheme is used in which only a small region is considered instead of the entire frame;

where partial scores are defined that are respectively estimated on two model streams sharing the same patch position:

where P represents an image patch, | P | is the number of pixels thereof, I and j represent pixel indices in the horizontal and vertical directions of the image P, respectively, I_i,jIs the value of the input image at i, j,

is the value of its reconstructed image at i, j, F_i,jIs the ground truth optical flow at i, j,

is the output, S, of U-Net given at i, j_IAnd S_FRespectively representing the fraction of the original image and the fraction of the optical flow; the frame-level score is then computed as a weighted combination of two part scores:

wherein, w_FAnd w_IIs a weight, λ, calculated from training data_SIs the contribution of the control portion score to the sum,

is to consider S in the frame_FThe highest value patch, namely:

weighting w_FAnd w_IEstimated as the inverse of the average score of the training data for n images:

where, i represents an image index,

an optical flow score representing the ith image,

consider the ith image S in the frame_FThe patch with the highest value.

Normalizing the frame level score of each evaluation video, the final frame level score being

Where t is the frame index in a video containing m frames, S_tDenotes the fraction of the t-th frame, max (S)_1...m) Represents the maximum value of the scores of all the frames,

i.e. the normalized frame fraction.

A computer readable storage medium storing computer instructions which, when executed by a processor, implement the method.

The invention has the following beneficial effects:

the invention provides an abnormal behavior detection method based on a deep convolutional neural network. The method makes full use of the structure information and the motion information extracted from the video frame, and can accurately and efficiently finish the intelligent detection of abnormal behaviors. In a preferred embodiment, the deep convolutional neural network combines a convolutional auto-encoder (Conv-AE) and U-Net, so that each stream contributes to the task of detecting outlier frames. Usually the network depth is a carefully selected hyper-parameter, and in order to mitigate the influence of the network depth on the accuracy, it is preferable that the method integrates a tuned inclusion module after the input layer. The method further provides a patch-based approach for evaluating the framework-level normalization score that reduces the effects of model output noise. Compared with other high-level methods, the method has obvious competitive advantages in the operation effect of the reference data set.

Drawings

FIG. 1 is a flow chart of an abnormal behavior detection method based on a deep convolutional neural network according to an embodiment of the present invention;

FIG. 2 is a block diagram of a model including spatial resolution of feature mapping according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below. It should be emphasized that the following description is merely exemplary in nature and is not intended to limit the scope of the invention or its application.

The embodiment of the invention provides an abnormal behavior detection method based on a deep convolutional neural network, which mainly comprises the following steps: after an input video passes through an encoder formed by a series of sub-modules, an appearance decoder and a motion decoder are respectively used for obtaining an appearance stream and a motion stream, and finally an abnormal detection module is used for judging whether the input video has abnormal behaviors. The invention can be used for detecting abnormal behaviors such as garbage throwing and the like. Referring to fig. 1 and 2, the method includes the steps of:

a1: an input video frame is encoded. The encoder comprises an inclusion module, a convolution module, a batch standardization module and an activation module;

a2: decoding the coded stream, and obtaining an appearance stream through an appearance decoder; and obtaining the motion stream through a motion decoder.

In particular embodiments, when performing the above steps, the following may be followed. It should be noted that the specific methods employed in the practice are merely illustrative, and the scope of the present invention includes, but is not limited to, the following methods.

A1: an input video frame is encoded.

The encoder in the preferred embodiment includes an inclusion, convolution, batch normalization, activation module.

The network proposed in the embodiments comprises an encoding-decoding architecture, which creates a bottleneck. Deep-level structures may omit features critical to decoding; conversely, a shallow level network may lose high levels of abstract information. The inclusion module was originally developed to allow the convolutional neural network to automatically determine the size of the filter. Preferably, the method uses an inclusion module to enable the model to automatically select a suitable convolution operation.

Some embodiments are mainly applied to processing surveillance videos shot at fixed angles. If a convolutional layer with a predefined size is added behind an input layer, the information extracted from the target is different along with the change of the distance, and the influence is also transferred to the next layer, so that the method adds an inclusion module behind the input layer to determine the low-level features as early as possible. The use of the inclusion module also significantly reduces the amount of computation compared to other approaches.

The convolutional auto-encoder (Conv-AE) used in some embodiments may learn the method of detecting abnormal objects from templates of normal performance. The convolutional self-encoder includes an encoder and a decoder.

The encoder consists of a series of blocks, including three layers: convolution, batch normalization and leak-ReLU activation functions. Some embodiments apply convolution directly rather than using pooling layers to reduce the resolution of the feature map. This parameterization supports the network to find an information way to reduce the spatial resolution of the feature map and to learn further up-sampling in the decoding phase.

The decoder is a sequence of layer blocks, each block having a Dropout layer added before the ReLU activation function, as a regularization method to reduce the risk of over-fitting during the training phase.

The appearance decoder can effectively learn appearance information such as textures, outlines, interest points and the like from the static image and output probability distribution of different abnormal behavior categories. The motion decoder can effectively learn motion information and predict the probability of different abnormal behavior categories.

The Conv-AE support used in some embodiments detects anomalous objects within an input frame by learning common appearance templates in normal events. Since Conv-AE learns common appearance patterns of normal events, we consider the input image I and its reconstructed image

L between₂Distance. Thus, the model forces the generation of images with similar intensity for each pixel. The loss of intensity is estimated as

Using only l₂One disadvantage of the loss is the blurring in the output, so we add a constraint to preserve the original gradient (i.e. sharpness) in the reconstructed image. Gradient loss is defined as the difference between the absolute gradients along two spatial dimensions

Wherein x, y represent the horizontal and vertical directions of the image space, respectively, g_dRepresenting the image gradient in both the horizontal and vertical directions. The final loss function of the appearance transformation is the sum of the intensity and gradient losses:

this combination of losses provides good performance for the video prediction task.

The motion decoder can effectively learn motion information and predict the probability of different abnormal behavior categories. The difference between the motion decoder and the appearance decoder is that the network used by the motion decoder contains a skip connection that can extract low-level features (edges, tiles, etc.) from the original image.

In addition to abnormal object structure, abnormal motion of typical objects is also suitable for evaluating video frames. Each module in the encoder is to enhance the level of spatial abstraction of common objects in the training frame. Thus, the method employs an association between a U-Net sub-network learning mode and a corresponding motion.

Some embodiments employ a pre-trained FlowNet2 to estimate optical flow. The optical flow output by FlowNet2 is not only much smoother, but also preserves discontinuities in motion with sharp boundaries, as compared to other models. The use of the leak-ReLU activation in the encoder will also maintain a weak response, which helps to provide useful information to the decoder.

The U-Net subnet focuses on learning the associations between these patterns and the corresponding motions, and the ground truth optical flow used in the method is estimated by a pre-trained FlowNet 2. To reduce the effect of these outliers when learning the motion correlations, the loss between the output optical flow and its ground-truth optical flow is represented by₁Distance measurement:

is given by_tAnd (4) the output of U-Net. The stream can predict the instantaneous motion of objects appearing in the video.

Except for distance-based loss L_flowIn addition, another penalty is added that makes the potential distribution of predicted optical flow similar to the ground truth optical flow case.

Input video frame I and its associated optical flow F obtained given FlowNet2The network proposed in the model map (G stands for Generator) produces reconstructed frames

And predicted optical flow

And discriminator D estimates the probability that the optical flow associated with I is the ground truth F. The GAN objective function consists of two loss functions:

where x, y and c represent the spatial position and corresponding channel, respectively, of the cell in the feature map output from D, and the λ value is the weight associated with the fractional loss in our proposed model. Our GAN is optimized by alternately minimizing the two GAN losses. GAN is used to indicate the efficiency of motion prediction.

The anomaly detection model in some embodiments aims to provide a normalized score for each frame. In a correlation approach, the score is typically a quantity that measures the similarity between ground truth and the reconstructed or predicted output. The normality of each video frame is determined by comparing its score to a threshold. It is clear that due to the summing or averaging operation of all pixel positions, anomalous events occurring within small image areas may be ignored. Thus, the method proposes another fractional estimation scheme, considering only one small region, rather than the entire frame.

Partial scores are defined that are estimated separately on two model streams sharing the same patch position:

is the output, S, of U-Net given at i, j_IAnd S_FThe scores of the original image and the optical flow are respectively expressed. Then, our frame-level score is computed as a weighted combination of the two part scores, as shown below:

in the formula, w_FAnd w_IIs a weight, λ, calculated from training data_SIs the contribution of the control portion score to the sum,

is to consider S in the frame_FThe highest value patch, namely:

finally, the frame-level scores of each evaluation video were normalized according to the recommendations of the relevant study.

The final frame-level score is

i.e. the normalized frame fraction.

The background of the present invention may contain background information related to the problem or environment of the present invention and does not necessarily describe the prior art. Accordingly, the inclusion in the background section is not an admission of prior art by the applicant.

The foregoing is a more detailed description of the invention in connection with specific/preferred embodiments and is not intended to limit the practice of the invention to those descriptions. It will be apparent to those skilled in the art that various substitutions and modifications can be made to the described embodiments without departing from the spirit of the invention, and these substitutions and modifications should be considered to fall within the scope of the invention. In the description herein, references to the description of the term "one embodiment," "some embodiments," "preferred embodiments," "an example," "a specific example," or "some examples" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction. Although embodiments of the present invention and their advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope of the claims.

Claims

1. A method for detecting abnormal behaviors based on a deep convolutional neural network is characterized by comprising the following steps:

a1: encoding an input video frame;

2. The method according to claim 1, wherein the step a1 specifically comprises:

a12: video is encoded using a convolutional auto-encoder.

3. The method of claim 2, wherein in step a11:

4. The method of claim 2, wherein in step a 12:

5. The method according to claim 1, wherein the step a2 specifically comprises:

6. The method of claim 5, wherein in step A21:

the appearance decoder learns appearance information from a static image and outputs probability distribution of different abnormal behavior categories, wherein the appearance information comprises textures, contours and interest points; wherein the appearance decoder is a layer block sequence, and a Dropout layer is added before the ReLU activation function of each block as a regularization method for reducing the risk of over-fitting in the training phase.

7. The method of claim 5, wherein in step A21:

for input image I and its reconstructed image

Wherein x, y respectively represent image spaceHorizontal and vertical directions, g_dRepresenting the image gradient in both the horizontal and vertical directions, the final loss function of the appearance transformation is the sum of the intensity and gradient losses:

8. the method of claim 5, wherein in step A22:

the motion decoder learns motion information and predicts the probability of different abnormal behavior categories; wherein the motion decoder is a layer block sequence, and a Dropout layer is added before a ReLU activation function of each block as a regularization method for reducing the risk of over-fitting in a training stage; and the network used by the motion decoder comprises a jump connection which can extract low-level features from the original image;

wherein a pre-trained FlowNet2 is employed to estimate optical flow;

is given by_tThe output of U-Net of (1);

And predicted optical flow

Discriminator D is used to estimate the probability that the optical flow associated with I is the ground truth F, and the GAN objective function consists of two loss functions:

9. The method according to any one of claims 1 to 8, wherein step a3 specifically comprises:

is the output of U-Net, S, given at i, j_IAnd S_FRespectively representing the fraction of the original image and the fraction of the optical flow; the frame-level score is then computed as a weighted combination of two part scores:

is to consider S in the frame_FThe highest value patch, namely:

where, i represents an image index,

an optical flow score representing the ith image,

consider the ith image S in the frame_FThe patch with the highest value.

i.e. the normalized frame fraction.

10. A computer-readable storage medium storing computer instructions, wherein the computer instructions, when executed by a processor, implement the method of any one of claims 1 to 9.