CN113762007A

CN113762007A - A method for abnormal behavior detection based on bi-prediction of appearance and action features

Info

Publication number: CN113762007A
Application number: CN202011263894.5A
Authority: CN
Inventors: 陈洪刚; 李自强; 王正勇; 何小海; 刘强; 吴晓红; 熊书琪
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2021-12-07
Anticipated expiration: 2040-11-12
Also published as: CN113762007B

Abstract

本发明公开了一种基于外观和动作特征双预测的异常行为检测方法，涉及计算机视觉和人工智能领域。方法包括：(1)顺序读取视频帧序列，计算相邻图像的帧间差，获取固定长度的视频帧序列和对应的帧差图序列；(2)利用引入记忆增强模块的双流网络模型，分别通过外观和动作子网络提取属于正常行为的特有外观和动作特征，并预测视频帧图和帧差图；(3)将预测的视频帧和帧差图相加融合，得到最终的预测视频帧；(4)通过评估记忆增强模块所提取动作和外观特征以及最终预测图像质量获取该帧异常得分。本发明采用基于预测模型的深度学习方法，能够有效地将含异常行为的视频帧检测出来，提高了异常检测的准确率。The invention discloses an abnormal behavior detection method based on double prediction of appearance and action features, and relates to the fields of computer vision and artificial intelligence. The method includes: (1) sequentially reading the video frame sequence, calculating the inter-frame difference between adjacent images, and obtaining a fixed-length video frame sequence and a corresponding frame difference map sequence; (2) using a dual-stream network model that introduces a memory enhancement module, The unique appearance and action features belonging to normal behavior are extracted through the appearance and action sub-networks, respectively, and the video frame image and frame difference image are predicted; (3) The predicted video frame and frame difference image are added and fused to obtain the final predicted video frame. ; (4) Obtain the frame anomaly score by evaluating the action and appearance features extracted by the memory enhancement module and the final predicted image quality. The present invention adopts the deep learning method based on the prediction model, which can effectively detect the video frames with abnormal behavior, and improves the accuracy of abnormal detection.

Description

Abnormal behavior detection method based on appearance and action characteristic double prediction

Technical Field

The invention relates to an abnormal behavior detection method based on appearance and action characteristic double prediction, and belongs to the field of computer vision and security monitoring.

Background

Abnormal behavior detection is a technique in the field of computer vision, and the purpose of abnormal behavior detection is to detect the existence of abnormal behavior in a video. In recent times, public safety is getting more and more concerned, a great amount of monitoring equipment is deployed at each place, thereby generating huge amount of video resources, and it is extremely difficult to pay attention to each monitoring picture in real time by manpower, and a great amount of manpower resources are consumed. By using the abnormal behavior detection algorithm, the abnormal behavior in the monitoring video can be detected and a warning can be given out in time, so that the labor cost can be greatly reduced, and the efficiency is improved. The abnormal behavior detection has wide application prospect in the fields of video monitoring, intelligent security, transportation and the like.

For the abnormal behavior detection of videos, due to the low occurrence rate of abnormal behaviors and the difficult reason of data collection, most current methods adopt a semi-supervised learning method which only uses normal videos for training, and a reconstruction or prediction based method becomes a mainly used method due to the good detection effect. The method reconstructs an input frame or predicts the next frame by inputting a plurality of continuous frames of video to a self-encoder network or generating a countermeasure network, and judges whether the video is abnormal or not by judging whether the video is reconstructed or predicted. Although this type of method achieves good results, the following problems are still faced: (1) abnormal behaviors can be classified into appearance, action or both, and the current reconstruction and prediction methods fully utilize appearance and action information; (2) the normal behaviors have diversity, the network cannot correctly learn the special characteristics of the normal sample due to the complex background and the like, and in addition, the reconstruction or prediction effect of the abnormal sample can be good due to the strong generating capacity of the convolutional neural network, so that the final abnormal detection accuracy is influenced.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides an abnormal behavior detection method based on appearance and action characteristic double prediction, and aims to design a double-flow network structure which comprises a memory enhancement module and is used for predicting appearance and action characteristics, so that an abnormal video frame can obtain a larger prediction error, and the accuracy of abnormal behavior detection is improved.

The invention adopts the following technical scheme: an abnormal behavior detection method based on appearance and action characteristic double prediction comprises the following steps:

(1) sequentially reading video frames, calculating the inter-frame difference of adjacent images, and acquiring a video frame sequence with a fixed length and a corresponding frame difference image sequence;

(2) by utilizing a double-current network model introduced into a memory enhancement module, extracting special appearance and action characteristics belonging to normal behaviors through an appearance sub-network and an action sub-network respectively, and predicting a video frame image and a frame difference image;

(3) adding and fusing the predicted video frame and the frame difference image to obtain a final predicted video frame;

(4) and obtaining the abnormal score of the frame by measuring the action and appearance characteristics extracted by the memory enhancement module and the quality of the final prediction image.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention simultaneously uses the video frame sequence and the RGB frame difference image sequence as input to be sent into a double-current convolution self-encoder network for prediction, compared with the prior method which uses a light flow graph to extract action characteristics, the invention can reduce the network complexity and the calculated amount by using the frame difference image;

2. the invention improves the network structure of the encoder and the decoder in the self-coding network, so that the characteristics can be better extracted, and the image prediction quality can be improved;

3. the method adds the memory enhancement module, better learns the characteristics of the normal sample, enhances the robustness of the network and enables the abnormal video to obtain a higher abnormal score;

4. according to the method, the quality of the prediction image is considered, and the feature similarity score of the extracted sample features and the feature similarity score of the normal behavior features are used as evaluation basis, so that the effect of abnormality detection is effectively improved, and the false detection rate is reduced.

Drawings

FIG. 1 is a flow chart of the abnormal behavior detection method of the present invention;

FIG. 2 is a network architecture diagram of the present invention for anomalous behavior detection based on dual prediction of appearance and motion characteristics;

fig. 3 is a block diagram of the upsampling and downsampling modules in the encoder and decoder of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1-2, an abnormal behavior detection method based on appearance and motion feature dual prediction includes the following steps:

(4) and obtaining the frame abnormality score by measuring the action and appearance characteristics extracted by the memory enhancement module and the quality of the final prediction image.

The detailed steps are as follows:

step 1: fixed length video frames and frame difference maps are acquired. A section of video stream is obtained from a fixed camera, after the video is subjected to framing processing, a continuous video frame sequence with the fixed length of t is selected, wherein the first t-1 frame image is directly sent into an appearance sub-network. For a video stream of a fixed camera, a background image I of the video can be acquired by an OpenCV method_BThen subtracting I from t frame RGB video image_BObtaining a foreground image I 'without background noise'₁,I′₂,…,I′_tAnd finally, the foreground image is takenSubtracting the previous frame from the next frame of the sequence to obtain t-1 continuous frame difference image sequences X required by the action sub-network₁,X₂,…,X_t-1。

Step 2: and respectively sending the video frames and the frame difference images with fixed lengths into a double-current network introduced into a memory enhancement module for prediction to generate predicted video frames and RGB frame difference images.

For the network architecture, as shown in fig. 2, the network is composed of two structurally identical sub-networks of self-encoders, which are widely used for the tasks of feature extraction and reconstruction and prediction of images. Taking the visiting sub-network as an example, further explanation is: the sub-networks are in turn formed by an encoder E_aMemory enhancing module M_aAnd decoder D_aAre cascaded. The encoder and the decoder are connected in a skip-connection mode in a feature layer with the same resolution, and the memory enhancement module performs normal sample feature enhancement on the features extracted by the encoder and then sends the features to the decoder for reconstruction. For an encoder and a decoder, the invention improves an up-sampling layer and a down-sampling layer of the encoder and the decoder, as shown in fig. 3, the improved up-sampling module and the down-sampling module both adopt a residual error-like structure, and two branches of the down-sampling module respectively pass convolution operation and maximum pooling operation of different kernel functions; the up-sampling module adopts deconvolution operations of convolution kernels with different sizes. The improved convolution kernel has richer acquired information and can extract more effective semantic features. Setting input appearance sub-network input to I₁,I₂,…,I_tThrough an encoder E_aDown-sampling extraction of deep features Z about image scenes, target appearance information, and the like_aThen, the memory enhancing module M_aFor feature Z_aPerforming memory enhancement on a normal sample to obtain enhanced characteristic Z'_aDecoder D_aInput Z'_aPredicting to obtain the t +1 th frame

The calculation method is shown in formula (1):

in the above formula

Respectively representing encoders E_aMemory enhancing module M_aAnd decoder D_aThe parameter (c) of (c).

The memory enhancing module is specifically described as follows:

the module comprises a memory item for storing M normal sample feature vectors in local; in the training phase, the encoder feeds all the features of the extracted normal samples into the module, which extracts the M features that best characterize the normal samples and stores them locally. The function of the module is realized by reading and updating two operations.

For read operations, which are used for reconstruction of the decoder in order to generate enhanced features, they exist in the training and testing phase of the network. The reading operation steps are as follows: for the output feature z of the encoder, calculating the cosine similarity between z and the storage feature p in the memory term, where the calculation formula is shown in (2):

where k and m are indices of features z and p, respectively, for s (z)^k,p^m) Applying softmax function to obtain read weight omega^k,mThe calculation formula is as shown in (3):

applying the calculated correspondence weight ω to the memory item feature p^k,mObtaining features of enhanced memory

The calculation method is as follows:

the updating operation only exists in the training stage and is used for learning the characteristic features of the normal sample, firstly, the cosine similarity is calculated by using the formula (1), and then, the updating weight v is calculated^m,kThe calculation method is as the formula (5):

the calculation method of the updated local memory is as the formula (6):

in order for the memory items to really remember the characteristics of the normal samples, the module introduces a characteristic compression loss L_cAnd characteristic separation loss L_sTwo loss functions. Characteristic compression loss L_cAs shown in equation (7):

in the above formula p^τRepresenting all memory items with z^kThe one with the highest similarity.

Characteristic separation loss L_sThe calculation method of (2) is shown in equation (9):

in the above formula, τ and γ respectively represent ω in the formula (1)^k,mThe value of index m at the time of the maximum value and the second largest value is taken. And step 3: predicting the predicted video frames from the appearance and motion sub-networks by step 2

Sum RGB frame difference map

Adding the two prediction images to obtain the final t +1 frame video frame of the network

And 4, step 4: the method for calculating the abnormality score specifically includes:

first, the t +1 th frame is calculated

With real frame I_t+1The peak signal-to-noise ratio (PSNR) is calculated as shown in equation (10):

wherein N represents the t +1 th frame image I_t+1The number of all pixels.

Secondly, calculating each output characteristic z of the appearance sub-network and the motion sub-network encoder^kMemory item characteristic p of memory enhancement module^τThe L2 distance of (a) is used as the feature similarity score of the two sub-networks, and the calculation method is shown in formula (11):

wherein τ is and z^kThe index of the memory item features with the maximum similarity;

finally, after the three scores are normalized to [0,1], the weight of each score is balanced by a hyper-parameter beta, and the calculation method is shown as a formula (12):

in the formula

D′_a(z_a,p_a) And D'_m(z_m,p_m) And respectively representing the normalized PSNR, the appearance feature similarity score and the action feature similarity score.

In order to verify the effectiveness of the method, the method uses three common data sets of Avenue, UCSD-ped2 and ShanghaiTech commonly used in the field of video abnormal behavior detection to train and test. Four abnormal behavior detection methods based on deep learning are selected as comparison methods, and the method specifically comprises the following steps:

the method comprises the following steps: the methods proposed by Abati et al, references "D.Abati, A.Porrello, S.Calderara, and R.Cucchiaara," tension space autogiration for novel detection, "in Proceedings of the IEEE Conference on Computer Vision and Pattern registration 2019, pp.481-490".

The method 2 comprises the following steps: nguyen et al, references "T. -. N.Nguyen and J.Meuner," analysis detection in video sequence with application-correlation, "in Proceedings of the IEEE International Conference on Computer Vision,2019, pp.1273-1283"

The method 3 comprises the following steps: liu et al, references "W.Liu, W.Luo, D.Lian, and S.Gao," Future frame prediction for analog detection-a new base, "in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, pp.6536-6545"

The method 4 comprises the following steps: the methods proposed by Gong et al, references "D.Gong et al", "Memory-aided depth auto encoder for unsupervised analysis," in Proceedings of the IEEE International Conference on Computer Vision,2019, pp.1705-1714 "

As shown in table 1, the method provided by the present invention uses AUC as an evaluation index on three data sets, and compared with the other four methods, the accuracy of the method identification is greatly superior.

TABLE 1 comparison with other methods evaluation index (AUC)

Finally, it should be noted that the above examples are only used to illustrate the technical solutions of the present invention, but not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. an abnormal behavior detection method based on double prediction of appearance and action features, is characterized in that, comprises the following steps:

(1) read video frames sequentially, calculate the inter-frame difference between adjacent images, and obtain a fixed-length video frame sequence and a corresponding frame difference map sequence;

(2) Using the dual-stream network model that introduces the memory enhancement module, the unique appearance and action features belonging to normal behavior are extracted through the appearance sub-network and the action sub-network respectively, and the video frame map and frame difference map are predicted;

(3) adding and fusing the predicted video frame and the frame difference map to obtain the final predicted video frame;

(4) Obtain the anomaly score of the frame by measuring the action and appearance features extracted by the memory enhancement module and the final predicted image quality.

2. a kind of abnormal behavior detection method based on appearance and action feature double prediction according to claim 1, is characterized in that, in described step (1), frame difference map calculation method is:

For each video segment, first, extract the background image of the video segment; secondly, for the video frame sequence of fixed length t, first subtract the background image to obtain the foreground target image with the background removed; finally, subtract the adjacent frames before and after. A frame difference map sequence of length t-1 is obtained.

3. a kind of abnormal behavior detection method based on double prediction of appearance and action feature according to claim 1, is characterized in that, in described step (2), the dual-stream network structure that has introduced memory enhancement module comprises appearance sub-network and action The sub-network consists of two convolutional neural networks, and the appearance sub-network and the action sub-network are composed of autoencoder networks with the same structure.

4. a kind of abnormal behavior detection method based on appearance and action feature double prediction according to claim 1 and 3, is characterized in that described self-encoder network is made up of encoder, decoder and memory enhancement module; Memory enhancement module Cascade between encoder and decoder.

5. a kind of abnormal behavior detection method based on appearance and action feature double prediction according to claim 4, is characterized in that, the network structure of described encoder and decoder;

The encoder and decoder respectively contain three down-sampling layers and three up-sampling layers; the down-sampling layer adopts a residual structure, and the two branches adopt max pooling and convolution respectively to reduce the resolution and increase the channel The two branches of the up-sampling layer sample the deconvolution of convolution kernels of different sizes to increase the resolution and reduce the number of channels; the encoder and the decoder use skip connections in the feature layers of the same resolution.

6. a kind of abnormal behavior detection method based on appearance and action feature double prediction according to claim 1, is characterized in that, the memory enhancement module in described step (2) comprises and stores M for normal sample feature vector Memory items stored locally; the memory enhancement module is divided into two operations: read and update;

The reading operation exists in the training and testing phases of the network at the same time; the reading operation steps are: for the encoder output feature z, calculate the cosine similarity between z and the stored feature p in the memory item, and the calculation formula is shown in (1):

In the formula, k,m are the indices of the features z and p, respectively. The softmax function is applied to s(z ^k , p ^m ) to obtain the read weight ω ^k,m . The calculation formula is as follows: (2):

Apply the calculated corresponding weight ω ^k,m to the memory item feature p to get the memory-enhanced feature

The calculation method is:

The update operation only exists in the training phase. First, formula (1) is used to calculate the cosine similarity, and then the update weight vm ^,k is calculated. The calculation method is as formula (4):

The calculation method of the updated memory is as formula (5):

The updated memory item will be saved locally and used during training and testing read operations.

7. a kind of abnormal behavior detection method based on appearance and action feature double prediction according to claim 1, is characterized in that, in described step (3), obtain the method for final prediction video frame:

Continuous t-1 video frame images are input to the appearance sub-network to predict the t-th frame

At the same time, continuous t-1 frame difference maps are input to the action sub-network to predict the t-th frame difference map

will finally

and

Add and fuse to get the t+1th frame

8. a kind of abnormal behavior detection method based on appearance and action feature double prediction according to claim 1, is characterized in that, in described step (4), the abnormal score is calculated by the following method:

(4.1) Calculate the t+1th frame

Peak signal-to-noise ratio (PSNR) with real frame _It+1 ;

(4.2) Calculate the L2 distance between each output feature z ^k of the encoder in the appearance sub-network and the action sub-network and the memory item feature p ^τ of the memory enhancement module as the feature similarity score of the two sub-networks. The calculation method is as formula (6) ) as shown:

where τ is the index of the memory item feature with the greatest similarity to z ^k ;

(4.3) Normalize the three scores in steps (4.1) and (4.2) to [0,1], and then add and fuse to obtain the final abnormal score. The higher the score, the greater the possibility of abnormal video frames. The score calculation method is shown in (7):

in the formula

D′ _a (za , _{p a} ₎ and D′ _m (z _m , p _m ) represent the normalized PSNR, appearance feature similarity score and action feature similarity score, respectively.