Disclosure of Invention
In order to solve the defects of the prior art, the invention provides an abnormal behavior detection method based on appearance and action characteristic double prediction, and aims to design a double-flow network structure which comprises a memory enhancement module and is used for predicting appearance and action characteristics, so that an abnormal video frame can obtain a larger prediction error, and the accuracy of abnormal behavior detection is improved.
The invention adopts the following technical scheme: an abnormal behavior detection method based on appearance and action characteristic double prediction comprises the following steps:
(1) sequentially reading video frames, calculating the inter-frame difference of adjacent images, and acquiring a video frame sequence with a fixed length and a corresponding frame difference image sequence;
(2) by utilizing a double-current network model introduced into a memory enhancement module, extracting special appearance and action characteristics belonging to normal behaviors through an appearance sub-network and an action sub-network respectively, and predicting a video frame image and a frame difference image;
(3) adding and fusing the predicted video frame and the frame difference image to obtain a final predicted video frame;
(4) and obtaining the abnormal score of the frame by measuring the action and appearance characteristics extracted by the memory enhancement module and the quality of the final prediction image.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention simultaneously uses the video frame sequence and the RGB frame difference image sequence as input to be sent into a double-current convolution self-encoder network for prediction, compared with the prior method which uses a light flow graph to extract action characteristics, the invention can reduce the network complexity and the calculated amount by using the frame difference image;
2. the invention improves the network structure of the encoder and the decoder in the self-coding network, so that the characteristics can be better extracted, and the image prediction quality can be improved;
3. the method adds the memory enhancement module, better learns the characteristics of the normal sample, enhances the robustness of the network and enables the abnormal video to obtain a higher abnormal score;
4. according to the method, the quality of the prediction image is considered, and the feature similarity score of the extracted sample features and the feature similarity score of the normal behavior features are used as evaluation basis, so that the effect of abnormality detection is effectively improved, and the false detection rate is reduced.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
As shown in fig. 1-2, an abnormal behavior detection method based on appearance and motion feature dual prediction includes the following steps:
(1) sequentially reading video frames, calculating the inter-frame difference of adjacent images, and acquiring a video frame sequence with a fixed length and a corresponding frame difference image sequence;
(2) by utilizing a double-current network model introduced into a memory enhancement module, extracting special appearance and action characteristics belonging to normal behaviors through an appearance sub-network and an action sub-network respectively, and predicting a video frame image and a frame difference image;
(3) adding and fusing the predicted video frame and the frame difference image to obtain a final predicted video frame;
(4) and obtaining the frame abnormality score by measuring the action and appearance characteristics extracted by the memory enhancement module and the quality of the final prediction image.
The detailed steps are as follows:
step 1: fixed length video frames and frame difference maps are acquired. A section of video stream is obtained from a fixed camera, after the video is subjected to framing processing, a continuous video frame sequence with the fixed length of t is selected, wherein the first t-1 frame image is directly sent into an appearance sub-network. For a video stream of a fixed camera, a background image I of the video can be acquired by an OpenCV methodBThen subtracting I from t frame RGB video imageBObtaining a foreground image I 'without background noise'1,I′2,…,I′tAnd finally, the foreground image is takenSubtracting the previous frame from the next frame of the sequence to obtain t-1 continuous frame difference image sequences X required by the action sub-network1,X2,…,Xt-1。
Step 2: and respectively sending the video frames and the frame difference images with fixed lengths into a double-current network introduced into a memory enhancement module for prediction to generate predicted video frames and RGB frame difference images.
For the network architecture, as shown in fig. 2, the network is composed of two structurally identical sub-networks of self-encoders, which are widely used for the tasks of feature extraction and reconstruction and prediction of images. Taking the visiting sub-network as an example, further explanation is: the sub-networks are in turn formed by an encoder E
aMemory enhancing module M
aAnd decoder D
aAre cascaded. The encoder and the decoder are connected in a skip-connection mode in a feature layer with the same resolution, and the memory enhancement module performs normal sample feature enhancement on the features extracted by the encoder and then sends the features to the decoder for reconstruction. For an encoder and a decoder, the invention improves an up-sampling layer and a down-sampling layer of the encoder and the decoder, as shown in fig. 3, the improved up-sampling module and the down-sampling module both adopt a residual error-like structure, and two branches of the down-sampling module respectively pass convolution operation and maximum pooling operation of different kernel functions; the up-sampling module adopts deconvolution operations of convolution kernels with different sizes. The improved convolution kernel has richer acquired information and can extract more effective semantic features. Setting input appearance sub-network input to I
1,I
2,…,I
tThrough an encoder E
aDown-sampling extraction of deep features Z about image scenes, target appearance information, and the like
aThen, the memory enhancing module M
aFor feature Z
aPerforming memory enhancement on a normal sample to obtain enhanced characteristic Z'
aDecoder D
aInput Z'
aPredicting to obtain the t +1 th frame
The calculation method is shown in formula (1):
in the above formula
Respectively representing encoders E
aMemory enhancing module M
aAnd decoder D
aThe parameter (c) of (c).
The memory enhancing module is specifically described as follows:
the module comprises a memory item for storing M normal sample feature vectors in local; in the training phase, the encoder feeds all the features of the extracted normal samples into the module, which extracts the M features that best characterize the normal samples and stores them locally. The function of the module is realized by reading and updating two operations.
For read operations, which are used for reconstruction of the decoder in order to generate enhanced features, they exist in the training and testing phase of the network. The reading operation steps are as follows: for the output feature z of the encoder, calculating the cosine similarity between z and the storage feature p in the memory term, where the calculation formula is shown in (2):
where k and m are indices of features z and p, respectively, for s (z)k,pm) Applying softmax function to obtain read weight omegak,mThe calculation formula is as shown in (3):
applying the calculated correspondence weight ω to the memory item feature p
k,mObtaining features of enhanced memory
The calculation method is as follows:
the updating operation only exists in the training stage and is used for learning the characteristic features of the normal sample, firstly, the cosine similarity is calculated by using the formula (1), and then, the updating weight v is calculatedm,kThe calculation method is as the formula (5):
the calculation method of the updated local memory is as the formula (6):
in order for the memory items to really remember the characteristics of the normal samples, the module introduces a characteristic compression loss LcAnd characteristic separation loss LsTwo loss functions. Characteristic compression loss LcAs shown in equation (7):
in the above formula pτRepresenting all memory items with zkThe one with the highest similarity.
Characteristic separation loss LsThe calculation method of (2) is shown in equation (9):
in the above formula, τ and γ respectively represent ω in the formula (1)
k,mThe value of index m at the time of the maximum value and the second largest value is taken. And step 3: predicting the predicted video frames from the appearance and motion sub-networks by
step 2
Sum RGB frame difference map
Adding the two prediction images to obtain the final t +1 frame video frame of the network
And 4, step 4: the method for calculating the abnormality score specifically includes:
first, the t +1 th frame is calculated
With real frame I
t+1The peak signal-to-noise ratio (PSNR) is calculated as shown in equation (10):
wherein N represents the t +1 th frame image It+1The number of all pixels.
Secondly, calculating each output characteristic z of the appearance sub-network and the motion sub-network encoderkMemory item characteristic p of memory enhancement moduleτThe L2 distance of (a) is used as the feature similarity score of the two sub-networks, and the calculation method is shown in formula (11):
wherein τ is and zkThe index of the memory item features with the maximum similarity;
finally, after the three scores are normalized to [0,1], the weight of each score is balanced by a hyper-parameter beta, and the calculation method is shown as a formula (12):
in the formula
D′
a(z
a,p
a) And D'
m(z
m,p
m) And respectively representing the normalized PSNR, the appearance feature similarity score and the action feature similarity score.
In order to verify the effectiveness of the method, the method uses three common data sets of Avenue, UCSD-ped2 and ShanghaiTech commonly used in the field of video abnormal behavior detection to train and test. Four abnormal behavior detection methods based on deep learning are selected as comparison methods, and the method specifically comprises the following steps:
the method comprises the following steps: the methods proposed by Abati et al, references "D.Abati, A.Porrello, S.Calderara, and R.Cucchiaara," tension space autogiration for novel detection, "in Proceedings of the IEEE Conference on Computer Vision and Pattern registration 2019, pp.481-490".
The method 2 comprises the following steps: nguyen et al, references "T. -. N.Nguyen and J.Meuner," analysis detection in video sequence with application-correlation, "in Proceedings of the IEEE International Conference on Computer Vision,2019, pp.1273-1283"
The method 3 comprises the following steps: liu et al, references "W.Liu, W.Luo, D.Lian, and S.Gao," Future frame prediction for analog detection-a new base, "in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, pp.6536-6545"
The method 4 comprises the following steps: the methods proposed by Gong et al, references "D.Gong et al", "Memory-aided depth auto encoder for unsupervised analysis," in Proceedings of the IEEE International Conference on Computer Vision,2019, pp.1705-1714 "
As shown in table 1, the method provided by the present invention uses AUC as an evaluation index on three data sets, and compared with the other four methods, the accuracy of the method identification is greatly superior.
TABLE 1 comparison with other methods evaluation index (AUC)
Finally, it should be noted that the above examples are only used to illustrate the technical solutions of the present invention, but not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.