CN117132919A

CN117132919A - Multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method

Info

Publication number: CN117132919A
Application number: CN202311010650.XA
Authority: CN
Inventors: 汪炜杰; 樊谨; 陈淼; 陈琪凯; 杨勇
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-08-11
Filing date: 2023-08-11
Publication date: 2023-11-28

Abstract

The invention discloses a multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method, which adopts a parallel network with convolution and self-attention extraction features as a core module, adopts an end-to-end self-encoder architecture in an overall architecture, extracts the expression features and space-time features of a video sequence from a convolution layer and a Transformer attention calculation layer, fuses all the features, combines a memory network to decode and predict the multi-scale features, and finally outputs a predicted video frame. The invention adopts the multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method, and uses the end-to-end network for extracting the representation features and the space-time features of the video sequence images, thereby improving the video frame prediction precision, enhancing the diversity of training normal sample feature information and improving the anomaly score assessment of video anomaly detection.

Description

Multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method

Technical Field

The invention relates to the technical field of anomaly detection, in particular to a multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method.

Background

The video anomaly detection is a task for detecting an anomaly event in a video, and has a profound research target in academia and important application value in industry. Because the abnormal events are defined differently in different scenes, all events which do not accord with normal logic behaviors can be regarded as abnormal behaviors. And the method of processing video is difficult and complex, video anomaly detection tasks are extremely challenging. This task cannot collect all the precisely defined data sets containing the abnormal events and thus cannot solve this problem by standard classification methods. Existing work studies typically use a training normal data model to learn the distribution of normal behavior, with the model determining anomalies by comparing the distribution gap of a given test sample to the training sample during testing.

In recent years, convolutional neural network-based methods are widely used in Auto-Encoder (AE) models for video anomaly detection, which are mainly due to a convolution operation that captures local features as effective image representations in a hierarchical manner, but, despite advantages in local feature extraction, convolutional neural networks have difficulty capturing global features of video, such as spatiotemporal relationship features between successive frames of video, feature dependencies between video positive sample data, and so on. As the Vision Transformer architecture based on the self-attention mechanism has good performance in the field of computer vision, more and more methods apply the self-attention method to the vision fields such as image classification. By means of self-attention mechanisms and multi-layer perceptron structures Vision Transformer exhibit complex spatial sequence relationships and long-range feature dependencies, thus constructing global representation features, vision Transformer has fewer limitations on inductive bias than convolutional neural networks, and although convolutional kernels are designed to capture short-range spatio-temporal information, they cannot model dependency beyond the receptive field. While the deep stacking operation of convolution can expand receptive fields, these methods have certain limitations in capturing remote dependencies by aggregating shorter range means. In contrast, by directly comparing the feature information of all the space-time positions, the experience range of the traditional convolution filtering method can be exceeded by applying an attention mechanism to capture the global dependency relationship. However, vision Transformer ignores local feature details, resulting in the fact that the transducer cannot make sensitive feature concerns about weak changes in video in terms of handling the image type of the frame sequence.

Disclosure of Invention

The invention aims to provide a multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method, which improves the video frame prediction precision, enhances the diversity of training normal sample feature information and improves the anomaly score evaluation of video anomaly detection.

In order to achieve the above purpose, the invention provides a multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method, which comprises the following steps:

step S1: data preprocessing, namely dividing a test set and a training set video into image frames;

step S2: according to the training data set obtained in the step S1, carrying out batch processing on training set data, and transmitting the training set data into a self-encoder model, wherein each group of data comprises 5 continuous video frames;

step S3: the data are respectively transmitted into a convolution channel and a converter channel coder (Encoder), and the characteristics of the convolution channel and the converter channel coder are fused to be used as expression characteristics and space-time characteristics;

step S4: inputting the features into a Memory Network (Memory Network), and performing normal sample feature Memory storage and updating on the features obtained in the step S3 through Softmax function calculation to obtain diversified normal event data;

step S5: transmitting the updated and stored video features into a Decoder (Decoder) to generate predicted frame data;

step S6: calculating the Mean Square Error (MSE) between the generated predicted frame and the reality obtained in the step S5, carrying out back propagation, and updating network parameters to obtain a trained self-encoder model;

step S7: testing the self-encoder model by verifying the data set;

step S8: calculating a Mean Square Error (MSE) between a predicted frame and a real predicted frame generated based on the verification data set, solving a mean value after solving the MSE of all groups of data, and obtaining the MSE error of the training data set;

step S9: repeating the steps S2 to S8 until the Mean Square Error (MSE) obtained in the step S8 is reduced;

step S10: and loading the network parameter model in the training task into the test task, transmitting the test data set data into the model for abnormal score calculation, and finally obtaining the evaluation score by calculating the area AUC under the ROC curve.

Preferably, step S2 specifically includes:

step S21: setting the size of each group of images, the length of the data sequence, and the width and height data and the sequence step length of each group of images according to the requirements;

step S22: grouping by adopting a sliding window mechanism, wherein the window length is the sequence length, and each time the window moves by one bit;

step S23: after the training data grouping is completed, the same sliding window data processing is carried out on the test set data.

Preferably, the end-to-end self-encoder model in step S3 is composed of a convolutional channel encoder, a transducer channel encoder, a feature fusion module, a memory network, and a Decoder.

Preferably, the specific method of step S3 is as follows:

the encoder is divided into a convolution channel encoder module and a transducer channel encoder module;

the core module of the convolution channel encoder is a convolution neural network, and the N layers of convolution operation is used for carrying out feature extraction processing on the image, and the convolution channel encoder has the following module formula:

wherein F represents an input training sample;representing the representation features extracted by convolution, wherein i represents the convolution feature extraction of the ith layer, BN is a feature normalization function, and ReLu is an activation function; />In order to reduce the features after resolution, the MaxPool function is a maximum pooling function;

the kernel of the transform channel encoder is self-Attention mechanism calculation, firstly, dividing a video sequence frame into n×n image blocks according to frames, normalizing the n×n image blocks, extracting sequence features through sequence Attention (Sequential Attention, SA), performing dimension transformation through the LayerNorm (LN) and a multi-layer perceptron (MLP), then performing frame global feature self-Attention extraction (Attention, AT), and finally splicing and outputting, wherein the overall formula is as follows:

F _1...p ＝{F ₁ ...F _p }＝Patch(F)

F _SA ＝MLP(LN(SA(LN(F _1...p )))

wherein F is _1...p Representing the image block after F segmentation, F _SA Representing the characteristic result after the sequence attention calculation, F _t Representing spatiotemporal features connected after self-attention computation and sequential attention computation,representing a connection symbol;

the convolution operation is different from the characteristic dimension of a transducer, the image block dimension in the transducer is (1+L) x E, wherein 1 and L are respectively the identification position information and the number of the image blocks, and E represents the embedding dimension;

in the convolution, the feature dimensions are h×w×c, and correspond to the height, width and channel number, so in the transform channel, first, 1×1 convolution is used to align the space dimensions of the feature information with the corresponding stage of the convolution layer, then, the regularization process is performed on the feature information by using the Batch Norm (BN), finally, the feature information captured by each convolution block in the convolution channel is restored to the corresponding feature resolution by using the Interpolate operation, the dimension of the feature information captured by each convolution block is reduced by using the average pooling operation, and then, the feature dimension conversion and LayerNorm (LN) regularization modes are used to make the feature information identical to the feature dimensions in the transform channel, and the overall conversion formula is as follows:

F′ _c ＝Interpolate(Conv(Reshape(F _t )))

wherein F is _t Representing spatiotemporal features obtained from a transducer channel, F' _c To be the convolution format feature after conversion, F _ct Representing the final feature data from the encoder, wherein the Reshape operation represents feature format conversion, the Conv operation samples the input to a given size scale for integrating the feature format, and the Concat is a stitching function.

Preferably, the specific method of the memory network module in step S4 is as follows:

in a memory network, M prototype features for recording data, each item being defined as M _t In the process of outputting video frames to an incoming Decoder in a memory network through an Encoder, a Softmax operation is used to obtainComprises M _t And encoder feature F _ct Relationship between, overall formulaThe following are provided:

preferably, the Mean Square Error (MSE) in step S6 is formulated as follows:

wherein the method comprises the steps ofFor predicted frames, F is the true frame and n represents the video sequence length.

Therefore, the method for detecting the video anomalies by using the multi-scale high-dimensional feature analysis unsupervised learning has the following beneficial effects:

(1) The method has the advantages that the representing characteristics and the space-time characteristics of the video sequence images are extracted, the prediction frame precision is improved, and the positive sample characteristic information of the model is enhanced;

(2) The fitting capability of sequence fluctuation of the video image is improved by using the space-time characteristics and the representation characteristics, the anomaly detection score of the model is increased, and the effect of the model on video anomaly detection is greatly improved.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a schematic diagram of the overall structure of an embodiment of the present invention;

FIG. 2 is a schematic diagram of a convolutional channel encoder according to an embodiment of the present invention;

FIG. 3 is a block diagram of a transducer channel encoder according to an embodiment of the present invention;

FIG. 4 is a block diagram of a feature fusion module according to an embodiment of the present invention;

FIG. 5 is a diagram of a memory network module according to an embodiment of the present invention;

FIG. 6 is a comparison of the AUC scores of an embodiment of the present invention with a plurality of existing methods under four public data sets UCSD Ped1, UCSD Ped2, CHUK Avenue and Shangai Tech.

Detailed Description

Examples

A multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method based on end-to-end architecture comprises the following steps:

step S1: and (3) preprocessing data, namely dividing the video of the test set and the training set into image frames.

Step S2: the training data set obtained in the step 1 is used for carrying out batch processing on training set data and transmitting the training set data into a self-encoder model, each group of data comprises 5 continuous video frames, wherein the first 4 frames are used as characteristic extraction information data for realizing frame prediction, and the last 1 frames are used as real frames and predicted frames for carrying out difference comparison;

the method comprises the following specific steps: and selecting a proper public video anomaly detection data set, and carrying out video frame segmentation on the data set to adapt to the requirements of a model on a data format. Firstly, setting the size of each group of images according to the requirement, and respectively corresponding the width and height data of each group of images and the sequence step length of the data sequence. The sliding window mechanism is adopted for grouping, the window length is the sequence length, and each time the window moves by one bit, namely, only one bit of difference between two adjacent groups of data is adopted. After the training data packet is completed, the same sliding window data processing is performed on the test set data.

As shown in fig. 1, the overall structure of the present invention is shown. The data processing and dividing part is at the entrance of the structure of the invention and is responsible for carrying out preliminary processing on the original data to form a data structure required by the prediction model.

Step S3: the data are respectively transmitted into a convolution channel and a converter channel coder (Encoder), and the two characteristics are fused after convolution calculation and self-attention calculation to be the final performance characteristic and the space-time characteristic;

the end-to-end self-encoder model in step S3 is composed of a convolutional channel encoder, a transducer channel encoder, a feature fusion module, a memory network, and a Decoder. The convolution channel encoder needs to input each group of image data, and uses convolution operation to extract the expression characteristics of the image; the transducer channel encoder needs to divide each group of input data into a plurality of image blocks, and uses a self-attention mechanism to perform attention calculation on each image block; features extracted from the two channels are subjected to feature splicing through a feature fusion module to form complete features with performance and space time; the memory network is responsible for receiving the extracted features and storing the features into the memory block, and updating the latest memory features; the decoder analyzes and reconstructs the characteristics, the real frames in each group of data are used as correct results to be compared with the generated predicted frames finally output by the model, and the errors between the two frames are calculated.

As shown in fig. 2, which shows the overall structure of the convolution feature extraction method of the present invention, the convolution encoder receives the image sequence in each set of data obtained in step S2, and uses N-layer convolution operation to perform feature extraction processing on the image, and the convolution channel encoder module formula is as follows:

wherein F represents an input training sample;representing the representation features extracted by convolution, wherein i represents the convolution feature extraction of the ith layer, BN is a feature normalization function, and ReLu is an activation function; />In order to reduce the features after resolution, the MaxPool function is a maximum pooling function; setting a two-dimensional sliding window, sliding one step at a time, then obtaining the maximum value of all image pixel values in the window, and dividing the obtained value into new pixel units.

As shown in fig. 3, which illustrates the overall structure of the inventive transducer feature extraction, the core module of the transducer channel encoder is self-attention mechanism computation for performing the spatio-temporal relationship between video sequences. Firstly, dividing a video sequence frame into N multiplied by N image blocks according to the frame, carrying out LayerNorm (LN) normalization on the image blocks, extracting sequence characteristics through sequence Attention (Sequential Attention, SA), carrying out dimension transformation through LayerNorm (LN) and a multi-layer perceptron (MLP), then carrying out frame global characteristic self-Attention extraction (Attention, AT), and finally splicing and outputting, wherein the overall formula is as follows:

F _1...p ＝{F ₁ ...F _p }＝Patch(F)

F _SA ＝MLP(LN(SA(LN(F _1...p )))

as shown in fig. 4, a feature fusion module is shown for fusing convolutions with features of different scales extracted from a transducer. The convolution operation is different from the characteristic dimension of a transducer, the image block dimension in the transducer is (1+L) x E, wherein 1 and L are respectively the identification position information and the number of the image blocks, and E represents the embedding dimension;

F′ _c ＝Interpolate(Conv(Reshape(F _t )))

Step S4: inputting the final features into a Memory Network (Memory Network), and performing normal sample feature Memory storage and updating on the features finally obtained in the step S3 through Softmax function calculation to obtain diversified normal event data;

as shown in fig. 5, the memory network module mainly has a Softmax computing and updating memory function. In a memory network, M prototype features for recording data, each item being defined as M _t In the process of outputting video frames to an incoming Decoder in a memory network through an Encoder, a Softmax operation is used to obtainComprises M _t And encoder feature F _ct The relation between the two is as follows:

step S5: the video features updated and stored through the memory network are transmitted into a Decoder (Decoder), and the next frame of the continuous video frame is reconstructed through deconvolution and residual connection operation, so that predicted frame data is generated.

Step S6: and (5) obtaining a final Mean Square Error (MSE) between the generated predicted frame and the reality in the calculation step, then carrying out back propagation through an Adam optimizer, updating network parameters, and finally obtaining a trained self-encoder model.

Step S7: the self-encoder model is tested by validating the data set.

Step S8: and calculating a Mean Square Error (MSE) between the predicted frame generated based on the verification data set and the real predicted frame, solving the Mean Square Error (MSE) of all groups of data, and obtaining the MSE error of the training data set.

Step S9: and repeating the steps S2 to S8 until the Mean Square Error (MSE) obtained in the step S8 is not greatly reduced, which indicates that the model training is basically finished and the network parameters are updated.

Fig. 6 shows experimental results of various methods on four public data sets of UCSD Ped1, UCSD Ped2, CHUK Avenue and Shanghai Tech under the same experimental conditions, wherein the measurement standard is the Area (AUC) enclosed by the coordinate axis under the ROC curve, the ROC represents the connection line of each point drawn under the specific stimulus condition by taking the false alarm probability P (y/N) obtained under different judgment standards as the abscissa and the hit probability P (y/SN) as the ordinate. The experimental results of the best performing model are bolded in the table for each experimental condition. From fig. 6, it can be seen that the multi-scale high-dimensional feature analysis video anomaly detection method based on the end-to-end architecture has a larger improvement than the rest methods, is inferior to the STAE method in the UCSD Ped1 dataset, and achieves the best effect in the UCSD Ped2, CHUK Avenue and Shanghai Tech datasets.

Therefore, the method for detecting the video anomalies by the multi-scale high-dimensional feature analysis unsupervised learning improves the accuracy of video frame prediction, enhances the diversity of training normal sample feature information and improves the anomaly score assessment of video anomaly detection.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.

Claims

1. The method for detecting the abnormal of the unsupervised learning video by the multi-scale high-dimensional feature analysis is characterized by comprising the following steps of:

step S3: the data are respectively transmitted into a convolution channel and a transducer channel encoder, and the characteristics of the convolution channel and the transducer channel encoder are fused to be used as the expression characteristics and the space-time characteristics;

step S4: inputting the features into a memory network, and performing normal sample feature memory storage and updating on the features obtained in the step S3 through Softmax function calculation to obtain diversified normal event data;

step S5: transmitting the updated and stored video characteristics into a decoder to generate predicted frame data;

step S6: calculating the mean square error between the generated predicted frame and the true obtained in the step S5, carrying out back propagation, and updating network parameters to obtain a trained self-encoder model;

step S7: testing the self-encoder model by verifying the data set;

step S8: calculating the mean square error between the predicted frame and the real predicted frame generated based on the verification data set, solving the mean square error of all groups of data, and then solving the mean square error of the training data set;

step S9: repeating the steps S2 to S8 until the mean square error reduction amplitude obtained in the step S8 is reduced;

2. The method for detecting the abnormal condition of the unsupervised learning video by the multi-scale high-dimensional feature analysis according to claim 1, wherein the step S2 specifically comprises:

3. The method for detecting the abnormal condition of the unsupervised learning video through the multi-scale high-dimensional feature analysis according to claim 1, wherein the method comprises the following steps of: the end-to-end self-encoder model described in step S3 is composed of a convolutional channel encoder, a transducer channel encoder, a feature fusion module, a memory network, and a Decoder.

4. The method for detecting the abnormal condition of the unsupervised learning video by the multi-scale high-dimensional feature analysis according to claim 1, wherein the specific method in the step S3 is as follows:

the kernel module of the transform channel encoder is self-attention mechanism calculation, firstly, video sequence frames are divided into N multiplied by N image blocks according to frames, layerNorm normalization is carried out on the video sequence frames, sequence features are extracted through sequence attention, dimension transformation is carried out through LayerNorm and a multi-layer perceptron, then frame global feature self-attention extraction is carried out, and finally, splicing output is carried out, wherein the overall formula is as follows:

F _1...p ＝{F ₁ ...F _p }＝Patch(F)

F _SA ＝MLP(LN(SA(LN(F _1...p )))

the image block scale in the transducer is (1+L) x E, wherein 1 and L are respectively the identification position information and the number of the image blocks, and E represents the embedding dimension;

in the convolution, the characteristic dimension is H×W×C, and the corresponding dimensions are height, width and channel number, 1×1 convolution is first used in the transducer channel, the characteristic information is subjected to spatial scale alignment corresponding to the stage of the convolution layer, then the BatchNorm is used for regularization treatment, and finally the Inter is used _p The method comprises the steps of performing the operation of the resolution to restore the feature to the corresponding feature resolution, reducing the dimension of the feature information captured by each convolution block in a convolution channel by using the average pooling operation, and enabling the feature information to be identical to the feature dimension in a transform channel in a mode of feature dimension transformation and LayerNorm regularization, wherein the overall transformation formula is as follows:

F′ _c ＝Interpolate(Conv(Reshape(F _t )))

5. The method for detecting the abnormal condition of the unsupervised learning video based on the multi-scale and high-dimensional feature analysis according to claim 1, wherein the specific method for the memory network module in the step S4 is as follows:

in a memory network, M prototype features for recording data, each item being defined as M _t In the process of outputting video frames to an incoming Decoder in a memory network through an Encoder, a Softmax operation is used to obtainComprises M _t And encoder feature F _ct The relation between the two is as follows:

6. the method for detecting the abnormal condition of the unsupervised learning video by the multi-scale high-dimensional feature analysis according to claim 1, wherein the mean square error formula in the step S6 is as follows: