CN115527151B

CN115527151B - Video anomaly detection method, system, electronic equipment and storage medium

Info

Publication number: CN115527151B
Application number: CN202211374647.1A
Authority: CN
Inventors: 崔振; 朱小涵; 曾志勇
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-07-11
Anticipated expiration: 2042-11-04
Also published as: CN115527151A

Abstract

The invention discloses a video anomaly detection method, a video anomaly detection system, electronic equipment and a storage medium, and relates to the technical field of computer vision. The method comprises the following steps: acquiring target input data; the target input data are continuous frame images of a target video; determining whether an abnormality exists in the target video according to the target input data and the video abnormality detection model, and outputting predicted data; the video anomaly detection model comprises a feature extractor, a feature encoder, a feature decoder and an anomaly score processor which are connected in sequence; and a prediction module is also connected between the feature encoder and the feature decoder. The video anomaly detection method and device can improve the video anomaly detection precision.

Description

Video anomaly detection method, system, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer vision, and in particular, to a method and system for detecting video anomalies, an electronic device, and a storage medium.

Background

Video anomaly detection is a hotspot in the fields of monitoring and security, and is particularly important in the field of security for guaranteeing public security, and the main purpose of the video anomaly detection is to detect events beyond the traditional cognition, namely to distinguish events which do not meet expected behaviors. The anomaly detection task typically trains a normal model with a training data set containing only normal samples, and the test phase determines samples that do not conform to the normal model as anomalies. If all the expected behaviors are collected into a closed set, the behaviors outside the closed set will be considered abnormal events. The main difficulties of the video anomaly detection task are: 1) The definition boundary of normal and abnormal behaviors is fuzzy; 2) The video target behaviors are various; 3) Different behavior definitions brought by scene diversity reduce generalization of the system, namely, the scene dependence of the behavior is higher; 4) Abnormal samples are usually far less than normal samples, and imbalance of positive and negative samples easily causes insufficient learning of abnormal features by the model. Thus, the problem of video anomaly detection for real scenes remains a great challenge. According to the training mode of the neural network, the existing video anomaly detection method mainly comprises the following steps: supervised learning, weakly supervised learning, and unsupervised learning. The detection and positioning of abnormal behavior of the video are often distinguished by the differences represented by different behavior features, and generally undergo three steps of moving target detection, feature extraction and abnormal behavior detection classification judgment.

According to the development stage of the above-described solution to the problem, typical methods include a conventional machine learning method and a deep learning method. In the stage of the traditional method, an algorithm for solving the problem of video anomaly detection is mainly based on a feature space constructed by manual features, and the traditional machine learning method is utilized to detect the anomaly behavior, but the problems of low objectivity and high scene dependence caused by too high manual participation exist, and the generalization capability of the method is weak; in the deep learning stage, the anomaly detection algorithm generally utilizes an end-to-end neural network to realize self-adaptive feature learning and anomaly detection, but the supervised algorithm is subject to heavy manual labeling and predefined anomalies, and the unsupervised or weakly supervised method has higher false detection and missing detection probability and has the problem that the generalization performance is to be improved. Therefore, the accuracy of video anomaly detection in the current prior art is not high.

Disclosure of Invention

The invention aims to provide a video anomaly detection method, a system, electronic equipment and a storage medium, which can improve the video anomaly detection precision.

In order to achieve the above object, the present invention provides a method for detecting video anomalies, comprising:

acquiring target input data; the target input data are continuous frame images of a target video;

determining whether an abnormality exists in the target video according to the target input data and the video abnormality detection model, and outputting predicted data; the video anomaly detection model comprises a feature extractor, a feature encoder, a feature decoder and an anomaly score processor which are connected in sequence; and a prediction module is also connected between the feature encoder and the feature decoder.

Optionally, the training process of the video anomaly detection model specifically includes:

acquiring training data; the training data comprises a sample video and a corresponding detection result; the detection result comprises video normality and video abnormality;

constructing a deep learning network model based on U-Net connection;

and inputting the training data into the deep learning network model, training the deep learning network model by adopting a batch random gradient descent method, and determining the trained deep learning network model as the video anomaly detection model.

Optionally, the determining whether the target video has an abnormality according to the target input data and the video abnormality detection model specifically includes:

inputting the target input data into the input feature extractor to extract pixel point feature information;

inputting the pixel point characteristic information into the characteristic encoder, obtaining characteristic reconstruction coding information according to characteristic abstract operation, and obtaining characteristic diffusion coding information according to dynamic diffusion equation operation;

inputting the characteristic reconstruction coding information and the characteristic diffusion coding information into the prediction module for prediction to obtain characteristic prediction coding information;

inputting the characteristic reconstruction coding information and the characteristic prediction coding information into the characteristic decoder for decoding operation to obtain a characteristic reconstruction sample and a characteristic prediction sample;

and determining whether an abnormality exists in the target video according to the characteristic reconstruction sample, the characteristic prediction sample and the abnormality score processor.

Optionally, inputting the feature reconstruction coding information and the feature diffusion coding information into the prediction module for prediction to obtain feature prediction coding information, which specifically includes:

constructing a dynamic diffusion equation according to the state parameters and diffusion parameters of the target input data; the state parameters comprise the spatial position, time and characteristic information of the pixel points;

and inputting the characteristic reconstruction coding information and the characteristic diffusion coding information into the dynamic diffusion equation to obtain characteristic prediction coding information.

Optionally, the determining whether an abnormality exists in the target video according to the feature reconstruction sample, the feature prediction sample and the abnormality score processor specifically includes:

performing loss operation on the characteristic reconstruction sample to obtain a reconstruction loss value;

carrying out loss operation on the characteristic prediction sample to obtain a predicted loss value;

and inputting the reconstruction loss value and the prediction loss value into the anomaly score processor to perform video anomaly detection operation, and determining whether anomalies exist in the target video.

The invention also provides a video anomaly detection system, which comprises:

the data acquisition unit is used for acquiring target input data; the target input data are continuous frame images of a target video;

the abnormality detection unit is used for determining whether an abnormality exists in the target video according to the target input data and the video abnormality detection model; the video anomaly detection model comprises a feature extractor, a feature encoder, a feature decoder and an anomaly score processor which are connected in sequence; and a prediction module is also connected between the feature encoder and the feature decoder.

The invention also provides electronic equipment, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic equipment to execute the video anomaly detection system.

The present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the video anomaly detection system as described above.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention discloses a video anomaly detection method, a system, electronic equipment and a storage medium, wherein the method comprises the steps of obtaining target input data; the target input data are continuous frame images of a target video; determining whether an abnormality exists in the target video according to the target input data and the video abnormality detection model, and outputting predicted data; the video anomaly detection model comprises a feature extractor, a feature encoder, a feature decoder and an anomaly score processor which are connected in sequence; and a prediction module is also connected between the feature encoder and the feature decoder. The video anomaly detection method and the video anomaly detection system can improve the accuracy of video anomaly detection by constructing the video anomaly detection model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of detecting video anomalies according to the present invention;

fig. 2 is a schematic diagram of a training network of a video sequence by using a video anomaly detection model in the present embodiment;

FIG. 3 is a schematic diagram of a test network of a video sequence by a video anomaly detection model in the present embodiment;

FIG. 4 is a flow chart of a method of detecting video anomalies according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

As shown in fig. 1, a method for detecting video anomalies provided by an embodiment of the present invention includes:

step 100: acquiring target input data; the target input data is a succession of frame images of a target video.

Step 200: determining whether an abnormality exists in the target video according to the target input data and the video abnormality detection model, and outputting predicted data; the video anomaly detection model comprises a feature extractor, a feature encoder, a feature decoder and an anomaly score processor which are connected in sequence; and a prediction module is also connected between the feature encoder and the feature decoder.

The specific process comprises the following steps:

and the first step is to input the target input data into the input feature extractor to extract the feature information of the pixel points.

And secondly, inputting the pixel point characteristic information into the characteristic encoder, obtaining characteristic reconstruction coding information according to characteristic abstract operation, and obtaining characteristic diffusion coding information according to dynamic diffusion equation operation.

And thirdly, inputting the characteristic reconstruction coding information and the characteristic diffusion coding information into the prediction module for prediction to obtain characteristic prediction coding information.

The further operation mode of the step comprises the following steps: constructing a dynamic diffusion equation according to the state parameters and diffusion parameters of the target input data; the state parameters comprise the spatial position, time and characteristic information of the pixel points; and inputting the characteristic reconstruction coding information and the characteristic diffusion coding information into the dynamic diffusion equation to obtain characteristic prediction coding information.

And fourthly, inputting the characteristic reconstruction coding information and the characteristic prediction coding information into the characteristic decoder for decoding operation to obtain a characteristic reconstruction sample and a characteristic prediction sample.

And fifthly, determining whether an abnormality exists in the target video according to the characteristic reconstruction sample, the characteristic prediction sample and the abnormality score processor.

The further operation mode of the step comprises the following steps: performing loss operation on the characteristic reconstruction sample to obtain a reconstruction loss value; carrying out loss operation on the characteristic prediction sample to obtain a predicted loss value; and inputting the reconstruction loss value and the prediction loss value into the anomaly score processor to perform video anomaly detection operation, and determining whether anomalies exist in the target video.

The training process of the video anomaly detection model specifically comprises the following steps:

firstly, acquiring training data; the training data comprises a sample video and a corresponding detection result; the detection result comprises video normality and video abnormality.

And secondly, constructing a deep learning network model based on U-Net connection.

And thirdly, inputting the training data into the deep learning network model, training the deep learning network model by adopting a batch random gradient descent method, and determining the trained deep learning network model as the video anomaly detection model.

As shown in fig. 2 to 3, as a specific embodiment, the method is mainly divided into the following parts:

data preparation stage:

for video anomaly detection tasks, a large number of relevant videos are collected, continuous frame images in the videos are selected as a data set A, and the A is divided into a training set T containing category differences _r And test set T _e Wherein T is _r Includes only normal mode samples and T _e Including normal and abnormal pattern samples.

Model modeling stage:

in this embodiment, the complete video is displayedThe anomaly detection model is denoted as M and includes a feature extractor F, a feature encoder E, a feature decoder D, and an anomaly score processor C. M utilizes rebuilding module and prediction module to realize feature extraction, coding, decoding and end-to-end network training etc. The input video sequence of the model is noted as i= (I ₁ ,I ₂ ,I ₃ ,…,I _L ) Wherein the subscript L denotes the sequence length. Each frame of image I _i Obtaining mapping space characteristic F after the action of F _i ＝F(I _i ) The feature sequence after I mapping is recorded as

Where D represents the feature dimension. In the practical implementation, a strategy of predicting 1 frame by 4 frames is adopted, so that a continuous video segment input sequence is adopted, the first four frames are historical frames, and the later frame is a future frame.

The model firstly abstracts the characteristics of F layer by an Encoder (Encoder, E), and then maps the abstracted characteristics back to the original characteristic space by a Decoder (Decoder, D) to obtain a reconstructed sample

The encoder E is decimated to more accurate features by the constraint of the loss function to facilitate the subsequent prediction process.

Prediction derivation implementation based on reconstruction assistance: in the continuous space modeling video sequence I, a diffusion equation is established by using variables such as space position, time, characteristic information (or energy) and the like, and a final prediction form is obtained through discretization. Specifically, the characteristic information (or energy) of the pixel point at the time t and the spatial position (x, y) is recorded as u (x, y, t). For a partially square region S, where S is small enough to be considered a pixel-level particle, the characteristic information H (t) of the S region can be expressed by the following integral formula:

H(t)＝ρ∫∫ _S u(x,y,t)dxdy

where ρ is a coefficient, analogous to the specific heat capacity in physics. The information change of the S region can be represented by the following gradient:

an infinitely small calculation is then performed on the S region, and the spatial variation is calculated by approximating the value of the center point. Based on the principle of conservation of heat variation in the heat conduction process, diffusion information inspired by the well-known Fourier law is calculated, namely, the diffusion heat flowing in unit area in unit time is in direct proportion to the gradient temperature of the diffusion material, so that the following energy variation can be obtained:

wherein DeltaQ _x Is the change in heat in the x direction; ΔQ _y Is the change in heat in the y direction; h is characteristic information; t is time.

The following dynamic equation of pixel-level particle motion can be further constructed:

wherein the diffusion parameter alpha is expressed by the relevant physical parameter p. Then discretizing the continuous differentiation to define four frames of time sequence as history frame u _t The next frame of the fourth frame is the future frame u _t+1 The predicted form of the feature level future frame can be obtained:

u _t+1 ＝u _t +α·u _t ·diagMΔt+α·diagM·u _t Δt

wherein diagM is a tri-diagonal matrix obtained in the derivation process, Δt is discretized time interval variation, and α is a diffusion parameter. Thereby obtaining the slave history frame u at the characteristic level _t To predict future frame u _t+1 A partial differential dynamic diffusion equation-based prediction module (abbreviated as PDE, defined as predictor). Due to u _t The historical frame sequence features from feature extractor FThe predictor is a dynamic diffusion predictor, i.e. based on derived prediction samples:

F _p ＝predictor(F)

the diffusion parameter α results from the encoding process, with each layer downsampling outputting the features u and α accordingly. For each video segment, the current input I corresponds to the downsampled feature F _i And diffusion parameter alpha _i The definition is as follows:

F _i ＝f _u (I)

α _i ＝f _α (I)

f _u and f _α Respectively, the convolution layers for feature u learning and diffusion parameter alpha learning, i representing the downsampling layer index. The encoding process based on encoder E can be summarized as f=e (F '), where F' is the current input, i.e. the upper layer output.

Correspondingly, the decoding process based on decoder D is summarized as:

by combining the bottom-layer detail features with the high-layer semantic features, the utilization rate of feature information in each layer of the network is greatly improved.

The overall loss of the model is defined as:

wherein the weight parameter λ is used to balance the prediction loss and the reconstruction loss.

And->

Both losses are calculated in the form of a mean square error, i.e. the average squared error between the calculated output result and the true value, defined as:

the encoder E and decoder are optimized simultaneously by minimizing overall loss. In the decoder, the loss is found by the obtained frame-level output, here a strategy of 4-frame prediction 1-frame is adopted, so that for each input video segment consecutive frames i= (I _i ,I _i+1 ,I _i+2 ,I _i+3 ) Resulting reconstructed and predicted outcome history frames

And future frame->

The same size as the scale of the actual real value.

The above steps are unified into an integral end-to-end deep learning network frame, and the model can realize end-to-end training and effect testing.

Model training stage:

first, a normal event data set T acquired by a data preparation stage _r For training of the subsequent model M. Second, randomly sampling training set T _r The multi-frame continuous images in the model are used as input samples, input into the current model M, optimize the model through a batch random gradient descent method, aim at minimizing the total loss, stop training when the total loss does not obviously change along with the increase of training rounds, otherwise continue to execute the present step process. Finally, the model M' is obtained after training is completed.

Model test stage:

first, a given video test data set T containing normal and abnormal events is input _e Based on the trained model M', a complete test process of reconstruction prediction is realized by sequentially passing through the feature extractor F, the feature encoder E, the feature decoder D and the anomaly score processor C.

Next, a fixed-size continuous frame video clip sequence I obtained by a sliding window method is tested, and an anomaly score processor C is used to calculate a video anomaly detection result (anomaly or abnormal)Normal), anomaly score employs predictive score

And reconstruction score->

Form of combination:

wherein the prediction score

And reconstruction score->

From->

And->

Assignment of losses, setting of the balance weight parameter lambda depends on the properties of different data sets.

Through the above operation process, it can be seen that:

modeling the relative changes of the moving object at different moments by using a thermal conduction equation based on dynamic diffusion. In practice, anomaly detection aims to distinguish between all events that do not meet the expected behavior. If all expected behavior is collected into a closed set, behavior outside the closed set will be considered an exception event. From the angle of particle motion, the motion track of the abnormal pixel particles is in a state inconsistent with the normal pixel particles due to breaking the natural law of object motion. In order to build a model of the movement of the pixel-level particles, the invention researches the energy diffusion phenomenon of the pixel particle flow in video anomaly detection, and constructs a heat conduction equation of the pixel-level particles to describe the dynamic characteristics of the video by referring to the heat diffusion process in physics. The present invention describes the variation of a time frame by defining and calculating a diffusion equation, and describes the spatial variation using diffusion parameters. For different events, the diffusion parameters encode corresponding distribution characterization, for example, rapid changes of speed in running, throwing and other events cause strong comparison of the corresponding diffusion parameters with the diffusion parameters in a normal mode, fluctuation of the diffusion process is formed, the diffusion parameters obtained through a neural network deviate from the diffusion parameters obtained in the normal mode training, and the modeling process accords with the change rule generated by abnormality so as to better detect the abnormality.

And solving a dynamic partial differential equation in an end-to-end manner by adopting a reconstructed combined training model based on the discrete prediction of the diffusion equation and auxiliary anomaly detection so as to jointly realize the prediction of a future frame and the reconstruction of a history frame. The combination of reconstruction and prediction can enhance the capture detection capability of the model for abnormal and motion changes only based on the fact that the reconstructed or predicted model is prone to miss detection of abnormal behaviors. The reconstruction process is helpful to improve the accuracy of the neural network for feature learning, thereby assisting in enhancing the ability of the model for future frame prediction. The capturing and detecting capability for the motion change can be established through a model trained on a normal sample, so that a better abnormality detecting effect is achieved in a test stage.

And performing feature extraction and reconstruction prediction on the input multi-frame continuous images by adopting a U-Net based on an automatic encoder-decoder as a backbone network, and calculating network loss on an image level. The U-Net realizes cross-layer integration of different scale features through cross-layer connection, reduces the parameter quantity and enhances the learning generalization capability of the model for motion information.

Therefore, the video anomaly detection method provided by the invention inspires a dynamic heat conduction equation in a diffusion mechanism, and realizes video anomaly detection end to end by simulating a feature learning process in video anomaly detection, so that a model achieves better performance in an explicit modeling mode. On one hand, the neural network self-adaptive learning characteristic changes in space and time, and on the other hand, the dynamic diffusion module is used for self-adaptively encoding the normal mode of the multi-frame. Unlike previous methods that use self-supervised discriminant cues to distinguish between abnormal and normal modes, or rely on scene reconstruction strategies to return to normal texture modes, the present invention uses a strategy that is a reconstruction-aided prediction to predict normal texture modes by capturing the most basic components of video stream changes, where pixel-level particle motion rules are mined and encoded for better generalization. The diffusion parameters corresponding to the different events encode the corresponding profile characterizations, and rapid changes in velocity can lead to strong contrast with the habitual patterns, leading to fluctuations in the diffusion process. The invention defines a dynamic diffusion equation, and a final prediction form is obtained by modeling the diffusion equation on the spatial position, the characteristic information (or the energy in physics) and the time of the video sequence in continuous space. The invention achieves considerable and comparable performance on real video datasets.

As shown in fig. 4, the present invention further provides a video anomaly detection system, including:

the data acquisition unit is used for acquiring target input data; the target input data is a succession of frame images of a target video.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the core concept of the invention; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A video anomaly detection method, comprising:

determining whether an abnormality exists in the target video according to the target input data and the video abnormality detection model, and outputting predicted data; the video anomaly detection model comprises a feature extractor, a feature encoder, a feature decoder and an anomaly score processor which are connected in sequence; a prediction module is also connected between the feature encoder and the feature decoder;

the determining whether the target video has an abnormality according to the target input data and the video abnormality detection model specifically includes:

inputting the target input data into the feature extractor to extract pixel point feature information;

determining whether an abnormality exists in the target video according to the characteristic reconstruction sample, the characteristic prediction sample and the abnormality score processor;

inputting the characteristic reconstruction coding information and the characteristic diffusion coding information into the prediction module for prediction to obtain characteristic prediction coding information, wherein the method specifically comprises the following steps:

inputting the characteristic reconstruction coding information and the characteristic diffusion coding information into the dynamic diffusion equation to obtain characteristic prediction coding information;

defining four frames of time series history frame as u _t Future frames are characterized by u _t+1 The predicted form of the feature level future frame can be obtained:

u _t+1 ＝u _t +α·u _t ·diagMΔt+α·diagM·u _t Δt

wherein diagM is a tri-diagonal matrix obtained by a derivation process, Δt is a discretized time interval change, α is a diffusion parameter, and a slave history frame u at a feature level is obtained _t To predict future frame u _t+1 Is based on partial differential dynamic diffusion equation.

2. The video anomaly detection method of claim 1, wherein the training process of the video anomaly detection model specifically comprises:

constructing a deep learning network model based on U-Net connection;

3. The method for detecting video anomalies according to claim 1, wherein the determining whether anomalies exist in the target video according to the feature reconstruction samples, the feature prediction samples and the anomaly score processor specifically comprises:

4. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the video anomaly detection method of any one of claims 1 to 3.

5. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the video abnormality detection method according to any one of claims 1 to 3.