CN114627421A

CN114627421A - Multi-memory video anomaly detection and positioning method and system based on scene target

Info

Publication number: CN114627421A
Application number: CN202210288377.6A
Authority: CN
Inventors: 李洪均; 陈金怡; 孙晓虎; 陈俊杰
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-06-14

Abstract

The invention discloses a multi-memory video anomaly detection and positioning method and system based on scene targets, which are used for carrying out anomaly detection on targets in a video from global and local angles by utilizing global anomaly branches and local anomaly branches respectively, so that the targets focusing on regions where anomalies possibly occur are realized; by fully utilizing the distribution of the scene targets, the far-point targets and the near-point targets are scaled to be uniform in size, so that the difference caused by the field angle is effectively relieved; by combining the characteristics of the model double branches, each target is abnormally quantized and positioned, only an abnormal target area can be marked, and the positioning effect is very clear. The method can improve the video abnormity detection performance and clearly position the abnormal target, and has important significance for the intelligent video abnormity detection field.

Description

Multi-memory video anomaly detection and positioning method and system based on scene target

Technical Field

The invention belongs to the field of video monitoring, and particularly relates to a scene target-based multi-memory video anomaly detection and positioning method and system.

Background

With the popularization of monitoring equipment and the rise of intelligent monitoring, video anomaly detection gradually becomes a research hotspot at home and abroad. Video anomaly detection refers to identifying undesirable behavior or appearance patterns in all scenes in a particular location. Because anomalies have a wide variety of unpredictable characteristics, the video anomaly detection task is often considered an unsupervised task, i.e., the anomalies can be detected during testing by training the normal data. Video frame prediction and video frame reconstruction are currently the main popular methods. The conventional method for video frame prediction is to input the continuous whole video frame as a normal sample into a model for training, and the model is used for fitting a normal mode. Therefore, when in testing, because the abnormal sample is not contacted in the training process, the abnormal sample cannot be well predicted when the abnormal sample is input into the model, and the purpose of abnormal detection is achieved.

At present, many existing video frame prediction methods directly use the whole video frame as model input, which is a global approach. This approach ignores that anomalies are more likely to occur on foreground objects that belong to local information in the video frame. Regarding the positioning problem, the current prediction-based method regards the pixel block with large error between the predicted frame and the real frame as an abnormal position. However, this positioning method has three problems. Firstly, the normal moving object also has an error area. Secondly, the error regions are dispersed, and once the distance is very close, two objects are difficult to distinguish. And thirdly, in real world application, the monitoring cameras are installed at different positions of different scenes. For the same target, the target windows are different in size due to different shooting angles. Due to the difference in the angle of view, the slight difference in the presence of the far point target is weakened and the slight difference in the presence of the near point target is enlarged for the targets of the same pattern, so that there is a problem that the near point target is more likely to be positioned as an abnormality.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a scene target-based multi-memory video anomaly detection and positioning method, which is used for positioning an anomaly target while improving the video anomaly detection performance.

In order to achieve the purpose, the invention adopts the following technical scheme:

the method for detecting and positioning the abnormal condition of the multi-memory video based on the scene target comprises the following steps of executing training from step S1 to step S2 to obtain an optimal global prediction model and an optimal local prediction model based on all normal videos in a target scene, and performing abnormal condition detection and positioning on a currently acquired video frame by combining the optimal global prediction model and the optimal local prediction model through the steps S3 to S4:

step S1: respectively aiming at each normal video in a target scene, firstly extracting each frame of the normal video to be used as a global training set, then extracting a target window corresponding to each preset type target in each video frame in the normal video to obtain a target pipeline of each preset type target in the normal video along each video frame in the normal video respectively to be used as a local training set;

step S2: based on a global training set, taking a first preset number of continuous video frames as input and the next video frame of the continuous video frames as output, training a global prediction model to obtain an optimal global prediction model;

based on a local training set, training a local prediction model by taking a target pipeline corresponding to each preset type target in a first preset number of continuous video frames as input and a target window corresponding to each preset type target in a next video frame of the continuous video frames as output to obtain an optimal local prediction model; and (3) respectively and iteratively executing the step (S2.1) to the step (S2.3) by the global prediction model training and the local prediction model training until the loss function is stable or the maximum iteration times are reached:

step S2.1: inputting input data of a first preset number of continuous video frames into a first network for feature extraction based on a prediction model to be trained to obtain each preset query item corresponding to the input data;

step S2.2: respectively inputting the preset query items into a second network aiming at the preset query items to obtain fusion characteristic items corresponding to the preset query items, and further obtaining fusion characteristic items corresponding to the preset query items;

step S2.3: inputting the input data into a third network based on each preset query item corresponding to the input data and the fusion feature item corresponding to each preset query item, obtaining output data of a next video frame related to the input data, and updating the prediction model;

step S3: taking a first preset number of continuous video frames of a previous video frame at the current time in the historical time direction as data to be analyzed, predicting and obtaining a current prediction video frame based on an optimal global prediction model, predicting and obtaining a target prediction window of each preset type of target in the current prediction video frame by an optimal local prediction model, carrying out abnormity detection on a current collected video frame, and executing step S4 if the judgment result is abnormal; if the video frame is judged to be not abnormal, carrying out abnormal detection on the next collected video frame in the mode of S3;

step S4: and aiming at detecting that the current collected video frame is abnormal, predicting and obtaining the current prediction video frame based on the optimal global prediction model, predicting and obtaining target prediction windows of various preset types of targets in the current prediction video frame by the optimal local prediction model, positioning the abnormal target windows of the current collected video frame, and finishing the detection.

As a preferred technical solution of the present invention, in the step S1, for each normal video, a specific process of extracting a target pipeline corresponding to each preset type target is as follows:

detecting each preset type target in each video frame by using a Yolov5 model pre-trained on an MS COCO data set, corresponding each preset type target of the front and rear video frames by means of a DeepsORT target tracking algorithm to obtain target pipelines respectively corresponding to each preset type target, and carrying out undistorted scaling uniform size on each target pipeline to obtain a local training set.

As a preferred technical solution of the present invention, in the step S2, the first network and the third network are both U-Net network structures, and the first network removes the last batch normalization layer and the ReLU layer and adopts the L2 normalization layer instead.

As a preferred technical solution of the present invention, in the step S2.1, the input data Y based on the first preset number of consecutive video frames_S＝(E_t-S,…,E_t-2,E_t-1) Performing feature extraction on the input data through a first network to obtain each preset query item corresponding to the input data

Wherein, Y_TThe method comprises the steps of indicating input data of a first preset S continuous video frames, indicating predicted time, and indicating input data of each video frame;

predicting the kth preset query term corresponding to input data of a video frame at the moment t, wherein k is 1_,..._,K_，K is the total number of the preset query items corresponding to the input data, K is (W/8) × (H/8), and W and H respectively represent the width and the height of the input data.

As a preferred technical solution of the present invention, in step S2.2, the second network includes M memory items

Respectively aiming at preset query items

The preset query item and the M memory items are respectively subjected to cosine similarity calculation and are applied to a softmax function to obtain the preset query item

Respectively associated with each memory item

Weight of (2)

Obtaining fusion characteristic items corresponding to the preset query items based on the weights of the memory items and the preset query items

As a preferred technical solution of the present invention, for each memory item in the second network, the following process is executed to update each memory item:

weights of preset query items and memory items corresponding to input data

Obtaining index sets corresponding to all the memory items respectively

Each subset in the index set is the weight of the memory item

The highest preset query item; aiming at each memory item, based on the memory item and the corresponding index set

Weight of each preset query item in

The memory term is updated using the following equation:

wherein f (. cndot.) denotes L2 normalization, p^mRefers to the updated memory entry.

As a preferred technical solution of the present invention, the training loss functions L of the global prediction model and the local prediction model in step 2 are predicted losses L respectively_predCharacteristic loss of compaction L_compactAnd characteristic separation loss L_separateThe three parts of the utility model are formed,

L＝L_pred+η_cL_compact+η_sL_separate

wherein eta is_cAnd η_sIs a preset coefficient; l is_predTo predict the L2 distance between the output data of a video frame and the pixels of the captured video frame data,

for predicting output data of video frames, a_tIn order to collect data corresponding to output data in video frames, T represents the total video frame number of a normal video, and T is b, which means that input data of the first b-1 video frames of the normal video is used as an initial input value of a prediction model;L_compact is a preset query term

And the memory item p closest to the query item_pThe L2 distance of (a) is,

indicating the k-th preset query item corresponding to the input data of the video frame at the predicted time t

The similarity between the memory items and the mth memory item, namely the weight; l is a radical of an alcohol_separateTo make the query term preset

Memory items p second proximate to the query item_nIs further from the predetermined query term and the closest memory term p_pIs far away by a preset threshold theta.

As a preferred technical solution of the present invention, in step S3, a current predicted video frame is obtained based on the optimal global prediction model prediction, a target prediction window of each preset type of target in the current predicted video frame is obtained based on the optimal local prediction model prediction, and a frame-level abnormality score S of a current captured video frame is calculated for the current captured video frame_tPerforming an abnormality detection, if the abnormality is scored by S_tIf the preset threshold value is reached, the current collected video frame is abnormal;

S_t＝λS_local,t+(1-λ)S_global,t

wherein the content of the first and second substances,

in the formula, S_local,tRepresents the local abnormal score of the video frame acquired at the current time t, S_global,tRepresenting the global abnormal score of the video frame collected at the current time t;

representing the abnormal scores of the jth preset type target in the video frame collected at the current time t, wherein J targets are in total, and lambda represents the contribution degree and sigma₁、σ₂Are all preset coefficients, g (-) represents min-max normalization; d (q)_tP) represents each preset query term in the video frame at the predicted time t

The nearest memory item p corresponding to each preset query item_pThe distance between the preset query items is L2, and K is the total number of the preset query items corresponding to each preset type target;

representing a target prediction window

And capturing a target window o in a video frame_tPeak signal to noise ratio, N, between_oIndicating the number of pixels contained in the target window,

representing a target prediction window

A medium pixel maximum value;

representing predicted video frames

And capturing video frames I_tPeak signal to noise ratio, N, between_IIndicates the number of pixels contained in the video frame,

representing predicted video frames

The medium pixel maximum.

As a preferred technical solution of the present invention, in step S4, for detecting that the current captured video frame is abnormal, the current predicted video frame is obtained based on the prediction of the optimal global prediction model, the target prediction windows of each preset type of target in the current predicted video frame are obtained by the prediction of the optimal and local prediction models, and the abnormal score of each target window in the video frame is calculated for the current captured video frame

Positioning an abnormal target window of a currently acquired video frame;

wherein the content of the first and second substances,

in the formula (I), the compound is shown in the specification,

representing a target window corresponding to the jth preset type target in the video frame at the predicted time t

The pixel in (b) and a target window corresponding to the jth target in the video frame at the acquisition time t

The corresponding pixels in the image are subjected to difference and summation, local represents local prediction, global represents global prediction, J targets are in total, and lambda represents contribution degree;

calculating the abnormal score of each target window in the video frame aiming at the current collected video frame, and taking the preset proportion gamma of the target with the highest abnormal score as a standard threshold theta based on the target with the highest abnormal score_tExceeding the standard threshold value theta_tThe target corresponding to the abnormality score of (1) is determined to be abnormal;

the system of the multi-memory video anomaly detection and positioning method based on the scene target comprises a training set module, a global prediction module, a local prediction module, an anomaly detection module and an anomaly positioning module,

the training set module obtains a global training set and a local training set based on each normal video in a target scene;

the global prediction module and the local prediction module respectively comprise an encoder, a decoder and a memory module, wherein the encoder corresponds to the first network and performs feature extraction based on input data to obtain each preset query item corresponding to the input data; the memory module corresponds to the second network, obtains fusion feature items corresponding to the preset query items based on the preset query items, and updates the memory items based on the preset query items; the decoder corresponds to the third network, and obtains output data of the prediction video frame based on all the preset query items corresponding to the input data and the fusion feature items corresponding to all the preset query items;

the anomaly detection module carries out anomaly detection on the currently acquired video frame based on a current prediction video frame obtained by global prediction model prediction and a target prediction window of each preset type of target in the current prediction video frame obtained by optimal local prediction model prediction;

and the abnormity positioning module is used for detecting that the current collected video frame is abnormal, predicting the obtained current predicted video frame based on the optimal global prediction model, predicting a target prediction window of each preset type of target in the current video frame by using the optimal and local prediction models, and positioning the abnormal target window of the current collected video frame.

The invention has the beneficial effects that: the invention provides a multi-memory video anomaly detection and positioning method and system based on scene targets, which are used for carrying out anomaly detection on targets in a video from a global angle and a local angle respectively by utilizing a global anomaly branch and a local anomaly branch, so that the targets focusing on regions where anomalies possibly occur are realized; by fully utilizing the distribution of the scene targets, the far-point targets and the near-point targets are scaled to be uniform in size, so that the difference caused by the field angle is effectively relieved; the method for performing abnormal quantification and positioning on each target by combining the characteristics of the double branches of the model can only mark an abnormal target area, and has very clear positioning effect. The method can improve the video abnormity detection performance and clearly position the abnormal target, and has important significance for the intelligent video abnormity detection field.

Drawings

FIG. 1 is an overall frame diagram;

FIG. 2 is a diagram illustrating a target pipeline and adaptive size in a local abnormal branch;

FIG. 3a is a graph of a target window distribution of a UCSD Ped2 data set;

FIG. 3b is a graph of a ShanghaiTech dataset target window distribution;

FIG. 4 is a table of two-branch fusion performance;

FIG. 5a is a ROC curve for a local outlier branch on a UCSD Ped2 data set before and after adding to a global outlier branch;

FIG. 5b is a ROC curve for local aberrant branching on both ShanghaiTec datasets before and after addition to global aberrant branching;

FIG. 6 is a comparison of the present model and other models on the UCSD Ped2 dataset and the ShanghaiTech dataset, respectively;

FIG. 7a is a graph illustrating anomaly detection results at a UCSD Ped2 data set frame level;

FIG. 7b is a frame level anomaly detection result on the ShanghaiTech dataset;

fig. 8 is a diagram of the visualization effect of the abnormal localization.

Detailed Description

The following examples will give the skilled person a more complete understanding of the present invention, but do not limit the invention in any way.

A scene target-based multi-memory video anomaly detection and positioning method and system are mainly used for realizing an unsupervised task in a mode of predicting video frames, namely predicting current frames through continuous frame video frames, as shown in figure 1. The model is mainly composed of two parts: global exception branches and local exception branches. The global abnormal branch mainly detects all the targets in the whole frame and the abnormal existing in the background; the local abnormal branch mainly detects an abnormality such as a pattern form of a target. Since the extracted targets are different in distance from the camera and in target form, resulting in different target windows, undistorted scaling operation needs to be performed on the detected targets according to the scene target distribution. Then, the encoder extracts the normal pattern of the target and reads or updates the memory module during the process of training the decoder to predict the next frame image, so that the memory module can help check whether the single target is abnormal or not at the time of testing. And finally, for the same frame, fusing the abnormal scores obtained by the global abnormal branches and the abnormal scores obtained by the local abnormal branches through the hyper-parameters to obtain the frame and the final abnormal scores corresponding to all targets in the frame.

The method trains two public data sets in the field of video anomaly detection, namely a UCSD Ped2 data set and a ShanghaiTech data set. The UCSD Ped2 data set comprises 16 training videos and 12 testing videos, wherein the abnormity mainly comprises a skateboard, a bicycle riding and a car driving; the ShanghaiTech dataset had 330 training videos and 107 test videos, with abnormalities mainly including cycling, driving, skateboarding, falling, running, etc. The UCSD Ped2 dataset has only 1 scene, and the ShanghaiTech dataset has 13 scenes in total.

The method for detecting and positioning the abnormal condition of the multi-memory video based on the scene target comprises the steps of S1-S2 training to obtain an optimal global prediction model and an optimal local prediction model based on all normal videos in the target scene, and performing abnormal condition detection and positioning on a current collected video frame by combining the optimal global prediction model and the optimal local prediction model through the steps of S3-S4.

Step S1: and respectively aiming at each normal video in a target scene, firstly extracting each frame of the normal video to be used as a global training set, then extracting a target window corresponding to each preset type target in each video frame in the normal video to obtain a target pipeline of each preset type target in the normal video along each video frame in the normal video respectively to be used as a local training set.

In step S1, for each normal video, a specific process of extracting a target pipeline corresponding to each preset type of target is as follows: in the local abnormal branch part, in order to focus on the target itself, a Yolov5 model pre-trained on an MS COCO data set is used for detecting each preset type target in each video frame, each preset type target of the front video frame and the rear video frame is corresponded by means of a DeepsORT target tracking algorithm, target pipelines respectively corresponding to each preset type target are obtained, and the target pipelines are subjected to undistorted scaling and unified size to obtain a local training set. Due to the motion of the target and other reasons, the target pipeline of each target obtained after the input video frame is processed by Yolov5 and the deepSORT algorithm is different in size, so that the target pipeline cannot be directly used as input. For the targets of the same mode, due to the difference of the field angles, the fine prediction error of the near-point target is amplified, the fine prediction of the far-point target is weakened, so that the two targets are not above a consideration standard, and the near-point target is more likely to be judged as abnormal. Therefore, a method of undistorted scaling of the detected objects according to the scene object distribution is adopted. Scene distribution refers to the distribution of object sizes in the training set. Before this, the target window is scaled in a non-deformable manner, so that one side of the target window is the same as the corresponding side with the fixed size, and the other side of the target window is smaller than the corresponding side with the fixed size, and then the smaller side is filled to achieve the fixed size. And for the selection of the width and the height with fixed sizes, considering that the distribution of the sizes of the targets is different due to different shooting angles, the height and the width which comprise the peak value of the target window and correspond to a certain proportion of the peak value of the number of the target windows are selected.

Step S2: and based on the global training set, training a global prediction model by taking a first preset number of continuous video frames as input and the next video frame of the continuous video frames as output to obtain an optimal global prediction model.

Based on a local training set, training a local prediction model by taking a target pipeline corresponding to each preset type target in a first preset number of continuous video frames as input and a target window corresponding to each preset type target in a next video frame of the continuous video frames as output to obtain an optimal local prediction model; inputting the input data of the local prediction model into target pipelines corresponding to preset types of targets respectively; and (4) respectively and iteratively executing the step (S2.1) to the step (S2.3) by the global prediction model training and the local prediction model training until the loss function is stable or the maximum iteration number is reached.

The first network and the third network are both of U-Net network structures, the first network removes the last batch standardization layer and the ReLU layer, and the first network is replaced by an L2 standardization layer. Since the last ReLU layer will remove negative values, the characterization capability is diminished. Meanwhile, since there are three down-sampling layers in the first network, the input target pipeline width and height are required to be at least a multiple of 8.

Step S2.1: and inputting the input data of a first preset number of continuous video frames into a first network for feature extraction based on the prediction model to be trained to obtain each preset query item corresponding to the input data.

In step S2.1, input data Y based on a first preset number of consecutive video frames_S＝(E_t-S,…,E_t-2,E_t-1) Performing feature extraction on the input data through a first network to obtain each preset query item corresponding to the input data

Wherein, Y_TThe method comprises the steps of indicating input data of a first preset S continuous video frames, indicating predicted time, corresponding to each time for each frame in the technical scheme, and indicating input data of each video frame by E;

in order to predict the kth preset query term corresponding to the input data of the video frame at the time t, K is 1.

Step S2.2: and respectively inputting the preset query items into a second network aiming at the preset query items to obtain fusion characteristic items corresponding to the preset query items, and further obtaining fusion characteristic items corresponding to the preset query items.

In step S2.2, the second network has reading and updating functions, and the second network includes M memory items

Respectively aiming at preset query items

Respectively associated with each memory item

Weight of (2)

For each query term

The corresponding normal mode characteristic item can be obtained by reading the second network

The second network has various memory items corresponding to multiple normal modes, and each preset query item

The most relevant information can be extracted from a plurality of normal modes and fused, and finally, (W/8) x (H/8) fused feature items can be obtained from (W/8) x (H/8) preset query items.

For each memory item in the second network, executing the following process to update each memory item: the updating strategy is that for each memory item in the memory module, the most similar memory item in the query items is selected as the query item of the memory item to update the memory item;

weights of preset query items and memory items corresponding to input data

Get each memoryIndex set corresponding to items respectively

Each subset in the index set is the weight of the memory item

Weight of each preset query item in

The memory term is updated using the following equation:

wherein f (. cndot.) denotes L2 normalization, p^mRefers to the updated memory term;

representing preset query terms

And index collection

The cosine similarity calculation is carried out on each preset query item in the query table, the weights after the softmax function is applied are used, and the most similar normal mode query items can be more noticed when the memory items are updated through the weighted sum of the query items.

Step S2.3: and inputting the input data into a third network based on each preset query item corresponding to the input data and the fusion characteristic item corresponding to each preset query item, obtaining output data of a next video frame related to the input data, and updating the prediction model.

The first network corresponds to an encoder, the second network corresponds to a memory module, and the third network corresponds to a decoder.

The network structure of the local abnormal branch prediction part is the same as that of the global abnormal branch network, and the parameters are different. In the training process, although two stages of separate training are performed, the training loss is the same. The training loss functions L of the global prediction model and the local prediction model are respectively predicted loss L_predCharacteristic loss of compaction L_compactAnd characteristic separation loss L_separateThe three parts of the utility model are formed,

L＝L_pred+η_cL_compact+η_sL_separate

wherein eta is_cAnd η_sThe coefficients are preset, and are respectively 0.1 and 0.1 in the embodiment; in order for the model to learn to correctly predict normal samples during the training process, the prediction loss penalty decoder predicts the difference between the output data and the collected data at the pixel level, specifically the L2 distance between pixels, i.e., L_predTo predict the L2 distance between the output data of a video frame and the pixels of the captured video frame data,

for predicting output data of video frames, a_tFor capturing data corresponding to output data in a video frame, T represents a normal viewThe total video frame number of the video, t ═ b, refers to a normal video, which uses the input data of the first b-1 video frames as the initial input value of the prediction model, and t ═ 5 is used in this embodiment; the feature tight loss punished is the query term

And the memory item p closest to the query item_pThe difference is specifically the L2 distance between the two, because the query terms output by the encoder during the training process are all query terms in the normal mode, the distance between the memory term and the query term can be reduced due to the tight loss of the features, so that the memory module can learn the normal mode and store the normal mode, and L is the distance between the memory term and the query term_compactFor presetting query terms

The memory item p closest to the preset query item_pThe L2 distance of (a) is,

representing the similarity, namely the weight, between a kth preset query item and an mth memory item corresponding to input data of a video frame at the predicted time t; l is_separateTo make the query term preset

Memory items p second proximate to the query item_nIs further from the predetermined query term and the closest memory term p_pThe distance of the query term and the second near memory term is further than a preset threshold θ, which is set to 1 in this embodiment, because the distance between the memory term and the query term is shortened due to the tight loss of the features, and the parameters of the encoder and the memory module are all updatable in the training process, all the memory terms and the query term are close to being similar in the late training period, and the loss of the feature separation is just for increasing the difference between the memory terms to achieve the purpose of multiple normal modes, specifically, the distance between the query term and the second near memory term is further than the distance between the query term and the nearest memory term by a threshold θ,

express a certainA query term, p_pRepresenting the memory term, p, nearest to the query term_nRepresenting the memory term that is the second closest to the query term.

Step S3: taking a first preset number of continuous video frames of a previous video frame at the current time in the historical time direction as data to be analyzed, predicting and obtaining a current prediction video frame based on an optimal global prediction model, predicting and obtaining a target prediction window of each preset type of target in the current prediction video frame by an optimal local prediction model, carrying out abnormity detection on a current collected video frame, and executing step S4 if the judgment result is abnormal; if it is determined not to be abnormal, abnormality detection is performed on the next captured video frame in the manner of S3.

In the step S3, a current prediction video frame is obtained based on the optimal global prediction model prediction, and a target prediction window of each preset type of target in the current prediction video frame is obtained based on the optimal local prediction model prediction, and the global abnormal branch mainly detects the abnormality of the whole frame; the local abnormal branch mainly detects an abnormality of the target pattern. The two branches have different emphasis points for anomaly detection, and the contribution degrees of the two branches are different aiming at different scenes. Therefore, the results of the global anomaly detection and the local anomaly detection are adjusted and fused through the hyper-parameter lambda, so that the model can be adaptive to various scenes. Calculating a frame-level anomaly score S for a currently captured video frame_tPerforming an abnormality detection, if the abnormality is scored by S_tIf the current collected video frame is abnormal, selecting a threshold with highest identification accuracy according to the threshold after testing;

S_t＝λS_local,t+(1-λ)S_global,t

wherein, the first and the second end of the pipe are connected with each other,

in the formula, S_local,tRepresenting local abnormal score, S, of video frame acquired at current time t_global,tRepresenting the global abnormal score of the video frame collected at the current time t;

representing the abnormal scores of the jth preset type target in the video frame collected at the current time t, wherein J targets are in total, and lambda represents the contribution degree and sigma₁、σ₂All are preset coefficients, the values are 0.6 and 0.6 respectively, and g (-) represents the min-max normalization; d (q)_tP) represents each preset query term in the video frame at the predicted time t

The nearest memory item p corresponding to each preset query item_pThe distance between the preset type targets is L2, and K is the total number of the preset query items corresponding to the preset type targets; since the anomalous target window cannot be well predicted, the prediction difference is measured by the peak signal-to-noise ratio (PSNR) between the predicted target window (predicted video frame) and the real target window (real video frame), when the predicted target window and the real window are very different, the PSNR value will be very low, since these two metrics are not above one order of magnitude, and when an anomaly occurs, one metric will increase and the other metric will decrease, so that the two metrics will be differentTo utilize sigma₁、σ₂The balance is carried out on the raw materials,

representing a target prediction window

representing a target prediction window

A medium pixel maximum;

representing predicted video frames

And capturing video frames I_tPeak signal to noise ratio, N, in between_IIndicates the number of pixels contained in the video frame,

representing predicted video frames

The medium pixel maximum. For the frame-level video anomaly detection task, the goal is to get an anomaly score for the entire frame. Therefore, for local abnormal branches, the highest abnormal target score is taken as the abnormal score representation of the whole video frame, and the global abnormal branch itself scores the abnormal score of the whole video frame. Double-branch anomalies score anomaly scores from two aspects. On one hand, the difference degree between the query item and the memory item is preset. Since normal samples are trained in the training process, a plurality of normal modes are stored in the memory module. If the abnormal sample is input into the encoder, the encoder will input the abnormal sampleAnd extracting abnormal features, wherein the obtained query term has a larger difference with the memory term. g (-) denotes min-max normalization by D (q)_tAnd p) is taken as an example,

In step S4, for detecting that the current captured video frame is abnormal, the current predicted video frame is obtained based on the prediction of the optimal global prediction model, the target prediction windows of each preset type of target in the current predicted video frame are obtained by the prediction of the optimal and local prediction models, and the abnormal score of each target window in the video frame is calculated for the current captured video frame

And positioning an abnormal target window of the current acquisition video frame, wherein the abnormal score is composed of two parts, one part is formed by focusing on an abnormal value presumed by the self mode of the target through a local abnormal branch, and the other part is formed by combining the background and the abnormal value presumed by the target through a global abnormal branch. The fusion of the two parts is determined according to the contribution degree of the two branches to video frame abnormity detection at the frame level;

wherein the content of the first and second substances,

in the formula (I), the compound is shown in the specification,

Pixel in (b) and the target window corresponding to the jth target in the captured video frame t

The corresponding pixels in the image are subjected to difference and summation, local represents local prediction, global represents global prediction, J targets are in total, and lambda represents contribution degree; in the local abnormal branch part, the pixel in the target window is predicted directly by the local abnormal branch, and the predicted pixel in the target window of the global abnormal branch part is divided according to the target window from the global predicted video frame. Because the abnormal values of the two parts are not in the same magnitude, the two parts are unified and normalized to 0,1]And then the two parts are fused through the contribution degree lambda, and finally the abnormal score of each target can be obtained.

Since the anomaly localization problem presupposes that the model detects a video frame in the frame-level task that is an anomaly for that frame, i.e., there is at least one anomaly in the video frame, the target with the highest anomaly score is considered to be an anomalous target. Meanwhile, considering that the abnormal scores of all abnormal video frames are different and the abnormal score of an abnormal target and the normal target in the same frame have larger difference, an abnormal target judgment standard is set, the abnormal score of each target window in the video frame is calculated aiming at the current collected video frame, and based on the target with the highest abnormal score, the preset proportion gamma of the target is used as a standard threshold theta_tExceeding the standard threshold value theta_tThe target corresponding to the abnormality score of (1) is determined to be abnormal;

by this abnormal object judgment criterion, the abnormal object can be clearly judged. Furthermore, a more specific visualization effect is to display only the anomalous target. Therefore, the abnormal object window is marked red as a whole, visualization is clearer, and the problems that a normal object also has an error area and the object is adhered are effectively solved. And a scene target distribution method is adopted, far and near targets are unified to the same measurement standard for comparison, and the visual field difference is effectively relieved.

And respectively training the global abnormal branch and the local abnormal branch during training. Global exception branching normalizes video frames to [ -1, 1 ] after scaling to size]The range of (1). Normalizing video frames to [ -1, after local abnormal branch adaptive window size]The range of (1). The training batches for the global abnormal branches on the UCSD Ped2 dataset and ShanghaiTech dataset were 60 and 10, respectively, and the training batches for the local abnormal branches were 20 and 10, respectively. The memory modules are all 10-entry 512-dimensional in size, all using an Adam optimizer, σ₂＝0.6，σ₃＝0.6，η_c＝0.1，η_sThe initial learning rate is 2e-4, and the cosine annealing attenuation method is adopted to attenuate the data, wherein the theta is 0.1, the gamma is 0.3, and the theta is 1. All models are built by using a pytore frame, the CPU model is Intel (R) Xeon (R) Silver 4216CPU @2.10GHz, and the used display card is Nvidia GeForce RTX 3090.

The general framework diagram of the scheme is shown in fig. 1, the schematic diagram of the target pipeline and the adaptive size in the local abnormal branch is shown in fig. 2, and the distribution diagram of the scene target on the two data sets is shown in fig. 3a and fig. 3 b. The abnormal detection emphasis points of the global abnormal branch and the local abnormal branch are different, and the contribution degrees of the two branches are different aiming at different scenes. Therefore, the results of the global anomaly detection and the local anomaly detection are adjusted and fused through the hyper-parameter lambda, so that the model can be adaptive to various scenes, and an optimal value is selected through experiments, and the result is shown in fig. 4. Where λ represents the contribution ratio of local abnormal branches, 1- λ represents the contribution ratio of global abnormal branches, the darkened font represents the optimal value, and the underlined font represents the suboptimal value. From the table, it can be seen that λ can achieve the optimal performance when taking 0.2 and 0.8 on the UCSD Ped2 and ShanghaiTech data sets respectively, and achieve the suboptimal performance when taking 0.3 and 0.7, which proves that the global abnormal branch and the local abnormal branch have complementarity. Their contributions to the anomaly detection result are different, and for a single scene data set such as UCSD Ped2, the global anomaly branch contribution is larger; for such multi-scenario datasets as ShanghaiTech, the contribution of local outlier branches is larger. Fig. 5a and 5b show ROC graphs on UCSD Ped2 and ShanghaiTech data sets before and after a local abnormal branch is added to a global abnormal branch, respectively, and it can be seen that after the local abnormal branch is added to the model, the model focuses more on the region where an abnormality may occur, and the performance is improved.

the global prediction module and the local prediction module respectively comprise an encoder, a decoder and a memory module, wherein the encoder corresponds to the first network and performs feature extraction based on input data to obtain each preset query item corresponding to the input data; the memory module corresponds to the second network, obtains fusion feature items corresponding to the preset query items based on the preset query items, and updates the memory items based on the preset query items; the decoder corresponds to a third network, and obtains output data of the prediction video frame based on all the preset query items corresponding to the input data and fusion feature items corresponding to all the preset query items;

Fig. 6 shows the comparison result between the model obtained by the present embodiment and other models on the UCSD Ped2 dataset and ShanghaiTech dataset. The performance of the model was assessed by comparing the area under the ROC curve, i.e., the AUC (%) values. Higher AUC values indicate better model performance. The darkened font represents the highest value and the underlined font represents the next highest value. It can be seen that the effect of this method on the ShanghaiTech data set reached the 75.34% highest AUC value compared to the other methods. The next highest AUC value of 96.75% was achieved on the UCSD Ped2 dataset, differing only 0.15% from the highest. Fig. 7a and 7b show the frame-level anomaly detection results of the model in the UCSD Ped2 dataset and ShanghaiTech dataset, respectively. As can be clearly seen from fig. 7b, the effect of adding the partial abnormal branch to the global abnormal branch on the distinctiveness of the video abnormal detection is significantly improved, which proves the effectiveness of the two-branch memory module network in performing the video abnormal detection.

Fig. 8 shows the visualization effect of the anomaly localization. Fig. 8 shows the original image on the left side, the conventional predicted abnormal localization effect in the middle, and the abnormal localization effect of the model on the UCSD Ped2 and ShanghaiTech data sets on the right side. In view of the characteristics of the dual branches of the model, the anomaly localization is composed of two parts, the first part is the anomaly score of each target window obtained by local anomaly branches, and the second part is the error sum of the error result predicted by global anomaly in the target window. And fusing the two parts according to the contribution degree of the double branches to the abnormal detection result to obtain the final abnormal region positioning. The defects that the traditional prediction effect has misjudgment and a normal target has a small number of error areas can be seen from the graph, and the abnormal positioning effect of the model is more accurate.

The invention designs a multi-memory video anomaly detection and positioning method and system based on scene targets, and utilizes a global anomaly branch and a local anomaly branch to carry out anomaly detection on targets in a video from global and local angles respectively, thereby realizing the purpose of focusing attention on regions where anomalies possibly occur; by fully utilizing the distribution of the scene targets, the far-point targets and the near-point targets are scaled to be uniform in size, so that the difference caused by the field angle is effectively relieved; the method for performing abnormal quantification and positioning on each target by combining the characteristics of the double branches of the model can only mark an abnormal target area, and has very clear positioning effect. The method can improve the video abnormity detection performance and clearly position the abnormal target, and has important significance for the intelligent video abnormity detection field.

The above description is only a few of the preferred embodiments of the present application and is not intended to limit the present application, which may be modified and varied by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A scene target-based multi-memory video anomaly detection and positioning method is characterized by comprising the following steps: based on each normal video in the target scene, training from step S1 to step S2 is performed to obtain an optimal global prediction model and an optimal local prediction model, and through the steps S3 to S4, the optimal global prediction model and the optimal local prediction model are combined to perform anomaly detection and positioning on the currently acquired video frame:

step S3: taking a first preset number of continuous video frames of a previous video frame at the current moment in the historical time direction as data to be analyzed, predicting and obtaining a current prediction video frame based on an optimal global prediction model, predicting and obtaining a target prediction window of each preset type of target in the current prediction video frame by an optimal local prediction model, carrying out abnormity detection on a current collected video frame, and executing step S4 if the judgment result shows that the current collected video frame is abnormal; if the video frame is judged to be not abnormal, carrying out abnormal detection on the next collected video frame in the mode of S3;

2. The scene object based multi-memory video anomaly detection and localization method according to claim 1, characterized in that: in step S1, for each normal video, the specific process of extracting the target pipeline corresponding to each preset type target is as follows:

3. The scene object based multi-memory video anomaly detection and localization method according to claim 1, characterized in that: in step S2, the first network and the third network are both U-Net network structures, and the first network removes the last batch normalization layer and the ReLU layer and replaces the last batch normalization layer with the L2 normalization layer.

4. The scene object based multi-memory video anomaly detection and localization method according to claim 1, characterized in that: in step S2.1, input data Y based on a first preset number of consecutive video frames_S＝(E_t-S,…,E_t-2,E_t-1) Performing feature extraction on the input data through a first network to obtain each preset query item corresponding to the input data

5. The scenario object-based multi of claim 4The memory video abnormity detection and positioning method is characterized in that: in step S2.2, the second network includes M memory entries

Respectively aiming at preset query items

Respectively associated with each memory item

Weight of (2)

6. The scene object based multi-memory video anomaly detection and localization method according to claim 5, characterized in that: for each memory item in the second network, executing the following process to update each memory item:

weights of preset query items and memory items corresponding to input data

Obtaining index sets corresponding to all the memory items respectively

Each subset in the index set is the weight of the memory item

Weight of each preset query item in

The memory term is updated using the following equation:

wherein f (. cndot.) represents L2 normalization, p^mRefers to the updated memory entries.

7. The scene object based multi-memory video anomaly detection and localization method according to claim 1, characterized in that: in the step 2, the training loss functions L of the global prediction model and the local prediction model are respectively predicted loss L_predCharacteristic loss of compaction L_compactAnd characteristic separation loss L_separateThe three parts of the utility model are formed,

L＝L_pred+η_cL_compact+η_sL_separate

wherein eta is_cAnd η_sIs a preset coefficient; l is a radical of an alcohol_predTo predict the L2 distance between the output data of a video frame and the pixels of the captured video frame data,

for predicting output data of video frames, a_tIn order to collect data corresponding to output data in video frames, T represents the total video frame number of a normal video, and T is b, which means that input data of the first b-1 video frames of the normal video is used as an initial input value of a prediction model; l is_compactFor presetting query terms

The memory item p closest to the preset query item_pThe L2 distance of (a) is,

The similarity between the memory items and the mth memory item, namely the weight; l is_separateTo make the query term preset

8. The scene object based multi-memory video anomaly detection and localization method according to claim 1, characterized in that: the step S3 is to predict and obtain the current prediction video frame based on the optimal global prediction model, predict and obtain the target prediction window of each preset type target in the current prediction video frame based on the optimal local prediction model, and calculate the frame level abnormal score S of the collected video frame aiming at the current collected video frame_tPerforming an abnormality detection, if the abnormality is scored by S_tIf the preset threshold value is reached, the current collected video frame is abnormal;

S_t＝λS_local,t+(1-λ)S_global,t

wherein the content of the first and second substances,

representing a target prediction window

representing a target prediction window

A medium pixel maximum;

representing predicted video frames

representing predicted video frames

The medium pixel maximum.

9. The scene object based multi-memory video anomaly detection and localization method according to claim 1, characterized in that: in step S4, for detecting that the current captured video frame is abnormal, the current predicted video frame is obtained based on the prediction of the optimal global prediction model, the target prediction windows of each preset type of target in the current predicted video frame are obtained by the prediction of the optimal and local prediction models, and the abnormal score of each target window in the video frame is calculated for the current captured video frame

Positioning an abnormal target window of a currently acquired video frame;

wherein the content of the first and second substances,

in the formula (I), the compound is shown in the specification,

10. the system of the scene target-based multi-memory video anomaly detection and positioning method is characterized in that: comprises a training set module, a global prediction module, a local prediction module, an abnormality detection module and an abnormality positioning module,