CN114627421A - Multi-memory video anomaly detection and positioning method and system based on scene target - Google Patents

Multi-memory video anomaly detection and positioning method and system based on scene target Download PDF

Info

Publication number
CN114627421A
CN114627421A CN202210288377.6A CN202210288377A CN114627421A CN 114627421 A CN114627421 A CN 114627421A CN 202210288377 A CN202210288377 A CN 202210288377A CN 114627421 A CN114627421 A CN 114627421A
Authority
CN
China
Prior art keywords
target
video frame
preset
video
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210288377.6A
Other languages
Chinese (zh)
Inventor
李洪均
陈金怡
孙晓虎
陈俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong University
Original Assignee
Nantong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong University filed Critical Nantong University
Priority to CN202210288377.6A priority Critical patent/CN114627421A/en
Publication of CN114627421A publication Critical patent/CN114627421A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a multi-memory video anomaly detection and positioning method and system based on scene targets, which are used for carrying out anomaly detection on targets in a video from global and local angles by utilizing global anomaly branches and local anomaly branches respectively, so that the targets focusing on regions where anomalies possibly occur are realized; by fully utilizing the distribution of the scene targets, the far-point targets and the near-point targets are scaled to be uniform in size, so that the difference caused by the field angle is effectively relieved; by combining the characteristics of the model double branches, each target is abnormally quantized and positioned, only an abnormal target area can be marked, and the positioning effect is very clear. The method can improve the video abnormity detection performance and clearly position the abnormal target, and has important significance for the intelligent video abnormity detection field.

Description

Multi-memory video anomaly detection and positioning method and system based on scene target
Technical Field
The invention belongs to the field of video monitoring, and particularly relates to a scene target-based multi-memory video anomaly detection and positioning method and system.
Background
With the popularization of monitoring equipment and the rise of intelligent monitoring, video anomaly detection gradually becomes a research hotspot at home and abroad. Video anomaly detection refers to identifying undesirable behavior or appearance patterns in all scenes in a particular location. Because anomalies have a wide variety of unpredictable characteristics, the video anomaly detection task is often considered an unsupervised task, i.e., the anomalies can be detected during testing by training the normal data. Video frame prediction and video frame reconstruction are currently the main popular methods. The conventional method for video frame prediction is to input the continuous whole video frame as a normal sample into a model for training, and the model is used for fitting a normal mode. Therefore, when in testing, because the abnormal sample is not contacted in the training process, the abnormal sample cannot be well predicted when the abnormal sample is input into the model, and the purpose of abnormal detection is achieved.
At present, many existing video frame prediction methods directly use the whole video frame as model input, which is a global approach. This approach ignores that anomalies are more likely to occur on foreground objects that belong to local information in the video frame. Regarding the positioning problem, the current prediction-based method regards the pixel block with large error between the predicted frame and the real frame as an abnormal position. However, this positioning method has three problems. Firstly, the normal moving object also has an error area. Secondly, the error regions are dispersed, and once the distance is very close, two objects are difficult to distinguish. And thirdly, in real world application, the monitoring cameras are installed at different positions of different scenes. For the same target, the target windows are different in size due to different shooting angles. Due to the difference in the angle of view, the slight difference in the presence of the far point target is weakened and the slight difference in the presence of the near point target is enlarged for the targets of the same pattern, so that there is a problem that the near point target is more likely to be positioned as an abnormality.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a scene target-based multi-memory video anomaly detection and positioning method, which is used for positioning an anomaly target while improving the video anomaly detection performance.
In order to achieve the purpose, the invention adopts the following technical scheme:
the method for detecting and positioning the abnormal condition of the multi-memory video based on the scene target comprises the following steps of executing training from step S1 to step S2 to obtain an optimal global prediction model and an optimal local prediction model based on all normal videos in a target scene, and performing abnormal condition detection and positioning on a currently acquired video frame by combining the optimal global prediction model and the optimal local prediction model through the steps S3 to S4:
step S1: respectively aiming at each normal video in a target scene, firstly extracting each frame of the normal video to be used as a global training set, then extracting a target window corresponding to each preset type target in each video frame in the normal video to obtain a target pipeline of each preset type target in the normal video along each video frame in the normal video respectively to be used as a local training set;
step S2: based on a global training set, taking a first preset number of continuous video frames as input and the next video frame of the continuous video frames as output, training a global prediction model to obtain an optimal global prediction model;
based on a local training set, training a local prediction model by taking a target pipeline corresponding to each preset type target in a first preset number of continuous video frames as input and a target window corresponding to each preset type target in a next video frame of the continuous video frames as output to obtain an optimal local prediction model; and (3) respectively and iteratively executing the step (S2.1) to the step (S2.3) by the global prediction model training and the local prediction model training until the loss function is stable or the maximum iteration times are reached:
step S2.1: inputting input data of a first preset number of continuous video frames into a first network for feature extraction based on a prediction model to be trained to obtain each preset query item corresponding to the input data;
step S2.2: respectively inputting the preset query items into a second network aiming at the preset query items to obtain fusion characteristic items corresponding to the preset query items, and further obtaining fusion characteristic items corresponding to the preset query items;
step S2.3: inputting the input data into a third network based on each preset query item corresponding to the input data and the fusion feature item corresponding to each preset query item, obtaining output data of a next video frame related to the input data, and updating the prediction model;
step S3: taking a first preset number of continuous video frames of a previous video frame at the current time in the historical time direction as data to be analyzed, predicting and obtaining a current prediction video frame based on an optimal global prediction model, predicting and obtaining a target prediction window of each preset type of target in the current prediction video frame by an optimal local prediction model, carrying out abnormity detection on a current collected video frame, and executing step S4 if the judgment result is abnormal; if the video frame is judged to be not abnormal, carrying out abnormal detection on the next collected video frame in the mode of S3;
step S4: and aiming at detecting that the current collected video frame is abnormal, predicting and obtaining the current prediction video frame based on the optimal global prediction model, predicting and obtaining target prediction windows of various preset types of targets in the current prediction video frame by the optimal local prediction model, positioning the abnormal target windows of the current collected video frame, and finishing the detection.
As a preferred technical solution of the present invention, in the step S1, for each normal video, a specific process of extracting a target pipeline corresponding to each preset type target is as follows:
detecting each preset type target in each video frame by using a Yolov5 model pre-trained on an MS COCO data set, corresponding each preset type target of the front and rear video frames by means of a DeepsORT target tracking algorithm to obtain target pipelines respectively corresponding to each preset type target, and carrying out undistorted scaling uniform size on each target pipeline to obtain a local training set.
As a preferred technical solution of the present invention, in the step S2, the first network and the third network are both U-Net network structures, and the first network removes the last batch normalization layer and the ReLU layer and adopts the L2 normalization layer instead.
As a preferred technical solution of the present invention, in the step S2.1, the input data Y based on the first preset number of consecutive video framesS=(Et-S,…,Et-2,Et-1) Performing feature extraction on the input data through a first network to obtain each preset query item corresponding to the input data
Figure BDA0003559192270000031
Wherein, YTThe method comprises the steps of indicating input data of a first preset S continuous video frames, indicating predicted time, and indicating input data of each video frame;
Figure BDA0003559192270000032
predicting the kth preset query term corresponding to input data of a video frame at the moment t, wherein k is 1,...,KK is the total number of the preset query items corresponding to the input data, K is (W/8) × (H/8), and W and H respectively represent the width and the height of the input data.
As a preferred technical solution of the present invention, in step S2.2, the second network includes M memory items
Figure BDA0003559192270000033
Respectively aiming at preset query items
Figure BDA0003559192270000034
The preset query item and the M memory items are respectively subjected to cosine similarity calculation and are applied to a softmax function to obtain the preset query item
Figure BDA0003559192270000035
Respectively associated with each memory item
Figure BDA0003559192270000036
Weight of (2)
Figure BDA0003559192270000037
Figure BDA0003559192270000038
Obtaining fusion characteristic items corresponding to the preset query items based on the weights of the memory items and the preset query items
Figure BDA0003559192270000039
Figure BDA00035591922700000310
As a preferred technical solution of the present invention, for each memory item in the second network, the following process is executed to update each memory item:
weights of preset query items and memory items corresponding to input data
Figure BDA00035591922700000311
Obtaining index sets corresponding to all the memory items respectively
Figure BDA00035591922700000312
Each subset in the index set is the weight of the memory item
Figure BDA00035591922700000313
The highest preset query item; aiming at each memory item, based on the memory item and the corresponding index set
Figure BDA00035591922700000314
Weight of each preset query item in
Figure BDA00035591922700000315
The memory term is updated using the following equation:
Figure BDA00035591922700000316
Figure BDA00035591922700000317
wherein f (. cndot.) denotes L2 normalization, pmRefers to the updated memory entry.
As a preferred technical solution of the present invention, the training loss functions L of the global prediction model and the local prediction model in step 2 are predicted losses L respectivelypredCharacteristic loss of compaction LcompactAnd characteristic separation loss LseparateThe three parts of the utility model are formed,
L=LpredcLcompactsLseparate
Figure BDA0003559192270000041
Figure BDA0003559192270000042
Figure BDA0003559192270000043
wherein eta iscAnd ηsIs a preset coefficient; l ispredTo predict the L2 distance between the output data of a video frame and the pixels of the captured video frame data,
Figure BDA0003559192270000044
for predicting output data of video frames, atIn order to collect data corresponding to output data in video frames, T represents the total video frame number of a normal video, and T is b, which means that input data of the first b-1 video frames of the normal video is used as an initial input value of a prediction model;Lcompact is a preset query term
Figure BDA0003559192270000045
And the memory item p closest to the query itempThe L2 distance of (a) is,
Figure BDA0003559192270000046
indicating the k-th preset query item corresponding to the input data of the video frame at the predicted time t
Figure BDA0003559192270000047
The similarity between the memory items and the mth memory item, namely the weight; l is a radical of an alcoholseparateTo make the query term preset
Figure BDA0003559192270000048
Memory items p second proximate to the query itemnIs further from the predetermined query term and the closest memory term ppIs far away by a preset threshold theta.
As a preferred technical solution of the present invention, in step S3, a current predicted video frame is obtained based on the optimal global prediction model prediction, a target prediction window of each preset type of target in the current predicted video frame is obtained based on the optimal local prediction model prediction, and a frame-level abnormality score S of a current captured video frame is calculated for the current captured video frametPerforming an abnormality detection, if the abnormality is scored by StIf the preset threshold value is reached, the current collected video frame is abnormal;
St=λSlocal,t+(1-λ)Sglobal,t
wherein the content of the first and second substances,
Figure BDA0003559192270000049
Figure BDA00035591922700000410
Figure BDA00035591922700000411
Figure BDA00035591922700000412
Figure BDA00035591922700000413
Figure BDA0003559192270000051
in the formula, Slocal,tRepresents the local abnormal score of the video frame acquired at the current time t, Sglobal,tRepresenting the global abnormal score of the video frame collected at the current time t;
Figure BDA0003559192270000052
representing the abnormal scores of the jth preset type target in the video frame collected at the current time t, wherein J targets are in total, and lambda represents the contribution degree and sigma1、σ2Are all preset coefficients, g (-) represents min-max normalization; d (q)tP) represents each preset query term in the video frame at the predicted time t
Figure BDA0003559192270000053
The nearest memory item p corresponding to each preset query itempThe distance between the preset query items is L2, and K is the total number of the preset query items corresponding to each preset type target;
Figure BDA0003559192270000054
representing a target prediction window
Figure BDA0003559192270000055
And capturing a target window o in a video frametPeak signal to noise ratio, N, betweenoIndicating the number of pixels contained in the target window,
Figure BDA0003559192270000056
representing a target prediction window
Figure BDA0003559192270000057
A medium pixel maximum value;
Figure BDA0003559192270000058
representing predicted video frames
Figure BDA0003559192270000059
And capturing video frames ItPeak signal to noise ratio, N, betweenIIndicates the number of pixels contained in the video frame,
Figure BDA00035591922700000510
representing predicted video frames
Figure BDA00035591922700000511
The medium pixel maximum.
As a preferred technical solution of the present invention, in step S4, for detecting that the current captured video frame is abnormal, the current predicted video frame is obtained based on the prediction of the optimal global prediction model, the target prediction windows of each preset type of target in the current predicted video frame are obtained by the prediction of the optimal and local prediction models, and the abnormal score of each target window in the video frame is calculated for the current captured video frame
Figure BDA00035591922700000512
Positioning an abnormal target window of a currently acquired video frame;
Figure BDA00035591922700000513
wherein the content of the first and second substances,
Figure BDA00035591922700000514
in the formula (I), the compound is shown in the specification,
Figure BDA00035591922700000515
representing a target window corresponding to the jth preset type target in the video frame at the predicted time t
Figure BDA00035591922700000516
The pixel in (b) and a target window corresponding to the jth target in the video frame at the acquisition time t
Figure BDA00035591922700000517
The corresponding pixels in the image are subjected to difference and summation, local represents local prediction, global represents global prediction, J targets are in total, and lambda represents contribution degree;
calculating the abnormal score of each target window in the video frame aiming at the current collected video frame, and taking the preset proportion gamma of the target with the highest abnormal score as a standard threshold theta based on the target with the highest abnormal scoretExceeding the standard threshold value thetatThe target corresponding to the abnormality score of (1) is determined to be abnormal;
Figure BDA0003559192270000061
the system of the multi-memory video anomaly detection and positioning method based on the scene target comprises a training set module, a global prediction module, a local prediction module, an anomaly detection module and an anomaly positioning module,
the training set module obtains a global training set and a local training set based on each normal video in a target scene;
the global prediction module and the local prediction module respectively comprise an encoder, a decoder and a memory module, wherein the encoder corresponds to the first network and performs feature extraction based on input data to obtain each preset query item corresponding to the input data; the memory module corresponds to the second network, obtains fusion feature items corresponding to the preset query items based on the preset query items, and updates the memory items based on the preset query items; the decoder corresponds to the third network, and obtains output data of the prediction video frame based on all the preset query items corresponding to the input data and the fusion feature items corresponding to all the preset query items;
the anomaly detection module carries out anomaly detection on the currently acquired video frame based on a current prediction video frame obtained by global prediction model prediction and a target prediction window of each preset type of target in the current prediction video frame obtained by optimal local prediction model prediction;
and the abnormity positioning module is used for detecting that the current collected video frame is abnormal, predicting the obtained current predicted video frame based on the optimal global prediction model, predicting a target prediction window of each preset type of target in the current video frame by using the optimal and local prediction models, and positioning the abnormal target window of the current collected video frame.
The invention has the beneficial effects that: the invention provides a multi-memory video anomaly detection and positioning method and system based on scene targets, which are used for carrying out anomaly detection on targets in a video from a global angle and a local angle respectively by utilizing a global anomaly branch and a local anomaly branch, so that the targets focusing on regions where anomalies possibly occur are realized; by fully utilizing the distribution of the scene targets, the far-point targets and the near-point targets are scaled to be uniform in size, so that the difference caused by the field angle is effectively relieved; the method for performing abnormal quantification and positioning on each target by combining the characteristics of the double branches of the model can only mark an abnormal target area, and has very clear positioning effect. The method can improve the video abnormity detection performance and clearly position the abnormal target, and has important significance for the intelligent video abnormity detection field.
Drawings
FIG. 1 is an overall frame diagram;
FIG. 2 is a diagram illustrating a target pipeline and adaptive size in a local abnormal branch;
FIG. 3a is a graph of a target window distribution of a UCSD Ped2 data set;
FIG. 3b is a graph of a ShanghaiTech dataset target window distribution;
FIG. 4 is a table of two-branch fusion performance;
FIG. 5a is a ROC curve for a local outlier branch on a UCSD Ped2 data set before and after adding to a global outlier branch;
FIG. 5b is a ROC curve for local aberrant branching on both ShanghaiTec datasets before and after addition to global aberrant branching;
FIG. 6 is a comparison of the present model and other models on the UCSD Ped2 dataset and the ShanghaiTech dataset, respectively;
FIG. 7a is a graph illustrating anomaly detection results at a UCSD Ped2 data set frame level;
FIG. 7b is a frame level anomaly detection result on the ShanghaiTech dataset;
fig. 8 is a diagram of the visualization effect of the abnormal localization.
Detailed Description
The following examples will give the skilled person a more complete understanding of the present invention, but do not limit the invention in any way.
A scene target-based multi-memory video anomaly detection and positioning method and system are mainly used for realizing an unsupervised task in a mode of predicting video frames, namely predicting current frames through continuous frame video frames, as shown in figure 1. The model is mainly composed of two parts: global exception branches and local exception branches. The global abnormal branch mainly detects all the targets in the whole frame and the abnormal existing in the background; the local abnormal branch mainly detects an abnormality such as a pattern form of a target. Since the extracted targets are different in distance from the camera and in target form, resulting in different target windows, undistorted scaling operation needs to be performed on the detected targets according to the scene target distribution. Then, the encoder extracts the normal pattern of the target and reads or updates the memory module during the process of training the decoder to predict the next frame image, so that the memory module can help check whether the single target is abnormal or not at the time of testing. And finally, for the same frame, fusing the abnormal scores obtained by the global abnormal branches and the abnormal scores obtained by the local abnormal branches through the hyper-parameters to obtain the frame and the final abnormal scores corresponding to all targets in the frame.
The method trains two public data sets in the field of video anomaly detection, namely a UCSD Ped2 data set and a ShanghaiTech data set. The UCSD Ped2 data set comprises 16 training videos and 12 testing videos, wherein the abnormity mainly comprises a skateboard, a bicycle riding and a car driving; the ShanghaiTech dataset had 330 training videos and 107 test videos, with abnormalities mainly including cycling, driving, skateboarding, falling, running, etc. The UCSD Ped2 dataset has only 1 scene, and the ShanghaiTech dataset has 13 scenes in total.
The method for detecting and positioning the abnormal condition of the multi-memory video based on the scene target comprises the steps of S1-S2 training to obtain an optimal global prediction model and an optimal local prediction model based on all normal videos in the target scene, and performing abnormal condition detection and positioning on a current collected video frame by combining the optimal global prediction model and the optimal local prediction model through the steps of S3-S4.
Step S1: and respectively aiming at each normal video in a target scene, firstly extracting each frame of the normal video to be used as a global training set, then extracting a target window corresponding to each preset type target in each video frame in the normal video to obtain a target pipeline of each preset type target in the normal video along each video frame in the normal video respectively to be used as a local training set.
In step S1, for each normal video, a specific process of extracting a target pipeline corresponding to each preset type of target is as follows: in the local abnormal branch part, in order to focus on the target itself, a Yolov5 model pre-trained on an MS COCO data set is used for detecting each preset type target in each video frame, each preset type target of the front video frame and the rear video frame is corresponded by means of a DeepsORT target tracking algorithm, target pipelines respectively corresponding to each preset type target are obtained, and the target pipelines are subjected to undistorted scaling and unified size to obtain a local training set. Due to the motion of the target and other reasons, the target pipeline of each target obtained after the input video frame is processed by Yolov5 and the deepSORT algorithm is different in size, so that the target pipeline cannot be directly used as input. For the targets of the same mode, due to the difference of the field angles, the fine prediction error of the near-point target is amplified, the fine prediction of the far-point target is weakened, so that the two targets are not above a consideration standard, and the near-point target is more likely to be judged as abnormal. Therefore, a method of undistorted scaling of the detected objects according to the scene object distribution is adopted. Scene distribution refers to the distribution of object sizes in the training set. Before this, the target window is scaled in a non-deformable manner, so that one side of the target window is the same as the corresponding side with the fixed size, and the other side of the target window is smaller than the corresponding side with the fixed size, and then the smaller side is filled to achieve the fixed size. And for the selection of the width and the height with fixed sizes, considering that the distribution of the sizes of the targets is different due to different shooting angles, the height and the width which comprise the peak value of the target window and correspond to a certain proportion of the peak value of the number of the target windows are selected.
Step S2: and based on the global training set, training a global prediction model by taking a first preset number of continuous video frames as input and the next video frame of the continuous video frames as output to obtain an optimal global prediction model.
Based on a local training set, training a local prediction model by taking a target pipeline corresponding to each preset type target in a first preset number of continuous video frames as input and a target window corresponding to each preset type target in a next video frame of the continuous video frames as output to obtain an optimal local prediction model; inputting the input data of the local prediction model into target pipelines corresponding to preset types of targets respectively; and (4) respectively and iteratively executing the step (S2.1) to the step (S2.3) by the global prediction model training and the local prediction model training until the loss function is stable or the maximum iteration number is reached.
The first network and the third network are both of U-Net network structures, the first network removes the last batch standardization layer and the ReLU layer, and the first network is replaced by an L2 standardization layer. Since the last ReLU layer will remove negative values, the characterization capability is diminished. Meanwhile, since there are three down-sampling layers in the first network, the input target pipeline width and height are required to be at least a multiple of 8.
Step S2.1: and inputting the input data of a first preset number of continuous video frames into a first network for feature extraction based on the prediction model to be trained to obtain each preset query item corresponding to the input data.
In step S2.1, input data Y based on a first preset number of consecutive video framesS=(Et-S,…,Et-2,Et-1) Performing feature extraction on the input data through a first network to obtain each preset query item corresponding to the input data
Figure BDA0003559192270000081
Wherein, YTThe method comprises the steps of indicating input data of a first preset S continuous video frames, indicating predicted time, corresponding to each time for each frame in the technical scheme, and indicating input data of each video frame by E;
Figure BDA0003559192270000082
in order to predict the kth preset query term corresponding to the input data of the video frame at the time t, K is 1.
Step S2.2: and respectively inputting the preset query items into a second network aiming at the preset query items to obtain fusion characteristic items corresponding to the preset query items, and further obtaining fusion characteristic items corresponding to the preset query items.
In step S2.2, the second network has reading and updating functions, and the second network includes M memory items
Figure BDA0003559192270000091
Respectively aiming at preset query items
Figure BDA0003559192270000092
The preset query item and the M memory items are respectively subjected to cosine similarity calculation and are applied to a softmax function to obtain the preset query item
Figure BDA0003559192270000093
Respectively associated with each memory item
Figure BDA0003559192270000094
Weight of (2)
Figure BDA0003559192270000095
Figure BDA0003559192270000096
Obtaining fusion characteristic items corresponding to the preset query items based on the weights of the memory items and the preset query items
Figure BDA0003559192270000097
Figure BDA0003559192270000098
For each query term
Figure BDA0003559192270000099
The corresponding normal mode characteristic item can be obtained by reading the second network
Figure BDA00035591922700000910
The second network has various memory items corresponding to multiple normal modes, and each preset query item
Figure BDA00035591922700000911
The most relevant information can be extracted from a plurality of normal modes and fused, and finally, (W/8) x (H/8) fused feature items can be obtained from (W/8) x (H/8) preset query items.
For each memory item in the second network, executing the following process to update each memory item: the updating strategy is that for each memory item in the memory module, the most similar memory item in the query items is selected as the query item of the memory item to update the memory item;
weights of preset query items and memory items corresponding to input data
Figure BDA00035591922700000912
Get each memoryIndex set corresponding to items respectively
Figure BDA00035591922700000913
Each subset in the index set is the weight of the memory item
Figure BDA00035591922700000914
The highest preset query item; aiming at each memory item, based on the memory item and the corresponding index set
Figure BDA00035591922700000915
Weight of each preset query item in
Figure BDA00035591922700000916
The memory term is updated using the following equation:
Figure BDA00035591922700000917
Figure BDA00035591922700000918
wherein f (. cndot.) denotes L2 normalization, pmRefers to the updated memory term;
Figure BDA00035591922700000919
representing preset query terms
Figure BDA00035591922700000920
And index collection
Figure BDA0003559192270000101
The cosine similarity calculation is carried out on each preset query item in the query table, the weights after the softmax function is applied are used, and the most similar normal mode query items can be more noticed when the memory items are updated through the weighted sum of the query items.
Step S2.3: and inputting the input data into a third network based on each preset query item corresponding to the input data and the fusion characteristic item corresponding to each preset query item, obtaining output data of a next video frame related to the input data, and updating the prediction model.
The first network corresponds to an encoder, the second network corresponds to a memory module, and the third network corresponds to a decoder.
The network structure of the local abnormal branch prediction part is the same as that of the global abnormal branch network, and the parameters are different. In the training process, although two stages of separate training are performed, the training loss is the same. The training loss functions L of the global prediction model and the local prediction model are respectively predicted loss LpredCharacteristic loss of compaction LcompactAnd characteristic separation loss LseparateThe three parts of the utility model are formed,
L=LpredcLcompactsLseparate
Figure BDA0003559192270000102
Figure BDA0003559192270000103
Figure BDA0003559192270000104
wherein eta iscAnd ηsThe coefficients are preset, and are respectively 0.1 and 0.1 in the embodiment; in order for the model to learn to correctly predict normal samples during the training process, the prediction loss penalty decoder predicts the difference between the output data and the collected data at the pixel level, specifically the L2 distance between pixels, i.e., LpredTo predict the L2 distance between the output data of a video frame and the pixels of the captured video frame data,
Figure BDA0003559192270000105
for predicting output data of video frames, atFor capturing data corresponding to output data in a video frame, T represents a normal viewThe total video frame number of the video, t ═ b, refers to a normal video, which uses the input data of the first b-1 video frames as the initial input value of the prediction model, and t ═ 5 is used in this embodiment; the feature tight loss punished is the query term
Figure BDA0003559192270000106
And the memory item p closest to the query itempThe difference is specifically the L2 distance between the two, because the query terms output by the encoder during the training process are all query terms in the normal mode, the distance between the memory term and the query term can be reduced due to the tight loss of the features, so that the memory module can learn the normal mode and store the normal mode, and L is the distance between the memory term and the query termcompactFor presetting query terms
Figure BDA0003559192270000107
The memory item p closest to the preset query itempThe L2 distance of (a) is,
Figure BDA0003559192270000108
representing the similarity, namely the weight, between a kth preset query item and an mth memory item corresponding to input data of a video frame at the predicted time t; l isseparateTo make the query term preset
Figure BDA0003559192270000109
Memory items p second proximate to the query itemnIs further from the predetermined query term and the closest memory term ppThe distance of the query term and the second near memory term is further than a preset threshold θ, which is set to 1 in this embodiment, because the distance between the memory term and the query term is shortened due to the tight loss of the features, and the parameters of the encoder and the memory module are all updatable in the training process, all the memory terms and the query term are close to being similar in the late training period, and the loss of the feature separation is just for increasing the difference between the memory terms to achieve the purpose of multiple normal modes, specifically, the distance between the query term and the second near memory term is further than the distance between the query term and the nearest memory term by a threshold θ,
Figure BDA0003559192270000111
express a certainA query term, ppRepresenting the memory term, p, nearest to the query termnRepresenting the memory term that is the second closest to the query term.
Step S3: taking a first preset number of continuous video frames of a previous video frame at the current time in the historical time direction as data to be analyzed, predicting and obtaining a current prediction video frame based on an optimal global prediction model, predicting and obtaining a target prediction window of each preset type of target in the current prediction video frame by an optimal local prediction model, carrying out abnormity detection on a current collected video frame, and executing step S4 if the judgment result is abnormal; if it is determined not to be abnormal, abnormality detection is performed on the next captured video frame in the manner of S3.
In the step S3, a current prediction video frame is obtained based on the optimal global prediction model prediction, and a target prediction window of each preset type of target in the current prediction video frame is obtained based on the optimal local prediction model prediction, and the global abnormal branch mainly detects the abnormality of the whole frame; the local abnormal branch mainly detects an abnormality of the target pattern. The two branches have different emphasis points for anomaly detection, and the contribution degrees of the two branches are different aiming at different scenes. Therefore, the results of the global anomaly detection and the local anomaly detection are adjusted and fused through the hyper-parameter lambda, so that the model can be adaptive to various scenes. Calculating a frame-level anomaly score S for a currently captured video frametPerforming an abnormality detection, if the abnormality is scored by StIf the current collected video frame is abnormal, selecting a threshold with highest identification accuracy according to the threshold after testing;
St=λSlocal,t+(1-λ)Sglobal,t
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003559192270000112
Figure BDA0003559192270000113
Figure BDA0003559192270000114
Figure BDA0003559192270000115
Figure BDA0003559192270000116
Figure BDA0003559192270000121
in the formula, Slocal,tRepresenting local abnormal score, S, of video frame acquired at current time tglobal,tRepresenting the global abnormal score of the video frame collected at the current time t;
Figure BDA0003559192270000122
representing the abnormal scores of the jth preset type target in the video frame collected at the current time t, wherein J targets are in total, and lambda represents the contribution degree and sigma1、σ2All are preset coefficients, the values are 0.6 and 0.6 respectively, and g (-) represents the min-max normalization; d (q)tP) represents each preset query term in the video frame at the predicted time t
Figure BDA0003559192270000123
The nearest memory item p corresponding to each preset query itempThe distance between the preset type targets is L2, and K is the total number of the preset query items corresponding to the preset type targets; since the anomalous target window cannot be well predicted, the prediction difference is measured by the peak signal-to-noise ratio (PSNR) between the predicted target window (predicted video frame) and the real target window (real video frame), when the predicted target window and the real window are very different, the PSNR value will be very low, since these two metrics are not above one order of magnitude, and when an anomaly occurs, one metric will increase and the other metric will decrease, so that the two metrics will be differentTo utilize sigma1、σ2The balance is carried out on the raw materials,
Figure BDA0003559192270000124
representing a target prediction window
Figure BDA0003559192270000125
And capturing a target window o in a video frametPeak signal to noise ratio, N, betweenoIndicating the number of pixels contained in the target window,
Figure BDA0003559192270000126
representing a target prediction window
Figure BDA0003559192270000127
A medium pixel maximum;
Figure BDA0003559192270000128
representing predicted video frames
Figure BDA0003559192270000129
And capturing video frames ItPeak signal to noise ratio, N, in betweenIIndicates the number of pixels contained in the video frame,
Figure BDA00035591922700001210
representing predicted video frames
Figure BDA00035591922700001211
The medium pixel maximum. For the frame-level video anomaly detection task, the goal is to get an anomaly score for the entire frame. Therefore, for local abnormal branches, the highest abnormal target score is taken as the abnormal score representation of the whole video frame, and the global abnormal branch itself scores the abnormal score of the whole video frame. Double-branch anomalies score anomaly scores from two aspects. On one hand, the difference degree between the query item and the memory item is preset. Since normal samples are trained in the training process, a plurality of normal modes are stored in the memory module. If the abnormal sample is input into the encoder, the encoder will input the abnormal sampleAnd extracting abnormal features, wherein the obtained query term has a larger difference with the memory term. g (-) denotes min-max normalization by D (q)tAnd p) is taken as an example,
Figure BDA00035591922700001212
step S4: and aiming at detecting that the current collected video frame is abnormal, predicting and obtaining the current prediction video frame based on the optimal global prediction model, predicting and obtaining target prediction windows of various preset types of targets in the current prediction video frame by the optimal local prediction model, positioning the abnormal target windows of the current collected video frame, and finishing the detection.
In step S4, for detecting that the current captured video frame is abnormal, the current predicted video frame is obtained based on the prediction of the optimal global prediction model, the target prediction windows of each preset type of target in the current predicted video frame are obtained by the prediction of the optimal and local prediction models, and the abnormal score of each target window in the video frame is calculated for the current captured video frame
Figure BDA0003559192270000131
And positioning an abnormal target window of the current acquisition video frame, wherein the abnormal score is composed of two parts, one part is formed by focusing on an abnormal value presumed by the self mode of the target through a local abnormal branch, and the other part is formed by combining the background and the abnormal value presumed by the target through a global abnormal branch. The fusion of the two parts is determined according to the contribution degree of the two branches to video frame abnormity detection at the frame level;
Figure BDA0003559192270000132
wherein the content of the first and second substances,
Figure BDA0003559192270000133
in the formula (I), the compound is shown in the specification,
Figure BDA0003559192270000134
representing a target window corresponding to the jth preset type target in the video frame at the predicted time t
Figure BDA0003559192270000135
Pixel in (b) and the target window corresponding to the jth target in the captured video frame t
Figure BDA0003559192270000136
The corresponding pixels in the image are subjected to difference and summation, local represents local prediction, global represents global prediction, J targets are in total, and lambda represents contribution degree; in the local abnormal branch part, the pixel in the target window is predicted directly by the local abnormal branch, and the predicted pixel in the target window of the global abnormal branch part is divided according to the target window from the global predicted video frame. Because the abnormal values of the two parts are not in the same magnitude, the two parts are unified and normalized to 0,1]And then the two parts are fused through the contribution degree lambda, and finally the abnormal score of each target can be obtained.
Since the anomaly localization problem presupposes that the model detects a video frame in the frame-level task that is an anomaly for that frame, i.e., there is at least one anomaly in the video frame, the target with the highest anomaly score is considered to be an anomalous target. Meanwhile, considering that the abnormal scores of all abnormal video frames are different and the abnormal score of an abnormal target and the normal target in the same frame have larger difference, an abnormal target judgment standard is set, the abnormal score of each target window in the video frame is calculated aiming at the current collected video frame, and based on the target with the highest abnormal score, the preset proportion gamma of the target is used as a standard threshold thetatExceeding the standard threshold value thetatThe target corresponding to the abnormality score of (1) is determined to be abnormal;
Figure BDA0003559192270000137
by this abnormal object judgment criterion, the abnormal object can be clearly judged. Furthermore, a more specific visualization effect is to display only the anomalous target. Therefore, the abnormal object window is marked red as a whole, visualization is clearer, and the problems that a normal object also has an error area and the object is adhered are effectively solved. And a scene target distribution method is adopted, far and near targets are unified to the same measurement standard for comparison, and the visual field difference is effectively relieved.
And respectively training the global abnormal branch and the local abnormal branch during training. Global exception branching normalizes video frames to [ -1, 1 ] after scaling to size]The range of (1). Normalizing video frames to [ -1, after local abnormal branch adaptive window size]The range of (1). The training batches for the global abnormal branches on the UCSD Ped2 dataset and ShanghaiTech dataset were 60 and 10, respectively, and the training batches for the local abnormal branches were 20 and 10, respectively. The memory modules are all 10-entry 512-dimensional in size, all using an Adam optimizer, σ2=0.6,σ3=0.6,ηc=0.1,ηsThe initial learning rate is 2e-4, and the cosine annealing attenuation method is adopted to attenuate the data, wherein the theta is 0.1, the gamma is 0.3, and the theta is 1. All models are built by using a pytore frame, the CPU model is Intel (R) Xeon (R) Silver 4216CPU @2.10GHz, and the used display card is Nvidia GeForce RTX 3090.
The general framework diagram of the scheme is shown in fig. 1, the schematic diagram of the target pipeline and the adaptive size in the local abnormal branch is shown in fig. 2, and the distribution diagram of the scene target on the two data sets is shown in fig. 3a and fig. 3 b. The abnormal detection emphasis points of the global abnormal branch and the local abnormal branch are different, and the contribution degrees of the two branches are different aiming at different scenes. Therefore, the results of the global anomaly detection and the local anomaly detection are adjusted and fused through the hyper-parameter lambda, so that the model can be adaptive to various scenes, and an optimal value is selected through experiments, and the result is shown in fig. 4. Where λ represents the contribution ratio of local abnormal branches, 1- λ represents the contribution ratio of global abnormal branches, the darkened font represents the optimal value, and the underlined font represents the suboptimal value. From the table, it can be seen that λ can achieve the optimal performance when taking 0.2 and 0.8 on the UCSD Ped2 and ShanghaiTech data sets respectively, and achieve the suboptimal performance when taking 0.3 and 0.7, which proves that the global abnormal branch and the local abnormal branch have complementarity. Their contributions to the anomaly detection result are different, and for a single scene data set such as UCSD Ped2, the global anomaly branch contribution is larger; for such multi-scenario datasets as ShanghaiTech, the contribution of local outlier branches is larger. Fig. 5a and 5b show ROC graphs on UCSD Ped2 and ShanghaiTech data sets before and after a local abnormal branch is added to a global abnormal branch, respectively, and it can be seen that after the local abnormal branch is added to the model, the model focuses more on the region where an abnormality may occur, and the performance is improved.
The system of the multi-memory video anomaly detection and positioning method based on the scene target comprises a training set module, a global prediction module, a local prediction module, an anomaly detection module and an anomaly positioning module,
the training set module obtains a global training set and a local training set based on each normal video in a target scene;
the global prediction module and the local prediction module respectively comprise an encoder, a decoder and a memory module, wherein the encoder corresponds to the first network and performs feature extraction based on input data to obtain each preset query item corresponding to the input data; the memory module corresponds to the second network, obtains fusion feature items corresponding to the preset query items based on the preset query items, and updates the memory items based on the preset query items; the decoder corresponds to a third network, and obtains output data of the prediction video frame based on all the preset query items corresponding to the input data and fusion feature items corresponding to all the preset query items;
the anomaly detection module carries out anomaly detection on the currently acquired video frame based on a current prediction video frame obtained by global prediction model prediction and a target prediction window of each preset type of target in the current prediction video frame obtained by optimal local prediction model prediction;
and the abnormity positioning module is used for detecting that the current collected video frame is abnormal, predicting the obtained current predicted video frame based on the optimal global prediction model, predicting a target prediction window of each preset type of target in the current video frame by using the optimal and local prediction models, and positioning the abnormal target window of the current collected video frame.
Fig. 6 shows the comparison result between the model obtained by the present embodiment and other models on the UCSD Ped2 dataset and ShanghaiTech dataset. The performance of the model was assessed by comparing the area under the ROC curve, i.e., the AUC (%) values. Higher AUC values indicate better model performance. The darkened font represents the highest value and the underlined font represents the next highest value. It can be seen that the effect of this method on the ShanghaiTech data set reached the 75.34% highest AUC value compared to the other methods. The next highest AUC value of 96.75% was achieved on the UCSD Ped2 dataset, differing only 0.15% from the highest. Fig. 7a and 7b show the frame-level anomaly detection results of the model in the UCSD Ped2 dataset and ShanghaiTech dataset, respectively. As can be clearly seen from fig. 7b, the effect of adding the partial abnormal branch to the global abnormal branch on the distinctiveness of the video abnormal detection is significantly improved, which proves the effectiveness of the two-branch memory module network in performing the video abnormal detection.
Fig. 8 shows the visualization effect of the anomaly localization. Fig. 8 shows the original image on the left side, the conventional predicted abnormal localization effect in the middle, and the abnormal localization effect of the model on the UCSD Ped2 and ShanghaiTech data sets on the right side. In view of the characteristics of the dual branches of the model, the anomaly localization is composed of two parts, the first part is the anomaly score of each target window obtained by local anomaly branches, and the second part is the error sum of the error result predicted by global anomaly in the target window. And fusing the two parts according to the contribution degree of the double branches to the abnormal detection result to obtain the final abnormal region positioning. The defects that the traditional prediction effect has misjudgment and a normal target has a small number of error areas can be seen from the graph, and the abnormal positioning effect of the model is more accurate.
The invention designs a multi-memory video anomaly detection and positioning method and system based on scene targets, and utilizes a global anomaly branch and a local anomaly branch to carry out anomaly detection on targets in a video from global and local angles respectively, thereby realizing the purpose of focusing attention on regions where anomalies possibly occur; by fully utilizing the distribution of the scene targets, the far-point targets and the near-point targets are scaled to be uniform in size, so that the difference caused by the field angle is effectively relieved; the method for performing abnormal quantification and positioning on each target by combining the characteristics of the double branches of the model can only mark an abnormal target area, and has very clear positioning effect. The method can improve the video abnormity detection performance and clearly position the abnormal target, and has important significance for the intelligent video abnormity detection field.
The above description is only a few of the preferred embodiments of the present application and is not intended to limit the present application, which may be modified and varied by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A scene target-based multi-memory video anomaly detection and positioning method is characterized by comprising the following steps: based on each normal video in the target scene, training from step S1 to step S2 is performed to obtain an optimal global prediction model and an optimal local prediction model, and through the steps S3 to S4, the optimal global prediction model and the optimal local prediction model are combined to perform anomaly detection and positioning on the currently acquired video frame:
step S1: respectively aiming at each normal video in a target scene, firstly extracting each frame of the normal video to be used as a global training set, then extracting a target window corresponding to each preset type target in each video frame in the normal video to obtain a target pipeline of each preset type target in the normal video along each video frame in the normal video respectively to be used as a local training set;
step S2: based on a global training set, taking a first preset number of continuous video frames as input and the next video frame of the continuous video frames as output, training a global prediction model to obtain an optimal global prediction model;
based on a local training set, training a local prediction model by taking a target pipeline corresponding to each preset type target in a first preset number of continuous video frames as input and a target window corresponding to each preset type target in a next video frame of the continuous video frames as output to obtain an optimal local prediction model; and (3) respectively and iteratively executing the step (S2.1) to the step (S2.3) by the global prediction model training and the local prediction model training until the loss function is stable or the maximum iteration times are reached:
step S2.1: inputting input data of a first preset number of continuous video frames into a first network for feature extraction based on a prediction model to be trained to obtain each preset query item corresponding to the input data;
step S2.2: respectively inputting the preset query items into a second network aiming at the preset query items to obtain fusion characteristic items corresponding to the preset query items, and further obtaining fusion characteristic items corresponding to the preset query items;
step S2.3: inputting the input data into a third network based on each preset query item corresponding to the input data and the fusion feature item corresponding to each preset query item, obtaining output data of a next video frame related to the input data, and updating the prediction model;
step S3: taking a first preset number of continuous video frames of a previous video frame at the current moment in the historical time direction as data to be analyzed, predicting and obtaining a current prediction video frame based on an optimal global prediction model, predicting and obtaining a target prediction window of each preset type of target in the current prediction video frame by an optimal local prediction model, carrying out abnormity detection on a current collected video frame, and executing step S4 if the judgment result shows that the current collected video frame is abnormal; if the video frame is judged to be not abnormal, carrying out abnormal detection on the next collected video frame in the mode of S3;
step S4: and aiming at detecting that the current collected video frame is abnormal, predicting and obtaining the current prediction video frame based on the optimal global prediction model, predicting and obtaining target prediction windows of various preset types of targets in the current prediction video frame by the optimal local prediction model, positioning the abnormal target windows of the current collected video frame, and finishing the detection.
2. The scene object based multi-memory video anomaly detection and localization method according to claim 1, characterized in that: in step S1, for each normal video, the specific process of extracting the target pipeline corresponding to each preset type target is as follows:
detecting each preset type target in each video frame by using a Yolov5 model pre-trained on an MS COCO data set, corresponding each preset type target of the front and rear video frames by means of a DeepsORT target tracking algorithm to obtain target pipelines respectively corresponding to each preset type target, and carrying out undistorted scaling uniform size on each target pipeline to obtain a local training set.
3. The scene object based multi-memory video anomaly detection and localization method according to claim 1, characterized in that: in step S2, the first network and the third network are both U-Net network structures, and the first network removes the last batch normalization layer and the ReLU layer and replaces the last batch normalization layer with the L2 normalization layer.
4. The scene object based multi-memory video anomaly detection and localization method according to claim 1, characterized in that: in step S2.1, input data Y based on a first preset number of consecutive video framesS=(Et-S,…,Et-2,Et-1) Performing feature extraction on the input data through a first network to obtain each preset query item corresponding to the input data
Figure FDA0003559192260000021
Wherein, YTThe method comprises the steps of indicating input data of a first preset S continuous video frames, indicating predicted time, and indicating input data of each video frame;
Figure FDA0003559192260000022
in order to predict the kth preset query term corresponding to the input data of the video frame at the time t, K is 1.
5. The scenario object-based multi of claim 4The memory video abnormity detection and positioning method is characterized in that: in step S2.2, the second network includes M memory entries
Figure FDA0003559192260000023
Respectively aiming at preset query items
Figure FDA0003559192260000024
The preset query item and the M memory items are respectively subjected to cosine similarity calculation and are applied to a softmax function to obtain the preset query item
Figure FDA0003559192260000025
Respectively associated with each memory item
Figure FDA0003559192260000026
Weight of (2)
Figure FDA0003559192260000027
Figure FDA0003559192260000028
Obtaining fusion characteristic items corresponding to the preset query items based on the weights of the memory items and the preset query items
Figure FDA0003559192260000029
Figure FDA00035591922600000210
6. The scene object based multi-memory video anomaly detection and localization method according to claim 5, characterized in that: for each memory item in the second network, executing the following process to update each memory item:
weights of preset query items and memory items corresponding to input data
Figure FDA00035591922600000211
Obtaining index sets corresponding to all the memory items respectively
Figure FDA00035591922600000212
Each subset in the index set is the weight of the memory item
Figure FDA00035591922600000213
The highest preset query item; aiming at each memory item, based on the memory item and the corresponding index set
Figure FDA00035591922600000214
Weight of each preset query item in
Figure FDA00035591922600000215
The memory term is updated using the following equation:
Figure FDA00035591922600000216
Figure FDA0003559192260000031
wherein f (. cndot.) represents L2 normalization, pmRefers to the updated memory entries.
7. The scene object based multi-memory video anomaly detection and localization method according to claim 1, characterized in that: in the step 2, the training loss functions L of the global prediction model and the local prediction model are respectively predicted loss LpredCharacteristic loss of compaction LcompactAnd characteristic separation loss LseparateThe three parts of the utility model are formed,
L=LpredcLcompactsLseparate
Figure FDA0003559192260000032
Figure FDA0003559192260000033
Figure FDA0003559192260000034
wherein eta iscAnd ηsIs a preset coefficient; l is a radical of an alcoholpredTo predict the L2 distance between the output data of a video frame and the pixels of the captured video frame data,
Figure FDA0003559192260000035
for predicting output data of video frames, atIn order to collect data corresponding to output data in video frames, T represents the total video frame number of a normal video, and T is b, which means that input data of the first b-1 video frames of the normal video is used as an initial input value of a prediction model; l iscompactFor presetting query terms
Figure FDA0003559192260000036
The memory item p closest to the preset query itempThe L2 distance of (a) is,
Figure FDA0003559192260000037
indicating the k-th preset query item corresponding to the input data of the video frame at the predicted time t
Figure FDA0003559192260000038
The similarity between the memory items and the mth memory item, namely the weight; l isseparateTo make the query term preset
Figure FDA0003559192260000039
Memory items p second proximate to the query itemnIs further from the predetermined query term and the closest memory term ppIs far away by a preset threshold theta.
8. The scene object based multi-memory video anomaly detection and localization method according to claim 1, characterized in that: the step S3 is to predict and obtain the current prediction video frame based on the optimal global prediction model, predict and obtain the target prediction window of each preset type target in the current prediction video frame based on the optimal local prediction model, and calculate the frame level abnormal score S of the collected video frame aiming at the current collected video frametPerforming an abnormality detection, if the abnormality is scored by StIf the preset threshold value is reached, the current collected video frame is abnormal;
St=λSlocal,t+(1-λ)Sglobal,t
wherein the content of the first and second substances,
Figure FDA0003559192260000041
Figure FDA0003559192260000042
Figure FDA0003559192260000043
Figure FDA0003559192260000044
Figure FDA0003559192260000045
Figure FDA0003559192260000046
in the formula, Slocal,tRepresenting local abnormal score, S, of video frame acquired at current time tglobal,tRepresenting the global abnormal score of the video frame collected at the current time t;
Figure FDA0003559192260000047
representing the abnormal scores of the jth preset type target in the video frame collected at the current time t, wherein J targets are in total, and lambda represents the contribution degree and sigma1、σ2Are all preset coefficients, g (-) represents min-max normalization; d (q)tP) represents each preset query term in the video frame at the predicted time t
Figure FDA0003559192260000048
The nearest memory item p corresponding to each preset query itempThe distance between the preset query items is L2, and K is the total number of the preset query items corresponding to each preset type target;
Figure FDA0003559192260000049
representing a target prediction window
Figure FDA00035591922600000410
And capturing a target window o in a video frametPeak signal to noise ratio, N, betweenoIndicating the number of pixels contained in the target window,
Figure FDA00035591922600000411
representing a target prediction window
Figure FDA00035591922600000412
A medium pixel maximum;
Figure FDA00035591922600000413
representing predicted video frames
Figure FDA00035591922600000414
And capturing video frames ItPeak signal to noise ratio, N, in betweenIIndicates the number of pixels contained in the video frame,
Figure FDA00035591922600000415
representing predicted video frames
Figure FDA00035591922600000416
The medium pixel maximum.
9. The scene object based multi-memory video anomaly detection and localization method according to claim 1, characterized in that: in step S4, for detecting that the current captured video frame is abnormal, the current predicted video frame is obtained based on the prediction of the optimal global prediction model, the target prediction windows of each preset type of target in the current predicted video frame are obtained by the prediction of the optimal and local prediction models, and the abnormal score of each target window in the video frame is calculated for the current captured video frame
Figure FDA00035591922600000417
Positioning an abnormal target window of a currently acquired video frame;
Figure FDA00035591922600000418
wherein the content of the first and second substances,
Figure FDA0003559192260000051
in the formula (I), the compound is shown in the specification,
Figure FDA0003559192260000052
representing a target window corresponding to the jth preset type target in the video frame at the predicted time t
Figure FDA0003559192260000053
The pixel in (b) and a target window corresponding to the jth target in the video frame at the acquisition time t
Figure FDA0003559192260000054
The corresponding pixels in the image are subjected to difference and summation, local represents local prediction, global represents global prediction, J targets are in total, and lambda represents contribution degree;
calculating the abnormal score of each target window in the video frame aiming at the current collected video frame, and taking the preset proportion gamma of the target with the highest abnormal score as a standard threshold theta based on the target with the highest abnormal scoretExceeding the standard threshold value thetatThe target corresponding to the abnormality score of (1) is determined to be abnormal;
Figure FDA0003559192260000055
10. the system of the scene target-based multi-memory video anomaly detection and positioning method is characterized in that: comprises a training set module, a global prediction module, a local prediction module, an abnormality detection module and an abnormality positioning module,
the training set module obtains a global training set and a local training set based on each normal video in a target scene;
the global prediction module and the local prediction module respectively comprise an encoder, a decoder and a memory module, wherein the encoder corresponds to the first network and performs feature extraction based on input data to obtain each preset query item corresponding to the input data; the memory module corresponds to the second network, obtains fusion feature items corresponding to the preset query items based on the preset query items, and updates the memory items based on the preset query items; the decoder corresponds to a third network, and obtains output data of the prediction video frame based on all the preset query items corresponding to the input data and fusion feature items corresponding to all the preset query items;
the anomaly detection module carries out anomaly detection on the currently acquired video frame based on a current prediction video frame obtained by global prediction model prediction and a target prediction window of each preset type of target in the current prediction video frame obtained by optimal local prediction model prediction;
and the abnormity positioning module is used for detecting that the current collected video frame is abnormal, predicting the obtained current predicted video frame based on the optimal global prediction model, predicting a target prediction window of each preset type of target in the current video frame by using the optimal and local prediction models, and positioning the abnormal target window of the current collected video frame.
CN202210288377.6A 2022-03-22 2022-03-22 Multi-memory video anomaly detection and positioning method and system based on scene target Withdrawn CN114627421A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210288377.6A CN114627421A (en) 2022-03-22 2022-03-22 Multi-memory video anomaly detection and positioning method and system based on scene target

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210288377.6A CN114627421A (en) 2022-03-22 2022-03-22 Multi-memory video anomaly detection and positioning method and system based on scene target

Publications (1)

Publication Number Publication Date
CN114627421A true CN114627421A (en) 2022-06-14

Family

ID=81904591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210288377.6A Withdrawn CN114627421A (en) 2022-03-22 2022-03-22 Multi-memory video anomaly detection and positioning method and system based on scene target

Country Status (1)

Country Link
CN (1) CN114627421A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117011616A (en) * 2023-10-07 2023-11-07 腾讯科技(深圳)有限公司 Image content auditing method and device, storage medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117011616A (en) * 2023-10-07 2023-11-07 腾讯科技(深圳)有限公司 Image content auditing method and device, storage medium and electronic equipment
CN117011616B (en) * 2023-10-07 2024-01-26 腾讯科技(深圳)有限公司 Image content auditing method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN111401201B (en) Aerial image multi-scale target detection method based on spatial pyramid attention drive
US20210248378A1 (en) Spatiotemporal action detection method
CN111862145B (en) Target tracking method based on multi-scale pedestrian detection
CN109543615B (en) Double-learning-model target tracking method based on multi-level features
CN113449660B (en) Abnormal event detection method of space-time variation self-coding network based on self-attention enhancement
CN110120064A (en) A kind of depth related objective track algorithm based on mutual reinforcing with the study of more attention mechanisms
CN113112519B (en) Key frame screening method based on interested target distribution
CN113344971B (en) Twin infrared target tracking method fused with Kalman filtering
CN114972312A (en) Improved insulator defect detection method based on YOLOv4-Tiny
CN111640138A (en) Target tracking method, device, equipment and storage medium
CN111340842A (en) Correlation filtering target tracking algorithm based on joint model
CN115131640A (en) Target detection method and system utilizing illumination guide and attention mechanism
CN113111722A (en) Automatic driving target identification method based on improved Mask R-CNN
CN114627421A (en) Multi-memory video anomaly detection and positioning method and system based on scene target
CN110889864A (en) Target tracking method based on double-layer depth feature perception
CN116977937A (en) Pedestrian re-identification method and system
CN114881286A (en) Short-time rainfall prediction method based on deep learning
CN115272405A (en) Robust online learning ship tracking method based on twin network
CN111353496A (en) Real-time detection method for infrared small and weak target
CN111091583B (en) Long-term target tracking method
CN115578691A (en) Video anomaly detection method, system and equipment based on scene target
CN115240394B (en) Method and system for monitoring and early warning water level of accident oil pool of transformer substation
CN101389038B (en) Video error blanketing method and apparatus based on macro block classification
CN116665095A (en) Method and system for detecting motion ship, storage medium and electronic equipment
CN107273873B (en) Pedestrian based on irregular video sequence recognition methods and system again

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220614

WW01 Invention patent application withdrawn after publication