WO2021147055A1

WO2021147055A1 - Systems and methods for video anomaly detection using multi-scale image frame prediction network

Info

Publication number: WO2021147055A1
Application number: PCT/CN2020/073932
Authority: WO
Inventors: Zhengping Che; Xuanzhao WANG; Ke Yang; Bo Jiang; Jian Tang
Original assignee: Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date: 2020-01-22
Filing date: 2020-01-22
Publication date: 2021-07-29

Abstract

Systems, methods, and computer-readable media for video anomaly detection using a learning model. The method may include receiving a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point. The method may also include extracting spatial features from each prior image frame at different resolutions and predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution. The method may further include predicting an estimated current image frame based on the estimated image frames in the different resolutions and detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.

Description

SYSTEMS AND METHODS FOR VIDEO ANOMALY DETECTION USING MULTI-SCALE IMAGE FRAME PREDICTION NETWORK

TECHNICAL FIELD

The present disclosure relates to systems and methods for video anomaly detection, and more particularly to, systems and methods for video anomaly detection using a multi-scale image frame prediction network.

BACKGROUND

Video anomaly detection plays an essential role in computer vision and is used in many applications such as warning system, scene understanding, activity recognition, road traffic analysis, etc. Given a video clip, the frame-level video anomaly detection aims at identifying the frames in which there exist events or behaviors different from expectations or regulations. Detecting anomalous events in videos is very challenging for mainly two reasons. First, the definition of an anomalous event is usually ambiguous, and its pattern varies a lot because it highly depends on the contexts of the event and the scenario where the event happens. For example, a vehicle driving down the road within the speed limits is normal but a vehicle dashing towards a crowd of people is anomalous. For another example, a cloud of smoke rising from inside of a building often indicates an anomaly, but cloud of smoke coming from a chimney is normal. Therefore, a robust vide anomaly detection method has to be able to address the ambiguous definition of the anomalous event. Second, the data between normal and abnormal samples is usually imbalanced in practice. Because anomalous events are rare and unpredictable in real-world scenarios, it is very difficult and high cost to collect and label abnormal videos. Therefore, video anomaly detection methods have to learn from normal data only (i.e., using the video of normal events to distinguish unseen and unbounded abnormal events from normal ones) .

Various video anomaly detection methods have been developed throughout time. Traditional video anomaly detection methods rely on hand-crafted features from prior domain knowledge. These methods learn dictionaries or descriptors of normal events based on extracted appearance and motion features. However, their detection performance is limited by the poor discriminative power of the simple features. The recent advance in deep learning networks promoted the deep learning-based video anomaly detection methods. These methods can be generally grouped into three categories: classification-based method, reconstruction-based methods, and prediction-based methods.

Classification-based method directly classifies each video frame to be normal or abnormal. Among them, some methods require additional anomaly data and labels in the training phase via a weakly supervised approach. However, because collecting and labeling anomaly data for training is expensive or even infeasible, this approach cannot be generalized to unbounded anomalies. Other works focus on specific types of anomaly events. For example, a multi-stage classification method first detects and crops out the object of interests and then builds one-versus-rest classifiers on the extracted features of that object. Since its anomaly detection result is substantially influenced by the performance of its object detection, it will fail in recognizing anomalies with unseen objects or no objects to attribute. Therefore, this type of method is very limited when dealing with complicated and uncertain real-world scenarios.

Reconstruction-based method is a more general way for video anomaly detection, which learns to reconstruct the input video frame with minimum reconstruction errors for normal frames and is supposed to have large errors for abnormal frames. Auto-encoder based methods and generative adversarial networks are commonly used as the reconstruction models. However, due to the improved performance of deep neural networks, an abnormal event may also have small reconstruction errors, so there is no guarantee that reconstruction-based methods can detect abnormal events well.

Prediction-based method has been developed to remedy the issues of the construction-based method and reconstruction-based method. It takes consecutive video frames to predict the next frame and determines whether the next frame is abnormal by the prediction error. However, the existing prediction-based methods are still suboptimal. For example, their U-Net architecture cannot fully learn temporal information and using adversarial learning and additional optical flow loss makes the training inefficient.

Embodiments of the disclosure address the above problems by providing prediction-based video anomaly detection methods and systems using a multi-scale frame prediction network

SUMMARY

Embodiments of the disclosure provide a method for video anomaly detection using a learning model. An exemplary method may include receiving a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point. The method may also include extracting spatial features from each prior image frame at different resolutions and predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution. The method may further include predicting an estimated current image frame based on the estimated image frames in the different resolutions and detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.

Embodiments of the disclosure also provide a system for performing video anomaly detection using a learning model. An exemplary system may include a communication interface configured to receive a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point. The system may also include at least one processor coupled to the communication interface. The at least one processor may be configured to extract spatial features from each prior image frame at different resolutions and predict an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution. The at least one processor may be further configured to predict an estimated current image frame based on the estimated image frames in the different resolutions and detect the video anomaly based on a difference between the captured current image frame and the estimated current image frame.

Embodiments of the disclosure further provide a non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method for video anomaly detection using a learning model. The method may include receiving a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point. The method may also include extracting spatial features from each prior image frame at different resolutions and predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution. The method may further include predicting an estimated current image frame based on the estimated image frames in the different resolutions and detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a schematic diagram of an exemplary anomaly detection system, according to embodiments of the disclosure.

FIG. 2 illustrates a block diagram of an exemplary anomaly detection device, according to embodiments of the disclosure.

FIG. 3 illustrates a flowchart of an exemplary method for anomaly detection, according to embodiments of the disclosure.

FIG. 4 illustrates a flowchart of an exemplary method for training an anomaly detection network, according to embodiments of the disclosure.

FIG. 5 illustrates a schematic framework of an exemplary anomaly detection network along with its’ training network, according to embodiments of the disclosure.

DETAILED DESCRIPTION

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.

FIG. 1 illustrates a schematic diagram of an exemplary video anomaly detection system (referred to as “anomaly detection system 100” ) , according to embodiments of the disclosure. Consistent with the present disclosure, anomaly detection system 100 is configured to detect anomalous events recorded in a video (i.e., determine if there is a video anomaly) captured by a camera 160. The anomaly detection may be based on a multi-scale image frame prediction network (referred to as an “anomaly detection network 105” hereafter) trained using sample videos (e.g., training data 101) . In some embodiments, anomaly detection system 100 may include components shown in FIG. 1, including an anomaly detection device 110, a model training device 120, a display device 130, a training database 140, a database/repository 150, a camera 160 and a network 170 for facilitating communications among the various components. It is to be contemplated that anomaly detection system 100 may include more or less components compared to those shown in FIG. 1.

As shown in FIG. 1, anomaly detection system 100 may perform two stages: an anomaly detection model training stage and an anomaly detection stage applying the trained model. To perform the training stage (e.g., training a learning model such as anomaly detection network 105) , anomaly detection system 100 may include model training device 120 and training database 140. To perform the video anomaly detection process to obtain an anomaly detection result 107 (e.g., whether the video includes anomalous event (s) ) , anomaly detection system 100 may include anomaly detection device 110 and database/repository 150. In some embodiments, anomaly detection system 100 may also include display device 130 to display anomaly detection result 107. In some embodiments, when the learning model (e.g., anomaly detection network 105) is pre-trained, anomaly detection system 100 may only include components for performing the anomaly detection related functions, namely anomaly detection device 110, database/repository 150 and optionally display device 130.

In some embodiments, anomaly detection system 100 may optionally include network 170 to facilitate the communication among the various components of anomaly detection system 100, such as

databases

140 and 150,

devices

110 and 120, and camera 160. For example, network 170 may be a local area network (LAN) , a wireless network, a cloud computing environment (e.g., software as a service, platform as a service, infrastructure as a service) , a client-server, a wide area network (WAN) , etc. In some embodiments, network 170 may be replaced by wired data communication systems or devices.

In some embodiments, the various components of anomaly detection system 100 may be remote from each other or in different locations and be connected through network 170 as shown in FIG. 1. In some alternative embodiments, certain components of anomaly detection system 100 may be located on the same site or inside one device. For example, training database 140 may be located on-site with or be part of model training device 120. As another example, model training device 120 and anomaly detection device 110 may be inside the same computer or processing device.

Consistent with the present disclosure, anomaly detection system 100 may store video that include multiple image frames. Image frames of a video may each correspond to a time point. Consistent with the present disclosure, a “current image frame” refers to the image frame corresponding to a selected time point. Accordingly, “prior image frames” refer to the set of image frames corresponding to time points prior to the selected time point. Likewise, “future image frames” refer to the set of image frames corresponding to time points subsequent to the selected time point. In some embodiments, sample image frames where anomalies are known (e.g., training data 101) may be stored in training database 140 and image frames for detection (e.g., image frames 102) may be stored in database/repository 150.

Image frames may be generated based on video data (e.g., a video comprising a plurality of image frames) received from video recording devices (e.g., camera 160) . In some embodiments, the video data may be visual image streams including the current image frame, the set of prior image frames and the set of future image frames, acquired by camera 160 or a wearable device, a smart phone, a tablet, a computer, a surveillance camera, or the like that includes a video recording device for acquiring the video data. In some embodiments, camera 160 may be any suitable video recording device that can acquire image frames. Image frames may be generated based on the acquired video data. Optionally, the video data may also come from a post-processing device implementing image enhancement technology (e.g., an application on a user device for adding night visions to the surveillance devices) .

In some embodiments, training database 140 may store training data 101, which includes sample image frames. In some embodiments, training data 101 may further include the known anomalies (if any) in the sample image frames. For example, the sample image frames may include an image frame corresponding to a time point when anomalous event happens, and image frames prior and subsequent to that time point. In some embodiments, training data 101 may include image frames that are known anomaly-free. Sample image frames may be stored in training database 140 as training data 101.

Consistent with some embodiments, anomaly detection network 105 (described in detail in connection with FIG. 5) may be a multi-scale image frame prediction network (referred to as “prediction network” hereafter) for predicting an estimated current image frame (i.e., a prediction of an image frame at a selected time point) based on a set of prior image frames corresponding to time points prior to the selected time point. Anomaly detection network 105 may also include an anomaly score calculation model (referred to as “assessment model” hereafter) for determining an anomaly score, indicative of a difference between the ground truth current image frame (i.e., captured image frame at the selected time point) and the estimated current image frame. In some embodiments, anomaly detection network 105 may decide whether the current image frame records/includes anomalous event (s) (i.e., whether an anomaly occurs at or around the selected time point) based on the anomaly sore.

In some embodiments, the prediction network may include an encoder for extracting spatial features from each prior image frame (i.e., the set of prior image frames corresponding to time points prior to the selected time point) at different resolutions, a predictor for predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution and a decoder for predicting the estimated current image frame based on the estimated image frames in the different resolutions.

In some embodiments, the encoder may include at least one convolution layer and a plurality of residual blocks for obtaining spatial features of the prior image frames in different resolutions. Each residual block may be configured to extract spatial features in a specific resolution. The residual blocks may be connected in a sequential manner, each producing spatial features in a different resolution. The more residual blocks are applied to an image frame, the lower spatial features are extracted. For example, each residual block may include a 2-D convolution, and when extracting the spatial features from each prior image frame, a first residual block for a first resolution may be applied to the image frame for generating spatial features in a first resolution, and a second residual block may be applied to the spatial features in the first resolution to obtain spatial features in a second resolution. The second resolution is lower than the first resolution. It is understood that the number of residual blocks of the plurality of residual blocks is not limited to two. In some embodiments, the plurality of residual blocks may include three or more residual blocks applied in a sequential manner, for generating spatial features in three or more different resolutions. The multi-scale architecture may allow the encoder to extract spatial features of the image frame at different scales.

In some embodiments, the predictor may include a plurality of sub-models in parallel, each corresponding to a resolution. Each of the sub-models may include a first block for extracting global temporal features from the spatial features of the set of prior image frames in that resolution, and a second block for extracting local temporal features from the global temporal features in that resolution. For example, within each sub-model, the first block may be a non-local block. The second block may be a convolutional gated recurrent unit (ConvGRU) and may extract local temporal features with its receptive field in that resolution.

In some embodiments, the decoder may include a plurality of residual blocks for fusing the local and global spatial-temporal features in the different resolutions generated by the predictor to generate an estimated current image frame of the selected time point. The estimated current image frame may have the same shape with the input prior image frames. For example, the decoder may fuse features at different scales and construct the output (e.g., the estimated current image frame) by upsampling and concatenating the channels of the features. In some embodiments, the lower-resolution features may be upsampled by the nearest neighbor interpolation (i.e., finding a value between the nearest neighbors) . Checkerboard artifacts (i.e., checkerboard patterns in the gradient) in the output may be eliminated using the residual blocks.

In some embodiments, the assessment model may calculate the anomaly score based on a peak signal-to-noise ratio (PSNR) of a mean squared error between the ground truth/captured current image frame and the estimated current image frame. For example, the assessment model may first calculate the mean squared error between the ground truth current image frame I _t and the estimated current image frame

and calculate the PSNR based on the mean squared error. The assessment model may then calculate an anomaly score based on normalizing the PSNR values of all the prior image frames.

Consistent with some embodiments, anomaly detection network 105 may be trained by minimizing the differences between an image frame predicted by using anomaly detection network 105 (hereafter “predicated training image frame” ) and a ground truth image frame corresponding to the same time point as the predicted image frame provided as part of training data 101 (hereafter “ground truth training image frame” ) . For example, model training device 120 may minimize a loss including a weighted sum of a perceptual loss indicative of a level of noise between the predicted training image frame and the ground truth training image frame, an intensity loss indicative of an intensity difference between the predicted training image frame and the ground truth training image frame, and a gradient difference loss indicative of a difference between an gradient image of the estimated current image frame and the ground truth current image frame. The details of the training will be disclosed in connection with FIG. 4 blow.

As show in FIG. 1, model training device 120 may communicate with training database 140 to receive one or more set of training data 101. Each set of training data 101 may include sample image frames (i.e., a plurality of image frames from a sample video) including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point. Model training device 120 may use training data 101 received from training database 140 to train the learning model, e.g., anomaly detection network 105. Model training device 120 may be implemented with hardware specially programmed by software that performs the training process. For example, model training device 120 may include a processor and a non-transitory computer-readable medium. The processor may conduct the training by performing instructions of a training process stored in the computer-readable medium. Model training device 120 may additionally include input and output interfaces to communicate with training database 140, network 170, and/or a user interface (not shown) . The user interface may be used for selecting sets of training data, adjusting one or more parameters of the training process, selecting or modifying a framework of the learning model, and/or manually adjust the selected time point of training data 101.

Anomaly detection device 110 may receive trained anomaly detection network 105 from model training device 120. Anomaly detection device 110 may include a processor and a non-transitory computer-readable medium (not shown) . The processor may perform instructions of an anomaly detection process stored in the medium. Anomaly detection device 110 may additionally include input and output interfaces to communicate with database/repository 150, camera 160, network 170 and/or a user interface of display device 130. The input interface may be used for selecting a video that includes the plurality of image frames or initiating the detection process. The output interface may be used for providing an anomaly detection result 107 associated with the video.

Display 130 may include a display such as a Liquid Crystal Display (LCD) , a Light Emitting Diode Display (LED) , a plasma display, or any other type of display, and provide a Graphical User Interface (GUI) presented on the display for user input and data depiction. The display may include a number of different types of materials, such as plastic or glass, and may be touch-sensitive to receive inputs from the user. For example, the display may include a touch-sensitive material that is substantially rigid, such as Gorilla Glass ^TM, or substantially pliable, such as Willow Glass ^TM. In some embodiments, display 130 may be a standalone device, or may be an integrated part of anomaly detection device 110.

FIG. 2 illustrates a block diagram of an exemplary anomaly detection device 110, according to embodiments of the disclosure. In some embodiments, as shown in FIG. 2, anomaly detection device 110 may include a communication interface 202, a processor 204, a memory 206, and a storage 208. In some embodiments, anomaly detection device 110 may have different modules in a single device, such as an integrated circuit (IC) chip (e.g., implemented as an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) ) , or separate devices with dedicated functions. In some embodiments, one or more components of anomaly detection device 110 may be located in a cloud or may be alternatively in a single location (such as inside a mobile device) or distributed locations. Components of anomaly detection device 110 may be in an integrated device or distributed at different locations but communicate with each other through a network (not shown) . Consistent with the president disclosure, anomaly detection device 110 may be configured to detect anomalous event (s) recorded in image frames 102 received from database/repository 150, using anomaly detection network 105 trained in model training device 120.

Communication interface 202 may send data to and receive data from components such as database/repository 150, camera 160, model training device 120 and display device 130 via communication cables, a Wireless Local Area Network (WLAN) , a Wide Area Network (WAN) , wireless networks such as radio waves, a cellular network, and/or a local or short-range wireless network (e.g., Bluetooth ^TM) , or other communication methods. In some embodiments, communication interface 202 may include an integrated service digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection. As another example, communication interface 202 may include a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links can also be implemented by communication interface 202. In such an implementation, communication interface 202 can send and receive electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

Consistent with some embodiments, communication interface 202 may receive anomaly detection network 105 from model training device 120 and image frames 102 from database/repository 150. Communication interface 202 may further provide image frames 102 and anomaly detection network 105 to memory 206 and/or storage 208 for storage or to processor 204 for processing.

Processor 204 may include any appropriate type of general-purpose or special-purpose microprocessor, digital signal processor, or microcontroller. Processor 204 may be configured as a separate processor module dedicated to detecting anomalous event (s) in a video including a plurality of image frames captured by a camera using a learning model. Alternatively, processor 204 may be configured as a shared processor module for performing other functions in addition to anomaly detection.

Memory 206 and storage 208 may include any appropriate type of mass storage provided to store any type of information that processor 204 may need to operate. Memory 206 and storage 208 may be a volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other type of storage device or tangible (i.e., non-transitory) computer-readable medium including, but not limited to, a ROM, a flash memory, a dynamic RAM, and a static RAM. Memory 206 and/or storage 208 may be configured to store one or more computer programs that may be executed by processor 204 to perform functions disclosed herein. For example, memory 206 and/or storage 208 may be configured to store program (s) that may be executed by processor 204 to detect anomalous events in image frames 102 based on anomaly detection network 105.

In some embodiments, memory 206 and/or storage 208 may also store intermediate data such as spatial features at different resolutions, estimated image frames in different resolutions, estimated current image frame, PSNR of a mean squared error, anomaly score, etc. Memory 206 and/or storage 208 may additionally store various learning models including their model parameters, such as learnable parameters of the encoder, the predictor, and the decoder, etc.

As shown in FIG. 2, processor 204 may include multiple modules, such as an encoder unit 240, a predictor unit 242, a decoder unit 244, an assessment unit 246, and the like. These modules (and any corresponding sub-modules or sub-units) can be hardware units (e.g., portions of an integrated circuit) of processor 204 designed for use with other components or software units implemented by processor 204 through executing at least part of a program. The program may be stored on a computer-readable medium, and when executed by processor 204, it may perform one or more functions. Although FIG. 2 shows units 240-246 all within one processor 204, it is contemplated that these units may be distributed among different processors located closely or remotely with each other.

In some embodiments, units 242-246 of FIG. 2 may execute computer instructions to perform the anomaly detection. For example, FIG. 3 illustrates a flowchart of an exemplary method 300 for anomaly detection based on anomaly detection network 105 (an example shown in FIG. 5) , according to embodiments of the disclosure. Method 300 may be implemented by anomaly detection device 110 and particularly processor 204 or a separate processor not shown in FIG. 2. Method 300 may include steps S302-S314 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 3. FIG. 3 and FIG. 5 will be described together.

In step S302, communication interface 202 may receive image frames 102 from database/repository 150. In some embodiments, image frames 102 may be part of a video that includes at least I _t-P, …, I _t-1, I _t image frames recorded using camera 160. For example, image frames 102 may include a current image frame I _t corresponding to a selected time point t and a set of P prior image frames I _t-P, …, I _t-1 corresponding to time points t-P, …, t-1, prior to the selected time point t.

In step S304, encoder unit 240 may apply at least one convolution layer 510 to an input image frame (i.e., one of a prior image frame of image frames 102) to generate a convoluted result of the input image frame. Encoder unit 240 may extract spatial features from the convoluted result of the input image frame at different resolutions. For example, encoder unit 240 may use several basic residual blocks 512 for extracting multi-scale spatial features (i.e., spatial features in different resolutions) . One or more basic residual blocks 512 may be applied to the convoluted result in sequence. Specifically, as shown in FIG. 5, L may be set as 3 (i.e., three different resolutions in total) . Basic residual blocks 512 may accordingly include three residual blocks applied in sequence for generating spatial features at different resolutions in a descending manner (e.g., from image resolution of 128×128×128 to image resolution of 64×64×256 to image resolution of 32×32×512) . For example, the first residual block may be applied to the convoluted result from the convolution layer and generate a first set of spatial features at a first resolution (e.g., 128×128×128) . The second residual block may be applied to the first set of spatial features and generate a second set of spatial features at a second resolution (e.g., 64×64×256) , lower than the first resolution. In some embodiments, each of basic residual blocks 512 may include a 2-D convolution for convoluting the spatial features.

For example, encoder unit 240 may calculate spatial features in L resolutions for each image frame from I ₁ to I _t-1 using basic residual blocks 512 according toequation (1) :

where t′= t-P, …, t-1 and f _enc (·) is the encoder function.

In step S306, predictor unit 242 may predict an estimate image frame in each resolution based on the spatial features of the set of P prior image frames I _t-P, …, I _t-1 in that resolution using a plurality of prediction sub-models 520. In some embodiments, the number of prediction sub-models 520 corresponds to the number of different resolutions. For example, in a specific example, because the spatial features were extracted at three different resolutions by encoder unit 240 (i.e., L=3) , prediction sub-models 520 may include three prediction sub-models. In some embodiments, the three prediction sub-models are applied in parallel, each of which corresponds to spatial features in one resolution for predicting an estimated image frame in that resolution. For example, each sub-model may include a first block (e.g., a non-local block) for extracting global temporal features from the spatial features of the set of prior image frames in the corresponding resolution and a second block for extracting local temporal features from the global temporal features in that resolution.

In some embodiments, the second block may be a convolutional gated recurrent unit (ConvGRU) . For example, within each sub-model, the first block (e.g., the non-local block) may be applied to capture long-range dependencies at all positions in the image frame and the ConvGRU may be applied to extract local temporal features (i.e., temporal patterns) using its receptive field focusing on the local neighborhood of the global temporal features extracted by the first block. The estimated current image frame in each resolution may be predicted according to equation (2) :

where l = 1, 2, …, L and f _pre (·) is the predictor function.

In step S308, decoder unit 244 may predict the estimated current image frame by fusing the features (i.e., both global temporal features and local temporal features) in the different resolutions using sub-models 530. In some embodiments, each sub-model of sub-models 530 includes an upsampling unit for upsampling the features, followed by a residual block for eliminating checkerboard artifacts (i.e., checkerboard patterns in the gradient) in the estimated current image frame (i.e., the output image frame by decoder unit 244) . In some embodiments, the features may be upsampled by nearest neighbor interpolation (i.e., finding a value between the nearest neighbors) .

For example, the estimated current image frame may be estimated by upsampling and concatenating the channels of the features according to equation (3) :

where

sthe predicted current image frame and f _dec (·) is the decoder function.

In step S310, assessment unit 246 may determine an anomaly score indicative of the difference between the captured current image frame I _t and the estimated current image frame

using the assessment model (not shown in FIG. 5) .

In some embodiments, the assessment model may calculate the anomaly score based on a peak signal-to-noise ratio (PSNR) of a mean squared error between the captured current image frame and the estimated current image frame. For example, the assessment model may first calculate the mean squared error (MSE) between the captured current image frame I _t and the estimated current image frame

according to equation (4)

where H and W are the height and the width of image frames respectively, and I _t (i, j) and

(i, j) are the Red, Green, Blue (RGB) values of the (i, j) ^th pixel in I _t and

respectively.

The assessment model may then calculate the PSNR based on the mean squared error according to equation (5) .

where MAX _It is the maximum possible pixel value of I _t.

The assessment model may further calculate the anomaly score of the image frame I _t at the time point t based on normalizing the PSNR values of all the T prior image frames. In some embodiments, the anomaly score may be calculated according to equation (6) :

where t′= 1, 2, …, T. In some embodiments, a higher anomaly sore S _t indicates the higher probability that the image frame I _t contains anomalous event (s) .

In step S312, assessment unit 246 may determine if the anomaly score is higher than a predetermined threshold. In some embodiments, an operator or a designer of the learning model (e.g., anomaly detection network 105) may set a predetermined value based on the domain knowledge for determining if the image frame records anomalous event (s) . For example, if the anomaly sore S _t is higher than the predetermined threshold, the image frame contains anomalous event (s) . Thus, in step S314, the video may be determined to include anomaly. Otherwise, if the anomaly sore S _t is not higher than the predetermined threshold, the current image frame contains no anomalous event (s) . Thus, in step S316, the video may be determined to be normal at or around time point t.

Consistent with some embodiments, anomaly detection network 105 may be trained by model training device 120 by minimizing a loss including a weighted sum of a perceptual loss indicative of a level of noise between the predicted training image frame and the ground truth training image frame, an intensity loss indicative of an intensity difference between the predicted training image frame and the ground truth training image frame, and a gradient difference loss indicative of a difference between an gradient image of the predicted training image frame and an gradient image of the ground truth training image frame.

FIG. 4 illustrates a flowchart of an exemplary method 400 for training anomaly detection network 105, according to embodiments of the disclosure. Method 400 may include steps S402-S412 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 4.

In step S402, model training device 120 may receive training data 101 from training database 140. In some embodiments, training data 101 may include a video that includes at least I _t-P, …, I _t-1, I _t training image frames recorded using camera 160. For example, training data 101 may include a training image frame I _t corresponding to a selected time point t and a set of P prior training image frames I _t-P, …, I _t-1 corresponding to time points t-P, …, t-1, prior to the selected time point t.

In step S404, model training device 120 may calculate the perceptual loss indicative of a level of noise between the training image frame

predicted by using the learning model (e.g., anomaly detection network 105) and the ground truth training image frame I _t. For example, model training device 120 may apply separate pre-trained deep convolution networks such as VGG16 networks 540 (shown in FIG. 5) to the predicted training image frame

and the ground truth training image frame I _t simultaneously.

In some embodiments, the pre-trained deep convolution networks are trained on ImageNet for image classification. Each VGG16 network may include multiple sub- convolution layers 542 with a rectified linear unit (ReLU) . For a specific example, the multiple sub-convolution layers 542 may include 13 sub-convolution layers. Model training device 120 may calculate the perceptual loss based on a weighted l ₁ distance of the 2 ^nd, 4 ^th, 7 ^th, 10 ^th and 13 ^th sub-convolution layers of the VGG16 network according to equation (7) :

where V= {2, 4, 7, 10, 13} ,

is the output from the VGG16’s v ^th sub-convolution layer, and the hyperparameter (i.e., a predetermined parameter) α _v controls the strength of each part of the loss. In a specific example, (α ₂, α ₄, α ₇, α ₁₀, α ₁₃) may be set as (0.1, 1, 10, 10, 10) .

In step S406, model training device 120 may calculate the intensity loss L _int indicative of the intensity l ₂ distance between the predicted training image frame

and the ground truth training image frame I _t according to equation (8) :

where 1 ≤ i ≤ H and 1 ≤ j ≤ W, and I _t (i, j) and

(i, j) are the RGB values of the (i, j) th pixel in I _t and

respectively.

In step S408, model training device 120 may calculate the gradient different loss L _gd indicative of the gradient difference between the gradient image of the predicted training image frame

and the gradient image of the ground truth training image frame I _t. For example, model training device 120 may measure the l ₁ distance in both vertical and horizontal directions (i.e., the vertical and horizontal gradient differences) according to equation (9) :

and calculate the gradient different loss L _gd according to equation (9) :

Instep S410, model training device 120 may calculate the overall lose as a weighted sum of the perceptual loss, the intensity loss and the gradient difference loss according to equation (11) :

where λ _int, λ _gd and λ _pl are hyperparameters (i.e., parameters with predefined values) . In step S412, model training device 120 may train the learning model (e.g., anomaly detection network 105) by minimizing the overall lose.

The design of the multi-scale architectures of the learning model applied by the disclosed system and method ensures more attention paid to semantically meaningful parts such as a person, a vehicle, etc. comparing to background information (i.e., the static/unchanged objects) in each of the image frames of the video to be detect. Moreover, the disclosed multi-scale learning model is more sensitive to objects with different scales of features. For example, spatial features extracted at different resolutions may ensure the same object with different granularities in the image frame may all be detected for determining if there is an anomalous event happened. Finally, because the learning model can be trained based on solely normal videos (i.e., videos that are anomaly-free) , it is less expensive to capture the training data for training the learning model.

Another aspect of the disclosure is directed to a non-transitory computer-readable medium storing instructions which, when executed, cause one or more processors to perform the methods, as discussed above. The computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage devices. For example, the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed. In some embodiments, the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.

It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed system and related methods. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed system and related methods.

It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.

Claims

A method for performing video anomaly detection using a learning model, comprising:

receiving a video comprising a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point;

extracting spatial features from each prior image frame at different resolutions;

predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution;

predicting an estimated current image frame based on the estimated image frames in the different resolutions; and

detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
The method of claim 1, wherein the learning model comprises at least one convolution layer and a plurality of residual blocks, wherein extracting the spatial features from each prior image frame further comprises:

applying the convolution layer to the prior image frame to obtain a convoluted result;

applying a first residual block to the convoluted result to obtain spatial features in a first resolution; and

applying a second residual block to the spatial features in the first resolution to obtain spatial features in a second resolution.
The method of claim 1, wherein the learning model comprises a plurality of prediction models each corresponding to a resolution, wherein the estimated image frames in the different resolutions are predicted using the plurality of prediction sub-models in parallel.
The method of claim 3, wherein each sub-model comprises a first block and a second block, wherein predicting the estimated image frame in each resolution further comprises:

extracting global features from the spatial features of the set of prior image frames in that resolution using the first block; and

extracting local temporal features from the global features in that resolution using the second block, wherein the second block has a receptive field applied to local neighborhoods of the global temporal features.
The method of claim 4, wherein the second block is a convolutional gated recurrent unit (ConvGRU) .
The method of claim 4, wherein predicting the estimated current image frame further comprises fusing the global temporal features in the different resolutions.
The method of claim 1, wherein the learning model is trained by minimizing a loss comprising a perceptual loss indicative of a level of noise between a training image frame predicted by using the learning model and a ground truth training image frame.
The method of claim 7, wherein the loss further comprises an intensity loss indicative of an intensity difference between the predicted training image frame and the ground truth training image frame, and a gradient difference loss indicative of a difference between an gradient image of the predicted training image frame and an gradient image of the ground truth training image frame.
The method of claim 8, wherein the loss is a weighted sum of the perceptual loss, the intensity loss and the gradient difference loss.
The method of claim 7, wherein the perceptual loss is calculated by:

applying a pre-trained classification model to the ground truth image frame to obtain a first output from at least one hidden layer of the pre-trained classification model;

applying the pre-trained classification model to the predicted image frame to obtain a second output from the at least one hidden layer of the pre-trained classification model; and

determining a difference between the first output and the second output.
The method of claim 10, wherein the at least one hidden layer comprises a plurality of hidden layers and the difference is a weighted distance between the first output and the second output from the plurality of hidden layers.
The method of claim 1, wherein detecting the video anomaly further comprises:

determining an anomaly score indicative of the difference between the captured current image frame and the estimated current image frame; and

detecting the video anomaly if the determined anomaly score is higher than a predetermined threshold.
The method of claim 12, wherein determining the anomaly score further comprises computing a peak signal-to-noise ratio (PSNR) of a mean squared error between the captured current image frame and the estimated current image frame.
A system for performing video anomaly detection using a learning model, comprising:

a communication interface configured to receive a video comprising a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point; and

at least one processor coupled to the communication interface and configured to:

extract spatial features from each prior image frame at different resolutions;

predict an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution;

predict an estimated current image frame based on the estimated image frames in the different resolutions; and

detect the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
The system of claim 14, wherein the learning model comprises at least one convolution layer and a plurality of residual blocks, wherein to extract the spatial features from each prior image frame, the at least one processor is further configured to:

apply the convolution layer to the prior image frame to obtain a convoluted result;

apply a first residual block to the convoluted result to obtain spatial features in a first resolution; and

apply a second residual block to the spatial features in the first resolution to obtain spatial features in a second resolution.
The system of claim 14, wherein the learning model comprises a plurality of prediction models each corresponding to a resolution, wherein the estimated image frames in the different resolutions are predicted using the plurality of prediction sub-models in parallel.
The system of claim 16, wherein each sub-model comprises a first block and a second block, wherein to predict the estimated image frame in each resolution, the at least one processor is further configured to:

extract global temporal features from the spatial features of the set of prior image frames in that resolution using the first block; and

extract local temporal features from the global temporal features in that resolution using the second block, wherein the second block has a receptive field applied to local neighborhoods of the global temporal features.
The system of claim 14, wherein the learning model is trained by minimizing a loss comprising a perceptual loss indicative of a level of noise between a training image frame predicted by using the learning model and a ground truth training image frame.
A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method for performing video anomaly detection using a learning model, the method comprising:

extracting spatial features from each prior image frame at different resolutions;

predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution;

predicting an estimated current image frame based on the estimated image frames in the different resolutions; and

detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
The non-transitory computer-readable medium of claim 19, wherein the learning model is trained by minimizing a loss comprising a perceptual loss indicative of a level of noise between a training image frame predicted by using the learning model and a ground truth training image frame.