WO2021147055A1 - Systems and methods for video anomaly detection using multi-scale image frame prediction network - Google Patents
Systems and methods for video anomaly detection using multi-scale image frame prediction network Download PDFInfo
- Publication number
- WO2021147055A1 WO2021147055A1 PCT/CN2020/073932 CN2020073932W WO2021147055A1 WO 2021147055 A1 WO2021147055 A1 WO 2021147055A1 CN 2020073932 W CN2020073932 W CN 2020073932W WO 2021147055 A1 WO2021147055 A1 WO 2021147055A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image frame
- resolution
- estimated
- image frames
- prior
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Definitions
- the present disclosure relates to systems and methods for video anomaly detection, and more particularly to, systems and methods for video anomaly detection using a multi-scale image frame prediction network.
- Video anomaly detection plays an essential role in computer vision and is used in many applications such as warning system, scene understanding, activity recognition, road traffic analysis, etc.
- the frame-level video anomaly detection aims at identifying the frames in which there exist events or behaviors different from expectations or regulations. Detecting anomalous events in videos is very challenging for mainly two reasons. First, the definition of an anomalous event is usually ambiguous, and its pattern varies a lot because it highly depends on the contexts of the event and the scenario where the event happens. For example, a vehicle driving down the road within the speed limits is normal but a vehicle dashing towards a crowd of people is anomalous.
- a cloud of smoke rising from inside of a building often indicates an anomaly, but cloud of smoke coming from a chimney is normal. Therefore, a robust vide anomaly detection method has to be able to address the ambiguous definition of the anomalous event.
- the data between normal and abnormal samples is usually imbalanced in practice. Because anomalous events are rare and unpredictable in real-world scenarios, it is very difficult and high cost to collect and label abnormal videos. Therefore, video anomaly detection methods have to learn from normal data only (i.e., using the video of normal events to distinguish unseen and unbounded abnormal events from normal ones) .
- Classification-based method directly classifies each video frame to be normal or abnormal.
- some methods require additional anomaly data and labels in the training phase via a weakly supervised approach.
- this approach cannot be generalized to unbounded anomalies.
- Other works focus on specific types of anomaly events. For example, a multi-stage classification method first detects and crops out the object of interests and then builds one-versus-rest classifiers on the extracted features of that object. Since its anomaly detection result is substantially influenced by the performance of its object detection, it will fail in recognizing anomalies with unseen objects or no objects to attribute. Therefore, this type of method is very limited when dealing with complicated and uncertain real-world scenarios.
- Reconstruction-based method is a more general way for video anomaly detection, which learns to reconstruct the input video frame with minimum reconstruction errors for normal frames and is supposed to have large errors for abnormal frames.
- Auto-encoder based methods and generative adversarial networks are commonly used as the reconstruction models.
- an abnormal event may also have small reconstruction errors, so there is no guarantee that reconstruction-based methods can detect abnormal events well.
- Prediction-based method has been developed to remedy the issues of the construction-based method and reconstruction-based method. It takes consecutive video frames to predict the next frame and determines whether the next frame is abnormal by the prediction error.
- the existing prediction-based methods are still suboptimal. For example, their U-Net architecture cannot fully learn temporal information and using adversarial learning and additional optical flow loss makes the training inefficient.
- Embodiments of the disclosure address the above problems by providing prediction-based video anomaly detection methods and systems using a multi-scale frame prediction network
- Embodiments of the disclosure provide a method for video anomaly detection using a learning model.
- An exemplary method may include receiving a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point.
- the method may also include extracting spatial features from each prior image frame at different resolutions and predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution.
- the method may further include predicting an estimated current image frame based on the estimated image frames in the different resolutions and detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
- Embodiments of the disclosure also provide a system for performing video anomaly detection using a learning model.
- An exemplary system may include a communication interface configured to receive a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point.
- the system may also include at least one processor coupled to the communication interface.
- the at least one processor may be configured to extract spatial features from each prior image frame at different resolutions and predict an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution.
- the at least one processor may be further configured to predict an estimated current image frame based on the estimated image frames in the different resolutions and detect the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
- Embodiments of the disclosure further provide a non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method for video anomaly detection using a learning model.
- the method may include receiving a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point.
- the method may also include extracting spatial features from each prior image frame at different resolutions and predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution.
- the method may further include predicting an estimated current image frame based on the estimated image frames in the different resolutions and detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
- FIG. 1 illustrates a schematic diagram of an exemplary anomaly detection system, according to embodiments of the disclosure.
- FIG. 2 illustrates a block diagram of an exemplary anomaly detection device, according to embodiments of the disclosure.
- FIG. 3 illustrates a flowchart of an exemplary method for anomaly detection, according to embodiments of the disclosure.
- FIG. 4 illustrates a flowchart of an exemplary method for training an anomaly detection network, according to embodiments of the disclosure.
- FIG. 5 illustrates a schematic framework of an exemplary anomaly detection network along with its’ training network, according to embodiments of the disclosure.
- FIG. 1 illustrates a schematic diagram of an exemplary video anomaly detection system (referred to as “anomaly detection system 100” ) , according to embodiments of the disclosure.
- anomaly detection system 100 is configured to detect anomalous events recorded in a video (i.e., determine if there is a video anomaly) captured by a camera 160.
- the anomaly detection may be based on a multi-scale image frame prediction network (referred to as an “anomaly detection network 105” hereafter) trained using sample videos (e.g., training data 101) .
- anomaly detection system 100 may include components shown in FIG.
- anomaly detection system 100 including an anomaly detection device 110, a model training device 120, a display device 130, a training database 140, a database/repository 150, a camera 160 and a network 170 for facilitating communications among the various components. It is to be contemplated that anomaly detection system 100 may include more or less components compared to those shown in FIG. 1.
- anomaly detection system 100 may perform two stages: an anomaly detection model training stage and an anomaly detection stage applying the trained model.
- training stage e.g., training a learning model such as anomaly detection network 105
- anomaly detection system 100 may include model training device 120 and training database 140.
- video anomaly detection process to obtain an anomaly detection result 107 (e.g., whether the video includes anomalous event (s) )
- anomaly detection system 100 may include anomaly detection device 110 and database/repository 150.
- anomaly detection system 100 may also include display device 130 to display anomaly detection result 107.
- anomaly detection system 100 may only include components for performing the anomaly detection related functions, namely anomaly detection device 110, database/repository 150 and optionally display device 130.
- anomaly detection system 100 may optionally include network 170 to facilitate the communication among the various components of anomaly detection system 100, such as databases 140 and 150, devices 110 and 120, and camera 160.
- network 170 may be a local area network (LAN) , a wireless network, a cloud computing environment (e.g., software as a service, platform as a service, infrastructure as a service) , a client-server, a wide area network (WAN) , etc.
- LAN local area network
- cloud computing environment e.g., software as a service, platform as a service, infrastructure as a service
- WAN wide area network
- network 170 may be replaced by wired data communication systems or devices.
- the various components of anomaly detection system 100 may be remote from each other or in different locations and be connected through network 170 as shown in FIG. 1.
- certain components of anomaly detection system 100 may be located on the same site or inside one device.
- training database 140 may be located on-site with or be part of model training device 120.
- model training device 120 and anomaly detection device 110 may be inside the same computer or processing device.
- anomaly detection system 100 may store video that include multiple image frames.
- Image frames of a video may each correspond to a time point.
- a “current image frame” refers to the image frame corresponding to a selected time point.
- prior image frames refer to the set of image frames corresponding to time points prior to the selected time point.
- “future image frames” refer to the set of image frames corresponding to time points subsequent to the selected time point.
- sample image frames where anomalies are known e.g., training data 101
- image frames for detection e.g., image frames 102
- Image frames may be generated based on video data (e.g., a video comprising a plurality of image frames) received from video recording devices (e.g., camera 160) .
- the video data may be visual image streams including the current image frame, the set of prior image frames and the set of future image frames, acquired by camera 160 or a wearable device, a smart phone, a tablet, a computer, a surveillance camera, or the like that includes a video recording device for acquiring the video data.
- camera 160 may be any suitable video recording device that can acquire image frames.
- Image frames may be generated based on the acquired video data.
- the video data may also come from a post-processing device implementing image enhancement technology (e.g., an application on a user device for adding night visions to the surveillance devices) .
- training database 140 may store training data 101, which includes sample image frames.
- training data 101 may further include the known anomalies (if any) in the sample image frames.
- the sample image frames may include an image frame corresponding to a time point when anomalous event happens, and image frames prior and subsequent to that time point.
- training data 101 may include image frames that are known anomaly-free. Sample image frames may be stored in training database 140 as training data 101.
- anomaly detection network 105 may be a multi-scale image frame prediction network (referred to as “prediction network” hereafter) for predicting an estimated current image frame (i.e., a prediction of an image frame at a selected time point) based on a set of prior image frames corresponding to time points prior to the selected time point.
- Anomaly detection network 105 may also include an anomaly score calculation model (referred to as “assessment model” hereafter) for determining an anomaly score, indicative of a difference between the ground truth current image frame (i.e., captured image frame at the selected time point) and the estimated current image frame.
- anomaly detection network 105 may decide whether the current image frame records/includes anomalous event (s) (i.e., whether an anomaly occurs at or around the selected time point) based on the anomaly sore.
- anomalous event i.e., whether an anomaly occurs at or around the selected time point
- the prediction network may include an encoder for extracting spatial features from each prior image frame (i.e., the set of prior image frames corresponding to time points prior to the selected time point) at different resolutions, a predictor for predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution and a decoder for predicting the estimated current image frame based on the estimated image frames in the different resolutions.
- an encoder for extracting spatial features from each prior image frame (i.e., the set of prior image frames corresponding to time points prior to the selected time point) at different resolutions
- a predictor for predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution
- a decoder for predicting the estimated current image frame based on the estimated image frames in the different resolutions.
- the encoder may include at least one convolution layer and a plurality of residual blocks for obtaining spatial features of the prior image frames in different resolutions.
- Each residual block may be configured to extract spatial features in a specific resolution.
- the residual blocks may be connected in a sequential manner, each producing spatial features in a different resolution. The more residual blocks are applied to an image frame, the lower spatial features are extracted.
- each residual block may include a 2-D convolution, and when extracting the spatial features from each prior image frame, a first residual block for a first resolution may be applied to the image frame for generating spatial features in a first resolution, and a second residual block may be applied to the spatial features in the first resolution to obtain spatial features in a second resolution.
- the second resolution is lower than the first resolution.
- the number of residual blocks of the plurality of residual blocks is not limited to two.
- the plurality of residual blocks may include three or more residual blocks applied in a sequential manner, for generating spatial features in three or more different resolutions.
- the multi-scale architecture may allow the encoder to extract spatial features of the image frame at different scales.
- the predictor may include a plurality of sub-models in parallel, each corresponding to a resolution.
- Each of the sub-models may include a first block for extracting global temporal features from the spatial features of the set of prior image frames in that resolution, and a second block for extracting local temporal features from the global temporal features in that resolution.
- the first block may be a non-local block.
- the second block may be a convolutional gated recurrent unit (ConvGRU) and may extract local temporal features with its receptive field in that resolution.
- ConvGRU convolutional gated recurrent unit
- the decoder may include a plurality of residual blocks for fusing the local and global spatial-temporal features in the different resolutions generated by the predictor to generate an estimated current image frame of the selected time point.
- the estimated current image frame may have the same shape with the input prior image frames.
- the decoder may fuse features at different scales and construct the output (e.g., the estimated current image frame) by upsampling and concatenating the channels of the features.
- the lower-resolution features may be upsampled by the nearest neighbor interpolation (i.e., finding a value between the nearest neighbors) .
- Checkerboard artifacts i.e., checkerboard patterns in the gradient
- the output may be eliminated using the residual blocks.
- the assessment model may calculate the anomaly score based on a peak signal-to-noise ratio (PSNR) of a mean squared error between the ground truth/captured current image frame and the estimated current image frame. For example, the assessment model may first calculate the mean squared error between the ground truth current image frame I t and the estimated current image frame and calculate the PSNR based on the mean squared error. The assessment model may then calculate an anomaly score based on normalizing the PSNR values of all the prior image frames.
- PSNR peak signal-to-noise ratio
- anomaly detection network 105 may be trained by minimizing the differences between an image frame predicted by using anomaly detection network 105 (hereafter “predicated training image frame” ) and a ground truth image frame corresponding to the same time point as the predicted image frame provided as part of training data 101 (hereafter “ground truth training image frame” ) .
- model training device 120 may minimize a loss including a weighted sum of a perceptual loss indicative of a level of noise between the predicted training image frame and the ground truth training image frame, an intensity loss indicative of an intensity difference between the predicted training image frame and the ground truth training image frame, and a gradient difference loss indicative of a difference between an gradient image of the estimated current image frame and the ground truth current image frame.
- model training device 120 may communicate with training database 140 to receive one or more set of training data 101.
- Each set of training data 101 may include sample image frames (i.e., a plurality of image frames from a sample video) including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point.
- Model training device 120 may use training data 101 received from training database 140 to train the learning model, e.g., anomaly detection network 105.
- Model training device 120 may be implemented with hardware specially programmed by software that performs the training process.
- model training device 120 may include a processor and a non-transitory computer-readable medium. The processor may conduct the training by performing instructions of a training process stored in the computer-readable medium.
- Model training device 120 may additionally include input and output interfaces to communicate with training database 140, network 170, and/or a user interface (not shown) .
- the user interface may be used for selecting sets of training data, adjusting one or more parameters of the training process, selecting or modifying a framework of the learning model, and/or manually adjust the selected time point of training data 101.
- Anomaly detection device 110 may receive trained anomaly detection network 105 from model training device 120.
- Anomaly detection device 110 may include a processor and a non-transitory computer-readable medium (not shown) .
- the processor may perform instructions of an anomaly detection process stored in the medium.
- Anomaly detection device 110 may additionally include input and output interfaces to communicate with database/repository 150, camera 160, network 170 and/or a user interface of display device 130.
- the input interface may be used for selecting a video that includes the plurality of image frames or initiating the detection process.
- the output interface may be used for providing an anomaly detection result 107 associated with the video.
- Display 130 may include a display such as a Liquid Crystal Display (LCD) , a Light Emitting Diode Display (LED) , a plasma display, or any other type of display, and provide a Graphical User Interface (GUI) presented on the display for user input and data depiction.
- the display may include a number of different types of materials, such as plastic or glass, and may be touch-sensitive to receive inputs from the user.
- the display may include a touch-sensitive material that is substantially rigid, such as Gorilla Glass TM , or substantially pliable, such as Willow Glass TM .
- display 130 may be a standalone device, or may be an integrated part of anomaly detection device 110.
- FIG. 2 illustrates a block diagram of an exemplary anomaly detection device 110, according to embodiments of the disclosure.
- anomaly detection device 110 may include a communication interface 202, a processor 204, a memory 206, and a storage 208.
- anomaly detection device 110 may have different modules in a single device, such as an integrated circuit (IC) chip (e.g., implemented as an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) ) , or separate devices with dedicated functions.
- IC integrated circuit
- ASIC application-specific integrated circuit
- FPGA field-programmable gate array
- one or more components of anomaly detection device 110 may be located in a cloud or may be alternatively in a single location (such as inside a mobile device) or distributed locations.
- anomaly detection device 110 may be in an integrated device or distributed at different locations but communicate with each other through a network (not shown) . Consistent with the president disclosure, anomaly detection device 110 may be configured to detect anomalous event (s) recorded in image frames 102 received from database/repository 150, using anomaly detection network 105 trained in model training device 120.
- Communication interface 202 may send data to and receive data from components such as database/repository 150, camera 160, model training device 120 and display device 130 via communication cables, a Wireless Local Area Network (WLAN) , a Wide Area Network (WAN) , wireless networks such as radio waves, a cellular network, and/or a local or short-range wireless network (e.g., Bluetooth TM ) , or other communication methods.
- communication interface 202 may include an integrated service digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection.
- ISDN integrated service digital network
- communication interface 202 may include a local area network (LAN) card to provide a data communication connection to a compatible LAN.
- Wireless links can also be implemented by communication interface 202.
- communication interface 202 can send and receive electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
- communication interface 202 may receive anomaly detection network 105 from model training device 120 and image frames 102 from database/repository 150. Communication interface 202 may further provide image frames 102 and anomaly detection network 105 to memory 206 and/or storage 208 for storage or to processor 204 for processing.
- Processor 204 may include any appropriate type of general-purpose or special-purpose microprocessor, digital signal processor, or microcontroller. Processor 204 may be configured as a separate processor module dedicated to detecting anomalous event (s) in a video including a plurality of image frames captured by a camera using a learning model. Alternatively, processor 204 may be configured as a shared processor module for performing other functions in addition to anomaly detection.
- Memory 206 and storage 208 may include any appropriate type of mass storage provided to store any type of information that processor 204 may need to operate.
- Memory 206 and storage 208 may be a volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other type of storage device or tangible (i.e., non-transitory) computer-readable medium including, but not limited to, a ROM, a flash memory, a dynamic RAM, and a static RAM.
- Memory 206 and/or storage 208 may be configured to store one or more computer programs that may be executed by processor 204 to perform functions disclosed herein.
- memory 206 and/or storage 208 may be configured to store program (s) that may be executed by processor 204 to detect anomalous events in image frames 102 based on anomaly detection network 105.
- memory 206 and/or storage 208 may also store intermediate data such as spatial features at different resolutions, estimated image frames in different resolutions, estimated current image frame, PSNR of a mean squared error, anomaly score, etc.
- Memory 206 and/or storage 208 may additionally store various learning models including their model parameters, such as learnable parameters of the encoder, the predictor, and the decoder, etc.
- processor 204 may include multiple modules, such as an encoder unit 240, a predictor unit 242, a decoder unit 244, an assessment unit 246, and the like. These modules (and any corresponding sub-modules or sub-units) can be hardware units (e.g., portions of an integrated circuit) of processor 204 designed for use with other components or software units implemented by processor 204 through executing at least part of a program.
- the program may be stored on a computer-readable medium, and when executed by processor 204, it may perform one or more functions.
- FIG. 2 shows units 240-246 all within one processor 204, it is contemplated that these units may be distributed among different processors located closely or remotely with each other.
- FIG. 3 illustrates a flowchart of an exemplary method 300 for anomaly detection based on anomaly detection network 105 (an example shown in FIG. 5) , according to embodiments of the disclosure.
- Method 300 may be implemented by anomaly detection device 110 and particularly processor 204 or a separate processor not shown in FIG. 2.
- Method 300 may include steps S302-S314 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 3. FIG. 3 and FIG. 5 will be described together.
- communication interface 202 may receive image frames 102 from database/repository 150.
- image frames 102 may be part of a video that includes at least I t-P , ..., I t-1 , I t image frames recorded using camera 160.
- image frames 102 may include a current image frame I t corresponding to a selected time point t and a set of P prior image frames I t-P , ..., I t-1 corresponding to time points t-P, ..., t-1, prior to the selected time point t.
- encoder unit 240 may apply at least one convolution layer 510 to an input image frame (i.e., one of a prior image frame of image frames 102) to generate a convoluted result of the input image frame.
- Encoder unit 240 may extract spatial features from the convoluted result of the input image frame at different resolutions.
- encoder unit 240 may use several basic residual blocks 512 for extracting multi-scale spatial features (i.e., spatial features in different resolutions) .
- One or more basic residual blocks 512 may be applied to the convoluted result in sequence. Specifically, as shown in FIG. 5, L may be set as 3 (i.e., three different resolutions in total) .
- Basic residual blocks 512 may accordingly include three residual blocks applied in sequence for generating spatial features at different resolutions in a descending manner (e.g., from image resolution of 128 ⁇ 128 ⁇ 128 to image resolution of 64 ⁇ 64 ⁇ 256 to image resolution of 32 ⁇ 32 ⁇ 512) .
- the first residual block may be applied to the convoluted result from the convolution layer and generate a first set of spatial features at a first resolution (e.g., 128 ⁇ 128 ⁇ 128) .
- the second residual block may be applied to the first set of spatial features and generate a second set of spatial features at a second resolution (e.g., 64 ⁇ 64 ⁇ 256) , lower than the first resolution.
- each of basic residual blocks 512 may include a 2-D convolution for convoluting the spatial features.
- encoder unit 240 may calculate spatial features in L resolutions for each image frame from I 1 to I t-1 using basic residual blocks 512 according toequation (1) :
- predictor unit 242 may predict an estimate image frame in each resolution based on the spatial features of the set of P prior image frames I t-P , ..., I t-1 in that resolution using a plurality of prediction sub-models 520.
- the number of prediction sub-models 520 corresponds to the number of different resolutions.
- prediction sub-models 520 may include three prediction sub-models.
- the three prediction sub-models are applied in parallel, each of which corresponds to spatial features in one resolution for predicting an estimated image frame in that resolution.
- each sub-model may include a first block (e.g., a non-local block) for extracting global temporal features from the spatial features of the set of prior image frames in the corresponding resolution and a second block for extracting local temporal features from the global temporal features in that resolution.
- a first block e.g., a non-local block
- second block for extracting local temporal features from the global temporal features in that resolution.
- the second block may be a convolutional gated recurrent unit (ConvGRU) .
- the first block e.g., the non-local block
- the ConvGRU may be applied to extract local temporal features (i.e., temporal patterns) using its receptive field focusing on the local neighborhood of the global temporal features extracted by the first block.
- the estimated current image frame in each resolution may be predicted according to equation (2) :
- decoder unit 244 may predict the estimated current image frame by fusing the features (i.e., both global temporal features and local temporal features) in the different resolutions using sub-models 530.
- each sub-model of sub-models 530 includes an upsampling unit for upsampling the features, followed by a residual block for eliminating checkerboard artifacts (i.e., checkerboard patterns in the gradient) in the estimated current image frame (i.e., the output image frame by decoder unit 244) .
- the features may be upsampled by nearest neighbor interpolation (i.e., finding a value between the nearest neighbors) .
- the estimated current image frame may be estimated by upsampling and concatenating the channels of the features according to equation (3) :
- assessment unit 246 may determine an anomaly score indicative of the difference between the captured current image frame I t and the estimated current image frame using the assessment model (not shown in FIG. 5) .
- the assessment model may calculate the anomaly score based on a peak signal-to-noise ratio (PSNR) of a mean squared error between the captured current image frame and the estimated current image frame. For example, the assessment model may first calculate the mean squared error (MSE) between the captured current image frame I t and the estimated current image frame according to equation (4)
- PSNR peak signal-to-noise ratio
- H and W are the height and the width of image frames respectively, and I t (i, j) and (i, j) are the Red, Green, Blue (RGB) values of the (i, j) th pixel in I t and respectively.
- the assessment model may then calculate the PSNR based on the mean squared error according to equation (5) .
- MAX It is the maximum possible pixel value of I t .
- the assessment model may further calculate the anomaly score of the image frame I t at the time point t based on normalizing the PSNR values of all the T prior image frames.
- the anomaly score may be calculated according to equation (6) :
- a higher anomaly sore S t indicates the higher probability that the image frame I t contains anomalous event (s) .
- assessment unit 246 may determine if the anomaly score is higher than a predetermined threshold.
- an operator or a designer of the learning model e.g., anomaly detection network 105
- anomaly detection network 105 may be trained by model training device 120 by minimizing a loss including a weighted sum of a perceptual loss indicative of a level of noise between the predicted training image frame and the ground truth training image frame, an intensity loss indicative of an intensity difference between the predicted training image frame and the ground truth training image frame, and a gradient difference loss indicative of a difference between an gradient image of the predicted training image frame and an gradient image of the ground truth training image frame.
- FIG. 4 illustrates a flowchart of an exemplary method 400 for training anomaly detection network 105, according to embodiments of the disclosure.
- Method 400 may include steps S402-S412 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 4.
- model training device 120 may receive training data 101 from training database 140.
- training data 101 may include a video that includes at least I t-P , ..., I t-1 , I t training image frames recorded using camera 160.
- training data 101 may include a training image frame I t corresponding to a selected time point t and a set of P prior training image frames I t-P , ..., I t-1 corresponding to time points t-P, ..., t-1, prior to the selected time point t.
- model training device 120 may calculate the perceptual loss indicative of a level of noise between the training image frame predicted by using the learning model (e.g., anomaly detection network 105) and the ground truth training image frame I t .
- model training device 120 may apply separate pre-trained deep convolution networks such as VGG16 networks 540 (shown in FIG. 5) to the predicted training image frame and the ground truth training image frame I t simultaneously.
- the pre-trained deep convolution networks are trained on ImageNet for image classification.
- Each VGG16 network may include multiple sub- convolution layers 542 with a rectified linear unit (ReLU) .
- the multiple sub-convolution layers 542 may include 13 sub-convolution layers.
- Model training device 120 may calculate the perceptual loss based on a weighted l 1 distance of the 2 nd , 4 th , 7 th , 10 th and 13 th sub-convolution layers of the VGG16 network according to equation (7) :
- ⁇ 2 , ⁇ 4 , ⁇ 7 , ⁇ 10 , ⁇ 13 may be set as (0.1, 1, 10, 10, 10) .
- model training device 120 may calculate the intensity loss L int indicative of the intensity l 2 distance between the predicted training image frame and the ground truth training image frame I t according to equation (8) :
- I t (i, j) and (i, j) are the RGB values of the (i, j) th pixel in I t and respectively.
- model training device 120 may calculate the gradient different loss L gd indicative of the gradient difference between the gradient image of the predicted training image frame and the gradient image of the ground truth training image frame I t .
- model training device 120 may measure the l 1 distance in both vertical and horizontal directions (i.e., the vertical and horizontal gradient differences) according to equation (9) :
- model training device 120 may calculate the overall lose as a weighted sum of the perceptual loss, the intensity loss and the gradient difference loss according to equation (11) :
- model training device 120 may train the learning model (e.g., anomaly detection network 105) by minimizing the overall lose.
- the design of the multi-scale architectures of the learning model applied by the disclosed system and method ensures more attention paid to semantically meaningful parts such as a person, a vehicle, etc. comparing to background information (i.e., the static/unchanged objects) in each of the image frames of the video to be detect.
- background information i.e., the static/unchanged objects
- the disclosed multi-scale learning model is more sensitive to objects with different scales of features. For example, spatial features extracted at different resolutions may ensure the same object with different granularities in the image frame may all be detected for determining if there is an anomalous event happened.
- the learning model can be trained based on solely normal videos (i.e., videos that are anomaly-free) , it is less expensive to capture the training data for training the learning model.
- the computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage devices.
- the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed.
- the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- Image Analysis (AREA)
Abstract
Systems, methods, and computer-readable media for video anomaly detection using a learning model. The method may include receiving a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point. The method may also include extracting spatial features from each prior image frame at different resolutions and predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution. The method may further include predicting an estimated current image frame based on the estimated image frames in the different resolutions and detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
Description
The present disclosure relates to systems and methods for video anomaly detection, and more particularly to, systems and methods for video anomaly detection using a multi-scale image frame prediction network.
Video anomaly detection plays an essential role in computer vision and is used in many applications such as warning system, scene understanding, activity recognition, road traffic analysis, etc. Given a video clip, the frame-level video anomaly detection aims at identifying the frames in which there exist events or behaviors different from expectations or regulations. Detecting anomalous events in videos is very challenging for mainly two reasons. First, the definition of an anomalous event is usually ambiguous, and its pattern varies a lot because it highly depends on the contexts of the event and the scenario where the event happens. For example, a vehicle driving down the road within the speed limits is normal but a vehicle dashing towards a crowd of people is anomalous. For another example, a cloud of smoke rising from inside of a building often indicates an anomaly, but cloud of smoke coming from a chimney is normal. Therefore, a robust vide anomaly detection method has to be able to address the ambiguous definition of the anomalous event. Second, the data between normal and abnormal samples is usually imbalanced in practice. Because anomalous events are rare and unpredictable in real-world scenarios, it is very difficult and high cost to collect and label abnormal videos. Therefore, video anomaly detection methods have to learn from normal data only (i.e., using the video of normal events to distinguish unseen and unbounded abnormal events from normal ones) .
Various video anomaly detection methods have been developed throughout time. Traditional video anomaly detection methods rely on hand-crafted features from prior domain knowledge. These methods learn dictionaries or descriptors of normal events based on extracted appearance and motion features. However, their detection performance is limited by the poor discriminative power of the simple features. The recent advance in deep learning networks promoted the deep learning-based video anomaly detection methods. These methods can be generally grouped into three categories: classification-based method, reconstruction-based methods, and prediction-based methods.
Classification-based method directly classifies each video frame to be normal or abnormal. Among them, some methods require additional anomaly data and labels in the training phase via a weakly supervised approach. However, because collecting and labeling anomaly data for training is expensive or even infeasible, this approach cannot be generalized to unbounded anomalies. Other works focus on specific types of anomaly events. For example, a multi-stage classification method first detects and crops out the object of interests and then builds one-versus-rest classifiers on the extracted features of that object. Since its anomaly detection result is substantially influenced by the performance of its object detection, it will fail in recognizing anomalies with unseen objects or no objects to attribute. Therefore, this type of method is very limited when dealing with complicated and uncertain real-world scenarios.
Reconstruction-based method is a more general way for video anomaly detection, which learns to reconstruct the input video frame with minimum reconstruction errors for normal frames and is supposed to have large errors for abnormal frames. Auto-encoder based methods and generative adversarial networks are commonly used as the reconstruction models. However, due to the improved performance of deep neural networks, an abnormal event may also have small reconstruction errors, so there is no guarantee that reconstruction-based methods can detect abnormal events well.
Prediction-based method has been developed to remedy the issues of the construction-based method and reconstruction-based method. It takes consecutive video frames to predict the next frame and determines whether the next frame is abnormal by the prediction error. However, the existing prediction-based methods are still suboptimal. For example, their U-Net architecture cannot fully learn temporal information and using adversarial learning and additional optical flow loss makes the training inefficient.
Embodiments of the disclosure address the above problems by providing prediction-based video anomaly detection methods and systems using a multi-scale frame prediction network
SUMMARY
Embodiments of the disclosure provide a method for video anomaly detection using a learning model. An exemplary method may include receiving a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point. The method may also include extracting spatial features from each prior image frame at different resolutions and predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution. The method may further include predicting an estimated current image frame based on the estimated image frames in the different resolutions and detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
Embodiments of the disclosure also provide a system for performing video anomaly detection using a learning model. An exemplary system may include a communication interface configured to receive a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point. The system may also include at least one processor coupled to the communication interface. The at least one processor may be configured to extract spatial features from each prior image frame at different resolutions and predict an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution. The at least one processor may be further configured to predict an estimated current image frame based on the estimated image frames in the different resolutions and detect the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
Embodiments of the disclosure further provide a non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method for video anomaly detection using a learning model. The method may include receiving a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point. The method may also include extracting spatial features from each prior image frame at different resolutions and predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution. The method may further include predicting an estimated current image frame based on the estimated image frames in the different resolutions and detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
FIG. 1 illustrates a schematic diagram of an exemplary anomaly detection system, according to embodiments of the disclosure.
FIG. 2 illustrates a block diagram of an exemplary anomaly detection device, according to embodiments of the disclosure.
FIG. 3 illustrates a flowchart of an exemplary method for anomaly detection, according to embodiments of the disclosure.
FIG. 4 illustrates a flowchart of an exemplary method for training an anomaly detection network, according to embodiments of the disclosure.
FIG. 5 illustrates a schematic framework of an exemplary anomaly detection network along with its’ training network, according to embodiments of the disclosure.
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
FIG. 1 illustrates a schematic diagram of an exemplary video anomaly detection system (referred to as “anomaly detection system 100” ) , according to embodiments of the disclosure. Consistent with the present disclosure, anomaly detection system 100 is configured to detect anomalous events recorded in a video (i.e., determine if there is a video anomaly) captured by a camera 160. The anomaly detection may be based on a multi-scale image frame prediction network (referred to as an “anomaly detection network 105” hereafter) trained using sample videos (e.g., training data 101) . In some embodiments, anomaly detection system 100 may include components shown in FIG. 1, including an anomaly detection device 110, a model training device 120, a display device 130, a training database 140, a database/repository 150, a camera 160 and a network 170 for facilitating communications among the various components. It is to be contemplated that anomaly detection system 100 may include more or less components compared to those shown in FIG. 1.
As shown in FIG. 1, anomaly detection system 100 may perform two stages: an anomaly detection model training stage and an anomaly detection stage applying the trained model. To perform the training stage (e.g., training a learning model such as anomaly detection network 105) , anomaly detection system 100 may include model training device 120 and training database 140. To perform the video anomaly detection process to obtain an anomaly detection result 107 (e.g., whether the video includes anomalous event (s) ) , anomaly detection system 100 may include anomaly detection device 110 and database/repository 150. In some embodiments, anomaly detection system 100 may also include display device 130 to display anomaly detection result 107. In some embodiments, when the learning model (e.g., anomaly detection network 105) is pre-trained, anomaly detection system 100 may only include components for performing the anomaly detection related functions, namely anomaly detection device 110, database/repository 150 and optionally display device 130.
In some embodiments, anomaly detection system 100 may optionally include network 170 to facilitate the communication among the various components of anomaly detection system 100, such as databases 140 and 150, devices 110 and 120, and camera 160. For example, network 170 may be a local area network (LAN) , a wireless network, a cloud computing environment (e.g., software as a service, platform as a service, infrastructure as a service) , a client-server, a wide area network (WAN) , etc. In some embodiments, network 170 may be replaced by wired data communication systems or devices.
In some embodiments, the various components of anomaly detection system 100 may be remote from each other or in different locations and be connected through network 170 as shown in FIG. 1. In some alternative embodiments, certain components of anomaly detection system 100 may be located on the same site or inside one device. For example, training database 140 may be located on-site with or be part of model training device 120. As another example, model training device 120 and anomaly detection device 110 may be inside the same computer or processing device.
Consistent with the present disclosure, anomaly detection system 100 may store video that include multiple image frames. Image frames of a video may each correspond to a time point. Consistent with the present disclosure, a “current image frame” refers to the image frame corresponding to a selected time point. Accordingly, “prior image frames” refer to the set of image frames corresponding to time points prior to the selected time point. Likewise, “future image frames” refer to the set of image frames corresponding to time points subsequent to the selected time point. In some embodiments, sample image frames where anomalies are known (e.g., training data 101) may be stored in training database 140 and image frames for detection (e.g., image frames 102) may be stored in database/repository 150.
Image frames may be generated based on video data (e.g., a video comprising a plurality of image frames) received from video recording devices (e.g., camera 160) . In some embodiments, the video data may be visual image streams including the current image frame, the set of prior image frames and the set of future image frames, acquired by camera 160 or a wearable device, a smart phone, a tablet, a computer, a surveillance camera, or the like that includes a video recording device for acquiring the video data. In some embodiments, camera 160 may be any suitable video recording device that can acquire image frames. Image frames may be generated based on the acquired video data. Optionally, the video data may also come from a post-processing device implementing image enhancement technology (e.g., an application on a user device for adding night visions to the surveillance devices) .
In some embodiments, training database 140 may store training data 101, which includes sample image frames. In some embodiments, training data 101 may further include the known anomalies (if any) in the sample image frames. For example, the sample image frames may include an image frame corresponding to a time point when anomalous event happens, and image frames prior and subsequent to that time point. In some embodiments, training data 101 may include image frames that are known anomaly-free. Sample image frames may be stored in training database 140 as training data 101.
Consistent with some embodiments, anomaly detection network 105 (described in detail in connection with FIG. 5) may be a multi-scale image frame prediction network (referred to as “prediction network” hereafter) for predicting an estimated current image frame (i.e., a prediction of an image frame at a selected time point) based on a set of prior image frames corresponding to time points prior to the selected time point. Anomaly detection network 105 may also include an anomaly score calculation model (referred to as “assessment model” hereafter) for determining an anomaly score, indicative of a difference between the ground truth current image frame (i.e., captured image frame at the selected time point) and the estimated current image frame. In some embodiments, anomaly detection network 105 may decide whether the current image frame records/includes anomalous event (s) (i.e., whether an anomaly occurs at or around the selected time point) based on the anomaly sore.
In some embodiments, the prediction network may include an encoder for extracting spatial features from each prior image frame (i.e., the set of prior image frames corresponding to time points prior to the selected time point) at different resolutions, a predictor for predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution and a decoder for predicting the estimated current image frame based on the estimated image frames in the different resolutions.
In some embodiments, the encoder may include at least one convolution layer and a plurality of residual blocks for obtaining spatial features of the prior image frames in different resolutions. Each residual block may be configured to extract spatial features in a specific resolution. The residual blocks may be connected in a sequential manner, each producing spatial features in a different resolution. The more residual blocks are applied to an image frame, the lower spatial features are extracted. For example, each residual block may include a 2-D convolution, and when extracting the spatial features from each prior image frame, a first residual block for a first resolution may be applied to the image frame for generating spatial features in a first resolution, and a second residual block may be applied to the spatial features in the first resolution to obtain spatial features in a second resolution. The second resolution is lower than the first resolution. It is understood that the number of residual blocks of the plurality of residual blocks is not limited to two. In some embodiments, the plurality of residual blocks may include three or more residual blocks applied in a sequential manner, for generating spatial features in three or more different resolutions. The multi-scale architecture may allow the encoder to extract spatial features of the image frame at different scales.
In some embodiments, the predictor may include a plurality of sub-models in parallel, each corresponding to a resolution. Each of the sub-models may include a first block for extracting global temporal features from the spatial features of the set of prior image frames in that resolution, and a second block for extracting local temporal features from the global temporal features in that resolution. For example, within each sub-model, the first block may be a non-local block. The second block may be a convolutional gated recurrent unit (ConvGRU) and may extract local temporal features with its receptive field in that resolution.
In some embodiments, the decoder may include a plurality of residual blocks for fusing the local and global spatial-temporal features in the different resolutions generated by the predictor to generate an estimated current image frame of the selected time point. The estimated current image frame may have the same shape with the input prior image frames. For example, the decoder may fuse features at different scales and construct the output (e.g., the estimated current image frame) by upsampling and concatenating the channels of the features. In some embodiments, the lower-resolution features may be upsampled by the nearest neighbor interpolation (i.e., finding a value between the nearest neighbors) . Checkerboard artifacts (i.e., checkerboard patterns in the gradient) in the output may be eliminated using the residual blocks.
In some embodiments, the assessment model may calculate the anomaly score based on a peak signal-to-noise ratio (PSNR) of a mean squared error between the ground truth/captured current image frame and the estimated current image frame. For example, the assessment model may first calculate the mean squared error between the ground truth current image frame I
t and the estimated current image frame
and calculate the PSNR based on the mean squared error. The assessment model may then calculate an anomaly score based on normalizing the PSNR values of all the prior image frames.
Consistent with some embodiments, anomaly detection network 105 may be trained by minimizing the differences between an image frame predicted by using anomaly detection network 105 (hereafter “predicated training image frame” ) and a ground truth image frame corresponding to the same time point as the predicted image frame provided as part of training data 101 (hereafter “ground truth training image frame” ) . For example, model training device 120 may minimize a loss including a weighted sum of a perceptual loss indicative of a level of noise between the predicted training image frame and the ground truth training image frame, an intensity loss indicative of an intensity difference between the predicted training image frame and the ground truth training image frame, and a gradient difference loss indicative of a difference between an gradient image of the estimated current image frame and the ground truth current image frame. The details of the training will be disclosed in connection with FIG. 4 blow.
As show in FIG. 1, model training device 120 may communicate with training database 140 to receive one or more set of training data 101. Each set of training data 101 may include sample image frames (i.e., a plurality of image frames from a sample video) including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point. Model training device 120 may use training data 101 received from training database 140 to train the learning model, e.g., anomaly detection network 105. Model training device 120 may be implemented with hardware specially programmed by software that performs the training process. For example, model training device 120 may include a processor and a non-transitory computer-readable medium. The processor may conduct the training by performing instructions of a training process stored in the computer-readable medium. Model training device 120 may additionally include input and output interfaces to communicate with training database 140, network 170, and/or a user interface (not shown) . The user interface may be used for selecting sets of training data, adjusting one or more parameters of the training process, selecting or modifying a framework of the learning model, and/or manually adjust the selected time point of training data 101.
FIG. 2 illustrates a block diagram of an exemplary anomaly detection device 110, according to embodiments of the disclosure. In some embodiments, as shown in FIG. 2, anomaly detection device 110 may include a communication interface 202, a processor 204, a memory 206, and a storage 208. In some embodiments, anomaly detection device 110 may have different modules in a single device, such as an integrated circuit (IC) chip (e.g., implemented as an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) ) , or separate devices with dedicated functions. In some embodiments, one or more components of anomaly detection device 110 may be located in a cloud or may be alternatively in a single location (such as inside a mobile device) or distributed locations. Components of anomaly detection device 110 may be in an integrated device or distributed at different locations but communicate with each other through a network (not shown) . Consistent with the president disclosure, anomaly detection device 110 may be configured to detect anomalous event (s) recorded in image frames 102 received from database/repository 150, using anomaly detection network 105 trained in model training device 120.
Consistent with some embodiments, communication interface 202 may receive anomaly detection network 105 from model training device 120 and image frames 102 from database/repository 150. Communication interface 202 may further provide image frames 102 and anomaly detection network 105 to memory 206 and/or storage 208 for storage or to processor 204 for processing.
In some embodiments, memory 206 and/or storage 208 may also store intermediate data such as spatial features at different resolutions, estimated image frames in different resolutions, estimated current image frame, PSNR of a mean squared error, anomaly score, etc. Memory 206 and/or storage 208 may additionally store various learning models including their model parameters, such as learnable parameters of the encoder, the predictor, and the decoder, etc.
As shown in FIG. 2, processor 204 may include multiple modules, such as an encoder unit 240, a predictor unit 242, a decoder unit 244, an assessment unit 246, and the like. These modules (and any corresponding sub-modules or sub-units) can be hardware units (e.g., portions of an integrated circuit) of processor 204 designed for use with other components or software units implemented by processor 204 through executing at least part of a program. The program may be stored on a computer-readable medium, and when executed by processor 204, it may perform one or more functions. Although FIG. 2 shows units 240-246 all within one processor 204, it is contemplated that these units may be distributed among different processors located closely or remotely with each other.
In some embodiments, units 242-246 of FIG. 2 may execute computer instructions to perform the anomaly detection. For example, FIG. 3 illustrates a flowchart of an exemplary method 300 for anomaly detection based on anomaly detection network 105 (an example shown in FIG. 5) , according to embodiments of the disclosure. Method 300 may be implemented by anomaly detection device 110 and particularly processor 204 or a separate processor not shown in FIG. 2. Method 300 may include steps S302-S314 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 3. FIG. 3 and FIG. 5 will be described together.
In step S302, communication interface 202 may receive image frames 102 from database/repository 150. In some embodiments, image frames 102 may be part of a video that includes at least I
t-P, …, I
t-1, I
t image frames recorded using camera 160. For example, image frames 102 may include a current image frame I
t corresponding to a selected time point t and a set of P prior image frames I
t-P, …, I
t-1 corresponding to time points t-P, …, t-1, prior to the selected time point t.
In step S304, encoder unit 240 may apply at least one convolution layer 510 to an input image frame (i.e., one of a prior image frame of image frames 102) to generate a convoluted result of the input image frame. Encoder unit 240 may extract spatial features from the convoluted result of the input image frame at different resolutions. For example, encoder unit 240 may use several basic residual blocks 512 for extracting multi-scale spatial features (i.e., spatial features in different resolutions) . One or more basic residual blocks 512 may be applied to the convoluted result in sequence. Specifically, as shown in FIG. 5, L may be set as 3 (i.e., three different resolutions in total) . Basic residual blocks 512 may accordingly include three residual blocks applied in sequence for generating spatial features at different resolutions in a descending manner (e.g., from image resolution of 128×128×128 to image resolution of 64×64×256 to image resolution of 32×32×512) . For example, the first residual block may be applied to the convoluted result from the convolution layer and generate a first set of spatial features at a first resolution (e.g., 128×128×128) . The second residual block may be applied to the first set of spatial features and generate a second set of spatial features at a second resolution (e.g., 64×64×256) , lower than the first resolution. In some embodiments, each of basic residual blocks 512 may include a 2-D convolution for convoluting the spatial features.
For example, encoder unit 240 may calculate spatial features in L resolutions for each image frame from I
1 to I
t-1 using basic residual blocks 512 according toequation (1) :
where t′= t-P, …, t-1 and f
enc (·) is the encoder function.
In step S306, predictor unit 242 may predict an estimate image frame in each resolution based on the spatial features of the set of P prior image frames I
t-P, …, I
t-1 in that resolution using a plurality of prediction sub-models 520. In some embodiments, the number of prediction sub-models 520 corresponds to the number of different resolutions. For example, in a specific example, because the spatial features were extracted at three different resolutions by encoder unit 240 (i.e., L=3) , prediction sub-models 520 may include three prediction sub-models. In some embodiments, the three prediction sub-models are applied in parallel, each of which corresponds to spatial features in one resolution for predicting an estimated image frame in that resolution. For example, each sub-model may include a first block (e.g., a non-local block) for extracting global temporal features from the spatial features of the set of prior image frames in the corresponding resolution and a second block for extracting local temporal features from the global temporal features in that resolution.
In some embodiments, the second block may be a convolutional gated recurrent unit (ConvGRU) . For example, within each sub-model, the first block (e.g., the non-local block) may be applied to capture long-range dependencies at all positions in the image frame and the ConvGRU may be applied to extract local temporal features (i.e., temporal patterns) using its receptive field focusing on the local neighborhood of the global temporal features extracted by the first block. The estimated current image frame in each resolution may be predicted according to equation (2) :
where l = 1, 2, …, L and f
pre (·) is the predictor function.
In step S308, decoder unit 244 may predict the estimated current image frame by fusing the features (i.e., both global temporal features and local temporal features) in the different resolutions using sub-models 530. In some embodiments, each sub-model of sub-models 530 includes an upsampling unit for upsampling the features, followed by a residual block for eliminating checkerboard artifacts (i.e., checkerboard patterns in the gradient) in the estimated current image frame (i.e., the output image frame by decoder unit 244) . In some embodiments, the features may be upsampled by nearest neighbor interpolation (i.e., finding a value between the nearest neighbors) .
For example, the estimated current image frame may be estimated by upsampling and concatenating the channels of the features according to equation (3) :
In step S310, assessment unit 246 may determine an anomaly score indicative of the difference between the captured current image frame I
t and the estimated current image frame
using the assessment model (not shown in FIG. 5) .
In some embodiments, the assessment model may calculate the anomaly score based on a peak signal-to-noise ratio (PSNR) of a mean squared error between the captured current image frame and the estimated current image frame. For example, the assessment model may first calculate the mean squared error (MSE) between the captured current image frame I
t and the estimated current image frame
according to equation (4)
where H and W are the height and the width of image frames respectively, and I
t (i, j) and
(i, j) are the Red, Green, Blue (RGB) values of the (i, j)
th pixel in I
t and
respectively.
The assessment model may then calculate the PSNR based on the mean squared error according to equation (5) .
where MAX
It is the maximum possible pixel value of I
t.
The assessment model may further calculate the anomaly score of the image frame I
t at the time point t based on normalizing the PSNR values of all the T prior image frames. In some embodiments, the anomaly score may be calculated according to equation (6) :
where t′= 1, 2, …, T. In some embodiments, a higher anomaly sore S
t indicates the higher probability that the image frame I
t contains anomalous event (s) .
In step S312, assessment unit 246 may determine if the anomaly score is higher than a predetermined threshold. In some embodiments, an operator or a designer of the learning model (e.g., anomaly detection network 105) may set a predetermined value based on the domain knowledge for determining if the image frame records anomalous event (s) . For example, if the anomaly sore S
t is higher than the predetermined threshold, the image frame contains anomalous event (s) . Thus, in step S314, the video may be determined to include anomaly. Otherwise, if the anomaly sore S
t is not higher than the predetermined threshold, the current image frame contains no anomalous event (s) . Thus, in step S316, the video may be determined to be normal at or around time point t.
Consistent with some embodiments, anomaly detection network 105 may be trained by model training device 120 by minimizing a loss including a weighted sum of a perceptual loss indicative of a level of noise between the predicted training image frame and the ground truth training image frame, an intensity loss indicative of an intensity difference between the predicted training image frame and the ground truth training image frame, and a gradient difference loss indicative of a difference between an gradient image of the predicted training image frame and an gradient image of the ground truth training image frame.
FIG. 4 illustrates a flowchart of an exemplary method 400 for training anomaly detection network 105, according to embodiments of the disclosure. Method 400 may include steps S402-S412 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 4.
In step S402, model training device 120 may receive training data 101 from training database 140. In some embodiments, training data 101 may include a video that includes at least I
t-P, …, I
t-1, I
t training image frames recorded using camera 160. For example, training data 101 may include a training image frame I
t corresponding to a selected time point t and a set of P prior training image frames I
t-P, …, I
t-1 corresponding to time points t-P, …, t-1, prior to the selected time point t.
In step S404, model training device 120 may calculate the perceptual loss indicative of a level of noise between the training image frame
predicted by using the learning model (e.g., anomaly detection network 105) and the ground truth training image frame I
t. For example, model training device 120 may apply separate pre-trained deep convolution networks such as VGG16 networks 540 (shown in FIG. 5) to the predicted training image frame
and the ground truth training image frame I
t simultaneously.
In some embodiments, the pre-trained deep convolution networks are trained on ImageNet for image classification. Each VGG16 network may include multiple sub- convolution layers 542 with a rectified linear unit (ReLU) . For a specific example, the multiple sub-convolution layers 542 may include 13 sub-convolution layers. Model training device 120 may calculate the perceptual loss based on a weighted l
1 distance of the 2
nd, 4
th, 7
th, 10
th and 13
th sub-convolution layers of the VGG16 network according to equation (7) :
where V= {2, 4, 7, 10, 13} ,
is the output from the VGG16’s v
th sub-convolution layer, and the hyperparameter (i.e., a predetermined parameter) α
v controls the strength of each part of the loss. In a specific example, (α
2, α
4, α
7, α
10, α
13) may be set as (0.1, 1, 10, 10, 10) .
In step S406, model training device 120 may calculate the intensity loss L
int indicative of the intensity l
2 distance between the predicted training image frame
and the ground truth training image frame I
t according to equation (8) :
where 1 ≤ i ≤ H and 1 ≤ j ≤ W, and I
t (i, j) and
(i, j) are the RGB values of the (i, j) th pixel in I
t and
respectively.
In step S408, model training device 120 may calculate the gradient different loss L
gd indicative of the gradient difference between the gradient image of the predicted training image frame
and the gradient image of the ground truth training image frame I
t. For example, model training device 120 may measure the l
1 distance in both vertical and horizontal directions (i.e., the vertical and horizontal gradient differences) according to equation (9) :
and calculate the gradient different loss L
gd according to equation (9) :
Instep S410, model training device 120 may calculate the overall lose as a weighted sum of the perceptual loss, the intensity loss and the gradient difference loss according to equation (11) :
where λ
int, λ
gd and λ
pl are hyperparameters (i.e., parameters with predefined values) . In step S412, model training device 120 may train the learning model (e.g., anomaly detection network 105) by minimizing the overall lose.
The design of the multi-scale architectures of the learning model applied by the disclosed system and method ensures more attention paid to semantically meaningful parts such as a person, a vehicle, etc. comparing to background information (i.e., the static/unchanged objects) in each of the image frames of the video to be detect. Moreover, the disclosed multi-scale learning model is more sensitive to objects with different scales of features. For example, spatial features extracted at different resolutions may ensure the same object with different granularities in the image frame may all be detected for determining if there is an anomalous event happened. Finally, because the learning model can be trained based on solely normal videos (i.e., videos that are anomaly-free) , it is less expensive to capture the training data for training the learning model.
Another aspect of the disclosure is directed to a non-transitory computer-readable medium storing instructions which, when executed, cause one or more processors to perform the methods, as discussed above. The computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage devices. For example, the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed. In some embodiments, the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.
It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed system and related methods. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed system and related methods.
It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.
Claims (20)
- A method for performing video anomaly detection using a learning model, comprising:receiving a video comprising a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point;extracting spatial features from each prior image frame at different resolutions;predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution;predicting an estimated current image frame based on the estimated image frames in the different resolutions; anddetecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
- The method of claim 1, wherein the learning model comprises at least one convolution layer and a plurality of residual blocks, wherein extracting the spatial features from each prior image frame further comprises:applying the convolution layer to the prior image frame to obtain a convoluted result;applying a first residual block to the convoluted result to obtain spatial features in a first resolution; andapplying a second residual block to the spatial features in the first resolution to obtain spatial features in a second resolution.
- The method of claim 1, wherein the learning model comprises a plurality of prediction models each corresponding to a resolution, wherein the estimated image frames in the different resolutions are predicted using the plurality of prediction sub-models in parallel.
- The method of claim 3, wherein each sub-model comprises a first block and a second block, wherein predicting the estimated image frame in each resolution further comprises:extracting global features from the spatial features of the set of prior image frames in that resolution using the first block; andextracting local temporal features from the global features in that resolution using the second block, wherein the second block has a receptive field applied to local neighborhoods of the global temporal features.
- The method of claim 4, wherein the second block is a convolutional gated recurrent unit (ConvGRU) .
- The method of claim 4, wherein predicting the estimated current image frame further comprises fusing the global temporal features in the different resolutions.
- The method of claim 1, wherein the learning model is trained by minimizing a loss comprising a perceptual loss indicative of a level of noise between a training image frame predicted by using the learning model and a ground truth training image frame.
- The method of claim 7, wherein the loss further comprises an intensity loss indicative of an intensity difference between the predicted training image frame and the ground truth training image frame, and a gradient difference loss indicative of a difference between an gradient image of the predicted training image frame and an gradient image of the ground truth training image frame.
- The method of claim 8, wherein the loss is a weighted sum of the perceptual loss, the intensity loss and the gradient difference loss.
- The method of claim 7, wherein the perceptual loss is calculated by:applying a pre-trained classification model to the ground truth image frame to obtain a first output from at least one hidden layer of the pre-trained classification model;applying the pre-trained classification model to the predicted image frame to obtain a second output from the at least one hidden layer of the pre-trained classification model; anddetermining a difference between the first output and the second output.
- The method of claim 10, wherein the at least one hidden layer comprises a plurality of hidden layers and the difference is a weighted distance between the first output and the second output from the plurality of hidden layers.
- The method of claim 1, wherein detecting the video anomaly further comprises:determining an anomaly score indicative of the difference between the captured current image frame and the estimated current image frame; anddetecting the video anomaly if the determined anomaly score is higher than a predetermined threshold.
- The method of claim 12, wherein determining the anomaly score further comprises computing a peak signal-to-noise ratio (PSNR) of a mean squared error between the captured current image frame and the estimated current image frame.
- A system for performing video anomaly detection using a learning model, comprising:a communication interface configured to receive a video comprising a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point; andat least one processor coupled to the communication interface and configured to:extract spatial features from each prior image frame at different resolutions;predict an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution;predict an estimated current image frame based on the estimated image frames in the different resolutions; anddetect the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
- The system of claim 14, wherein the learning model comprises at least one convolution layer and a plurality of residual blocks, wherein to extract the spatial features from each prior image frame, the at least one processor is further configured to:apply the convolution layer to the prior image frame to obtain a convoluted result;apply a first residual block to the convoluted result to obtain spatial features in a first resolution; andapply a second residual block to the spatial features in the first resolution to obtain spatial features in a second resolution.
- The system of claim 14, wherein the learning model comprises a plurality of prediction models each corresponding to a resolution, wherein the estimated image frames in the different resolutions are predicted using the plurality of prediction sub-models in parallel.
- The system of claim 16, wherein each sub-model comprises a first block and a second block, wherein to predict the estimated image frame in each resolution, the at least one processor is further configured to:extract global temporal features from the spatial features of the set of prior image frames in that resolution using the first block; andextract local temporal features from the global temporal features in that resolution using the second block, wherein the second block has a receptive field applied to local neighborhoods of the global temporal features.
- The system of claim 14, wherein the learning model is trained by minimizing a loss comprising a perceptual loss indicative of a level of noise between a training image frame predicted by using the learning model and a ground truth training image frame.
- A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method for performing video anomaly detection using a learning model, the method comprising:extracting spatial features from each prior image frame at different resolutions;predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution;predicting an estimated current image frame based on the estimated image frames in the different resolutions; anddetecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
- The non-transitory computer-readable medium of claim 19, wherein the learning model is trained by minimizing a loss comprising a perceptual loss indicative of a level of noise between a training image frame predicted by using the learning model and a ground truth training image frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2020/073932 WO2021147055A1 (en) | 2020-01-22 | 2020-01-22 | Systems and methods for video anomaly detection using multi-scale image frame prediction network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2020/073932 WO2021147055A1 (en) | 2020-01-22 | 2020-01-22 | Systems and methods for video anomaly detection using multi-scale image frame prediction network |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021147055A1 true WO2021147055A1 (en) | 2021-07-29 |
Family
ID=76992032
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/073932 WO2021147055A1 (en) | 2020-01-22 | 2020-01-22 | Systems and methods for video anomaly detection using multi-scale image frame prediction network |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2021147055A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113592719A (en) * | 2021-08-14 | 2021-11-02 | 北京达佳互联信息技术有限公司 | Training method of video super-resolution model, video processing method and corresponding equipment |
CN116152722A (en) * | 2023-04-19 | 2023-05-23 | 南京邮电大学 | Video anomaly detection method based on combination of residual attention block and self-selection learning |
CN116450880A (en) * | 2023-05-11 | 2023-07-18 | 湖南承希科技有限公司 | Intelligent processing method for vehicle-mounted video of semantic detection |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150332434A1 (en) * | 2014-05-15 | 2015-11-19 | The Government Of The United States Of America, As Represented By The Secretary Of The Navy | Demosaicking System and Method for Color array Based Multi-Spectral Sensors |
CN110245603A (en) * | 2019-06-12 | 2019-09-17 | 成都信息工程大学 | A kind of group abnormality behavior real-time detection method |
CN110582748A (en) * | 2017-04-07 | 2019-12-17 | 英特尔公司 | Method and system for boosting deep neural networks for deep learning |
CN110705376A (en) * | 2019-09-11 | 2020-01-17 | 南京邮电大学 | Abnormal behavior detection method based on generative countermeasure network |
-
2020
- 2020-01-22 WO PCT/CN2020/073932 patent/WO2021147055A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150332434A1 (en) * | 2014-05-15 | 2015-11-19 | The Government Of The United States Of America, As Represented By The Secretary Of The Navy | Demosaicking System and Method for Color array Based Multi-Spectral Sensors |
CN110582748A (en) * | 2017-04-07 | 2019-12-17 | 英特尔公司 | Method and system for boosting deep neural networks for deep learning |
CN110245603A (en) * | 2019-06-12 | 2019-09-17 | 成都信息工程大学 | A kind of group abnormality behavior real-time detection method |
CN110705376A (en) * | 2019-09-11 | 2020-01-17 | 南京邮电大学 | Abnormal behavior detection method based on generative countermeasure network |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113592719A (en) * | 2021-08-14 | 2021-11-02 | 北京达佳互联信息技术有限公司 | Training method of video super-resolution model, video processing method and corresponding equipment |
CN113592719B (en) * | 2021-08-14 | 2023-11-28 | 北京达佳互联信息技术有限公司 | Training method of video super-resolution model, video processing method and corresponding equipment |
CN116152722A (en) * | 2023-04-19 | 2023-05-23 | 南京邮电大学 | Video anomaly detection method based on combination of residual attention block and self-selection learning |
CN116450880A (en) * | 2023-05-11 | 2023-07-18 | 湖南承希科技有限公司 | Intelligent processing method for vehicle-mounted video of semantic detection |
CN116450880B (en) * | 2023-05-11 | 2023-09-01 | 湖南承希科技有限公司 | Intelligent processing method for vehicle-mounted video of semantic detection |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9008365B2 (en) | Systems and methods for pedestrian detection in images | |
US20230316702A1 (en) | Explainable artificial intelligence (ai) based image analytic, automatic damage detection and estimation system | |
WO2021147055A1 (en) | Systems and methods for video anomaly detection using multi-scale image frame prediction network | |
EP2959454B1 (en) | Method, system and software module for foreground extraction | |
JP7272533B2 (en) | Systems and methods for evaluating perceptual systems | |
CN110929593A (en) | Real-time significance pedestrian detection method based on detail distinguishing and distinguishing | |
CN110781980B (en) | Training method of target detection model, target detection method and device | |
CN110826429A (en) | Scenic spot video-based method and system for automatically monitoring travel emergency | |
Huynh-The et al. | NIC: A robust background extraction algorithm for foreground detection in dynamic scenes | |
KR102141302B1 (en) | Object detection method based 0n deep learning regression model and image processing apparatus | |
CN114495029A (en) | Traffic target detection method and system based on improved YOLOv4 | |
CN113313037A (en) | Method for detecting video abnormity of generation countermeasure network based on self-attention mechanism | |
CN112381132A (en) | Target object tracking method and system based on fusion of multiple cameras | |
JP2020119154A (en) | Information processing device, information processing method, and program | |
CN114613006A (en) | Remote gesture recognition method and device | |
CN108154199B (en) | High-precision rapid single-class target detection method based on deep learning | |
CN114677657A (en) | Signal lamp time domain state detection method, model training method and related device | |
CN112825116B (en) | Method, device, medium and equipment for detecting and tracking human face of monitoring video image | |
Kheder et al. | Transfer Learning Based Traffic Light Detection and Recognition Using CNN Inception-V3 Model | |
KR102178202B1 (en) | Method and apparatus for detecting traffic light | |
CN116030507A (en) | Electronic equipment and method for identifying whether face in image wears mask | |
CN110956097A (en) | Method and module for extracting occluded human body and method and device for scene conversion | |
Balamurugan | Abnormal Event Detection in Video Surveillance Using Yolov3 | |
US11954917B2 (en) | Method of segmenting abnormal robust for complex autonomous driving scenes and system thereof | |
US11468676B2 (en) | Methods of real-time spatio-temporal activity detection and categorization from untrimmed video segments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20915037 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20915037 Country of ref document: EP Kind code of ref document: A1 |