WO2021147055A1 - Systems and methods for video anomaly detection using multi-scale image frame prediction network - Google Patents

Systems and methods for video anomaly detection using multi-scale image frame prediction network Download PDF

Info

Publication number
WO2021147055A1
WO2021147055A1 PCT/CN2020/073932 CN2020073932W WO2021147055A1 WO 2021147055 A1 WO2021147055 A1 WO 2021147055A1 CN 2020073932 W CN2020073932 W CN 2020073932W WO 2021147055 A1 WO2021147055 A1 WO 2021147055A1
Authority
WO
WIPO (PCT)
Prior art keywords
image frame
resolution
estimated
image frames
prior
Prior art date
Application number
PCT/CN2020/073932
Other languages
French (fr)
Inventor
Zhengping Che
Xuanzhao WANG
Ke Yang
Bo Jiang
Jian Tang
Original Assignee
Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology And Development Co., Ltd. filed Critical Beijing Didi Infinity Technology And Development Co., Ltd.
Priority to PCT/CN2020/073932 priority Critical patent/WO2021147055A1/en
Publication of WO2021147055A1 publication Critical patent/WO2021147055A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present disclosure relates to systems and methods for video anomaly detection, and more particularly to, systems and methods for video anomaly detection using a multi-scale image frame prediction network.
  • Video anomaly detection plays an essential role in computer vision and is used in many applications such as warning system, scene understanding, activity recognition, road traffic analysis, etc.
  • the frame-level video anomaly detection aims at identifying the frames in which there exist events or behaviors different from expectations or regulations. Detecting anomalous events in videos is very challenging for mainly two reasons. First, the definition of an anomalous event is usually ambiguous, and its pattern varies a lot because it highly depends on the contexts of the event and the scenario where the event happens. For example, a vehicle driving down the road within the speed limits is normal but a vehicle dashing towards a crowd of people is anomalous.
  • a cloud of smoke rising from inside of a building often indicates an anomaly, but cloud of smoke coming from a chimney is normal. Therefore, a robust vide anomaly detection method has to be able to address the ambiguous definition of the anomalous event.
  • the data between normal and abnormal samples is usually imbalanced in practice. Because anomalous events are rare and unpredictable in real-world scenarios, it is very difficult and high cost to collect and label abnormal videos. Therefore, video anomaly detection methods have to learn from normal data only (i.e., using the video of normal events to distinguish unseen and unbounded abnormal events from normal ones) .
  • Classification-based method directly classifies each video frame to be normal or abnormal.
  • some methods require additional anomaly data and labels in the training phase via a weakly supervised approach.
  • this approach cannot be generalized to unbounded anomalies.
  • Other works focus on specific types of anomaly events. For example, a multi-stage classification method first detects and crops out the object of interests and then builds one-versus-rest classifiers on the extracted features of that object. Since its anomaly detection result is substantially influenced by the performance of its object detection, it will fail in recognizing anomalies with unseen objects or no objects to attribute. Therefore, this type of method is very limited when dealing with complicated and uncertain real-world scenarios.
  • Reconstruction-based method is a more general way for video anomaly detection, which learns to reconstruct the input video frame with minimum reconstruction errors for normal frames and is supposed to have large errors for abnormal frames.
  • Auto-encoder based methods and generative adversarial networks are commonly used as the reconstruction models.
  • an abnormal event may also have small reconstruction errors, so there is no guarantee that reconstruction-based methods can detect abnormal events well.
  • Prediction-based method has been developed to remedy the issues of the construction-based method and reconstruction-based method. It takes consecutive video frames to predict the next frame and determines whether the next frame is abnormal by the prediction error.
  • the existing prediction-based methods are still suboptimal. For example, their U-Net architecture cannot fully learn temporal information and using adversarial learning and additional optical flow loss makes the training inefficient.
  • Embodiments of the disclosure address the above problems by providing prediction-based video anomaly detection methods and systems using a multi-scale frame prediction network
  • Embodiments of the disclosure provide a method for video anomaly detection using a learning model.
  • An exemplary method may include receiving a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point.
  • the method may also include extracting spatial features from each prior image frame at different resolutions and predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution.
  • the method may further include predicting an estimated current image frame based on the estimated image frames in the different resolutions and detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
  • Embodiments of the disclosure also provide a system for performing video anomaly detection using a learning model.
  • An exemplary system may include a communication interface configured to receive a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point.
  • the system may also include at least one processor coupled to the communication interface.
  • the at least one processor may be configured to extract spatial features from each prior image frame at different resolutions and predict an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution.
  • the at least one processor may be further configured to predict an estimated current image frame based on the estimated image frames in the different resolutions and detect the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
  • Embodiments of the disclosure further provide a non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method for video anomaly detection using a learning model.
  • the method may include receiving a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point.
  • the method may also include extracting spatial features from each prior image frame at different resolutions and predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution.
  • the method may further include predicting an estimated current image frame based on the estimated image frames in the different resolutions and detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
  • FIG. 1 illustrates a schematic diagram of an exemplary anomaly detection system, according to embodiments of the disclosure.
  • FIG. 2 illustrates a block diagram of an exemplary anomaly detection device, according to embodiments of the disclosure.
  • FIG. 3 illustrates a flowchart of an exemplary method for anomaly detection, according to embodiments of the disclosure.
  • FIG. 4 illustrates a flowchart of an exemplary method for training an anomaly detection network, according to embodiments of the disclosure.
  • FIG. 5 illustrates a schematic framework of an exemplary anomaly detection network along with its’ training network, according to embodiments of the disclosure.
  • FIG. 1 illustrates a schematic diagram of an exemplary video anomaly detection system (referred to as “anomaly detection system 100” ) , according to embodiments of the disclosure.
  • anomaly detection system 100 is configured to detect anomalous events recorded in a video (i.e., determine if there is a video anomaly) captured by a camera 160.
  • the anomaly detection may be based on a multi-scale image frame prediction network (referred to as an “anomaly detection network 105” hereafter) trained using sample videos (e.g., training data 101) .
  • anomaly detection system 100 may include components shown in FIG.
  • anomaly detection system 100 including an anomaly detection device 110, a model training device 120, a display device 130, a training database 140, a database/repository 150, a camera 160 and a network 170 for facilitating communications among the various components. It is to be contemplated that anomaly detection system 100 may include more or less components compared to those shown in FIG. 1.
  • anomaly detection system 100 may perform two stages: an anomaly detection model training stage and an anomaly detection stage applying the trained model.
  • training stage e.g., training a learning model such as anomaly detection network 105
  • anomaly detection system 100 may include model training device 120 and training database 140.
  • video anomaly detection process to obtain an anomaly detection result 107 (e.g., whether the video includes anomalous event (s) )
  • anomaly detection system 100 may include anomaly detection device 110 and database/repository 150.
  • anomaly detection system 100 may also include display device 130 to display anomaly detection result 107.
  • anomaly detection system 100 may only include components for performing the anomaly detection related functions, namely anomaly detection device 110, database/repository 150 and optionally display device 130.
  • anomaly detection system 100 may optionally include network 170 to facilitate the communication among the various components of anomaly detection system 100, such as databases 140 and 150, devices 110 and 120, and camera 160.
  • network 170 may be a local area network (LAN) , a wireless network, a cloud computing environment (e.g., software as a service, platform as a service, infrastructure as a service) , a client-server, a wide area network (WAN) , etc.
  • LAN local area network
  • cloud computing environment e.g., software as a service, platform as a service, infrastructure as a service
  • WAN wide area network
  • network 170 may be replaced by wired data communication systems or devices.
  • the various components of anomaly detection system 100 may be remote from each other or in different locations and be connected through network 170 as shown in FIG. 1.
  • certain components of anomaly detection system 100 may be located on the same site or inside one device.
  • training database 140 may be located on-site with or be part of model training device 120.
  • model training device 120 and anomaly detection device 110 may be inside the same computer or processing device.
  • anomaly detection system 100 may store video that include multiple image frames.
  • Image frames of a video may each correspond to a time point.
  • a “current image frame” refers to the image frame corresponding to a selected time point.
  • prior image frames refer to the set of image frames corresponding to time points prior to the selected time point.
  • “future image frames” refer to the set of image frames corresponding to time points subsequent to the selected time point.
  • sample image frames where anomalies are known e.g., training data 101
  • image frames for detection e.g., image frames 102
  • Image frames may be generated based on video data (e.g., a video comprising a plurality of image frames) received from video recording devices (e.g., camera 160) .
  • the video data may be visual image streams including the current image frame, the set of prior image frames and the set of future image frames, acquired by camera 160 or a wearable device, a smart phone, a tablet, a computer, a surveillance camera, or the like that includes a video recording device for acquiring the video data.
  • camera 160 may be any suitable video recording device that can acquire image frames.
  • Image frames may be generated based on the acquired video data.
  • the video data may also come from a post-processing device implementing image enhancement technology (e.g., an application on a user device for adding night visions to the surveillance devices) .
  • training database 140 may store training data 101, which includes sample image frames.
  • training data 101 may further include the known anomalies (if any) in the sample image frames.
  • the sample image frames may include an image frame corresponding to a time point when anomalous event happens, and image frames prior and subsequent to that time point.
  • training data 101 may include image frames that are known anomaly-free. Sample image frames may be stored in training database 140 as training data 101.
  • anomaly detection network 105 may be a multi-scale image frame prediction network (referred to as “prediction network” hereafter) for predicting an estimated current image frame (i.e., a prediction of an image frame at a selected time point) based on a set of prior image frames corresponding to time points prior to the selected time point.
  • Anomaly detection network 105 may also include an anomaly score calculation model (referred to as “assessment model” hereafter) for determining an anomaly score, indicative of a difference between the ground truth current image frame (i.e., captured image frame at the selected time point) and the estimated current image frame.
  • anomaly detection network 105 may decide whether the current image frame records/includes anomalous event (s) (i.e., whether an anomaly occurs at or around the selected time point) based on the anomaly sore.
  • anomalous event i.e., whether an anomaly occurs at or around the selected time point
  • the prediction network may include an encoder for extracting spatial features from each prior image frame (i.e., the set of prior image frames corresponding to time points prior to the selected time point) at different resolutions, a predictor for predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution and a decoder for predicting the estimated current image frame based on the estimated image frames in the different resolutions.
  • an encoder for extracting spatial features from each prior image frame (i.e., the set of prior image frames corresponding to time points prior to the selected time point) at different resolutions
  • a predictor for predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution
  • a decoder for predicting the estimated current image frame based on the estimated image frames in the different resolutions.
  • the encoder may include at least one convolution layer and a plurality of residual blocks for obtaining spatial features of the prior image frames in different resolutions.
  • Each residual block may be configured to extract spatial features in a specific resolution.
  • the residual blocks may be connected in a sequential manner, each producing spatial features in a different resolution. The more residual blocks are applied to an image frame, the lower spatial features are extracted.
  • each residual block may include a 2-D convolution, and when extracting the spatial features from each prior image frame, a first residual block for a first resolution may be applied to the image frame for generating spatial features in a first resolution, and a second residual block may be applied to the spatial features in the first resolution to obtain spatial features in a second resolution.
  • the second resolution is lower than the first resolution.
  • the number of residual blocks of the plurality of residual blocks is not limited to two.
  • the plurality of residual blocks may include three or more residual blocks applied in a sequential manner, for generating spatial features in three or more different resolutions.
  • the multi-scale architecture may allow the encoder to extract spatial features of the image frame at different scales.
  • the predictor may include a plurality of sub-models in parallel, each corresponding to a resolution.
  • Each of the sub-models may include a first block for extracting global temporal features from the spatial features of the set of prior image frames in that resolution, and a second block for extracting local temporal features from the global temporal features in that resolution.
  • the first block may be a non-local block.
  • the second block may be a convolutional gated recurrent unit (ConvGRU) and may extract local temporal features with its receptive field in that resolution.
  • ConvGRU convolutional gated recurrent unit
  • the decoder may include a plurality of residual blocks for fusing the local and global spatial-temporal features in the different resolutions generated by the predictor to generate an estimated current image frame of the selected time point.
  • the estimated current image frame may have the same shape with the input prior image frames.
  • the decoder may fuse features at different scales and construct the output (e.g., the estimated current image frame) by upsampling and concatenating the channels of the features.
  • the lower-resolution features may be upsampled by the nearest neighbor interpolation (i.e., finding a value between the nearest neighbors) .
  • Checkerboard artifacts i.e., checkerboard patterns in the gradient
  • the output may be eliminated using the residual blocks.
  • the assessment model may calculate the anomaly score based on a peak signal-to-noise ratio (PSNR) of a mean squared error between the ground truth/captured current image frame and the estimated current image frame. For example, the assessment model may first calculate the mean squared error between the ground truth current image frame I t and the estimated current image frame and calculate the PSNR based on the mean squared error. The assessment model may then calculate an anomaly score based on normalizing the PSNR values of all the prior image frames.
  • PSNR peak signal-to-noise ratio
  • anomaly detection network 105 may be trained by minimizing the differences between an image frame predicted by using anomaly detection network 105 (hereafter “predicated training image frame” ) and a ground truth image frame corresponding to the same time point as the predicted image frame provided as part of training data 101 (hereafter “ground truth training image frame” ) .
  • model training device 120 may minimize a loss including a weighted sum of a perceptual loss indicative of a level of noise between the predicted training image frame and the ground truth training image frame, an intensity loss indicative of an intensity difference between the predicted training image frame and the ground truth training image frame, and a gradient difference loss indicative of a difference between an gradient image of the estimated current image frame and the ground truth current image frame.
  • model training device 120 may communicate with training database 140 to receive one or more set of training data 101.
  • Each set of training data 101 may include sample image frames (i.e., a plurality of image frames from a sample video) including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point.
  • Model training device 120 may use training data 101 received from training database 140 to train the learning model, e.g., anomaly detection network 105.
  • Model training device 120 may be implemented with hardware specially programmed by software that performs the training process.
  • model training device 120 may include a processor and a non-transitory computer-readable medium. The processor may conduct the training by performing instructions of a training process stored in the computer-readable medium.
  • Model training device 120 may additionally include input and output interfaces to communicate with training database 140, network 170, and/or a user interface (not shown) .
  • the user interface may be used for selecting sets of training data, adjusting one or more parameters of the training process, selecting or modifying a framework of the learning model, and/or manually adjust the selected time point of training data 101.
  • Anomaly detection device 110 may receive trained anomaly detection network 105 from model training device 120.
  • Anomaly detection device 110 may include a processor and a non-transitory computer-readable medium (not shown) .
  • the processor may perform instructions of an anomaly detection process stored in the medium.
  • Anomaly detection device 110 may additionally include input and output interfaces to communicate with database/repository 150, camera 160, network 170 and/or a user interface of display device 130.
  • the input interface may be used for selecting a video that includes the plurality of image frames or initiating the detection process.
  • the output interface may be used for providing an anomaly detection result 107 associated with the video.
  • Display 130 may include a display such as a Liquid Crystal Display (LCD) , a Light Emitting Diode Display (LED) , a plasma display, or any other type of display, and provide a Graphical User Interface (GUI) presented on the display for user input and data depiction.
  • the display may include a number of different types of materials, such as plastic or glass, and may be touch-sensitive to receive inputs from the user.
  • the display may include a touch-sensitive material that is substantially rigid, such as Gorilla Glass TM , or substantially pliable, such as Willow Glass TM .
  • display 130 may be a standalone device, or may be an integrated part of anomaly detection device 110.
  • FIG. 2 illustrates a block diagram of an exemplary anomaly detection device 110, according to embodiments of the disclosure.
  • anomaly detection device 110 may include a communication interface 202, a processor 204, a memory 206, and a storage 208.
  • anomaly detection device 110 may have different modules in a single device, such as an integrated circuit (IC) chip (e.g., implemented as an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) ) , or separate devices with dedicated functions.
  • IC integrated circuit
  • ASIC application-specific integrated circuit
  • FPGA field-programmable gate array
  • one or more components of anomaly detection device 110 may be located in a cloud or may be alternatively in a single location (such as inside a mobile device) or distributed locations.
  • anomaly detection device 110 may be in an integrated device or distributed at different locations but communicate with each other through a network (not shown) . Consistent with the president disclosure, anomaly detection device 110 may be configured to detect anomalous event (s) recorded in image frames 102 received from database/repository 150, using anomaly detection network 105 trained in model training device 120.
  • Communication interface 202 may send data to and receive data from components such as database/repository 150, camera 160, model training device 120 and display device 130 via communication cables, a Wireless Local Area Network (WLAN) , a Wide Area Network (WAN) , wireless networks such as radio waves, a cellular network, and/or a local or short-range wireless network (e.g., Bluetooth TM ) , or other communication methods.
  • communication interface 202 may include an integrated service digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection.
  • ISDN integrated service digital network
  • communication interface 202 may include a local area network (LAN) card to provide a data communication connection to a compatible LAN.
  • Wireless links can also be implemented by communication interface 202.
  • communication interface 202 can send and receive electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
  • communication interface 202 may receive anomaly detection network 105 from model training device 120 and image frames 102 from database/repository 150. Communication interface 202 may further provide image frames 102 and anomaly detection network 105 to memory 206 and/or storage 208 for storage or to processor 204 for processing.
  • Processor 204 may include any appropriate type of general-purpose or special-purpose microprocessor, digital signal processor, or microcontroller. Processor 204 may be configured as a separate processor module dedicated to detecting anomalous event (s) in a video including a plurality of image frames captured by a camera using a learning model. Alternatively, processor 204 may be configured as a shared processor module for performing other functions in addition to anomaly detection.
  • Memory 206 and storage 208 may include any appropriate type of mass storage provided to store any type of information that processor 204 may need to operate.
  • Memory 206 and storage 208 may be a volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other type of storage device or tangible (i.e., non-transitory) computer-readable medium including, but not limited to, a ROM, a flash memory, a dynamic RAM, and a static RAM.
  • Memory 206 and/or storage 208 may be configured to store one or more computer programs that may be executed by processor 204 to perform functions disclosed herein.
  • memory 206 and/or storage 208 may be configured to store program (s) that may be executed by processor 204 to detect anomalous events in image frames 102 based on anomaly detection network 105.
  • memory 206 and/or storage 208 may also store intermediate data such as spatial features at different resolutions, estimated image frames in different resolutions, estimated current image frame, PSNR of a mean squared error, anomaly score, etc.
  • Memory 206 and/or storage 208 may additionally store various learning models including their model parameters, such as learnable parameters of the encoder, the predictor, and the decoder, etc.
  • processor 204 may include multiple modules, such as an encoder unit 240, a predictor unit 242, a decoder unit 244, an assessment unit 246, and the like. These modules (and any corresponding sub-modules or sub-units) can be hardware units (e.g., portions of an integrated circuit) of processor 204 designed for use with other components or software units implemented by processor 204 through executing at least part of a program.
  • the program may be stored on a computer-readable medium, and when executed by processor 204, it may perform one or more functions.
  • FIG. 2 shows units 240-246 all within one processor 204, it is contemplated that these units may be distributed among different processors located closely or remotely with each other.
  • FIG. 3 illustrates a flowchart of an exemplary method 300 for anomaly detection based on anomaly detection network 105 (an example shown in FIG. 5) , according to embodiments of the disclosure.
  • Method 300 may be implemented by anomaly detection device 110 and particularly processor 204 or a separate processor not shown in FIG. 2.
  • Method 300 may include steps S302-S314 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 3. FIG. 3 and FIG. 5 will be described together.
  • communication interface 202 may receive image frames 102 from database/repository 150.
  • image frames 102 may be part of a video that includes at least I t-P , ..., I t-1 , I t image frames recorded using camera 160.
  • image frames 102 may include a current image frame I t corresponding to a selected time point t and a set of P prior image frames I t-P , ..., I t-1 corresponding to time points t-P, ..., t-1, prior to the selected time point t.
  • encoder unit 240 may apply at least one convolution layer 510 to an input image frame (i.e., one of a prior image frame of image frames 102) to generate a convoluted result of the input image frame.
  • Encoder unit 240 may extract spatial features from the convoluted result of the input image frame at different resolutions.
  • encoder unit 240 may use several basic residual blocks 512 for extracting multi-scale spatial features (i.e., spatial features in different resolutions) .
  • One or more basic residual blocks 512 may be applied to the convoluted result in sequence. Specifically, as shown in FIG. 5, L may be set as 3 (i.e., three different resolutions in total) .
  • Basic residual blocks 512 may accordingly include three residual blocks applied in sequence for generating spatial features at different resolutions in a descending manner (e.g., from image resolution of 128 ⁇ 128 ⁇ 128 to image resolution of 64 ⁇ 64 ⁇ 256 to image resolution of 32 ⁇ 32 ⁇ 512) .
  • the first residual block may be applied to the convoluted result from the convolution layer and generate a first set of spatial features at a first resolution (e.g., 128 ⁇ 128 ⁇ 128) .
  • the second residual block may be applied to the first set of spatial features and generate a second set of spatial features at a second resolution (e.g., 64 ⁇ 64 ⁇ 256) , lower than the first resolution.
  • each of basic residual blocks 512 may include a 2-D convolution for convoluting the spatial features.
  • encoder unit 240 may calculate spatial features in L resolutions for each image frame from I 1 to I t-1 using basic residual blocks 512 according toequation (1) :
  • predictor unit 242 may predict an estimate image frame in each resolution based on the spatial features of the set of P prior image frames I t-P , ..., I t-1 in that resolution using a plurality of prediction sub-models 520.
  • the number of prediction sub-models 520 corresponds to the number of different resolutions.
  • prediction sub-models 520 may include three prediction sub-models.
  • the three prediction sub-models are applied in parallel, each of which corresponds to spatial features in one resolution for predicting an estimated image frame in that resolution.
  • each sub-model may include a first block (e.g., a non-local block) for extracting global temporal features from the spatial features of the set of prior image frames in the corresponding resolution and a second block for extracting local temporal features from the global temporal features in that resolution.
  • a first block e.g., a non-local block
  • second block for extracting local temporal features from the global temporal features in that resolution.
  • the second block may be a convolutional gated recurrent unit (ConvGRU) .
  • the first block e.g., the non-local block
  • the ConvGRU may be applied to extract local temporal features (i.e., temporal patterns) using its receptive field focusing on the local neighborhood of the global temporal features extracted by the first block.
  • the estimated current image frame in each resolution may be predicted according to equation (2) :
  • decoder unit 244 may predict the estimated current image frame by fusing the features (i.e., both global temporal features and local temporal features) in the different resolutions using sub-models 530.
  • each sub-model of sub-models 530 includes an upsampling unit for upsampling the features, followed by a residual block for eliminating checkerboard artifacts (i.e., checkerboard patterns in the gradient) in the estimated current image frame (i.e., the output image frame by decoder unit 244) .
  • the features may be upsampled by nearest neighbor interpolation (i.e., finding a value between the nearest neighbors) .
  • the estimated current image frame may be estimated by upsampling and concatenating the channels of the features according to equation (3) :
  • assessment unit 246 may determine an anomaly score indicative of the difference between the captured current image frame I t and the estimated current image frame using the assessment model (not shown in FIG. 5) .
  • the assessment model may calculate the anomaly score based on a peak signal-to-noise ratio (PSNR) of a mean squared error between the captured current image frame and the estimated current image frame. For example, the assessment model may first calculate the mean squared error (MSE) between the captured current image frame I t and the estimated current image frame according to equation (4)
  • PSNR peak signal-to-noise ratio
  • H and W are the height and the width of image frames respectively, and I t (i, j) and (i, j) are the Red, Green, Blue (RGB) values of the (i, j) th pixel in I t and respectively.
  • the assessment model may then calculate the PSNR based on the mean squared error according to equation (5) .
  • MAX It is the maximum possible pixel value of I t .
  • the assessment model may further calculate the anomaly score of the image frame I t at the time point t based on normalizing the PSNR values of all the T prior image frames.
  • the anomaly score may be calculated according to equation (6) :
  • a higher anomaly sore S t indicates the higher probability that the image frame I t contains anomalous event (s) .
  • assessment unit 246 may determine if the anomaly score is higher than a predetermined threshold.
  • an operator or a designer of the learning model e.g., anomaly detection network 105
  • anomaly detection network 105 may be trained by model training device 120 by minimizing a loss including a weighted sum of a perceptual loss indicative of a level of noise between the predicted training image frame and the ground truth training image frame, an intensity loss indicative of an intensity difference between the predicted training image frame and the ground truth training image frame, and a gradient difference loss indicative of a difference between an gradient image of the predicted training image frame and an gradient image of the ground truth training image frame.
  • FIG. 4 illustrates a flowchart of an exemplary method 400 for training anomaly detection network 105, according to embodiments of the disclosure.
  • Method 400 may include steps S402-S412 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 4.
  • model training device 120 may receive training data 101 from training database 140.
  • training data 101 may include a video that includes at least I t-P , ..., I t-1 , I t training image frames recorded using camera 160.
  • training data 101 may include a training image frame I t corresponding to a selected time point t and a set of P prior training image frames I t-P , ..., I t-1 corresponding to time points t-P, ..., t-1, prior to the selected time point t.
  • model training device 120 may calculate the perceptual loss indicative of a level of noise between the training image frame predicted by using the learning model (e.g., anomaly detection network 105) and the ground truth training image frame I t .
  • model training device 120 may apply separate pre-trained deep convolution networks such as VGG16 networks 540 (shown in FIG. 5) to the predicted training image frame and the ground truth training image frame I t simultaneously.
  • the pre-trained deep convolution networks are trained on ImageNet for image classification.
  • Each VGG16 network may include multiple sub- convolution layers 542 with a rectified linear unit (ReLU) .
  • the multiple sub-convolution layers 542 may include 13 sub-convolution layers.
  • Model training device 120 may calculate the perceptual loss based on a weighted l 1 distance of the 2 nd , 4 th , 7 th , 10 th and 13 th sub-convolution layers of the VGG16 network according to equation (7) :
  • ⁇ 2 , ⁇ 4 , ⁇ 7 , ⁇ 10 , ⁇ 13 may be set as (0.1, 1, 10, 10, 10) .
  • model training device 120 may calculate the intensity loss L int indicative of the intensity l 2 distance between the predicted training image frame and the ground truth training image frame I t according to equation (8) :
  • I t (i, j) and (i, j) are the RGB values of the (i, j) th pixel in I t and respectively.
  • model training device 120 may calculate the gradient different loss L gd indicative of the gradient difference between the gradient image of the predicted training image frame and the gradient image of the ground truth training image frame I t .
  • model training device 120 may measure the l 1 distance in both vertical and horizontal directions (i.e., the vertical and horizontal gradient differences) according to equation (9) :
  • model training device 120 may calculate the overall lose as a weighted sum of the perceptual loss, the intensity loss and the gradient difference loss according to equation (11) :
  • model training device 120 may train the learning model (e.g., anomaly detection network 105) by minimizing the overall lose.
  • the design of the multi-scale architectures of the learning model applied by the disclosed system and method ensures more attention paid to semantically meaningful parts such as a person, a vehicle, etc. comparing to background information (i.e., the static/unchanged objects) in each of the image frames of the video to be detect.
  • background information i.e., the static/unchanged objects
  • the disclosed multi-scale learning model is more sensitive to objects with different scales of features. For example, spatial features extracted at different resolutions may ensure the same object with different granularities in the image frame may all be detected for determining if there is an anomalous event happened.
  • the learning model can be trained based on solely normal videos (i.e., videos that are anomaly-free) , it is less expensive to capture the training data for training the learning model.
  • the computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage devices.
  • the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed.
  • the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

Systems, methods, and computer-readable media for video anomaly detection using a learning model. The method may include receiving a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point. The method may also include extracting spatial features from each prior image frame at different resolutions and predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution. The method may further include predicting an estimated current image frame based on the estimated image frames in the different resolutions and detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.

Description

SYSTEMS AND METHODS FOR VIDEO ANOMALY DETECTION USING MULTI-SCALE IMAGE FRAME PREDICTION NETWORK TECHNICAL FIELD
The present disclosure relates to systems and methods for video anomaly detection, and more particularly to, systems and methods for video anomaly detection using a multi-scale image frame prediction network.
BACKGROUND
Video anomaly detection plays an essential role in computer vision and is used in many applications such as warning system, scene understanding, activity recognition, road traffic analysis, etc. Given a video clip, the frame-level video anomaly detection aims at identifying the frames in which there exist events or behaviors different from expectations or regulations. Detecting anomalous events in videos is very challenging for mainly two reasons. First, the definition of an anomalous event is usually ambiguous, and its pattern varies a lot because it highly depends on the contexts of the event and the scenario where the event happens. For example, a vehicle driving down the road within the speed limits is normal but a vehicle dashing towards a crowd of people is anomalous. For another example, a cloud of smoke rising from inside of a building often indicates an anomaly, but cloud of smoke coming from a chimney is normal. Therefore, a robust vide anomaly detection method has to be able to address the ambiguous definition of the anomalous event. Second, the data between normal and abnormal samples is usually imbalanced in practice. Because anomalous events are rare and unpredictable in real-world scenarios, it is very difficult and high cost to collect and label abnormal videos. Therefore, video anomaly detection methods have to learn from normal data only (i.e., using the video of normal events to distinguish unseen and unbounded abnormal events from normal ones) .
Various video anomaly detection methods have been developed throughout time. Traditional video anomaly detection methods rely on hand-crafted features from prior domain knowledge. These methods learn dictionaries or descriptors of normal events based on extracted appearance and motion features. However, their detection performance is limited by the poor discriminative power of the simple features. The recent advance in deep learning networks promoted the deep learning-based video anomaly detection methods. These methods can be generally grouped into three  categories: classification-based method, reconstruction-based methods, and prediction-based methods.
Classification-based method directly classifies each video frame to be normal or abnormal. Among them, some methods require additional anomaly data and labels in the training phase via a weakly supervised approach. However, because collecting and labeling anomaly data for training is expensive or even infeasible, this approach cannot be generalized to unbounded anomalies. Other works focus on specific types of anomaly events. For example, a multi-stage classification method first detects and crops out the object of interests and then builds one-versus-rest classifiers on the extracted features of that object. Since its anomaly detection result is substantially influenced by the performance of its object detection, it will fail in recognizing anomalies with unseen objects or no objects to attribute. Therefore, this type of method is very limited when dealing with complicated and uncertain real-world scenarios.
Reconstruction-based method is a more general way for video anomaly detection, which learns to reconstruct the input video frame with minimum reconstruction errors for normal frames and is supposed to have large errors for abnormal frames. Auto-encoder based methods and generative adversarial networks are commonly used as the reconstruction models. However, due to the improved performance of deep neural networks, an abnormal event may also have small reconstruction errors, so there is no guarantee that reconstruction-based methods can detect abnormal events well.
Prediction-based method has been developed to remedy the issues of the construction-based method and reconstruction-based method. It takes consecutive video frames to predict the next frame and determines whether the next frame is abnormal by the prediction error. However, the existing prediction-based methods are still suboptimal. For example, their U-Net architecture cannot fully learn temporal information and using adversarial learning and additional optical flow loss makes the training inefficient.
Embodiments of the disclosure address the above problems by providing prediction-based video anomaly detection methods and systems using a multi-scale frame prediction network
SUMMARY
Embodiments of the disclosure provide a method for video anomaly detection using a learning model. An exemplary method may include receiving a video including  a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point. The method may also include extracting spatial features from each prior image frame at different resolutions and predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution. The method may further include predicting an estimated current image frame based on the estimated image frames in the different resolutions and detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
Embodiments of the disclosure also provide a system for performing video anomaly detection using a learning model. An exemplary system may include a communication interface configured to receive a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point. The system may also include at least one processor coupled to the communication interface. The at least one processor may be configured to extract spatial features from each prior image frame at different resolutions and predict an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution. The at least one processor may be further configured to predict an estimated current image frame based on the estimated image frames in the different resolutions and detect the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
Embodiments of the disclosure further provide a non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method for video anomaly detection using a learning model. The method may include receiving a video including a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point. The method may also include extracting spatial features from each prior image frame at different resolutions and predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution. The method may further include predicting an estimated current image frame based on the estimated image  frames in the different resolutions and detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates a schematic diagram of an exemplary anomaly detection system, according to embodiments of the disclosure.
FIG. 2 illustrates a block diagram of an exemplary anomaly detection device, according to embodiments of the disclosure.
FIG. 3 illustrates a flowchart of an exemplary method for anomaly detection, according to embodiments of the disclosure.
FIG. 4 illustrates a flowchart of an exemplary method for training an anomaly detection network, according to embodiments of the disclosure.
FIG. 5 illustrates a schematic framework of an exemplary anomaly detection network along with its’ training network, according to embodiments of the disclosure.
DETAILED DESCRIPTION
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
FIG. 1 illustrates a schematic diagram of an exemplary video anomaly detection system (referred to as “anomaly detection system 100” ) , according to embodiments of the disclosure. Consistent with the present disclosure, anomaly detection system 100 is configured to detect anomalous events recorded in a video (i.e., determine if there is a video anomaly) captured by a camera 160. The anomaly detection may be based on a multi-scale image frame prediction network (referred to as an “anomaly detection network 105” hereafter) trained using sample videos (e.g., training data 101) . In some embodiments, anomaly detection system 100 may include components shown in FIG. 1, including an anomaly detection device 110, a model training device 120, a display device 130, a training database 140, a database/repository 150, a camera 160 and a network 170 for facilitating communications among the various components. It is to be  contemplated that anomaly detection system 100 may include more or less components compared to those shown in FIG. 1.
As shown in FIG. 1, anomaly detection system 100 may perform two stages: an anomaly detection model training stage and an anomaly detection stage applying the trained model. To perform the training stage (e.g., training a learning model such as anomaly detection network 105) , anomaly detection system 100 may include model training device 120 and training database 140. To perform the video anomaly detection process to obtain an anomaly detection result 107 (e.g., whether the video includes anomalous event (s) ) , anomaly detection system 100 may include anomaly detection device 110 and database/repository 150. In some embodiments, anomaly detection system 100 may also include display device 130 to display anomaly detection result 107. In some embodiments, when the learning model (e.g., anomaly detection network 105) is pre-trained, anomaly detection system 100 may only include components for performing the anomaly detection related functions, namely anomaly detection device 110, database/repository 150 and optionally display device 130.
In some embodiments, anomaly detection system 100 may optionally include network 170 to facilitate the communication among the various components of anomaly detection system 100, such as  databases  140 and 150,  devices  110 and 120, and camera 160. For example, network 170 may be a local area network (LAN) , a wireless network, a cloud computing environment (e.g., software as a service, platform as a service, infrastructure as a service) , a client-server, a wide area network (WAN) , etc. In some embodiments, network 170 may be replaced by wired data communication systems or devices.
In some embodiments, the various components of anomaly detection system 100 may be remote from each other or in different locations and be connected through network 170 as shown in FIG. 1. In some alternative embodiments, certain components of anomaly detection system 100 may be located on the same site or inside one device. For example, training database 140 may be located on-site with or be part of model training device 120. As another example, model training device 120 and anomaly detection device 110 may be inside the same computer or processing device.
Consistent with the present disclosure, anomaly detection system 100 may store video that include multiple image frames. Image frames of a video may each correspond to a time point. Consistent with the present disclosure, a “current image frame” refers to the image frame corresponding to a selected time point. Accordingly, “prior image frames” refer to the set of image frames corresponding to time points prior  to the selected time point. Likewise, “future image frames” refer to the set of image frames corresponding to time points subsequent to the selected time point. In some embodiments, sample image frames where anomalies are known (e.g., training data 101) may be stored in training database 140 and image frames for detection (e.g., image frames 102) may be stored in database/repository 150.
Image frames may be generated based on video data (e.g., a video comprising a plurality of image frames) received from video recording devices (e.g., camera 160) . In some embodiments, the video data may be visual image streams including the current image frame, the set of prior image frames and the set of future image frames, acquired by camera 160 or a wearable device, a smart phone, a tablet, a computer, a surveillance camera, or the like that includes a video recording device for acquiring the video data. In some embodiments, camera 160 may be any suitable video recording device that can acquire image frames. Image frames may be generated based on the acquired video data. Optionally, the video data may also come from a post-processing device implementing image enhancement technology (e.g., an application on a user device for adding night visions to the surveillance devices) .
In some embodiments, training database 140 may store training data 101, which includes sample image frames. In some embodiments, training data 101 may further include the known anomalies (if any) in the sample image frames. For example, the sample image frames may include an image frame corresponding to a time point when anomalous event happens, and image frames prior and subsequent to that time point. In some embodiments, training data 101 may include image frames that are known anomaly-free. Sample image frames may be stored in training database 140 as training data 101.
Consistent with some embodiments, anomaly detection network 105 (described in detail in connection with FIG. 5) may be a multi-scale image frame prediction network (referred to as “prediction network” hereafter) for predicting an estimated current image frame (i.e., a prediction of an image frame at a selected time point) based on a set of prior image frames corresponding to time points prior to the selected time point. Anomaly detection network 105 may also include an anomaly score calculation model (referred to as “assessment model” hereafter) for determining an anomaly score, indicative of a difference between the ground truth current image frame (i.e., captured image frame at the selected time point) and the estimated current image frame. In some embodiments, anomaly detection network 105 may decide whether the current  image frame records/includes anomalous event (s) (i.e., whether an anomaly occurs at or around the selected time point) based on the anomaly sore.
In some embodiments, the prediction network may include an encoder for extracting spatial features from each prior image frame (i.e., the set of prior image frames corresponding to time points prior to the selected time point) at different resolutions, a predictor for predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution and a decoder for predicting the estimated current image frame based on the estimated image frames in the different resolutions.
In some embodiments, the encoder may include at least one convolution layer and a plurality of residual blocks for obtaining spatial features of the prior image frames in different resolutions. Each residual block may be configured to extract spatial features in a specific resolution. The residual blocks may be connected in a sequential manner, each producing spatial features in a different resolution. The more residual blocks are applied to an image frame, the lower spatial features are extracted. For example, each residual block may include a 2-D convolution, and when extracting the spatial features from each prior image frame, a first residual block for a first resolution may be applied to the image frame for generating spatial features in a first resolution, and a second residual block may be applied to the spatial features in the first resolution to obtain spatial features in a second resolution. The second resolution is lower than the first resolution. It is understood that the number of residual blocks of the plurality of residual blocks is not limited to two. In some embodiments, the plurality of residual blocks may include three or more residual blocks applied in a sequential manner, for generating spatial features in three or more different resolutions. The multi-scale architecture may allow the encoder to extract spatial features of the image frame at different scales.
In some embodiments, the predictor may include a plurality of sub-models in parallel, each corresponding to a resolution. Each of the sub-models may include a first block for extracting global temporal features from the spatial features of the set of prior image frames in that resolution, and a second block for extracting local temporal features from the global temporal features in that resolution. For example, within each sub-model, the first block may be a non-local block. The second block may be a convolutional gated recurrent unit (ConvGRU) and may extract local temporal features with its receptive field in that resolution.
In some embodiments, the decoder may include a plurality of residual blocks for fusing the local and global spatial-temporal features in the different resolutions generated by the predictor to generate an estimated current image frame of the selected time point. The estimated current image frame may have the same shape with the input prior image frames. For example, the decoder may fuse features at different scales and construct the output (e.g., the estimated current image frame) by upsampling and concatenating the channels of the features. In some embodiments, the lower-resolution features may be upsampled by the nearest neighbor interpolation (i.e., finding a value between the nearest neighbors) . Checkerboard artifacts (i.e., checkerboard patterns in the gradient) in the output may be eliminated using the residual blocks.
In some embodiments, the assessment model may calculate the anomaly score based on a peak signal-to-noise ratio (PSNR) of a mean squared error between the ground truth/captured current image frame and the estimated current image frame. For example, the assessment model may first calculate the mean squared error between the ground truth current image frame I t and the estimated current image frame 
Figure PCTCN2020073932-appb-000001
and calculate the PSNR based on the mean squared error. The assessment model may then calculate an anomaly score based on normalizing the PSNR values of all the prior image frames.
Consistent with some embodiments, anomaly detection network 105 may be trained by minimizing the differences between an image frame predicted by using anomaly detection network 105 (hereafter “predicated training image frame” ) and a ground truth image frame corresponding to the same time point as the predicted image frame provided as part of training data 101 (hereafter “ground truth training image frame” ) . For example, model training device 120 may minimize a loss including a weighted sum of a perceptual loss indicative of a level of noise between the predicted training image frame and the ground truth training image frame, an intensity loss indicative of an intensity difference between the predicted training image frame and the ground truth training image frame, and a gradient difference loss indicative of a difference between an gradient image of the estimated current image frame and the ground truth current image frame. The details of the training will be disclosed in connection with FIG. 4 blow.
As show in FIG. 1, model training device 120 may communicate with training database 140 to receive one or more set of training data 101. Each set of training data 101 may include sample image frames (i.e., a plurality of image frames from a sample video) including a current image frame corresponding to a selected time point and a set  of prior image frames corresponding to time points prior to the selected time point. Model training device 120 may use training data 101 received from training database 140 to train the learning model, e.g., anomaly detection network 105. Model training device 120 may be implemented with hardware specially programmed by software that performs the training process. For example, model training device 120 may include a processor and a non-transitory computer-readable medium. The processor may conduct the training by performing instructions of a training process stored in the computer-readable medium. Model training device 120 may additionally include input and output interfaces to communicate with training database 140, network 170, and/or a user interface (not shown) . The user interface may be used for selecting sets of training data, adjusting one or more parameters of the training process, selecting or modifying a framework of the learning model, and/or manually adjust the selected time point of training data 101.
Anomaly detection device 110 may receive trained anomaly detection network 105 from model training device 120. Anomaly detection device 110 may include a processor and a non-transitory computer-readable medium (not shown) . The processor may perform instructions of an anomaly detection process stored in the medium. Anomaly detection device 110 may additionally include input and output interfaces to communicate with database/repository 150, camera 160, network 170 and/or a user interface of display device 130. The input interface may be used for selecting a video that includes the plurality of image frames or initiating the detection process. The output interface may be used for providing an anomaly detection result 107 associated with the video.
Display 130 may include a display such as a Liquid Crystal Display (LCD) , a Light Emitting Diode Display (LED) , a plasma display, or any other type of display, and provide a Graphical User Interface (GUI) presented on the display for user input and data depiction. The display may include a number of different types of materials, such as plastic or glass, and may be touch-sensitive to receive inputs from the user. For example, the display may include a touch-sensitive material that is substantially rigid, such as Gorilla Glass TM, or substantially pliable, such as Willow Glass TM. In some embodiments, display 130 may be a standalone device, or may be an integrated part of anomaly detection device 110.
FIG. 2 illustrates a block diagram of an exemplary anomaly detection device 110, according to embodiments of the disclosure. In some embodiments, as shown in FIG. 2, anomaly detection device 110 may include a communication interface 202, a  processor 204, a memory 206, and a storage 208. In some embodiments, anomaly detection device 110 may have different modules in a single device, such as an integrated circuit (IC) chip (e.g., implemented as an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) ) , or separate devices with dedicated functions. In some embodiments, one or more components of anomaly detection device 110 may be located in a cloud or may be alternatively in a single location (such as inside a mobile device) or distributed locations. Components of anomaly detection device 110 may be in an integrated device or distributed at different locations but communicate with each other through a network (not shown) . Consistent with the president disclosure, anomaly detection device 110 may be configured to detect anomalous event (s) recorded in image frames 102 received from database/repository 150, using anomaly detection network 105 trained in model training device 120.
Communication interface 202 may send data to and receive data from components such as database/repository 150, camera 160, model training device 120 and display device 130 via communication cables, a Wireless Local Area Network (WLAN) , a Wide Area Network (WAN) , wireless networks such as radio waves, a cellular network, and/or a local or short-range wireless network (e.g., Bluetooth TM) , or other communication methods. In some embodiments, communication interface 202 may include an integrated service digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection. As another example, communication interface 202 may include a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links can also be implemented by communication interface 202. In such an implementation, communication interface 202 can send and receive electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
Consistent with some embodiments, communication interface 202 may receive anomaly detection network 105 from model training device 120 and image frames 102 from database/repository 150. Communication interface 202 may further provide image frames 102 and anomaly detection network 105 to memory 206 and/or storage 208 for storage or to processor 204 for processing.
Processor 204 may include any appropriate type of general-purpose or special-purpose microprocessor, digital signal processor, or microcontroller. Processor 204 may be configured as a separate processor module dedicated to detecting anomalous event (s) in a video including a plurality of image frames captured by a camera using a  learning model. Alternatively, processor 204 may be configured as a shared processor module for performing other functions in addition to anomaly detection.
Memory 206 and storage 208 may include any appropriate type of mass storage provided to store any type of information that processor 204 may need to operate. Memory 206 and storage 208 may be a volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other type of storage device or tangible (i.e., non-transitory) computer-readable medium including, but not limited to, a ROM, a flash memory, a dynamic RAM, and a static RAM. Memory 206 and/or storage 208 may be configured to store one or more computer programs that may be executed by processor 204 to perform functions disclosed herein. For example, memory 206 and/or storage 208 may be configured to store program (s) that may be executed by processor 204 to detect anomalous events in image frames 102 based on anomaly detection network 105.
In some embodiments, memory 206 and/or storage 208 may also store intermediate data such as spatial features at different resolutions, estimated image frames in different resolutions, estimated current image frame, PSNR of a mean squared error, anomaly score, etc. Memory 206 and/or storage 208 may additionally store various learning models including their model parameters, such as learnable parameters of the encoder, the predictor, and the decoder, etc.
As shown in FIG. 2, processor 204 may include multiple modules, such as an encoder unit 240, a predictor unit 242, a decoder unit 244, an assessment unit 246, and the like. These modules (and any corresponding sub-modules or sub-units) can be hardware units (e.g., portions of an integrated circuit) of processor 204 designed for use with other components or software units implemented by processor 204 through executing at least part of a program. The program may be stored on a computer-readable medium, and when executed by processor 204, it may perform one or more functions. Although FIG. 2 shows units 240-246 all within one processor 204, it is contemplated that these units may be distributed among different processors located closely or remotely with each other.
In some embodiments, units 242-246 of FIG. 2 may execute computer instructions to perform the anomaly detection. For example, FIG. 3 illustrates a flowchart of an exemplary method 300 for anomaly detection based on anomaly detection network 105 (an example shown in FIG. 5) , according to embodiments of the disclosure. Method 300 may be implemented by anomaly detection device 110 and particularly processor 204 or a separate processor not shown in FIG. 2. Method 300  may include steps S302-S314 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 3. FIG. 3 and FIG. 5 will be described together.
In step S302, communication interface 202 may receive image frames 102 from database/repository 150. In some embodiments, image frames 102 may be part of a video that includes at least I t-P, …, I t-1, I t image frames recorded using camera 160. For example, image frames 102 may include a current image frame I t corresponding to a selected time point t and a set of P prior image frames I t-P, …, I t-1 corresponding to time points t-P, …, t-1, prior to the selected time point t.
In step S304, encoder unit 240 may apply at least one convolution layer 510 to an input image frame (i.e., one of a prior image frame of image frames 102) to generate a convoluted result of the input image frame. Encoder unit 240 may extract spatial features from the convoluted result of the input image frame at different resolutions. For example, encoder unit 240 may use several basic residual blocks 512 for extracting multi-scale spatial features (i.e., spatial features in different resolutions) . One or more basic residual blocks 512 may be applied to the convoluted result in sequence. Specifically, as shown in FIG. 5, L may be set as 3 (i.e., three different resolutions in total) . Basic residual blocks 512 may accordingly include three residual blocks applied in sequence for generating spatial features at different resolutions in a descending manner (e.g., from image resolution of 128×128×128 to image resolution of 64×64×256 to image resolution of 32×32×512) . For example, the first residual block may be applied to the convoluted result from the convolution layer and generate a first set of spatial features at a first resolution (e.g., 128×128×128) . The second residual block may be applied to the first set of spatial features and generate a second set of spatial features at a second resolution (e.g., 64×64×256) , lower than the first resolution. In some embodiments, each of basic residual blocks 512 may include a 2-D convolution for convoluting the spatial features.
For example, encoder unit 240 may calculate spatial features in L resolutions for each image frame from I 1 to I t-1 using basic residual blocks 512 according toequation (1) :
Figure PCTCN2020073932-appb-000002
where t′= t-P, …, t-1 and f enc (·) is the encoder function.
In step S306, predictor unit 242 may predict an estimate image frame in each resolution based on the spatial features of the set of P prior image frames I t-P, …, I t-1 in  that resolution using a plurality of prediction sub-models 520. In some embodiments, the number of prediction sub-models 520 corresponds to the number of different resolutions. For example, in a specific example, because the spatial features were extracted at three different resolutions by encoder unit 240 (i.e., L=3) , prediction sub-models 520 may include three prediction sub-models. In some embodiments, the three prediction sub-models are applied in parallel, each of which corresponds to spatial features in one resolution for predicting an estimated image frame in that resolution. For example, each sub-model may include a first block (e.g., a non-local block) for extracting global temporal features from the spatial features of the set of prior image frames in the corresponding resolution and a second block for extracting local temporal features from the global temporal features in that resolution.
In some embodiments, the second block may be a convolutional gated recurrent unit (ConvGRU) . For example, within each sub-model, the first block (e.g., the non-local block) may be applied to capture long-range dependencies at all positions in the image frame and the ConvGRU may be applied to extract local temporal features (i.e., temporal patterns) using its receptive field focusing on the local neighborhood of the global temporal features extracted by the first block. The estimated current image frame in each resolution may be predicted according to equation (2) :
Figure PCTCN2020073932-appb-000003
where l = 1, 2, …, L and f pre (·) is the predictor function.
In step S308, decoder unit 244 may predict the estimated current image frame by fusing the features (i.e., both global temporal features and local temporal features) in the different resolutions using sub-models 530. In some embodiments, each sub-model of sub-models 530 includes an upsampling unit for upsampling the features, followed by a residual block for eliminating checkerboard artifacts (i.e., checkerboard patterns in the gradient) in the estimated current image frame (i.e., the output image frame by decoder unit 244) . In some embodiments, the features may be upsampled by nearest neighbor interpolation (i.e., finding a value between the nearest neighbors) .
For example, the estimated current image frame may be estimated by upsampling and concatenating the channels of the features according to equation (3) :
Figure PCTCN2020073932-appb-000004
where
Figure PCTCN2020073932-appb-000005
sthe predicted current image frame and f dec (·) is the decoder function.
In step S310, assessment unit 246 may determine an anomaly score indicative of the difference between the captured current image frame I t and the estimated current image frame
Figure PCTCN2020073932-appb-000006
using the assessment model (not shown in FIG. 5) .
In some embodiments, the assessment model may calculate the anomaly score based on a peak signal-to-noise ratio (PSNR) of a mean squared error between the captured current image frame and the estimated current image frame. For example, the assessment model may first calculate the mean squared error (MSE) between the captured current image frame I t and the estimated current image frame
Figure PCTCN2020073932-appb-000007
according to equation (4)
Figure PCTCN2020073932-appb-000008
where H and W are the height and the width of image frames respectively, and I t (i, j) and
Figure PCTCN2020073932-appb-000009
 (i, j) are the Red, Green, Blue (RGB) values of the (i, j)  th pixel in I t and
Figure PCTCN2020073932-appb-000010
respectively.
The assessment model may then calculate the PSNR based on the mean squared error according to equation (5) .
Figure PCTCN2020073932-appb-000011
where MAX It is the maximum possible pixel value of I t.
The assessment model may further calculate the anomaly score of the image frame I t at the time point t based on normalizing the PSNR values of all the T prior image frames. In some embodiments, the anomaly score may be calculated according to equation (6) :
Figure PCTCN2020073932-appb-000012
where t′= 1, 2, …, T. In some embodiments, a higher anomaly sore S t indicates the higher probability that the image frame I t contains anomalous event (s) .
In step S312, assessment unit 246 may determine if the anomaly score is higher than a predetermined threshold. In some embodiments, an operator or a designer of the learning model (e.g., anomaly detection network 105) may set a  predetermined value based on the domain knowledge for determining if the image frame records anomalous event (s) . For example, if the anomaly sore S t is higher than the predetermined threshold, the image frame contains anomalous event (s) . Thus, in step S314, the video may be determined to include anomaly. Otherwise, if the anomaly sore S t is not higher than the predetermined threshold, the current image frame contains no anomalous event (s) . Thus, in step S316, the video may be determined to be normal at or around time point t.
Consistent with some embodiments, anomaly detection network 105 may be trained by model training device 120 by minimizing a loss including a weighted sum of a perceptual loss indicative of a level of noise between the predicted training image frame and the ground truth training image frame, an intensity loss indicative of an intensity difference between the predicted training image frame and the ground truth training image frame, and a gradient difference loss indicative of a difference between an gradient image of the predicted training image frame and an gradient image of the ground truth training image frame.
FIG. 4 illustrates a flowchart of an exemplary method 400 for training anomaly detection network 105, according to embodiments of the disclosure. Method 400 may include steps S402-S412 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 4.
In step S402, model training device 120 may receive training data 101 from training database 140. In some embodiments, training data 101 may include a video that includes at least I t-P, …, I t-1, I t training image frames recorded using camera 160. For example, training data 101 may include a training image frame I t corresponding to a selected time point t and a set of P prior training image frames I t-P, …, I t-1 corresponding to time points t-P, …, t-1, prior to the selected time point t.
In step S404, model training device 120 may calculate the perceptual loss indicative of a level of noise between the training image frame
Figure PCTCN2020073932-appb-000013
predicted by using the learning model (e.g., anomaly detection network 105) and the ground truth training image frame I t. For example, model training device 120 may apply separate pre-trained deep convolution networks such as VGG16 networks 540 (shown in FIG. 5) to the predicted training image frame
Figure PCTCN2020073932-appb-000014
and the ground truth training image frame I t simultaneously.
In some embodiments, the pre-trained deep convolution networks are trained on ImageNet for image classification. Each VGG16 network may include multiple sub- convolution layers 542 with a rectified linear unit (ReLU) . For a specific example, the multiple sub-convolution layers 542 may include 13 sub-convolution layers. Model training device 120 may calculate the perceptual loss based on a weighted l 1 distance of the 2 nd, 4 th, 7 th, 10 th and 13 th sub-convolution layers of the VGG16 network according to equation (7) :
Figure PCTCN2020073932-appb-000015
where V= {2, 4, 7, 10, 13} , 
Figure PCTCN2020073932-appb-000016
is the output from the VGG16’s v th sub-convolution layer, and the hyperparameter (i.e., a predetermined parameter) α v controls the strength of each part of the loss. In a specific example, (α 2, α 4, α 7, α 10, α 13) may be set as (0.1, 1, 10, 10, 10) .
In step S406, model training device 120 may calculate the intensity loss L int indicative of the intensity l 2 distance between the predicted training image frame
Figure PCTCN2020073932-appb-000017
and the ground truth training image frame I t according to equation (8) :
Figure PCTCN2020073932-appb-000018
where 1 ≤ i ≤ H and 1 ≤ j ≤ W, and I t (i, j) and
Figure PCTCN2020073932-appb-000019
 (i, j) are the RGB values of the (i, j) th pixel in I t and
Figure PCTCN2020073932-appb-000020
respectively.
In step S408, model training device 120 may calculate the gradient different loss L gd indicative of the gradient difference between the gradient image of the predicted training image frame
Figure PCTCN2020073932-appb-000021
and the gradient image of the ground truth training image frame I t. For example, model training device 120 may measure the l 1 distance in both vertical and horizontal directions (i.e., the vertical and horizontal gradient differences) according to equation (9) :
Figure PCTCN2020073932-appb-000022
and calculate the gradient different loss L gd according to equation (9) :
Figure PCTCN2020073932-appb-000023
Instep S410, model training device 120 may calculate the overall lose as a weighted sum of the perceptual loss, the intensity loss and the gradient difference loss according to equation (11) :
Figure PCTCN2020073932-appb-000024
where λ int, λ gd and λ pl are hyperparameters (i.e., parameters with predefined values) . In step S412, model training device 120 may train the learning model (e.g., anomaly detection network 105) by minimizing the overall lose.
The design of the multi-scale architectures of the learning model applied by the disclosed system and method ensures more attention paid to semantically meaningful parts such as a person, a vehicle, etc. comparing to background information (i.e., the static/unchanged objects) in each of the image frames of the video to be detect. Moreover, the disclosed multi-scale learning model is more sensitive to objects with different scales of features. For example, spatial features extracted at different resolutions may ensure the same object with different granularities in the image frame may all be detected for determining if there is an anomalous event happened. Finally, because the learning model can be trained based on solely normal videos (i.e., videos that are anomaly-free) , it is less expensive to capture the training data for training the learning model.
Another aspect of the disclosure is directed to a non-transitory computer-readable medium storing instructions which, when executed, cause one or more processors to perform the methods, as discussed above. The computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage devices. For example, the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed. In some embodiments, the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.
It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed system and related methods. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed system and related methods.
It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.

Claims (20)

  1. A method for performing video anomaly detection using a learning model, comprising:
    receiving a video comprising a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point;
    extracting spatial features from each prior image frame at different resolutions;
    predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution;
    predicting an estimated current image frame based on the estimated image frames in the different resolutions; and
    detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
  2. The method of claim 1, wherein the learning model comprises at least one convolution layer and a plurality of residual blocks, wherein extracting the spatial features from each prior image frame further comprises:
    applying the convolution layer to the prior image frame to obtain a convoluted result;
    applying a first residual block to the convoluted result to obtain spatial features in a first resolution; and
    applying a second residual block to the spatial features in the first resolution to obtain spatial features in a second resolution.
  3. The method of claim 1, wherein the learning model comprises a plurality of prediction models each corresponding to a resolution, wherein the estimated image frames in the different resolutions are predicted using the plurality of prediction sub-models in parallel.
  4. The method of claim 3, wherein each sub-model comprises a first block and a second block, wherein predicting the estimated image frame in each resolution further comprises:
    extracting global features from the spatial features of the set of prior image frames in that resolution using the first block; and
    extracting local temporal features from the global features in that resolution using the second block, wherein the second block has a receptive field applied to local neighborhoods of the global temporal features.
  5. The method of claim 4, wherein the second block is a convolutional gated recurrent unit (ConvGRU) .
  6. The method of claim 4, wherein predicting the estimated current image frame further comprises fusing the global temporal features in the different resolutions.
  7. The method of claim 1, wherein the learning model is trained by minimizing a loss comprising a perceptual loss indicative of a level of noise between a training image frame predicted by using the learning model and a ground truth training image frame.
  8. The method of claim 7, wherein the loss further comprises an intensity loss indicative of an intensity difference between the predicted training image frame and the ground truth training image frame, and a gradient difference loss indicative of a difference between an gradient image of the predicted training image frame and an gradient image of the ground truth training image frame.
  9. The method of claim 8, wherein the loss is a weighted sum of the perceptual loss, the intensity loss and the gradient difference loss.
  10. The method of claim 7, wherein the perceptual loss is calculated by:
    applying a pre-trained classification model to the ground truth image frame to obtain a first output from at least one hidden layer of the pre-trained classification model;
    applying the pre-trained classification model to the predicted image frame to obtain a second output from the at least one hidden layer of the pre-trained classification model; and
    determining a difference between the first output and the second output.
  11. The method of claim 10, wherein the at least one hidden layer comprises a plurality of hidden layers and the difference is a weighted distance between the first output and the second output from the plurality of hidden layers.
  12. The method of claim 1, wherein detecting the video anomaly further comprises:
    determining an anomaly score indicative of the difference between the captured current image frame and the estimated current image frame; and
    detecting the video anomaly if the determined anomaly score is higher than a predetermined threshold.
  13. The method of claim 12, wherein determining the anomaly score further comprises computing a peak signal-to-noise ratio (PSNR) of a mean squared error between the captured current image frame and the estimated current image frame.
  14. A system for performing video anomaly detection using a learning model, comprising:
    a communication interface configured to receive a video comprising a plurality of image frames captured by a camera, the plurality of image frames including a current image frame corresponding to a selected time point and a set of prior image frames corresponding to time points prior to the selected time point; and
    at least one processor coupled to the communication interface and configured to:
    extract spatial features from each prior image frame at different resolutions;
    predict an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution;
    predict an estimated current image frame based on the estimated image frames in the different resolutions; and
    detect the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
  15. The system of claim 14, wherein the learning model comprises at least one convolution layer and a plurality of residual blocks, wherein to extract the spatial features from each prior image frame, the at least one processor is further configured to:
    apply the convolution layer to the prior image frame to obtain a convoluted result;
    apply a first residual block to the convoluted result to obtain spatial features in a first resolution; and
    apply a second residual block to the spatial features in the first resolution to obtain spatial features in a second resolution.
  16. The system of claim 14, wherein the learning model comprises a plurality of prediction models each corresponding to a resolution, wherein the estimated image frames in the different resolutions are predicted using the plurality of prediction sub-models in parallel.
  17. The system of claim 16, wherein each sub-model comprises a first block and a second block, wherein to predict the estimated image frame in each resolution, the at least one processor is further configured to:
    extract global temporal features from the spatial features of the set of prior image frames in that resolution using the first block; and
    extract local temporal features from the global temporal features in that resolution using the second block, wherein the second block has a receptive field applied to local neighborhoods of the global temporal features.
  18. The system of claim 14, wherein the learning model is trained by minimizing a loss comprising a perceptual loss indicative of a level of noise between a training image frame predicted by using the learning model and a ground truth training image frame.
  19. A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method for performing video anomaly detection using a learning model, the method comprising:
    extracting spatial features from each prior image frame at different resolutions;
    predicting an estimated image frame in each resolution based on the spatial features of the set of prior image frames in that resolution;
    predicting an estimated current image frame based on the estimated image frames in the different resolutions; and
    detecting the video anomaly based on a difference between the captured current image frame and the estimated current image frame.
  20. The non-transitory computer-readable medium of claim 19, wherein the learning model is trained by minimizing a loss comprising a perceptual loss indicative of a level  of noise between a training image frame predicted by using the learning model and a ground truth training image frame.
PCT/CN2020/073932 2020-01-22 2020-01-22 Systems and methods for video anomaly detection using multi-scale image frame prediction network WO2021147055A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/073932 WO2021147055A1 (en) 2020-01-22 2020-01-22 Systems and methods for video anomaly detection using multi-scale image frame prediction network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/073932 WO2021147055A1 (en) 2020-01-22 2020-01-22 Systems and methods for video anomaly detection using multi-scale image frame prediction network

Publications (1)

Publication Number Publication Date
WO2021147055A1 true WO2021147055A1 (en) 2021-07-29

Family

ID=76992032

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/073932 WO2021147055A1 (en) 2020-01-22 2020-01-22 Systems and methods for video anomaly detection using multi-scale image frame prediction network

Country Status (1)

Country Link
WO (1) WO2021147055A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113592719A (en) * 2021-08-14 2021-11-02 北京达佳互联信息技术有限公司 Training method of video super-resolution model, video processing method and corresponding equipment
CN116152722A (en) * 2023-04-19 2023-05-23 南京邮电大学 Video anomaly detection method based on combination of residual attention block and self-selection learning
CN116450880A (en) * 2023-05-11 2023-07-18 湖南承希科技有限公司 Intelligent processing method for vehicle-mounted video of semantic detection

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150332434A1 (en) * 2014-05-15 2015-11-19 The Government Of The United States Of America, As Represented By The Secretary Of The Navy Demosaicking System and Method for Color array Based Multi-Spectral Sensors
CN110245603A (en) * 2019-06-12 2019-09-17 成都信息工程大学 A kind of group abnormality behavior real-time detection method
CN110582748A (en) * 2017-04-07 2019-12-17 英特尔公司 Method and system for boosting deep neural networks for deep learning
CN110705376A (en) * 2019-09-11 2020-01-17 南京邮电大学 Abnormal behavior detection method based on generative countermeasure network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150332434A1 (en) * 2014-05-15 2015-11-19 The Government Of The United States Of America, As Represented By The Secretary Of The Navy Demosaicking System and Method for Color array Based Multi-Spectral Sensors
CN110582748A (en) * 2017-04-07 2019-12-17 英特尔公司 Method and system for boosting deep neural networks for deep learning
CN110245603A (en) * 2019-06-12 2019-09-17 成都信息工程大学 A kind of group abnormality behavior real-time detection method
CN110705376A (en) * 2019-09-11 2020-01-17 南京邮电大学 Abnormal behavior detection method based on generative countermeasure network

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113592719A (en) * 2021-08-14 2021-11-02 北京达佳互联信息技术有限公司 Training method of video super-resolution model, video processing method and corresponding equipment
CN113592719B (en) * 2021-08-14 2023-11-28 北京达佳互联信息技术有限公司 Training method of video super-resolution model, video processing method and corresponding equipment
CN116152722A (en) * 2023-04-19 2023-05-23 南京邮电大学 Video anomaly detection method based on combination of residual attention block and self-selection learning
CN116450880A (en) * 2023-05-11 2023-07-18 湖南承希科技有限公司 Intelligent processing method for vehicle-mounted video of semantic detection
CN116450880B (en) * 2023-05-11 2023-09-01 湖南承希科技有限公司 Intelligent processing method for vehicle-mounted video of semantic detection

Similar Documents

Publication Publication Date Title
US9008365B2 (en) Systems and methods for pedestrian detection in images
US20230316702A1 (en) Explainable artificial intelligence (ai) based image analytic, automatic damage detection and estimation system
WO2021147055A1 (en) Systems and methods for video anomaly detection using multi-scale image frame prediction network
EP2959454B1 (en) Method, system and software module for foreground extraction
JP7272533B2 (en) Systems and methods for evaluating perceptual systems
CN110929593A (en) Real-time significance pedestrian detection method based on detail distinguishing and distinguishing
CN110781980B (en) Training method of target detection model, target detection method and device
CN110826429A (en) Scenic spot video-based method and system for automatically monitoring travel emergency
Huynh-The et al. NIC: A robust background extraction algorithm for foreground detection in dynamic scenes
KR102141302B1 (en) Object detection method based 0n deep learning regression model and image processing apparatus
CN114495029A (en) Traffic target detection method and system based on improved YOLOv4
CN113313037A (en) Method for detecting video abnormity of generation countermeasure network based on self-attention mechanism
CN112381132A (en) Target object tracking method and system based on fusion of multiple cameras
JP2020119154A (en) Information processing device, information processing method, and program
CN114613006A (en) Remote gesture recognition method and device
CN108154199B (en) High-precision rapid single-class target detection method based on deep learning
CN114677657A (en) Signal lamp time domain state detection method, model training method and related device
CN112825116B (en) Method, device, medium and equipment for detecting and tracking human face of monitoring video image
Kheder et al. Transfer Learning Based Traffic Light Detection and Recognition Using CNN Inception-V3 Model
KR102178202B1 (en) Method and apparatus for detecting traffic light
CN116030507A (en) Electronic equipment and method for identifying whether face in image wears mask
CN110956097A (en) Method and module for extracting occluded human body and method and device for scene conversion
Balamurugan Abnormal Event Detection in Video Surveillance Using Yolov3
US11954917B2 (en) Method of segmenting abnormal robust for complex autonomous driving scenes and system thereof
US11468676B2 (en) Methods of real-time spatio-temporal activity detection and categorization from untrimmed video segments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20915037

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20915037

Country of ref document: EP

Kind code of ref document: A1