WO2019043406A1 - Anomaly detection from video data from surveillance cameras - Google Patents

Anomaly detection from video data from surveillance cameras Download PDF

Info

Publication number
WO2019043406A1
WO2019043406A1 PCT/GB2018/052478 GB2018052478W WO2019043406A1 WO 2019043406 A1 WO2019043406 A1 WO 2019043406A1 GB 2018052478 W GB2018052478 W GB 2018052478W WO 2019043406 A1 WO2019043406 A1 WO 2019043406A1
Authority
WO
WIPO (PCT)
Prior art keywords
input data
statistical model
objects
operable
video
Prior art date
Application number
PCT/GB2018/052478
Other languages
French (fr)
Inventor
Jiameng GAO
Boris PLOIX
Original Assignee
Calipsa Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Calipsa Limited filed Critical Calipsa Limited
Publication of WO2019043406A1 publication Critical patent/WO2019043406A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present invention relates to object orientated data analysis. More particularly, the present invention relates to analysis of objects within video data from surveillance cameras.
  • aspects and/or embodiments seek to provide a method, apparatus and system to detect anomalous behaviour from video data.
  • a method of detecting anomalous behaviour comprising the steps of: receiving a first set of input data, comprising one or more digital image frames; generating a statistical model based on the first set of input data, the statistical model operable to detect one or more objects within the first set of input data; analysing a second set of input data with respect to the statistical model; and detecting one or more objects within the second set of input data.
  • Surveillance cameras are becoming increasingly popular.
  • the videos they record can help catch criminals or detect certain behaviours. They may also be useful in detecting potential accidents before they occur, and analysing behaviours and movements to ensure that such accidents do not happen in the future.
  • the video which is recorded needs to be viewed and analysed.
  • Such analysis may be performed ceaselessly and with a lower rate of errors than a human camera operator.
  • Human operators can be used to train and hence generate the statistical model as they would train a new employee. The model can therefore be arranged to learn relentlessly and is less likely to make the same mistake twice.
  • the first and/or second set of input data comprises one or more digital videos, formed from the one or more digital image frames.
  • Digital videos are often difficult to analyse, as the task of watching them can be mentally unstimulating and hence not performed effectively. Humans are not generally suited to watching endless hours of video on multiple screens, as they get tired and lose focus. However using the method disclosed herein, any human operators may instead be provided with actionable alerts as opposed to raw video feeds, keeping them more engaged and making the best use of their decision-making skills.
  • a statistical model does not tire or require breaks, and can be able to process several video streams simultaneously.
  • the one or more digital videos are recorded from one or more surveillance cameras.
  • Surveillance cameras while often placed at sites of interest, often fail as the videos which they record are not fully analysed. Such a system would allow the video produced by the surveillance cameras to be used to its full effect.
  • the method disclosed herein is agnostic to the type of camera used to record video, for example regardless of whether the video was recorded with a mobile phone or a HD video recorder, the method disclosed herein may still be used in the video analysis.
  • the generation of the statistical model is performed using one or more of: Convolutional Neural Networks (CNNs); Deep Convolutional Networks; Recurrent Neural Networks; Reinforced Learning; Scale-Invariant Feature Transform (SIFT) features, and/or optical flow feature vectors.
  • CNNs Convolutional Neural Networks
  • SIFT Scale-Invariant Feature Transform
  • CNNs and/or similar tools, can provide a useful and robust means for generating a statistical model. They can be trained effectively, and learn from previous errors. This machine learning method may be used in combination with proprietary datasets to deliver greater accuracy of analysis. However other arrangements may be used, for example hand designed or modelled filters, such as Scale-Invariant Feature Transform (SIFT) features, and/or optical flow feature vectors.
  • SIFT Scale-Invariant Feature Transform
  • CNNs themselves may comprise a series of layers (also referred to as "sets") of filters, wherein each set of filters may be of the same width and height. Each layer of filters can be convolved over its input and the outputs fed into a subsequent layer. Optionally, between each layer, the outputs are fed into a non-linear function before being fed into a subsequent layer.
  • the values of the filters may be randomly selected at the beginning of a training session, and during gradient descent training, they converge to optimal values in relation to the tasks for which they are being trained.
  • the method further comprises the steps of: analysing the first set of input data through one or more filters; and obtaining one or more filter outputs.
  • the generation of the statistical model comprises the use of the one or more filter outputs.
  • the one or more filters comprise one or more of: CNNs; Deep Convolutional Networks; Recurrent Neural Networks; Reinforced Learning; Scale-Invariant Feature Transform (SIFT) features, and/or optical flow feature vectors.
  • SIFT Scale-Invariant Feature Transform
  • Filters can reduce noise and other distraction from a video and create higher-level abstractions of images and video, hence enable a more efficient and accurate analysis.
  • CNNs and/or similar tools may be used in place of one or more filters.
  • the one or more objects comprise one or more of: vehicles; human beings; animals; plants; buildings; and/or weather formations.
  • the statistical model is operable to track one or more objects in the first and/or second set of input data.
  • the statistical model is operable detect anomalous objects in the first and/or second set of input data.
  • accidents can be detected through anomalous behaviours, for example collisions, between two vehicles.
  • Unusual movements of human beings for example a single person travelling against a large crowd, may also be useful to detect.
  • Such analysis may allow accidents to be prevented in the future, or an emergency team to be dispatched to the site of an accident more efficiently.
  • the analysis of the second set of input data is unsupervised.
  • the analysis of the second set of input data occurs in real time.
  • the analysis of the data may be rapidly scaled up. If a statistical model is trained to recognise anomalous behaviour on roads with an accuracy above a specified level, then video feeds from a large number of other surveillance cameras may be immediately analysed using that same statistical model. Such a task would have previously required the employment and training of a correspondingly large number of people to review and label the raw video data.
  • a probability distribution may be assigned over the probability of the positions and other features of the object. Therefore an anomaly may be detected by a probability value using a predetermined threshold.
  • an apparatus for detecting anomalous behaviour comprising: means for receiving a first set of input data, comprising one or more digital image frames; means for generating a statistical model based on the first set of input data, the statistical model operable to detect one or more objects within the first set of input data; means for analysing a second set of input data with respect to the statistical model; and means for detecting one or more objects within the second set of input data.
  • a system operable to perform the method disclosed herein.
  • a computer program product operable to perform the method and/or apparatus and/or system disclosed herein.
  • the method disclosed herein may be effectively implemented.
  • Such an implementation may allow for more effective use of human resources within a company or organisation, as well as reducing undesirable anomalous behaviours and detecting their root causes.
  • Figure 1 shows an example analysis of a video frame
  • Figure 2 shows a diagrammatic process flow.
  • the view 100 comprises a single digital image frame taken from a plurality of such frames, together comprising a digital video.
  • the surveillance camera from which this view 100 is observed is positioned over a road 110, and positioned in such a way that objects on the road 1 10 are visible.
  • objects may comprise vehicles, road markings, vehicle paths, and road barriers. In this example, vehicles are being detected and tracked.
  • vehicles are being analysed in reference to what type of vehicle they appear to be.
  • Cars 105 on the road 1 10 are identified as being different from motorcycles 115.
  • the paths of the vehicles 120 are also identified and recorded. Anomalous interactions between vehicles, for example a collision, may be detected and recorded. Further, illegal manoeuvres such as forbidden lane changes and travelling at excessive speeds may be accurately monitored and recorded.
  • a first surveillance camera 205 records a video.
  • This video may include, for example, a road on which traffic travels.
  • the video is processed into an appropriately sized video file 210.
  • the video file 210 is then used to develop a statistical model 215.
  • objects which may fall under the umbrella term of "vehicle” within the video file 210 may be identified by an operator, such that the statistical model 215 develops the ability to detect vehicles autonomously.
  • the statistical model 215 may be arranged to differentiate between different types of vehicle, for example categorising vehicles as "motorcycles", "bicycles", “cars”, “vans", and/or "lorries”.
  • This training is not necessarily limited to road surveillance videos, but could be applied to any video file.
  • surveillance camera video footage from a sports event could be used and the statistical model trained to recognise "humans", or even "home fans” and "away fans". If the detected humans were acting in an anomalous manner, for example a fight begins, the statistical model could be trained to recognise such behaviour.
  • a simple self-service web interface may be used to train the statistical model 215 for personal use, for example in the case of a household camera used to detect cats in a garden, or differentiating black taxis from private cars on the street. Once trained in such a manner, the statistical model 215 may then be applied to a second video file 211 , derived from a second surveillance camera 206.
  • the statistical model 215 may then autonomously detect the objects which it has been trained to recognise, which in this embodiment is vehicles travelling along a road. Such detection can occur without human supervision, and may be applied to many video feeds simultaneously, both recorded and in real time. The accuracy of the statistical model 215 may improve over time with supervision from a human operator.
  • the algorithms used may be rewarded for correct notifications and penalised for false alarms.
  • a specific example of such an algorithm could comprise calculating a function Q(s, a), where the expected reward Q is determined by the state "s" of the current conditions of the image or object, such as its position, its length of stay in the video, appearance features, etc. and actions "a", which would be to raise an alarm, or not raise an alarm.
  • the algorithm could take action a* which has the maximum expected reward for a given state s.
  • the function Q may take the form of various differentiable function approximators, such as a neural network, with the state s and action a as input, or an individual Gaussian Process for each action a.
  • An output may be provided in the form of an annotated display 220, on which the analysed video file 211 is provided along with an overlay.
  • the overlay can represent the findings of the statistical model, for example providing labels to any vehicles labelled "bicycle", or highlighting areas in which collisions seem to occur most frequently.
  • An alert system may be employed to alert a human operator if a dangerous or otherwise anomalous situation arises.
  • a web interface may be provided, comprising a search and reporting functionality. The interface may be operable to allow users to filter by event type, location and action taken.
  • Interactive reports may be provided, with give a level of detail that is significantly more difficult to achieve with manual enumeration. For example, such a report may comprise vehicle speeds, classification, colour, changes in lanes, heatmaps and/or advanced flow analysis.
  • the trained statistical model 215 may be stored in cloud storage, at a location remote from the site of the second surveillance camera 206. Any camera feeds or recorded videos may be uploaded to such a cloud platform using a simple interface. A user would only need access to a web browser and a working internet connection. No dedicated hardware would be required in this case. Local analysis may also be provided, for example if privacy of the data was a major concern.
  • a convolutional neural network-based object detector which in this embodiment comprises a convolutional neural network (CNN).
  • the object detector may further or in addition comprise the use of at least one of: Deep Convolutional Networks; Recurrent Neural Networks; Reinforced Learning; Scale-Invariant Feature Transform (SIFT) features, and/or optical flow feature vectors.
  • the CNN comprises a set of automatically learnt filters and produces a tensor for each image, and a region proposal component.
  • the regional proposal component may be in the form of a neural network (a set of automatically learnt matrix transformations) that uses the tensor as input to estimate bounding box coordinates for regions with high likelihoods of objects. This gives, for each image, a set of bounding boxes for each of the detected objects
  • a CNN-based tracker may comprise a CNN (itself comprising a collection of filters), which takes a crop of the image as input, wherein the crop may be centred on the target tracked object.
  • the filter outputs may then be arranged to produce a tensor-representation which can be flattened to form a vector.
  • the CNN may be applied across a search area around a previous location of the object, obtaining one or more tensor and/or vector outputs.
  • a normalised dot-product may then be used to calculate the cosine between the new vectors and the vector of the target object.
  • the location of the target object may then be assigned to be the location that produced the vector with the highest cosine value). Therefore by using a CNN (which could be different from the CNN used in the object detector) to produce a tensor representation for each object on the first frame (also referred to as the target objects) and matching, on a subsequent frame, detection bounding boxes which produce tensor representations may be generated that are the closest to the target object. These bounding boxes may be drawn onto the sequence of images and displayed to the user in video form.
  • the tracker may be arranged to output four coordinates (x_minimum, y_minimum, x_maximum, y_maximum). From these coordinates four lines may be generated to form a rectangle.
  • a tool can be used to draw them onto an image.
  • a tool may comprise a program arranged to change one or more colours of the pixels on the bounding box to a desired colour for the bounding box.
  • the tool may change the pixel values of the lines between (x_minimum, y_minimum) to (x_maximum, y_minimum), (x_minimum, y_minimum) to (x_minimum, y_maximum), etc. to red and/or blue and/or green. Having done this for all detected objects through the sequence of images, the full trajectories of each of the detected objects may be obtained.
  • This provides the 2D spatial coordinates for each object at every frame it is detected within the video.
  • a set of 3D coordinates is therefore provided for each object, with the third dimension being time.
  • the starting frame-number of each of the objects is set to zero, such that the third dimension represents how long the object has remained within the video.
  • Build a probability density model over the sets of 3D coordinates for all detected objects This can be done using Kernel Density Estimation, where the kernel could be Gaussian, triangular, or any other suitable arrangement, such that the density at any point in this 3D space would correlate with the number of individual trajectory data points close by. This provides a method to numerically calculate the probability at any point for future data points.
  • a Kernel Density Estimation model calculates the probability of a certain point by counting and/or weighting all nearby points using a function ("kernel"), such as a Gaussian or a triangular function, giving a weighted average for the expected number of data points at a predetermined location. For any data points with a probability less than a threshold (for example, 0.05), an alert would be raised to the user. This could be extended to the user rewards and/or reinforcement learning arrangement as disclosed above. Examples of alerts include drawing the bounding box of the object in a different colour to distinguish them to the user, displaying a text box directly to the user, or sending the user an email.
  • a threshold for example, 0.05
  • the object detector and tracker are run on each new incoming frame, such that bounding boxes for each object are obtained. Cropped images of these detected objects are then fed into the CNN of step 4, to obtain flattened vector representations which are then fed into the density estimation model, where probability scores are obtained.
  • any feature in one aspect of the invention may be applied to other aspects of the invention, in any appropriate combination.
  • method aspects may be applied to system aspects, and vice versa.
  • any, some and/or all features in one aspect can be applied to any, some and/or all features in any other aspect, in any appropriate combination.

Abstract

The present invention relates to object orientated data analysis. More particularly, the present invention relates to analysis of objects within video data from surveillance cameras. According to a first aspect, there is provided a method of detecting anomalous behaviour, the method comprising the steps of: receiving a first set of input data, comprising one or more digital image frames; generating a statistical model based on the first set of input data, the statistical model operable to detect one or more objects within the first set of input data; analysing a second set of input data with respect to the statistical model; and detecting one or more objects within the second set of input data.

Description

ANOMALY DETECTION FROM VIDEO DATA FROM SURVEILLANCE CAMERAS
Field
The present invention relates to object orientated data analysis. More particularly, the present invention relates to analysis of objects within video data from surveillance cameras.
Background
It is estimated that there are close to 250 million video surveillance cameras deployed worldwide, capturing 1.6 trillion hours of video annually. In order to review even 20% of the most critical video streams, either in real time or post-processing, approximately 110 million human operators would be required to keep up. Human errors may also be present, and important details may be missed from the video stream.
Each human operator also requires training, and hence expansion of surveillance systems may be difficult to scale effectively. Further, humans are often not suited to watching video streams on multiple screens simultaneously for many hours. Focus will be lost, and important details become increasingly likely to be missed.
Summary of Invention
Aspects and/or embodiments seek to provide a method, apparatus and system to detect anomalous behaviour from video data.
According to a first aspect, there is provided a method of detecting anomalous behaviour, the method comprising the steps of: receiving a first set of input data, comprising one or more digital image frames; generating a statistical model based on the first set of input data, the statistical model operable to detect one or more objects within the first set of input data; analysing a second set of input data with respect to the statistical model; and detecting one or more objects within the second set of input data.
Surveillance cameras are becoming increasingly popular. The videos they record can help catch criminals or detect certain behaviours. They may also be useful in detecting potential accidents before they occur, and analysing behaviours and movements to ensure that such accidents do not happen in the future. However in order to do so, the video which is recorded needs to be viewed and analysed. By using the abovementioned method, such analysis may be performed ceaselessly and with a lower rate of errors than a human camera operator. Human operators can be used to train and hence generate the statistical model as they would train a new employee. The model can therefore be arranged to learn relentlessly and is less likely to make the same mistake twice. By training over a wide range of video and image data under different environmental conditions, algorithms used (which may comprise convolutional neural networks) can be arranged to be robust to a wide range of environment changes. A further advantage may be provided in that the same models may be deployed to various environments without further specific engineering and/or training. Any algorithms used may hence be robust to weather, lighting and camera placement issues and work without any further configuration. This method may achieve levels of accuracy far beyond that of a human operator at a much lower cost. Typically, 50-80% may be saved using this method compared to manual enumeration methods.
Optionally, the first and/or second set of input data comprises one or more digital videos, formed from the one or more digital image frames.
Digital videos are often difficult to analyse, as the task of watching them can be mentally unstimulating and hence not performed effectively. Humans are not generally suited to watching endless hours of video on multiple screens, as they get tired and lose focus. However using the method disclosed herein, any human operators may instead be provided with actionable alerts as opposed to raw video feeds, keeping them more engaged and making the best use of their decision-making skills. A statistical model does not tire or require breaks, and can be able to process several video streams simultaneously.
Optionally, the one or more digital videos are recorded from one or more surveillance cameras.
Surveillance cameras, while often placed at sites of interest, often fail as the videos which they record are not fully analysed. Such a system would allow the video produced by the surveillance cameras to be used to its full effect. The method disclosed herein is agnostic to the type of camera used to record video, for example regardless of whether the video was recorded with a mobile phone or a HD video recorder, the method disclosed herein may still be used in the video analysis.
Optionally, the generation of the statistical model is performed using one or more of: Convolutional Neural Networks (CNNs); Deep Convolutional Networks; Recurrent Neural Networks; Reinforced Learning; Scale-Invariant Feature Transform (SIFT) features, and/or optical flow feature vectors.
CNNs, and/or similar tools, can provide a useful and robust means for generating a statistical model. They can be trained effectively, and learn from previous errors. This machine learning method may be used in combination with proprietary datasets to deliver greater accuracy of analysis. However other arrangements may be used, for example hand designed or modelled filters, such as Scale-Invariant Feature Transform (SIFT) features, and/or optical flow feature vectors. CNNs themselves may comprise a series of layers (also referred to as "sets") of filters, wherein each set of filters may be of the same width and height. Each layer of filters can be convolved over its input and the outputs fed into a subsequent layer. Optionally, between each layer, the outputs are fed into a non-linear function before being fed into a subsequent layer. The values of the filters may be randomly selected at the beginning of a training session, and during gradient descent training, they converge to optimal values in relation to the tasks for which they are being trained.
Optionally, the method further comprises the steps of: analysing the first set of input data through one or more filters; and obtaining one or more filter outputs. Optionally, the generation of the statistical model comprises the use of the one or more filter outputs. Optionally, the one or more filters comprise one or more of: CNNs; Deep Convolutional Networks; Recurrent Neural Networks; Reinforced Learning; Scale-Invariant Feature Transform (SIFT) features, and/or optical flow feature vectors.
Filters can reduce noise and other distraction from a video and create higher-level abstractions of images and video, hence enable a more efficient and accurate analysis. CNNs and/or similar tools may be used in place of one or more filters.
Optionally, the one or more objects comprise one or more of: vehicles; human beings; animals; plants; buildings; and/or weather formations. Optionally, the statistical model is operable to track one or more objects in the first and/or second set of input data. Optionally, the statistical model is operable detect anomalous objects in the first and/or second set of input data.
It can be advantageous to analyse the presence of certain objects. For example, accidents can be detected through anomalous behaviours, for example collisions, between two vehicles. Unusual movements of human beings, for example a single person travelling against a large crowd, may also be useful to detect. Such analysis may allow accidents to be prevented in the future, or an emergency team to be dispatched to the site of an accident more efficiently.
Optionally, the analysis of the second set of input data is unsupervised. Optionally, the analysis of the second set of input data occurs in real time.
By analysing a set of data without supervision, the analysis of the data may be rapidly scaled up. If a statistical model is trained to recognise anomalous behaviour on roads with an accuracy above a specified level, then video feeds from a large number of other surveillance cameras may be immediately analysed using that same statistical model. Such a task would have previously required the employment and training of a correspondingly large number of people to review and label the raw video data. In this embodiment, instead of building a classification model that classifies the behaviour of objects into pre-determined classes of activities (e.g. car driving slowly, making illegal turns), a probability distribution may be assigned over the probability of the positions and other features of the object. Therefore an anomaly may be detected by a probability value using a predetermined threshold.
According to a further aspect, there is provided an apparatus for detecting anomalous behaviour, comprising: means for receiving a first set of input data, comprising one or more digital image frames; means for generating a statistical model based on the first set of input data, the statistical model operable to detect one or more objects within the first set of input data; means for analysing a second set of input data with respect to the statistical model; and means for detecting one or more objects within the second set of input data. According to a further aspect, there is provided a system operable to perform the method disclosed herein. According to a further aspect, there is provided a computer program product operable to perform the method and/or apparatus and/or system disclosed herein.
By providing such an apparatus and/or system, the method disclosed herein may be effectively implemented. Such an implementation may allow for more effective use of human resources within a company or organisation, as well as reducing undesirable anomalous behaviours and detecting their root causes.
Brief Description of Drawings
Embodiments will now be described, by way of example only and with reference to the accompanying drawings having like-reference numerals, in which: Figure 1 shows an example analysis of a video frame; and Figure 2 shows a diagrammatic process flow. Specific Description
Referring to Figure 1 , a first embodiment will now be described. In this figure, a view 100 from a surveillance camera is shown. The view 100 comprises a single digital image frame taken from a plurality of such frames, together comprising a digital video. The surveillance camera from which this view 100 is observed is positioned over a road 110, and positioned in such a way that objects on the road 1 10 are visible. Such objects may comprise vehicles, road markings, vehicle paths, and road barriers. In this example, vehicles are being detected and tracked.
In particular, vehicles are being analysed in reference to what type of vehicle they appear to be. Cars 105 on the road 1 10 are identified as being different from motorcycles 115. The paths of the vehicles 120 are also identified and recorded. Anomalous interactions between vehicles, for example a collision, may be detected and recorded. Further, illegal manoeuvres such as forbidden lane changes and travelling at excessive speeds may be accurately monitored and recorded.
An exemplary analysis process is shown in Figure 2. In this figure, a first surveillance camera 205 records a video. This video may include, for example, a road on which traffic travels. The video is processed into an appropriately sized video file 210. The video file 210 is then used to develop a statistical model 215. For example, objects which may fall under the umbrella term of "vehicle" within the video file 210 may be identified by an operator, such that the statistical model 215 develops the ability to detect vehicles autonomously. Further, the statistical model 215 may be arranged to differentiate between different types of vehicle, for example categorising vehicles as "motorcycles", "bicycles", "cars", "vans", and/or "lorries". This training is not necessarily limited to road surveillance videos, but could be applied to any video file. For example, surveillance camera video footage from a sports event could be used and the statistical model trained to recognise "humans", or even "home fans" and "away fans". If the detected humans were acting in an anomalous manner, for example a fight begins, the statistical model could be trained to recognise such behaviour. A simple self-service web interface may be used to train the statistical model 215 for personal use, for example in the case of a household camera used to detect cats in a garden, or differentiating black taxis from private cars on the street. Once trained in such a manner, the statistical model 215 may then be applied to a second video file 211 , derived from a second surveillance camera 206. The statistical model 215 may then autonomously detect the objects which it has been trained to recognise, which in this embodiment is vehicles travelling along a road. Such detection can occur without human supervision, and may be applied to many video feeds simultaneously, both recorded and in real time. The accuracy of the statistical model 215 may improve over time with supervision from a human operator. The algorithms used may be rewarded for correct notifications and penalised for false alarms.
A specific example of such an algorithm could comprise calculating a function Q(s, a), where the expected reward Q is determined by the state "s" of the current conditions of the image or object, such as its position, its length of stay in the video, appearance features, etc. and actions "a", which would be to raise an alarm, or not raise an alarm. In such a scenario, the algorithm could take action a* which has the maximum expected reward for a given state s. With such an arrangement in place, every time an alert is raised, or a notable event is missed, the user can reward the system for correct detections or penalise the system for false alarms or false negatives, which will reinforce and/or correct the function Q. Such correction may be arranged through gradient descent. Here the function Q may take the form of various differentiable function approximators, such as a neural network, with the state s and action a as input, or an individual Gaussian Process for each action a.
An output may be provided in the form of an annotated display 220, on which the analysed video file 211 is provided along with an overlay. The overlay can represent the findings of the statistical model, for example providing labels to any vehicles labelled "bicycle", or highlighting areas in which collisions seem to occur most frequently. An alert system may be employed to alert a human operator if a dangerous or otherwise anomalous situation arises. A web interface may be provided, comprising a search and reporting functionality. The interface may be operable to allow users to filter by event type, location and action taken. Interactive reports may be provided, with give a level of detail that is significantly more difficult to achieve with manual enumeration. For example, such a report may comprise vehicle speeds, classification, colour, changes in lanes, heatmaps and/or advanced flow analysis. The trained statistical model 215 may be stored in cloud storage, at a location remote from the site of the second surveillance camera 206. Any camera feeds or recorded videos may be uploaded to such a cloud platform using a simple interface. A user would only need access to a web browser and a working internet connection. No dedicated hardware would be required in this case. Local analysis may also be provided, for example if privacy of the data was a major concern.
The implementation of the abovementioned arrangement may involve the following steps:
1. Obtain a sequence of images from a live or stored video feed, where the length of the sequence is set to a threshold. For example, 1 minute of video may be provided at a frame rate of 25 frames per second.
2. Pass the sequence of images as input to a convolutional neural network-based object detector, which in this embodiment comprises a convolutional neural network (CNN). The object detector may further or in addition comprise the use of at least one of: Deep Convolutional Networks; Recurrent Neural Networks; Reinforced Learning; Scale-Invariant Feature Transform (SIFT) features, and/or optical flow feature vectors. The CNN comprises a set of automatically learnt filters and produces a tensor for each image, and a region proposal component. The regional proposal component may be in the form of a neural network (a set of automatically learnt matrix transformations) that uses the tensor as input to estimate bounding box coordinates for regions with high likelihoods of objects. This gives, for each image, a set of bounding boxes for each of the detected objects
3. Using the sequence of images and the sets of bounding boxes, link the detection of each individual object together using a CNN-based tracker. A CNN-based tracker may comprise a CNN (itself comprising a collection of filters), which takes a crop of the image as input, wherein the crop may be centred on the target tracked object. The filter outputs may then be arranged to produce a tensor-representation which can be flattened to form a vector. For the next frame, the CNN may be applied across a search area around a previous location of the object, obtaining one or more tensor and/or vector outputs. A normalised dot-product may then be used to calculate the cosine between the new vectors and the vector of the target object. The location of the target object may then be assigned to be the location that produced the vector with the highest cosine value). Therefore by using a CNN (which could be different from the CNN used in the object detector) to produce a tensor representation for each object on the first frame (also referred to as the target objects) and matching, on a subsequent frame, detection bounding boxes which produce tensor representations may be generated that are the closest to the target object. These bounding boxes may be drawn onto the sequence of images and displayed to the user in video form. The tracker may be arranged to output four coordinates (x_minimum, y_minimum, x_maximum, y_maximum). From these coordinates four lines may be generated to form a rectangle. Once one or more rectangles have been formed, a tool can be used to draw them onto an image. In this embodiment, such a tool may comprise a program arranged to change one or more colours of the pixels on the bounding box to a desired colour for the bounding box. Hence the tool may change the pixel values of the lines between (x_minimum, y_minimum) to (x_maximum, y_minimum), (x_minimum, y_minimum) to (x_minimum, y_maximum), etc. to red and/or blue and/or green. Having done this for all detected objects through the sequence of images, the full trajectories of each of the detected objects may be obtained. This provides the 2D spatial coordinates for each object at every frame it is detected within the video. A set of 3D coordinates is therefore provided for each object, with the third dimension being time. Using these 3D trajectory coordinates, the starting frame-number of each of the objects is set to zero, such that the third dimension represents how long the object has remained within the video. Build a probability density model over the sets of 3D coordinates for all detected objects. This can be done using Kernel Density Estimation, where the kernel could be Gaussian, triangular, or any other suitable arrangement, such that the density at any point in this 3D space would correlate with the number of individual trajectory data points close by. This provides a method to numerically calculate the probability at any point for future data points. For further images or sequences of images from the live or stored video feed, the same object detection and tracking methods may be used as previously described to obtain the 3D coordinates for each new object in the new images. These coordinates may then be evaluated using the Kernel Density Estimation model. Conventionally, a Kernel Density Estimation model calculates the probability of a certain point by counting and/or weighting all nearby points using a function ("kernel"), such as a Gaussian or a triangular function, giving a weighted average for the expected number of data points at a predetermined location. For any data points with a probability less than a threshold (for example, 0.05), an alert would be raised to the user. This could be extended to the user rewards and/or reinforcement learning arrangement as disclosed above. Examples of alerts include drawing the bounding box of the object in a different colour to distinguish them to the user, displaying a text box directly to the user, or sending the user an email.
A further embodiment is also disclosed as in the following steps:
1. Obtain a sequence of images from a live or stored video feed, where the length of the sequence is set to a threshold, as previously described.
2. Pass the sequence of images through a CNN (which could be different from the CNNs used in the object detector and tracker), such that a tensor representation is obtained for each whole image, where these tensors may have dimensions such as (40 X 30 X 512).
3. The tensor representation is then flattened into a vector (which in this exemplary embodiment is of dimension 614,400 = 40x30x512), wherein this vector representation is used as the input for a probability density estimation model, an example of which is Kernel Density Estimation.
4. For all future frames, their tensor representations are flattened and their probability values may be estimated using the density estimation model, such that low probability scores (such as less than 0.05) would raise an alert to the user.
A yet further embodiment may at least in part combine the above methods in the following steps:
1. Obtain a sequence of images from a live or stored video feed, where the length of the sequence is set to a threshold, as previously described.
2. Pass the sequence of images through a CNN-based object detector and an object tracker, such that for each object, a trajectory (which may be in the form of a sequence of bounding boxes) is obtained for each object.
3. Make a crop of the image from each bounding box of each object, thus obtaining a set of the sequence of crops for each object as it moves through the frame. 4. This set of crops is then fed into a CNN (which could be different from the CNNs used in the object detector and tracker), wherein a tensor representation is obtained for each crop. All the obtained tensor representations are then flattened and used as input for a probability density estimation model, as described previously.
5. For future frames, the object detector and tracker are run on each new incoming frame, such that bounding boxes for each object are obtained. Cropped images of these detected objects are then fed into the CNN of step 4, to obtain flattened vector representations which are then fed into the density estimation model, where probability scores are obtained.
6. An alert is then sent to the user, as described above.
Any system feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure.
Any feature in one aspect of the invention may be applied to other aspects of the invention, in any appropriate combination. In particular, method aspects may be applied to system aspects, and vice versa. Furthermore, any, some and/or all features in one aspect can be applied to any, some and/or all features in any other aspect, in any appropriate combination.
It should also be appreciated that particular combinations of the various features described and defined in any aspects of the invention can be implemented and/or supplied and/or used independently.

Claims

A method of detecting anomalous behaviour, the method comprising the steps of: receiving a first set of input data, comprising one or more digital image frames;
generating a statistical model based on the first set of input data, the statistical model operable to detect one or more objects within the first set of input data;
analysing a second set of input data with respect to the statistical model; and detecting one or more objects within the second set of input data.
A method as claimed in claim 1 , wherein the first and/or second set of input data comprises one or more digital videos, formed from the one or more digital image frames.
A method as claimed in any one of claims 1 or 2, wherein the one or more digital videos are recorded from one or more surveillance cameras.
A method as claimed in any preceding claim, wherein the generation of the statistical model is performed using one or more of: Convolutional Neural Networks (CNNs); Deep Convolutional Networks; Recurrent Neural Networks; Reinforced Learning; Scale-Invariant Feature Transform (SIFT) features, and/or optical flow feature vectors.
A method as claimed in any preceding claim, further comprising the steps of:
analysing the first set of input data through one or more filters; and
obtaining one or more filter outputs.
A method as claimed in claim 5, wherein the generation of the statistical model comprises the use of the one or more filter outputs.
A method as claimed in any one of claims 5 or 6, wherein the one or more filters comprise one or more of: Convolutional Neural Networks (CNNs); Deep Convolutional Networks; Recurrent Neural Networks; Reinforced Learning; Scale- Invariant Feature Transform (SIFT) features, and/or optical flow feature vectors. A method as claimed in any preceding claim, wherein the one or more objects comprise one or more of: vehicles; human beings; animals; plants; buildings; and/or weather formations.
A method as claimed in any preceding claim, wherein the statistical model is operable to track one or more objects in the first and/or second set of input data.
A method as claimed in any preceding claim, wherein the statistical model is operable detect anomalous objects in the first and/or second set of input data.
A method as claimed in any preceding claim, wherein the analysis of the second set of input data is unsupervised.
A method as claimed in any preceding claim, wherein the analysis of the second set of input data occurs in real time.
An apparatus for detecting anomalous behaviour, comprising:
means for receiving a first set of input data, comprising one or more digital image frames;
means for generating a statistical model based on the first set of input data, the statistical model operable to detect one or more objects within the first set of input data;
means for analysing a second set of input data with respect to the statistical model; and
means for detecting one or more objects within the second set of input data.
A system operable to perform the method of any one of claims 1 to 12.
A computer program product operable to perform the method and/or apparatus and/or system of any preceding claim.
PCT/GB2018/052478 2017-08-31 2018-08-31 Anomaly detection from video data from surveillance cameras WO2019043406A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB1713977.5A GB201713977D0 (en) 2017-08-31 2017-08-31 Anomaly detection
GB1713977.5 2017-08-31

Publications (1)

Publication Number Publication Date
WO2019043406A1 true WO2019043406A1 (en) 2019-03-07

Family

ID=60050552

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2018/052478 WO2019043406A1 (en) 2017-08-31 2018-08-31 Anomaly detection from video data from surveillance cameras

Country Status (2)

Country Link
GB (1) GB201713977D0 (en)
WO (1) WO2019043406A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934283A (en) * 2019-03-08 2019-06-25 西南石油大学 A kind of adaptive motion object detection method merging CNN and SIFT light stream
CN109978067A (en) * 2019-04-02 2019-07-05 北京市天元网络技术股份有限公司 A kind of trade-mark searching method and device based on convolutional neural networks and Scale invariant features transform
CN110674334A (en) * 2019-09-16 2020-01-10 南京信息工程大学 Near-repetitive image retrieval method based on consistency region deep learning features
CN113065431A (en) * 2021-03-22 2021-07-02 浙江理工大学 Human body violation prediction method based on hidden Markov model and recurrent neural network
CN113569675A (en) * 2021-07-15 2021-10-29 郑州大学 Mouse open field experimental behavior analysis method based on ConvLSTM network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010111748A1 (en) * 2009-04-01 2010-10-07 Curtin University Of Technology Systems and methods for detecting anomalies from data
WO2015001544A2 (en) * 2013-07-01 2015-01-08 Agent Video Intelligence Ltd. System and method for abnormality detection
GB2554948A (en) * 2016-10-17 2018-04-18 Calipsa Ltd Video monitoring using machine learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010111748A1 (en) * 2009-04-01 2010-10-07 Curtin University Of Technology Systems and methods for detecting anomalies from data
WO2015001544A2 (en) * 2013-07-01 2015-01-08 Agent Video Intelligence Ltd. System and method for abnormality detection
GB2554948A (en) * 2016-10-17 2018-04-18 Calipsa Ltd Video monitoring using machine learning

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934283A (en) * 2019-03-08 2019-06-25 西南石油大学 A kind of adaptive motion object detection method merging CNN and SIFT light stream
CN109934283B (en) * 2019-03-08 2023-04-25 西南石油大学 Self-adaptive moving object detection method integrating CNN and SIFT optical flows
CN109978067A (en) * 2019-04-02 2019-07-05 北京市天元网络技术股份有限公司 A kind of trade-mark searching method and device based on convolutional neural networks and Scale invariant features transform
CN110674334A (en) * 2019-09-16 2020-01-10 南京信息工程大学 Near-repetitive image retrieval method based on consistency region deep learning features
CN113065431A (en) * 2021-03-22 2021-07-02 浙江理工大学 Human body violation prediction method based on hidden Markov model and recurrent neural network
CN113065431B (en) * 2021-03-22 2022-06-17 浙江理工大学 Human body violation prediction method based on hidden Markov model and recurrent neural network
CN113569675A (en) * 2021-07-15 2021-10-29 郑州大学 Mouse open field experimental behavior analysis method based on ConvLSTM network
CN113569675B (en) * 2021-07-15 2023-05-23 郑州大学 ConvLSTM network-based mouse open field experimental behavior analysis method

Also Published As

Publication number Publication date
GB201713977D0 (en) 2017-10-18

Similar Documents

Publication Publication Date Title
US11468660B2 (en) Pixel-level based micro-feature extraction
CN108053427B (en) Improved multi-target tracking method, system and device based on KCF and Kalman
WO2019043406A1 (en) Anomaly detection from video data from surveillance cameras
Liu et al. Intelligent video systems and analytics: A survey
US9111148B2 (en) Unsupervised learning of feature anomalies for a video surveillance system
US9652863B2 (en) Multi-mode video event indexing
JP5602792B2 (en) Behavior recognition system
US10096235B2 (en) Alert directives and focused alert directives in a behavioral recognition system
US8280153B2 (en) Visualizing and updating learned trajectories in video surveillance systems
Maddalena et al. Stopped object detection by learning foreground model in videos
US8416296B2 (en) Mapper component for multiple art networks in a video analysis system
Duque et al. Prediction of abnormal behaviors for intelligent video surveillance systems
US11017236B1 (en) Anomalous object interaction detection and reporting
CN104378582A (en) Intelligent video analysis system and method based on PTZ video camera cruising
Zabłocki et al. Intelligent video surveillance systems for public spaces–a survey
US10824935B2 (en) System and method for detecting anomalies in video using a similarity function trained by machine learning
Cermeño et al. Intelligent video surveillance beyond robust background modeling
WO2020008667A1 (en) System and method for video anomaly detection
CN110598570A (en) Pedestrian abnormal behavior detection method and system, storage medium and computer equipment
CN114140745A (en) Method, system, device and medium for detecting personnel attributes of construction site
Khan et al. Comparative study of various crowd detection and classification methods for safety control system
KR et al. Moving vehicle identification using background registration technique for traffic surveillance
Micheloni et al. Exploiting temporal statistics for events analysis and understanding
Seidenari et al. Non-parametric anomaly detection exploiting space-time features
Amato et al. Neural network based video surveillance system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18766020

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 14.07.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18766020

Country of ref document: EP

Kind code of ref document: A1