WO2008040945A1

WO2008040945A1 - A method of identifying a measure of feature saliency in a sequence of images

Info

Publication number: WO2008040945A1
Application number: PCT/GB2007/003707
Authority: WO
Inventors: Peter Cheung; Christos Bouganis; Yang Liu
Original assignee: Imperial Innovations Limited
Priority date: 2006-10-06
Filing date: 2007-09-28
Publication date: 2008-04-10
Also published as: GB0619817D0

Abstract

A method is provided of identifying feature saliency in a sequence of images. The method comprises extracting a corresponding feature from first and second images in the sequence, calculating an actual motion of the feature, calculating a predicted motion of the feature when identifying the feature saliency as a function of the difference between the actual and predicted motion.

Description

A METHOD OF IDENTIFYING A MEASURE OF FEATURE SALIENCY IN A

SEQUENCE OF IMAGES

FIELD OF THE INVENTION

This invention relates to the field of video processing. Particularly, the present invention is related to a method and an apparatus for creating the temporal saliency map of a sequence of video frames.

BACKGROUND OF INVENTION

It is well known that the Human Visual System (HVS) has the ability to capture a vast amount of visual information but only a fraction of this reaches higher levels of processing in the brain. This is due to human's remarkable ability to focus its attention only to the parts of the image that are more "interesting" (or more salient) than the other parts without requiring any attentional effort. These points (or features) of the image that attract the human attention are called salient points. Initial research was focused on spatial saliency, the detection of salient points in a still image. Recently some work uses temporal (i.e. changes in time) as well as spatial information to derive the salient points using the so-called spatiotemporal saliency models. From such a map (known as saliency map), regions of significance can be identified.

However, the existing work on temporal saliency detection is based on a direct extension of the spatial saliency detection model to the time domain or on the concept of motion contrast. Using the idea of motion contrast, temporal salient points are those points (or features) that exhibit motion that is different from the average motion in the image or from the motion in the point's neighbourhood. These existing methods for deriving the spatiotemporal saliency map are still founded mostly on spatial properties of features and do not fully exploit the available time domain information. Nor do they emulate sufficiently the way that the HVS handles the temporal information.

SUMMARY OF INVENTION

The invention is set out in the claims. The invention allows for determining the saliency regions in a sequence of video frames based on both spatial and temporal information in the video sequence. Because the provision of steps such as 1) the processing of spatial information of individual frames in order to extract a plurality of features; 2) a collection of processing modules that provide predictions on the motion of each of the said features in future frames; 3) a set of combining modules that determine the degree of saliency of the said features based on the degree of unpredictability of the motions of the said features, and/or by selecting different types of prediction modules, the present invention can be applied to different application domains with different saliency characteristics. By changing the parameters in the prediction modules in a time varying fashion, the present invention can be used to adapt to changing conditions.

The present invention provides a new way of exploiting both the spatial and temporal information of such feature points and presents a general framework for creating temporal saliency map that is both flexible (i.e. coping with different applications) and adaptable (i.e. adapting to timing varying environment).

Applications for the present invention include, but are not limited to, the field of communications where the non-salient parts of the image can be compressed more heavily than the salient parts, the field of computer graphics applications where high- fidelity selective rendering can be achieved based on the output of the saliency model, in video quality estimation, and in the field of intelligent surveillance systems where only the vital regions are processed or recorded, thus saving both processing time and storage.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be understood more fully from the detailed description given below and from the accompanying drawings of the preferred embodiments of the invention, which, however, should not be taken to limit the invention to the specific embodiments, but are for explanation and understanding only.

Figure 1 is a diagram that shows an overview of the system. Figure 2 is a diagram showing the architecture of the system. Figure 3 is a diagram showing one embodiment of the invention

Overview

Figure 1 depicts aspects of the present approach. The input to the system is the pixel data from a frame of the video sequence (101). Features such as corners, lines and defined shapes are extracted from the image frame in (102). The coordinates of the features are stored and are compared with the coordinates of the respective features from one or more previous frames to determine the actual motion of each feature in (103). The result is the calculated motion. A predictor is used to predict the motion of each feature in (104). This is carried out from one or more preceding frames in which the feature is identified and its future motion predicted using a model, for example, assuming a predetermined motion behaviour such as linear motion. The actual and the predicted motions are compared and their differences computed using a distance measure (105) to produce a numerical value that represent the degree of saliency for each feature (106). If the degree of saliency is larger than a predetermined or computed threshold, the feature is deemed salient (i.e. significant), otherwise it is deemed non-salient (i.e. not significant). In this way, the degree of saliency is a function of the difference between the predicted and the actual feature motion.

In an initialization step (not shown), and every n frames, the system performs detection of "interesting features" in the current frame. The term "interesting features" can include low-level features e.g. corners, or high-level features e.g. faces or textures and can be predetermined or user selected from an appropriate menu. This is performed using existing feature selection or object detection algorithms of the type identified above. Information regarding the coordinates of the detected features are stored in a "pool" of features for example as a table stored in computer memory. The "pool" of features is updated every n frames in order for new features to be detected by the system. The parameter n is set according to the scene's activity such that the more rapidly the scene changes, the smaller the value of n, increasing the update frequency, or user selected, or can be predetermined. The current system utilizes two realizations of predictors. The first one is realized through the "short-term" temporal saliency map module, which is responsible for the prediction of feature's motion in the next frame. The second one is realized through the "long-term" temporal saliency map module which is responsible for detecting any periodicity in the feature's motion.

In more detail, the "short-term" temporal saliency map module detects the features that exhibit discrepancies between their calculated motion and their predicted motion. In the core of the module exists an array of predictors, one for each feature. The predictors are responsible for predicting the position of each feature in the next frame. The module stores in the "pool" or table of features information regarding each features' dissimilarity measure between its calculated and predicted motion.

The "long-term" saliency map module detects features that undergo a periodic motion. The module stores in the "pool" or table of features information regarding the detected degree of periodicity in the features' motion.

Finally, the system uses the information stored in the "pool" or table to produce the temporal saliency map. The module combines the information regarding the predictability about the motion of a feature and the degree of periodicity in its motion that has been detected by the system in order to produce the temporal saliency map.

System realization

This section describes a realization of the present temporal attention model. In the following description some details are set forth, such as the type of motion to be predicted, the specific algorithms used for feature detection and tracking in consecutive frames etc., in order to provide a thorough understanding of how this invention works. However, it will be obvious to one skilled in the art that the present invention may be practiced without these specific details and/or using alternative implementational approaches that will be familiar to the skilled reader.

In one possible approach, it is assumed that features that undergo a linear motion are classified as non-salient, where features with more complicated motion types are classified as salient. The motivation behind this comes from the observation that the human brain can predict easier a linear motion than any other type of motion. However alternative approaches and assumption can of course be adopted.

The present realization focuses on the low-level features which are identified every n frames. Shi and Tomasi's algorithm [1] has been adopted for feature detection. This algorithm is chosen because a) it detects the corners in an image with robustness and b) the algorithm has been developed to maximize the quality of tracking, which is important for the subsequent steps of the proposed system.

The method described herein can be implemented by any appropriate computer processor or hardware processor, the results stored in any appropriate memory and displayed in any appropriate manner. In one embodiment, for example, an apparatus for performing the approach is shown in Fig. 2 and includes a feature detection module 200, a feature tracker module 202, a memory module 204, a short term predictor module 206, a long term predictor module 208, a distance measure module to compute the degree of saliency 210 and an output module 212 which may be a display or interface for communication or download of the saliency or other data.

The steps performed can be understood with reference to Fig. 3. The calculation of the features' position in the new frame (k), 300 is performed (step 302) in the feature tracker module. The calculation is based on the pyramidal implementation of the iterative Lucas-Kanade optical flow algorithm [2]. According to this algorithm, the position of a feature in the new frame is calculated (step 304) in the lowest-resolution level within the pyramid and the result is propagated to the next finer level until the original resolution is reached. The output of the module is the position of each feature in the new frame. From this the calculated motion for this feature is computed simply and at step 306 stored in the "pool" or table in the memory module.

To implement feature tracking at step 310 the implementation of the system described here is based on the non-salient assumption of linearity, thus the Kalman filter [3] is used as the prediction instrument at step 314 determine short-term temporal saliency in the short term prediction module, taking into account the feature coordination in frame R (312) stored in the pool of features 306. The measurements for the Kalman filters are the coordinates of the features in the previous frame. The output of the filter is a prediction for the position of the features in the current frame (308a, 308b). The state vector is composed by four variables. These are the two components regarding the position of a feature in the image, and the two components of its velocity. The state vector is denoted by Λ: (1), where x andy are the coordinates of the feature, and x' andjμ' are the velocity components in x and y directions respectively. x y x' (i) y'

The transition matrix from one state to another is denoted by F. The values in the matrix guarantee that the model follows a linear motion with constant velocity. Finally, the measurement matrix is denoted by H. The entries of the matrix denote that the only measurements that are used by the system are the coordinates of the features. The values for F and H matrices are shown below:

1 0 1 0 1 0 0 0 0 1 0 1 H = 0 1 0 0

F = 0 0 1 0 0 0 0 1

The covariance matrices for the model noise and the measurement noise are assumed to be diagonal. The measurements Z_k denote the coordinates of a feature at time k. The Kalman filter has two stages: the prediction and the update stage. During the prediction stage, the coordinates in the next time step of each feature are predicted using the "project state" and "project covariance" equations, where the Kalman filter parameters are updated using the "Kalman gain", "update estimate" and "update covariance" equations. Table 1 summarizes the notation used in the Kalman filters, where Table 2 summarizes the aforementioned equations.

Table 1. Kalman filter's symbol definition

Table 2. Kalman filter's update and prediction equations

Kalman Gain: K_ft = PJTH⁷ [HP-H⁷ + R]-¹ Update estimate: ±_k = xζ 4- Kj₃(Zt - Hxj^~) Update covariance: P;_; = [I — K_/;H]P^

Project state: ic^ = Fx_Λ_i

Project covariance: P^^" = FP^-iF^T + Q

The system uses the prediction regarding the feature's positions that is provided by the Kalman filter to calculate the predicted velocity of the feature. The predicted velocity is stored as the predicted motion of the feature in the memory module and is employed by the system for the saliency detection. The amount of information that we learn from the new measurement about the velocity of the feature is used as a measure of unpredictability. This is achieved at step 316 by measuring the distance between the probability distributions of the feature's predicted velocities before and after the measurement. To quantify the distance between the two distributions, the Bhattacharyya distance [4] is applied. Under the Kalman filter formulation, the variables in the state vector are assumed to follow a Normal probability distribution. The Bhattacharyya distance between two q and p Gaussian distributions can be expressed as follows:

The q and p denote the prior and posterior distributions for the velocity of the features. The variable σ_p ² _>( denotes the variance of the i^th component of the velocity in the p distribution, where the variable μ_p ² _t denotes the mean. The mean is calculated using the Kalman filter's prediction, where the variance is extracted from the estimation error covariance matrix P.

The short-term saliency system has the ability to cope with fast and slow motions. This is achieved by updating the appropriate predictor in the array of predictors by the coordinates of the feature according to its average speed. Using the above motion interpolation, slow and fast moving features are treated equally well by the system.

The long-term saliency map is based on the periodicity in the motion of a feature. The system maintains a history of N frames regarding the features' position (318). Their periodicity, if any, is detected in the long-term prediction module 320 through the use of the autocorrelation function R(t), where t denotes the lag.

σ_xσ. v

The vector X denotes the position of a feature in the image, where the vector m denotes the mean value. The variable σ_x denotes the standard deviation of the feature's position in the x direction, where the variable σ_y denotes the standard deviation in the y direction. The long-term periodicity value 322 is stored in the pool 306 in the memory module R(t) is stored for different values oft).

In the last stage 324, the information regarding the features' position, the level of periodicity that they exhibit, and their dissimilarity measures between the prior and posterior distribution of the velocity are combined together in the saliency module 326. The features that have large values in their dissimilarity measure, implying that do not undergo a linear motion, ie exceed a saliency threshold give rise to salient regions. The threshold is a means of removing feature that has low degree of saliency. It may be set to zero, in which case the saliency map will simply be a 2 dimensional array of number representing how significant the scene is in each location. However, if the map is used to make decision, such as whether location is worth further processing and attention, or to display the locations of saliency feature on a monitor, then a threshold is set appropriate to the application. Any such features that are detected to undergo a periodic motion are removed from the temporal saliency map. The resulting salient map 328 is smoothed by a 2D Gaussian filter which produces the final temporal saliency map comprising a 2D array of numbers, one for each pixel location at the appropriate resolution, to indicate the degree of saliency (i.e. significance). The size of the Gaussian filter and its extend depend on the image resolution.

It is important to note that the present invention can easily be adapted to detect saliency in different applications. Some example variations to the above embodiment are:

1) Different feature extraction methods can be used. For example, one could focus on extracting oval like objects to detect faces or a objects having rectangular or other geometric or definable shapes;

2) Different predictors can be used. For example, the above embodiment uses Kalman filter as predictor to predict linear motion. As a result, features exhibiting motion that is not linear are detected as salient. Replacing the predictor with one that is good in predicting sinusoidal motion will eliminate wave-like motion from the saliency map. Furthermore a plurality of predictors can be operated in parallel to eliminate a plurality of motion behaviour.

3) Parameters within the predictor can be varied in time. In this way, the invention can be used to cope with time-varying factors due to, say,

changing environment.

4) The resolution at which respective stages of the algorithm are performed

can be varied.

In the following above description of the present invention numerous specific details are set forth, such as the specific feature extraction method, the Kalman filter predictor, the Bhattacharyya distance measure etc., in order to provide a thorough understanding of the present invention. However, it will be obvious to one skilled in the art that the present invention may be practiced without these specific details. In other instances well known methods, functions, components and procedures have not been described in detail as not to unnecessarily obscure the present invention. Furthermore, the present invention can easily be modified in such a way that different processing steps can be applied to the video frames presented at different resolutions.

[1] J.Shi and C. Tomasi, "Good features to track", in IEEE Conference on Computer Vision and Pattern Recognition, 1994

[2] J. Bouguet, "Pyramidal implementation of the Lukas Kanade feature tracker. Description of the algorithm", 2001 (http://sourceforge.net/projects/opencvlibrary)

[3] G. Welch and G. Bishop, "An Introduction to the Kalman Filter", Technical Report, TR 95-041, Department of Computer Science, University of North Carolina at Chapel Hill.

[4] A. Djouadi, O. Snorrason and F. Garber, "The quality of Training-Sample estimates of the Bhattacharyya coefficient", IEEE Tran. Pattern analysis and machine intelligence, vol. 12, pp. 92-97, 1990

Claims

1. A method of identifying feature saliency in a sequence of images comprising extracting a corresponding feature from first and second images in the sequence, calculating an actual motion of the feature, calculating a predicted motion of the feature and identifying the feature saliency as a function of the difference between the actual and predicted motion.

2. A method as claimed in claim 1 which in the extracted feature comprises one of a geometric feature or a colour feature or texture.

3. A method as claimed in claim 1 or claim 2 in which the feature to be extracted is identified based on a predetermined characteristic in at least one of an initialization step and/or at repeated intervals in an image sequence.

4. A method as claimed in any preceding claim in which the predicted motion is predicted using a prediction model based on motion of the feature in one or more preceding images.

5. A method as claimed in any preceding claim in which the predicted motion is predicted using a prediction model identifying periodic motion.

6. A method as claimed in claim 4 or 5 in which a feature exhibiting predicted or periodic motion is excluded from being a salient feature.

7. A method as claimed in any preceding claim in which an extracted feature is stored together with its calculated actual and predicted motion in a store.

8. A method as claimed in any preceding claim in which the predicted motion for a feature is predicted using a prediction model specific to that feature.

9. A method as claimed in any preceding claim in which the extracted feature is extracted using an extraction process specific to that feature.

10. A method as claimed in any preceding claim in which a feature is identified a salient if the difference between actual and predicted motion exceeds a saliency threshold.

11. A method as claimed in any preceding claim further comprising constructing a map of identified salient features.

12. A method of compressing data comprising identifying salient features according to the method of any of claims 1 to 11 and discarding data relating to non salient features.

13. A method as claimed in claim 12 in which all non-salient feature data is discarded.

14. A method as claimed in claim 13 in which a proportion of non-salient feature data is discarded.

15. A method of processing multiple sequential images for display comprising identifying a salient feature according to the method of any of claims 1 to 12 and rendering a non-salient feature at a lower resolution for display.

16. A method of monitoring a sequence of images comprising identifying salient feature according to the method of any of claims 1 to 12 and storing only data relating to the salient feature.

17. A computer programme comprising a set of instructions configured to carry out the method steps of any of claims 1 to 16. (This method may be practised as a compute program in software. However, due to the high data rate and large amount of computation, it may also be partly implemented using hardware. Therefore this may be too restrictive.)

18. A computer arranged to operate according to a set of instructions as claimed in claim 17.

19. A Video surveillance apparatus including the computer as claimed in claim 18.