NL2004878C2

NL2004878C2 - System and method for detecting a person's direction of interest, such as a person's gaze direction.

Info

Publication number: NL2004878C2
Application number: NL2004878A
Authority: NL
Inventors: Vladimir Nedovic; Roberto Valenti
Original assignee: Univ Amsterdam
Priority date: 2010-06-11
Filing date: 2010-06-11
Publication date: 2011-12-13
Also published as: WO2012008827A1

Description

SYSTEM AND METHOD FOR DETECTING A PERSON'S DIRECTION OF INTEREST, SUCH AS A PERSON'S GAZE DIRECTION

The invention relates to a system for detecting a person's 5 direction of interest, such as a person's gaze direction.

The system can also be used to detect other visual directions of interest of a person, such as the direction of a person's head, the person's eye, the person's arm and/or finger pointing, or the person's whole body, or a 10 combination thereof.

Visual gaze estimation is the process which determines the 3D line of sight of a person in order to analyze the location of interest. The estimation of the direction or the 15 location of interest of a user is key for many applications, spanning from gaze based human-computer interaction, advertisement [see: Smith, K., Ba, S.O., Odobez, J.M., Gatica-Perez, D.: Tracking the visual focus of attention for a varying number of wandering people. PAMI 30 (2008)], human 20 cognitive state analysis, attentive interfaces (e.g. gaze controlled mouse) to human behavior analysis. Gaze direction can also provide high-level semantic cues such as who is speaking to whom, information on non verbal communications (e.g. interest, pointing with the head/with the eyes) and 25 the mental state/attention of a user (e.g. a driver).

Overall, visual gaze estimation is important to understand someone's attention, motivation and intentions.

Typically, the pipeline of estimating visual gaze mainly 30 consists of two steps (see Figure 2): (1) analyze and transform pixel based image features obtained by sensory information (devices) to a higher level representation (e.g. the position of the head or the location of the eyes) and 2 (2) map these features to estimate the visual gaze vector (line of sight), hence finding the area of interest in the scene.

5 There is an abundance of research in the literature concerning the first component of the pipeline, which principally covers methods to estimate the head position and the eye location, as they are both contributing factors to the final estimation of the visual gaze [see: Langton, S.R., 10 Honeyman, H., Tessler, E.: The influence of head contour and nose angle on the perception of eye-gaze direction. Perception & psychophysics 66 (2004)] .

Nowadays, commercial eye gaze trackers are one of the most 15 successful visual gaze devices. However, to achieve good detection accuracy, they have the drawback of using intrusive or expensive sensors (pointed infrared cameras) which cannot be used in daylight and often limit the possible movement of the head, or require the user to wear 20 the device [see: Bates, R., Istance, H., Oosthuizen, L.,

Majaranta, P.: Survey of de-facto standards in eye tracking. In: COGAIN Conf. on Comm. By Gaze Inter. (2005)]. Therefore, recently, eye center locators based solely on appearance are proposed [see: Cristinacce, D., Cootes, T., Scott, I.: A 25 multi-stage approach to facial feature detection. In: BMVC. (2004) 277-286; Kroon, B., Boughorbel, S., Hanjalic, A.: Accurate eye localization in webcam content. In: FG. (2008); and Valenti, R., Gevers, T.: Accurate eye center location and tracking using isophote curvature. In: CVPR. (2008)] 30 which are reaching reasonable accuracy in order to roughly estimate the area of attention on a screen in the second step of the pipeline.

3 A recent survey [Hansen, D.W., Ji, Q.: In the eye of the beholder: A survey of models for eyes and gaze. PAMI 32 (2010)] discusses the different methodologies to obtain the eye location information through video-based devices. Some 5 of the methods can be also used to estimate the face location and the head pose in geometric head pose estimation methods. Other methods in this category track the appearance between video frames, or treat the problem as an image classification one, often interpolating the results between 10 known poses. The survey in [Murphy-Chutorian, E., Trivedi, M.: Head pose estimation in computer vision: A survey. PAMI 31 (2009) ] gives a good overview of appearance based head pose estimation methods.

15 Once the correct features are determined using one of the methods and devices discussed above, the second step in gaze estimation (see Figure 2) is to map the obtained information to the 3D scene in front of the user. In eye gaze trackers, this is often achieved by direct mapping of the eye center 20 position to the screen location. This reguires the system to be calibrated and often limits the possible position of the user (e.g. using chinrests). In case of 3D visual gaze estimation, this often requires the intrinsic camera parameters to be known. Failure to correctly calibrate or 25 comply to the restrictions of the gaze estimation device may result in wrong estimations of the gaze.

The invention aims at a more accurate, user-friendly and/or cheaper system for detecting a person's direction of 30 interest.

To that end, the system comprises: a processor; at least one video camera connected to said processor for capturing video 4 data; electronic memory connected to said processor; wherein said processor is arranged to determine in real time an interest vector of a person from said video data; wherein said processor is arranged to determine in real time a 5 salient peak closest to the determined interest vector; wherein said processor is arranged to determine in real time a saliency-corrected interest vector between said person and said closest salient peak; wherein said processor is arranged to determine in real time the deviation between the 10 determined interest vector and the determined saliency-corrected interest vector; wherein said processor is arranged to determine in real time further interest vectors of said person from said video data; and wherein said processor is arranged to calculate in real time recalibrated 15 interest vectors by using a calibration error value calculated from said determined deviation.

Preferably said processor is arranged to determine in real time the salient peaks closest to a multitude of determined 20 interest vectors; said processor is arranged to determine in real time a multitude of saliency-corrected interest vectors between the person and said closest salient peaks; wherein said processor is arranged to determine in real time the deviations between the multitude of determined interest 25 vectors and the multitude of saliency-corrected interest vectors; wherein said calibration error value is calculated from said multitude of determined deviations.

Preferably said processor is arranged to iterate in real 30 time said process of calculating said calibration error value by replacing previous determined interest vectors with interest vectors which are corrected using a previous 5 calibration error value, for calculating a current calibration error value.

Preferably said processor is arranged to calculate in real 5 time said calibration error value by minimizing the difference between the multitude of determined deviations and said calibration error value, for instance by using a weighted least square error minimization method.

10 Preferably said salient peaks in the region around the determined interest vector are determined using saliency data about the area which the person is expected to look at, such as video data, screen capture data or manually input data, such as annotated saliency data.

15

In one preferred embodiment said processor is arranged to determine in real time salient peaks in the region around the determined interest vector from video data before determining said salient peak closest to the determined 20 interest vector.

In a further preferred embodiment said system comprises at least two video cameras connected to said processor, one camera for capturing video data of a person's face and/or 25 body, and one camera for capturing said video data.

In an alternative preferred embodiment the processor, electronic memory and said at least two video cameras are combined in one device. The device may for instance be a 30 smartphone, having a videocamera in the back aimed at an area of interest, and a webcam in the front, aimed at the user's face and eyes. A smartphone with gaze detection capabilities is described in US 2010/0079508, wherein gaze 6 detection is used to determine if a person is looking at the screen of the smartphone. By using the teaching of the current invention, the smartphone can be used to detect which objects behind the smartphone the person is looking 5 at.

The invention furthermore relates to a method for detecting a person's interest direction, wherein a processor performs the steps of: determine in real time an interest vector of a 10 person from video data captured by a video camera; determine in real time a salient peak closest to the determined interest vector; determine in real time a saliency-corrected interest vector between said person and said closest salient peak; determine in real time the deviation between the 15 determined interest vector and the determined saliency- corrected interest vector; determine in real time further interest vectors of said person from said video data; and calculate in real time recalibrated interest vectors by using a calibration error value calculated from said 20 determined deviation.

The invention also relates to a computer software program arranged to run on a processor to perform the steps of the method of the invention, a computer readable data carrier 25 comprising a computer software program arranged to run on a processor to perform the steps of the method of the invention, and a computer comprising a processor and electronic memory connected thereto leaded with a computer software program arranged to perform the steps of the method 30 of the invention.

A preferred embodiment of the invention is described in more detail below with reference to the drawings in which: 7

Figure 1 is a perspective view of a system in accordance with the invention; and 5 Figure 2 is a flow chart of the system in accordance with the invention.

According to figure 1 a system for detecting a person's gaze direction comprises a computer 1 with amongst others a 10 processor unit, system memory and a hard drive, a video camera 2 aimed at the face of a person 6, connected to for instance a USB port of the computer 1. A second camera behind the person, which is aimed at the area where the person 6 is looking at, is also connected to a USB port of 15 the computer 1. A software program is loaded from the hard drive into the system memory of the computer 1 in order to perform the steps of the gaze detection method.

An image 4 having several (salient) objects (in this example 20 a car and it components) that may be of interest to the person is present in front of the person 6. Alternatively the system may be used to determine the gaze direction in a physical environment where (salient) real world objects of interest are present.

25

According to figure 2, a visual gaze vector can be resolved from a combination of body/head pose and eye location obtained from imaging device 2 in component I (box 10). As this is a rough estimation, the obtained gaze line 13 in 30 component II (box 11) is then followed until an uncertain location in the gazed area. The area of interest, in this example obtained from imaging device 5, is analyzed in component III (box 12). In the proposed framework, the gaze 8 vector 13 will be steered (arrow 14) to the most probable (salient) object which is close to the previously estimated point of interest. It is proven that salient objects attract eye fixations [see: Spain, M., Perona, P.: Some objects are 5 more equal than others: Measuring and predicting importance. In: ECCV. (2009); and Einhauser, W., Spain, M., Perona, P.: Objects predict fixations better than early saliency. J.

Vis. 8 (2008) 1-26], and this property is extensively used in the literature to create saliency maps (probability maps 10 which represent the likelihood of receiving an eye fixation) to automate the generation of fixation maps [see: Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: ICCV (2009); and Peters, R.J., Iyer, A., Koch, C., Itti, L.: Components of bottom-up gaze 15 allocation in natural scenes. J. Vis. 5 (2005) 692-692] .

According to the prior art on saliency it is predicted where interesting parts of the scene are, and thereby it is being tried to predict where a person would look. However, now 20 that accurate saliency algorithms are available [see: Valenti, R., Sebe, N., Gevers, T.: Image saliency by isocentric curvedness and color. In ICCV. (2009); Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. PAMI 20 (1998) 1254-25 1259; and Ma, Y.F., Zhang, H.J.: Contrast-based image attention analysis by using fuzzy growing. In ACM MM.

(2003)] , the invention proposes to reverse the problem by using saliency maps in aiding the uncertain fixations.

30 In the system according to the invention, the gaze vector 13 obtained by an existing visual gaze estimation system is used to estimate the possible interest area on the scene.

The size of this area will depend on device capabilities and 9 on the scenario. This area is evaluated for salient regions, and filtered so that salient regions which are far away from the centre of interest will be less relevant for the final estimation. The obtained probability landscape is then 5 explored to find the best candidate for the location of the adjusted fixation. This process is repeated for every estimated fixation in the image. After all the fixations and respective adjustments are obtained, the least-square error between them is minimized in order to find the best 10 transformation from the estimated sets of fixations to the adjusted ones.

This transformation is then applied to the original fixations and future ones, in order to compensate for the 15 found error. When a sequence of estimations is available, the obtained improvement is used to correct the previously erroneous estimates. The found error is used to adjust and recalibrate the gaze estimation devices at runtime, in order to improve future estimations. The method may be used to fix 20 the shortcoming of low quality monocular head and eye trackers improving their overall accuracy.

Visual gaze estimators have inherent errors which may occur in each of the components of the visual gaze pipeline. From 25 these errors the size of the area where interesting locations may be found can be derived. To this end, three errors which should be taken into account when estimating visual gaze (one for each of the components of the pipeline) can be identified: the device error, the calibration error 30 and the foveating error. Depending on the scenario, the actual size of the area of interest will be computed by cumulating these three errors and mapping them to the distance of the gazed scene.

10

The device error:

This error is attributed to the first component of the visual gaze estimation pipeline. As imaging devices are 5 limited in resolution, there are a discrete number of states in which image features can be detected and recognized. The variables defining this error are often the maximum level of details which the device can achieve while interpreting pixels as the location of the eye or the position of the 10 head. Therefore, this error mainly depends on the scenario (e.g. the distance of the subject from the imaging device) and on the device that is being used.

The calibration error: 15 This error is attributed to the resolution of the visual gaze starting from the features extracted in the first component. Eye gaze trackers often use a mapping between the position of the eye and the corresponding locations on the screen. Therefore, the tracking system needs to be 20 calibrated. In case the subject moves from his original location, this mapping will be inconsistent and the system may erroneously estimate the visual gaze. Chinrests are often required in these situations to limit the movements of the users to a minimum. Muscular distress, the length of the 25 session, the tiredness of the subject, all may influence the calibration error. As the calibration error cannot be known a priori, it cannot be modeled. Therefore, the aim is to estimate it, so that is can be compensated.

30 The foveating error:

As this error is associated with the new component proposed in the pipeline, it is required to analyze the properties of the fovea to define it. The fovea is the part of the retina 11 responsible for accurate central vision in the direction in which it is pointed. It is necessary to perform any activities which require a high level of visual details. The human fovea has a diameter of about 1.0 mm with a high 5 concentration of cone photoreceptors which account for the high visual acuity capability. Through saccades (more than 10,000 per hour [see: Geisler, W.S., Banks, M.S.: Handbook of Optics, 2nd Ed. Volume I: Fundamentals, Techniques and Design. Volume 1. McGraw-Hill, Inc., New York, NY, USA 10 (1995)], the fovea is moved to the regions of interest, generating eye fixations. In fact, if the gazed object is large, the eyes constantly shift their gaze to subsequently bring images into the fovea. For this reason, fixations obtained by analyzing the location of the center of the 15 cornea are widely used in the literature as an indication of the gaze and interest of the user.

However, it is generally assumed that the fixation obtained by analyzing the center of the cornea corresponds to the 20 exact location of interest. While this is a valid assumption in most scenarios, the size of the fovea actually permits to see the central two degrees of the visual field. For instance, when reading a text, humans do not fixate on each of the letters, but one fixation permits to read and see the 25 multiple words at once.

Another important aspect to be taken into account is the decrease in visual resolution as we move away from the center of the fovea. The fovea is surrounded by the 30 parafovea belt which extends up to 1.25 mm away from the center, followed by the perifovea (2.75 mm away), which in turn is surrounded by a larger area that delivers low resolution information. Starting at the outskirts of the 12 fovea, the density of receptors progressively decreases, hence the visual resolution decreases rapidly as it goes far away from the foveal center [see: Rossi, E.A., Roorda, A.: The relationship between visual resolution and cone spacing 5 in the human fovea. Nature Neuroscience 13 (2009)]. This is modeled by using a Gaussian kernel centered on the area of interest, with standard deviation as a quarter of the estimated area of interest. In this way, areas which are close to the border of the area of interest are of lesser 10 importance. In our model, we consider this region as the possible location for the interest point. As the area of interest is increased by the projection of the total error, the tail of the Gaussian of the area of interest will aid to balance the importance of a fixation point against the 15 distance from the original fixation point. As the point of interest could be anywhere in this limited area, the next step is to use saliency to extract potential fixation candidates .

20 The saliency is evaluated on the interest area by using a customized version of the saliency framework proposed in [Valenti, R., Sebe, N., Gevers, T.: Image saliency by isocentric curvedness and color. In: ICCV. (2009)]. The framework uses isophote curvature to extract the 25 displacement vectors, which indicate the center of the osculating circle at each point of the image. In Cartesian coordinates, the isophote curvature is defined as: 4L(:r •••

Where Lx represent the first order derivative of the 30 luminance function in the x direction, Lxx the second order derivative on the x direction, and so on. The isophote curvature is used to estimate points which are closer to the 13 center of the structure it belongs to, therefore the isophote curvature is inverted and multiplied by the gradient. The displacement coordinates D(x, y) to the estimated centers are then obtained by: n, f +

In this way every pixel in the image gives an estimate of the potential structure it belongs to. To collect and reinforce this information and to deduce the location of the objects, D(x, y)'s are mapped into an accumulator, weighted 10 according to their local importance defined as the amount of image curvature and color edges. The accumulator is then convolved with a Gaussian kernel so that each cluster of votes will form a single estimate. This clustering of votes in the accumulator gives an indication of where the centers 15 of interesting or structured objects are in the image.

In [Valenti, R., Sebe, N., Gevers, T.: Image saliency by isocentric curvedness and color. In: ICCV. (2009), multiple scales are used. Here, since the scale is directly related 20 to the size of the area of interest, the optimal scale can be determined once and then linked to the area of interest itself. Furthermore, in the abovementioned document, the color and curvature information is added to the final saliency map while here this information is discarded. The 25 reasoning behind this choice is that this information is mainly useful to enhance objects on their edges, while the isocentric saliency is fit to locate the adjusted fixations closer to the center of the objects, rather than on their edges. While removing this information from the saliency map 30 might reduce the overall response of salient objects in the scene, it brings the ability to use the saliency maps as smooth probability density functions.

14

Once the saliency of the area of interest is obtained, it is masked by the area of interest model as defined before. Hence, the Gaussian kernel in the middle of the area of interest will aid in suppressing saliency peaks in its 5 outskirts. However, there may still be uncertainties about multiple optimal fixation candidates.

Therefore, a meanshift window with a size corresponding to the standard deviation of the Gaussian kernel is initialized 10 on the location of the estimated fixation point (corresponding to the center of the area of interest). The meanshift algorithm will then iterate from that point towards the point of highest energy. After convergence, the saliency peak on the area of interest which is closer to the 15 centre of the converged meanshift window is selected as the new (adjusted) fixation point. This process is repeated for all fixation points on an image, obtaining a set of corrections. An analysis of a number of these corrections holds information about the overall calibration error. This 20 allows for estimation of the current calibration error of the gaze estimation system which thereafter can be used to compensate it. The highest peaks in the saliency maps are used to align fixation points with the salient points discovered in the area of interest.

25 A weighted least-squares error minimization between the estimated gaze locations and the corrected ones is performed. In this way, the affine transformation matrix T is derived. The weight is retrieved as the confidence of the 30 adjustment, which considers both the distance from the original fixation and the saliency value sampled on the same location. The obtained transformation matrix T is thereafter applied to the original fixations to obtain the final 15 fixation estimates. These new fixations should have minimized the calibration error.

The pseudo code of the proposed system is as follows: 5 Initialize scenario parameters - Calculate the total error = foveating error + device error + calibration error - Calculate the size of the area of interest by projecting total error at distance d as tan (d*total 10 error) for (each new fixation point p) do - Retrieve the estimated gaze point by the device - Extract the area of interest around each the fixation p Inspect the area of interest for salient objects 15 - Filter the result by the Gaussian kernel - Initialize a meanshift window on the center of the area of interest while (maximum iterations not reached or Ap < threshold) do - climb the distribution to the point of maximum energy 2 0 end while - Select the saliency peak closest to the center of the converged meanshift window as being the correct adjusted fixation - Store the original fixation and the adjusted fixation, 25 with weight w found on the same location on the saliency map

- Calculate the weighted least-squares solution between all the stored points to derive the transformation matrix T

30 - Transform all original fixations with the obtained transformation matrix - Use the transformation matrix T to compensate the calibration error in the device 16 end for

It will be appreciated by those skilled in the art that changes could be made to the embodiments described above 5 without departing from the broad inventive concept thereof. It is understood, therefore, that this invention is not limited to the particular embodiments disclosed, but it is intended to cover modifications within the spirit and scope of the present invention as defined by the appended claims.

10

Claims

A system for detecting the direction of interest of a person, such as the viewing direction, eye direction, main direction, body direction or finger pointing direction of a person, comprising: a processor; at least one video camera connected to the processor for recording video data; 10 electronic memory connected to the processor; wherein the processor is arranged to determine a person's interest vector from the video data in real time; Wherein the processor is arranged to determine in real time a salient peak that is closest to the interest vector; wherein the processor is arranged to determine a salinity corrected interest vector between the person and the nearest salient peak in real time; wherein the processor is arranged to determine the deviation between the determined interest vector and the determined salinity corrected interest vector in real time; wherein the processor is arranged to determine further interest vectors of the person from the video data in real time; and wherein the processor is arranged to calculate interest vectors calibrated in real time by using an error value calculated from the determined deviation.

The system of claim 1, wherein the processor is arranged to determine in real time the salient peaks closest to a plurality of particular interest vectors; the processor is arranged to determine a plurality of salinity corrected interest vectors between the person and the nearest salient peaks in real time; wherein the processor is arranged to determine in real time the deviations between the plurality of particular interest vectors and the plurality of salinity corrected vectors; 10 wherein the calibration error value is calculated from the plurality of certain deviations.

3. System as claimed in claim 1 or 2, wherein the processor is arranged to iterate in real time the process of calculating the calibration error value by replacing predetermined interest vectors with interest vectors corrected by using a predetermined calibration error value for calculating a current calibration error value. 20

4. System as claimed in claim 2 or 3, wherein the processor is adapted to calculate the calibration error value in real time by minimizing the difference between the plurality of determined deviations and the calibration error value, for instance by using a weighted least squares method.

5. System according to any of claims 1-4, wherein salient peaks are determined by using salient data about the area that the person is expected to look at, such as video data, screen recording data or manually entered data.

6. System as claimed in any of the foregoing claims 1-5, wherein the processor is arranged to determine salient peaks in real time in the region around the determined interest vector from the video data before determining the salient peak closest to the determined interest vector is located.

7. System as claimed in claim 6, wherein the system comprises at least two video cameras connected to the processor, wherein a camera is intended for recording video data of the face and / or body of a person, and wherein a camera is intended to record the mentioned video data.

System according to claim 7, wherein the processor, the electronic memory and the at least two video cameras are combined in a device, for example a smartphone.

The system according to any of the preceding claims 1-8, wherein the direction of interest is a viewing direction.

10. A method for detecting a person's interest direction, a processor performing the following steps: determining a person's interest vector in real time from video data recorded by a video camera; determining a salient peak closest to the interest vector in real time; determining a salinity corrected interest vector between the person and the nearest salient peak in real time; determining the deviation between the determined interest vector and the determined salinity corrected interest vector in real time; determining real-time interest vectors of the person from the video data in real time; and calculating calibrated interest vectors in real time using an error value calculated from the determined deviation.

A computer software program arranged to run on a processor to perform the steps of the method of claim 10.

12. Computer-readable carrier comprising a computer software program that is arranged to run on a processor in order to carry out the steps of the method according to claim 10.

13. Computer comprising a processor and associated electronic memory that is loaded with a computer software program that is arranged to carry out the steps of the method according to claim 10.