WO2014132259A1

WO2014132259A1 - Method and system for correlating gaze information

Info

Publication number: WO2014132259A1
Application number: PCT/IL2014/050205
Authority: WO
Inventors: Noam Meir; Ziv TSOREF
Original assignee: Inuitive Ltd.
Priority date: 2013-02-27
Filing date: 2014-02-27
Publication date: 2014-09-04

Abstract

A computer vision system is disclosed. The system comprises: an imaging system configured for capturing an image of a scene and an individual in the scene; and a data processor, configured for receiving imagery data from the imaging system and processing the imaging data so as to construct a three-dimensional image of the scene, and to extract a gaze direction of the individual. The three-dimensional image and the gaze direction are defined over the same three-dimensional coordinate system.

Description

METHOD AND SYSTEM FOR CORRELATING GAZE INFORMATION

RELATED APPLICATION

This application claims the benefit of priority of U.S. Provisional Patent Application No. 61/769,989 filed February 27, 2013, U.S. Provisional Patent Application No. 61/817,375 filed April 30, 2013, and PCT Application No. IL2013/050997 filed December 4, 2013, the contents of which are incorporated herein by reference in their entirety. FIELD AND BACKGROUND OF THE INVENTION

The present invention, in some embodiments thereof, relates to image processing and, more particularly, but not exclusively, to a method and system for correlating gaze information.

The use of cameras in computer systems, commonly termed computer vision, continues to increase. Video conferencing, live feeds and the like are common applications that require computer vision, and advanced user interfaces that use computer vision are becoming increasingly available for desktop, home, and mobile computing devices.

Conventional camera applications involve the use of a camera operator who controls the camera, and, in particular, controls the image that the camera records by appropriately orienting the camera. The camera operator may also provide direction to improve the appearance of the objects being recorded. In the terminology common to the field, proper image framing assures that a desired image is included within the field of view of the camera. A typical computer vision application is often operated using a fixed position camera and no camera operator per se.

Some computer vision systems employ stereo camera systems to acquire three- dimensional information about objects. Typical stereo camera systems include two or more electronic cameras which are mounted at spaced apart locations. The electronic cameras have partially overlapping fields of view. A computer connected to receive images from each of the cameras can compare the images to derive three-dimensional information about objects in the field of view. Information such as the distances to the objects and the sizes, dimensions and orientations of the objects can be determined by triangulation.

Use of computer vision techniques for gaze tracking is increasing in a number of diagnostic, human performance, and control applications. Gaze tracking involves the determination and tracking of the gaze or fixation point of a person's eyes on a surface of an object such as the screen of a computer monitor. The gaze point is generally defined as the intersection of the person's line of sight with the surface of the object being viewed.

Gaze tracking techniques and applications are disclosed in numerous publications, such as U.S. Patent Nos. 8,510,166, 8,553,936, 8,434,868, U.S. Published Application Nos. 20130107207 and 20120294478, and International Publication No. WO2012154418, WO2013066790 and WO2013070788.

Additional background art includes U.S. Patent Nos. 7,613,356 and 7,576,847, and Qiang Ji and Xiaojie Yang, "Real-Time Eye, Gaze, and Face Pose Tracking for Monitoring Driver Vigilance", Department of Electrical, Computer, and System Engineering Rensselaer Polytechnic Institute, Troy, NY 12180, USA.

SUMMARY OF THE INVENTION

According to an aspect of some embodiments of the present invention there is provided a computer vision system. The system comprises: an imaging system configured for capturing an image of a scene and an individual in the scene; and a data processor, configured for receiving imagery data from the imaging system and processing the imaging data so as to construct a three-dimensional image of the scene, and to extract a gaze direction of the individual. The three-dimensional image and the gaze direction are defined over the same three-dimensional coordinate system.

According to some embodiments of the invention the imaging system is configured to provide a stereoscopic image, and the data processor is configured for reconstructing the three-dimensional image from the stereoscopic image.

According to some embodiments of the invention the imaging system is configured to provide a two-dimensional image and range data, and the data processor is configured for reconstructing the three-dimensional image from the two-dimensional image and range data. According to some embodiments of the invention the data processor is configured for determining whether the gaze direction passes through an object in the scene.

According to some embodiments of the invention the object is imaged by the imaging system, and the data processor is configured for determining a position of the object and defining the position over the dimensional coordinate system.

According to some embodiments of the invention the data processor is configured for receiving a position of the object and defining the position over the dimensional coordinate system.

According to some embodiments of the invention the object is a display device.

According to some embodiments of the invention the invention the system comprises a video system, configured for receiving and displaying a video image on the display device.

According to some embodiments of the invention the data processor is configured for synchronizing the extraction of the gaze direction with the video image.

According to some embodiments of the invention the system comprises a correlator configured for correlating the gaze direction with the video image by dynamically associating a gaze point on the display with a spatiotemporal region of the image.

According to an aspect of some embodiments of the present invention there is provided a method of determining gaze. The method comprises: receiving from an imaging system imagery data describing a scene and an individual in the scene; and using a data processor for processing the imagery imaging data to construct a three- dimensional image of the scene, and to extract a gaze direction of the individual. The three-dimensional image and the gaze direction are defined over the same three- dimensional coordinate system.

According to some embodiments of the invention the gaze direction is defined over the three-dimensional coordinate system during the extraction of the gaze direction.

According to some embodiments of the invention the method comprises determining whether the gaze direction passes through an object in the scene. According to some embodiments of the invention the object is imaged by the imaging system, and the method comprises determining a position of the object in the scene and defining the position over the three-dimensional coordinate system.

According to some embodiments of the invention the method comprises receiving a position of the object and defining the position over the three-dimensional coordinate system.

According to some embodiments of the invention the object is a display device. According to some embodiments of the invention the method comprises receiving and displaying a video image on the display device. According to some embodiments of the invention the method comprises synchronizing the extraction of the gaze direction with the video image.

According to some embodiments of the invention the method comprises correlating the gaze direction with the video image by dynamically associating a gaze point on the display with a spatiotemporal region of the image.

According to an aspect of some embodiments of the present invention there is provided a computer vision system. The system comprises: a video system, configured for receiving and displaying a video image on a display device; a gaze tracker, configured for receiving a local image of a local scene in front of the display device, and extracting a gaze direction of an individual in the local scene; a correlator, configured for correlating the gaze direction with the video image by dynamically associating a gaze point on the display with a spatiotemporal region of the image; and an output circuit, for transmitting the associated region to a remote location.

According to some embodiments of the invention the gaze tracker is configured for determining presence or absence of the individual.

According to some embodiments of the invention the gaze tracker is configured for extracting a plurality of gaze directions of a respective plurality of individuals in the local scene.

According to some embodiments of the invention the local image is a three- dimensional image.

According to some embodiments of the invention the system comprises a controller configured for automatically activating and deactivating the gaze tracker, the correlator and the output circuit responsively to a content of the video image. According to some embodiments of the invention the video system is configured for automatically identifying the content.

According to some embodiments of the invention the gaze tracker, the correlator and the output circuit are activated when the video image comprises a commercial advertisement, and are deactivated otherwise.

According to some embodiments of the invention the video image comprises a hidden commercial advertisement, and the correlator is configured for determining whether the spatiotemporal region encompasses the hidden commercial advertisement.

According to some embodiments of the invention the correlator is configured for calculating correlation statistics pertaining to an amount of time the spatiotemporal region is being associated with the gaze direction.

According to some embodiments of the invention the gaze tracker is configured for recognizing a face of the individual, wherein the correlator is configured for dynamically associating the gaze point with the face.

According to an aspect of some embodiments of the present invention there is provided a method of correlating gaze information. The method comprises: displaying a video image on a display device; receiving a local image of a local scene in front of the display device, and extracting a gaze direction of an individual in the local scene; correlating the gaze direction with the video image by dynamically associating a gaze point on the display with a spatiotemporal region of the image; and transmitting the associated region to a remote location.

According to some embodiments of the invention the method comprises determining presence or absence of the individual.

According to some embodiments of the invention the method comprises extracting at least one additional gaze direction of at least one additional individual in the local scene.

According to some embodiments of the invention the method comprises activating and deactivating the gaze tracking, the correlation and the transmission responsively to a content of the video image. According to some embodiments of the invention the method comprises identifying the content.

According to some embodiments of the invention the video image comprises a hidden commercial advertisement, and the method comprises determining whether the spatiotemporal region encompasses the hidden commercial advertisement.

According to some embodiments of the invention the method comprises calculating correlation statistics pertaining to an amount of time the spatiotemporal region is being associated with the gaze direction.

According to some embodiments of the invention the method comprises recognizing a face of the individual wherein the correlation is further according to the recognition.

Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.

Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.

For example, hardware for performing selected tasks according to embodiments of the invention could be implemented as a chip or a circuit. As software, selected tasks according to embodiments of the invention could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system. In an exemplary embodiment of the invention, one or more tasks according to exemplary embodiments of method and/or system as described herein are performed by a data processor, such as a computing platform for executing a plurality of instructions. Optionally, the data processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data. Optionally, a network connection is provided as well. A display and/or a user input device such as a keyboard or mouse are optionally provided as well.

BRIEF DESCRIPTION OF THE DRAWINGS

Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings and images. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.

FIG. 1 is a schematic diagram of a computer vision system, according to some embodiments of the present invention;

FIG. 2 is a flowchart diagram of a method suitable for determining gaze, according to some embodiments of the present invention;

FIG. 3 is a schematic diagram of an exemplified computer vision system, according to additional embodiments of the present invention; and

FIG. 4 is a flowchart diagram of a preferred procedure according to some embodiments of the present invention. DESCRIPTION OF SPECIFIC EMBODIMENTS OF THE INVENTION

Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.

The present embodiments comprise a method and system that can be used in computer vision. The method and/or system of the present embodiments can provide, for example using a data processor, a three-dimensional geometric model of a scene and the locations of various objects within the scene. The three-dimensional geometric model is calculated based on an image of the scene that corresponds to imagery data generated by an imaging system. The three-dimensional geometric model can be a three- dimensional image of the scene. The three-dimensional image can optionally and preferably overlap with previously prepared three-dimensional model of the scene. In these embodiments, the coordinates of the objects as imaged by an imaging system are preferably expressed by means of the coordinates defined by the previously prepared three-dimensional model.

The method and/or system of the present embodiments can identify objects in the scene, track their position and, optionally and preferably also track their orientation. Thus, the system and method of the present embodiments provides tracking over six degrees of freedom (DOF), three translational degrees of freedom and three rotational degrees of freedom. The tracked objects can be stationary or they can be moving objects. In various exemplary embodiments of the invention the tracking data (position and/or orientation) is associated with a time tag of the captured image so as to allow correlating the activity of the tracked object with other inputs such as voice or gesture.

The tracked objects can include at least one individual. The method and/or system of the present embodiments can detect presence or absence of individuals in the scene, and can optionally also identify the individual by employing a face recognition procedure.

The method and/or system of the present embodiments can extract the gaze direction of one or more individuals in the scene. The gaze direction can be along the direction of the nose of the individual, or along a direction that is defined based on the orientation of the head of the individual and the location of the pupils within the eyes of the individual. In various exemplary embodiments of the invention the gaze direction is defined over the same three-dimensional coordinate system that is used to map the scene over the three-dimensional geometrical model or three-dimensional image. The method and/or system of the present embodiments can determine whether the extracted gaze direction passes through an object in the scene. The object can be stationary or it can be a moving object. In some embodiments of the present invention the object is a display device. The object can be imaged, in which case the method and/or system of the present embodiments can determine the position of the object in the scene. Alternatively, the method and/or system of the present embodiments can receive the position of the object from an external source. The object can be within or outside the current field-of-view of the imaging system.

The determination whether the gaze passes through the object is optionally and preferably based on the gaze direction, the three-dimensional location of the individual, the three-dimensional location of the object, and the time stamp of the images used to extract the locations and gaze directions.

The system and/or method of the present embodiments can synchronize the extraction of the gaze direction with a video image displayed on a display device. The method and/or system of the present embodiments can correlate the gaze direction with the video image on the display device by dynamically associating a gaze point on the display with a spatiotemporal region of the image.

Referring now to the drawings, FIG. 1 is a schematic diagram of a computer vision system 10, according to some embodiments of the present invention. System 10 optionally comprises a video system 12, configured for receiving and displaying a video image on a display device. System 12 receives a video stream from a remote location (not shown). For example, the stream (referred to below as the remote stream) can be broadcasted over a communication network 18 (e.g., a local area network, or a wide area network such as the Internet or a cellular network), and system 12 can be configured to communicate with network 18, and receive the remote stream therefrom. In various exemplary embodiments of the invention system 12 also receives from the remote location an audio stream (referred to be low as the remote audio stream) accompanying the video stream.

Video system 12 is preferably configured for decoding the remote video stream such that it can be interpreted and properly displayed. System 12 is preferably configured to support a number of known codecs, such as MPEG-2, MPEG-4, H.264, H.263+, or other codecs. Video system 12 uses the remote video stream for displaying a video image 14 of the remote user on a display device 16. In various exemplary embodiments of the invention system 10 also comprises display device 16.

Video image 14 includes a plurality of time-dependent values (e.g., grey-levels, intensities, color intensities, etc.), wherein a particular value at a particular time-point corresponds to a picture-element (e.g., a pixel, a sub-pixel or a group of pixels) in a single frame of video image 14.

Video system 12 can decode the remote audio stream such that it can be interpreted and properly output. If the remote audio stream is compressed, video system preferably decompresses it. Video system 12 is preferably configured to support a number of known audio codecs. Video system 12 transmits the decoded remote audio stream to an audio output device 34, such as, but not limited to, a speaker, a headset and the like. In various exemplary embodiments of the invention system 10 comprises audio output device 34.

Video system 12 can be any man-made machine capable of receiving a video stream and optionally also executing a set of instructions and/or performing calculations. In the schematic illustration of FIG. 1 video system 12 is implemented in a data processor 44, which can include, but is not limited to, a general purpose computer supplemented by dedicated software, general purpose microprocessor supplemented by dedicated software, general purpose microcontroller supplemented by dedicated software, general purpose graphics processor supplemented by dedicated software and/or a digital signal processor (DSP) supplemented by dedicated software. The dedicated software can be in the form of program instructions stored on a non-volatile memory medium readable by the circuitry of processor 44, wherein when the program instructions are read by the circuitry, the program instructions cause the circuitry to perform the operations described herein.

Video system 12 can also comprise dedicated circuitry (e.g. , a printed circuit board) and/or a programmable electronic chip into which dedicated software is burned.

In various exemplary embodiments of the invention system 10 comprises a gaze tracker 42, which receives an imagery data stream describing a local scene 22, for example, the scene in front of display 16, and extracts a gaze direction 28 of an individual 20 in local scene 22. In some embodiments of the present invention gaze tracker 42 also extracts a head orientation of individual 20. As used herein, "gaze direction" refers to the direction of a point or a set of points in the scene, relative to individual 20.

Typically, but not necessarily, gaze direction 28 is the direction of an eye or a nose of individual 20 relative to the object in the scene. The object can be display device 16 or a portion thereof. The object can also be another object in local scene 22 (e.g. , a table, a chair, walls, an appliance, a carpet, a painting, a lamp, a window, a door, etc.) and/or other individuals, which are not shown for clarity of presentation. When there is more than one individual in scene 22, gaze tracker 42 optionally and preferably extracts a gaze direction for each of at least two of the individuals in local scene 22. In some embodiments of the present invention gaze tracker 42 is configured for recognizing a face of individual 20.

Gaze tracker 42 can be any man-made machine capable of data and also executing a set of instructions and/or performing calculations. In the schematic illustration of FIG. 1, gaze tracker 42 is implemented in a data processor 44. Gaze tracker 42 can also comprise dedicated circuitry (e.g. , a printed circuit board) and/or a programmable electronic chip into which dedicated software is burned. Gaze tracker 42 optionally and preferably also has a non-volatile memory medium readable by the circuitry, wherein the memory medium stores program instructions which, when read by the circuitry, cause the circuitry to extract the gaze direction as described herein.

The imagery data stream received by gaze tracker 42 is referred to herein as the local imagery data stream.

As used herein "imagery data" refers to a plurality of values that represent a two- or three-dimensional image, and that can therefore be used to reconstruct a two- or three- dimensional image.

Typically, the imagery data comprise values (e.g. , grey-levels, intensities, color intensities, etc.), each corresponding to a picture-element (e.g., a pixel, a sub-pixel or a group of pixels) in the image. In some embodiments of the present invention the imagery data correspond to a three-dimensional image, and in some embodiments of the present invention the imagery data also comprise range data as further detailed hereinbelow.

As used herein "stream of imagery data" or "imagery data stream" refers to time- dependent imagery data, wherein the plurality of values varies with time. Typically, the stream of imagery data comprises a video stream which may include a plurality of time-dependent values (e.g., grey-levels, intensities, color intensities, etc.), wherein a particular value at a particular time-point corresponds to a picture-element (e.g. , a pixel, a sub-pixel or a group of pixels) in a video frame.

The imagery data stream can be generated by an imaging system 24 constituted to capture a view of local scene 22 and transmit a corresponding imagery data to gaze tracker 42. In various exemplary embodiments of the invention gaze tracker 42 is configured for determining the presence or absence of individual 20 in scene 22.

The local imagery data preferably comprises a video stream, referred to herein as the local video stream. In various exemplary embodiments of the invention system 10 comprises imaging system 24.

Gaze tracker 42 and optionally also imaging system 24 can be operated synchronously with the content of image 14. This can be done by a synchronizer 43 which transmits a sync signal whenever it is desired to extract correlation information between the image of scene 22 and image 14. When image 14 is a video image, synchronizer 43 can be configured for analyzing a video time code of image 14 in order to synchronize the image captured by system 24 and its time tag with the displayed video time code.

Synchronizer 43 can comprise dedicated circuitry (e.g. , a printed circuit board) and/or a programmable electronic chip into which dedicated software is burned. Synchronizer 43 can optionally and preferably also has a non-volatile memory medium readable by the circuitry, wherein the memory medium stores program instructions which, when read by the circuitry, cause the circuitry to synchronize the gaze tracking with the displayed image.

Imaging system 24 is optionally and preferably configured for capturing the video stream and range data, or for capturing a stereoscopic video stream, so that a three-dimensional video image can be reconstructed by data processor 44. The range data can be of any type known in the art.

The range data describe topographical information of the individual 20 in the local scene 22 and can comprise a depth map.

The term "depth map," as used herein, refers to a representation of a scene as a two-dimensional matrix, in which each matrix element corresponds to a respective location in the scene and has a respective matrix element value indicative of the distance from a certain reference location to the respective scene location. The reference location is typically static and the same for all matrix elements. A depth map optionally and preferably has the form of an image in which the pixel values indicate depth information. By way of example, an 8-bit grey-scale image can be used to represent depth information. The depth map can provide depth information on a per-pixel basis of the image data, but may also use a coarser granularity, such as a lower resolution depth map wherein each matrix element value provides depth information for a group of pixels of the image data.

The range data can also be provided in the form of a disparity map. A disparity map refers to the apparent shift of objects or parts of objects in a scene when observed from two different viewpoints, such as from the left-eye and the right-eye viewpoint. Disparity map and depth map are related and can be mapped onto one another provided the geometry of the respective viewpoints of the disparity map are known, as is commonly known to those skilled in the art.

The range data and/or stereoscopic image can be used to reconstruct a three- dimensional image comprising geometric properties of a non-planar surface which at least partially encloses a three-dimensional volume. Generally, the non-planar surface is a two-dimensional object embedded in a three-dimensional space. Formally, a non- planar surface is a metric space induced by a smooth connected and compact Riemannian 2-manifold. Ideally, the geometric properties of the non-planar surface would be provided explicitly for example, the slope and curvature (or even other spatial derivatives or combinations thereof) for every point of the non-planar surface. Yet, such information is rarely attainable and the spatial information of the three-dimensional image is provided for a sampled version of the non-planar surface, which is a set of points on the Riemannian 2-manifold and which is sufficient for describing the topology of the 2-manifold. Typically, the spatial information of the non-planar surface is a reduced version of a 3D spatial representation, which may be either a point-cloud or a 3D reconstruction (e.g., a polygonal mesh or a curvilinear mesh) based on the point cloud. The 3D image is expressed via a 3D coordinate system, such as, but not limited to, Cartesian, Spherical, Ellipsoidal, 3D Parabolic or Paraboloidal coordinate 3D system. It is appreciated that a three-dimensional image of an object is typically a two- dimensional image which, in addition to indicating the lateral extent of object members, further indicates the relative or absolute distance of the object members, or portions thereof, from some reference point, such as the location of the imaging device. Thus, a three-dimensional image typically includes information residing on a non-planar surface of a three-dimensional body and not necessarily in the bulk. Yet, it is commonly acceptable to refer to such image as "three-dimensional" because the non-planar surface is conveniently defined over a three-dimensional system of coordinate. Thus, throughout this specification and in the claims section that follows, the term "three-dimensional image" primarily relate to surface entities.

In order to improve the quality of the reconstructed three-dimensional image, additional occlusion information, known in the art as de-occlusion information, can be provided. (De-)occlusion information relates to image and/or depth information which can be used to represent views for additional viewpoints (e.g. , other than those used for generating the disparity map). In addition to the information that was occluded by objects, the occlusion information may also comprise information in the vicinity of occluded regions. The availability of occlusion information enables filling in of holes which occur when reconstructing the three-dimensional image using a 2D+range data.

The reconstruction of three-dimensional image using the image and range data and optionally also occlusion data can be done using any procedure known in the art. Representative examples of suitable algorithms are found in Qingxiong Yang, "Spatial- Depth Super Resolution for Range Images," IEEE Conference on Computer Vision and Pattern Recognition, 2007, pages 1-8; H. Hirschmuller, "Stereo Processing by Semiglobal Matching and Mutual Information," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(2):328-341 ; International Publication No. WO1999/006956; European Publication Nos. EP1612733 and EP2570079; U.S. Published Application Nos. 20120183238 and 20120306876; and U.S. Patent Nos. 7,583,777 and 8,249,334, the contents of which are hereby incorporated by reference.

Range data can be captured by imaging system 24 in more than one way.

In some embodiments of the present invention imaging system 24 is configured for capturing stereoscopic image data, for example, by capturing scene 22 from two or more different view points. Thus, in various exemplary embodiments of the invention imaging system 24 comprises a plurality of spaced apart imaging sensors. Shown in FIG. 1 are two imaging sensors 24a and 24b, but it is not intended to limit the scope of the present invention to a system with two imaging sensors. Thus, the present embodiments contemplate configurations with one imaging sensor or more than two (e.g., 3, 4, 5 or more) imaging sensors.

From the stereoscopic image data, data processor 44 optionally and preferably calculates range data. The calculated range data can be in the form of a depth map and/or a disparity map. Optionally, data processor also calculates occlusion data. In some embodiments, imaging system 24 capture scene 22 from three or more viewpoints so as to allow data processor 44 to calculate the occlusion data in higher precision.

The gaze direction and/or the head orientation can be extracted by processing the images in the local imagery data stream using any technique known in the art. In various exemplary embodiments of the invention the reconstruction of the three- dimensional image from the imagery data stream and the extraction of the gaze direction are performed by the same gaze tracker or data processor. This allows determining the gaze of individual 20 within the local scene 22 without the need for a calibration step.

In some embodiments of the present invention gaze tracker 42 employs an eye- tracking procedure for tracking the eyes of the individual as known in the art, and then determines the gaze direction and/or the head orientation based on the position of the eyes on the image. As a representative example for an eye-tracking procedure, the corners of one or more of the eyes can be detected, e.g. , as described in Everingham et al. [In BMVC, 2006], the contents of which are hereby incorporated by reference. Following such detection, each eye can be defined as a region between two identified corners. Additional eye-tracking techniques are found, for example, in U.S. Patent Nos. 6,526,159 and 8,342,687; European publication Nos. EP1403680, EP0596868, International Publication No. WO 1999/027412, the contents of which are hereby incorporated by reference.

Also contemplated are other face recognition techniques that use geometrical characteristics of a face, to identify facial features in the face of individual 20, including, without limitation, the tip of the nose, the nostrils, the mouth corners, and the ears. Once these facial features are identified, gaze tracker 42 determines the gaze direction and/or the head orientation based on the position of the identified facial features on the image. Techniques for the identification of facial features in an image are now in the art and are found, for example, in U.S. Patent No. 8,369,586, European Publication Nos. EP1296279, EP1693782, and International Publication No. WO 1999/053443, the contents of which are hereby incorporated by reference.

When range and/or stereoscopic data of scene 22 are available (for example, in the form of a depth map or disparity map), the range and/or stereoscopic data can be used to aid the extraction of gaze and/or the head orientation information. In some embodiments of the present invention the nose of individual 20 can be identified based on the range and/or stereoscopic data associated with the face of individual 20. For example, the region in the range data of the face that has the shortest distance to imaging system 24 can be identified as the nose. Once the nose is identified, gaze tracker 42 determines the gaze direction and/or the head orientation based on the position of the nose on the image, optionally and preferably also based on the location and/or state (e.g., eyelid open/close, pupil location within the eye) of the eyes of the individual.

The range data can also be used according to some embodiments of the present invention for mapping the relation between the gaze and the local scene. In various exemplary embodiments of the invention gaze tracker 42 expresses gaze direction 28 in the three-dimensional coordinate system that describes and maps a plurality of objects present in scene 22. It was found by the present inventor that these embodiments are advantageous from the standpoint of timing because it allows tracking the gaze of the individual substantially in real time (e.g. , within less than half a second, or less than 50 ms, or less than 5 ms, or less than 1 ms, or less than 200 μ8 or less than 100 μ8).

Shown in FIG. 1 is a representative example of a three-dimensional coordinate system 30 having an origin Oi that is defined by processor 44. Origin Oi can be defined at any point. For clarity of presentation coordinate system 30 is drawn in FIG. 1 with origin Oi at an arbitrary location, but other locations are also contemplated. In some embodiments of the present invention Oi is defined at one of imaging sensors 24a and 24b, in some embodiments of the present invention Oi is defined at a point between imaging sensors 24a and 24b, in some embodiments of the present invention Oi is defined at an object in scene 22, preferably a static object, but may also be a moving object (e.g., at individual 20), in some embodiments of the present invention Oi is defined at a point on display 16. While FIG. 1 shows coordinate system 30 as a three-dimensional Cartesian coordinate, this need not necessarily be the case, since, for some applications, other three-dimensional coordinate systems such as one of the aforementioned coordinate systems can be employed.

Coordinate system 30 is used by processor 44 to define the location of individual

20 relative to origin 01. In various exemplary embodiments of the invention coordinate system 30 is used by processor 44 to define the location of the head, or one or more identifiable points on the head of individual 20 (e.g., nose, eyes, ears) relative to origin 01. Coordinate system 30 can also be used to define the orientation of the head of individual 20. In some embodiments of the present invention, gaze tracker 42 expresses gaze direction also in terms of coordinate system 30 relative to origin Oi.

Also shown in FIG. 1 is a representative example of a secondary coordinate system 32 having a secondary origin 0₂ that can be defined by gaze tracker 42. Origin 0₂ is optionally and preferably defined at a point on the head of individual 20, but for clarity of presentation, coordinate system 32 is drawn in FIG. 1 with secondary origin 0₂ shifted relative to individual 20. Coordinate system 32 can be a two-dimensional or three-dimensional coordinate system, and can be either Cartesian or of any of the aforementioned types of coordinate systems. In the representative illustration of FIG. 1, which is not to be considered as limiting, system 32 is a three-dimensional Cartesian coordinate system.

Since the location of the head of individual 20 is expressed in terms of coordinate system 30, any coordinate within coordinate system 32 is also expressible in terms of coordinate system 30, by means of a coordinate transformation, e.g., using an appropriate transformation matrix.

In embodiments in which such a secondary coordinate system 32 is defined, gaze tracker 42 expresses the location of the eyes and/or nose of individual 20 relative to secondary origin 0₂. When system 32 is a two-dimensional coordinate system, the coordinates of the eyes and/or nose of individual 20 as extracted by gaze tracker 42 within system 32 are transformed, for example, by data processor 44, to coordinate system 30 by means of coordinate transformation. The gaze direction is then expressed in terms of coordinate system 30, for example, along the direction of the nose, or along a direction that is defined based on the orientation of the head and the location of the pupils. When system 32 is a three-dimensional coordinate system, gaze tracker first expresses the gaze direction in terms of system 32. Thereafter, the coordinates of the gaze direction are transformed to coordinate system 30 by means of coordinate transformation, so that the gaze direction is expressed in terms of coordinate system 30.

In various exemplary embodiments of the invention at least one of the imaging sensors of system 24 is configured to provide a field-of-view of scene 22 or a part thereof over a spectral range from infrared to visible light. Preferably, at least one, more preferably, all the imaging sensors comprise a pixelated imager, such as, but not limited to, a CCD or CMOS matrix, which is devoid of IR CUT filter and which therefore generates a signal in response to light at any wavelength within the visible range and any wavelength within the IR range, more preferably the near IR range.

Representative examples of a characteristic wavelength range detectable by the imaging sensors include, without limitation, any wavelength from about 400 nm to about 1100 nm, or any wavelength from about 400 nm to about 1000 nm. In some embodiments of the present invention the imaging devices also provide signal responsively to light at the ultraviolet (UV) range. In these embodiments, the characteristic wavelength range detectable by the imaging devices can be from about 300 nm to about 1100 nm. Other characteristic wavelength ranges are not excluded from the scope of the present invention.

As used herein the term "about" refers to ± 10 %.

The imaging sensors optionally and preferably provide partially overlapping field-of-views of scene 22. The overlap between the field-of-views allows data processor 44 to combine the field-of-views. In various exemplary embodiments of the invention the spacing between the imaging sensors is selected to allow data processor 44 to provide a three-dimensional reconstruction of individual 20 and optionally other objects and/or individuals in scene 22.

In some embodiments of the present invention system 10 comprises one or more light sources 26 constituted for illuminating at least part of scene 22. Shown in FIG. 1 is one light source, but it is not intended to limit the scope of the present invention to a system with one light source. Thus, the present embodiments contemplate configurations with two light sources or more than two (e.g., 3, 4, 5 or more) light sources. Herein, unless explicitly stated, a reference to light sources in the plural form should be construed as a reference to one or more light sources.

The light source can be of any type known in the art, including, without limitation, light emitting diode (LED) and a laser device.

One or more of light sources 26 optionally and preferably emits infrared light. In these embodiments, the infrared light sources generate infrared light at a wavelength detectable by the imaging sensors. The infrared light sources are optionally and preferably positioned adjacent to the imaging sensors, this need not necessarily be the case, since, for some applications, it may not be necessary for the light sources to be adjacent to the imaging sensors.

System 10 can also comprise a correlator 36 which correlates gaze direction 28 with the location of an object in the scene, thereby to determine whether or not individual 20 is looking at the object. The correlation is based on the three-dimensional coordinates of the object, preferably relative to origin Oi in terms of system 30, and can be done either when the object is within the current field-of-view of imaging system 24 or when the three-dimensional coordinates object are outside the current field-of-view of system 24 (namely an object that has been previously imaged by system 24, but does not currently appear at the field-of-view of system 24).

When gaze tracker 42 recognizes the face of individual 20, wherein correlator 36 optionally and preferably dynamically associates the gaze direction or gaze point with the recognized face.

When the object is display device 16, correlator 36 optionally and preferably correlates gaze direction 28 with video image 14 by dynamically associating a gaze point 28a on display 16 with a spatiotemporal region 14a of image 14.

Spatiotemporal region 14a encompasses a part of image 14 both spatially and temporally. Specifically, region 14a corresponds to a spatial portion of N frames of image 14, where N is a positive integer which is less than the total number of frames in image 14. The spatial size of region 14a can be from 10 pixels to 100 pixels in diameter. The temporal duration of spatiotemporal region 14a is represented by the number of frames N, which is typically from 1 to 100 or from 1 to 50 or from 1 to 25.

Also contemplated are embodiments in which correlator 36 correlates the gaze direction with sound information or speech recognition that may be detected by system 10. Further contemplated are embodiments in which correlator 36 correlates any other gestures that may be performed by individual 20.

Correlator 36 preferably expresses the coordination system of the object in terms of the coordinate system over which gaze direction 28 is defined. Conversely, correlator 36 can express the coordination system over which gaze direction 28 is defined in terms of the coordinate system of the object.

The coordinate system over which gaze direction 28 is defined is typically provided by graze tracker 42 or data processor 44, which in some embodiments of the present invention is configured for reconstructing a three-dimensional image from the imagery data provided by the image sensors as well as for extracting the gaze direction from the reconstructed image.

The coordinate system of the object can be predetermined, in case of a static object, or it can be provided by other means. For example, one or more of the imaging sensors can be positioned so as to capture also the object (e.g., display device 16) wherein the coordinates of the object can be calculated by data processor 44 and/or gaze tracker 42.

Correlator 36 receives the coordinates of the object and the gaze direction and expresses the object and the gaze direction using the same coordinate system. Since both the object and the gaze direction are expressed using the same coordinate system, no calibration is required for determining gaze point 28a.

In some embodiments of the present invention system 10 comprises an output circuit 38 configured for transmitting data pertaining to the correlation to a remote location. For example, when correlator 36 determines whether or not the individual is looking at the object, output circuit 38 can transmits the results of such a determination. When correlator 36 dynamically associates a gaze point on display 16 with a spatiotemporal region 14a of image 14, output circuit 38 can transmit the associated spatiotemporal region to the remote location. The transmission can be over network 18.

The present embodiments are useful for many applications, particularly, but not exclusively, behavioral research. For example, it is recognized that the success of visual advertisements, such as commercial television advertisements, is measured by TV viewing rating, which is a measure that indicates how many TVs are being switched to a given channel at a given time. It has been recognized by the present inventor that although conventional techniques are capable of determining whether or not a particular TV device is tuned to a particular channel, it is unknown whether an individual is in front of the display, what is the number of the individuals that are in front of the display, and whether or not the individual(s) is actually viewing the contents displayed by the TV. The present inventor also found that the problem is particularly complicated when a hidden commercial advertisement is displayed, because the timing at which the relevant content is displayed is unknown.

The present embodiments provide solution to at least some of these problems. In some embodiments of the present invention system 10 comprises a controller 46 configured for automatically activating and deactivating gaze tracker 42, correlator 36 and output circuit 38 responsively to a content of video image 14. For example, gaze tracker 42, correlator 36 and output circuit 38 are activated when the video image comprises a commercial advertisement, and are deactivated otherwise. In some embodiments, video system 12 is configured for automatically identifying the content of video image 14.

When the video image comprises a hidden commercial advertisement, correlator 36 is configured for determining whether the spatio temporal region 14a encompasses the hidden commercial advertisement. Correlator 36 can also calculate correlation statistics pertaining to an amount of time the spatiotemporal region is being associated with the gaze direction. The correlation statistics can also be transmitted to the remote location.

Reference is now made to FIG. 2 which is a flowchart diagram of a method suitable for determining gaze, according to some embodiments of the present invention. At least some operations of the method can be executed by a data processor, e.g. , data processor 44, and at least some operations of the method can be executed by an imaging system, e.g. , imaging system 24. In some exemplary embodiments of the invention the operations below are executed by computer vision system 10.

It is to be understood that, unless otherwise defined, the operations described hereinbelow can be executed either contemporaneously or sequentially in many combinations or orders of execution. Specifically, the ordering of the flowchart diagrams is not to be considered as limiting. For example, two or more operations, appearing in the following description or in the flowchart diagrams in a particular order, can be executed in a different order (e.g., a reverse order) or substantially contemporaneously. Additionally, several operations described below are optional and may not be executed.

The method can be embodied in many forms. For example, it can be embodied in on a tangible medium such as a computer for performing the method operations. It can be embodied on a computer readable medium, comprising computer readable instructions for carrying out at least some of the method operations. In can also be embodied in electronic device having digital computer capabilities arranged to run the computer program on the tangible medium or execute the instruction on a computer readable medium.

The method begins at 50. At 51 the method optionally receives from a remote location a stream of imagery data, and at 52 the method optionally displays an image (e.g., a video image) corresponding to the stream imagery data on a display device, as further detailed hereinabove. In various exemplary embodiments of the invention, the method decodes and/or decompress the video data before the image is displayed. In some embodiments, the method also receives from the remote location an audio stream accompanying the video stream, and transmits the audio stream to an audio output device, as further detailed hereinabove. In various exemplary embodiments of the invention, the method decodes and/or decompresses the audio data before the image is displayed.

At 53 the method optionally illuminates at least a portion of the local scene in front of the display device, for example, by visible and/or infrared light, as further detailed hereinabove.

At 54, an imaging system, such as imaging system 24, is used for capturing a stream or imagery data of an individual in the local scene. In some embodiments of the present invention the method captures a video stream and range data, as further detailed hereinabove. When the scene is illuminated by infrared light, the method optionally and preferably captures imaging data in the visible range and in the infrared range. At 45 the method constructs a three-dimensional image of the local scene and extracts a gaze direction and/or the head orientation of the individual, using a data processor. The three- dimensional image and the gaze direction are defined over the same three-dimensional coordinate system, as further detailed hereinabove. The method can provide, for example, using a data processor, a three- dimensional geometric model of the scene and the locations of various objects within the scene. The imaging system can provides the imagery data generally continually (e.g. , at a frame rate of at least 12 or at least 24 or at least 30 or at least 48 or at least 50 or at least 60 or at least 72 or at least 120 or at least 300 frames per second) or via a scanning process that maps the scene.

The three-dimensional image can optionally and preferably overlap with previously prepared three-dimensional model of the scene. In these embodiments, the coordinates of the objects as imaged by the imaging systems are expressed by means of the coordinates defined by the previously prepared three-dimensional model.

In various exemplary embodiments of the invention the method identifies objects in the scene, within the field-of-view of the imaging system, and tracks their position and, optionally and preferably also their orientation. The tracked objects can be stationary or moving objects. In various exemplary embodiments of the invention the tracking data (position and/or orientation) is associated with a time tag of the captured image so as to allow to correlate the activity of the tracked object with other inputs such as voice or gesture.

In various exemplary embodiments of the invention the tracked objects include at least one individual. In some embodiments of the present invention the method detects presence or absence of individuals in the scene, and optionally identifies the individual by employing a face recognition procedure.

In various exemplary embodiments of the invention the gaze direction is defined over the three-dimensional coordinate system already during the extraction of the gaze direction. In some embodiments, the face of the individual is also defined over the same coordinate system as the scene, so that there is no transfer of coordinates between the coordinates of the face of the individual and the coordinates of the scene. Alternatively, the face of the individual can be defined over a separate coordinate system (e.g., a two- dimensional coordinate system), and the method expresses the coordinates of the face in terms of the three-dimensional coordinate system of the scene, so that the gaze direction is also defined over the three-dimensional coordinate system of the scene.

The method optionally and preferably continues to 56 at which the method determine whether the gaze direction passes through an object in the scene. The object can be stationary or it can be a moving object. In some embodiments of the present invention the object is the display device. The object can be imaged by the imaging system at 54, in which case the method optionally and preferably determines the position of the object in the scene. Alternatively, the method can receive the position of the object from an external source (for example, the object can be static at a predetermined position, and/or the method can receive the position from a position tracking system). In various exemplary embodiments of the invention the method defines the position over the three-dimensional coordinate system.

The determination 56 is optionally and preferably based on the gaze direction, the three-dimensional location of the individual, the three-dimensional location of the object, and the time stamp of the images used to extract the locations and gaze directions.

In some embodiments of the present invention the method synchronizes the extraction of the gaze direction with the video image. In some embodiments of the present invention the method correlates 57 the gaze direction with the video image on the display device, by dynamically associating a gaze point on the display with a spatiotemporal region of the image, as further detailed hereinabove.

At 58 the method optionally and preferably transmits the correlation to a remote location.

The method ends at 59.

As used herein the term "about" refers to ± 10 %.

The word "exemplary" is used herein to mean "serving as an example, instance or illustration." Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.

The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments." Any particular embodiment of the invention may include a plurality of "optional" features unless such features conflict.

The terms "comprises", "comprising", "includes", "including", "having" and their conjugates mean "including but not limited to".

The term "consisting of means "including and limited to". The term "consisting essentially of" means that the composition, method or structure may include additional ingredients, steps and/or parts, but only if the additional ingredients, steps and/or parts do not materially alter the basic and novel characteristics of the claimed composition, method or structure.

As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.

Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases "ranging/ranges between" a first indicate number and a second indicate number and "ranging/ranges from" a first indicate number "to" a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.

It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements. Various embodiments and aspects of the present invention as delineated hereinabove and as claimed in the claims section below find experimental support in the following examples. EXAMPLES

Reference is now made to the following examples, which together with the above descriptions illustrate some embodiments of the invention in a non limiting fashion.

The present example describes a system which tracks an observer person position and his attention direction and correlates its viewing site and line of gaze with the content that is displayed on a display. The local scene near the display can comprise one or more individuals, each may be watching the display or any other object or individual within the scene. The display screen displays a content that may include plurality of images, such as a video image and/or a split screen image.

The system employs viewer tracking by performing a process that expresses the screen 2D coordinate system in terms of the 3D tracking device coordinate system. The system combines 3D imaging system that captures a 3D image of the scene and provides the locations of any relevant object in the scene.

In some embodiments, the system synchronizes changes in the displayed content or in the scene with the gaze tracking procedure so as to correlate between the displayed image and the gaze.

In some embodiments, the system performs automatic content recognition and its location on the screen of the display device, e.g., in terms of the coordinates of the screen. The system then expresses the location of the particular content in terms of the 3D coordinates of the scene.

In some embodiments, the processing is performed substantially in real time

(e.g. , within less than half a second, or less than 50 ms, or less than 5 ms, or less than 1 ms, or less than 200 μ8 or less than 100 μ8). The system of the present embodiments can also perform post processing activities by recording content location and corresponding tracked data to a computer readable medium.

The system of the present embodiments uses data received from the same imaging sensor both for 3D imaging of a plurality of objects in the scene, and for determining the gaze of individual within the scene. Both the 3D image and the gaze direction or gaze spot are optionally and preferably expressed in terms of the same 3D coordinate system

Following is a description of the Exemplified system, with reference to FIG. 3.

A video image from a video stream is displayed on a display with input information on a specific content in the video streaming, its timing and its location on the screen. The source for the information may be a predefined data base that comes with the video stream or a computerized device that operates an image processing algorithm that automatically detect the content in the video streaming its timing and its location on the screen.

A tracking system detects recognizes and tracks the individuals that watch the video image on the screen. The system repeatedly indicates who, when and the spot on the screen that is being watched at a particular time-instant. The system may employ any suitable tracking procedure that complies with the setup conditions in terms of distance from the screen lighting conditions etc.

A computer provides correlation between the information on the content in the video stream with the tracking information on the individuals that watch the screen. This information may be used immediately, for example in order to identify a specific interested item and to display additional information such as its detailed specification. This information can also be used at a later time, for example, to provide statistical data how specific content was watched by a specific viewer or group of viewers.

The system optionally comprises a correlator that can be a dedicated circuit configured for synchronizing the 3D spatial imaging system input with the any specific content that is displayed on the screen.

The system optionally comprises a data processor that processes the provided 3D map reconstruction data in order to get information on the viewer's position and their attention direction. The data processor can also calculate the spot on the screen that had been watch at any time. The data processor can also process the video streaming data, so as to identify a content, such as a commercial advertisement, and to determine the location of the content on the screen of the display device. The data processor can also process the geometrical information of the viewer spot attention and the content location on the screen so as to correlate between the content and the gaze. In some embodiments, the data processor employees a spatial correlation process that expresses the coordination system of the display in terms of the coordinate system of the scene as provided by the 3D imaging system. This allows the processer to calculate the geometrical relations between any position of the individual, any gaze direction thereof, and any location on the screen of the display.

The data processor can employ one or more of the following algorithms: content detection algorithm that detects a content in the video stream and provide its location on the screen, face recognition algorithm that identifies the individual(s) in the local scene, face tracking algorithm that track the position of the individual in the scene, gaze direction tracking algorithm that tracks and measure the gaze direction of the individual(s) in the scene, geometrical calculation algorithm that provides the geometrical relation between each tracked viewer, its gaze direction and the content spot on the screen that he or she is gazing at.

The system of the present Example, comprises a spatial imaging system that contains one or more imaging sensors which may be 2D image and/or 3D sensor. The system provides an image of the scene in the system field of view (FOV) where each object in the image is defined by a set of 3D spatial coordinates (X, Y, Z). The coordinate system is aligned with the display screen coordinate system by a set of space correlation data.

A space correlation process may comprise measurements of 3 or more points in the 2D screen coordinate system at the defined 3D spatial coordinates (X, Y, Z).

This may be done by mounting at least one of the imaging sensors at a position such that at least part of the screen is in its field of view.

In some embodiments, the processor employs an image processing algorithm that automatically recognizes faces and provides their 6 degrees-of-freedom (DOF) position and orientation in the 3D coordinate system. As an example of such image processing algorithm is faceAPI marketed by Seeing Machines Inc., USA.

The processor receives the images from the spatial imaging system, employs the face tracking algorithm which provides the face 6 DOF position and orientation in the defined coordinate system and based on a space correlation data set calculates the spot on the screen that each face is looking at. This process may further include a gaze tracking algorithm. The correlator receives information pertaining the content that is displayed on the screen, together with information from the spatial imaging system. This synchronization circuit provides the correlation information of any specific viewer and its gaze direction with a specific spot on the screen that contains a specific content.

In some embodiments, the correlator receives the video streaming data and provide syncing signal of predefined content on the screen. This sync signal allows the system to timely correlate any of the spot on the screen that has been watched with a specific content that was displayed on the screen at that spot.

The correlation provides information that includes the content on the screen that is being viewed by each of the viewers at a particular time instant or a particular period of time.

A preferred procedure according to some embodiments of the present invention will now be described with reference to FIG. 4, which is not to be considered as limiting.

A spatial image is captured by the spatial image sensor and the imagery data are transferred to the data processor. The spatial image may be a single 2D image or a set of multiple 2D images or a 3D image or a set of multiply 3D image or any combination that provides the required spatial image.

The capturing of the images is preferably performed synchronously with the display content on the screen by means of a sync signal that is received in the data processor and a time mark is recorded.

When the content on the display is a video image, a video time code analyzer can optionally be used in order to synchronize the capturing of the spatial images and their time tags with the displayed video time code. The images are then processed, for example using a face tracker algorithm, to provide the position and orientation (6 DOF) of each viewer face in the 3D coordinate system of the scene.

The processor calculates, based on the space correlation data, the geometrical relations between each viewer and the screen. Based on this calculation, the processor determines, for each viewer, whether the screen is being viewed by the respective viewer and optionally which part of the screen is being viewed. The criteria for such determination are optionally and preferably based on the field of view of each viewer relative to the position and orientation of his or hers eyes as extracted by the face tracker algorithm. The geometrical calculation may include, for example, calculation of a gaze vector that is aiming outward from the tracked face at the middle point between the tracked face eyes. The direction of this vector in the 3D coordinate system of the scene can be used to determine the FOV of that tracked viewer. The data processor can then determine whether the screen is within that FOV and associate the determination with the time tag of the image of the scene, as well as the time code of the video, if applied.

In some embodiments, the processor calculates the spot on the screen that the gaze vector is aiming to. This spot may be subsequently correlated with a spatiotemporal region of the displayed video image.

In some embodiments, the tracker is synchronized with the video streaming and therefore any latency in the tracking system output is maintained constant during the operation. This allows the system of the present embodiments to consider the latency as a constant time delay while providing the correlation information.

When the system records the correlation as determined by the correlator, post processing calculations are optionally and preferably employed.

Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.

All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims

WHAT IS CLAIMED IS:

1. A computer vision system, comprising:

an imaging system configured for capturing an image of a scene and an individual in said scene; and

a data processor, configured for receiving imagery data from said imaging system and processing said imaging data so as to construct a three-dimensional image of said scene, and to extract a gaze direction of said individual, wherein said three- dimensional image and said gaze direction are defined over the same three-dimensional coordinate system.

2. The system of claim 1, wherein said gaze direction is defined over said three-dimensional coordinate system during said extraction of said gaze direction.

3. The system according to any of claims 1 and 2, wherein said imaging system is configured to provide a stereoscopic image, and said data processor is configured for reconstructing said three-dimensional image from said stereoscopic image.

4. The system according to any of claims 1 and 2, wherein said imaging system is configured to provide a two-dimensional image and range data, and said data processor is configured for reconstructing said three-dimensional image from said two- dimensional image and range data.

5. The system according to any of claims 1 and 2, wherein said data processor is configured for determining whether said gaze direction passes through an object in the scene.

6. The system according to claim 5, wherein said object is imaged by said imaging system, and said data processor is configured for determining a position of said object and defining said position over said dimensional coordinate system.

7. The system according to claim 5, wherein said data processor is configured for receiving a position of said object and defining said position over said dimensional coordinate system.

8. The system according to any of claims 3 and 4, wherein said data processor is configured for determining whether said gaze direction passes through an object in the scene.

9. The system according to claim 8, wherein said object is imaged by said imaging system, and said data processor is configured for determining a position of said object and defining said position over said dimensional coordinate system.

10. The system according to claim 8, wherein said data processor is configured for receiving a position of said object and defining said position over said dimensional coordinate system.

11. The system according to any of claims 1 and 2, wherein said object is a display device.

12. The system according to claim 11, further comprising a video system, configured for receiving and displaying a video image on said display device.

13. The system according to claim 12, wherein said data processor is configured for synchronizing said extraction of said gaze direction with said video image.

14. The system according to claim 12, further comprising a correlator configured for correlating said gaze direction with said video image by dynamically associating a gaze point on said display with a spatiotemporal region of said image.

15. The system according to any of claims 8-10, wherein said object is a display device.

16. The system according to claim 15, further comprising a video system, configured for receiving and displaying a video image on said display device.

17. The system according to claim 16, wherein said data processor is configured for synchronizing said extraction of said gaze direction with said video image.

18. The system according to any of claims 16 and 17, further comprising a correlator configured for correlating said gaze direction with said video image by dynamically associating a gaze point on said display with a spatiotemporal region of said image.

19. A method of determining gaze, comprising:

receiving from an imaging system imagery data describing a scene and an individual in said scene;

using a data processor for processing said imagery data to construct a three- dimensional image of said scene, and to extract a gaze direction of said individual, wherein said three-dimensional image and said gaze direction are defined over the same three-dimensional coordinate system.

20. The method of claim 19, wherein said gaze direction is defined over said three-dimensional coordinate system during said extraction of said gaze direction.

21. The method according to any of claims 19 and 20, further comprising determining whether said gaze direction passes through an object in the scene.

22. The method according to claim 21, wherein said object is imaged by said imaging system, and the method comprising determining a position of said object in said scene and defining said position over said three-dimensional coordinate system.

23. The method according to claim 21, further comprising receiving a position of said object and defining said position over said three-dimensional coordinate system.

24. The method according to any of claims 21-23, wherein said object is a display device.

25. The method according to claim 24, further comprising receiving and displaying a video image on said display device.

26. The method according to claim 25, further comprising synchronizing said extraction of said gaze direction with said video image.

27. The method according to any of claims 25 and 26, further comprising correlating said gaze direction with said video image by dynamically associating a gaze point on said display with a spatiotemporal region of said image.

28. A computer vision system, comprising:

a video system, configured for receiving and displaying a video image on a display device;

a gaze tracker, configured for receiving a local image of a local scene in front of said display device, and extracting a gaze direction of an individual in said local scene; a correlator, configured for correlating said gaze direction with said video image by dynamically associating a gaze point on said display with a spatiotemporal region of said image; and

an output circuit, for transmitting said associated region to a remote location.

29. The system of claim 28, wherein said gaze tracker is configured for determining presence or absence of said individual.

30. The system according to any of claims 28 and 29, wherein said gaze tracker is configured for extracting a plurality of gaze directions of a respective plurality of individuals in said local scene.

31. The system according to any of claims 28-30, wherein said local image is a three-dimensional image.

32. The system according to any of claims 28 and 29, further comprising a controller configured for automatically activating and deactivating said gaze tracker, said correlator and said output circuit responsively to a content of said video image.

33. The system according to claim 32, wherein said video system is configured for automatically identifying said content.

34. The system according to claim 32, wherein said gaze tracker, said correlator and said output circuit are activated when said video image comprises a commercial advertisement, and are deactivated otherwise.

35. The system according to any of claims 28-31, further comprising a controller configured for automatically activating and deactivating said gaze tracker, said correlator and said output circuit responsively to a content of said video image.

36. The system according to claim 35, wherein said video system is configured for automatically identifying said content.

37. The system according to any of claims 35 and 36, wherein said gaze tracker, said correlator and said output circuit are activated when said video image comprises a commercial advertisement, and are deactivated otherwise.

38. The system according to any of claims 28 and 29, wherein said video image comprises a hidden commercial advertisement, and said correlator is configured for determining whether said spatiotemporal region encompasses said hidden commercial advertisement.

39. The system according to any of claims 30-37, wherein said video image comprises a hidden commercial advertisement, and said correlator is configured for determining whether said spatiotemporal region encompasses said hidden commercial advertisement.

40. The system according to any of claims 28 and 29, wherein said correlator is configured for calculating correlation statistics pertaining to an amount of time said spatiotemporal region is being associated with said gaze direction.

41. The system according to any of claims 30-39, wherein said correlator is configured for calculating correlation statistics pertaining to an amount of time said spatiotemporal region is being associated with said gaze direction.

42. The system according to any of claims 28 and 29, wherein said gaze tracker is configured for recognizing a face of said individual, wherein said correlator is configured for dynamically associating said gaze point with said face.

43. The system according to any of claims 30-42, wherein said gaze tracker is configured for recognizing a face of said individual, wherein said correlator is configured for dynamically associating said gaze point with said face.

44. A method of correlating gaze information, comprising:

displaying a video image on a display device;

receiving a local image of a local scene in front of said display device, and extracting a gaze direction of an individual in said local scene;

correlating said gaze direction with said video image by dynamically associating a gaze point on said display with a spatiotemporal region of said image; and

transmitting said associated region to a remote location.

45. The method of claim 44, further comprising determining presence or absence of said individual.

46. The method according to any of claims 44 and 45, further comprising extracting at least one additional gaze direction of at least one additional individual in said local scene.

47. The method according to any of claims 44-46, wherein said local image is a three-dimensional image.

48. The method according to any of claims 44-47, further comprising activating and deactivating said gaze tracking, said correlation and said transmission responsively to a content of said video image.

49. The method according to claim 48, further comprising identifying said content.

50. The method according to any of claims 48 and 49, wherein said gaze tracker, said correlator and said output circuit are activated when said video image comprises a commercial advertisement, and are deactivated otherwise.

51. The method according to any of claims 44-50, wherein said video image comprises a hidden commercial advertisement, and the method comprises determining whether said spatiotemporal region encompasses said hidden commercial advertisement.

52. The method according to any of claims 44-51, further comprising calculating correlation statistics pertaining to an amount of time said spatiotemporal region is being associated with said gaze direction.

53. The method according to any of claims 44-52, further comprising recognizing a face of said individual wherein said correlation is further according to said recognition.