WO2021146177A1 - Systems and methods for eye tracking using machine learning techniques - Google Patents

Systems and methods for eye tracking using machine learning techniques Download PDF

Info

Publication number
WO2021146177A1
WO2021146177A1 PCT/US2021/013058 US2021013058W WO2021146177A1 WO 2021146177 A1 WO2021146177 A1 WO 2021146177A1 US 2021013058 W US2021013058 W US 2021013058W WO 2021146177 A1 WO2021146177 A1 WO 2021146177A1
Authority
WO
WIPO (PCT)
Prior art keywords
eye
finder module
user
digital image
gaze
Prior art date
Application number
PCT/US2021/013058
Other languages
French (fr)
Other versions
WO2021146177A8 (en
Inventor
Robert C. CHAPPELL
Zachary Sharp MICKELSON
Tai Chan
Original Assignee
Eye Tech Digital Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Eye Tech Digital Systems, Inc. filed Critical Eye Tech Digital Systems, Inc.
Priority to EP21741279.0A priority Critical patent/EP4091095A1/en
Priority to CN202180007615.5A priority patent/CN114930410A/en
Publication of WO2021146177A1 publication Critical patent/WO2021146177A1/en
Publication of WO2021146177A8 publication Critical patent/WO2021146177A8/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/193Preprocessing; Feature extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20021Dividing image into blocks, subimages or windows
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing
    • G06T2207/30041Eye; Retina; Ophthalmic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/255Detecting or recognising potential candidate objects based on visual cues, e.g. shapes

Definitions

  • the present invention relates, generally, to eye-tracking systems and methods and, more particularly, to the application of machine learning techniques to such eye-tracking systems.
  • Eye-tracking systems such as those used in conjunction with desktop computers, laptops, tablets, virtual reality headsets, and other computing devices that include a display - generally include one or more illuminators configured to direct infrared light to the user’s eyes and an image sensor that captures the images for further processing. By determining the relative locations of the user’s pupils and the corneal reflections produced by the illuminators, the eye-tracking system can accurately predict the user’s gaze point on the display.
  • Various embodiments of the present invention relate to systems and methods for, inter alia: i) providing improved eye-tracking using previously trained machine learning models; ii) providing improved eye-tracking calibration through frequent retraining of a machine learning model during normal use of the system, iii) providing improved eye-tracking using machine learning models configured to eye finding and/or pupil finding; iv) providing improved eye-tracking using a combination of shallow artificial neural networks (ANNs) and convolutional neural networks (CNNs); and v) providing improved eye-tracking functionality using a hybrid approach including both traditional and machine learning models.
  • ANNs shallow artificial neural networks
  • CNNs convolutional neural networks
  • FIG. 1 is a conceptual block diagram illustrating an eye-tracking system in accordance with various embodiments
  • FIGS. 2 and 3 present schematic block diagrams of eye-tracking systems in accordance with various embodiments
  • FIG. 4 is a flowchart illustrating an eye-tracking method in accordance with various embodiments
  • FIGS. 5A-5C illustrate the determination of eye regions in accordance with various embodiments
  • FIGS. 6A-6B illustrate the imaging of a user’s corneal reflections (CRs) and pupil center (PC) in accordance with various embodiments.
  • FIG. 7 illustrates a shallow neural network in accordance with various embodiments.
  • FIG. 8 illustrates a convolutional neural network (CNN) is accordance with various embodiments. DETAILED DESCRIPTION OF PREFERRED
  • the present subject matter relates to improved systems and methods for performing eye-tracking using artificial intelligence (AI) techniques and machine learning (ML) models in place of, or in conjunction with, traditional eye-tracking techniques.
  • AI artificial intelligence
  • ML machine learning
  • the following detailed description is merely exemplary in nature and is not intended to limit the inventions or the application and uses of the inventions described herein.
  • conventional techniques and components related to eye-tracking algorithms, image sensors, machine learning systems, and digital image processing may not be described in detail herein.
  • the present invention is generally implemented in the context of a system 100 including a computing device 110 (e.g., a desktop computer, tablet computer, laptop, or the like) having an eye-tracking assembly 120 coupled to, integrated into, or otherwise associated with device 110.
  • the eye-tracking assembly 120 is configured to observe the facial region 181 of a user 180 within a field of view 170 and, through various techniques described in detail below, track the location and movement of the user’s gaze (or "gaze point”) 113 on display 112 of computing device 110.
  • the gaze point 113 may be characterized, for example, by a tuple ( , y] specifying linear coordinates (in pixels, centimeters, or other suitable unit) relative to an arbitrary reference point on display screen 112 (e.g., the upper left corner, as shown).
  • a tuple , y] specifying linear coordinates (in pixels, centimeters, or other suitable unit) relative to an arbitrary reference point on display screen 112 (e.g., the upper left corner, as shown).
  • eye-tracking assembly 120 includes one or more infrared (IR) light emitting diodes (LEDs) 121 positioned to illuminate facial region 181 of user 180.
  • Assembly 120 further includes one or more cameras 125 configured to acquire (at a suitable frame-rate) digital images corresponding to region 181 of the user’s face (generally referred to as "image data”).
  • the image data may be processed locally (i.e., within computing device 110) to determine gaze point 113.
  • eye tracking is accomplished using an image processing module or modules 162 that that are remote from computing device 110 - e.g., hosted within a cloud computing system 160 communicatively coupled to computing device 110 over a network 150 (e.g., the Internet).
  • image processing module 162 performs the computationally complex operations necessary to determine the gaze point and is then transmitted back (as eye and gaze data) over the network to computing device 110.
  • An example cloud-based eye-tracking system that may employed in the context of the present invention may be found, for example, in U.S. Pat. App. No. 16/434,830, entitled "Devices and Methods for Reducing Computational and Transmission Latencies in Cloud Based Eye Tracking Systems,” filed June 7, 2019, the contents of which are hereby incorporated by reference.
  • an eye-tracking system 200 in accordance with various embodiments includes an eye finder module 210 configured to receive an image 201 (i.e., an image acquired of user’s facial region 181) and determine, as described in further detail below, the location within image 201 of the user’s eyes. This eye location data 211 is then provided to a pupil finder module (or simply "pupil finder”) 220 and a corneal reflection (CR) finder (or simply "CR finder”) 230. In parallel, image 201 may also be directly provided (in raw form) to pupil finder 220 and CR finder 230, as shown.
  • image 201 may also be directly provided (in raw form) to pupil finder 220 and CR finder 230, as shown.
  • the output 221 of pupil finder 220 (e.g., data specifying the predicted location ofthe pupil centers (PCs) in image 201) is provided to gaze finder module (or simply "gaze finder”) 240.
  • the output 231 of CR finder 230 is provided to gaze finder 240.
  • Gaze finder 240 then takes the received pupil and CR information and produces gaze data 241, which in one embodiment includes the gaze coordinates ( , ) (113 in FIG. 1) along with other optional information, such as a value specifying the user’s distance from eye tracking assembly 120.
  • one or more modules 210, 220, 230, and 240 as illustrated in FIG. 2 are implemented using previously-trained machine learning models, rather than traditional eye tracking techniques (e.g., conventional geometric models).
  • eye finder 210, pupil finder 220, and/or CR finder 230 are implemented as convolutional neural networks (CNNs) that perform object detection.
  • one or more of modules 210, 220, 230, and 240 implement a You Only Look Once (YOLO) algorithm (e.g., YOLO v3) configured to produce a regression output that includes predicted coordinates (i.e., of the CRs and PCs of the user’s eyes).
  • YOLO You Only Look Once
  • the YOLO algorithm may be implemented using a variety of programming languages and libraries.
  • YOLO object detection is implemented on a cloud computing platform using a Tensor Flow library.
  • gaze finder 240 is implemented as a shallow ANN (e.g., an ANN with a single hidden layer) that takes as its input a vector of integers produced by pupil finder 220 and CR finder 230 and produces a regression output including the predicted gaze point coordinates along with confidence levels associated with that prediction.
  • ANN e.g., an ANN with a single hidden layer
  • an eye tracking system 300 in accordance with an alternate embodiment includes an eye finder module (or simply "eye finder”) 310 (which, as above, may be implemented using a machine learning model or conventional eye-finding techniques) and a gaze finder module (or simply “gaze finder”) 340.
  • gaze finder 340 is implemented as a full CNN and receives as its input the data 311 from eye finder 310 as well as the raw user image 301. The result is a gaze point output that may correspond to a regression output (e.g., integer coordinate data) or classification output (a discrete region on display screen 112).
  • system 200 of FIG. 2 implements a hybrid approach to perform eye tracking (including both machine learning and conventional techniques)
  • system 300 of FIG. 3 performs eye tracking primarily through a single, properly trained CNN.
  • the gaze point output 341 of gaze finder 340 is further processed to improve the accuracy of the predicted gaze point.
  • the present inventors have determined that such an embodiment is particularly advantageous in accounting for differences between the appearance of the given user and the appearance of the users used for supervised training of the CNN.
  • numeric x and y offsets are added to the gaze point output 341.
  • the gaze point output values 341 are multiplied by one or more constants.
  • FIG. 4 is a flowchart illustrating an eye-tracking method 400 in accordance with various embodiments that might be performed, for example, by the eye tracking system illustrated in FIG. 2. More particularly, referringto FIG.4 together with FIGS. 5A-5C and FIGS. 6A-6B, the method 400 beings with capturing a first image (510) that includes at least a portion of the user’s face 511 (step 401).
  • the first image 510 is a high resolution image produced by camera 125 of FIG. 1.
  • a second image 520 of the user’s facial region 521 is produced by decimating (or otherwise down-sampling or reducing the resolution of) the first image 510.
  • the second image 520 is a 416 x 416 pixel image.
  • the eye regions 531, 532 are determined from second image 520 or a transformed/decimated third image 530 based on image 520.
  • the eye region determination is made by eye finder module 210 using, for example, a YOLO machine learning model.
  • the system crops out, from the first image (i.e., the high resolution image 510) a pair of close-up images of the respective eye regions at the locations (step 404).
  • these close-up images are 416 x 416 pixel images.
  • the system determines (e.g., using a YOLO machine learning model as described above) the PCs and CRs for each eye.
  • This is illustrated in FIGS. 6Aand 6B as a first image 601 including a first eye 531 having a PC 542 and CRs 552; and a second image 602 including a second eye having a PC 543 and CRs 553.
  • the system e.g., gaze finder 240 determines (at step 406), the predicted gaze point (x, y) ⁇
  • FIG. 7 is a schematic block diagram of an artificial neural network (ANN) 700 in accordance with various embodiments that may be used to implement, for example, the gaze finder 240 of FIG. 2.
  • ANN artificial neural network
  • ANN 700 includes an input layer 701 with a number of input nodes (e.g., 701-1 to 701-n), an output layer 703 with a number of output nodes (e.g., 703-1 to 703-j), and one or more interconnected hidden layers 702 (in this example, a single hidden layer 702 including nodes 702-1 to 702-k).
  • the number of nodes in each layer (n, k, and fl may vary depending upon the application, and in fact may be modified dynamically by the system itself to optimize its performance.
  • multiple hidden layers 702 may be incorporated into ANN 700.
  • Each of the layers 702 and 703 receives input from a previous layer via a network of weighted connections (illustrated as arrows in FIG. 7). That is, the arrows in Fig. 7 may be represented as a matrix of floating point values representing weights between pairs of interconnected nodes.
  • Each of the nodes implements an "activation function” (e.g., sigmoid, tanh, or linear) that will generally vary depending upon the particular application, and which produces an output that is based on the sum of the inputs at each node.
  • ANN 700 is trained via a learning rule and "cost function” that are used to modify the weights of the connections in response to the input patterns (i.e., eye tracking data) provided to input layer 701 and the training set provided at output layer 703, thereby allowing ANN 700 to learn by example through a combination of backpropagation and gradient descent optimization.
  • Such learning may be supervised (with previously eye tracking data provided as input layer 701 and known gaze point information provided as output layer 703), unsupervised (with uncategorized examples provided to input layer 701), or involve reinforcement learning, where some notion of "reward” is provided during training of the eye-tracking data.
  • ANN 700 may be used as an analytical tool to make predictions and perform classification or regression based on the input 701. That is, new inputs are presented to input layer 701, where they are processed by the middle layer 702 and, via forward propagation through the weights associated with each of the edges, produce an output 703.
  • output layer 703 will typically include a set of confidence levels or probabilities associated with a corresponding number of different classes, such as the location of the gaze point.
  • FIG. 8 is a block diagram of an exemplary convolutional neural network (CNN) in accordance with various embodiments, and which may be used, for example, to implement the gaze finder 340 of FIG. 3.
  • CNN convolutional neural network
  • CNN 800 generally receives an input image 810 (e.g., an image of a user’s facial region) and produces an output 840 comprising a vector of gaze point data.
  • an input image 810 e.g., an image of a user’s facial region
  • CNN 800 implements a convolutional phase 822, followed by feature extraction 820 and classification 830.
  • Convolutional phase 822 uses an appropriately sized convolutional filter that produces a set of feature maps 821 corresponding to smaller tilings of input image 810.
  • convolution as a process is translationally invariant - i.e., features of interest (e.g., nose, eyes, mouth) can be identified regardless of their location within image 810.
  • Subsampling 824 is then performed to produce a set of smaller feature maps 823 that are effectively "smoothed” to reduce sensitivity of the convolutional filters to noise and other variations. Subsampling might involve taking an average or a maximum value over a sample of the inputs 821. Feature maps 823 then undergo another convolution 826, as is known in the art, to produce a large set of smaller feature maps 825. Feature maps 825 are then subsampled 828 to produce feature maps 827.
  • the feature maps 827 are processed to produce a first layer 831, followed by a fully-connected layer 833, from which outputs 840 are produced.
  • outputs 841 and 842 might correspond to the likelihood that particular features have been recognized.
  • the CNN illustrated in FIG. 8 trained in a supervised mode by presenting it with a large number (i.e., a "corpus”) of input images of users’ faces, and "clamping” outputs 840 based on the known, ground truth location of the user’s gaze.
  • Backpropagation as is known in the art is then used to refine the training CNN 800.
  • the trained CNN is used to process images 810 as described above.
  • training of the machine learning models and consequently the eye-tracking calibration takes place in the background in a way that is largely transparent to the user. That is, the user is not prompted to enter a specified "calibration” mode. Rather, the system, during normal operation, continuously updates and trains the models based on the acquired images.
  • any of the various modules described herein may be implemented as one or more machine learning models that undergo supervised, unsupervised, semi-supervised, or reinforcement learning and perform classification (e.g., binary or multiclass classification), regression, clustering, dimensionality reduction, and/or such tasks.
  • classification e.g., binary or multiclass classification
  • regression e.g., clustering, dimensionality reduction, and/or such tasks.
  • ANN artificial neural networks
  • RNN recurrent neural networks
  • CNN convolutional neural network
  • CART classification and regression trees
  • ensemble learning models such as boosting, bootstrapped aggregation, gradient boosting machines, and random forests
  • Bayesian network models e.g., naive Bayes
  • PCA principal component analysis
  • SVM support vector machines
  • clustering models such as K-nearest-neighbor, K-means, expectation maximization, hierarchical clustering, etc.
  • linear discriminant analysis models such as K-nearest-neighbor, K-means, expectation maximization, hierarchical clustering, etc.
  • an eye tracking system includes: an eye finder module configured to receive a digital image of a user’s face and produce eye region data specifying first and second locations of the user’s eyes; a pupil finder module configured to receive the eye region data and to determine, using the digital image and the eye region data, the locations of first and second pupil centers within the digital image; and a gaze finder module configured to determine a user gaze point based in part on the locations of the first and second pupil centers; wherein at least one of the eye finder module, pupil finder module, and gaze finder module are implemented as a previously trained machine learning model.
  • the gaze finder module is implemented as a shallow artificial neural network (ANN).
  • ANN shallow artificial neural network
  • the pupil finder module is implemented as a convolutional neural network (CNN).
  • CNN convolutional neural network
  • the CNN is implemented using YOLO object detection.
  • the system further includes a corneal reflection finder module configured to receive the eye region data and to determine, using the digital image and the eye region data, a plurality of corneal reflections within the digital image, wherein the gaze finder determines the user gaze point base in part on the locations of the plurality of corneal reflections.
  • a corneal reflection finder module configured to receive the eye region data and to determine, using the digital image and the eye region data, a plurality of corneal reflections within the digital image, wherein the gaze finder determines the user gaze point base in part on the locations of the plurality of corneal reflections.
  • the machine learning model is trained by acquiring images of the user during normal operation.
  • At least one of the eye finder module and gaze finder module is implemented on a cloud computing platform remote from the user.
  • An eye tracking system in accordance with another embodiment includes: an eye finder module configured to receive a digital image of a user’s face and produce eye region data specifying first and second locations of the user’s eyes; and a gaze finder module, including a previously trained convolutional machine learning model, configured to determine a user gaze point based in part on the eye region data.
  • the previously trained machine learning model is a convolutional neural network (CNN).
  • CNN convolutional neural network
  • the eye finder module is implemented using YOLO object detection.
  • the previously trained machine learning model is trained by acquiring images of the user during normal operation.
  • At least one of the gaze finder module and eye finder module is implemented using a cloud computing platform remote from the user.
  • An eye tracking method in accordance with one embodiment includes: receiving a digital image of a user’s face; producing eye region data specifying first and second locations of the user’s eyes; determining, using the digital image and the eye region data, the locations of first and second pupil centers within the digital image; and determining, using a previously trained shallow neural network model, a user gaze point based in part on the locations of the first and second pupil centers.
  • the eye region data is determined using a shallow artificial neural network (ANN).
  • ANN shallow artificial neural network
  • the locations of the first and second pupil centers are determined using a convolutional neural network (CNN).
  • CNN convolutional neural network
  • the CNN is implemented using YOLO object detection.
  • the method includes receiving the eye region data and determining, using the digital image and the eye region data, a plurality of corneal reflections within the digital image.
  • the previously trained shallow neural network model is trained by acquiring images of the user during normal operation.
  • An eye tracking system in accordance with one embodiment includes: an eye-tracking assembly including at least one infrared (IR) light emitting diode (LED) positioned to illuminate a user’s facial region, and at least one cameral configured to acquire a digital image of the user’s facial region; an eye finder module configured to receive a digital image of a user’s facial region from the eye-tracking assembly and produce eye region data specifying first and second locations of the user’s eyes; a pupil finder module pupil finder module implemented as a YOLO convolutional neural network (CNN) configured to receive the eye region data and to determine, using the digital image and the eye region data, the locations of first and second pupil centers within the digital image; a corneal reflection finder module implemented as a YOLO CNN configured to receive the eye region data and to determine, using the digital image and the eye region data, a plurality of corneal reflections within the digital image; and a gaze finder module implemented as a shallow artificial neural network (ANN) configured to determine
  • IR
  • At least one of the eye finder module, pupil finder module, corneal reflection finder module, and gaze finder module are trained by acquiring images of the user during normal operation.
  • At least one of the gaze finder module, corneal reflection module, and eye finder module may be implemented using a cloud computing platform that is remote from the user and/or the eye-tracking assembly and/or the computing device with which the eye-tracking assembly is used.
  • Embodiments of the present disclosure may be described herein in terms of functional and/or logical block components and various processing steps. It should be appreciated that such block components may be realized by any number of hardware, software, and/or firmware components configured to perform the specified functions. For example, an embodiment of the present disclosure may employ various integrated circuit components, e.g., memory elements, digital signal processing elements, logic elements, look up tables, or the like, which may carry out a variety of functions under the control of one or more microprocessors or other control devices.
  • integrated circuit components e.g., memory elements, digital signal processing elements, logic elements, look up tables, or the like, which may carry out a variety of functions under the control of one or more microprocessors or other control devices.
  • module or “controller” refer to any hardware, software, firmware, electronic control component, processing logic, and/or processor device, individually or in any combination, including without limitation: application specific integrated circuits (ASICs), field-programmable gate-arrays (FPGAs), dedicated neural network devices (e.g., Google Tensor Processing Units), electronic circuits, processors (shared, dedicated, or group) configured to execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that provide the described functionality.
  • ASICs application specific integrated circuits
  • FPGAs field-programmable gate-arrays
  • dedicated neural network devices e.g., Google Tensor Processing Units
  • processors shared, dedicated, or group configured to execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that provide the described functionality.
  • exemplary means "serving as an example, instance, or illustration.” Any implementation described herein as "exemplary” is not necessarily to be construed as preferred or advantageous over other implementations, nor is it intended to be construed as a model that must be literally duplicated.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • Ophthalmology & Optometry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Studio Devices (AREA)
  • Image Processing (AREA)

Abstract

An eye tracking system includes an eye finder module configured to receive a digital image of a user's face and produce eye region data specifying first and second locations of the user's eyes. A pupil finder module receives the eye region data and determines, using the digital image and the eye region data, the locations of first and second pupil centers within the digital image. A gaze finder module determines a user gaze point based in part on the locations of the first and second pupil centers. At least one of the eye finder module, pupil finder module, and gaze finder module are implemented as a previously trained machine learning model.

Description

SYSTEMS AND METHODS FOR EYE TRACKING USING
MACHINE LEARNING TECHNIQUES
TECHNICAL FIELD
[0001] The present invention relates, generally, to eye-tracking systems and methods and, more particularly, to the application of machine learning techniques to such eye-tracking systems.
BACKGROUND
[0002] Eye-tracking systems — such as those used in conjunction with desktop computers, laptops, tablets, virtual reality headsets, and other computing devices that include a display - generally include one or more illuminators configured to direct infrared light to the user’s eyes and an image sensor that captures the images for further processing. By determining the relative locations of the user’s pupils and the corneal reflections produced by the illuminators, the eye-tracking system can accurately predict the user’s gaze point on the display.
[0003] While currently known eye-tracking systems are reasonably accurate and responsive for gaming purposes, there are a number of ways in which such systems might be improved. For example, there is a need for improved robustness in eye-tracking systems to address partial occlusions or in circumstances where a user’s eyeglasses present image processing challenges.
[0004] Systems and methods are therefore needed that overcome these and other limitations of the prior art.
SUMMARY OF THE INVENTION
[0005] Various embodiments of the present invention relate to systems and methods for, inter alia: i) providing improved eye-tracking using previously trained machine learning models; ii) providing improved eye-tracking calibration through frequent retraining of a machine learning model during normal use of the system, iii) providing improved eye-tracking using machine learning models configured to eye finding and/or pupil finding; iv) providing improved eye-tracking using a combination of shallow artificial neural networks (ANNs) and convolutional neural networks (CNNs); and v) providing improved eye-tracking functionality using a hybrid approach including both traditional and machine learning models. BRIEF DESCRIPTION OF THE DRAWING FIGURES
[0006] The present invention will hereinafter be described in conjunction with the appended drawing figures, wherein like numerals denote like elements, and:
[0007] FIG. 1 is a conceptual block diagram illustrating an eye-tracking system in accordance with various embodiments;
[0008] FIGS. 2 and 3 present schematic block diagrams of eye-tracking systems in accordance with various embodiments;
[0009] FIG. 4 is a flowchart illustrating an eye-tracking method in accordance with various embodiments;
[0010] FIGS. 5A-5C illustrate the determination of eye regions in accordance with various embodiments;
[0011] FIGS. 6A-6B illustrate the imaging of a user’s corneal reflections (CRs) and pupil center (PC) in accordance with various embodiments; and
[0012] FIG. 7 illustrates a shallow neural network in accordance with various embodiments; and
[0013] FIG. 8 illustrates a convolutional neural network (CNN) is accordance with various embodiments. DETAILED DESCRIPTION OF PREFERRED
EXEMPLARY EMBODIMENTS
[0014] The present subject matter relates to improved systems and methods for performing eye-tracking using artificial intelligence (AI) techniques and machine learning (ML) models in place of, or in conjunction with, traditional eye-tracking techniques. In that regard, the following detailed description is merely exemplary in nature and is not intended to limit the inventions or the application and uses of the inventions described herein. Furthermore, there is no intention to be bound by any theory presented in the preceding background or the following detailed description. In the interest of brevity, conventional techniques and components related to eye-tracking algorithms, image sensors, machine learning systems, and digital image processing may not be described in detail herein.
[0015] Referring first to FIG. 1, the present invention is generally implemented in the context of a system 100 including a computing device 110 (e.g., a desktop computer, tablet computer, laptop, or the like) having an eye-tracking assembly 120 coupled to, integrated into, or otherwise associated with device 110. The eye-tracking assembly 120 is configured to observe the facial region 181 of a user 180 within a field of view 170 and, through various techniques described in detail below, track the location and movement of the user’s gaze (or "gaze point”) 113 on display 112 of computing device 110. The gaze point 113 may be characterized, for example, by a tuple ( , y] specifying linear coordinates (in pixels, centimeters, or other suitable unit) relative to an arbitrary reference point on display screen 112 (e.g., the upper left corner, as shown).
[0016] In the illustrated embodiment, eye-tracking assembly 120 includes one or more infrared (IR) light emitting diodes (LEDs) 121 positioned to illuminate facial region 181 of user 180. Assembly 120 further includes one or more cameras 125 configured to acquire (at a suitable frame-rate) digital images corresponding to region 181 of the user’s face (generally referred to as "image data”).
[0017] In some embodiments, the image data may be processed locally (i.e., within computing device 110) to determine gaze point 113. In some embodiments, however, eye tracking is accomplished using an image processing module or modules 162 that that are remote from computing device 110 - e.g., hosted within a cloud computing system 160 communicatively coupled to computing device 110 over a network 150 (e.g., the Internet). In such embodiments, image processing module 162 performs the computationally complex operations necessary to determine the gaze point and is then transmitted back (as eye and gaze data) over the network to computing device 110. An example cloud-based eye-tracking system that may employed in the context of the present invention may be found, for example, in U.S. Pat. App. No. 16/434,830, entitled "Devices and Methods for Reducing Computational and Transmission Latencies in Cloud Based Eye Tracking Systems,” filed June 7, 2019, the contents of which are hereby incorporated by reference.
[0018] Referring now to the block diagram illustrated in FIG. 2 and with continued reference to FIG. 1, an eye-tracking system 200 in accordance with various embodiments includes an eye finder module 210 configured to receive an image 201 (i.e., an image acquired of user’s facial region 181) and determine, as described in further detail below, the location within image 201 of the user’s eyes. This eye location data 211 is then provided to a pupil finder module (or simply "pupil finder”) 220 and a corneal reflection (CR) finder (or simply "CR finder”) 230. In parallel, image 201 may also be directly provided (in raw form) to pupil finder 220 and CR finder 230, as shown.
[0019] The output 221 of pupil finder 220 (e.g., data specifying the predicted location ofthe pupil centers (PCs) in image 201) is provided to gaze finder module (or simply "gaze finder”) 240. Similarly, the output 231 of CR finder 230 is provided to gaze finder 240. Gaze finder 240 then takes the received pupil and CR information and produces gaze data 241, which in one embodiment includes the gaze coordinates ( , ) (113 in FIG. 1) along with other optional information, such as a value specifying the user’s distance from eye tracking assembly 120.
[0020] In accordance with the present invention, one or more modules 210, 220, 230, and 240 as illustrated in FIG. 2 are implemented using previously-trained machine learning models, rather than traditional eye tracking techniques (e.g., conventional geometric models). For example, in one embodiment, eye finder 210, pupil finder 220, and/or CR finder 230 are implemented as convolutional neural networks (CNNs) that perform object detection. In one embodiment, one or more of modules 210, 220, 230, and 240 implement a You Only Look Once (YOLO) algorithm (e.g., YOLO v3) configured to produce a regression output that includes predicted coordinates (i.e., of the CRs and PCs of the user’s eyes). As is known in the art, the YOLO algorithm may be implemented using a variety of programming languages and libraries. In one embodiment, for example, YOLO object detection is implemented on a cloud computing platform using a Tensor Flow library.
[0021] In accordance with various embodiments, gaze finder 240 is implemented as a shallow ANN (e.g., an ANN with a single hidden layer) that takes as its input a vector of integers produced by pupil finder 220 and CR finder 230 and produces a regression output including the predicted gaze point coordinates along with confidence levels associated with that prediction.
[0022] Referring now to FIG. 3, an eye tracking system 300 in accordance with an alternate embodiment includes an eye finder module (or simply "eye finder”) 310 (which, as above, may be implemented using a machine learning model or conventional eye-finding techniques) and a gaze finder module (or simply "gaze finder”) 340. In this embodiment, gaze finder 340 is implemented as a full CNN and receives as its input the data 311 from eye finder 310 as well as the raw user image 301. The result is a gaze point output that may correspond to a regression output (e.g., integer coordinate data) or classification output (a discrete region on display screen 112). Stated another way, while system 200 of FIG. 2 implements a hybrid approach to perform eye tracking (including both machine learning and conventional techniques), system 300 of FIG. 3 performs eye tracking primarily through a single, properly trained CNN.
[0023] In accordance with one embodiment, the gaze point output 341 of gaze finder 340 is further processed to improve the accuracy of the predicted gaze point. The present inventors have determined that such an embodiment is particularly advantageous in accounting for differences between the appearance of the given user and the appearance of the users used for supervised training of the CNN. In one embodiment, for example, numeric x and y offsets are added to the gaze point output 341. In other embodiments, the gaze point output values 341 are multiplied by one or more constants.
[0024] FIG. 4 is a flowchart illustrating an eye-tracking method 400 in accordance with various embodiments that might be performed, for example, by the eye tracking system illustrated in FIG. 2. More particularly, referringto FIG.4 together with FIGS. 5A-5C and FIGS. 6A-6B, the method 400 beings with capturing a first image (510) that includes at least a portion of the user’s face 511 (step 401). In one embodiment, the first image 510 is a high resolution image produced by camera 125 of FIG. 1.
[0025] Next, at step 402, a second image 520 of the user’s facial region 521 is produced by decimating (or otherwise down-sampling or reducing the resolution of) the first image 510. In one embodiment, the second image 520 is a 416 x 416 pixel image. Next, at 403, the eye regions 531, 532 are determined from second image 520 or a transformed/decimated third image 530 based on image 520. In one embodiment, as described above, the eye region determination is made by eye finder module 210 using, for example, a YOLO machine learning model.
[0026] After the general eye regions 531 and 532 are determined, the system then crops out, from the first image (i.e., the high resolution image 510) a pair of close-up images of the respective eye regions at the locations (step 404). In one embodiment, these close-up images are 416 x 416 pixel images.
[0027] Subsequently, at step 405, the system determines (e.g., using a YOLO machine learning model as described above) the PCs and CRs for each eye. This is illustrated in FIGS. 6Aand 6B as a first image 601 including a first eye 531 having a PC 542 and CRs 552; and a second image 602 including a second eye having a PC 543 and CRs 553. Given this data, the system (e.g., gaze finder 240) determines (at step 406), the predicted gaze point (x, y)·
[0028] FIG. 7 is a schematic block diagram of an artificial neural network (ANN) 700 in accordance with various embodiments that may be used to implement, for example, the gaze finder 240 of FIG. 2.
[0029] In general, ANN 700 includes an input layer 701 with a number of input nodes (e.g., 701-1 to 701-n), an output layer 703 with a number of output nodes (e.g., 703-1 to 703-j), and one or more interconnected hidden layers 702 (in this example, a single hidden layer 702 including nodes 702-1 to 702-k). The number of nodes in each layer (n, k, and fl may vary depending upon the application, and in fact may be modified dynamically by the system itself to optimize its performance. In some embodiments (e.g., deep learning systems), multiple hidden layers 702 may be incorporated into ANN 700.
[0030] Each of the layers 702 and 703 receives input from a previous layer via a network of weighted connections (illustrated as arrows in FIG. 7). That is, the arrows in Fig. 7 may be represented as a matrix of floating point values representing weights between pairs of interconnected nodes. Each of the nodes implements an "activation function” (e.g., sigmoid, tanh, or linear) that will generally vary depending upon the particular application, and which produces an output that is based on the sum of the inputs at each node.
[0031] ANN 700 is trained via a learning rule and "cost function” that are used to modify the weights of the connections in response to the input patterns (i.e., eye tracking data) provided to input layer 701 and the training set provided at output layer 703, thereby allowing ANN 700 to learn by example through a combination of backpropagation and gradient descent optimization. Such learning may be supervised (with previously eye tracking data provided as input layer 701 and known gaze point information provided as output layer 703), unsupervised (with uncategorized examples provided to input layer 701), or involve reinforcement learning, where some notion of "reward” is provided during training of the eye-tracking data.
[0032] Once ANN 700 is trained to a satisfactory level, it may be used as an analytical tool to make predictions and perform classification or regression based on the input 701. That is, new inputs are presented to input layer 701, where they are processed by the middle layer 702 and, via forward propagation through the weights associated with each of the edges, produce an output 703. As described above, output layer 703 will typically include a set of confidence levels or probabilities associated with a corresponding number of different classes, such as the location of the gaze point.
[0033] FIG. 8 is a block diagram of an exemplary convolutional neural network (CNN) in accordance with various embodiments, and which may be used, for example, to implement the gaze finder 340 of FIG. 3.
[0034] As shown in FIG. 8, CNN 800 generally receives an input image 810 (e.g., an image of a user’s facial region) and produces an output 840 comprising a vector of gaze point data.
[0035] In general, CNN 800 implements a convolutional phase 822, followed by feature extraction 820 and classification 830. Convolutional phase 822 uses an appropriately sized convolutional filter that produces a set of feature maps 821 corresponding to smaller tilings of input image 810. As is known, convolution as a process is translationally invariant - i.e., features of interest (e.g., nose, eyes, mouth) can be identified regardless of their location within image 810.
[0036] Subsampling 824 is then performed to produce a set of smaller feature maps 823 that are effectively "smoothed” to reduce sensitivity of the convolutional filters to noise and other variations. Subsampling might involve taking an average or a maximum value over a sample of the inputs 821. Feature maps 823 then undergo another convolution 826, as is known in the art, to produce a large set of smaller feature maps 825. Feature maps 825 are then subsampled 828 to produce feature maps 827.
[0037] During the classification phase (830), the feature maps 827 are processed to produce a first layer 831, followed by a fully-connected layer 833, from which outputs 840 are produced. For example, outputs 841 and 842 might correspond to the likelihood that particular features have been recognized.
[0038] In general, the CNN illustrated in FIG. 8 trained in a supervised mode by presenting it with a large number (i.e., a "corpus”) of input images of users’ faces, and "clamping” outputs 840 based on the known, ground truth location of the user’s gaze. Backpropagation as is known in the art is then used to refine the training CNN 800. Subsequently, during normal operation, the trained CNN is used to process images 810 as described above.
[0039] In accordance with various embodiments, training of the machine learning models and consequently the eye-tracking calibration takes place in the background in a way that is largely transparent to the user. That is, the user is not prompted to enter a specified "calibration” mode. Rather, the system, during normal operation, continuously updates and trains the models based on the acquired images.
[0040] While the above discussion often focuses on the use of artificial neural networks, the range of embodiments are not so limited. Any of the various modules described herein (e.g., in FIGS. 2 and 3) may be implemented as one or more machine learning models that undergo supervised, unsupervised, semi-supervised, or reinforcement learning and perform classification (e.g., binary or multiclass classification), regression, clustering, dimensionality reduction, and/or such tasks. Examples of such models include, without limitation, artificial neural networks (ANN) (such as a recurrent neural networks (RNN) and convolutional neural network (CNN)), decision tree models (such as classification and regression trees (CART)), ensemble learning models (such as boosting, bootstrapped aggregation, gradient boosting machines, and random forests), Bayesian network models (e.g., naive Bayes), principal component analysis (PCA), support vector machines (SVM), clustering models (such as K-nearest-neighbor, K-means, expectation maximization, hierarchical clustering, etc.), linear discriminant analysis models.
[0041] In summary, what have been described are various eye-tracking systems and method utilizing novel machine learning techniques. In accordance with one embodiment, an eye tracking system includes: an eye finder module configured to receive a digital image of a user’s face and produce eye region data specifying first and second locations of the user’s eyes; a pupil finder module configured to receive the eye region data and to determine, using the digital image and the eye region data, the locations of first and second pupil centers within the digital image; and a gaze finder module configured to determine a user gaze point based in part on the locations of the first and second pupil centers; wherein at least one of the eye finder module, pupil finder module, and gaze finder module are implemented as a previously trained machine learning model.
[0042] In accordance with one embodiment, the gaze finder module is implemented as a shallow artificial neural network (ANN).
[0043] In accordance with one embodiment, the pupil finder module is implemented as a convolutional neural network (CNN).
[0044] In accordance with one embodiment, the CNN is implemented using YOLO object detection.
[0045] In accordance with one embodiment, the system further includes a corneal reflection finder module configured to receive the eye region data and to determine, using the digital image and the eye region data, a plurality of corneal reflections within the digital image, wherein the gaze finder determines the user gaze point base in part on the locations of the plurality of corneal reflections.
[0046] In accordance with one embodiment, the machine learning model is trained by acquiring images of the user during normal operation.
[0047] In accordance with one embodiment, at least one of the eye finder module and gaze finder module is implemented on a cloud computing platform remote from the user.
[0048] An eye tracking system in accordance with another embodiment includes: an eye finder module configured to receive a digital image of a user’s face and produce eye region data specifying first and second locations of the user’s eyes; and a gaze finder module, including a previously trained convolutional machine learning model, configured to determine a user gaze point based in part on the eye region data.
[0049] In accordance with one embodiment, the previously trained machine learning model is a convolutional neural network (CNN).
[0050] In accordance with one embodiment, the eye finder module is implemented using YOLO object detection.
[0051] In accordance with one embodiment, the previously trained machine learning model is trained by acquiring images of the user during normal operation.
[0052] In accordance with one embodiment, at least one of the gaze finder module and eye finder module is implemented using a cloud computing platform remote from the user.
[0053] An eye tracking method in accordance with one embodiment includes: receiving a digital image of a user’s face; producing eye region data specifying first and second locations of the user’s eyes; determining, using the digital image and the eye region data, the locations of first and second pupil centers within the digital image; and determining, using a previously trained shallow neural network model, a user gaze point based in part on the locations of the first and second pupil centers.
[0054] In accordance with one embodiment, the eye region data is determined using a shallow artificial neural network (ANN).
[0055] In accordance with one embodiment, the locations of the first and second pupil centers are determined using a convolutional neural network (CNN). [0056] In accordance with one embodiment, the CNN is implemented using YOLO object detection.
[0057] In accordance with one embodiment, the method includes receiving the eye region data and determining, using the digital image and the eye region data, a plurality of corneal reflections within the digital image.
[0058] In accordance with another embodiment, the previously trained shallow neural network model is trained by acquiring images of the user during normal operation.
[0059] An eye tracking system in accordance with one embodiment includes: an eye-tracking assembly including at least one infrared (IR) light emitting diode (LED) positioned to illuminate a user’s facial region, and at least one cameral configured to acquire a digital image of the user’s facial region; an eye finder module configured to receive a digital image of a user’s facial region from the eye-tracking assembly and produce eye region data specifying first and second locations of the user’s eyes; a pupil finder module pupil finder module implemented as a YOLO convolutional neural network (CNN) configured to receive the eye region data and to determine, using the digital image and the eye region data, the locations of first and second pupil centers within the digital image; a corneal reflection finder module implemented as a YOLO CNN configured to receive the eye region data and to determine, using the digital image and the eye region data, a plurality of corneal reflections within the digital image; and a gaze finder module implemented as a shallow artificial neural network (ANN) configured to determine a user gaze point based in part on the locations of the first and second pupil centers and the locations of the plurality of corneal reflections. At least one of the eye finder module, pupil finder module, corneal reflection finder module, and gaze finder module are trained by acquiring images of the user during normal operation. At least one of the gaze finder module, corneal reflection module, and eye finder module may be implemented using a cloud computing platform that is remote from the user and/or the eye-tracking assembly and/or the computing device with which the eye-tracking assembly is used.
[0060] Embodiments of the present disclosure may be described herein in terms of functional and/or logical block components and various processing steps. It should be appreciated that such block components may be realized by any number of hardware, software, and/or firmware components configured to perform the specified functions. For example, an embodiment of the present disclosure may employ various integrated circuit components, e.g., memory elements, digital signal processing elements, logic elements, look up tables, or the like, which may carry out a variety of functions under the control of one or more microprocessors or other control devices.
[0061] In addition, those skilled in the art will appreciate that embodiments of the present disclosure may be practiced in conjunction with any number of systems, and that the systems described herein are merely exemplary embodiments of the present disclosure. Further, the connecting lines shown in the various figures contained herein are intended to represent example functional relationships and/or physical couplings between the various elements. It should be noted that many alternative or additional functional relationships or physical connections may be present in an embodiment of the present disclosure. [0062] As used herein, the terms "module” or "controller” refer to any hardware, software, firmware, electronic control component, processing logic, and/or processor device, individually or in any combination, including without limitation: application specific integrated circuits (ASICs), field-programmable gate-arrays (FPGAs), dedicated neural network devices (e.g., Google Tensor Processing Units), electronic circuits, processors (shared, dedicated, or group) configured to execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that provide the described functionality.
[0063] As used herein, the word "exemplary” means "serving as an example, instance, or illustration.” Any implementation described herein as "exemplary” is not necessarily to be construed as preferred or advantageous over other implementations, nor is it intended to be construed as a model that must be literally duplicated.
[0064] While the foregoing detailed description will provide those skilled in the art with a convenient road map for implementing various embodiments of the invention, it should be appreciated that the particular embodiments described above are only examples, and are not intended to limit the scope, applicability, or configuration of the invention in any way. To the contrary, various changes may be made in the function and arrangement of elements described without departing from the scope of the invention.

Claims

1. An eye tracking system comprising: an eye finder module configured to receive a digital image of a user’s face and produce eye region data specifying first and second locations of the user’s eyes; a pupil finder module configured to receive the eye region data and to determine, using the digital image and the eye region data, the locations of first and second pupil centers within the digital image; and a gaze finder module configured to determine a user gaze point based in part on the locations of the first and second pupil centers; wherein at least one of the eye finder module, pupil finder module, and gaze finder module are implemented as a previously trained machine learning model.
2. The eye tracking system of claim 1, wherein the gaze finder module is implemented as an artificial neural network (ANN).
3. The eye tracking system of claim 1, wherein the pupil finder module is implemented as a convolutional neural network (CNN).
4. The eye tracking system of claim 3, wherein the CNN is implemented using YOLO object detection.
5. The eye tracking system of claim 1, further including a corneal reflection finder module configured to receive the eye region data and to determine, using the digital image and the eye region data, at least one corneal reflection within the digital image, wherein the gaze finder determines the user gaze point base in part on the locations of the plurality of corneal reflections.
6. The eye tracking system of claim 1, wherein the machine learning model is trained by acquiring images of the user during normal operation.
7. The eye tracking system of claim 1, wherein at least one of the eye finder module and gaze finder module is implemented on a cloud computing platform remote from the user.
8. An eye tracking system comprising: an eye finder module configured to receive a digital image of a user’s face and produce eye region data specifying first and second locations of the user’s eyes; and a gaze finder module, including a previously trained convolutional machine learning model, configured to determine a user gaze point based in part on the eye region data.
9. The eye tracking system of claim 8, wherein the previously trained machine learning model is a convolutional neural network (CNN).
10. The eye tracking system of claim 8, wherein the eye finder module is implemented using YOLO object detection.
11. The eye tracking system of claim 8, wherein the previously trained machine learning model is trained by acquiring images of the user during normal operation.
12. The eye tracking system of claim 8, wherein at least one of the gaze finder module and eye finder module is implemented using a cloud computing platform remote from the user.
13. An eye tracking method comprising: receiving a digital image of a user’s face; producing eye region data specifying first and second locations of the user’s eyes; determining, using the digital image and the eye region data, the locations of first and second pupil centers within the digital image; and determining, using a previously trained shallow neural network model, a user gaze point based in part on the locations of the first and second pupil centers.
14. The eye tracking method of claim 13, wherein the eye region data is determined using a shallow artificial neural network (ANN).
15. The eye tracking method of claim 13, wherein the locations of the first and second pupil centers are determined using a convolutional neural network (CNN).
16. The eye tracking method of claim 15, wherein the CNN is implemented using YOLO object detection.
17. The eye tracking method of claim 13, further including: receiving the eye region data; and determining, using the digital image and the eye region data, at least one corneal reflection within the digital image.
18. The eye tracking method of claim 13, wherein the previously trained shallow neural network model is trained by acquiring images of the user during normal operation.
19. An eye tracking system comprising: an eye-tracking assembly including at least one infrared (IR) light emitting diode (LED) positioned to illuminate a user’s facial region, and at least one cameral configured to acquire a digital image of the user’s facial region; an eye finder module configured to receive a digital image of a user’s facial region from the eye-tracking assembly and produce eye region data specifying first and second locations of the user’s eyes; a pupil finder module pupil finder module implemented as a YOLO convolutional neural network (CNN) configured to receive the eye region data and to determine, using the digital image and the eye region data, the locations of first and second pupil centers within the digital image; a corneal reflection finder module implemented as a YOLO CNN configured to receive the eye region data and to determine, using the digital image and the eye region data, at least one corneal reflection within the digital image; and a gaze finder module implemented as a shallow artificial neural network (ANN) configured to determine a user gaze point based in part on the respective locations of the first and second pupil centers and the at least one corneal reflection; wherein at least one of the eye finder module, pupil finder module, corneal reflection finder module, and gaze finder module are trained by acquiring images of the user during normal operation.
20. The eye tracking system of claim 19, wherein at least one of the gaze finder module, corneal reflection module, and eye finder module is implemented using a cloud computing platform remote from the user.
PCT/US2021/013058 2020-01-13 2021-01-12 Systems and methods for eye tracking using machine learning techniques WO2021146177A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21741279.0A EP4091095A1 (en) 2020-01-13 2021-01-12 Systems and methods for eye tracking using machine learning techniques
CN202180007615.5A CN114930410A (en) 2020-01-13 2021-01-12 System and method for eye tracking using machine learning techniques

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202016741081A 2020-01-13 2020-01-13
US16/741,081 2020-01-13

Publications (2)

Publication Number Publication Date
WO2021146177A1 true WO2021146177A1 (en) 2021-07-22
WO2021146177A8 WO2021146177A8 (en) 2022-07-21

Family

ID=76864141

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/013058 WO2021146177A1 (en) 2020-01-13 2021-01-12 Systems and methods for eye tracking using machine learning techniques

Country Status (3)

Country Link
EP (1) EP4091095A1 (en)
CN (1) CN114930410A (en)
WO (1) WO2021146177A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2626136A (en) * 2023-01-10 2024-07-17 Mercedes Benz Group Ag System and method for estimation of eye gaze direction of a user with or without eyeglasses

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190303724A1 (en) * 2018-03-30 2019-10-03 Tobii Ab Neural Network Training For Three Dimensional (3D) Gaze Prediction With Calibration Parameters

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190303724A1 (en) * 2018-03-30 2019-10-03 Tobii Ab Neural Network Training For Three Dimensional (3D) Gaze Prediction With Calibration Parameters

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2626136A (en) * 2023-01-10 2024-07-17 Mercedes Benz Group Ag System and method for estimation of eye gaze direction of a user with or without eyeglasses

Also Published As

Publication number Publication date
CN114930410A (en) 2022-08-19
EP4091095A1 (en) 2022-11-23
WO2021146177A8 (en) 2022-07-21

Similar Documents

Publication Publication Date Title
US20240013526A1 (en) Depth and motion estimations in machine learning environments
Akinyelu et al. Convolutional neural network-based methods for eye gaze estimation: A survey
US11436437B2 (en) Three-dimension (3D) assisted personalized home object detection
KR102526700B1 (en) Electronic device and method for displaying three dimensions image
WO2020015752A1 (en) Object attribute identification method, apparatus and system, and computing device
WO2021190296A1 (en) Dynamic gesture recognition method and device
US10325184B2 (en) Depth-value classification using forests
US20210319585A1 (en) Method and system for gaze estimation
US20220198836A1 (en) Gesture recognition method, electronic device, computer-readable storage medium, and chip
US11698529B2 (en) Systems and methods for distributing a neural network across multiple computing devices
US20230214458A1 (en) Hand Pose Estimation for Machine Learning Based Gesture Recognition
US20230137337A1 (en) Enhanced machine learning model for joint detection and multi person pose estimation
CN112419326B (en) Image segmentation data processing method, device, equipment and storage medium
CN110738650B (en) Infectious disease infection identification method, terminal device and storage medium
US20230020965A1 (en) Method and apparatus for updating object recognition model
EP4091095A1 (en) Systems and methods for eye tracking using machine learning techniques
Geisler et al. Real-time 3d glint detection in remote eye tracking based on bayesian inference
Uke et al. Optimal video processing and soft computing algorithms for human hand gesture recognition from real-time video
CN115205806A (en) Method and device for generating target detection model and automatic driving vehicle
US12039630B2 (en) Three-dimensional pose detection based on two-dimensional signature matching
El-Baz et al. Robust boosted parameter based combined classifier for rotation invariant texture classification
CN112766063B (en) Micro-expression fitting method and system based on displacement compensation
US20240070892A1 (en) Stereovision annotation tool
Sandoval et al. On the Use of a Low-Cost Embedded System for Face Detection and Recognition
Saleh et al. Reliable switching mechanism for low cost multi-screen eye tracking devices via deep recurrent neural networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21741279

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021741279

Country of ref document: EP

Effective date: 20220816