GB2574410A

GB2574410A - Apparatus and method for eye-tracking based text input

Info

Publication number: GB2574410A
Application number: GB1809138.9A
Authority: GB
Inventors: Jan Stefan Hamminga Derk; Beliaeva Marila
Original assignee: Robot Protos Ltd
Current assignee: Robot Protos Ltd
Priority date: 2018-06-04
Filing date: 2018-06-04
Publication date: 2019-12-11
Also published as: GB201809138D0

Abstract

Constructing a 3D scene from multiple camera image streams for the purpose of text input using eye-tracking and micro expression emotion analysis. Includes a primary camera and a secondary rotatable camera of higher angular resolution, together used to incrementally construct and refine a 3D scene of the objects in the primary camera view. The constructed scene is then used to perform eye gaze pattern tracking over a displayed keyboard and emotion classification by micro expression extraction, wherein the emotional information is used as a real-time feedback system for auto-correcting the text input. The 3D scene maybe iteratively constructed based on corresponding features found between the primary camera image stream and secondary camera image stream combined with the secondary camera rotation angle. Alternatively the 3D scene may be constructed from features located in the image stream of secondary cameras, the location of the features determined using pixel coordinates, position orientation and optical properties of the cameras.

Description

Apparatus and method for eye-tracking based text input

Field and background of the Invention

The present application relates to a method and apparatus for tracking the gaze point of a user or information consumer (eye-tracking), in particular closed-loop, non-contact, and non-emitting and using said method and apparatus as means of text input. Comparable methods for such text input are, in example: pressing keys on a touchscreen of a phone or voice recognition systems.

Eye-tracking as a text input method offers great potential in situations where operation by hands or voice commands is either impossible or undesired. Examples are sterile environments where physical interaction with the phone is prohibited or simply situations requiring unobtrusive communication. Current technical solutions however are either simply too large to fit in most computing devices, too unpleasant to use due to requiring helmets, glasses, or contact lenses to be worn, or too inaccurate to work as a practical, every day textual input method.

Summary of the Invention

The invention proposes an apparatus and method offering substantial benefits over current solutions, combining a very compact physical form with a novel feedback system that overcomes accuracy issues associated with small form-factor solutions. The apparatus comprises a primary camera module mounted fixedly, one or multiple secondary camera modules mounted to allow rotation around a hinge-point at the lens opening face, sensors to measure said rotation angles, a processor coupled to determine and control secondary camera rotations.

The method comprises extracting visually distinct features from the primary camera image stream, locating one or more of said features in the image stream of one or more secondary cameras, when found determining the location of each located feature in 3D space using pixel coordinates, position, orientation, and optical properties of each camera. This process is done continuously to incrementally construct and refine a 3D scene.

The 3D scene is simultaneously and continuously scanned for known patterns. If a human face is found the refinement process is directed to focus refinement on the eye area, and a new process is started to continuously analyse the face for specific gaze patterns. Patterns over a displayed keyboard are used to determine which words a user is spelling by looking at the sequence of letters. If a word is deemed complete it is displayed to the user along with one or more less likely alternatives.

The user's microexpression is captured and categorized as 'approval· or 'disapproval' on the moment of first gazing upon each displayed word, after which the best candidate word is selected for input. Alternatively each completed word is immediately added to a text input field and the user's microexpression is captured upon first gaze on the added word, the word replaced with the next best candidate each time the captured micro expression signifies disapproval.

Through the greatly enhanced speed of the feedback loop, compared to existing methods of input correction, a user will experience the correction as a natural and integral part of the input session. The user is incentivized and trained continuously to provide clear microexpressions, improving the system, effectiveness during a longer period of use.

Brief description of the draw ings

Figure 1 shows an example of the apparatus, the field-of-view for two cameras and eyes as gazing upon a keyboard key.

Detailed description

An apparatus according to claims 1-10 and a method according to claims 11-19, further detailed below:

The invention comprises a two or more camera apparatus (figure 1), wherein one (primary) camera is fixedly mounted and works as a fixed reference, featuring a relatively wide-angle lens to capture most of the target object in the field-of-view (5), comparable to what is customarily used in mobile phone front facing cameras. At least one (secondary) camera, using a lens system characterized by a narrower field-of-view (7) and greater angular resolution, is mounted offset from the primary camera optical axis in such a way that using a hinge point at the lens face allows it to image a significant part of the field-of-view of the primary camera by rotating (6). A series of electromagnetic (voice coil) actuators are mounted in the apparatus, at the image sensor side of the secondary camera, allowing exact control of the rotation of the camera unit.

A continuous process analyses the respective image streams from each camera: A heuristic method, such as a neural network, is used to find distinctive image features in both image streams that correspond to the same physical object. If no match can be found the secondary camera is rotated to the next best orientation, if a match is found the pixel coordinates in each image, of the matching feature, are combined with the secondary camera rotation to perform a triangulation in 3-dimensional (3D) space. The determined geometry points are mapped to a 3D scene along with information on colour and time, optionally secondary camera rotation and a certainty score of the triangulation are kept and associated with mapped points.

A simultaneous process continuously analyses the 3D scene for known objects, objects for which a size is fixed, for example part of the apparatus itself, a standardized object such as a coin or a specific object of which the size is previously established. Found objects are then used to assess and improve the overall scene accuracy.

Once a human face has been detected the eyes (4) are continuously monitored to determine the pattern of gazing on keys (2) displayed on a screen (1).

The use of microexpressions, facial expressions generally lasting less than half a second, is to filter out the user's general emotional state and mood changes, such as reactions to the environment, text to input or any other factor, from the person's instant reaction to a displayed word, A user may, during an input session, see a word incorrectly interpreted by the method and show'' an instinctive and direct reaction before reverting to the user's overall mood.

Combining microexpressions with tracking a user's gaze pattern can also have its advantages in situations where there is a need for understanding exactly if and how a user reacts to a display of information. In example, the designer of a railway information system will want to understand which displayed information is consumed first and which information is ineffective.

Claims

Claims What is claimed is:

1. An apparatus comprising: a primary camera module mounted fixedly relative to the apparatus; one or multiple secondary camera modules mounted to allow rotation around a first axis perpendicular to the primary camera optical axis and a second axis of which the direction vector is perpendicular to both the primary camera, optical axis and the first rotation axis; one or more actuators to rotate a secondary camera along said axis; sensors to measure secondary camera rotation angle for each of said rotation axis; a processor coupled to all actuators and sensors used to control secondary camera rotation; said processor iteratively constructing and locally refining a 3-dimensional (3D) scene, based on corresponding features found between the primary camera image stream and any secondary camera image stream, combined with the corresponding secondary camera, rotation angle.

2. The apparatus of claim 1, wherein: the primary camera has a wider angle of view than the secondary cameras; secondary cameras each have greater angular resolution than the primary camera.

3. The apparatus of claim 1, wherein: the lens opening side of a secondary camera module is mounted in a flexible medium which functions as a hinge for rotation of said camera, module; conductors for powering and signalling between the apparatus and said camera module are integrated in said flexible medium.

4. The apparatus of claim 1, wherein at least one of the cameras enables measuring the distance between camera, and subject by determining the elapsed time between the apparatus emitting either a series of photons, a pattern of photons, or both and said camera(s) identifying the reflecting photons.

5. The apparatus of claim 1, further comprising inertia and orientation measurement sensors to account and compensate for orientation changes of the apparatus in respect to the 3D scene.

6. The apparatus of claim 1, wherein the 3D scene constructed by the apparatus is used to analyse eye movements and gaze points.

7. The apparatus of claim 6, wherein a geometrical representation of a known display of text or graphical information is part of the 3D scene and the eye movements and gaze points are used to determine which displayed information a person is gazing upon at any specific time.

8. The apparatus of claim 7, wherein said display comprises a keyboard layout to allow a person to input text based on said gazing information.

9. The apparatus as in any of claims 1-8, wherein: the display is part of the apparatus; the apparatus is a mobile device, such as, but not limited to, a phone, tablet, or laptop.

10. The apparatus as in any of claims 1-8, wherein the apparatus is stationary mounted in or near a display of textual information, such as, but not limited to, advertising billboards or public announcement screens.

11. A method comprising: extracting visually distinct features from a primary camera image stream; locating one or more of said features in the image stream of one or more secondary cameras; determining the location of each said located feature in 3D space using pixel coordinates, position, orientation, and optical properties of each camera; incrementally constructing and refining a 3D scene from said located features.

12. The method of claim 11, further comprising a database of predefined 3D geometrical patterns, such as, but not limited to, facial geometry, to generate visual features of interest that may appear in the primary camera image stream.

13. The method of claim 11, wherein an image stream of higher angular resolution and narrower angle of view can be oriented to enhance detail in specific areas of the 3D scene.

14. The method of claim 11, wherein the 3D scene comprises at least one human face and the 3D scene is used to determine: gaze points, gaze durations, and motion patterns between said gaze points of each face present in the scene.

15. The method of claim 14, wherein: the 3D scene is used to determine micro expressions; said microexpressions are used to determine a category of emotion, such as, but not limited to, approval, or disapproval; said microexpressions and emotional information is coupled to eye gazing patterns.

16. The method of claim 14. wherein a keyboard is displayed to a user and said user's pattern of sequentially gazing at letters, spelling a word, are converted to words to be used as textual input.

17. The method as in any of claims 14-15, wherein the gathered information is used to analyse consumption of information on a display, such as, but not limited to: determining which information on a billboard is most effective in catching attention; or establishing comparison information for multiple information display locations.

18. The method according to claims 15 and 16, wherein: the words best fitting a gaze pattern are presented to the user on a display device; for each said word, in the moment the user gazes on it, the user's micro expression is classified as approving or disapproving; the best word candidate is selected based upon said classification.

19. The method according to claims 15 and 16, wherein: the words best fitting a gaze pattern are continuously written to an input field, presented to the user on a display device; for each of said written words a certainty score is kept; for each s?.id word a list of alternative choices is kept; if the user gazes on a specific word in said input field, the user's microexpression is classified as approving or disapproving; in the case disapproval is detected the disapproved word is swapped out for the next best candidate.