WO2016151396A1 - Method for refining control by combining eye tracking and voice recognition - Google Patents

Method for refining control by combining eye tracking and voice recognition Download PDF

Info

Publication number
WO2016151396A1
WO2016151396A1 PCT/IB2016/000412 IB2016000412W WO2016151396A1 WO 2016151396 A1 WO2016151396 A1 WO 2016151396A1 IB 2016000412 W IB2016000412 W IB 2016000412W WO 2016151396 A1 WO2016151396 A1 WO 2016151396A1
Authority
WO
WIPO (PCT)
Prior art keywords
screen
area
objects
user
eye tracking
Prior art date
Application number
PCT/IB2016/000412
Other languages
French (fr)
Inventor
Martin Henrik Tall
Jonas Priesum
Javier SAN AGUSTIN
Original Assignee
The Eye Tribe
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Eye Tribe filed Critical The Eye Tribe
Priority to EP16720164.9A priority Critical patent/EP3271803A1/en
Priority to KR1020177027275A priority patent/KR20170129165A/en
Priority to CN201680025224.5A priority patent/CN107567611A/en
Priority to JP2017567559A priority patent/JP2018515817A/en
Publication of WO2016151396A1 publication Critical patent/WO2016151396A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04842Selection of displayed objects or displayed text elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2203/00Indexing scheme relating to G06F3/00 - G06F3/048
    • G06F2203/038Indexing scheme relating to G06F3/038
    • G06F2203/0381Multimodal input, i.e. interface arrangements enabling the user to issue commands by simultaneous use of input devices of different nature, e.g. voice plus gesture on digitizer

Definitions

  • GUIs graphical user interfaces
  • Objects which may include images, words, and alphanumeric characters can be displayed on screens; and users employ cursor-control devices (e.g. mouse or touch pad) and switches to indicate choice and selection of interactive screen elements.
  • cursor-control devices e.g. mouse or touch pad
  • cursor and switch systems may use a touch sensitive screen whereby a user identifies and selects something by touching its screen location with a finger or stylus.
  • a control icon such as "print,” or select a hyperlink.
  • Cursor control and touch-control panels are designed such that users physically manipulate a control device to locate and select screen items.
  • a voice recognition subsystem could also suffer ambiguity when trying to resolve a recognized word with a singularly related screen object or word.
  • control methodologies may employ zooming so as to limit the number of screen objects and increase the distance between them, as in eye tracking control; or require iterative spoken commands in order to increase the probability of correct control or selection interpretation.
  • the method herein disclosed and claimed enables independently implemented eye tracking and voice recognition controls to co-operate so as to make overall control faster and/or more accurate.
  • the method herein disclosed and claimed is applicable to locating and selecting screen objects that may result from booting up a system in preparation for running an application, or interacting with a server-based HTML page aggregate using a client user system (e.g. interacting with a website via the Internet).
  • this method in conjunction with eye tracking and voice recognition control subsystems would provide enhanced control over the interaction of screen displayed objects irrespective of the underlying platform specifics.
  • the method herein disclosed and claimed uses attributes of eye tracking to reduce the ambiguities of voice-recognition control; and uses voice recognition to reduce the ambiguities of eye tracking control.
  • the result is control synergy; that is, control speed and accuracy that exceeds that of eye tracking or voice recognition control on each's own.
  • Figure 1 depicts a display screen displaying non-text and textual objects.
  • the screen for example, could be any system display and control screen, such as a computer monitor, smartphone screen, tablet screen, or the like.
  • Figure 2 depicts the screen of Figure 1 where eye tracking control determines that the user's gaze is essentially on a non-textual object.
  • Figure 3 depicts the screen of Figure 1 where eye tracking control determines that the user's gaze is essentially on a screen area comprising text objects.
  • Figure 4 depicts an exemplary flow chart illustrating how combining eye tracking and voice recognition would increase the confidence level of determining a location and selection, and, therefore, the accuracy.
  • Figure 5 depicts an exemplary flow chart illustrating how combining eye tracking_and voice recognition would increase the probability level of determining a location and selection, and, therefore, the accuracy.
  • Figure 6 depicts an exemplary flow chart illustrating how combining eye tracking and voice recognition would increase the probability level of determining the selected word in a group of words by associating the interpreted word with its occurrence in a smaller screen area determined as the user's gaze screen area.
  • GUIs have become the primary interaction mechanism between systems and users.
  • displayed objects on a screen which could be images, alphanumeric characters, text, icons, and the like
  • the user makes use of a portion of the GUI that enables the user to locate and select a screen object.
  • the two most common GUI subsystems employ cursor control devices (e.g. mouse or touch pad) and selection switches to locate and select screen objects.
  • the screen object could be a control icon, like a print button, so locating and selecting it may cause a displayed document file to be printed. If the screen object is a letter, word, or highlighted text portion, the selection would make it available for editing, deletion, copy-and-paste, or similar operations.
  • Today many devices use a touch panel screen which enables a finger or stylus touch to locate and/or select a screen object. In both cases, the control relies on the user to physically engage with a control device in order to locate and select a screen object.
  • cursor control With cursor control, one is usually able to precisely locate and select a screen object. Sometimes one has to enlarge a portion of the screen to make objects larger and move them farther apart from one another in order to precisely locate and select an intended screen object. This zooming function is more typical of finger-touch controls where a finger touch on an area with several small screen objects is imprecise until zooming is applied.
  • a GUI could also serve to enable location and selection of screen objects without requiring physical engagement.
  • a GUI that makes use of eye tracking control would determine where on a screen the user is gazing (e.g. location) and use some method for selection control (e.g. gaze dwell time). This would be analogous to using a mouse to move a cursor over a screen object and then pressing a button to signify selection intent.
  • Voice-recognition-based control could also serve as a control technology where physical engagement would not be required.
  • a screen of objects would have a vocabulary of spoken words associated with the objects, and when a user says a word or phrase, the control system recognizes the word and associates it with a particular screen object. So, for example, a screen with an object that is a circle with a letter A in its center could be located and selected by a user who says "circle A,” which may cause the GUI system to highlight it, and then saying “select,” which would cause the GUI system to select the object and perhaps remove the highlighting. Clearly, if there were many objects on a screen, some having the same description, saying "circle” where there are five circles of various size and color would be ambiguous.
  • the system could prompt the user for further delineation in order to have a higher confidence level or higher probability estimation.
  • the tradeoff in using eye tracking or voice-recognition control is eliminating the need for physical engagement with a pointing/ selecting device or the screen, but accepting less precise location and selection resolution. Often, as a result of the lower resolution, there may be more steps performed before the system can determine the location and selection of an object with a probability commensurate with more resolute controls, such as cursor, touch pad, or touch screen.
  • a type-selecting cursor is smaller than an alphanumeric character standing alone or immersed in a word. So, if one is fixing a typographical error, one can select a single letter and delete or change it. Using touch control, the area of finger or stylus touch is typically larger than a cursor pointer. It would be difficult to select a letter immersed in a word for similar typographical error correction. One may have to make several pointing attempts to select the correct letter, or expand (i.e. zoom) the word to larger proportions so that the touch point can be resolved to the single, intended letter target.
  • font sizes and nontextual object dimensions will affect the control resolution, but in general, technologies that do not require physical engagement cannot accommodate dense text having small characters and non-text objects having small dimensions without iterative zooming steps.
  • the method herein disclosed and claimed makes use of eye tracking and voice recognition control technologies in conjunction to, in effect, improve the accuracy of locating and selecting screen objects using either control technology on its own.
  • the method applies to any system having displayed objects whereby users interact with said system by locating and selecting screen objects and directing the system to carry out some operation or operations on one or a plurality of screen objects.
  • Such systems can comprise combinations of hardware, firmware and software that, in concert, support displaying, locating, selecting and operating on displayed objects.
  • the method may comprise interacting with system hardware and/or software as part of an integrated control subsystem incorporating eye tracking and voice-recognitions controls; or as part of a system in which separate eye tracking and voice-recognition control subsystems can interact.
  • the method invention herein disclosed and claimed should therefore not be limited in scope to any particular system architecture or parsing of hardware and software.
  • Eye tracking technology or subsystem refers to any such technology or subsystem, regardless of architecture or implementation, which is capable of determining approximately where a user's eye or eyes are gazing at some area of a display screen.
  • the eye tracking technology or subsystem may also be capable of determining that a user has selected one or more objects in the gazed area so located.
  • An object could be an icon or link that initiates an operation if so selected.
  • Voice-recognition technology or subsystem refers to any such technology or subsystem, regardless of architecture or implementation, which is capable of recognizing a user's spoken word or phrase of words and associating that recognized word or phrase with a displayed object and/ or an operational command.
  • Figure 1 depicts a display of objects on a screen.
  • Objects consist of text objects, such as alphanumeric characters, words, sentences and paragraphs; and non-text objects which comprise images, line art, icons, and the like. This drawing is exemplary and should not be read as limiting the layout and content of objects on a screen.
  • an eye tracking control subsystem has determined that a user's eye is gazing at a portion of a non-text object and the gazed area is defined by the area circled by 201.
  • Figure 3 depicts the screen of Figure 1 where an eye tracking control subsystem has determined that a user's eye is gazing at a portion of text objects, the area of which is circled by 301.
  • the gazed area, 301 covers a plurality of alphanumeric characters and words.
  • the eye tracking control subsystem would be unable to determine specifically which character or word is the object of interest.
  • iterative zoom operations may have to be done in order to resolve which letter or word is the object of interest.
  • each time a zoom operation is applied more time is required.
  • the entire visible screen and any of its objects could be a user's object of choice.
  • the voice-recognition subsystem would first have to recognize the word "here,” then associate it with any instances of it among the screen objects. As shown in Figure 1, there are three instances of the word "here.”
  • the voice-recognition subsystem would be unable to resolve the command to a singular object choice. It may have to engage in a repetitive sequence of highlighting each instance of "here” in turn until the user says “yes,” for example. This would take more time.
  • Figure 4 shows an exemplary task flow.
  • the flow shown in Figure 4 should not be read as limiting.
  • the flow begins 401 where a system loads and parses the elements that will comprise the screen objects. Although not shown in the flow chart, this operation may be done repeatedly.
  • the eye tracking subsystem computes repeated screen gaze coordinates and passes them to the system. From 402, a gazed area, G, is determined (403).
  • 404 and 405 once area G is determined, the system builds a dictionary of links, D, and vocabulary, V, for those found links in area G.
  • vocabulary V may be updated for every gaze coordinate, for every fixation, every N gaze coordinates, every T milliseconds, and so on. Steps 402 through 405 continue to refresh until a voice command is received (406). The system then recognizes the voice command based on vocabulary, V (407) and determines link L along with a confidence level of accuracy, C (408). With voice recognition, extraneous sounds coupled with a voice command, can also introduce audio artifacts that may reduce recognition accuracy.
  • the confidence level C may be compared to a threshold value, th, and if it is greater (409), then the system activates link L (410), otherwise it returns to operation (402).
  • the threshold th may take a fixed value, or it may be computed on a per-case basis depending on different factors, for example, noise in the gaze coordinates, on-screen accuracy reported by the eye tracking system, confidence level in the gaze coordinates, location of the link L on the screen, or any combination of these.
  • the system can activate the link, L, with sufficient level of confidence using fewer steps and in less time.
  • Figure 5 shows an exemplary task flow.
  • the flow in Figure 5 should not be read as limiting.
  • the flow begins with 501 where a system loads and parses the elements that will comprise the screen objects. Although not shown in the flow chart, this operation may be done repeatedly.
  • the eye control subsystem repeatedly refreshes the gazed area coordinates and feeds that data to the system (502).
  • a gazed area G is determined by the eye tracking coordinates received during a time window that may extend from the time the command is received to some predetermined number of seconds before that (504).
  • a dictionary of links, D, present in area G is built (505) and a vocabulary, V, of links in the area G is built (506).
  • the voice command is recognized based on V (507) with probability P.
  • the accuracy probability P for each link may be computed (508) based on different factors, for example, the confidence level of the voice recognition C, the distance from the gaze point or a fixation to the link, the duration of said fixation, time elapsed between link being gazed upon and emission of the voice command, and the like; and the link with highest probability P may be selected.
  • P is larger than a threshold value (509), th, then the link, L, is activated (510), otherwise the system returns to operation (502) and waits for a new voice command.
  • the threshold value th may take a fixed value, or it may be computed on a per-case basis as explained above for operation (409). Note that in both Figures 4 and 5 a link is activated. In fact, these operations are not limited to links, but rather, could be applied to any interactive screen object.
  • Figure 6 shows an exemplary task flow.
  • the flow in Figure 6 should not be read as limiting.
  • the flow begins with the system loading and parsing the elements that will comprise the screen objects. Although not shown in the flow chart, this operation may be done repeatedly.
  • the system awaits a voice command.
  • the command is "select" (603).
  • a gazed area, G is determined (604) by using the eye tracking coordinates received during a time window that may extend from the time the command is received to some predetermined number of seconds before that.
  • the gazed area is as in Figure 3 over text objects. So, the text is parsed, T, in area G and a vocabulary, V, is built (605).
  • the text object of the voice command is recognized (606).
  • a word, W is evaluated as to probability, P, (607) and compared to a threshold value (608), th. If P exceeds th, word W is selected (609).
  • Probability P and threshold value th may be computed as explained previously.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention is a method for combining eye tracking and voice-recognition control technologies to increase the speed and/or accuracy of locating and selecting objects displayed on a display screen for subsequent control and operations.

Description

METHOD FOR REFINING CONTROL BY COMBINING EYE TRACKING
AND VOICE RECOGNITION
BACKGROUND
[0001] Computing devices, such as personal computers, smartphones, tablets, and others make use of graphical user interfaces (GUIs) to facilitate control by their users. Objects which may include images, words, and alphanumeric characters can be displayed on screens; and users employ cursor-control devices (e.g. mouse or touch pad) and switches to indicate choice and selection of interactive screen elements.
[0002] In other cases, rather than cursor and switch, systems may use a touch sensitive screen whereby a user identifies and selects something by touching its screen location with a finger or stylus. In this way, for example, one could select a control icon, such as "print," or select a hyperlink. One could also select a sequence of alphanumeric characters or words for text editing and/ or copy-and-paste interactions. Cursor control and touch-control panels are designed such that users physically manipulate a control device to locate and select screen items.
[0003] There are alternative means for such control, however, that do not involve physically moving or touching a control subsystem. One such alternative makes use of eye tracking where a user's gaze at a screen can be employed to identify a screen area of interest and a screen item for interactive selection. Another alternative makes use of voice recognition and associates recognized words with related items displayed on a screen. Neither eye tracking nor voice recognition control, on their own, are as precise with regard to locating and selecting screen objects as, say, cursor control or touch control. In the case of eye tracking, one is often limited in resolution to a screen area rather than a point or small cluster of points. If there is more than one screen object within or near that screen area, then selection may be ambiguous. Similarly, with a screen full of text and object choices, a voice recognition subsystem could also suffer ambiguity when trying to resolve a recognized word with a singularly related screen object or word. Thus, as a result, such control methodologies may employ zooming so as to limit the number of screen objects and increase the distance between them, as in eye tracking control; or require iterative spoken commands in order to increase the probability of correct control or selection interpretation.
SUMMARY
[0004] By combining eye tracking and voice recognition controls one can effectively increase the accuracy of location and selection and thereby reduce iterative zooming or spoken commands that are currently required when using one or the other control technology.
[0005] The method herein disclosed and claimed enables independently implemented eye tracking and voice recognition controls to co-operate so as to make overall control faster and/or more accurate.
[0006] The method herein disclosed and claimed could be employed in an integrated control system that combines eye tracking with voice recognition control.
[0007] The method herein disclosed and claimed is applicable to locating and selecting screen objects that may result from booting up a system in preparation for running an application, or interacting with a server-based HTML page aggregate using a client user system (e.g. interacting with a website via the Internet). In essence, this method in conjunction with eye tracking and voice recognition control subsystems would provide enhanced control over the interaction of screen displayed objects irrespective of the underlying platform specifics.
[0008] The method herein disclosed and claimed uses attributes of eye tracking to reduce the ambiguities of voice-recognition control; and uses voice recognition to reduce the ambiguities of eye tracking control. The result is control synergy; that is, control speed and accuracy that exceeds that of eye tracking or voice recognition control on each's own. BRIEF DESCRIPTIONS OF THE DRAWINGS
[0009] Figure 1 depicts a display screen displaying non-text and textual objects. The screen, for example, could be any system display and control screen, such as a computer monitor, smartphone screen, tablet screen, or the like.
[0010] Figure 2 depicts the screen of Figure 1 where eye tracking control determines that the user's gaze is essentially on a non-textual object.
[0011] Figure 3 depicts the screen of Figure 1 where eye tracking control determines that the user's gaze is essentially on a screen area comprising text objects.
[0012] Figure 4 depicts an exemplary flow chart illustrating how combining eye tracking and voice recognition would increase the confidence level of determining a location and selection, and, therefore, the accuracy.
[0013] Figure 5 depicts an exemplary flow chart illustrating how combining eye tracking_and voice recognition would increase the probability level of determining a location and selection, and, therefore, the accuracy.
[0014] Figure 6 depicts an exemplary flow chart illustrating how combining eye tracking and voice recognition would increase the probability level of determining the selected word in a group of words by associating the interpreted word with its occurrence in a smaller screen area determined as the user's gaze screen area.
DETAILED DESCRIPTION
[0015] As interactive computing systems of all kinds have evolved, GUIs have become the primary interaction mechanism between systems and users. With displayed objects on a screen, which could be images, alphanumeric characters, text, icons, and the like, the user makes use of a portion of the GUI that enables the user to locate and select a screen object. The two most common GUI subsystems employ cursor control devices (e.g. mouse or touch pad) and selection switches to locate and select screen objects. The screen object could be a control icon, like a print button, so locating and selecting it may cause a displayed document file to be printed. If the screen object is a letter, word, or highlighted text portion, the selection would make it available for editing, deletion, copy-and-paste, or similar operations. Today many devices use a touch panel screen which enables a finger or stylus touch to locate and/or select a screen object. In both cases, the control relies on the user to physically engage with a control device in order to locate and select a screen object.
[0016] With cursor control, one is usually able to precisely locate and select a screen object. Sometimes one has to enlarge a portion of the screen to make objects larger and move them farther apart from one another in order to precisely locate and select an intended screen object. This zooming function is more typical of finger-touch controls where a finger touch on an area with several small screen objects is imprecise until zooming is applied.
[0017] A GUI could also serve to enable location and selection of screen objects without requiring physical engagement. For example, a GUI that makes use of eye tracking control would determine where on a screen the user is gazing (e.g. location) and use some method for selection control (e.g. gaze dwell time). This would be analogous to using a mouse to move a cursor over a screen object and then pressing a button to signify selection intent.
[0018] Voice-recognition-based control could also serve as a control technology where physical engagement would not be required. A screen of objects would have a vocabulary of spoken words associated with the objects, and when a user says a word or phrase, the control system recognizes the word and associates it with a particular screen object. So, for example, a screen with an object that is a circle with a letter A in its center could be located and selected by a user who says "circle A," which may cause the GUI system to highlight it, and then saying "select," which would cause the GUI system to select the object and perhaps remove the highlighting. Clearly, if there were many objects on a screen, some having the same description, saying "circle" where there are five circles of various size and color would be ambiguous. The system could prompt the user for further delineation in order to have a higher confidence level or higher probability estimation. [0019] Thus, the tradeoff in using eye tracking or voice-recognition control is eliminating the need for physical engagement with a pointing/ selecting device or the screen, but accepting less precise location and selection resolution. Often, as a result of the lower resolution, there may be more steps performed before the system can determine the location and selection of an object with a probability commensurate with more resolute controls, such as cursor, touch pad, or touch screen.
[0020] Typically, a type-selecting cursor is smaller than an alphanumeric character standing alone or immersed in a word. So, if one is fixing a typographical error, one can select a single letter and delete or change it. Using touch control, the area of finger or stylus touch is typically larger than a cursor pointer. It would be difficult to select a letter immersed in a word for similar typographical error correction. One may have to make several pointing attempts to select the correct letter, or expand (i.e. zoom) the word to larger proportions so that the touch point can be resolved to the single, intended letter target.
[0021] Regardless of which GUI location and selection technology one uses, font sizes and nontextual object dimensions will affect the control resolution, but in general, technologies that do not require physical engagement cannot accommodate dense text having small characters and non-text objects having small dimensions without iterative zooming steps.
[0022] The method herein disclosed and claimed makes use of eye tracking and voice recognition control technologies in conjunction to, in effect, improve the accuracy of locating and selecting screen objects using either control technology on its own. The method applies to any system having displayed objects whereby users interact with said system by locating and selecting screen objects and directing the system to carry out some operation or operations on one or a plurality of screen objects. Such systems can comprise combinations of hardware, firmware and software that, in concert, support displaying, locating, selecting and operating on displayed objects. The method may comprise interacting with system hardware and/or software as part of an integrated control subsystem incorporating eye tracking and voice-recognitions controls; or as part of a system in which separate eye tracking and voice-recognition control subsystems can interact. The method invention herein disclosed and claimed should therefore not be limited in scope to any particular system architecture or parsing of hardware and software.
[0023] Eye tracking technology or subsystem refers to any such technology or subsystem, regardless of architecture or implementation, which is capable of determining approximately where a user's eye or eyes are gazing at some area of a display screen. The eye tracking technology or subsystem may also be capable of determining that a user has selected one or more objects in the gazed area so located. An object could be an icon or link that initiates an operation if so selected.
[0024] Voice-recognition technology or subsystem refers to any such technology or subsystem, regardless of architecture or implementation, which is capable of recognizing a user's spoken word or phrase of words and associating that recognized word or phrase with a displayed object and/ or an operational command.
[0025] Figure 1 depicts a display of objects on a screen. Objects consist of text objects, such as alphanumeric characters, words, sentences and paragraphs; and non-text objects which comprise images, line art, icons, and the like. This drawing is exemplary and should not be read as limiting the layout and content of objects on a screen.
[0026] With eye tracking control technology one can determine an area where a user's eye or eyes are gazing at the screen of Figure 1. For example, in Figure 2, an eye tracking control subsystem has determined that a user's eye is gazing at a portion of a non-text object and the gazed area is defined by the area circled by 201.
[0027] Figure 3 depicts the screen of Figure 1 where an eye tracking control subsystem has determined that a user's eye is gazing at a portion of text objects, the area of which is circled by 301.
[0028] In Figure 2, if the non-text object were smaller than 201, and more than one such object were located in area 201, the eye tracking subsystem could not, at that time, resolve which object in area 201 is a user's object of interest. By engaging in a subsequent step, the screen objects could be enlarged such that only one object would be located in area 201. But the subsequent step adds time for the sake of accuracy. It may also be the case that a first zooming attempt results in two or more objects still within area 201. Hence, a second zoom operation may have to be done in order to determine the object of interest. Here, again, more time is used.
[0029] In Figure 3, the gazed area, 301, covers a plurality of alphanumeric characters and words. Here, again, the eye tracking control subsystem would be unable to determine specifically which character or word is the object of interest. Again, iterative zoom operations may have to be done in order to resolve which letter or word is the object of interest. As with the non-text object case, each time a zoom operation is applied, more time is required.
[0030] Using a voice-recognition technology in association with Figure 1, the entire visible screen and any of its objects could be a user's object of choice. For example, if the user said "delete word 'here'", the voice-recognition subsystem would first have to recognize the word "here," then associate it with any instances of it among the screen objects. As shown in Figure 1, there are three instances of the word "here." Thus, the voice-recognition subsystem would be unable to resolve the command to a singular object choice. It may have to engage in a repetitive sequence of highlighting each instance of "here" in turn until the user says "yes," for example. This would take more time.
[0031] In one embodiment of the invention herein disclosed and claimed, Figure 4 shows an exemplary task flow. The flow shown in Figure 4 should not be read as limiting. The flow begins 401 where a system loads and parses the elements that will comprise the screen objects. Although not shown in the flow chart, this operation may be done repeatedly. In 402, the eye tracking subsystem computes repeated screen gaze coordinates and passes them to the system. From 402, a gazed area, G, is determined (403). In 404 and 405, once area G is determined, the system builds a dictionary of links, D, and vocabulary, V, for those found links in area G. Depending on the capabilities of the computing device and/or the voice recognition subsystem, vocabulary V may be updated for every gaze coordinate, for every fixation, every N gaze coordinates, every T milliseconds, and so on. Steps 402 through 405 continue to refresh until a voice command is received (406). The system then recognizes the voice command based on vocabulary, V (407) and determines link L along with a confidence level of accuracy, C (408). With voice recognition, extraneous sounds coupled with a voice command, can also introduce audio artifacts that may reduce recognition accuracy. In order to avoid incorrect selections due to extraneous sounds, the confidence level C may be compared to a threshold value, th, and if it is greater (409), then the system activates link L (410), otherwise it returns to operation (402). The threshold th may take a fixed value, or it may be computed on a per-case basis depending on different factors, for example, noise in the gaze coordinates, on-screen accuracy reported by the eye tracking system, confidence level in the gaze coordinates, location of the link L on the screen, or any combination of these. Here is a case where eye tracking technology is used to reduce the whole screen of possible objects to just those within the gazed area, G. Rather than having to iterate with repeated zoom steps, by using the eye tracking gazed area G as a delineator, the system can activate the link, L, with sufficient level of confidence using fewer steps and in less time.
[0032] In another embodiment, Figure 5 shows an exemplary task flow. The flow in Figure 5 should not be read as limiting. The flow begins with 501 where a system loads and parses the elements that will comprise the screen objects. Although not shown in the flow chart, this operation may be done repeatedly. The eye control subsystem repeatedly refreshes the gazed area coordinates and feeds that data to the system (502). When a voice command is received (503), a gazed area G is determined by the eye tracking coordinates received during a time window that may extend from the time the command is received to some predetermined number of seconds before that (504). A dictionary of links, D, present in area G is built (505) and a vocabulary, V, of links in the area G is built (506). The voice command is recognized based on V (507) with probability P. In case multiple links are recognized, the accuracy probability P for each link may be computed (508) based on different factors, for example, the confidence level of the voice recognition C, the distance from the gaze point or a fixation to the link, the duration of said fixation, time elapsed between link being gazed upon and emission of the voice command, and the like; and the link with highest probability P may be selected. If P is larger than a threshold value (509), th, then the link, L, is activated (510), otherwise the system returns to operation (502) and waits for a new voice command. The threshold value th may take a fixed value, or it may be computed on a per-case basis as explained above for operation (409). Note that in both Figures 4 and 5 a link is activated. In fact, these operations are not limited to links, but rather, could be applied to any interactive screen object.
[0033] In another embodiment, Figure 6 shows an exemplary task flow. The flow in Figure 6 should not be read as limiting. The flow begins with the system loading and parsing the elements that will comprise the screen objects. Although not shown in the flow chart, this operation may be done repeatedly. Then, the system awaits a voice command. Here, for example, the command is "select" (603). A gazed area, G, is determined (604) by using the eye tracking coordinates received during a time window that may extend from the time the command is received to some predetermined number of seconds before that. Here, the gazed area is as in Figure 3 over text objects. So, the text is parsed, T, in area G and a vocabulary, V, is built (605). Based on vocabulary, V, the text object of the voice command is recognized (606). A word, W, is evaluated as to probability, P, (607) and compared to a threshold value (608), th. If P exceeds th, word W is selected (609). Probability P and threshold value th may be computed as explained previously.
[0034] The flows shown in Figures 4, 5 and 6 are exemplary. In each example, the entire screen of objects is reduced to those objects within a gazed area increasing the confidence or probability level without resorting to zooming operations. It is of course possible that a gazed area will still continue to have some object of interest ambiguity, but the likelihood is far lower than with using only voice-recognition control. Often the spoken word in combination with gazed area is sufficient to resolve the object of interest without any zooming operations. Clearly, the combination of eye tracking and voice-recognition technologies will resolve the object of interest faster than either eye tracking or voice-recognition controls applied exclusively.

Claims

WHAT IS CLAIMED IS:
1. A method comprising:
determining an area on a display screen at which a user is gazing;
recognizing a spoken word or plurality of spoken words;
associating said spoken word or plurality of spoken words with objects displayed on said display screen;
limiting said objects displayed on said display screen to said area on said screen at which a user is gazing;
associating said objects displayed on said display screen in said area on a screen at which said user is gazing with said spoken word or plurality of spoken words.
2. A method as in claim 1 further comprising:
determining a level of confidence in said associating said objects displayed on said display screen in said area on a screen at which said user is gazing with said spoken word or plurality of spoken words;
comparing said level of confidence with a predetermined level of confidence value and if greater than said predetermined level of confidence value, accepting the association of said spoken word or plurality of spoken words with said objects displayed on said display screen in said area on a screen which said user is gazing.
3. A method as in claim 1 further comprising:
determining said level of confidence value based on the accuracy of the gaze coordinates, the noise of the gaze coordinates, the confidence level in the gaze coordinates, the location of the objects on the screen, or any combination thereof.
4. A method as in claim 1 further comprising:
determining a level of probability in said associating said objects displayed on said display screen in said area on a screen at which said user is gazing with recognizing said spoken word or plurality of spoken words; comparing said level of probability with a predetermined level of probability value and if greater than said predetermined level of probability value, accepting the association of said spoken word or plurality of spoken words with said objects displayed on said display screen in said area on a screen at which said user is gazing.
5. A method as in claim 4 further comprising:
determining said level of probability based on the confidence level of the voice recognition, the distance from the gaze fixation to each object, the duration of the gaze fixation, the time elapsed between the gaze fixation and the emission of the voice command, or any combination thereof.
6. A method comprising:
determining the objects present in an area on a display screen at which said user is gazing, building a vocabulary of a voice recognition engine based on said objects,
recognizing a spoken word or plurality of spoken words using said vocabulary;
associating said objects present in the gazed area with said spoken word or plurality of spoken words.
7. A method as in claim 6 further comprising:
updating said vocabulary of said voice recognition engine on every fixation of said user.
PCT/IB2016/000412 2015-03-20 2016-03-15 Method for refining control by combining eye tracking and voice recognition WO2016151396A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP16720164.9A EP3271803A1 (en) 2015-03-20 2016-03-15 Method for refining control by combining eye tracking and voice recognition
KR1020177027275A KR20170129165A (en) 2015-03-20 2016-03-15 How to improve control by combining eye tracking and speech recognition
CN201680025224.5A CN107567611A (en) 2015-03-20 2016-03-15 By the way that eyes are tracked into the method combined with voice recognition to finely control
JP2017567559A JP2018515817A (en) 2015-03-20 2016-03-15 How to improve control by combining eye tracking and speech recognition

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201562013590P 2015-03-20 2015-03-20
US62/13590 2015-03-20

Publications (1)

Publication Number Publication Date
WO2016151396A1 true WO2016151396A1 (en) 2016-09-29

Family

ID=56979172

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2016/000412 WO2016151396A1 (en) 2015-03-20 2016-03-15 Method for refining control by combining eye tracking and voice recognition

Country Status (1)

Country Link
WO (1) WO2016151396A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106527729A (en) * 2016-11-17 2017-03-22 科大讯飞股份有限公司 Non-contact type input method and device
WO2019107507A1 (en) * 2017-11-30 2019-06-06 国立大学法人東京工業大学 Line-of-sight guiding system
CN114174972A (en) * 2019-07-19 2022-03-11 谷歌有限责任公司 Compressed spoken utterances for automated assistant control of complex application GUIs

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1320848B1 (en) * 2000-09-20 2006-08-16 International Business Machines Corporation Eye gaze for contextual speech recognition
EP2801890A1 (en) * 2013-05-07 2014-11-12 Samsung Electronics Co., Ltd Method and apparatus for selecting an object on a screen

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1320848B1 (en) * 2000-09-20 2006-08-16 International Business Machines Corporation Eye gaze for contextual speech recognition
EP2801890A1 (en) * 2013-05-07 2014-11-12 Samsung Electronics Co., Ltd Method and apparatus for selecting an object on a screen

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106527729A (en) * 2016-11-17 2017-03-22 科大讯飞股份有限公司 Non-contact type input method and device
WO2019107507A1 (en) * 2017-11-30 2019-06-06 国立大学法人東京工業大学 Line-of-sight guiding system
JP2019101137A (en) * 2017-11-30 2019-06-24 国立大学法人東京工業大学 Sight line guiding system
CN114174972A (en) * 2019-07-19 2022-03-11 谷歌有限责任公司 Compressed spoken utterances for automated assistant control of complex application GUIs
CN114174972B (en) * 2019-07-19 2024-05-17 谷歌有限责任公司 Automated assistant controlled compressed spoken utterances for complex application GUIs
US11995379B2 (en) 2019-07-19 2024-05-28 Google Llc Condensed spoken utterances for automated assistant control of an intricate application GUI

Similar Documents

Publication Publication Date Title
US20170262051A1 (en) Method for refining control by combining eye tracking and voice recognition
US10996851B2 (en) Split virtual keyboard on a mobile computing device
US10838513B2 (en) Responding to selection of a displayed character string
US8922489B2 (en) Text input using key and gesture information
US9223590B2 (en) System and method for issuing commands to applications based on contextual information
JP4527731B2 (en) Virtual keyboard system with automatic correction function
US8327282B2 (en) Extended keyboard user interface
US20140078065A1 (en) Predictive Keyboard With Suppressed Keys
US9043300B2 (en) Input method editor integration
US20130007606A1 (en) Text deletion
US20120306767A1 (en) Method for editing an electronic image on a touch screen display
US20130257732A1 (en) Adaptive virtual keyboard
US20040130575A1 (en) Method of displaying a software keyboard
EP2897055A1 (en) Information processing device, information processing method, and program
EP2713255A1 (en) Method and electronic device for prompting character input
KR101522375B1 (en) Input method editor
GB2511431A (en) Character string replacement
US9910589B2 (en) Virtual keyboard with adaptive character recognition zones
US20140304640A1 (en) Techniques for input of a multi-character compound consonant or vowel and transliteration to another language using a touch computing device
EP2306287A2 (en) Apparatus and method for displaying input character indicator
WO2016151396A1 (en) Method for refining control by combining eye tracking and voice recognition
US9778839B2 (en) Motion-based input method and system for electronic device
US11899904B2 (en) Text input system with correction facility
US20150363047A1 (en) Methods and systems for multimodal interaction
KR102138095B1 (en) Voice command based virtual touch input apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16720164

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2017567559

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20177027275

Country of ref document: KR

Kind code of ref document: A