WO2012063247A1 - Input processing - Google Patents

Input processing Download PDF

Info

Publication number
WO2012063247A1
WO2012063247A1 PCT/IN2010/000739 IN2010000739W WO2012063247A1 WO 2012063247 A1 WO2012063247 A1 WO 2012063247A1 IN 2010000739 W IN2010000739 W IN 2010000739W WO 2012063247 A1 WO2012063247 A1 WO 2012063247A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
activity
input
computer
adapted
Prior art date
Application number
PCT/IN2010/000739
Other languages
French (fr)
Inventor
Prasenjit Dey
Bowon Lee
Prakash Nama
Muthuselvam Selvaraj
Original Assignee
Hewlett-Packard Development Company, L . P .
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L . P . filed Critical Hewlett-Packard Development Company, L . P .
Priority to PCT/IN2010/000739 priority Critical patent/WO2012063247A1/en
Publication of WO2012063247A1 publication Critical patent/WO2012063247A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/03Arrangements for converting the position or the displacement of a member into a coded form
    • G06F3/033Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor
    • G06F3/038Control and interface arrangements therefor, e.g. drivers or device-embedded control circuitry
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/00221Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
    • G06K9/00302Facial expression recognition
    • G06K9/00308Static expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/00335Recognising movements or behaviour, e.g. recognition of gestures, dynamic facial expressions; Lip-reading
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2203/00Indexing scheme relating to G06F3/00 - G06F3/048
    • G06F2203/038Indexing scheme relating to G06F3/038
    • G06F2203/0381Multimodal input, i.e. interface arrangements enabling the user to issue commands by simultaneous use of input devices of different nature, e.g. voice plus gesture on digitizer

Abstract

Presented is method and system for processing a user input provided by a user of an input device. The method comprises detecting the user input and obtaining a visual and audio representation of the user's actions. A user activity is determined from the obtained audio and visual representations. A user command is then determined based on the detected user input and the determined user activity.

Description

INPUT PROCESSING

Background Computing systems accept a variety of inputs. Some computer applications accept gestures provided by input devices to enable easier control and navigation of the applications.

Gestures are ways to invoke an action, similar to clicking a toolbar button or typing a keyboard shortcut. Gestures may be performed with a pointing device (including but not limited to a mouse, stylus, hand and/or finger). A gesture typically has a shape, pose or movement associated with it. Such a gesture may be as simple as a stationary pose or a straight line movement or as complicated as a series of movements or poses.

Computer systems may comprise sensors such as cameras and microphones for detecting user inputs and/or gestures. Such sensors can be arranged to continuously capture signals from the surrounding environment. Robust detection of user inputs and/or gestures is therefore a factor that can influence a user's interaction experience with a system.

Brief Description of the Drawings

For a better understanding, embodiments will now be described, purely by way of example, with reference to the accompanying drawings, in which:

Fig. 1 is depicts the arrangement of apparatus according to an embodiment;

Fig. 2 is block diagram depicting the operation of an embodiment;

Fig. 3 illustrates an algorithm for detecting lip movements according to an embodiment;

Figs.4a-4d illustrate optical flow direction caused by facial movement of a user;

Fig. 5 depicts optical flow caused by movement of a user performing a gesture;

Fig. 6 is a flow diagram of a method according to an embodiment; Fig. 7 depicts a portable computing device according to an embodiment.

Detailed Description

Embodiments provide a method for robust processing of user inputs which takes accounts of user activity. Taking account of user activity, such as user movement, attention, etc., may help provide context to a detected user input, thereby assisting the correct determination of a user input. For example, the attention and activity of a user may be determined by detecting the facial position and movement of the user which, in turn, may enable the determination of a level of engagement of the user with the system.

A plurality of user activity properties may be combined to obtain a more robust understanding of the user's engagement with system and place a detected user input in context. Such information about the context of user inputs may be used to reject detected input signals as not being deliberate and, instead, identify the signals as relating to other activity (such as human-to-human interactions or background discussion/noise) in the vicinity of the system so that the system does not respond to them. Thus, a determined level of attention and engagement of a user may be used to reject ambient/background user activities which may otherwise interfere with the detection of user inputs and/or prevent correct user interaction with a system.

In a multiple user scenario, embodiments may be adapted to determine the passing of controls or an input device from one user to another user based on determined user activity. Also, the detected identity of a user may be used to personalize the behavior, appearance and/or interaction methods of embodiments.

Embodiments provide a method of processing a user input provided by input devices, the method comprising: detecting the user input; obtaining visual and audio representations of the user's actions; determining a user activity from the a combination of the visual and audio representations of the user's actions; and determining a user command based on the detected user input and the determined user activity. Accordingly, there is provided a natural and intuitive interface method by which to command an action using natural activities, such as gestures. Such a method may also be suitable for use in a multiple user environment.

Embodiments comprise an architecture and related computational infrastructure such that the activity of a user may be used to provide context to a detected user input so as to assist in determining the intention of the detected user input in more detail (in other words, disambiguate or qualify the detected user input). Once detected, a user input may be combined with the determined activity of a user to determine a command or action desired by the user. Thus, embodiments may employ hardware and software such that the user is free to move and perform gestures when providing an input, as well as hardware and software such that the movement or activity of the user can be detected and determined. A variety of architectures may be used to enable such functions.

One exemplary way of enabling the detection of user activity is to employ conventional voice recognition technology which is adapted to detect and determine the speech of a user. In such a system, a user may provide an audible parameter (for example, by speaking) which disambiguates an input. By adapting the voice recognition process to recognize the audible parameters spoken by a particular user (such as the user controlling the input device), detected sound that is determined to originate from other users may be ignored as not relating to the detected input signal(s).

Similarly, video or image recognition technology may be employed to detect and determine user activity that is visible. For example, a video camera may be arranged to detect a user's movement or facial expression and determine whether or not the user's attention is directed towards a -user interface or to another source of interest (such as other users).

A natural and intuitive means of interaction is provided, enabling a user of such a system to feel as though he or she is interacting with the system, for example, by controlling the system only when their attention is directed towards a user interface of the system. Thus, a unique and compelling user interface is hereby disclosed as a means of interacting with a graphical user interface (GUI).

Commands or operations may be associated with user activities such as gestures. These operations may include navigation forward, backward, scrolling up or down, changing applications, and arbitrary application commands. Further, a user activity need not have a predefined meaning but rather may be customizable by a developer or user to perform an action or combination of actions so that a user may have quick access to keyboard shortcuts or macros, for example.

An embodiment, pictured in Fig. 1 , provides apparatus for processing a gesture performed by a user 90 of an input device. The apparatus comprises a display surface 10; an input device 20, for performing a gesture; a video camera 30 for producing a video recording of user movements; and a processor 50. The field of view of the video camera 30 includes the input device 20. The processor 50 is adapted to detect, from the video recording (or otherwise), a gesture performed by a user of the input device 20. It is also adapted to determine the attention direction of the user 90.

The processor then uses the determined attention direction to specify a detected gesture in more detail (in other words, disambiguate or qualify the gesture).

In the embodiment of Fig. 1 , the input device 20 comprises part of the body of a user 90. In particular, the input device comprises the user's hand.

Using the input device 20, the user 90 can select, highlight, and/or modify items displayed on the display surface 10. The processor 50 interprets gestures made using the input device 20 in order to manipulate data, objects and/or execute conventional computer application tasks.

Other types of input devices, such as a mouse, stylus, trackball, or the like could be used. Additionally, a user's own hand or finger could be the input device 20 and used for selecting or indicating portions of a displayed image on a proximity-sensitive display. Consequently, the term "user input device", as used herein, is intended to have a broad definition and encompasses many variations on well-known input devices.

The video camera 30 is also adapted to perform as a depth camera. This is an imaging system which provides two-dimensional arrays of depth values - that is, depth images. Thus, in the illustrated embodiment, the video camera 30 is adapted to produce normal (grayscale or color) video images in addition to the depth images. In the present example, the depth camera is based on the time-of- flight principle: pulses of infra-red light are emitted to all objects in the field of view and the time of arrival of the reflected pulses is measured, to determine the distance from the sensor.

Note that depth cameras of other types may also be used. The skilled person will be familiar with a variety of other potentially suitable distance-sensing technologies. These include stereo imaging, or stereo triangulation, in which two (or more) image sensors are used to determine a depth image by making disparity measurements. Another possibility is to illuminate a scene with so-called "structured light", where a geometric pattern such as a checkerboard is projected, and depth/distance information is determined from the distortions observed when this known pattern falls on the objects in the scene.

In the arrangement of Fig. 1 , the depth camera 30 is positioned to observe the display surface 10, from a relatively short distance of about 0.5m to 1 m. The camera 30 is spatially positioned such that the display surface 10 is visible in the field-of-view of the camera.

A simple, one-time calibration procedure can be used to locate the four corners of the display surface. This may be either manual, whereby the user indicates the positions of the vertices, or could be automatic, by analysis of the image of the scene. To help with automatic or semi-automatic detection of the surface, its boundaries may be identified with markers of distinctive color or brightness. If calibration is manual, then the camera should be manually recalibrated if it is disturbed.

Repeated calculation of the distance from the hand to the display surface 10, for example, can be analyzed to determine user activity. The user activity can be used to provide context to a detected user input so as to assist in determining the intention of the detected user input in more detail (in other words, disambiguate or qualify the detected user input). The processor 50 can use the user activity to control the user interaction. Thus, a user input may be combined with the determined activity of a user to determine a command or action desired by the user. It will be understood that the location of the input device 20 may be made relative to other predetermined locations instead of the display surface 10. For example, the distance of a user's body may be determined from the depth image, thus enabling the distance of the input device 20 from the user's body to be determined and used to determine a user' activity and specify an input in more detail (in other words, disambiguate or qualify a detected input).

It will also be understood that alternative camera arrangements may be used such as a conventional arrangement of the camera in the same plane as the display and observing the user so that the display is not in the field of view of the camera.

The processor 50 can comprise hardware of various types. In this example, the processor is a central processing unit (CPU) of a personal computer (PC). Accordingly, the display surface 10 is the display of the PC, which is under the control of the CPU 50. The apparatus allows the user 90 to provide input to the PC using speech or making hand gestures, for example. The processor 50 acts on this user input by controlling an operation depending on the determined activity of the user. The operation could be of almost any type: for example, the activation of a software application or the pressing of a button or selection of a menu item within an application. Of course, as will be readily apparent to those skilled in the art, the processor may be comprised in another hardware device, such as a set-top box (STB). The range of suitable operations which may be controlled will vary accordingly. With a STB, for example, the operation controlled may involve changing channels or browsing an electronic program guide (EPG).

Identification of the context of a user input may be useful for multi-user systems and/or systems having a microphone and video camera arrangement. It may be advantageous for such systems to ignore interactions or user-inputs that are not directed to the system. Further, in a multi-user interaction scenario, the detection of user context may enable a transfer of interactive control from one user to another.

It has been identified that the following information may be useful for determining the context of a user input: i) Attention Detection

ii) Movement Detection

iii) User Identification

Attention detection

A user's attention can be inferred from explicit and implicit cues, like head pose, gaze, facial expression etc. Although attention is the state of the mind of a user in response to their surroundings, it may manifest itself in several observable cues from the. user. Embodiments may use observable cues in order to determine the attention of a user. For example, if a user's gaze is directed towards a system display, but the user is thinking of something else, embodiments may still conclude that the user's attention it directed towards the system.

Common, observable cues include posture, gaze, hand position, etc., which are visually detectable and used to determine the attention level and/or direction of the user.

Movement Detection

Although a user's attention may directed towards the system, the user may not be interacting with the system and may, instead, be a silent observer whilst another user actively interacts with the system. Observable cues including speech, hand gestures, pointing, lip activity etc. may be used to determine a user engagement with the system.

Determination of the attention of users may inform the system which users to monitor for movement. For example, the presence of a required attention level may be used to activate the system for listening to user commands.

Determination of "genuine" activity may require a combination of evidences from multimodal cues.

For example, when users are gazing towards a display of a system and there is some background speech, a speech detection unit will indicate activity whilst a processor may detect lip activity of the users using a video camera. This may be used to determine jointly with the lip activity whether or not the detected speech was from a user whose attention is directed towards the display. Another example of combining detected clues may be the situation when users are gesturing with their hand whilst talking to each other. Combined with the cue of attention, embodiments may be arranged to determine that even though there is gesture activity, the gestures are not directed towards the system. Similarly, speech activity between the users can also be rejected by combining the evidence that the attention of the users is not towards the system.

User Identification

The identity of the user may be used to personalize interaction. User identities may be registered with a system. A face recognition engine may be used to detect the identity of a user from an image or video. The identity of the user can then be used to personalize a user interface, speaker dependent models, gesture training data, etc.

Cues/Modalities

According to embodiments, some of the observable cues that may be detected and tracks include: user attention; user gaze direction; user pose; user face; change of depth of the user (front-most user, lunged forward etc.); user gestures; user lip movement; user speech

Evidence Fusion

While detected cues can be used individually to provide information about the state of a user, cues may be combined to provide a more robust and complex understanding of the context within which a user input is provided. For example, combining information regarding a user's face, lip activity, and audio activity from a particular direction may help infer whether detected speech was from a particular user and directed towards the system, or whether it was a spurious, background speech utterance (i.e. background noise). Further, combination of frontal face detection and gesture activity detection can be used to reject spurious gestures.

Figure 2 depicts an embodiment which uses a video camera 60 and a microphone array 70. Here, the multimodal cues of Lip Activity Detection (LAD), Sound Source Localization using microphone array (SSL), Face Detection (FD) and Gesture Activity Detection (GAD) are used to determine an engagement level of a user with the system. User activity is determined from speech activity (LAD + SSL) and gesture activity (GAD). The speech activity is determined by a speech determination unit 75 which is input with signals from the video camera 60 and the microphone array 70. The gesture activity is determined by a gesture determination unit 80 which is input with signal from the video camera 60. Thus, it will be understood that LAD and GAD are captured using the video camera 60 and the SSL is captured using the microphone array 70. In other words, the user activity is determined using multiple cues that are provided from a visual representation of the user and/or an audio representation of the user.

Such use of multiple cues may help to disambiguate and/or qualify a detected user input. For example, the detection of lip activity in a visual representation of the user without the detection of concurrently occurring speech in an audio recording/representation of the user may indicate spurious lip movements like chewing a gum, lip biting, etc. Conversely, the detection of speech from a direction of the user without concurrently occurring lip movement in the visual representation may indicate background speech from the direction of the user and not directed at the system. Also, the presence of speech audio from a direction without any user detected in that direction may indicate an ambient/background sound from that direction.

The speech determination unit 75 detects speech activity using a using a visual cue of lip activity (from a video input provided by the video camera 60) and the presence of an audio cue from the direction of the user whose lips are moving (from an audio input provided by the microphone array 70).

Here, the speech determination unit 75 comprises a lip activity detector (LAD) 77 and a Sound Source Localization (SSL) unit 79.

The LAD 77 uses a detected frontal face in a video signal to identify the lip region of a user as a Region of Interest (ROI) and looks for vertical optical flow in the ROI. For example, an algorithm for detecting lip movements is illustrated in Figure 3. In a video image 100, the lip region 105 is identified as the ROI by a lip ROI localizer 77a of the LAD 77. Since inadvertent movements (left-right, front- back) of the face can also result in optical flow in the mouth region, optical flows from patches of a user's left and right cheek regions 110 is also determined. Accordingly, as shown in Figure 3, the amount of optical flow in the lip region 105, the left cheek region 110 and right check region 110 is determined in steps 115, 120 and 125, respectively. These steps are undertaken by a motion detector 77b of the LAD 77. Next, in step 130, the amounts of optical flow in the left and right cheek regions are averaged to obtain a single reference value 135 of optical flow for the cheek regions. A ratio of the determined optical flow of mouth region to the value 135 of optical flow of the cheek region is then determined in step 140. The ratio value 150 is used as an indicator of lip movement by comparing it to a reference/threshold value in step 155 (using a lip activity thresholding unit 77c of the LAD 77). If the ratio value 150 is determined to be equal to or greater than the reference/threshold value, the algorithm proceeds to step 160 where the lip activity thresholding unit 77c determines that there is significant vertical flow in mouth region compared to the other regions of the face, therefore determining the existence of genuine lip activity. If, on the other hand, the ratio value 150 is determined to be less than the reference/threshold value, the algorithm proceeds to step 165 where the lip activity thresholding unit 77c determines that there no significant vertical flow in mouth region compared to the other regions of the face, therefore determining the no genuine lip activity.

In another embodiment for the LAD 77, the direction of optical flow in the mouth region to decide whether the optical flow was due to lip movement or movement of whole face (sideways, forward-backward) as shown in Figures 4a to 4d.

The SSL unit 79 receives a signal from the microphone array 70 which is adapted to continuously detect audible sounds. The signal from the microphone array 70 is processed by an audio activity thresholding unit 79a of the SSL unit 79. The audio activity thresholding unit 79a compares the signal from the microphone array 70 with a predetermined threshold value to determine whether or not audio activity of interest has occurred. If audio activity of interest is detected, the audio signal is passed to a time delay estimation unit 79b of the SSL unit 79. The time delay estimation unit 79b is adapted to estimate time delay between the microphones of the microphone array 70 using any suitable known method, such as generalized cross correlation method with the phase transform (PHAT) frequency weighting for example. Using the obtained time delay information, a mapping unit 79c of the SSL unit 79 determines which region (such as left, right or center) the detected audio activity is located in (relative to the front of the system, for example).

The gesture activity determination unit 80 is input with a signal from the video camera 60. In this example, the GAD of the gesture activity determination unit 80 uses a detected face of the video signal to identify the chest region of the user as the ROI 180 (as illustrated in Figure 5). Here, an assumption made is that, whenever the user makes a gesture directed at the system, the user's face 185 will be visible in the video image and can be used to identify the location of the chest region of the user (which is also where the gesture is expected to be performed by the user). Using such an approach, the ROI detector 80a of the gesture activity determination unit 80 identifies the user's chest region as the ROI 180.

Optical flow of a value exceeding a predetermined threshold which occurs in the ROI 180 (i.e. around the chest region) is considered to be genuine gesture activity. Such an assumption may help to reject background movement activity (like a person walking past behind the user, or inadvertent movement of the user's hand, for example). Also, when the user is not facing the system and talking to another user (i.e. when the user's face is not detected in the video signal), detected gesture activity may be rejected on the basis that is not directed towards the system (but to the other user instead). Accordingly, the gesture activity determination unit 80 comprises a motion detector 80b which is adapted to detect user motion in the ROI 180 of the video signal. Here, the motion detector 80b is adapted to determine optical flow in the ROI 180 of the video signal. A value of optical flow is therefore provided by the motion detector 80b which is passed to a gesture activity thresholding unit 80c of the gesture activity determination unit 80. The gesture activity thresholding unit 80c compares the value of optical flow to a reference/threshold value. If the value of optical flow is determined to be equal to or greater than the reference/threshold value, the gesture activity thresholding unit 80c determines that there is significant optical flow and therefore determines the existence of genuine gesture activity. If, on the other hand, the value of optical flow is determined to be less than the reference/threshold value, the gesture activity thresholding unit 80c determines that there is no significant optical flow in the ROI 180 and therefore determines that there is no genuine gesture activity.

The different levels of detected user activity from the various units (LAD,

GAD and SSL) of the gesture activity determination unit 80 and the speech determination unit 75 are combined (using a suitable algorithm) by the evidence fusion unit 85 to provide a value/score representing the user's engagement level with the system.

In the current example, a rule-based fusion/combination is used where, when a lip activity is detected from a particular direction, audio activity is also checked using SSL in the same direction and time interval. If it is determined that there is co-occurring lip activity and audio activity, a genuine speech activity is reported, along with a confidence score which is the product of the confidence scores of the LAD and SSL components.

Gesture activity may be reported as another parallel activity stream along with its corresponding confidence score. More sophisticated embodiments may combine confidence scores from individual components, and come up with a joint interpretation of the engagement level (i.e. score) for the user.

The obtained value/score is then passed to a multi-modal application 90 to provide a context for a detected user input. As has been explained above, this context may enable different kinds of user interactions (ranging from complete sensor input rejection to complete interactions).

A detected input may therefore be interpreted in combination with a measure of user activity to (which places the detected input in context) to determine a command or action as may be desired by the user. Such an input which is combined with a representation of user-activity is hereinafter referred to as a context-based (CB) input since a single input may provide multiple modes of operation, the chosen mode being dependent on the context (i.e. measure of user activity) of the user input. The context may determine, for example, a target software program or desired command. The context-based input concept specifies a general pattern of interaction where there is an input command part and there is context-based parameter part of an interaction. For example, a CB input according to an embodiment may be represented as follows:

CB Input = Input Command + User Activity Parameter.

Thus, a CB input as an interaction consists of a user input provided in combination with user activity. When the user provides a speech input, for example, the activity of the user is used as an extra parameter to provide context to the speech input and specify the speech input in more detail. Such a CB speech input may therefore be represented as follows:

CB Speech Input = Speech + User Activity Parameter.

Embodiments may use different forms of user activities like gesture, gaze, pose, speech, face or torso movement, etc. to determine a context for a user input. Such embodiments may jointly interpret multiple forms of user activity to identify a context for a user input. This may provide user context information to a multimodal application which can make use of the information to disambiguate or qualify user inputs or interactions.

For example, embodiments may determine the occurrence of any one of the following user activities when a user input is provided:

The user is gesturing to another user, therefore providing spurious hand movements which are not part of a gesture directed towards the system;

The user is exhibiting unexpected lip activity caused by the user chewing gum or taking to another user, for example;

Background speech from another person or nearby devices such as a radio or mobile phone, etc;

The user is gazing or blankly staring at the display without paying actual attention to the system;

The user gets up and walks away from the system; and

The user is sleeping.

The detection of such exemplary user activities may be used to indicate that a detected user input is not valid for example. A combination of user activities can be used to disambiguate whether a user's attention and/or activity is directed towards the system. All available cues may be captured using sensors like cameras and microphones.

A user input may be a gesture. Gestures may be performed without requiring the user to enter any special mode when using an application - although a mode requirement may used in alternative embodiments, for example, requiring the user to hold a button while performing a gesture. The occurrence of a gesture may be determined based on a profile of the physical or logical x and y co-ordinates charted against time. A gesture may also be determined based upon timing information. Because a gesture of a human may be a quick gesture, one or more predefined thresholds can be chosen. A movement threshold may be, for example, greater than 1 cm and the time threshold greater than 0.2 milliseconds and less than 700 milliseconds. These values of course may be varied to accommodate all users. In some embodiments a threshold may be defined based upon the size of a screen and/or the distance of the graphical element from an edge of the screen. In other embodiments, a velocity threshold may be used instead of or in addition to a speed threshold, wherein the velocity threshold defines a minimum velocity at which the user must move his/her finger or hand for it to qualify as a gesture. Other aspects of a gesture may be compared against other thresholds. For instance, the system may calculate velocity, acceleration, curvature, lift, and the like and use these derived values or sets of values to determine if a user has performed a gesture.

Turning now to Figure 6, a method 600 according to an embodiment is depicted. A processor like that of Figure 1 may be adapted to execute such a method.

The method 600 comprises: detecting 610 a user input; obtaining 620 an electronic representation of the user's actions; determining 630, from the electronic representation, a user activity; and determining 640 a user command based on the detected user input and the determined user activity.

Here, the electronic representation of the user's actions is a digital video of comprising a visual and audio recording of the user. Thus, the electronic representation may be obtained using a convention digital video camera having a microphone arrangement (of any suitable number of microphones).

The user's hand is one simple and intuitive example of the input device 20. However, other user devices may also be used. For example, the user may hold a wand or other pointer in his/her hand. This could be colored distinctively or exhibit characteristic markers, to aid detection in the image. Equally, another body part could be used, such as an extended finger or the head. In each case, the position of the input device can be calculated from a depth image for example.

Embodiments provide an architecture and related computational infrastructure such that the context of a detected use input may be determined and used to disambiguate or qualify the input. Embodiments may therefore be robust to noise conditions by rejecting spurious speech and gesture activities. In this way, embodiments may identify user inputs accurately. Multiple cues such as attention direction, posture, lip movement, etc., may be used to determine user activity.

While specific embodiments have been described herein for purposes of illustration, various other modifications will be apparent to a person skilled in the art and may be made without departing from the scope of the concepts disclosed.

For example, an embodiment as shown in Fig. 7 comprises a portable computing device 700 having a touch screen 701 which functions both as an output of visual content and an input device for the device 700. A conventional touch screen interface enables a user to provide input to a graphical user interface ("GUI") 701a by manually touching the surface of the screen as a means of targeting and selecting displayed graphical elements. In general, simulated buttons, icons, sliders, and/or other displayed elements are engaged by a user by directly touching the screen area at the location of the displayed user interface element.

The computing device 700 further comprises a processing unit 702, user activity detection means 704 and data storage means 706. The user activity detection means 704 comprises a video camera 704a and a microphone 704b arranged to obtain a visual and audio recording, respectively, of the user of the device 700.

The data storage means 706 store one or more software programs for controlling the operation of the computing device. The software program includes routines for determining user activity such that an input provided by the user can be disambiguated or further defined by the determined user activity (determined from the visual and audio recordings). These routines may be implemented in hardware and/or software and may be implemented in a variety of ways. In general, the routines are configured to determine when a user provides an input and to determine an activity of a user when the input is provided.

Claims

Claims:
1. A method of processing a user input provided by a user of an input device, the method comprising:
using a computer, detecting the user input;
using a computer, obtaining a visual representation of the user's actions; using a computer, obtaining an audio representation of the user's actions, using a computer, determining a user activity based on the obtained visual and audio representations; and
using a computer, determining a user command based on the detected user input and the determined user activity.
2. The method of claim 1 , wherein the audio representation is provided by a microphone arrangement and wherein the step of determining a user activity comprises, using a computer, processing the audio representation in accordance with a sound recognition process.
3. The method of claim 2, wherein the sound recognition process is adapted to determine at least one of: the location of a sound source; speech content of a sound; and the volume of a sound.
4. The method of claim 1 , wherein the visual representation is provided by a video capture device and wherein the step of determining a user activity comprises, using a computer, processing the visual representation in accordance with a feature recognition process.
5. The method of claim 4, wherein the feature recognition process is adapted to determine at least one of: an attention direction of the user; facial movement of the user; lip movement of the user; and a body pose of the user.
6. The method of claim 4, wherein the video capture device is a depth camera adapted to provided one or more depth images, and wherein the step of determining a user activity comprises, using a computer, processing the one or more depth images in accordance with a depth recognition process so as to determine the distance of an object or the user from a predetermined location. 7. The method of claim 1 , wherein the step of detecting the user input comprises:
using a computer, detecting movement or shape of the input device;
using a computer, comparing the detected movement or shape with a predetermined threshold value; and
using a computer, determining a gesture has occurred if the detected movement or shape is equal to or exceeds the predetermined threshold value.
8. The method of claim 7, wherein the predetermined threshold value is at least one of: a value of speed; a velocity value; a duration of time; a measure of straightness; a coordinate direction; a description of a shape and acceleration value.
9. A system for processing a user input provided by a user of an input device, the system comprising:
detection means adapted to detect the user input;
image capture means adapted to obtain a visual representation of the user's actions;
a microphone arrangement adapted to obtain an audio representation of the user's actjons
activity determination means adapted to determine a user activity from the obtained visual and audio representations of the user's actions; and
a processing unit adapted to determine a user command based on the detected user input and the determined user activity. 10. The system of claim 9, wherein the activity determination means are adapted to determine a user activity by processing the audio representation in accordance with a sound recognition process.
1. The system of claim 9, wherein the image capture means comprise a video capture device and wherein the activity determination means are adapted to determine a user activity by processing the visual representation in accordance with a feature recognition process.
12. The system of claim 11 , wherein the feature recognition process is adapted to determine at least one of: an attention direction of the user; facial movement of the user; lip movement of the user; and a body pose of the user.
13. The system of claim 9, wherein the detection means comprise:
input device detection means adapted to detect movement or shape of the input device;
a comparison unit adapted to compare the detected movement or shape with a predetermined threshold value; and
a gesture determination unit adapted to determine a gesture has occurred if the detected movement or shape is equal to or exceeds the predetermined threshold value. 14. A computer program comprising computer program code means adapted to perform, when run a computer, the steps of:
detecting a user input provided by a user of an input device;
obtaining a visual representation of the user's actions;
obtaining an audio representation of the user's actions,
determining a user activity based on the obtained visual and audio representations; and
determining a user command based on the detected user input and the determined user activity.
15. A computer program as claimed in claim 14 embodied on a computer readable medium.
PCT/IN2010/000739 2010-11-12 2010-11-12 Input processing WO2012063247A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IN2010/000739 WO2012063247A1 (en) 2010-11-12 2010-11-12 Input processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IN2010/000739 WO2012063247A1 (en) 2010-11-12 2010-11-12 Input processing

Publications (1)

Publication Number Publication Date
WO2012063247A1 true WO2012063247A1 (en) 2012-05-18

Family

ID=46050458

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2010/000739 WO2012063247A1 (en) 2010-11-12 2010-11-12 Input processing

Country Status (1)

Country Link
WO (1) WO2012063247A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9662980B2 (en) 2013-06-07 2017-05-30 Shimane Prefectural Government Gesture input apparatus for car navigation system
WO2018071004A1 (en) * 2016-10-11 2018-04-19 Hewlett-Packard Development Company, L.P. Visual cue system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1520685A (en) * 2001-06-29 2004-08-11 皇家菲利浦电子有限公司 Picture-in-picture repositioning and/or resizing based on speech and gesture control
CN101038523A (en) * 2007-04-26 2007-09-19 上海交通大学 Mouse system based on visual tracking and voice recognition
US20100266210A1 (en) * 2009-01-30 2010-10-21 Microsoft Corporation Predictive Determination

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1520685A (en) * 2001-06-29 2004-08-11 皇家菲利浦电子有限公司 Picture-in-picture repositioning and/or resizing based on speech and gesture control
CN101038523A (en) * 2007-04-26 2007-09-19 上海交通大学 Mouse system based on visual tracking and voice recognition
US20100266210A1 (en) * 2009-01-30 2010-10-21 Microsoft Corporation Predictive Determination

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9662980B2 (en) 2013-06-07 2017-05-30 Shimane Prefectural Government Gesture input apparatus for car navigation system
WO2018071004A1 (en) * 2016-10-11 2018-04-19 Hewlett-Packard Development Company, L.P. Visual cue system

Similar Documents

Publication Publication Date Title
US8856691B2 (en) Gesture tool
US8139029B2 (en) Method and device for three-dimensional sensing
US8606735B2 (en) Apparatus and method for predicting user's intention based on multimodal information
US8552976B2 (en) Virtual controller for visual displays
US9423870B2 (en) Input determination method
US8457353B2 (en) Gestures and gesture modifiers for manipulating a user-interface
US9268404B2 (en) Application gesture interpretation
US8693732B2 (en) Computer vision gesture based control of a device
Argyros et al. Vision-based interpretation of hand gestures for remote control of a computer mouse
CN103180893B (en) A user interface for providing a three-dimensional method and system
CN103518172B (en) Gaze-assisted computer interface
US20100199228A1 (en) Gesture Keyboarding
CN102789313B (en) User interaction system and method
US9377859B2 (en) Enhanced detection of circular engagement gesture
US20090217211A1 (en) Enhanced input using recognized gestures
US20150363070A1 (en) System and method for interfacing with a device via a 3d display
WO2010035477A1 (en) User interface device, user interface method, and recording medium
EP2962175B1 (en) Delay warp gaze interaction
US9389779B2 (en) Depth-based user interface gesture control
US9459697B2 (en) Dynamic, free-space user interactions for machine control
CN103890695B (en) Gesture-based interface system and method
US20120062729A1 (en) Relative position-inclusive device interfaces
US9218063B2 (en) Sessionless pointing user interface
US8959013B2 (en) Virtual keyboard for a non-tactile three dimensional user interface
US20070130547A1 (en) Method and system for touchless user interface control

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10859456

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct app. not ent. europ. phase

Ref document number: 10859456

Country of ref document: EP

Kind code of ref document: A1