WO2002063599A1 - System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input - Google Patents

System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input Download PDF

Info

Publication number
WO2002063599A1
WO2002063599A1 PCT/US2002/002853 US0202853W WO02063599A1 WO 2002063599 A1 WO2002063599 A1 WO 2002063599A1 US 0202853 W US0202853 W US 0202853W WO 02063599 A1 WO02063599 A1 WO 02063599A1
Authority
WO
WIPO (PCT)
Prior art keywords
modal
data
input
environment
mood
Prior art date
Application number
PCT/US2002/002853
Other languages
French (fr)
Inventor
Stephane H. Maes
Chalapathy V. Neti
Original Assignee
International Business Machines Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corporation filed Critical International Business Machines Corporation
Priority to JP2002563459A priority Critical patent/JP2004538543A/en
Priority to CA002437164A priority patent/CA2437164A1/en
Priority to EP02724896A priority patent/EP1358650A4/en
Priority to KR1020037010176A priority patent/KR100586767B1/en
Publication of WO2002063599A1 publication Critical patent/WO2002063599A1/en
Priority to HK04106079A priority patent/HK1063371A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/227Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S715/00Data processing: presentation processing of document, operator interface processing, and screen saver display processing
    • Y10S715/961Operator interface with visual structure or function dictated by intended use
    • Y10S715/965Operator interface with visual structure or function dictated by intended use for process control and configuration
    • Y10S715/966Computer process, e.g. operation of computer

Definitions

  • the present invention relates to multi-modal data processing techniques and, more particularly, to systems and methods for performing focus detection, referential ambiguity resolution and mood classification in accordance with multi-modal input data.
  • GUI graphical user interface
  • apparatus may be provided for allowing a user to designate a position on the display screen by detecting the user's gaze point, which is designated by his line of sight with respect to the screen, without the user having to manually operate one ofthe conventional input devices.
  • EOG signals serve as input for use in controlling certain task-performing functions.
  • Still other multi-modal systems are capable of accepting user commands by use of voice and gesture inputs.
  • U.S. Patent No. 5,600,765 to Ando et al. issued February 4, 1997 discloses such a system wherein, while pointing to either a display object or a display position on a display screen of a graphics display system through a pointing input device, a user commands the graphics display system to cause an event on a graphics display.
  • gestures are provided to the system directly as part of commands. Alternatively, a user may give spoken commands.
  • the existing multi-modal techniques fall significantly short of providing an effective conversational environment between the user and the computing system with which the user wishes to interact. That is, the conventional multi-modal systems fail to provide effective conversational computing environments. For instance, the use of user gestures or eye gaze in conventional systems, such as illustrated above, is merely a substitute for the use of a traditional GUI pointing device.
  • the system independently recognizes voice-based commands and independently recognizes gesture-based commands. Thus, there is no attempt in the conventional systems to use one or more input modes to disambiguate or understand data input by one or more other input modes.
  • the present invention provides techniques for performing focus detection, referential ambiguity resolution and mood classification in accordance with multi-modal input data, in varying operating conditions, in order to provide an effective conversational computing environment for one or more users.
  • a multi-modal conversational computing system comprises a user interface subsystem configured to input multi-modal data from an environment in which the user interface subsystem is deployed.
  • the multi-modal data includes at least audio-based data and image-based data.
  • the environment includes one or more users and one or more devices which are controllable by the multi-modal system of the invention.
  • the system also comprises at least one processor, operatively coupled to the user interface subsystem, and configured to receive at least a portion of the multi-modal input data from the user interface subsystem.
  • the processor is further configured to then make a determination of at least one of an intent, a focus and a mood of at least one of the one or more users based on at least a portion of the received multi-modal input data.
  • the processor is still further configured to then cause execution of one or more actions to occur in the environment based on at least one of the determined intent, the determined focus and the determined mood.
  • the system further comprises a memory, operatively coupled to the at least one processor, which stores at least a portion of results associated with the intent, focus and mood determinations made by the processor for possible use in a subsequent determination or action.
  • such a multi-modal conversational computing system provides the capability to: (i) determine an object, application or appliance addressed by the user; (ii) determine the focus ofthe user and therefore determine if the user is actively focused on an appropriate application and, on that basis, to determine if an action should be taken; (iii) understand queries based on who said or did what, what was the focus of the user when he gave a multi-modal query/command and what is the history of these commands and focuses; and (iv) estimate the mood of the user and initiate and or adapt some behavior/service/appliances accordingly.
  • the computing system may also change the associated business logic of an application with which the user interacts.
  • multi-modality may comprise a combination of other modalities other than voice and video.
  • multi-modality may include keyboard/pointer/mouse (or telephone keypad) and other sensors, etc.
  • a general principle of the present invention of the combination of modality through at least two different sensors (and actuators for outputs) to disambiguate the input, and guess the mood or focus can be generalized to any such combination.
  • Engines or classifiers for determining the mood or focus will then be specific to the sensors but the methodology of using them is the same as disclosed herein.
  • FIG. 1 is a block diagram illustrating a multi-modal conversational computing system according to an embodiment ofthe present invention
  • FIG. 2 is a flow diagram illustrating a referential ambiguity resolution methodology performed by a multi-modal conversational computing system according to an embodiment ofthe present invention
  • FIG. 3 is a flow diagram illustrating a mood/focus classification methodology performed by a multi-modal conversational computing system according to an embodiment ofthe present invention
  • FIG. 4 is a block diagram illustrating an audio-visual speech recognition module for use according to an embodiment ofthe present invention
  • FIG. 5A is diagram illustrating exemplary frontal face poses and non-frontal face poses for use according to an embodiment ofthe present invention
  • FIG. 5B is a flow diagram of a face/feature and frontal pose detection methodology for use according to an embodiment ofthe present invention
  • FIG. 5C is a flow diagram of an event detection methodology for use according to an embodiment ofthe present invention
  • FIG. 5D is a flow diagram of an event detection methodology employing utterance verification for use according to an embodiment ofthe present invention.
  • FIG. 6 is a block diagram illustrating an audio-visual speaker recognition module for use according to an embodiment of the present invention.
  • FIG. 7 is a flow diagram of an utterance verification methodology for use according to an embodiment ofthe present invention.
  • FIGs. 8A and 8B are block diagrams illustrating a conversational computing system for use according to an embodiment ofthe present invention
  • FIGs. 9A through 9C are block diagrams illustrating respective mood classification systems for use according to an embodiment ofthe present invention.
  • FIG. 10 is a block diagram of an illustrative hardware implementation of a multi-modal conversational computing system according to the invention.
  • FIG. 1 a block diagram illustrates a multi-modal conversational computing system according to an embodiment of the present invention.
  • the multi-modal conversational computing system 10 comprises an input/output (I/O) subsystem 12, an I/O manager module 14, one or more recognition engines 16, a dialog manager module 18, a context stack 20 and a mood/focus classifier 22.
  • the multi-modal conversational computing system 10 of the present invention receives multi-modal input in the form of audio input data, video input data, as well as other types of input data (in accordance with the I/O subsystem 12), processes the multi-modal data (in accordance with the I/O manager 14), and performs various recognition tasks (e.g., speech recognition, speaker recognition, gesture recognition, lip reading, face recognition, etc., in accordance with the recognition engines 16), if necessary, using this processed data.
  • various recognition tasks e.g., speech recognition, speaker recognition, gesture recognition, lip reading, face recognition, etc.
  • the results of the recognition tasks and/or the processed data, itself, is then used to perform one or more conversational computing tasks, e.g., focus detection, referential ambiguity resolution, and mood classification (in accordance with the dialog manager 18, the context stack 20 and/or the classifier 22), as will be explained in detail below.
  • conversational computing tasks e.g., focus detection, referential ambiguity resolution, and mood classification (in accordance with the dialog manager 18, the context stack 20 and/or the classifier 22), as will be explained in detail below.
  • the multi-modal conversational computing system 10 may be employed within a vehicle.
  • the system may be used to detect a distracted or sleepy operator based on detection of abnormally long eye closure or gazing in another direction (by video input) and/or speech that indicates distraction or sleepiness (by audio input), and to then alert the operator of this potentially dangerous state. This is referred to as focus detection.
  • focus detection By extracting and then tracking eye conditions (e.g., opened or closed) and/or face direction, the system can make a determination as to what the operator is focusing on.
  • the system 10 may be configured to receive and process, not only visible image data, but also (or aiternativelv) non-visible image data such as infrared (IR) visual data. Also (or, again, alternatively), radio frequency (RF) data may be received and processed. So, in the case where the multi-modal conversational computing system is deployed in an operating environment where light is not abundant (i.e., poor lighting conditions), e.g., a vehicle driven at night, the system can still acquire multi-modal input, process data and then, if necessary, output an appropriate response. The system could also therefore operate in the absence of light.
  • IR infrared
  • RF radio frequency
  • the vehicle application lends itself also to an understanding of the concept of referential ambiguity resolution.
  • the multi-modal conversational computing system 10 is coupled to several devices (e.g., telephone, radio, television, lights) which may be controlled by user input commands received and processed by the system.
  • devices e.g., telephone, radio, television, lights
  • the system 10 must be able to perform user reference resolution, e.g., the system may receive the spoken utterance, "call my office,” but unless the system can resolve which occupant made this statement, it will not know which office phone number to direct an associated cellular telephone to call.
  • the system 10 therefore performs referential ambiguity resolution with respect to multiple users by taking both audio input data and image data input and processing it to make a user resolution determination. This may include detecting speech activity and/or the identity ofthe user based on both audio and image cues. Techniques for accomplishing this will be explained below.
  • the system 10 therefore performs referential ambiguity resolution with respect to multiple devices by taking both audio input data and image data input and processing it to make a device resolution determination. This may include detecting the speaker's head pose using gross spatial resolution of the direction being addressed, or body pose (e.g., pointing). This may also include disambiguating an I/O (input/output) event generated previously and stored in a context manager/history stack (e.g., if a beeper rang and the user asked "turn it off,” the term “it” can be disambiguated). Techniques for accomplishing this will be explained below.
  • the system 10 may make a determination of a vehicle occupant's mood or emotional state in order to effect control of other associated devices that may then effect that state. For instance, if the system detects that the user is warm or cold, the system may cause the temperature to be adjusted for each passenger. If the passenger is tired, the system may cause the adjustment of the seat, increase the music volume, etc. Also, as another example (not necessarily an in-vehicle system), an application interface responsiveness may be tuned to the mood of the user. For instance, if the user seems confused, help may be provided by the system. Further, if the user seems upset, faster executions are attempted. Still further, if the user is uncertain, the system may ask for confirmation or offer to guide the user.
  • the system can be deployed in a larger area, e.g., a room with multiple video input and speech input devices, as well as multiple associated devices controlled by the system 10.
  • a larger area e.g., a room with multiple video input and speech input devices, as well as multiple associated devices controlled by the system 10.
  • FIGs. 2 and 3 provide a general explanation of the interaction of the functional components of the system 10 during the course of the execution of one or more such applications.
  • raw multi-modal input data is obtained from multi-modal data sources associated with the system.
  • multi-modal data sources associated with the system.
  • the data input portion ofthe subsystem may comprise one or more cameras or sensors for capturing video input data representing the environment in which the system (or, at least, the I/O subsystem) is deployed.
  • the cameras/sensors may be capable of capturing not only visible image data (images in the visible electromagnetic spectrum), but also IR (near, mid and/or far field IR video) and/or RF image data.
  • IR near, mid and/or far field IR video
  • RF radio frequency
  • the I/O subsystem 12 may comprise one or more microphones for capturing audio input data from the environment in which the system is deployed. Further, the I/O subsystem may also include an analog-to-digital converter which converts the electrical signal generated by a microphone into a digital signal representative of speech uttered or other sounds that are captured. Further, the subsystem may sample the speech signal and partition the signal into overlapping frames so that each frame is discretely processed by the remainder ofthe system.
  • the cameras and microphones may be strategically placed throughout the vehicle in order to attempt to fully capture all visual activity and audio activity that may be necessary for the system to make ambiguity resolution determinations.
  • the I/O subsystem 12 may also comprise other typical input devices for obtaining user input, e.g., GUI-based devices such as a keyboard, a mouse, etc., and/or other devices such as a stylus and digitizer pad for capturing electronic handwriting, etc. It is to be understood that one of ordinary skill in the art will realize other user interfaces and devices that may be included for capturing user activity.
  • GUI-based devices such as a keyboard, a mouse, etc.
  • other devices such as a stylus and digitizer pad for capturing electronic handwriting, etc.
  • the raw multi-modal input data is abstracted into one or more events.
  • the data abstraction is performed by the I/O manager 14.
  • the I/O manager receives the raw multi-modal data and abstracts the data into a form that represents one or more events, e.g., a spoken utterance, a visual gesture, etc.
  • a data abstraction operation may involve generalizing details associated with all or portions of the input data so as to yield a more generalized representation of the data for use in further operations.
  • the abstracted data or event is then sent by the I/O manager 14 to one or more recognition engines 16 in order to have the event recognized, if necessary. That is, depending on the nature ofthe event, one or more recognition engines may be used to recognize the event. For example, if the event is some form of spoken utterance wherein the microphone picks up the audible portion of the utterance and a camera picks up the visual portion (e.g., lip movement) of the utterance, the event may be sent to an audio-visual speech recognition engine to have the utterance recognized using both the audio input and the video input associated with the speech. Alternatively, or in addition, the event may be sent to an audio-visual speaker recognition engine to have the speaker of the utterance identified, verified and/or authenticated. Also, both speech recognition and speaker recognition can be combined on the same utterance.
  • the event is some form of spoken utterance wherein the microphone picks up the audible portion of the utterance and a camera picks up the visual portion (e.g., lip movement
  • the event may be sent to a gesture recognition engine for recognition.
  • the event may comprise handwritten input provided by the user such that one of the recognition engines may be a handwriting recognition engine.
  • the data may not necessarily need to be recognized since the data is already identifiable without recognition operations.
  • gesture recognition e.g., body, arms and/or hand movement, etc., that a user employs to passively or actively give instruction to the system
  • focus recognition e.g., direction of face and eyes of a user
  • the classifier 22 is preferably used to determine the focus of the user and, in addition, the user's mood.
  • step 208 the recognized events, as well as the events that do not need to be recognized, are stored in a storage unit referred to as the context stack 20.
  • the context stack is used to create a history of interaction between the user and the system so as to assist the dialog manager 18 in making referential ambiguity resolution determinations when determining the user's intent.
  • step 210 the system 10 attempts to determine the user intent based on the current event and the historical interaction information stored in the context stack and then determine and execute one or more application programs that effectuate the user's intention and/or react to the user activity.
  • the application depends on the environment • that the system is deployed in.
  • the application may be written in any computer programming language but preferably it is written in a Conversational Markup Language (CML) as disclosed in U.S. patent application identified as 09/544,823 (attorney docket no. YO999-478) filed April 6, 2000 and entitled "Methods and Systems for Multi-modal Browsing and Implementation of a Conversational Markup Language;"
  • the dialog manager must first determine the user's intent based on the current event and, if available, the historical information (e.g., past events) stored in the context stack. For instance, returning to the vehicle example, the user may say “turn it on,” while pointing at the vehicle radio. The dialog manager would therefore receive the results of the recognized events associated with the spoken utterance "turn it on” and the gesture of pointing to the radio. Based on these events, the dialog manager does a search of the existing applications, transactions or "dialogs," or portions thereof, with which such an utterance and gesture could be associated. Accordingly, as shown in FIG. 1, the dialog manager 18 determines the appropriate CML-authored application 24.
  • the dialog manager 18 determines the appropriate CML-authored application 24.
  • the application may be stored on the system 10 or accessed (e.g., downloaded) from some remote location. If the dialog manager determines with some predetermined degree of confidence that the application it selects is the one which will effectuate the users desire, the dialog manager executes the next step of the multi-modal dialog (e.g., prompt or display for, missing, ambiguous or confusing information, asks for confirmation or launches the execution of an action associated to a fully understood multi-modal request from the user) of that application based on the multi-modal input. That is, the dialog manager selects the appropriate device (e.g., radio) activation routine and instructs the I/O manager to output a command to activate the radio.
  • the appropriate device e.g., radio
  • the predetermined degree of confidence may be that at least two input parameters or variables of the application are satisfied or provided by the received events.
  • levels of confidence and algorithms may be established as, for example, described in K.A. Papineni, S. Roukos, R.T. Ward, "Free-flow dialog management using forms,” Proc. Eurospeech, Budapest, 1999; and K. Davies et al., "The IBM conversational telephony system for financial applications,” Proc. Eurospeech, Budapest, 1999, the disclosures of which are incorporated by reference herein.
  • the dialog manager would first try to determine user intent based solely on the "turn it on” command. However, since there are likely other devices in the vehicle that could be turned on, the system would likely not be able to determine with a sufficient degree of confidence what the user was referring to. However, this recognized spoken utterance event is stored on the context stack. Then, when the recognized gesture event (e.g., pointing to the radio) is received, the dialog manager takes this event and the previous spoken utterance event stored on the context stack and makes a determination that the user intended to have the radio turned on. Consider the case where the user says “turn it on,” but makes no gesture and provides no other utterance.
  • the recognized gesture event e.g., pointing to the radio
  • the dialog manager does not have enough input to determine the user intent (step 212 in FIG. 2) and thus implement the command.
  • the dialog manager in step 214, then causes the generation of an output to the user requesting further input data so that the user's intent can be disambiguated. This may be accomplished by the dialog manager instructing the I/O manager to have the I/O subsystem output a request for clarification.
  • the I/O subsystem 12 may comprise a text-to-speech (TTS) engine and one or more output speakers.
  • TTS text-to-speech
  • the dialog manager then generates a predetermined question such as "what device do you want to have turned on?" which the TTS engine converts to a synthesized utterance that is audibly output by the speaker to the user.
  • the system 10 obtains the raw input data, again in step 202, and the process 200 iterates based on the new data. Such iteration can continue as long as necessary for the dialog manager to determine the user's intent.
  • the dialog manager 18 may also seek confirmation in step 216 from the user in the same manner as the request for more information (step 214) before executing the processed event, dispatching a task and/or executing some other action in step 218 (e.g., causing the radio to be turned on). For example, the system may output "do you want the radio turned on?" To which the user may respond "yes.” The system then causes the radio to be turned on. Further, the dialog manager 18 may store information it generates and/or obtains during the processing of a current event on the context stack 20 for use in making resolution or other determinations at some later time.
  • the system 10 can also make user ambiguity resolution determinations, e.g., in a multiple user environment, someone says "dial my office.” Given the explanation above, one of ordinary skill will appreciate how the system 10 could handle such a command in order to decide who among the multiple users made the request and then effectuate the order.
  • the output to the user to request further input may be made in any other number of ways and with any amount of interaction turns between the user and feedback from the system to the user.
  • the I/O subsystem 12 may include a GUI-based display whereby the request is made by the system in the form of a text message displayed on the screen of the display.
  • GUI-based display whereby the request is made by the system in the form of a text message displayed on the screen of the display.
  • YO999-111) filed on October 1, 1999 and entitled "Conversational Computing Via Conversational Virtual Machine," the disclosure of which is incorporated by reference herein, may be employed to provide a framework for the I/O manager, recognition engines, dialog manager and context stack of the invention. A description of such a conversational virtual machine will be provided below.
  • focus or attention detection is preferably performed in accordance with the focus/mood classifier 22, as will be explained below, it is to be appreciated that such operation can also be performed by the dialog manager 18, as explained above.
  • FIG. 3 a flow diagram illustrates a methodology 300 performed by a multi-modal conversational computing system by which mood classification and/or focus detection is accomplished. It is to be appreciated that the system 10 may perform the methodology of FIG. 3 in parallel with the methodology of FIG. 2 or at separate times. And because of this, the events that are stored by one process in the context stack can be used by the other.
  • steps 302 through 308 are similar to steps 202 through 208 in FIG. 2. That is, the I/O subsystem 12 obtains raw multi-modal input data from the various multi-modal sources (step 302); the I/O manager 14 abstracts the multi-modal input data into one or more events (step 304); the one or more recognition engines 16 recognize the event, if necessary, based on the nature of the one or more events (step 306); and the events are stored on the context stack (step 308).
  • the system 10 may determine the focus (and focus history) of the user in order to determine whether he is paying sufficient attention to the task of driving (assuming he is the driver).
  • Such determination may be made by noting abnormally long eye closure or gazing in another direction and/or speech that indicates distraction or sleepiness.
  • the system may then alert the operator of this potentially dangerous state.
  • the system may make a determination of a vehicle occupant's mood or emotional state in order to effect control of other associated devices that may then effect that state.
  • Such focus and mood determinations are made in step 310 by the focus/mood classifier 22.
  • the focus/mood classifier 22 receives either the events directly from the I/O manager 14 or, if necessary depending on the nature of the event, the classifier receives the recognized events from the one or more recognition engines 16.
  • the focus/mood classifier may receive visual events indicating the position ofthe user's eyes and/or head as well as audio events indicating sounds the user may be making (e.g., snoring). Using these events, as well as past information stored on the context stack, the classifier makes the focus detection and/or mood classification determination. Results of such determinations may also be stored on the context stack.
  • the classifier may cause the execution of some action depending on the resultant determination. For example, if the driver's attention is determined to be distracted, the I/O manager may be instructed by the classifier to output a warning message to the driver via the TTS system and the one or more output speakers. If the driver is determined to be tired due, for example, to his monitored body posture, the I/O manager may be instructed by the classifier to provide a warning message, adjust the temperature or radio volume in the vehicle, etc. It is to be appreciated the conversational data mining system disclosed in U.S. patent application identified as Serial No. 09/371,400 (attorney docket no.
  • FIG. 4 a block diagram illustrates a preferred embodiment of an audio-visual speech recognition module that may be employed as one of the recognition modules of FIG. 1 to perform speech recognition using multi-modal input data received in accordance with the invention. It is to be appreciated that such an audio-visual speech recognition module is disclosed in the above-referenced U.S. patent application identified as Serial No. 09/369,707 (attorney docket no.
  • This particular illustrative embodiment depicts audio-visual recognition using a decision fusion approach.
  • the audio-visual speech recognition module described herein provides is the ability to process arbitrary content video. That is, previous systems that have attempted to utilize visual cues from a video source in the context of speech recognition have utilized video with controlled conditions, i.e., non-arbitrary content video. That is, the video content included only faces from which the visual cues were taken in order to try to recognize short commands or single words in a predominantly noiseless environment.
  • the module described herein is preferably able to process arbitrary content video which may not only contain faces but may also contain arbitrary background objects in a noisy environment.
  • arbitrary content video is in the context of broadcast news. Such video can possibly contain a newsperson speaking at a location where there is arbitrary activity and noise in the background.
  • the module is able to locate and track a face and, more particularly, a mouth, to determine what is relevant visual information to be used in more accurately recognizing the accompanying speech provided by the speaker.
  • the module is also able to continue to recognize when the speaker's face is not visible (audio only) or when the speech in inaudible (lip reading only).
  • the module is capable of receiving real-time arbitrary content from a video camera 404 and microphone 406 via the I/O manager 14. It is to be understood that the camera and microphone are part of the I/O subsystem 12. While the video signals received from the camera 404 and the audio signals received from the microphone 406 are shown in FIG. 4 as not being compressed, they may be compressed and therefore need to be decompressed in accordance with the applied compression scheme.
  • the video signal captured by the camera 404 can be of any particular type.
  • the face and pose detection techniques may process images of any wavelength such as, e.g., visible and/or non- visible electromagnetic spectrum images.
  • this may include infrared (IR) images (e.g., near, mid and far field IR video) and radio frequency (RF) images.
  • IR infrared
  • RF radio frequency
  • the module may perform audio-visual speech detection and recognition techniques in poor lighting conditions, changing lighting conditions, or in environments without light.
  • the system may be installed in an automobile or some other form of vehicle and capable of capturing IR images so that improved speech recognition may be performed.
  • the module provides the capability to perform accurate LVCSR (large vocabulary continuous speech recognition).
  • LVCSR large vocabulary continuous speech recognition
  • a phantom line denoted by Roman numeral I represents the processing path the audio information signal takes within the module
  • a phantom line denoted by Roman numeral II represents the processing path the video information signal takes within the module.
  • the module includes an auditory feature extractor 414.
  • the feature extractor 414 receives an audio or speech signal and, as is known in the art, extracts spectral features from the signal at regular intervals.
  • the spectral features are in the form of acoustic feature vectors (signals) which are then passed on to a probability module 416.
  • the speech signal may be sampled at a rate of 16 kilohertz (kHz).
  • a frame may consist of a segment of speech having a 25 millisecond (msec) duration.
  • the extraction process preferably produces 24 dimensional acoustic cepstral vectors via the process described below. Frames are advanced every 10 msec to obtain succeeding acoustic vectors. Note that other acoustic front-ends with other frame sizes and sampling rates/signal bandwidths can also be employed.
  • magnitudes of discrete Fourier transforms of samples of speech data in a frame are considered in a logarithmically warped frequency scale.
  • these amplitude values themselves are transformed to a logarithmic scale.
  • the latter two steps are motivated by a logarithmic sensitivity of human hearing to frequency and amplitude.
  • a rotation in the form of discrete cosine transform is applied.
  • One way to capture the dynamics is to use the delta (first-difference) and the delta-delta (second-order differences) information.
  • LDA Linear Discriminant Analysis
  • the probability module labels the extracted vectors with one or more previously stored phonemes which, as is known in the art, are sub-phonetic or acoustic units of speech.
  • the module may also work with lefemes, which are portions of phones in a given context.
  • Each phoneme associated with one or more feature vectors has a probability associated therewith indicating the likelihood that it was that particular acoustic unit that was spoken.
  • the probability module yields likelihood scores for each considered phoneme in the form of the probability that, given a particular phoneme or acoustic unit
  • the acoustic unit represents the uttered speech characterized by one or more acoustic feature vectors A or, in other words, P(A
  • the processing performed in blocks 414 and 416 may be accomplished via any conventional acoustic information recognition system capable of extracting and labeling acoustic feature vectors, e.g., Lawrence Rabiner, Biing-Hwang Juang, "Fundamentals of Speech
  • the audio-visual speech recognition module (denoted in FIG. 4 as part of block 16 from FIG. 1) includes an active speaker face detection module 418.
  • the active speaker face detection module 418 receives video input camera 404. It is to be appreciated that speaker face detection can also be performed directly in the compressed data domain and/or from audio and video information rather than just from video information. In any case, module 418 generally locates and tracks the speaker's face and facial features within the arbitrary video background. This will be explained in detail below.
  • the recognition module also preferably includes a frontal pose detection module 420.
  • the detection module 420 serves to determine whether a speaker in a video frame is in a frontal pose. This serves the function of reliably determining when someone is likely to be uttering or is likely to start uttering speech that is meant to be processed by the module, e.g., recognized by the module. This is the case at least when the speaker's face is visible from one of the cameras.
  • conventional speech recognition with, for example, silence detection, speech activity detection and/or noise compensation can be used. Thus, background noise is not recognized as though it were speech, and the starts of utterances are not mistakenly discarded.
  • the module implements a detection module such that the modality of vision is used in connection with the modality of speech to determine when to perform certain functions in auditory and visual speech recognition.
  • One way to determine when a user is speaking to the system is to detect when he is facing the camera and when his mouth indicates a speech or verbal activity. This copies human behavior well. That is, when someone is looking at you and moves his lips, this indicates, in general, that he is speaking to you.
  • a face pose when a user is considered to be: (i) more or less looking at the camera; or (ii) looking directly at the camera (also referred to as “strictly frontal”).
  • we determine "frontalness” by determining that a face is absolutely not frontal (also referced to as “non-frontal”).
  • a non-frontal face pose is when the orientation ofthe head is far enough from the strictly frontal orientation that the gaze can not be inte ⁇ reted as directed to the camera nor inte ⁇ reted as more or less directed at the camera.
  • FIG. 5A Examples of what are considered frontal face poses and non-frontal face poses in a preferred embodiment are shown in FIG. 5A.
  • Poses I, II and III illustrate face poses where the user's face is considered frontal
  • poses IX and V illustrate face poses where the user's face is considered non-frontal.
  • FIG. 5B a flow diagram of an illustrative method of performing face detection and frontal pose detection is shown.
  • the first step (step 502) is to detect face candidates in an arbitrary content video frame received from the camera 404.
  • step 504 we detect facial features on each candidate such as, for example, nose, eyes, mouth, ears, etc.
  • step 506 we remove candidates that do not have sufficient frontal characteristics, e.g., a number of well detected facial features and distances between these features.
  • An alternate process in step 506 to the pruning method involves a hierarchical template matching technique, also explained in detail below.
  • step 508 if at least one face candidate exists after the pruning mechanism, it is determined that a frontal face is in the video frame being considered.
  • a geometric method suggests to simply consider variations of distances between some features in a two dimensional representation of a face (i.e., a camera image), according to the pose. For instance, on a picture of a slightly turned face, the distance between the right eye and the nose should be different from the distance between the left eye and the nose, and this difference should increase as the face turns.
  • the facial normal is estimated by considering mostly pose invariant distance ratios within a face.
  • two-class detection problem which is less complex than the general pose detection problem that aims to determine face pose very precisely.
  • two-class detection we mean that a binary decision is made between two options, e.g., presence of a face or absence of a face, frontal face or non-frontal face, etc. While one or more of the techniques described above may be employed, the techniques we implement in a prefe ⁇ ed embodiment are described below.
  • the main technique employed by the active speaker face detection module 418 and the frontal pose detection module 420 to do face and feature detection is based on Fisher Linear Discriminant (FLD) analysis.
  • FLD Fisher Linear Discriminant
  • a goal of FLD analysis is to get maximum discrimination between classes and reduce the dimensionality of the feature space.
  • face detection we consider two classes: (i) the In-Class, which comprises faces, and; (ii) the Out-Class, composed of non-faces.
  • the criterion of FLD analysis is then to find the vector ofthe feature space W that maximizes the following ratio:
  • Face detection involves first locating a face in the first frame of a video sequence and the location is tracked across frames in the video clip. Face detection is preferably performed in the following manner. For locating a face, an image pyramid over permissible scales is generated and, for every location in the pyramid, we score the surrounding area as a face location. After a skin-tone segmentation process that aims to locate image regions in the pyramid where colors could indicate the presence of a face, the image is sub-sampled and regions are compared to a previously stored diverse training set of face templates using FLD analysis. This yields a score that is combined with a Distance From Face Space (DFFS) measure to give a face likelihood score.
  • DFFS Distance From Face Space
  • DFFS considers the distribution o he image energy over the eigenvectors of the covariance matrix. The higher the total score, the higher the chance that the considered region is a face. Thus, the locations scoring highly on all criteria are determined to be faces. For each high scoring face location, we consider small translations, scale and rotation changes that occur from one frame to the next and re-score the face region under each of these changes to optimize the estimates of these parameters (i.e., FLD and DFFS). DFFS is also described in the article by M. Turk and A. Pentland, "Eigenfaces for Recognition," Journal of Cognitive Neuro Science, vol. 3, no. 1, pp. 71-86, 1991.
  • a computer vision-based face identification method for face and feature finding which may be employed in accordance with the invention is described in Andrew Senior, "Face and feature finding for face recognition system," 2 nd Int. Conf. On Audio- Video based Biometric Person Authentication, Washington DC, March 1999.
  • a similar method is applied, combined with statistical considerations of position, to detect the features within a face (step 504 of FIG. 5B).
  • this face and feature detection technique is designed to detect strictly frontal faces only, and the templates are intended only to distinguish strictly frontal faces from non-faces: more general frontal faces are not considered at all.
  • this method requires the creation of face and feature templates. These are generated from a database of frontal face images. The training face or feature vectors are added to the In-class and some Out-class vectors are generated randomly from the background in our training images.
  • the total score may be compared to a threshold to decide whether or not a face candidate or a feature candidate is a true face or feature.
  • the module provides two alternate ways to adapt (step 506 of FIG. 5B) the detection method: (i) a pruning mechanism and; (ii) a hierarchical template matching technique.
  • the false candidates are pruned from the candidates according to the following independent computations:
  • This pruning method has many advantages. For example, it does not require the computation of a specific database: we can reuse the one computed to do face detection.
  • Out-Class includes non-frontal faces.
  • the hierarchical template method makes it easier to find a less user independent threshold so that we could solve our problem by simple face finding score thresholding.
  • One advantage of the hierarchical template matching method is that the pose score (i.e., the score given by the pose template matching) is very low for non-faces (i.e., for non-faces that could have been wrongly detected as faces by the face template matching), which helps to discard non-faces.
  • the module 420 (FIG. 4).
  • these estimates represent whether or not a face having a frontal pose is detected in the video frame under consideration. These estimates are used by an event detection module 428, along with the audio feature vectors A extracted in module 414 and visual speech feature vectors V extracted in a visual speech feature extractor module 422, explained below.
  • the visual speech feature extractor 422 extracts visual speech feature vectors (e.g., mouth or lip-related parameters), denoted in FIG. 4 as the letter V, from the face detected in the video frame by the active speaker face detector 418,
  • visual speech features examples include grey scale parameters of the mouth region; geometric/model based parameters such as area, height, width of mouth region; lip contours arrived at by curve fitting, spline parameters of inner/outer contour; and motion parameters obtained by three dimensional tracking.
  • Still another feature set that may be extracted via module 422 takes into account the above factors.
  • Such technique is known as Active Shape modeling and is described in Iain Matthews, "Features for audio visual speech recognition," Ph.D dissertation, School of Information Systems, University of East Angalia, January 1998.
  • the visual speech feature extractor 422 may implement one or more known visual feature extraction techniques
  • the extractor extracts grey scale parameters associated with the mouth region ofthe image. Given the location of the lip corners, after normalization of scale and rotation, a rectangular region containing the lip region at the center of the rectangle is extracted from the original decompressed video frame. Principal Component Analysis (PCA), as is known, may be used to extract a vector of smaller dimension from this vector of grey-scale values.
  • PCA Principal Component Analysis
  • Another method of extracting visual feature vectors may include extracting geometric features. This entails extracting the phonetic/visemic information from the geometry ofthe lip contour and its time dynamics.
  • Typical parameters may be the mouth corners, the height or the area of opening, the curvature of inner as well as the outer lips.
  • Positions of articulators, e.g., teeth and tongue, may also be feature parameters, to the extent that they are discernible by the camera.
  • the method of extraction of these parameters from grey scale values may involve minimization of a function (e.g., a cost function) that describes the mismatch between the lip contour associated with parameter values and the grey scale image. Color information may be utilized as well in extracting these parameters.
  • a function e.g., a cost function
  • a boundary detection From the captured (or demultiplexed and decompressed) video stream one performs a boundary detection, the ultimate result of which is a parameterized contour, e.g., circles, parabolas, ellipses or, more generally, spline contours, each of which can be described by a finite set of parameters.
  • a parameterized contour e.g., circles, parabolas, ellipses or, more generally, spline contours, each of which can be described by a finite set of parameters.
  • a wire-frame may consist of a large number of triangular patches. These patches together give a structural representation of the mouth/lip/jaw region, each of which contain useful features in speech-reading. These parameters could also be used in combination with grey scale values of the image to benefit from the relative advantages of both schemes.
  • the extracted visual speech feature vectors are then normalized in block 424 with respect to the frontal pose estimates generated by the detection module 420.
  • the normalized visual speech feature vectors are then provided to a probability module 426.
  • the probability module 426 labels the extracted visual speech vectors with one or more previously stored phonemes.
  • each phoneme associated with one or more visual speech feature vectors has a probability associated therewith indicating the likelihood that it was that particular acoustic unit that was spoken in the video segment being considered.
  • the probability module yields likelihood scores for each considered phoneme in the form ofthe probability that, given a particular phoneme or acoustic unit (au), the acoustic unit represents the uttered speech characterized by one or more visual speech feature vectors V or, in other words, P(V
  • the visual speech feature vectors may be labeled with visemes which, as previously mentioned, are visual phonemes or canonical mouth shapes that accompany speech utterances.
  • the probabilities generated by modules 416 and 426 are jointly used by AN probability module 430.
  • the respective probabilities from modules 416 and 426 are combined based on a confidence measure 432.
  • Confidence estimation refers to a likelihood or other confidence measure being determined with regard to the recognized input. Recently, efforts have been initiated to develop appropriate confidence measures for recognized speech. In LVCSR Hub5 Workshop, April 29 - May 1, 1996,
  • a first method uses decision trees trained on word-dependent features (amount of training utterances, minimum and average triphone occurrences, occu ⁇ ence in language model training, number of phonemes/lefemes, duration, acoustic score (fast match and detailed match), speech or non-speech), sentence-dependent features (signal-to-noise ratio, estimates of speaking rates: number of words or of lefemes or of vowels per second, sentence likelihood provided by the language model, trigram occu ⁇ ence in the language model), word in a context features (trigram occu ⁇ ence in language model) as well as speaker profile features (accent, dialect, gender, age, speaking rate, identity, audio quality, S ⁇ R, etc.).
  • a probability of e ⁇ or is computed on the training data for each of the leaves of the tree. Algorithms to build such trees are disclosed, for example, in Breiman et al., "Classification and regression trees,” Chapman & Hall, 1993. At recognition, all or some of these features are measured during recognition and for each word the decision tree is walked to a leave which provides a confidence level. In C. ⁇ eti, S. Roukos and E. Eide "Word based confidence measures as a guide for stack search in speech recognition," ICASSP97,
  • the probability module 430 decides which probability, i.e., the probability from the visual information path or the probability from the audio information path, to rely on more. This determination may be represented in the following manner:
  • v p represents a probability associated with the visual information
  • a p represents a probability associated with the co ⁇ esponding audio information
  • W x and w 2 represent respective weights.
  • the module 430 assigns appropriate weights to the probabilities. For instance, if the su ⁇ ounding environmental noise level is particularly high, i.e., resulting in a lower acoustic confidence measure, there is more of a chance that the probabilities generated by the acoustic decoding path contain e ⁇ ors. Thus, the module 430 assigns a lower weight for w 2 than for w ⁇ placing more reliance on the decoded information from the visual path.
  • the module may set w 2 higher than w ⁇ .
  • a visual confidence measure may be used. It is to be appreciated that the first joint use ofthe visual information and audio information in module 430 is refe ⁇ ed to as decision or score fusion.
  • An alternative embodiment implements feature fusion as described in the above-referenced U.S. patent application identified as Serial No. 09/369,707 (attorney docket no. YO999-317).
  • search module 434 with language models (LM) based on the weighted probabilities received from module 430. That is, the acoustic units identified as having the highest probabilities of representing what was uttered in the arbitrary content video are put together to form words.
  • the words are output by the search engine 434 as the decoded system output.
  • a conventional search engine may be employed. This output is provided to the dialog manager 18 of FIG. 1 for use in disambiguating the user's intent, as described above.
  • the audio-visual speech recognition module of FIG. 4 also includes an event detection module 428.
  • the module may use information from the video path only, information from the audio path only, or information from both paths simultaneously to decide whether or not to decode information. This is accomplished via the event detection module 428. It is to be understood that “event detection” refers to the determination of whether or not an actual speech event that is intended to be decoded is occurring or is going to occur. Based on the output of the event detection module, microphone 406 or the search engine 434 may be enabled/disabled. Note that if no face is detected, then the audio can be processed to make decisions.
  • the event detection module 428 receives input from the frontal pose detector 420, the visual feature extractor 424 (via the pose normalization block 426), and the audio feature extractor 414.
  • any mouth openings on a face identified as "frontal” are detected. This detection is based on the tracking of the facial features associated with a detected frontal face, as described in detail above with respect to modules 418 and 420.
  • microphone 406 is turned on, in step 512. Once the microphone is turned on, any signal received therefrom is stored in a buffer (step 514). Then, mouth opening pattern recognition (e.g., periodicity) is performed on the mouth movements associated with the buffered signal to determine if what was buffered was in fact speech (step 516). This is determined by comparing the visual speech feature vectors to pre-stored visual speech patterns consistent with speech. If the buffered data is tagged as speech, in step 518, the buffered data is sent on through the acoustic path so that the buffered data may be recognized, in step 520, so as to yield a decoded output.
  • mouth opening pattern recognition e.g., periodicity
  • FIG. 5C depicts one example of how visual information (e.g., mouth openings) is used to decide whether or not to decode an input audio signal.
  • the event detection module may alternatively control the search module 434, e.g., turning it on or off, in response to whether or not a speech event is detected.
  • the event detection module is generally a module that decides whether an input signal captured by the microphone is speech given audio and co ⁇ esponding video information or, P(Speech
  • the event detection module 428 may perform one or more speech-only based detection methods such as, for example: signal energy level detection (e.g., is audio signal above a given level); signal zero crossing detection (e.g., are there high enough zero crossings); voice activity detection (non- stationarity of the spectrum) as described in, e.g., N.R. Garner et al, "Robust noise detection for speech recognition and enhancement," Electronics letters, Feb. 1997, vol. 33, no. 4, pp. 270-271; D.K. Freeman et al., "The voice activity detector of the pan-European digital mobile telephone service, IEEE 1989, CH2673-2; N.R.
  • signal energy level detection e.g., is audio signal above a given level
  • signal zero crossing detection e.g., are there high enough zero crossings
  • voice activity detection non- stationarity of the spectrum
  • Verification the disclosure of which is inco ⁇ orated by reference herein.
  • utterance verification is performed when the text (script) is not known and available to the system.
  • the uttered speech to be verified may be decoded by classical speech recognition techniques so that a decoded script and associated time alignments are available. This is accomplished using the feature data from the acoustic feature extractor 414.
  • the visual speech feature vectors from the visual feature extractor 422 are used to produce a visual phonemes (visemes) sequence.
  • the script is aligned with the visemes.
  • a rapid (or other) alignment may be performed in a conventional manner in order to attempt to synchronize the two information streams. For example, in one embodiment, rapid alignment as disclosed in the U.S. patent application identified as Serial No. 09/015,150 (docket no.
  • YO997-386 and entitled “Apparatus and Method for Generating Phonetic Transcription from Enrollment Utterances,” the disclosure of which is inco ⁇ orated by reference herein, may be employed.
  • step 528 a likelihood on the alignment is computed to determine how well the script aligns to the visual data.
  • the results of the likelihood are then used, in step 530, to decide whether an actual speech event occurred or is occurring and whether the information in the paths needs to be recognized.
  • the audio-visual speech recognition module of FIG. 4 may apply one of, a combination of two of, or all three of, the approaches described above in the event detection module 428 to perform event detection.
  • Video information only based detection is useful so that the module can do the detection when the background noise is too high for a speech only decision.
  • the audio only approach is useful when speech occurs without a visible face present.
  • the combined approach offered by unsupervised utterance verification improves the decision process when a face is detectable with the right pose to improve the acoustic decision.
  • the event detection methodology provides better modeling of background noise, that is, when no speech is detected, silence is detected. Also, for embedded applications, such event detection provides additional advantages. For example, the CPU associated with an embedded device can focus on other tasks instead of having to run in a speech detection mode. Also, a battery power savings is realized since speech recognition engine and associated components may be powered off when no speech is present.
  • the audio-visual speech recognition module of FIG. 4 may employ the alternative embodiments of audio-visual speech detection and recognition described in the above-referenced U.S. patent application identified as Serial No. 09/369,707 (attorney docket no. YO999-317).
  • the embodiment of FIG. 4 illustrates a decision or score fusion approach
  • the module may employ a feature fusion approach and/or a serial rescoring approach, as described in the above-referenced U.S. patent application identified as Serial No. 09/369,707 (attorney docket no. YO999-317).
  • FIG. 6 a block diagram illustrates a prefe ⁇ ed embodiment of an audio-visual speaker recognition module that may be employed as one ofthe recognition modules of FIG. 1 to perform speaker recognition using multi-modal input data received in accordance with the invention. It is to be appreciated that such an audio-visual speaker recognition module is disclosed in the above-referenced U.S. patent application identified as Serial No. 09/369,706 (attorney docket no.
  • The. audio-visual speaker recognition and utterance verification module shown in FIG. 6 uses a decision fusion approach. Like the audio-visual speech recognition module of FIG. 4, the speaker recognition module of FIG. 6 may receive the same types of arbitrary content video from the camera 604 and audio from the microphone 606 via the I/O manager 14. While the camera and microphone have different reference numerals in FIG. 6 than in FIG. 4, it is to be appreciated that they may be the same camera and microphone.
  • a phantom line denoted by Roman numeral I represents the processing path the audio information signal takes within the module, while a phantom line denoted by Roman numeral II represents the processing path the video information signal takes within the module.
  • the audio signal path I will be discussed, then the video signal path H, followed by an explanation of how the two types of information are combined, to provide improved speaker recognition accuracy.
  • the module includes an auditory feature extractor 614.
  • the feature extractor 614 receives an audio or speech signal and, as is known in the art, extracts spectral features from the signal at regular intervals.
  • the spectral features are in the form of acoustic feature vectors (signals) which are then passed on to an audio speaker recognition module
  • the speech signal may be sampled at a rate of 16 kilohertz (kHz).
  • a frame may consist of a segment of speech having a 25 millisecond (msec) duration.
  • the extraction process preferably produces 24 dimensional acoustic cepstral vectors via the process described below. Frames are advanced every 10 msec to obtain succeeding acoustic vectors.
  • other front-ends may be employed.
  • LDA Linear Discriminant Analysis
  • the audio speaker recognition module 616 may perform speaker identification and/or speaker verification using the extracted acoustic feature vectors.
  • the processes of speaker identification and verification may be accomplished via any conventional acoustic information speaker recognition system.
  • speaker recognition module 616 may implement the recognition techniques described in the U.S. patent application identified by Serial No. 08/788,471, filed on January 28 1997, and entitled:- "Text Independent Speaker
  • the illustrative speaker identification system may use two techniques: a model-based approach and a frame-based approach.
  • a model-based approach For speaker identification based on audio.
  • the frame-based approach can be described in the following manner.
  • M be the model co ⁇ esponding to the ith enrolled speaker.
  • - ⁇ ' is represented by a mixture Gaussian model defined by the parameter set
  • weights for each of the n t components of speaker fs model are created using training data consisting of a sequence of K frames of speech with ⁇ -dimensional
  • the total distance D , of model ' from the test data is then taken to be the sum ofthe distances over all the test frames:
  • the above approach finds the closest matching model and the person whose model that represents is determined to be the person whose utterance is being processed. Speaker verification may be performed in a similar manner, however, the input acoustic data is compared to determine if the data matches closely enough with stored models. If the comparison yields a close enough match, the person uttering the speech is verified. The match is accepted or rejected by comparing the match with competing models. These models can be selected to be similar to the claimant speaker or be speaker independent (i.e., a single or a set of speaker independent models). If the claimant wins and wins with enough margin (computed at the level ofthe likelihood or the distance to the models), we accept the claimant. Otherwise, the claimant is rejected. It should be understood that, at enrollment, the input speech is collected for a speaker to build the mixture gaussian model Mi that characterize each speaker.
  • the audio-visual speaker recognition and utterance verification module includes an active speaker face segmentation module 620 and a face recogmtion module 624.
  • the active speaker face segmentation module 620 receives video input from camera 604. It is to be appreciated that speaker face detection can also be performed directly in the compressed data domain and/or from audio and video information rather than just from video information. In any case, segmentation module 620 generally locates and tracks the speaker's face and facial features within the arbitrary video background. This will be explained in detail below.
  • an identification and/or verification operation may be performed by recognition module 624 to identify and/or verify the face ofthe person assumed to be the speaker in the video. Verification can also be performed by adding score thresholding or competing models.
  • the visual mode of speaker identification is implemented as a face recognition system where faces are found and tracked in the video sequences, and recognized by comparison with a database of candidate face templates.
  • utterance verification provides a technique to verify that the person actually uttered the speech used to recognize him. Face detection and recognition may be performed in a variety of ways. For example, in an embodiment employing an infrared camera 604, face detection and identification may be performed as disclosed in Francine J. Prokoski and Robert R.
  • Faces can occur at a variety of scales, locations and orientations in the video frames.
  • the system searches for a fixed size template in an image pyramid.
  • the image pyramid is constructed by repeatedly down-sampling the original image to give progressively lower resolution representations of the original frame.
  • the region must contain a high proportion of skin-tone pixels, and then the intensities ofthe candidate region are compared with a trained face model. Pixels falling into a pre-defined cuboid of hue-chromaticity-intensity space are deemed to be skin tone, and the proportion of skin tone pixels must exceed a threshold for the candidate region to be considered further.
  • The. face model is based on a training set of cropped, normalized, grey-scale face images. Statistics of these faces are gathered and a variety of classifiers are trained based on these statistics.
  • a Fisher linear discriminant (FLD) trained with a linear program is found to distinguish between faces and background images, and "Distance from face space” (DFFS), as described in M. Turk and A. Pentland, "Eigenfaces for Recognition,” Journal of Cognitive Neuro Science, vol. 3, no. 1, pp. 71-86, 1991, is used to score the quality of faces given high scores by the first method. A high combined score from both these face detectors indicates that the candidate region is indeed a face.
  • Candidate face regions with small perturbations of scale, location and rotation relative to high-scoring face candidates are also tested and the maximum scoring candidate among the perturbations is chosen, giving refined estimates of these three parameters.
  • the face is tracked by using a velocity estimate to predict the new face location, and models are used to search for the face in candidate regions near the predicted location with similar scales and rotations.
  • a low score is inte ⁇ reted as a failure of tracking, and the algorithm begins again with an exhaustive search.
  • a Gabor Jet representation as described in L. Wiskott and C. von der Malsburg, "Recognizing Faces by Dynamic Link Matching," Proceedings of the International Conference on Artificial Neural Networks, pp. 347-352, 1995, is generated.
  • a Gabor jet is a set of two-dimensional Gabor filters - each a sine wave modulated by a Gaussian.
  • a simple distance metric is used to compute the distance between the feature vectors for trained faces and the test candidates.
  • the distance between the i th trained candidate and a test candidate for feature k is defined as:
  • Confidence estimation refers to a likelihood or other confidence measure being determined with regard to the recognized input.
  • the confidence estimation procedure may include measurement of noise levels respectively associated with the audio signal and the video signal. These levels may be measured internally or externally with respect to the system. A higher level of noise associated with a signal generally means that the confidence attributed to the recognition results associated with that signal is lower. Therefore, these confidence measures are taken into consideration during the weighting of the visual and acoustic results discussed below.
  • audio-visual speaker identification/verification may be performed by a joint identification/verification module 630 as follows.
  • the top N scores are generated-based on both audio and video-based identification techniques.
  • the two lists are combined by a weighted sum and the best-scoring candidate is chosen. Since the weights need only to be defined up to a scaling factor, we can define the combined score S av as a function ofthe single parameter a :
  • the mixture angle a has to be selected according to the relative reliability of audio identification and face identification.
  • One way to achieve this is to optimize in order to maximize the audio-visual accuracy on some training data.
  • the audio-visual speaker recognition module of FIG. 6 provides another decision or score fusion technique derived by the previous technique, but which does not require any training. It consists in selecting at testing time, for each clip, the value of Ct in a given range which maximizes the difference between the highest and the second highest scores. The co ⁇ esponding best hypothesis I(n) is then chosen.
  • ⁇ (ri) arg max [max. S° v (n) - 2nd max, SfTM ( «) 1 and
  • this technique is the following.
  • the point co ⁇ esponding to the co ⁇ ect decision is expected to lie apart from the others.
  • the fixed linear weights assume that the "direction" where this point can be found relative to the others is always the same, which is not necessarily true.
  • the equation relating to Ot n) and I(n) above find the point which lies farthest apart from the others in any direction between ⁇ * ⁇ and # 2 .
  • Another inte ⁇ retation is that the distance between the best combined score and the second best is an indicator of the reliability of the decision.
  • the method adaptively chooses the weights which maximize that confidence measure.
  • the joint identification/verification module 630 makes a decision with regard to the speaker.
  • a decision may be made to accept the speaker if he is verified via both the acoustic path and the visual path. However, he may be rejected if he is only verified through one ofthe paths.
  • the top three scores from the face identification process may be combined with the top three scores from the acoustic speaker identification process. Then, the highest combined score is identified as the speaker.
  • the system before the module makes a final disposition with respect to the speaker, the system performs an utterance verification operation. It is to be appreciated ' that utterance verification is performed by the utterance verification module 628 (FIG. 6) based on input from the acoustic feature extractor 614 and a visual speech feature extractor 622. Before describing utterance verification, a description of illustrative techniques for extracting visual speech feature vectors will follow. Particularly, the visual speech feature extractor 622 extracts visual speech feature vectors (e.g., mouth or lip-related parameters), denoted in FIG. 6 as the letter V, from the face detected in the video frame by the active speaker face segmentation module 620.
  • visual speech feature vectors e.g., mouth or lip-related parameters
  • Examples of visual speech features that may be extracted are grey scale parameters of the mouth region; geometric/model based parameters such as area, height, width of mouth region; lip contours arrived at by curve fitting, spline parameters of inner/outer contour; and motion parameters obtained by three dimensional tracking. Still another feature set that may be extracted via module 622 takes into account the above factors. Such technique is known as Active Shape modeling and is described in Iain Matthews, "Features for audio visual speech recognition," Ph.D dissertation, School of Information Systems, University of East Angalia, January 1998. Thus, while the visual speech feature extractor 622 may implement one or more known visual feature extraction techniques, in one embodiment, the extractor extracts grey scale parameters associated with the mouth region ofthe image.
  • PCA Principal Component Analysis
  • Another method of extracting visual feature vectors may include extracting geometric features. This entails extracting the phonetic/visemic information from the geometry ofthe lip contour and its time dynamics. Typical parameters may be the mouth corners, the height or the area of opening, the curvature of inner as well as the outer lips. Positions of articulators, e.g., teeth and tongue, may also be feature parameters, to the extent that they are discernible by the camera.
  • the method of extraction of these parameters from grey scale values may involve minimization of a function (e.g., a cost function) that describes the mismatch between the lip contour associated with parameter values and the grey scale image. Color information may be utilized as well in extracting these parameters.
  • a function e.g., a cost function
  • a boundary detection From the captured (or demultiplexed and decompressed) video stream one performs a boundary detection, the ultimate result of which is a parameterized contour, e.g., circles, parabolas, ellipses or, more generally, spline contours, each of which can be described by a finite set of parameters.
  • a parameterized contour e.g., circles, parabolas, ellipses or, more generally, spline contours, each of which can be described by a finite set of parameters.
  • a wire-frame may consist of a large number of triangular patches. These patches together give a structural representation of the mouth/lip/jaw region, each of which contain useful features in speech-reading. These parameters could also be used in combination with grey scale values of the image to benefit from the relative advantages of both schemes.
  • the AV utterance verifier 628 Given the extracted visual speech feature vectors (V) from extractor 622 and the acoustic feature vectors(A) from extractor 614, the AV utterance verifier 628 performs verification. Verification may involve a comparison of the resulting likelihood, for example, of aligning the audio on a random sequence of visemes. As is known, visemes, or visual phonemes, are generally canonical mouth shapes that accompany speech utterances which are categorized and pre-stored similar to acoustic phonemes, A goal associated with utterance verification is to make a determination that the speech used to verify the speaker in the audio path I and the visual cues used to verify the speaker in the video path TJ co ⁇ elate or align.
  • a determination has many advantages. For example, from the utterance verification, it can be determined whether the user is lip synching to a pre-recorded tape playback to attempt to fool the system. Also, from utterance verification, e ⁇ ors in the audio decoding path may be detected. Depending on the number of e ⁇ ors, a confidence measure may be produced and used by the system.
  • Utterance verification may be performed in: (i) a supervised mode, i.e., when the text (script) is known and available to the system; or (ii) an unsupervised mode, i.e., when the text (script) is not known and available to the system.
  • step 702A (unsupervised mode), the uttered speech to be verified may be decoded by classical speech recognition techniques so that a decoded script and associated time alignments are available. This is accomplished using the feature data from the acoustic feature extractor 614. Contemporaneously, in step 704, the visual speech feature vectors from the visual feature extractor 622 are used to produce a visual phonemes or visemes sequence.
  • step 706 the script is aligned with the visemes.
  • a rapid (or other) alignment may be performed in a conventional manner in order to attempt to synchronize the two information streams.
  • rapid alignment as disclosed in the U.S. patent application identified by Serial No. 09/015,150 (docket no.
  • step 702B replaces step 702A such that the expected or known script is aligned with the visemes in step 706, rather than the decoded version of the script.
  • step 708 a likelihood on the alignment is computed to determine how well the script aligns to the visual data.
  • the results of the likelihood are then provided to a decision block 632 which, along with the results of the score module 630, decides on a final disposition ofthe speaker, e.g., accept him or reject him. This may be used to allow or deny access to a variety of devices, applications, facilities, etc.
  • the system in the unsupervised utterance verification mode, the system is able to check that the user is indeed speaking rather than using a playback device and moving bis lips. Also, a priori, e ⁇ ors may be detected in the audio decoding. In the supervised mode, the system is able to prove that the user uttered the text if the recognized text is sufficiently aligned or co ⁇ elated to the extracted lip parameters.
  • utterance verification in the unsupervised mode can be used to perform speech detection as disclosed in the above-referenced U.S. patent application identified as U.S. Serial No. 09/369,707 (attorney docket no. YO999-317). Indeed, if acoustic and visual activities are detected, they can be verified against each other. When the resulting acoustic utterance is accepted, the system considers that speech is detected. Otherwise, it is considered that extraneous activities are present. It is to be appreciated that the audio-visual speaker recognition module of FIG. 6 may employ the alternative embodiments of audio-visual speaker recognition described in the above-referenced U.S. patent application identified as Serial No.
  • the module 20 may employ a feature fusion approach and/or a serial rescoring approach, as described in the above-referenced U.S. patent application identified as Serial No. 09/369,706 (attorney docket no. YO999-318).
  • the output of the audio-visual speaker recognition system of FIG. 6 is provided to the dialog manager 18 of FIG. 1 for use in disambiguating the user's intent, as explained above.
  • FIGs. 8A and 8B block diagrams illustrate a prefe ⁇ ed embodiment of a conversational virtual machine (CVM).
  • CVM conversational virtual machine
  • CVM CVM described below may be employed to provide a framework for: portions of the I/O subsystem 12; I/O manager 14; recognition engines
  • a multi-modal conversational computing system ofthe invention may be implemented through a browser that carries these functions, an OSS (operating system service) layer, a VM (virtual machine) or even just an application that implements all these functionalities, possibly without explicitly identifying these component but rather by implementing hard-coded equivalent services.
  • OSS operating system service
  • VM virtual machine
  • the implementation may support only modalities of speech and video and, in such a case, does not need to support other modalities (e.g., handwriting, GUI, etc.).
  • the CVM may be employed as a main component for implementing conversational computing according to the conversational computing paradigm described above with respect to the present invention.
  • the CVM is a conversational platform or kernel running on top of a conventional OS (operating system) or RTOS (real-time operating system).
  • a CVM platform can also be implemented with PvC (pervasive computing) clients as well as servers and can be distributed across multiple systems (clients and servers).
  • the CVM provides conversational APIs (application programming interfaces) and protocols between conversational subsystems (e.g., speech recognition engine, text-to speech, etc.) and conversational and or conventional applications.
  • the CVM may also provide backward compatibility to existing applications, with a more limited interface.
  • the CVM provides conversational services and behaviors as well as conversational protocols for interaction with multiple applications and devices also equipped with a CVM layer, or at least, conversationally aware.
  • a CVM (or operating system) based on the conversational computing paradigm described herein allows a computer or any other interactive device to converse with a user.
  • the CVM further allows the user to run multiple tasks on a machine regardless if the machine has no display or GUI capabilities, nor any keyboard, pen or pointing device. Indeed, the user can manage these tasks like a conversation and bring a task or multiple simultaneous tasks, to closure.
  • the CVM affords the capability of relying on mixed initiatives, contexts and advanced levels of abstraction, to perform its various functions. Mixed initiative or free flow navigation allows a user to naturally complete, modify, or co ⁇ ect a request via dialog with the system.
  • the CVM can actively help (take the initiative to help) and coach a user through a task, especially in speech-enabled applications, wherein the mixed initiative capability is a natural way of compensating for a display-less system or system with limited display capabilities.
  • the CVM complements conventional interfaces and user input/output rather than replacing them. This is the notion of "multi-modality" whereby speech, and video as described above, may be used in parallel with a mouse, keyboard, and other input devices such as a pen.
  • Conventional interfaces can be replaced when device limitations constrain the implementation of certain interfaces.
  • the ubiquity and uniformity of the resulting interface across devices, tiers and services is an additional mandatory characteristic.
  • a CVM system can, to a large extent, function with conventional input and/or output media. Indeed, a computer with classical keyboard inputs and pointing devices coupled with a traditional monitor display can profit significantly by utilizing the CVM.
  • U.S. patent application identified as U.S. Serial No. 09/507,526 (attorney docket no. YO999-178) filed on February 18, 2000 and entitled "Multi-Modal Shell” which claims priority to U.S. provisional patent application identified as U.S. Serial No. 60/128,081 filed on April 7, 1999 and U.S. provisional patent application identified by Serial No.
  • FIG. 8A a block diagram illustrates a CVM system according to a prefe ⁇ ed embodiment, which may be implemented on a client device or a server.
  • the components of the system 10 may be located locally (in the vehicle), remotely (e.g., connected wirelessly to the vehicle), or some combination thereof.
  • the CVM provides a universal coordinated multi-modal conversational user interface (CUI) 780.
  • the "multi- modality” aspect of the CUI implies that various I/O resources such as voice, keyboard, pen, and pointing device (mouse), keypads, touch screens, etc., and video as described above, can be used in conjunction with the CVM platform.
  • the "universality" aspect of the CUI implies that the CVM system provides the same UI (user interface) to a user whether the
  • CVM is implemented in connection with a desktop computer, a PDA with limited display capabilities, or with a phone where no display is provided.
  • universality implies that the CVM system can appropriately handle the UI of devices with capabilities ranging from speech only to multi-modal, i.e., speech + GUI, to purely GUI.
  • the system may be extended to include video input data as well. Therefore, the universal CUI provides the same UI for all user interactions, regardless of the access modality.
  • the concept of universal CUI extends to the concept of a coordinated
  • CUI CUI
  • devices within or across multiple computer tiers
  • they can be managed through a single discourse - i.e., a coordinated interface. That is, when multiple devices are conversationally connected (i.e., aware of each other), it is possible to simultaneously control them through one interface (e.g., single microphone).
  • voice can automatically control via a universal coordinated CUI a smart phone, a pager, a PDA (personal digital assistant), networked computers, IVR (interactive voice response) and a car embedded computer that are conversationally connected.
  • PDA personal digital assistant
  • IVR interactive voice response
  • the CVM system can run a plurality of applications including conversationally aware applications 782 (i.e., applications that "speak" conversational protocols) and conventional applications 784.
  • the conversationally aware applications 782 are applications that are specifically programmed for operating with a CVM core layer (or kernel) 788 via conversational application APIs 786.
  • the CVM kernel 788 controls the dialog across applications and devices on the basis of their registered conversational capabilities and requirements and provides a unified conversational user interface which goes far beyond adding speech as I/O modality to provide conversational system behaviors.
  • the CVM system may be built on top of a conventional OS and APIs 790 and conventional device hardware 792 and located on a server or any client device (PC, PDA, PvC).
  • the conventional applications 784 are managed by the CVM kernel layer 788 which is responsible for accessing, via the OS APIs, GUI menus and commands of the conventional applications as well as the underlying OS commands.
  • the CVM automaticallv handles all the input/output issues, including the conversational subsystems 796 (i.e., conversational engines) and conventional subsystems (e.g., file system and conventional drivers) ofthe conventional OS 790.
  • conversational sub-systems 796 are responsible for converting voice requests into queries and converting outputs and results into spoken messages using the appropriate data files 794 (e.g., contexts, finite state grammars, vocabularies, language models, symbolic query maps, etc.).
  • the conversational application API 786 conveys all the information for the CVM 788 to transform queries into application calls and conversely converts output into speech, appropriately sorted before being provided to the user.
  • FIG. 8B a diagram illustrates abstract programming layers of a CVM according to a prefe ⁇ ed embodiment.
  • the abstract layers of the CVM comprise conversationally aware applications 800 and conventional applications 801 that can run on top of the CVM.
  • An application that relies on multi-modal disambiguation is an example of such a conversational application that executes on top of the CVM.
  • an application that exploits focus information or mood can be considered as a conversational application on top of the CVM.
  • These applications are the programs that are executed by the system to provide the user with the interaction he desires within the environment in which the system is deployed.
  • the conversationally aware applications 800 interact with a CVM kernel layer 802 via a conversational application API layer 803.
  • the conversational application API layer 803 encompasses conversational programming languages/scripts and libraries (conversational foundation classes) to provide the various features offered by the CVM kernel 802.
  • the conversational programming languages/scripts provide the conversational APIs that allow an application developer to hook (or develop) conversationally aware applications 800. They also provide the conversational API layer 803, conversational protocols 804 and system calls that allow a developer to build the conversational features into an application to make it "conversationally aware.”
  • the code implementing the applications, API calls and protocol calls includes inte ⁇ reted and compiled scripts and programs, with library links, conversational logic engine call and conversational foundation classes.
  • the conversational application API layer 803 comprises a plurality of conversational foundation classes 805 (or fundamental dialog components) which are provided to the application developer through library functions that may be used to build a CUI or conversationally aware applications 800.
  • the conversational foundation classes 805 are the elementary components or conversational gestures (as described by TN. Raman, in “Auditory User Interfaces, Toward The Speaking Computer,” Kluwer Academic Publishers, Boston 1997) that characterize any dialog, independently of the modality or combination of modalities (which can be implemented procedurally or declaratively).
  • the conversational foundation classes 805 comprise CUI building blocks and conversational platform libraries, dialog modules and components, and dialog scripts and beans.
  • the conversational foundation classes 805 may be compiled locally into conversational objects 806. More specifically, the conversational objects 805 (or dialog components) are compiled from the conversational foundation classes 805 (fundamental dialog components) by combining the different individual classes in a code calling these libraries through a programming language such as Java or
  • coding comprises embedding such fundamental dialog components into declarative code or linking them to imperative code.
  • Nesting and embedding of the conversational foundation classes 805 allows the conversational object 806 (either reusable or not) to be constructed (either declaratively or via compilation/inte ⁇ retation) for performing specific dialog tasks or applications.
  • CFC Conversational Foundation Classes
  • CML is not the only way to program the CVM. Any programming language that interfaces to the applications APIs and protocols would fit.
  • the conversational objects 806 may be implemented declaratively such as pages of CML (conversational markup language) (nested or not) which are processed or loaded by a conversational browser (or viewer) (800a) as disclosed in the PCT patent application identified as PCT/US99/23008 (attorney docket no. YO9998-392) filed on October 1, 1999 and entitled "Conversational Browser and Conversational Systems,” which is inco ⁇ orated herein by reference.
  • the dialog objects comprise applets or objects that may be loaded through CML (conversational markup language) pages (via a conversational browser), imperative objects on top of CVM (possibly distributed on top ofthe CVM), script tags in CML, and servlet components.
  • conversational gestures are as follows.
  • a conversational gesture message is used by a machine to convey informational messages to the user.
  • the gesture messages will typically be rendered as a displayed string or spoken prompt. Portions of the message to be spoken can be a function of the cu ⁇ ent state of the various applications/dialogs running on top of the CVM.
  • a conversational gesture "select from set” is used to encapsulate dialogues where the user is expected to pick from a set of discrete choices. It encapsulates the prompt, the default selection, as well as the set of legal choices.
  • Conversational gesture message "select from range” encapsulates dialogs where the user is allowed to pick a value from a continuous range of values.
  • the gesture encapsulates the valid range, the current selection, and an informational prompt.
  • conversational gesture input is used to obtain user input when the input constraints are more complex (or perhaps non-existent).
  • the gesture encapsulates the user prompt, application-level semantics about the item of information being requested and possibly a predicate to test the validity of the input.
  • the conversational foundation classes include, yet su ⁇ ass, the concept of conversational gestures (i.e., they extend to the level of fundamental behavior and services as well as rules to perform conversational tasks).
  • a programming model allows the connection between a master dialog manager and engines through conversational APIs. It is to be understood that such a master dialog manager may be implemented as part ofthe dialog manager 18 of FIG. 1 , while the engines would include the one or more recognition engines of FIG. 1.
  • Data files of the foundation classes, as well as data needed by any recognition engine are present on CVM (loadable for embedded platforms or client platforms).
  • CVM loadable for embedded platforms or client platforms.
  • Data files of objects can be expanded and loaded.
  • the development environment offered by the CVM is refe ⁇ ed to herein as SPOKEN AGETM.
  • Spoken Age allows a developer to build, simulate and debug conversationally aware applications for CVM. Besides offering direct implementation of the API calls, it offers also tools to build advanced conversational interfaces with multiple personalities, voice fonts which allow the user to select the type of voice providing the output, and conversational formatting languages which build conversational presentations like Postcript and AFL (audio formatting languages).
  • the conversational application API layer 803 encompasses conversational programming languages and scripts to provide universal conversational input and output, conversational logic and conversational meta-information exchange protocols. The conversational programming language/scripts allow to use any available resources as input or output stream.
  • the conversational engines 808 recognition engines 16 of FIG.
  • each input is converted into a binary or ASCII input, which can be directly processed by the programming language as built-in objects. Calls, flags and tags can be automatically included to transmit between object and processes the conversational meta-information required to co ⁇ ectly interface with the different objects. Moreover, output streams can be specially formatted according to the needs of the application or user.
  • logic statement status and operators are expanded to handle the richness of conversational queries that can be compared on the bases of their ASCII/binary content or on the basis of their NLU-converted (natural language understanding-converted) query (input/output of conventional and conversational sub-systems) or FSG-based queries (where the system used restricted commands).
  • Logic operators can be implemented to test or modify such systems.
  • Conversational logic values/operators expand to include: true, false, incomplete, ambiguous, different/equivalent for an ASCII point of view, different/equivalent from a NLU point of view, different/equivalent from a active query field point of view, unknown, incompatible, and incomparable.
  • the conversational application API layer 803 comprises code for providing extensions of the underlying OS features and behavior. Such extensions include, for example, high level of abstraction and abstract categories associated with any object, self-registration mechanisms of abstract categories, memorization, summarization, conversational search, selection, redirection, user customization, train ability, help, multiuser and security capabilities, as well as the foundation class libraries.
  • the conversational computing system of FIG. 8B further comprises a conversational engine API layer 807 which provides an interface between core engines conversational engines 808 (e.g., speech recognition, speaker recognition, NL parsing, NLU, TTS and speech compression/decompression engines, visual recognition) and the applications using them.
  • the engine API layer 807 also provides the protocols to communicate with core engines whether they be local or remote.
  • An I/O API layer 810 provides an interface with conventional I/O resources 811 such as a keyboard, mouse, touch screen, keypad, etc. (for providing a multi-modal conversational UI), an audio subsystem for capturing speech I/O (audio in audio out), and a video subsystem for capturing video I/O.
  • the I/O API layer 810 provides device abstractions, I/O abstractions and UI abstractions.
  • the I/O resources 811 will register with the CVM kernel layer 802 via the I/O API layer 810. It is to be understood that the I/O APIs 810 may be implemented as part ofthe I/O manager 14 of FIG. 1, while the I/O resources 811 may be implemented as part ofthe I/O subsystem 12 of FIG. 1
  • the core CVM kernel layer 802 comprises programming layers such as a conversational application and behavior/service manager layer 815, a conversational dialog manager (arbitrator) layer 819, a conversational resource manager layer 820, a task/dispatcher manager 821 and a meta-information manager 822, which provide the core functions ofthe CVM layer 802. It is to be understood that these components may be implemented as part of the dialog manager 18 of FIG. 1.
  • the conversational application and behavior/service manager layer 815 comprises functions for managing'fhe conventional and conversationally aware applications 800 and 801. Such management functions include, for example, keeping track of which applications are registered (both local and network-distributed), what are the dialog interfaces (if any) of the applications, and what is the state of each application.
  • the conversational application and services/behavior manager 815 initiates all the tasks associated with any specific service or behavior provided by the CVM system.
  • the conversational services and behaviors are all the behaviors and features of a conversational UI that the user may expect to find in the applications and interactions, as well as the features that an application developer may expect to be able to access via APIs (without having to implement with the development of the application).
  • Examples of the conversational services and behavior provided by the CVM kernel 802 include, but are not limited to, conversational categorization and meta-information, conversational object, resource and file management, conversational search, conversational selection, conversational customization, conversational security, conversational help, conversational prioritization, conversational resource management, output formatting and presentation, summarization, conversational delayed actions/agents/memorization, conversational logic, and coordinated interfaces and devices.
  • Such services are provided through API calls via the conversational application API Layer 803.
  • the conversational application and behavior/services manager 815 is responsible for executing all the different functions needed to adapt the UI to the capabilities and constraints ofthe device, application and/or user preferences.
  • the conversational dialog manager 819 comprises functions for managing the dialog (conversational dialog comprising speech and other multi-modal I/O such as GUI keyboard, pointer, mouse, as well as video input, etc.) and arbitration (dialog manager arbitrator or DMA) across all registered applications.
  • the conversational dialog manager 819 determines what information the user has, which inputs the user presents, and which aDnlication(s) should handle the user inputs.
  • the DMA processes abstracted I/O events (abstracted by the I/O manager) using the context/history to understand the user intent.
  • the DMA determines the target of the event and, if needed, seeks confirmation, disambiguation, co ⁇ ection, more details, etc., until the intent is unambiguous and fully determined.
  • the DMA then launches the action associated to the user's query.
  • the DMA function handles multi-modal I/O events to: (1) determine the target application or dialog (or portion of it); and (2) use past history and context to: (a) understand the intent ofthe user; (b) follow up with a dialog to disambiguate, complete, co ⁇ ect or confirm the understanding; (c) or, dispatch a task resulting from full understanding ofthe intent ofthe user.
  • the conversational resource manager 820 determines what conversational engines
  • the conversational resource manager 820 prioritizes the allocation of CPU cycles or input/output priorities to maintain a flowing dialog with the active application (e.g., the engines engaged for recognizing or processing a cu ⁇ ent input or output have priorities). Similarly, for distributed applications, it routes and selects the engine and network path to be used to minimize any network delay for the active foreground process.
  • the task dispatcher/manager 821 dispatches and coordinates different tasks and processes that are spawned (by the user and machine) on local and networked conventional and conversational resources.
  • the meta-information manager 822 manages the meta-information associated with the system via a meta-information repository 818.
  • the meta-information manager 822 and repository 818 collect all the information typically assumed known in a conversational interaction but not available at the level of the current conversation. Examples are a-priori knowledge, cultural, educational assumptions and persistent information, past request, references, information about the
  • meta-information manager 822 and stored in the meta-information repository 818.
  • meta-information repository 818 includes a user-usage log based on user identity.
  • services such as conversational help and assistance, as well as some dialog prompts (introduction, questions, feedback, etc.) provided by the CVM system can be tailored based on the usage history of the user as stored in the meta-information repository 818 and associated with the application. If a user has been previously interacting with a given application, an explanation can be reduced assuming that it is familiar to the user. Similarly, if a user commits many e ⁇ ors, the explanations can be more complex, as multiple e ⁇ ors are inte ⁇ reted as user uncertainty, unfamiliarity, or incomprehension misunderstanding ofthe application or function.
  • a context stack 817 is managed by the dialog manager 819, possibly through a context manager that interacts with the dialog manager and arbitrator. It is to be understood that the context stack 817 may be implemented as part ofthe context stack 20 of FIG. 1.
  • the context stack 817 comprises all the information associated with an application. Such information includes all the variable, states, input, output and queries to the backend that are performed in the context of the dialog and any extraneous event that occurs during the dialog.
  • the context stack is associated with the organized/sorted context co ⁇ esponding to each active dialog (or defe ⁇ ed dialog-agents/memorization).
  • a global history 816 is included in the CVM system and includes information that is stored beyond the context of each application. The global history stores, for example, the information that is associated with all the applications and actions taking during a conversational session (i.e., the history of the dialog between user and machine for a cu ⁇ ent session or from when the machine was activated).
  • the CVM kernel layer 802 further comprises a backend abstraction layer 823 which allows access to backend business logic 813 via the dialog manager 819 (rather than bypassing the dialog manager 819). This allows such accesses to be added to the context stack 817 and global history 816.
  • the backend abstraction layer 823 can translate input and output to and from the dialog manager 819 to database queries.
  • This layer 823 will convert standardized attribute value n-tuples into database queries and translate the result of such queries into tables or sets of attribute value n-tuples back to the dialog manager 819.
  • a conversational transcoding layer 824 is provided to adapt the behavior, UI and dialog presented to the user based on the I/O and engine capabilities ofthe device which executes the CVM system.
  • the CVM system further comprises a communication stack 814 (or communication engines) as part of the underlying system services provided by the OS 812.
  • the CVM system utilizes the communication stack to transmit information via conversational protocols 804 which extend the conventional communication services to provide conversational communication.
  • the communication stack 814 may be implemented in connection with the well-known OSI (open system interconnection) protocol layers for providing conversational communication exchange between conversational devices.
  • OSI open system interconnection
  • OSI comprises seven layers with each layer performing a respective function to provide communication between network distributed conversational applications of network-connected devices.
  • Such layers (whose functions are well-understood) comprise an application layer, a presentation layer, a session layer, a transport layer, a network layer, a data link layer and a physical layer.
  • the application layer is extended to allow conversational communication via the conversational protocols 804.
  • the conversational protocols 804 allow, in general, remote applications and resources register their conversational capabilities and proxies. These conversational protocols 804 are further disclosed in the PCT patent application identified as PCT/US99/22925 (attorney docket no. Y0999-113) filed on October 1, 1999 and entitled
  • FIGs. 9 A and 9B block diagrams illustrate prefe ⁇ ed embodiments of respective conversational data mining systems. It is to be appreciated that such conversational data mining systems are disclosed in the above-referenced U.S. patent application identified as Serial No. 09/371,400 (attorney docket no. Y0999-227) filed on August 10, 1999 and entitled “Conversational Data Mining,” inco ⁇ orated by reference herein. A description of such systems, one of which may be employed to implement a mood/focus classifier module 22 of FIG. 1, is provided below in this section. However, it is to be appreciated that other mechamsms for implementing mood classification and focus detection according to the invention may be employed.
  • focus detection may be performed in accordance with the dialog manager 18 (FIG. 1) along with ambiguity resolution, it is preferably performed in accordance with the mood/focus classifier 22 (FIG. 1), an implementation of which will be described below. It is to be appreciated that focus can be determined by classification and data mining exactly the same way as mood is determined or the user is classified (as will be explained below), i.e., the attitude and moves/gestures of the user are used to determine stochastically the most likely focus item and focus state.
  • FIGs. 9A and 9B will be used to generally describe mood/focus classification techni ⁇ ues that mav be employed in the mood/focus classifier 22 (FIG. 1) with respect to speech-based event data.
  • FIG. 9C depicts an apparatus for collecting data associated with a voice of a user, in accordance with the present invention.
  • the apparatus is designated generally as 900.
  • the apparatus includes a dialog management unit 902 which conducts a conversation with the user. It is to be understood that the user-provided input data events are preferably provided to the system 900 via the I/O manager 14 of FIG. 1.
  • Apparatus 900 further includes an audio capture module 906 which is coupled to the dialog management unit 902 and which captures a speech waveform associated with utterances spoken by the user 904 during the conversation. While shown for ease of explanation in FIG. 9A, the audio capture unit 906 may be part ofthe I/O subsystem 12 of FIG. 1. In which case, the captured input data is passed onto system 900 via the I/O manager 14.
  • a conversation should be broadly understood to include any interaction, between a first human and either a second human, a machine, or a combination thereof, which includes at least some speech.
  • the mood classification (focus detection) system 900 may be extended to process video in a similar manner.
  • Apparatus 900 further includes an acoustic front end 908 which is coupled to the audio capture module 906 and which is configured to receive and digitize the speech waveform so as to provide a digitized speech waveform. Further, acoustic front end 908 is also configured to extract, from the digitized speech waveform, at least one acoustic feature which is co ⁇ elated with at least one user attribute.
  • the at least one user attribute can include at least one ofthe following: gender ofthe user, age ofthe user, accent ofthe user, native language of the user, dialect of the user, socioeconomic classification of the user, educational level of the user, and emotional state of the user.
  • the dialog management unit 902 may employ acoustic features, such as MEL cepstra, obtained from acoustic front end 908 and may therefore, if desired, have a direct coupling thereto.
  • Apparatus 900 further includes a processing module 910 which is coupled to the acoustic front end 908 and which analyzes the at least one acoustic feature to determine the at least one user attribute.
  • apparatus 900 includes a data warehouse 912 which is coupled to the processing module 910 and which stores the at least one user attribute, together with at least one identifying indicia, in a form for subsequent data mining thereon. Identifying indicia will be discussed elsewhere herein.
  • the gender of the user can be determined by classifying the pitch of the user's voice, or by simply clustering the features. In the latter method, voice prints associated with a large set of speakers of a given gender are built and a speaker classification is then performed with the two sets of models.
  • Age of the user can also be determined via classification of age groups, in a manner similar to gender. Although having limited reliability, broad classes of ages, such as children, teenagers, adults and senior citizens can be separated in this fashion.
  • ICSLP'98 sets forth useful techniques.
  • Native language of the user can be determined in a manner essentially equivalent to accent classification.
  • Meta information about the native language ofthe speaker can be added to define each accent/native language model.
  • the socioeconomic classification ofthe user can include such factors as the racial background ofthe user, ethnic background ofthe user, and economic class ofthe user, for example, blue collar, white collar-middle class or wealthy. Such determinations can be made via annotated accents and dialects at the moment of training, as well as by examining the choice of words ofthe user. While only moderately reliable, it is believed that these techniques will give sufficient insight into the background of the user so as to be useful for data mining.
  • the educational level of the user can be determined by the word choice and accent, in a manner similar to the socioeconomic classification; again, only partial reliability is expected, but sufficient for data mining pu ⁇ oses.
  • the audio capture module 906 can include, for example, at least one of an analog-to-digital converter board, an interactive voice response system, and a microphone.
  • the dialog management unit 902 can include a telephone interactive voice response system, for example, the same one used to implement the audio capturing.
  • Dialog management unit 902 can include natural language understanding (NLU), natural language generation (NLG), finite state grammar (FSG), and/or text-to-speech syntheses (TTS) for machine-prompting the user in lieu of, or in addition to, the human operator.
  • NLU natural language understanding
  • NLG natural language generation
  • FSG finite state grammar
  • TTS text-to-speech syntheses
  • the processing module 910 can be implemented in the processor portion of the IVR, or can be implemented in a separate general piupose computer with appropriate software. Still further, the processing module can be implemented using an application specific circuit such as an application specific integrated circuit (ASIC) or can be implemented in an application specific circuit employing discrete components, or a combination of discrete and integrated components.
  • Processing module 910 can include an emotional state classifier 914. Classifier
  • Processing module 910 can further include a speaker clusterer and classifier 920.
  • Element 920 can further include a speaker clustering and classification module 922 and a speaker class data base 924.
  • Processing module 910 can further include a speech recognizor 926 which can, in turn, itself include a speech recognition module 928 and a speech prototype, language model and grammar database 930. Speech recognizor 926 can be part of the dialog management unit 902 or, for example, a separate element within the implementation of processing .module 910. Yet further, processing module 910 can include an accent identifier 932, which in turn includes an accent identification module 934 and an accent database 936,
  • Processing module 910 can include any one of elements 914, 920, 926 and 932; all of those elements together; or any combination thereof.
  • Apparatus 900 can further include a post processor 938 which is coupled to the data warehouse 912 and which is configured to transcribe user utterances and to perform keyword spotting thereon.
  • the post processor can be a part of the processing module 910 or of any of the sub-components thereof.
  • it can be implemented as part of the speech recognizor 926.
  • Post processor 938 can be implemented as part of the processor of an IVR, as an application specific circuit, or on a general pu ⁇ ose computer with suitable software modules.
  • Post processor 938 can employ speech recognizor 926.
  • Post processor 938 can also include a semantic module (not shown) to inte ⁇ ret meaning of phrases.
  • the semantic module could be used by speech recognizor 926 to indicate that some decoding candidates in a list are meaningless and should be discarded/replaced with meaningful candidates.
  • the acoustic front end 908 can typically be an eight dimensions plus energy front end as known in the art. However, it should be understood that 13, 24, or any other number of dimensions could be used. MEL cepstra can be computed, for example, over 25 ms frames with a 10 ms overlap, along with the delta and delta delta parameters, that is, the first and second finite derivatives. Such acoustic features can be supplied to the speaker clusterer and classifier 920, speech recognizor 926 and accent identifier 932, as shown in FIG. 9A.
  • acoustic features can be extracted by the acoustic front end 908. These can be designated as emotional state features, such as riinning average pitch, running pitch variance, pitch jitter, running energy variance, speech rate, slaimmer, fundamental frequency, and variation in fundamental frequency. Pitch jitter refers to the number of sign changes of the first derivative of pitch. Shimmer is energy jitter. These features can be sunnlied from the acoustic front end 908 to the emotional state classifier 914.
  • the aforementioned acoustic features, including the MEL cepstra and the emotional state features can be thought of as the raw, that is, unprocessed features.
  • Speech features can be processed by a text-independent speaker classification system, for example, in speaker clusterer and classifier 920. This permits classification ofthe speakers based on acoustic similarities of their voices. Implementation and use of such a system is disclosed in U.S. patent application Serial No. 60/011,058, filed February 2, 1996; U.S. patent application Serial No. 08/787,031, filed January 28, 1997 (now U.S. Patent No. 5,895,447 issued April 20, 1999); U.S. patent application Serial No. 08/788,471, filed January 28, 1997; and U.S. patent application Serial No.
  • the classification of the speakers can be supervised or unsupervised. In the supervised case, the classes have been decided beforehand based on external information. Typically, such classification can separate between male and female, adult versus child, native speakers versus different classes of non-native speakers, and the like.
  • the indices of this classification process constitute processed features.
  • the results of this process can be supplied to the emotional state classifier 914 and can be used to normalize the emotional state features with respect to the average (mean) observed for a given class, during training, for a neutral emotional state.
  • the normalized emotional state features are used by the emotional state classifier 914 which then outputs an estimate of the emotional state. This output is also considered to be part ofthe processed features.
  • the emotional state features can be normalized by the emotional state classifier 914 with respect to each class produced by the speech clusterer and classifier 920.
  • a feature can be normalized as follows. Let X 0 be the normal frequency. Let X,- be the measured frequency. Then, the normalized feature will be given by X, minus X 0 . This quantity can be positive or 5 negative, and is not, in general, dimensionless.
  • the speech recognizor 926 can transcribe the queries from the user.
  • the output can be full sentences, but finer granularity can also be attained; for example, time alignment of the recognized words.
  • the time stamped transcriptions can also be considered as part of the processed features, and will be discussed further below with respect to methods in accordance with the present invention.
  • conversation from every stage of a transaction can be transcribed and stored.
  • appropriate data is transfe ⁇ ed from the speaker clusterer and classifier 920 to the emotional state classifier
  • a continuous speech recognizor can be trained on speech with several speakers having the different accents which are to be recognized.
  • Each of the framing speakers is also associated with an accent vector, with each dimension representing the most likely mixture component associated with each state of each lefeme.
  • the speakers can be clustered based on the distance between these accent vectors, and the clusters can be identified by, for example, the accent of the member speakers.
  • the accent identification can be performed by extracting an accent vector from the user's speech and classifying it.
  • dialect, socioeconomic classification, and the like can be estimated based on vocabulary and word series used by the user.
  • Appropriate key words, sentences, or grammatical mistakes ⁇ to detect can be compiled via expert linguistic knowledge.
  • the accent, socioeconomic background, gender, age and the like are part ofthe processed features.
  • any ofthe processed features, indicated by the solid a ⁇ ows, can be stored in the data warehouse 912.
  • raw features, indicated by the dotted lines can also be stored in the data warehouse 912.
  • Any ofthe processed or raw features can be stored in the data warehouse 912 and then associated with the other data which has been collected, upon completion of the transaction.
  • Classical data mining techniques can then be applied. Such techniques are known, for example, as set forth in the book “Da.ta Warehousing, Data Mining and OAAP,” by Alex Berson and Stephen J. Smith, published by McGraw Hill in 1997, and in “Discovering Data Mining,” by Cabena et al, published by Prentice Hall in 1998.
  • target marketing, predictive models or classifiers are automatically obtained by applying appropriate mining recipes. All data stored in the data warehouse 912 can be stored in a format to facilitate subsequent data mining thereon.
  • Business objectives can include, for example, detection of users who are vulnerable to a proposal to buy a given product or service, detection of users who have problems with the automated system and should be transfe ⁇ ed to an operator and detection of users who are angry at the service and should be transfe ⁇ ed to a supervisory person.
  • the user can be a customer of a business which employs the apparatus 900, or can be a client of some other type of institution, such as a nonprofit institution, a government agency or the like.
  • FIG. 9B depicts a real-time-modifiable voice system for interaction with a user, in accordance with the present invention, which is designated generally as 1000. Elements in FIG. 9B which are similar to those in FIG. 9A have received the same reference numerals incremented by 100.
  • System 1000 can include a dialog management unit 1002 similar to that discussed above.
  • unit 1002 can be a human operator or supervisor, an JNR, or a Voice User Interface (VUI).
  • System 1000 can also include an audio capture module 1006 similar to that described above, and an acoustic front end 1008, also similar to that described above.
  • unit 1002 can be directly coupled to acoustic front end 1008, if desired, to permit use .of MEL cepstra or other acoustic features determined by front end 1008.
  • Processing module 1010 can include a dynamic classification module 1040 which performs dynamic classification of the user. Accordingly, processing module 1010 is configured to modify behavior of the voice system 1000 based on at least one user attribute which has been determined based on at least one acoustic feature extracted from the user's speech.
  • System 1000 can further include a business logic unit 1042 which is coupled to the dialog management unit 1002, the dynamic classification module 1040, and optionally to the acoustic front end 1008.
  • the business logic unit can be implemented as a processing portion of the IVR or VUI, can be part of an appropriately programmed general ptupose computer, or can be an application specific circuit.
  • the processing module 1010 (including module 1040) be implemented as a general ptupose computer and that the business logic 1042 be implemented in a processor portion of an interactive voice response system.
  • Dynamic classification module 1040 can be configured to provide feedback which can be real-time feedback to the business logic unit 1042 and the dialog management unit 1002.
  • a data warehouse 1012 and post processor 1038 can be optionally provided as shown and can operate as discussed above with respect to the data collecting apparatus 900. It should be emphasized, however, that in the real-time-modifiable voice system 1000 ofthe present invention, data warehousing is optional and if desired, the system can be limited to the real time feedback discussed with respect to elements 1040, 1042 and
  • Processing module 1010 can modify behavior ofthe system 1000, at least in part, by prompting a human operator thereof, as suggested by the feedback line connected with dialog management unit 1002. For example, a human operator could be alerted when an angry emotional state of the user is detected and could be prompted to utter soothing words to the user, or transfer the user to a higher level human supervisor. Further, the processing module 1010 could modify business logic 1042 of the system 1000. This could be done, for example, when both the processing module 1010 and business logic unit 1042 were part of an IVR system. Examples of modification of business logic will be discussed further below, but could include tailoring a marketing offer to the user based on attributes ofthe user detected by the system 1000.
  • FIG. 9C a block diagram illustrates how the mood/focus . classification techniques described above may be implemented by a mood/focus classifier
  • the classifier shown in FIG. 9C comprises a speech input channel 1050-1, a speech channel controller 1052-1, and a speech-based mood classification subsystem 1054-1.
  • the classifier also comprises a video input channel 1050-N, a video channel controller 1052-N, and a video-based mood classification subsystem 1054-N.
  • other input channels and co ⁇ esponding classification subsystems may be included to extend the classifier to other modalities.
  • the individual classification subsystems each take raw features from their respective input channel and employ recognition and classification engines to process the features and then, in conjunction with data warehouse 1058, make a dynamic classification determination.
  • Video features may be treated similar to speech features.
  • joint dynamic classification may be performed in block 1056 using the data from each input modality to make an overall classification determination.
  • Business logic unit 1060 and multi-modal shell 1062 are used to control the process in accordance with the particular application(s) being run by the mood/focus classifier.
  • Channel controllers 1052-1 and 1052-N are used to control the input of speech data and video data, respectively.
  • a mood classification system as described above can instruct the I/O subsystem 12 of FIG. 1, via the I/O manager 14, to adjust devices in the environment that would have the effect of changing the user's mood and/or focus, e.g., temperature control system, music system, etc..
  • FIG. 10 a block diagram of an illustrative hardware implementation of a multi-modal conversational computing system according to the invention is shown.
  • a processor 1092 for controlling and performing the various operations associated with the illustrative systems of the invention depicted in FIGs. 1 through 9C is coupled to a memory 1094 and a user interface 1096.
  • processor as used herein is intended to include any processing device, such as, for example, one that includes a CPU (central processing unit) and/or other processing circuitry.
  • the processor may be a digital signal processor, as is known in the art.
  • processor may refer to more than one individual processor.
  • memory as used herein is intended to include memory associated with a processor or CPU, such as, for example, RAM, ROM, a fixed memory device (e.g., hard drive), a removable memory device (e.g., diskette), a flash memory, etc.
  • user interface as used herein is intended to include, for example, one or more input devices, e.g., keyboard, for inputting data to the processing unit, and/or one or more output devices, e.g., CRT display and/or printer, for providing results associated with the processing unit.
  • the user interface 1096 is also intended to include the one or more microphones for receiving user speech and the one or more cameras/sensors for capturing image data, as well as any other I/O interface devices used in the multi-modal system.
  • computer software including instructions or code for performing the methodologies of the invention, as described herein, may be stored in one or more ofthe associated memory devices (e.g., ROM, fixed or removable memory) and, when ready to be utilized, loaded in part or in whole (e.g., into RAM) and executed by a CPU.
  • the elements illustrated in FIGs. 1 through 9C may be implemented in various forms of hardware, software, or combinations thereof, e.g., one or more digital signal processors with associated memory, application specific integrated circuit(s), functional circuitry, one or more appropriately programmed general pu ⁇ ose digital computers with associated memory, etc. Given the teachings of the invention provided herein, one of ordinary skill in the related art will be able to contemplate other implementations ofthe elements ofthe invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Geometry (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • User Interface Of Digital Computer (AREA)
  • Image Processing (AREA)
  • Devices For Executing Special Programs (AREA)
  • Image Analysis (AREA)

Abstract

This is a method provided for performing focus detection, ambiguity resolution and mood classification (815) in accordance with multi-modal input data, in varying operating conditions, in order to provide an effective conversational computing environment (418, 422) for one or more users (812).

Description

SYSTEM AND METHOD FOR MULTI-MODAL FOCUS DETECTION, REFERENTIAL AMBIGUITY RESOLUTION AND MOOD CLASSIFICATION
USING MULTI-MODAL INPUT
Field of the Invention The present invention relates to multi-modal data processing techniques and, more particularly, to systems and methods for performing focus detection, referential ambiguity resolution and mood classification in accordance with multi-modal input data.
Background of the Invention
The use of more than one input mode to obtain data that may be used to perform various computing tasks is becoming increasingly more prevalent in today's computer-based processing systems. Systems that employ such "multi-modal" input techniques have inherent advantages over systems that use only one data input mode.
For example, there are systems that include a video input source and more traditional computer data input sources, such as the manual operation of a mouse device and/or keyboard in coordination with a multi-window graphical user interface (GUI).
Examples of such systems are disclosed in U.S. Patent No. 5,912,721 to Yamaguchi et al. issued on June 15, 1999. In accordance with teachings in the Yamaguchi et al. system, apparatus may be provided for allowing a user to designate a position on the display screen by detecting the user's gaze point, which is designated by his line of sight with respect to the screen, without the user having to manually operate one ofthe conventional input devices.
Other systems that rely on eye tracking may include other input sources besides video to obtain data for subsequent processing. For example, U.S. Patent No. 5,517,021 to Kaufman et al. issued May 14, 1996 discloses the use of an electro-oculographic (EOG) device to detect signals generated by eye movement and other eye gestures. Such
EOG signals serve as input for use in controlling certain task-performing functions. Still other multi-modal systems are capable of accepting user commands by use of voice and gesture inputs. U.S. Patent No. 5,600,765 to Ando et al. issued February 4, 1997 discloses such a system wherein, while pointing to either a display object or a display position on a display screen of a graphics display system through a pointing input device, a user commands the graphics display system to cause an event on a graphics display.
Another multi-modal computing concept employing voice and gesture input is known as "natural computing." In accordance with natural computing techniques, gestures are provided to the system directly as part of commands. Alternatively, a user may give spoken commands.
However, while such multi-modal systems would appear to have inherent advantages over systems that use only one data input mode, the existing multi-modal techniques fall significantly short of providing an effective conversational environment between the user and the computing system with which the user wishes to interact. That is, the conventional multi-modal systems fail to provide effective conversational computing environments. For instance, the use of user gestures or eye gaze in conventional systems, such as illustrated above, is merely a substitute for the use of a traditional GUI pointing device. In the case of natural computing techniques, the system independently recognizes voice-based commands and independently recognizes gesture-based commands. Thus, there is no attempt in the conventional systems to use one or more input modes to disambiguate or understand data input by one or more other input modes. Further, there is no attempt in the conventional systems to utilize multi-modal input to perform user mood or attention classification. Still further, in the conventional systems that utilize video as an data input modality, the video input mechanisms are confined to the visible wavelength spectrum. Thus, the usefulness of such systems is restricted to environments where light is abundantly available. Unfortunately, depending on the operating conditions, an abundance of light may not be possible or the level of light may be frequently changing (e.g., as in a moving car). Accordingly, it would be highly advantageous to provide systems and methods for performing focus detection, referential ambiguity resolution and mood classification in accordance with multi-modal input data, in varying operating conditions, in order to provide an effective conversational computing environment for one or more users.
Summary ofthe Invention
The present invention provides techniques for performing focus detection, referential ambiguity resolution and mood classification in accordance with multi-modal input data, in varying operating conditions, in order to provide an effective conversational computing environment for one or more users.
In one aspect of the invention, a multi-modal conversational computing system comprises a user interface subsystem configured to input multi-modal data from an environment in which the user interface subsystem is deployed. The multi-modal data includes at least audio-based data and image-based data. The environment includes one or more users and one or more devices which are controllable by the multi-modal system of the invention. The system also comprises at least one processor, operatively coupled to the user interface subsystem, and configured to receive at least a portion of the multi-modal input data from the user interface subsystem. The processor is further configured to then make a determination of at least one of an intent, a focus and a mood of at least one of the one or more users based on at least a portion of the received multi-modal input data. The processor is still further configured to then cause execution of one or more actions to occur in the environment based on at least one of the determined intent, the determined focus and the determined mood. The system further comprises a memory, operatively coupled to the at least one processor, which stores at least a portion of results associated with the intent, focus and mood determinations made by the processor for possible use in a subsequent determination or action.
Advantageously, such a multi-modal conversational computing system provides the capability to: (i) determine an object, application or appliance addressed by the user; (ii) determine the focus ofthe user and therefore determine if the user is actively focused on an appropriate application and, on that basis, to determine if an action should be taken; (iii) understand queries based on who said or did what, what was the focus of the user when he gave a multi-modal query/command and what is the history of these commands and focuses; and (iv) estimate the mood of the user and initiate and or adapt some behavior/service/appliances accordingly. The computing system may also change the associated business logic of an application with which the user interacts.
It is to be understood that multi-modality, in accordance with the present invention, may comprise a combination of other modalities other than voice and video. For example, multi-modality may include keyboard/pointer/mouse (or telephone keypad) and other sensors, etc. Thus, a general principle of the present invention of the combination of modality through at least two different sensors (and actuators for outputs) to disambiguate the input, and guess the mood or focus, can be generalized to any such combination. Engines or classifiers for determining the mood or focus will then be specific to the sensors but the methodology of using them is the same as disclosed herein.
This should be understood throughout the descriptions herein, even if illustrative embodiments focus on sensors that produce a stream of audio and video data.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
Brief Description ofthe Drawings
FIG. 1 is a block diagram illustrating a multi-modal conversational computing system according to an embodiment ofthe present invention;
FIG. 2 is a flow diagram illustrating a referential ambiguity resolution methodology performed by a multi-modal conversational computing system according to an embodiment ofthe present invention; FIG. 3 is a flow diagram illustrating a mood/focus classification methodology performed by a multi-modal conversational computing system according to an embodiment ofthe present invention;
FIG. 4 is a block diagram illustrating an audio-visual speech recognition module for use according to an embodiment ofthe present invention;
FIG. 5A is diagram illustrating exemplary frontal face poses and non-frontal face poses for use according to an embodiment ofthe present invention;
FIG. 5B is a flow diagram of a face/feature and frontal pose detection methodology for use according to an embodiment ofthe present invention; FIG. 5C is a flow diagram of an event detection methodology for use according to an embodiment ofthe present invention;
FIG. 5D is a flow diagram of an event detection methodology employing utterance verification for use according to an embodiment ofthe present invention;
FIG. 6 is a block diagram illustrating an audio-visual speaker recognition module for use according to an embodiment of the present invention;
FIG. 7 is a flow diagram of an utterance verification methodology for use according to an embodiment ofthe present invention;
FIGs. 8A and 8B are block diagrams illustrating a conversational computing system for use according to an embodiment ofthe present invention; FIGs. 9A through 9C are block diagrams illustrating respective mood classification systems for use according to an embodiment ofthe present invention; and
FIG. 10 is a block diagram of an illustrative hardware implementation of a multi-modal conversational computing system according to the invention.
Detailed Description of Preferred Embodiments Referring initially to FIG. 1, a block diagram illustrates a multi-modal conversational computing system according to an embodiment of the present invention. As shown, the multi-modal conversational computing system 10 comprises an input/output (I/O) subsystem 12, an I/O manager module 14, one or more recognition engines 16, a dialog manager module 18, a context stack 20 and a mood/focus classifier 22.
Generally, the multi-modal conversational computing system 10 of the present invention receives multi-modal input in the form of audio input data, video input data, as well as other types of input data (in accordance with the I/O subsystem 12), processes the multi-modal data (in accordance with the I/O manager 14), and performs various recognition tasks (e.g., speech recognition, speaker recognition, gesture recognition, lip reading, face recognition, etc., in accordance with the recognition engines 16), if necessary, using this processed data. The results of the recognition tasks and/or the processed data, itself, is then used to perform one or more conversational computing tasks, e.g., focus detection, referential ambiguity resolution, and mood classification (in accordance with the dialog manager 18, the context stack 20 and/or the classifier 22), as will be explained in detail below. While the multi-modal conversational computing system of the present invention is not limited to a particular application, initially describing a few exemplary applications will assist in contextually understanding the various features that the system offers and functions that it is capable of performing.
Thus, by way of a first illustrative application, the multi-modal conversational computing system 10 may be employed within a vehicle. In such an example, the system may be used to detect a distracted or sleepy operator based on detection of abnormally long eye closure or gazing in another direction (by video input) and/or speech that indicates distraction or sleepiness (by audio input), and to then alert the operator of this potentially dangerous state. This is referred to as focus detection. By extracting and then tracking eye conditions (e.g., opened or closed) and/or face direction, the system can make a determination as to what the operator is focusing on. As will be seen, the system 10 may be configured to receive and process, not only visible image data, but also (or aiternativelv) non-visible image data such as infrared (IR) visual data. Also (or, again, alternatively), radio frequency (RF) data may be received and processed. So, in the case where the multi-modal conversational computing system is deployed in an operating environment where light is not abundant (i.e., poor lighting conditions), e.g., a vehicle driven at night, the system can still acquire multi-modal input, process data and then, if necessary, output an appropriate response. The system could also therefore operate in the absence of light.
The vehicle application lends itself also to an understanding of the concept of referential ambiguity resolution. Consider that there are multiple users in the vehicle and that the multi-modal conversational computing system 10 is coupled to several devices (e.g., telephone, radio, television, lights) which may be controlled by user input commands received and processed by the system. In such a situation, not only is there multi-modal input, but there may be multi-modal input from multiple occupants of the vehicle.
Thus, the system 10 must be able to perform user reference resolution, e.g., the system may receive the spoken utterance, "call my office," but unless the system can resolve which occupant made this statement, it will not know which office phone number to direct an associated cellular telephone to call. The system 10 therefore performs referential ambiguity resolution with respect to multiple users by taking both audio input data and image data input and processing it to make a user resolution determination. This may include detecting speech activity and/or the identity ofthe user based on both audio and image cues. Techniques for accomplishing this will be explained below.
Similarly, a user may say to the system, "turn that off," but without device reference resolution, the system would not know which associated device to direct to be turned off. The system 10 therefore performs referential ambiguity resolution with respect to multiple devices by taking both audio input data and image data input and processing it to make a device resolution determination. This may include detecting the speaker's head pose using gross spatial resolution of the direction being addressed, or body pose (e.g., pointing). This may also include disambiguating an I/O (input/output) event generated previously and stored in a context manager/history stack (e.g., if a beeper rang and the user asked "turn it off," the term "it" can be disambiguated). Techniques for accomplishing this will be explained below.
In' addition, the system 10 may make a determination of a vehicle occupant's mood or emotional state in order to effect control of other associated devices that may then effect that state. For instance, if the system detects that the user is warm or cold, the system may cause the temperature to be adjusted for each passenger. If the passenger is tired, the system may cause the adjustment of the seat, increase the music volume, etc. Also, as another example (not necessarily an in-vehicle system), an application interface responsiveness may be tuned to the mood of the user. For instance, if the user seems confused, help may be provided by the system. Further, if the user seems upset, faster executions are attempted. Still further, if the user is uncertain, the system may ask for confirmation or offer to guide the user.
While the above example illustrates an application where the multi-modal conversational computing system 10 is deployed in a vehicle, in another illustrative arrangement, the system can be deployed in a larger area, e.g., a room with multiple video input and speech input devices, as well as multiple associated devices controlled by the system 10. Given the inventive teachings herein, one of ordinary skill in the art will realize other applications in which the multi-modal conversational computing system may be employed.
Given the functional components of the multi-modal conversational computing system 10 of FIG. 1, as well as keeping in mind the exemplary applications described above, the following description of FIGs. 2 and 3 provide a general explanation of the interaction of the functional components of the system 10 during the course of the execution of one or more such applications.
Referring now to FIG. 2, a flow diagram illustrates a methodology 200 performed by a muluVmodal conversational computing system by which referential ambiguity resolution (e.g., user and/or device disambiguation) is accomplished. First, in step 202, raw multi-modal input data is obtained from multi-modal data sources associated with the system. In terms ofthe computing system 10 in FIG. 1, such sources are represented by I/O subsystem 12. As mentioned above, the data input portion ofthe subsystem may comprise one or more cameras or sensors for capturing video input data representing the environment in which the system (or, at least, the I/O subsystem) is deployed. The cameras/sensors may be capable of capturing not only visible image data (images in the visible electromagnetic spectrum), but also IR (near, mid and/or far field IR video) and/or RF image data. Of course, in systems with more than one camera, different mixes of cameras/sensors may be employed, e.g., system having one or more video cameras, one or more IR sensors and/or one or more RF sensors.
In addition to the one or more cameras, the I/O subsystem 12 may comprise one or more microphones for capturing audio input data from the environment in which the system is deployed. Further, the I/O subsystem may also include an analog-to-digital converter which converts the electrical signal generated by a microphone into a digital signal representative of speech uttered or other sounds that are captured. Further, the subsystem may sample the speech signal and partition the signal into overlapping frames so that each frame is discretely processed by the remainder ofthe system.
Thus, referring to the vehicle example above, it is to be understood that the cameras and microphones may be strategically placed throughout the vehicle in order to attempt to fully capture all visual activity and audio activity that may be necessary for the system to make ambiguity resolution determinations.
Still further, the I/O subsystem 12 may also comprise other typical input devices for obtaining user input, e.g., GUI-based devices such as a keyboard, a mouse, etc., and/or other devices such as a stylus and digitizer pad for capturing electronic handwriting, etc. It is to be understood that one of ordinary skill in the art will realize other user interfaces and devices that may be included for capturing user activity.
Next, in step 204, the raw multi-modal input data is abstracted into one or more events. In terms ofthe computing system 10 in FIG. 1, the data abstraction is performed by the I/O manager 14. The I/O manager receives the raw multi-modal data and abstracts the data into a form that represents one or more events, e.g., a spoken utterance, a visual gesture, etc. As is known, a data abstraction operation may involve generalizing details associated with all or portions of the input data so as to yield a more generalized representation of the data for use in further operations.
In step 206, the abstracted data or event is then sent by the I/O manager 14 to one or more recognition engines 16 in order to have the event recognized, if necessary. That is, depending on the nature ofthe event, one or more recognition engines may be used to recognize the event. For example, if the event is some form of spoken utterance wherein the microphone picks up the audible portion of the utterance and a camera picks up the visual portion (e.g., lip movement) of the utterance, the event may be sent to an audio-visual speech recognition engine to have the utterance recognized using both the audio input and the video input associated with the speech. Alternatively, or in addition, the event may be sent to an audio-visual speaker recognition engine to have the speaker of the utterance identified, verified and/or authenticated. Also, both speech recognition and speaker recognition can be combined on the same utterance.
If the event is some form of user gesture picked up by a camera, the event may be sent to a gesture recognition engine for recognition. Again, depending on the types of user interfaces provided by the system, the event may comprise handwritten input provided by the user such that one of the recognition engines may be a handwriting recognition engine. In the case of more typical GUI-based input (e.g., keyboard, mouse, etc.), the data may not necessarily need to be recognized since the data is already identifiable without recognition operations.
An audio-visual speech recognition module that may be employed as one of the recognition engines 16 is disclosed in U.S. patent application identified as Serial No.
09/369,707 (attorney docket no. YO999-317), filed on August 6, 1999 and entitled "Methods and Apparatus for Audio-visual Speech Detection and Recognition," the disclosure of which is incorporated by reference herein. A description of such an audio-visual speech recognition system will be provided below. An audio-visual speaker recognition module that may be employed as one of the recognition engines 16 is disclosed in U.S. patent application identified as Serial No. 09/369,706 (attorney docket no. YO999-318), filed on August 6, 1999 and entitled "Methods And Apparatus for Audio-Visual Speaker Recognition and Utterance Verification," the disclosure of which is incorporated by reference herein. A description of such an audio-visual speaker recognition system will be provided below. It is to be appreciated that gesture recognition (e.g., body, arms and/or hand movement, etc., that a user employs to passively or actively give instruction to the system) and focus recognition (e.g., direction of face and eyes of a user) may be performed using the recognition modules described in the above-referenced patent applications. With regard to focus detection, however, the classifier 22 is preferably used to determine the focus of the user and, in addition, the user's mood.
It is to be appreciated that two, more or even all of the input modes described herein may be synchronized via the techniques disclosed in U.S. patent application identified as Serial No. 09/507,526 (attorney docket no. YO999-178) filed on February 18, 2000 and entitled "Systems and Method for Synchronizing Multi-modal Interactions," which claims priority to U.S. provisional patent application identified as U.S. Serial No. 60/128,081 filed on April 1, 1999 and U.S. provisional patent application identified by Serial No. 60/158,777 filed on October 12, 1999, the disclosures of which are incorporated by reference herein.
In step 208, the recognized events, as well as the events that do not need to be recognized, are stored in a storage unit referred to as the context stack 20. The context stack is used to create a history of interaction between the user and the system so as to assist the dialog manager 18 in making referential ambiguity resolution determinations when determining the user's intent.
Next, in step 210, the system 10 attempts to determine the user intent based on the current event and the historical interaction information stored in the context stack and then determine and execute one or more application programs that effectuate the user's intention and/or react to the user activity. The application depends on the environment that the system is deployed in. The application may be written in any computer programming language but preferably it is written in a Conversational Markup Language (CML) as disclosed in U.S. patent application identified as 09/544,823 (attorney docket no. YO999-478) filed April 6, 2000 and entitled "Methods and Systems for Multi-modal Browsing and Implementation of a Conversational Markup Language;" U.S. patent application identified as Serial No. 60/102,957 (attorney docket no. YO998-392) filed on October 2, 1998 and entitled "Conversational Browser and Conversational Systems" to which priority is claimed by PCT patent application identified as PCT/US99/23008 filed on October 1, 1999; as well as the above-referenced U.S. patent application identified as Serial No. 09/507,526 (attorney docket no. YO999-178), the disclosures of which are incorporated by reference herein.
Thus, the dialog manager must first determine the user's intent based on the current event and, if available, the historical information (e.g., past events) stored in the context stack. For instance, returning to the vehicle example, the user may say "turn it on," while pointing at the vehicle radio. The dialog manager would therefore receive the results of the recognized events associated with the spoken utterance "turn it on" and the gesture of pointing to the radio. Based on these events, the dialog manager does a search of the existing applications, transactions or "dialogs," or portions thereof, with which such an utterance and gesture could be associated. Accordingly, as shown in FIG. 1, the dialog manager 18 determines the appropriate CML-authored application 24. The application may be stored on the system 10 or accessed (e.g., downloaded) from some remote location. If the dialog manager determines with some predetermined degree of confidence that the application it selects is the one which will effectuate the users desire, the dialog manager executes the next step of the multi-modal dialog (e.g., prompt or display for, missing, ambiguous or confusing information, asks for confirmation or launches the execution of an action associated to a fully understood multi-modal request from the user) of that application based on the multi-modal input. That is, the dialog manager selects the appropriate device (e.g., radio) activation routine and instructs the I/O manager to output a command to activate the radio. The predetermined degree of confidence may be that at least two input parameters or variables of the application are satisfied or provided by the received events. Of course, depending on the application, other levels of confidence and algorithms may be established as, for example, described in K.A. Papineni, S. Roukos, R.T. Ward, "Free-flow dialog management using forms," Proc. Eurospeech, Budapest, 1999; and K. Davies et al., "The IBM conversational telephony system for financial applications," Proc. Eurospeech, Budapest, 1999, the disclosures of which are incorporated by reference herein.
Consider the case where the user first says "turn it on," and then a few seconds later points to the radio. The dialog manager would first try to determine user intent based solely on the "turn it on" command. However, since there are likely other devices in the vehicle that could be turned on, the system would likely not be able to determine with a sufficient degree of confidence what the user was referring to. However, this recognized spoken utterance event is stored on the context stack. Then, when the recognized gesture event (e.g., pointing to the radio) is received, the dialog manager takes this event and the previous spoken utterance event stored on the context stack and makes a determination that the user intended to have the radio turned on. Consider the case where the user says "turn it on," but makes no gesture and provides no other utterance. In this case, assume that the dialog manager does not have enough input to determine the user intent (step 212 in FIG. 2) and thus implement the command. The dialog manager, in step 214, then causes the generation of an output to the user requesting further input data so that the user's intent can be disambiguated. This may be accomplished by the dialog manager instructing the I/O manager to have the I/O subsystem output a request for clarification. In one embodiment, the I/O subsystem 12 may comprise a text-to-speech (TTS) engine and one or more output speakers. The dialog manager then generates a predetermined question such as "what device do you want to have turned on?" which the TTS engine converts to a synthesized utterance that is audibly output by the speaker to the user. The user, hearing the query, could then point to the radio or say "the radio" thereby providing the dialog manager with the additional input data to disambiguate his request. That is, with reference to FIG. 2, the system 10 obtains the raw input data, again in step 202, and the process 200 iterates based on the new data. Such iteration can continue as long as necessary for the dialog manager to determine the user's intent.
The dialog manager 18 may also seek confirmation in step 216 from the user in the same manner as the request for more information (step 214) before executing the processed event, dispatching a task and/or executing some other action in step 218 (e.g., causing the radio to be turned on). For example, the system may output "do you want the radio turned on?" To which the user may respond "yes." The system then causes the radio to be turned on. Further, the dialog manager 18 may store information it generates and/or obtains during the processing of a current event on the context stack 20 for use in making resolution or other determinations at some later time.
Of course, it is to be understood that the above example is a simple example of device ambiguity resolution. As mentioned, the system 10 can also make user ambiguity resolution determinations, e.g., in a multiple user environment, someone says "dial my office." Given the explanation above, one of ordinary skill will appreciate how the system 10 could handle such a command in order to decide who among the multiple users made the request and then effectuate the order.
Also, the output to the user to request further input may be made in any other number of ways and with any amount of interaction turns between the user and feedback from the system to the user. For example, the I/O subsystem 12 may include a GUI-based display whereby the request is made by the system in the form of a text message displayed on the screen of the display. One of ordinary skill in the art will appreciate many other output mechanisms for implementing the teachings herein. It is to be appreciated the conversational virtual machine disclosed in PCT patent application identified as PCT/US99/22927 (attorney docket no. YO999-111) filed on October 1, 1999 and entitled "Conversational Computing Via Conversational Virtual Machine," the disclosure of which is incorporated by reference herein, may be employed to provide a framework for the I/O manager, recognition engines, dialog manager and context stack of the invention. A description of such a conversational virtual machine will be provided below.
Also, while focus or attention detection is preferably performed in accordance with the focus/mood classifier 22, as will be explained below, it is to be appreciated that such operation can also be performed by the dialog manager 18, as explained above.
Referring now to FIG. 3, a flow diagram illustrates a methodology 300 performed by a multi-modal conversational computing system by which mood classification and/or focus detection is accomplished. It is to be appreciated that the system 10 may perform the methodology of FIG. 3 in parallel with the methodology of FIG. 2 or at separate times. And because of this, the events that are stored by one process in the context stack can be used by the other.
It is to be appreciated that steps 302 through 308 are similar to steps 202 through 208 in FIG. 2. That is, the I/O subsystem 12 obtains raw multi-modal input data from the various multi-modal sources (step 302); the I/O manager 14 abstracts the multi-modal input data into one or more events (step 304); the one or more recognition engines 16 recognize the event, if necessary, based on the nature of the one or more events (step 306); and the events are stored on the context stack (step 308).
As described in the above vehicle example, in the case of focus detection, the system 10 may determine the focus (and focus history) of the user in order to determine whether he is paying sufficient attention to the task of driving (assuming he is the driver).
Such determination may be made by noting abnormally long eye closure or gazing in another direction and/or speech that indicates distraction or sleepiness. The system may then alert the operator of this potentially dangerous state. In addition, with respect to mood classification, the system may make a determination of a vehicle occupant's mood or emotional state in order to effect control of other associated devices that may then effect that state. Such focus and mood determinations are made in step 310 by the focus/mood classifier 22. The focus/mood classifier 22 receives either the events directly from the I/O manager 14 or, if necessary depending on the nature of the event, the classifier receives the recognized events from the one or more recognition engines 16. For instance, in the vehicle example, the focus/mood classifier may receive visual events indicating the position ofthe user's eyes and/or head as well as audio events indicating sounds the user may be making (e.g., snoring). Using these events, as well as past information stored on the context stack, the classifier makes the focus detection and/or mood classification determination. Results of such determinations may also be stored on the context stack.
Then, in step 312, the classifier may cause the execution of some action depending on the resultant determination. For example, if the driver's attention is determined to be distracted, the I/O manager may be instructed by the classifier to output a warning message to the driver via the TTS system and the one or more output speakers. If the driver is determined to be tired due, for example, to his monitored body posture, the I/O manager may be instructed by the classifier to provide a warning message, adjust the temperature or radio volume in the vehicle, etc. It is to be appreciated the conversational data mining system disclosed in U.S. patent application identified as Serial No. 09/371,400 (attorney docket no. YO999-227) filed on August 10, 1999 and entitled "Conversational Data Mining," the disclosure of which is incorporated by reference herein, may be employed to provide a framework for the mood/focus classifier of the invention. A description of such a conversational data mining system will be provided below.
For ease of reference, the remainder of the detailed description will be divided into the following sections: (A) Audio-visual speech recognition; (B) Audio-visual sneaker ecognition- (C, Conversational Virtual Machine; and (D) Conversational Data Mining. These sections describe detailed preferred embodiments of certain components of the multi-modal conversational computing system 10 shown in FIG. 1, as will be explained in each section.
A. Audio-visual speech recognition Referring now to FIG. 4, a block diagram illustrates a preferred embodiment of an audio-visual speech recognition module that may be employed as one of the recognition modules of FIG. 1 to perform speech recognition using multi-modal input data received in accordance with the invention. It is to be appreciated that such an audio-visual speech recognition module is disclosed in the above-referenced U.S. patent application identified as Serial No. 09/369,707 (attorney docket no. YO999-317), filed on August 6, 1999 and entitled "Methods and Apparatus for Audio-visual Speech Detection and Recognition." A description of one of the embodiments of such an audio-visual speech recognition module for use in a preferred embodiment of the multi-modal conversational computing system of the invention is provided below in this section. However, it is to be appreciated that other mechanisms for performing speech recognition may be employed.
This particular illustrative embodiment, as will be explained, depicts audio-visual recognition using a decision fusion approach. It is to be appreciated that one of the advantages that the audio-visual speech recognition module described herein provides is the ability to process arbitrary content video. That is, previous systems that have attempted to utilize visual cues from a video source in the context of speech recognition have utilized video with controlled conditions, i.e., non-arbitrary content video. That is, the video content included only faces from which the visual cues were taken in order to try to recognize short commands or single words in a predominantly noiseless environment. However, as will be explained in detail below, the module described herein is preferably able to process arbitrary content video which may not only contain faces but may also contain arbitrary background objects in a noisy environment. One example of arbitrary content video is in the context of broadcast news. Such video can possibly contain a newsperson speaking at a location where there is arbitrary activity and noise in the background. In such a case; as will be explained, the module is able to locate and track a face and, more particularly, a mouth, to determine what is relevant visual information to be used in more accurately recognizing the accompanying speech provided by the speaker. The module is also able to continue to recognize when the speaker's face is not visible (audio only) or when the speech in inaudible (lip reading only).
Thus, the module is capable of receiving real-time arbitrary content from a video camera 404 and microphone 406 via the I/O manager 14. It is to be understood that the camera and microphone are part of the I/O subsystem 12. While the video signals received from the camera 404 and the audio signals received from the microphone 406 are shown in FIG. 4 as not being compressed, they may be compressed and therefore need to be decompressed in accordance with the applied compression scheme.
It is to be understood that the video signal captured by the camera 404 can be of any particular type. As mentioned, the face and pose detection techniques may process images of any wavelength such as, e.g., visible and/or non- visible electromagnetic spectrum images. By way of example only, this may include infrared (IR) images (e.g., near, mid and far field IR video) and radio frequency (RF) images. Accordingly, the module may perform audio-visual speech detection and recognition techniques in poor lighting conditions, changing lighting conditions, or in environments without light. For example, the system may be installed in an automobile or some other form of vehicle and capable of capturing IR images so that improved speech recognition may be performed. Because video information (i.e., including visible and/or non-visible electromagnetic spectrum images) is used in the speech recognition process, the system is less susceptible to recognition errors due to noisy conditions, which significantly hamper conventional recognition systems that use only audio information. In addition, due to the methodologies for processing the visual information described herein, the module provides the capability to perform accurate LVCSR (large vocabulary continuous speech recognition). A phantom line denoted by Roman numeral I represents the processing path the audio information signal takes within the module, while a phantom line denoted by Roman numeral II represents the processing path the video information signal takes within the module. First, the audio signal path I will be discussed, then the video signal path II, followed by an explanation of how the two types of information are combined to provide improved recognition accuracy.
The module includes an auditory feature extractor 414. The feature extractor 414 receives an audio or speech signal and, as is known in the art, extracts spectral features from the signal at regular intervals. The spectral features are in the form of acoustic feature vectors (signals) which are then passed on to a probability module 416. Before acoustic vectors are extracted, the speech signal may be sampled at a rate of 16 kilohertz (kHz). A frame may consist of a segment of speech having a 25 millisecond (msec) duration. In such an arrangement, the extraction process preferably produces 24 dimensional acoustic cepstral vectors via the process described below. Frames are advanced every 10 msec to obtain succeeding acoustic vectors. Note that other acoustic front-ends with other frame sizes and sampling rates/signal bandwidths can also be employed.
First, in accordance with a preferred acoustic feature extraction process, magnitudes of discrete Fourier transforms of samples of speech data in a frame are considered in a logarithmically warped frequency scale. Next, these amplitude values themselves are transformed to a logarithmic scale. The latter two steps are motivated by a logarithmic sensitivity of human hearing to frequency and amplitude. Subsequently, a rotation in the form of discrete cosine transform is applied. One way to capture the dynamics is to use the delta (first-difference) and the delta-delta (second-order differences) information. An alternative way to capture dynamic information is to append a set of (e.g., four) preceding and succeeding vectors to the vector under consideration and then project the vector to a lower dimensional space, which is chosen to have the most discrimination. The latter procedure is known as Linear Discriminant Analysis (LDA) and is well known in the art.
After the acoustic feature vectors, denoted in FIG 4. by the letter A, are extracted, the probability module labels the extracted vectors with one or more previously stored phonemes which, as is known in the art, are sub-phonetic or acoustic units of speech.
The module may also work with lefemes, which are portions of phones in a given context. Each phoneme associated with one or more feature vectors has a probability associated therewith indicating the likelihood that it was that particular acoustic unit that was spoken. Thus, the probability module yields likelihood scores for each considered phoneme in the form of the probability that, given a particular phoneme or acoustic unit
(au), the acoustic unit represents the uttered speech characterized by one or more acoustic feature vectors A or, in other words, P(A|acoustic unit). It is to be appreciated that the processing performed in blocks 414 and 416 may be accomplished via any conventional acoustic information recognition system capable of extracting and labeling acoustic feature vectors, e.g., Lawrence Rabiner, Biing-Hwang Juang, "Fundamentals of Speech
Recognition," Prentice Hall, 1993.
Referring now to the video signal path II of FIG. 4, the methodologies of processing visual information will now be explained. The audio-visual speech recognition module (denoted in FIG. 4 as part of block 16 from FIG. 1) includes an active speaker face detection module 418. The active speaker face detection module 418 receives video input camera 404. It is to be appreciated that speaker face detection can also be performed directly in the compressed data domain and/or from audio and video information rather than just from video information. In any case, module 418 generally locates and tracks the speaker's face and facial features within the arbitrary video background. This will be explained in detail below.
The recognition module also preferably includes a frontal pose detection module 420. It is to be understood that the detection module 420 serves to determine whether a speaker in a video frame is in a frontal pose. This serves the function of reliably determining when someone is likely to be uttering or is likely to start uttering speech that is meant to be processed by the module, e.g., recognized by the module. This is the case at least when the speaker's face is visible from one of the cameras. When it is not, conventional speech recognition with, for example, silence detection, speech activity detection and/or noise compensation can be used. Thus, background noise is not recognized as though it were speech, and the starts of utterances are not mistakenly discarded. It is to be appreciated that not all speech acts performed within the hearing of the module are intended for the system. The user may not be speaking to the system, but to another person present or on the telephone. Accordingly, the module implements a detection module such that the modality of vision is used in connection with the modality of speech to determine when to perform certain functions in auditory and visual speech recognition.
One way to determine when a user is speaking to the system is to detect when he is facing the camera and when his mouth indicates a speech or verbal activity. This copies human behavior well. That is, when someone is looking at you and moves his lips, this indicates, in general, that he is speaking to you.
In accordance with the face detection module 418 and frontal pose detection module 420, we detect the "frontalness" of a face pose in the video image being considered. We call a face pose "frontal" when a user is considered to be: (i) more or less looking at the camera; or (ii) looking directly at the camera (also referred to as "strictly frontal"). Thus, in a preferred embodiment, we determine "frontalness" by determining that a face is absolutely not frontal (also referced to as "non-frontal"). A non-frontal face pose is when the orientation ofthe head is far enough from the strictly frontal orientation that the gaze can not be inteφreted as directed to the camera nor inteφreted as more or less directed at the camera. Examples of what are considered frontal face poses and non-frontal face poses in a preferred embodiment are shown in FIG. 5A. Poses I, II and III illustrate face poses where the user's face is considered frontal, and poses IX and V illustrate face poses where the user's face is considered non-frontal. Referring to FIG. 5B, a flow diagram of an illustrative method of performing face detection and frontal pose detection is shown. The first step (step 502) is to detect face candidates in an arbitrary content video frame received from the camera 404. Next, in step 504, we detect facial features on each candidate such as, for example, nose, eyes, mouth, ears, etc. Thus, we have all the information necessary to prune the face candidates according to their frontalness, in step 506. That is, we remove candidates that do not have sufficient frontal characteristics, e.g., a number of well detected facial features and distances between these features. An alternate process in step 506 to the pruning method involves a hierarchical template matching technique, also explained in detail below. In step 508, if at least one face candidate exists after the pruning mechanism, it is determined that a frontal face is in the video frame being considered.
There are several ways to solve the general problem of pose detection. First, a geometric method suggests to simply consider variations of distances between some features in a two dimensional representation of a face (i.e., a camera image), according to the pose. For instance, on a picture of a slightly turned face, the distance between the right eye and the nose should be different from the distance between the left eye and the nose, and this difference should increase as the face turns. We can also try to estimate the facial orientation from inherent properties of a face. In the article by A. Gee and R. Cipolla, "Estimating Gaze from a Single View of a Face," Tech. Rep. CUED/F-INFENG/TR174, March 1994, it is suggested that the facial normal is estimated by considering mostly pose invariant distance ratios within a face.
Another way is to use filters and other simple transformations on the original image or the face region. In the article by R. Brunelli, "Estimation of pose and illuminant direction for face processing," Image and Vision Computing 15, pp. 741-748, 1997, for instance, after a preprocessing stage that tends to reduce sensitivity to illumination, the two eyes are projected on the horizontal axis and the amount of asymmetry yields an estimation ofthe rotation ofthe face. In methods referred to as training methods, one tries to "recognize" the face pose by modeling several possible poses of the face. One possibility is the use of Neural Networks like Radial Basic Function (RBF) networks as described in the article by AJ. Howell and Hilary Buxton, "Towards Visually Mediated Interaction Using Appearance-Based Models," CSRP 490, June 1998. The RBF networks are trained to classify images in terms of pose classes from low resolution pictures of faces.
Another approach is to use three dimensional template matching. In the article by N. Kruger, M. Potzch, and C. von der Malsburg, "Determination of face position and pose with a learned representation based on labeled graphs," Image and Vision Computing 15, pp. 665-673, 1997, it is suggested to use a three dimensional elastic graph matching to represent a face. Each node is associated with a set of Gabor jets and the similarity between the candidate graph and the templates for different poses can be optimized by deforming the graph.
Of course, these different ways can be combined to yield better results. Almost all of these methods assume that a face has been previously located on a picture, and often assume that some features in the face like the eyes, the nose and so on, have been detected. Moreover some techniques, especially the geometric ones, rely very much on the accuracy of this feature position detection.
But face and feature finding on a picture is a problem that also has many different solutions. In a preferred embodiment, we consider it as a two-class detection problem which is less complex than the general pose detection problem that aims to determine face pose very precisely. By two-class detection, as opposed to multi-class detection, we mean that a binary decision is made between two options, e.g., presence of a face or absence of a face, frontal face or non-frontal face, etc. While one or more of the techniques described above may be employed, the techniques we implement in a prefeπed embodiment are described below.
In such a prefeπed embodiment, the main technique employed by the active speaker face detection module 418 and the frontal pose detection module 420 to do face and feature detection is based on Fisher Linear Discriminant (FLD) analysis. A goal of FLD analysis is to get maximum discrimination between classes and reduce the dimensionality of the feature space. For face detection, we consider two classes: (i) the In-Class, which comprises faces, and; (ii) the Out-Class, composed of non-faces. The criterion of FLD analysis is then to find the vector ofthe feature space W that maximizes the following ratio:
Figure imgf000026_0001
where SB is the between-class scatter matrix and Sw the within-class scatter matrix, Having found the right w (which is refeπed to as the FLD), we then project each feature vector x on it by computing w x and compare the result to a threshold in order to decide whether x belongs to the In-Class or to the Out-Class. It should be noted that we may use Principal Component Analysis (PCA), as is known, to reduce dimensionality ofthe feature space prior to finding the vector ofthe feature space jy that maximizes the ratio in equation (1), e.g., see P.N. Belhumeur, J.P. Hespanha, and DJ. Kriegman,
"Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, July 1997.
Face detection (step 502 of FIG. 5B) involves first locating a face in the first frame of a video sequence and the location is tracked across frames in the video clip. Face detection is preferably performed in the following manner. For locating a face, an image pyramid over permissible scales is generated and, for every location in the pyramid, we score the surrounding area as a face location. After a skin-tone segmentation process that aims to locate image regions in the pyramid where colors could indicate the presence of a face, the image is sub-sampled and regions are compared to a previously stored diverse training set of face templates using FLD analysis. This yields a score that is combined with a Distance From Face Space (DFFS) measure to give a face likelihood score. As is known, DFFS considers the distribution o he image energy over the eigenvectors of the covariance matrix. The higher the total score, the higher the chance that the considered region is a face. Thus, the locations scoring highly on all criteria are determined to be faces. For each high scoring face location, we consider small translations, scale and rotation changes that occur from one frame to the next and re-score the face region under each of these changes to optimize the estimates of these parameters (i.e., FLD and DFFS). DFFS is also described in the article by M. Turk and A. Pentland, "Eigenfaces for Recognition," Journal of Cognitive Neuro Science, vol. 3, no. 1, pp. 71-86, 1991. A computer vision-based face identification method for face and feature finding which may be employed in accordance with the invention is described in Andrew Senior, "Face and feature finding for face recognition system," 2nd Int. Conf. On Audio- Video based Biometric Person Authentication, Washington DC, March 1999. A similar method is applied, combined with statistical considerations of position, to detect the features within a face (step 504 of FIG. 5B). Notice that this face and feature detection technique is designed to detect strictly frontal faces only, and the templates are intended only to distinguish strictly frontal faces from non-faces: more general frontal faces are not considered at all. Of course, this method requires the creation of face and feature templates. These are generated from a database of frontal face images. The training face or feature vectors are added to the In-class and some Out-class vectors are generated randomly from the background in our training images.
In a score thresholding technique, the total score may be compared to a threshold to decide whether or not a face candidate or a feature candidate is a true face or feature.
This score, being based on FLD analysis, has interesting properties for the practical pose detection problem. Indeed, for a given user, the score varies as the user is turning his head, e.g., the score being higher when the face is more frontal. Then, having already a method to detect strictly frontal faces and features in it, we adapt it as closely as possible for our two-class detection problem. In a prefeπed embodiment, the module provides two alternate ways to adapt (step 506 of FIG. 5B) the detection method: (i) a pruning mechanism and; (ii) a hierarchical template matching technique.
Pruning mechamsm
Here, we reuse templates already computed for face detection. Our face and feature detection technique only needs strictly frontal faces training data and thus we do not require a broader database. The method involves combining face and feature detection to prune non-frontal faces. We first detect faces in the frame according to the algorithm we have discussed above, but intentionally with a low score threshold. This low threshold allows us to detect faces that are far from being strictly frontal, so that we do not miss any more or less frontal faces. Of course, this yields the detection of some profile faces and even non-faces. Then, in each candidate, we estimate the location ofthe face features (eyes, nose, lips, etc.).
The false candidates are pruned from the candidates according to the following independent computations:
(i) The sum of all the facial feature scores: this is the score given by our combination of FLD and DFFS. The sum is to be compared to a threshold to decide if the candidate should be discarded.
(ii) The number of main features that are well recognized: we discard candidates with a low score for the eyes, the nose and the mouth. Indeed, these are the most characteristic and visible features of a human face and they differ a lot between frontal and non-frontal faces. (iii) The ratio ofthe distance between each eye and the center ofthe nose.
(iv) The ratio of the distance between each eye and the side of the face region (each face is delimited bv a square for template matching, see, e.g., A. Senior reference cited above. Particularly, the ratio is the distance of the outer extremity of the left eye from the medial axis over the distance of the outer extremity of the right eye from the medial axis. The ratio depends on the perspective angle of the viewer and can therefore be used as a criterion. These ratios, for two-dimensional projection reasons, will differ from unity, the more the face is non-frontal. So, we compute these ratios for each face candidate and compare them to unity to decide if the candidate has to be discarded or not.
Then, if one or more face candidates remain in the candidates stack, we will consider that a frontal face has been detected in the considered frame. Finally, for practical reasons, we preferably use a burst mechanism to smooth results. Here, we use the particularity of our interactive system: since we consider a user who is (or is not) in front ofthe camera, we can take its behavior in time into account. As the video camera is expected to take pictures from the user at a high rate (typically 30 frames per second), we can use the results of the former frames to predict the results in the current one, considering that humans move slowly compared to the frame rate.
So, if a frontal face has been detected in the current frame, we may consider that it will remain frontal in the next x frames (x depends on the frame rate). Of course, this will add some false positive detections when the face actually becomes non-frontal from frontal as the user turns his head or leaves, but we can accept some more false positive detections if we get lower false negative detections. Indeed, false negative detections are worse for our human-computer interaction system than false positive ones: it is very important to not miss a single word of the user speech, even if the computer sometimes listens too much.
This pruning method has many advantages. For example, it does not require the computation of a specific database: we can reuse the one computed to do face detection.
Also, compared to simple thresholding, it discards some high score non-faces, because it relies on some face-specific considerations such as face features and face geometry. Hierarchical template matching
Another solution to solve our detection problem is to modify the template matching technique. Indeed, our FLD computation technique does not consider "non-frontal" faces at all: In-class comprises only "strictly frontal" faces and Out-class only non-faces. So, in accordance with this alternate embodiment, we may use other forms of templates such as:
(i) A face template where the In-Class includes frontal faces as well as non-frontal faces, unlike the previous technique, and where the Out-Class includes comprises non-frontal faces. (ii) A pose template where the In-Class includes strictly frontal faces and the
Out-Class includes non-frontal faces.
The use of these two templates allows us to do a hierarchical template matching. First, we do template matching with the face template in order to compute a real face-likelihood score. This one will indicate (after the comparison with a threshold) if we have a face (frontal or non-frontal) or a non-face. Then, if a face has been actually detected by this matching, we can perform the second template matching with the pose template that, this time, will yield a frontalness-likelihood score. This final pose score has better variations from non-frontal to frontal faces than the previous face score.
Thus, the hierarchical template method makes it easier to find a less user independent threshold so that we could solve our problem by simple face finding score thresholding. One advantage of the hierarchical template matching method is that the pose score (i.e., the score given by the pose template matching) is very low for non-faces (i.e., for non-faces that could have been wrongly detected as faces by the face template matching), which helps to discard non-faces. Given the results of either the pruning method or the hierarchical template matching method, one or more frontal pose presence estimates are output by the module 420 (FIG. 4). These estimates (which may include the FLD and DFFS parameters computed in accordance with modules 418 and 420) represent whether or not a face having a frontal pose is detected in the video frame under consideration. These estimates are used by an event detection module 428, along with the audio feature vectors A extracted in module 414 and visual speech feature vectors V extracted in a visual speech feature extractor module 422, explained below. Returning now to FIG. 4, the visual speech feature extractor 422 extracts visual speech feature vectors (e.g., mouth or lip-related parameters), denoted in FIG. 4 as the letter V, from the face detected in the video frame by the active speaker face detector 418,
Examples of visual speech features that may be extracted are grey scale parameters of the mouth region; geometric/model based parameters such as area, height, width of mouth region; lip contours arrived at by curve fitting, spline parameters of inner/outer contour; and motion parameters obtained by three dimensional tracking. Still another feature set that may be extracted via module 422 takes into account the above factors. Such technique is known as Active Shape modeling and is described in Iain Matthews, "Features for audio visual speech recognition," Ph.D dissertation, School of Information Systems, University of East Angalia, January 1998.
Thus, while the visual speech feature extractor 422 may implement one or more known visual feature extraction techniques, in one embodiment, the extractor extracts grey scale parameters associated with the mouth region ofthe image. Given the location of the lip corners, after normalization of scale and rotation, a rectangular region containing the lip region at the center of the rectangle is extracted from the original decompressed video frame. Principal Component Analysis (PCA), as is known, may be used to extract a vector of smaller dimension from this vector of grey-scale values.
Another method of extracting visual feature vectors that may be implemented in module 422 may include extracting geometric features. This entails extracting the phonetic/visemic information from the geometry ofthe lip contour and its time dynamics.
Typical parameters may be the mouth corners, the height or the area of opening, the curvature of inner as well as the outer lips. Positions of articulators, e.g., teeth and tongue, may also be feature parameters, to the extent that they are discernible by the camera.
The method of extraction of these parameters from grey scale values may involve minimization of a function (e.g., a cost function) that describes the mismatch between the lip contour associated with parameter values and the grey scale image. Color information may be utilized as well in extracting these parameters.
From the captured (or demultiplexed and decompressed) video stream one performs a boundary detection, the ultimate result of which is a parameterized contour, e.g., circles, parabolas, ellipses or, more generally, spline contours, each of which can be described by a finite set of parameters.
Still other features that can be extracted include two or three dimensional wire-frame model-based techniques of the type used in the computer graphics for the puφoses of animation. A wire-frame may consist of a large number of triangular patches. These patches together give a structural representation of the mouth/lip/jaw region, each of which contain useful features in speech-reading. These parameters could also be used in combination with grey scale values of the image to benefit from the relative advantages of both schemes.
The extracted visual speech feature vectors are then normalized in block 424 with respect to the frontal pose estimates generated by the detection module 420. The normalized visual speech feature vectors are then provided to a probability module 426.
Similar to the probability module 416 in the audio iriformation path which labels the acoustic feature vectors with one or more phonemes, the probability module 426 labels the extracted visual speech vectors with one or more previously stored phonemes. Again, each phoneme associated with one or more visual speech feature vectors has a probability associated therewith indicating the likelihood that it was that particular acoustic unit that was spoken in the video segment being considered. Thus, the probability module yields likelihood scores for each considered phoneme in the form ofthe probability that, given a particular phoneme or acoustic unit (au), the acoustic unit represents the uttered speech characterized by one or more visual speech feature vectors V or, in other words, P(V|acoustic unit). Alternatively, the visual speech feature vectors may be labeled with visemes which, as previously mentioned, are visual phonemes or canonical mouth shapes that accompany speech utterances. Next, the probabilities generated by modules 416 and 426 are jointly used by AN probability module 430. In module 430, the respective probabilities from modules 416 and 426 are combined based on a confidence measure 432. Confidence estimation refers to a likelihood or other confidence measure being determined with regard to the recognized input. Recently, efforts have been initiated to develop appropriate confidence measures for recognized speech. In LVCSR Hub5 Workshop, April 29 - May 1, 1996,
MITAGS, MD, organized by ΝIST and DARPA, different approaches are proposed to attach to each word a confidence level. A first method uses decision trees trained on word-dependent features (amount of training utterances, minimum and average triphone occurrences, occuπence in language model training, number of phonemes/lefemes, duration, acoustic score (fast match and detailed match), speech or non-speech), sentence-dependent features (signal-to-noise ratio, estimates of speaking rates: number of words or of lefemes or of vowels per second, sentence likelihood provided by the language model, trigram occuπence in the language model), word in a context features (trigram occuπence in language model) as well as speaker profile features (accent, dialect, gender, age, speaking rate, identity, audio quality, SΝR, etc.). A probability of eπor is computed on the training data for each of the leaves of the tree. Algorithms to build such trees are disclosed, for example, in Breiman et al., "Classification and regression trees," Chapman & Hall, 1993. At recognition, all or some of these features are measured during recognition and for each word the decision tree is walked to a leave which provides a confidence level. In C. Νeti, S. Roukos and E. Eide "Word based confidence measures as a guide for stack search in speech recognition," ICASSP97,
Munich, Germany, April, 1997, is described a method relying entirely on scores returned by IBM stack decoder (using log-likelihood - actually the average incremental log-likelihood, detailed match, fast match). In the LVCSR proceeding, another method to estimate the confidence level is done using predictors via linear regression. The predictor used are: the word duration, the language model score, the average acoustic score (best score) per frame and the fraction ofthe N-Best list with the same word as top choice. The present embodiment preferably offers a combination of these two approaches
(confidence level measured via decision trees and via linear predictors) to systematically extract the confidence level in any translation process, not limited to speech recognition. Another method to detect incoπectly recognized words is disclosed in U.S. Patent No. 5,937,383 entitled "Apparatus and Methods for Speech Recognition Including Individual or Speaker Class Dependent Decoding History Caches for Fast Word Acceptance or
Rejection," the disclosure of which is incoφorated herein by reference.
Thus, based on the confidence measure, the probability module 430 decides which probability, i.e., the probability from the visual information path or the probability from the audio information path, to rely on more. This determination may be represented in the following manner:
Figure imgf000034_0001
It is to be understood that vp represents a probability associated with the visual information, ap represents a probability associated with the coπesponding audio information, and Wx and w2 represent respective weights. Thus, based on the confidence measure 432, the module 430 assigns appropriate weights to the probabilities. For instance, if the suπounding environmental noise level is particularly high, i.e., resulting in a lower acoustic confidence measure, there is more of a chance that the probabilities generated by the acoustic decoding path contain eπors. Thus, the module 430 assigns a lower weight for w2 than for wι placing more reliance on the decoded information from the visual path. However, if the noise level is low and thus the acoustic confidence measure is relatively higher, the module may set w2 higher than wι . Alternatively, a visual confidence measure may be used. It is to be appreciated that the first joint use ofthe visual information and audio information in module 430 is refeπed to as decision or score fusion. An alternative embodiment implements feature fusion as described in the above-referenced U.S. patent application identified as Serial No. 09/369,707 (attorney docket no. YO999-317).
Then, a search is performed in search module 434 with language models (LM) based on the weighted probabilities received from module 430. That is, the acoustic units identified as having the highest probabilities of representing what was uttered in the arbitrary content video are put together to form words. The words are output by the search engine 434 as the decoded system output. A conventional search engine may be employed. This output is provided to the dialog manager 18 of FIG. 1 for use in disambiguating the user's intent, as described above. In a prefeπed embodiment, the audio-visual speech recognition module of FIG. 4 also includes an event detection module 428. As previously mentioned, one problem of conventional speech recognition systems is there inability to discriminate between extraneous audible activity, e.g., background noise or background speech not intended to be decoded, and speech that is indeed intended to be decoded. This causes such problems as misfiring of the system and "junk" recognition. According to various embodiments, the module may use information from the video path only, information from the audio path only, or information from both paths simultaneously to decide whether or not to decode information. This is accomplished via the event detection module 428. It is to be understood that "event detection" refers to the determination of whether or not an actual speech event that is intended to be decoded is occurring or is going to occur. Based on the output of the event detection module, microphone 406 or the search engine 434 may be enabled/disabled. Note that if no face is detected, then the audio can be processed to make decisions.
Referring now to FIG. 5C, an illustrative event detection method using information from the video path only to make the detection decision is shown. To make this determination, the event detection module 428 receives input from the frontal pose detector 420, the visual feature extractor 424 (via the pose normalization block 426), and the audio feature extractor 414.
First, in step 510, any mouth openings on a face identified as "frontal" are detected. This detection is based on the tracking of the facial features associated with a detected frontal face, as described in detail above with respect to modules 418 and 420.
If a mouth opening or some mouth motion is detected, microphone 406 is turned on, in step 512. Once the microphone is turned on, any signal received therefrom is stored in a buffer (step 514). Then, mouth opening pattern recognition (e.g., periodicity) is performed on the mouth movements associated with the buffered signal to determine if what was buffered was in fact speech (step 516). This is determined by comparing the visual speech feature vectors to pre-stored visual speech patterns consistent with speech. If the buffered data is tagged as speech, in step 518, the buffered data is sent on through the acoustic path so that the buffered data may be recognized, in step 520, so as to yield a decoded output. The process is repeated for each subsequent portion of buffered data until no more mouth openings are detected. In such case, the microphone is then turned off. It is to be understood that FIG. 5C depicts one example of how visual information (e.g., mouth openings) is used to decide whether or not to decode an input audio signal. The event detection module may alternatively control the search module 434, e.g., turning it on or off, in response to whether or not a speech event is detected. Thus, the event detection module is generally a module that decides whether an input signal captured by the microphone is speech given audio and coπesponding video information or, P(Speech|AN). It is also to be appreciated that the event detection methodology may be performed using the audio path information only. In such case, the event detection module 428 may perform one or more speech-only based detection methods such as, for example: signal energy level detection (e.g., is audio signal above a given level); signal zero crossing detection (e.g., are there high enough zero crossings); voice activity detection (non- stationarity of the spectrum) as described in, e.g., N.R. Garner et al, "Robust noise detection for speech recognition and enhancement," Electronics letters, Feb. 1997, vol. 33, no. 4, pp. 270-271; D.K. Freeman et al., "The voice activity detector of the pan-European digital mobile telephone service, IEEE 1989, CH2673-2; N.R. Garner, "Speech detection in adverse mobile telephony acoustic environments," to appear in Speech Communications; B.S Atal et al., "A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition, IEEE Trans. Acoustic, Speech and Signal Processing, vol. ASSP-24 n3, 1976. See also, L.R. Rabiner, "Digital processing of speech signals," Prentice- hall, 1978. Referring now to FIG. 5D, an illustrative event detection method simultaneously using both information from the video path and the audio path to make the detection decision is shown. The flow diagram illustrates unsupervised utterance verification methodology as is also described in the U.S. patent application identified as U.S. Serial No. 09/369,706 (attorney docket no. YO999-318), filed August 6, 1999 and entitled: "Methods And Apparatus for Audio-Visual Speaker Recognition and Utterance
Verification," the disclosure of which is incoφorated by reference herein. In ther unsupervised mode, utterance verification is performed when the text (script) is not known and available to the system.
Thus, in step 522, the uttered speech to be verified may be decoded by classical speech recognition techniques so that a decoded script and associated time alignments are available. This is accomplished using the feature data from the acoustic feature extractor 414. Contemporaneously, in step 524, the visual speech feature vectors from the visual feature extractor 422 are used to produce a visual phonemes (visemes) sequence. Next, in step 526, the script is aligned with the visemes. A rapid (or other) alignment may be performed in a conventional manner in order to attempt to synchronize the two information streams. For example, in one embodiment, rapid alignment as disclosed in the U.S. patent application identified as Serial No. 09/015,150 (docket no. YO997-386) and entitled "Apparatus and Method for Generating Phonetic Transcription from Enrollment Utterances," the disclosure of which is incoφorated by reference herein, may be employed. Then, in step 528, a likelihood on the alignment is computed to determine how well the script aligns to the visual data. The results of the likelihood are then used, in step 530, to decide whether an actual speech event occurred or is occurring and whether the information in the paths needs to be recognized.
The audio-visual speech recognition module of FIG. 4 may apply one of, a combination of two of, or all three of, the approaches described above in the event detection module 428 to perform event detection. Video information only based detection is useful so that the module can do the detection when the background noise is too high for a speech only decision. The audio only approach is useful when speech occurs without a visible face present. The combined approach offered by unsupervised utterance verification improves the decision process when a face is detectable with the right pose to improve the acoustic decision.
Besides minimizing or eliminating recognition engine misfiring and/or "junk" recognition, the event detection methodology provides better modeling of background noise, that is, when no speech is detected, silence is detected. Also, for embedded applications, such event detection provides additional advantages. For example, the CPU associated with an embedded device can focus on other tasks instead of having to run in a speech detection mode. Also, a battery power savings is realized since speech recognition engine and associated components may be powered off when no speech is present. Other general applications of this speech detection methodology include: (i) use with visible electromagnetic spectrum image or non-visible electromagnetic spectrum image (e.g., far IR) camera in vehicle-based speech detection or noisy environment; (ii) speaker detection in an audience to focus local or array microphones; (iii) speaker recognition (as in the above- referenced U.S. patent application identified by docket no. YO999-318) and tagging in broadcast news or Tele Video conferencing. One of ordinary skill in the art will contemplate other applications given the inventive teachings described herein.
It is to be appreciated that the audio-visual speech recognition module of FIG. 4 may employ the alternative embodiments of audio-visual speech detection and recognition described in the above-referenced U.S. patent application identified as Serial No. 09/369,707 (attorney docket no. YO999-317). For instance, whereas the embodiment of FIG. 4 illustrates a decision or score fusion approach, the module may employ a feature fusion approach and/or a serial rescoring approach, as described in the above-referenced U.S. patent application identified as Serial No. 09/369,707 (attorney docket no. YO999-317).
B. Audio-visual speaker recognition Referring now to FIG. 6, a block diagram illustrates a prefeπed embodiment of an audio-visual speaker recognition module that may be employed as one ofthe recognition modules of FIG. 1 to perform speaker recognition using multi-modal input data received in accordance with the invention. It is to be appreciated that such an audio-visual speaker recognition module is disclosed in the above-referenced U.S. patent application identified as Serial No. 09/369,706 (attorney docket no. YO999-318), filed on August 6, 1999 and entitled "Methods And Apparatus for Audio- Visual Speaker Recognition and Utterance Verification." A description of one of the embodiments of such an audio-visual speaker recognition module for use in a prefeπed embodiment ofthe multi-modal conversational computing system ofthe invention is provided below in this section. However, it is to be appreciated that other mechanisms for performing speaker recognition may be employed.
The. audio-visual speaker recognition and utterance verification module shown in FIG. 6 uses a decision fusion approach. Like the audio-visual speech recognition module of FIG. 4, the speaker recognition module of FIG. 6 may receive the same types of arbitrary content video from the camera 604 and audio from the microphone 606 via the I/O manager 14. While the camera and microphone have different reference numerals in FIG. 6 than in FIG. 4, it is to be appreciated that they may be the same camera and microphone.
A phantom line denoted by Roman numeral I represents the processing path the audio information signal takes within the module, while a phantom line denoted by Roman numeral II represents the processing path the video information signal takes within the module. First, the audio signal path I will be discussed, then the video signal path H, followed by an explanation of how the two types of information are combined, to provide improved speaker recognition accuracy.
The module includes an auditory feature extractor 614. The feature extractor 614 receives an audio or speech signal and, as is known in the art, extracts spectral features from the signal at regular intervals. The spectral features are in the form of acoustic feature vectors (signals) which are then passed on to an audio speaker recognition module
616. Before acoustic vectors are extracted, the speech signal may be sampled at a rate of 16 kilohertz (kHz). A frame may consist of a segment of speech having a 25 millisecond (msec) duration. In such an aπangement, the extraction process preferably produces 24 dimensional acoustic cepstral vectors via the process described below. Frames are advanced every 10 msec to obtain succeeding acoustic vectors. Of course, other front-ends may be employed.
First, in accordance with a prefeπed acoustic feature extraction process, magnitudes of discrete Fourier transforms of samples of speech data in a frame are considered in a logarithmically waφed frequency scale. Next, these amplitude values themselves are transformed to a logarithmic scale. The latter two steps are motivated by a logarithmic sensitivity of human hearing to frequency and amplitude. Subsequently, a rotation in the form of discrete cosine transform is applied. One way to capture the dynamics is to use the delta (first-difference) and the delta-delta (second-order differences) information. An alternative way to capture dynamic information is to append a set of (e.g., four) preceding and succeeding vectors to the vector under consideration and then project the vector to a lower dimensional space, which is chosen to have the most discrimination. The latter procedure is known as Linear Discriminant Analysis (LDA) and is well known in the art. It is to be understood that other variations on features may be used, e.g., LPC cepstra, PLP, etc., and that the invention is not limited to any particular type.
After the acoustic feature vectors, denoted in FIG 6. by the letter A, are extracted, they are provided to the audio speaker recognition module 616. It is to be understood that the module 616 may perform speaker identification and/or speaker verification using the extracted acoustic feature vectors. The processes of speaker identification and verification may be accomplished via any conventional acoustic information speaker recognition system. For example, speaker recognition module 616 may implement the recognition techniques described in the U.S. patent application identified by Serial No. 08/788,471, filed on January 28 1997, and entitled:- "Text Independent Speaker
Recognition for Transparent Command Ambiguity Resolution and Continuous Access Control," the disclosure of which is incoφorated herein by reference.
An illustrative speaker identification process for use in module 616 will now be described. The illustrative system is disclosed in H. Beigi, S.H. Maes, UN. Chaudari and J.S. Sorenson, "IBM model-based and frame-by-frame speaker recognition," Speaker
Recognition and its Commercial and Forensic Applications, Avignon, France 1998. The illustrative speaker identification system may use two techniques: a model-based approach and a frame-based approach. In the experiments described herein, we use the frame-based approach for speaker identification based on audio. The frame-based approach can be described in the following manner.
Let M, be the model coπesponding to the ith enrolled speaker. -Η' is represented by a mixture Gaussian model defined by the parameter set
Figure imgf000042_0001
, consisting of the mean vector, covariance matrix and mixture
weights for each of the nt components of speaker fs model. These models are created using training data consisting of a sequence of K frames of speech with ^-dimensional
cepstral feature vectors, m ) m=\,...κ . The goal of speaker identification is to find the model, M> _ that best explains the test data represented by a sequence of N frames,
V» J n=ι,...N . We use the following frame-based weighted likelihood distance measure, i,n , in making the decision:
Figure imgf000042_0002
The total distance D, of model ' from the test data is then taken to be the sum ofthe distances over all the test frames:
N
n=l
Thus, the above approach finds the closest matching model and the person whose model that represents is determined to be the person whose utterance is being processed. Speaker verification may be performed in a similar manner, however, the input acoustic data is compared to determine if the data matches closely enough with stored models. If the comparison yields a close enough match, the person uttering the speech is verified. The match is accepted or rejected by comparing the match with competing models. These models can be selected to be similar to the claimant speaker or be speaker independent (i.e., a single or a set of speaker independent models). If the claimant wins and wins with enough margin (computed at the level ofthe likelihood or the distance to the models), we accept the claimant. Otherwise, the claimant is rejected. It should be understood that, at enrollment, the input speech is collected for a speaker to build the mixture gaussian model Mi that characterize each speaker.
Referring now to the video signal path II of FIG. 6, the methodologies of processing visual information will now be explained. The audio-visual speaker recognition and utterance verification module includes an active speaker face segmentation module 620 and a face recogmtion module 624. The active speaker face segmentation module 620 receives video input from camera 604. It is to be appreciated that speaker face detection can also be performed directly in the compressed data domain and/or from audio and video information rather than just from video information. In any case, segmentation module 620 generally locates and tracks the speaker's face and facial features within the arbitrary video background. This will be explained in detail below. From data provided from the segmentation module 622, an identification and/or verification operation may be performed by recognition module 624 to identify and/or verify the face ofthe person assumed to be the speaker in the video. Verification can also be performed by adding score thresholding or competing models. Thus, the visual mode of speaker identification is implemented as a face recognition system where faces are found and tracked in the video sequences, and recognized by comparison with a database of candidate face templates. As will be explained later, utterance verification provides a technique to verify that the person actually uttered the speech used to recognize him. Face detection and recognition may be performed in a variety of ways. For example, in an embodiment employing an infrared camera 604, face detection and identification may be performed as disclosed in Francine J. Prokoski and Robert R. Riedel, "Infrared Identification of Faces and Body Parts," BIOMETRICS, Personal Identification in Networked Society, Kluwer Academic Publishers, 1999. In a prefeπed embodiment, techniques described in Andrew Senior, "Face and feature finding for face recognition system," 2nd Int. Conf. On Audio-Video based Biometric Person Authentication, Washington DC, March 1999 are employed. The following is an illustrative description of face detection and recognition as respectively performed by segmentation module 622 and recogmtion module 624.
Face Detection
Faces can occur at a variety of scales, locations and orientations in the video frames. In this system, we make the assumption that faces are close to the vertical, and that there is no face smaller than 66 pixels high. However, to test for a face at all the remaining locations and scales, the system searches for a fixed size template in an image pyramid. The image pyramid is constructed by repeatedly down-sampling the original image to give progressively lower resolution representations of the original frame.
Within each of these sub-images, we consider all square regions of the same size as our face template (typically 11x11 pixels) as candidate face locations. A sequence of tests is used to test whether a region contains a face or not.
First, the region must contain a high proportion of skin-tone pixels, and then the intensities ofthe candidate region are compared with a trained face model. Pixels falling into a pre-defined cuboid of hue-chromaticity-intensity space are deemed to be skin tone, and the proportion of skin tone pixels must exceed a threshold for the candidate region to be considered further.
The. face model is based on a training set of cropped, normalized, grey-scale face images. Statistics of these faces are gathered and a variety of classifiers are trained based on these statistics. A Fisher linear discriminant (FLD) trained with a linear program is found to distinguish between faces and background images, and "Distance from face space" (DFFS), as described in M. Turk and A. Pentland, "Eigenfaces for Recognition," Journal of Cognitive Neuro Science, vol. 3, no. 1, pp. 71-86, 1991, is used to score the quality of faces given high scores by the first method. A high combined score from both these face detectors indicates that the candidate region is indeed a face. Candidate face regions with small perturbations of scale, location and rotation relative to high-scoring face candidates are also tested and the maximum scoring candidate among the perturbations is chosen, giving refined estimates of these three parameters. In subsequent frames, the face is tracked by using a velocity estimate to predict the new face location, and models are used to search for the face in candidate regions near the predicted location with similar scales and rotations. A low score is inteφreted as a failure of tracking, and the algorithm begins again with an exhaustive search.
Face Recognition Having found the face, K facial features are located using the same techniques
(FLD and DFFS) used for face detection. Features are found using a hierarchical approach where large-scale features, such as eyes, nose and mouth are first found, then sub-features are found relative to these features. As many as 29 sub-features are used, including the hairline, chin, ears, and the corners of mouth, nose, eyes and eyebrows. Prior statistics are used to restrict the search area for each feature and sub-feature relative to the face and feature positions, respectively. At each of the estimated sub-feature locations, a Gabor Jet representation, as described in L. Wiskott and C. von der Malsburg, "Recognizing Faces by Dynamic Link Matching," Proceedings of the International Conference on Artificial Neural Networks, pp. 347-352, 1995, is generated. A Gabor jet is a set of two-dimensional Gabor filters - each a sine wave modulated by a Gaussian.
Each filter .has scale (the sine wavelength and Gaussian standard deviation with fixed ratio) and orientation (ofthe sine wave). We use five scales and eight orientations, giving 40 complex coefficients (a(j),j = 1,..., 40) at each feature location.
A simple distance metric is used to compute the distance between the feature vectors for trained faces and the test candidates. The distance between the ith trained candidate and a test candidate for feature k is defined as:
Figure imgf000046_0001
A simple average of these similarities, ύ/ - 1 / Λ _. >-> _ gives an overall measure for the similarity of the test face to the face template in the database. Accordingly, based on the similarity measure, an identification and/or verification of the person in the video sequence under consideration is made.
Next, the results of the face recognition module 624 and the audio speaker recognition module 616 are provided to respective confidence estimation blocks 626 and 618 where confidence estimation is performed. Confidence estimation refers to a likelihood or other confidence measure being determined with regard to the recognized input. In one embodiment, the confidence estimation procedure may include measurement of noise levels respectively associated with the audio signal and the video signal. These levels may be measured internally or externally with respect to the system. A higher level of noise associated with a signal generally means that the confidence attributed to the recognition results associated with that signal is lower. Therefore, these confidence measures are taken into consideration during the weighting of the visual and acoustic results discussed below. Given the audio-based speaker recognition and face recognition scores provided by respective modules 616 and 624, audio-visual speaker identification/verification may be performed by a joint identification/verification module 630 as follows. The top N scores are generated-based on both audio and video-based identification techniques. The two lists are combined by a weighted sum and the best-scoring candidate is chosen. Since the weights need only to be defined up to a scaling factor, we can define the combined score Sav as a function ofthe single parameter a :
S"v = cosα D, + sin a Sj
The mixture angle a has to be selected according to the relative reliability of audio identification and face identification. One way to achieve this is to optimize in order to maximize the audio-visual accuracy on some training data. Let us denote by ty (/ and S; (n) as the audio ID (identification) and video JJD score for the i'h enrolled speaker
(i = 1...P) computed on the n"' training clip. Let us define the variable
Figure imgf000047_0001
as zero when the nth clip belongs to the i'h speaker and one otherwise. The cost function to be mimmized is the empirical eπor, as discussed in V.Ν. Vapnik, "The Nature of Statistical
Learning Theory, Springer, 1995, that can be written as:
1 N C( ) = ~77∑ T*(n) where i = arg max ^(w)
N n=l and where:
Figure imgf000047_0002
= cosα £>(«) + sintf St(n) . In order to prevent over-fitting, one can also resort to the smoothed error rate, as discussed in H. Ney, "On the Probabilistic Inteφretation of Neural Network Classification and Discriminative Training Criteria," IEEE Transactions on Pattern Analysis and Machine Intelligence," vol. 17, no. 2, pp. 107-119, 1995, defined as:
Figure imgf000048_0001
When n is large, all the terms of the inner sum approach zero, except for i= i , and C'(a ) approaches the raw eπor count C ) . Otherwise, all the incoπect hypotheses
(those for which T:(ϊl) = \) have a contribution that is a decreasing function of the distance between their score and the maximum score. If the best hypothesis is incoπect, it has the largest contribution. Hence, by mimmizing the latter cost function, one tends to maximize not only the recognition accuracy on the tiaining data, but also the margin by which the best score wins. This function also presents the advantage of being differentiable, which can facilitate the optimization process when there is more than one parameter. The audio-visual speaker recognition module of FIG. 6 provides another decision or score fusion technique derived by the previous technique, but which does not require any training. It consists in selecting at testing time, for each clip, the value of Ct in a given range which maximizes the difference between the highest and the second highest scores. The coπesponding best hypothesis I(n) is then chosen. We have:
ά (ri) = arg max [max. S°v (n) - 2nd max, Sf™ («) 1 and
Figure imgf000049_0001
π
The values of Ct\ and 2 should be restricted to the interval 0...- . The rationale of
this technique is the following. In the »"/) plane, the point coπesponding to the coπect decision is expected to lie apart from the others. The fixed linear weights assume that the "direction" where this point can be found relative to the others is always the same, which is not necessarily true. The equation relating to Ot n) and I(n) above find the point which lies farthest apart from the others in any direction between <*\ and # 2 . Another inteφretation is that the distance between the best combined score and the second best is an indicator of the reliability of the decision. The method adaptively chooses the weights which maximize that confidence measure.
Thus, the joint identification/verification module 630 makes a decision with regard to the speaker. In a verification scenario, based on one ofthe techniques described above, a decision may be made to accept the speaker if he is verified via both the acoustic path and the visual path. However, he may be rejected if he is only verified through one ofthe paths. In an identification scenario, for example, the top three scores from the face identification process may be combined with the top three scores from the acoustic speaker identification process. Then, the highest combined score is identified as the speaker.
In a preferred embodiment, before the module makes a final disposition with respect to the speaker, the system performs an utterance verification operation. It is to be appreciated' that utterance verification is performed by the utterance verification module 628 (FIG. 6) based on input from the acoustic feature extractor 614 and a visual speech feature extractor 622. Before describing utterance verification, a description of illustrative techniques for extracting visual speech feature vectors will follow. Particularly, the visual speech feature extractor 622 extracts visual speech feature vectors (e.g., mouth or lip-related parameters), denoted in FIG. 6 as the letter V, from the face detected in the video frame by the active speaker face segmentation module 620.
Examples of visual speech features that may be extracted are grey scale parameters of the mouth region; geometric/model based parameters such as area, height, width of mouth region; lip contours arrived at by curve fitting, spline parameters of inner/outer contour; and motion parameters obtained by three dimensional tracking. Still another feature set that may be extracted via module 622 takes into account the above factors. Such technique is known as Active Shape modeling and is described in Iain Matthews, "Features for audio visual speech recognition," Ph.D dissertation, School of Information Systems, University of East Angalia, January 1998. Thus, while the visual speech feature extractor 622 may implement one or more known visual feature extraction techniques, in one embodiment, the extractor extracts grey scale parameters associated with the mouth region ofthe image. Given the location of the lip corners, after normalization of scale and rotation, a rectangular region containing the lip region at the center of the rectangle is extracted from the original decompressed video frame. Principal Component Analysis (PCA), as is known, may be used to extract a vector of smaller dimension from this vector of grey-scale values.
Another method of extracting visual feature vectors that may be implemented in module 622 may include extracting geometric features. This entails extracting the phonetic/visemic information from the geometry ofthe lip contour and its time dynamics. Typical parameters may be the mouth corners, the height or the area of opening, the curvature of inner as well as the outer lips. Positions of articulators, e.g., teeth and tongue, may also be feature parameters, to the extent that they are discernible by the camera. The method of extraction of these parameters from grey scale values may involve minimization of a function (e.g., a cost function) that describes the mismatch between the lip contour associated with parameter values and the grey scale image. Color information may be utilized as well in extracting these parameters. From the captured (or demultiplexed and decompressed) video stream one performs a boundary detection, the ultimate result of which is a parameterized contour, e.g., circles, parabolas, ellipses or, more generally, spline contours, each of which can be described by a finite set of parameters.
Still other features that can be extracted include two or three dimensional wire-frame model-based techniques of the type used in the computer graphics for the puφoses of animation. A wire-frame may consist of a large number of triangular patches. These patches together give a structural representation of the mouth/lip/jaw region, each of which contain useful features in speech-reading. These parameters could also be used in combination with grey scale values of the image to benefit from the relative advantages of both schemes.
Given the extracted visual speech feature vectors (V) from extractor 622 and the acoustic feature vectors(A) from extractor 614, the AV utterance verifier 628 performs verification. Verification may involve a comparison of the resulting likelihood, for example, of aligning the audio on a random sequence of visemes. As is known, visemes, or visual phonemes, are generally canonical mouth shapes that accompany speech utterances which are categorized and pre-stored similar to acoustic phonemes, A goal associated with utterance verification is to make a determination that the speech used to verify the speaker in the audio path I and the visual cues used to verify the speaker in the video path TJ coπelate or align. This allows the system to be confident that the speech data that is being used to recognize the speaker is actually what the speaker uttered. Such a determination has many advantages. For example, from the utterance verification, it can be determined whether the user is lip synching to a pre-recorded tape playback to attempt to fool the system. Also, from utterance verification, eπors in the audio decoding path may be detected. Depending on the number of eπors, a confidence measure may be produced and used by the system.
Referring now to FIG. 7, a flow diagram of an utterance verification methodology is shown. Utterance verification may be performed in: (i) a supervised mode, i.e., when the text (script) is known and available to the system; or (ii) an unsupervised mode, i.e., when the text (script) is not known and available to the system.
Thus, in step 702A (unsupervised mode), the uttered speech to be verified may be decoded by classical speech recognition techniques so that a decoded script and associated time alignments are available. This is accomplished using the feature data from the acoustic feature extractor 614. Contemporaneously, in step 704, the visual speech feature vectors from the visual feature extractor 622 are used to produce a visual phonemes or visemes sequence.
Next, in step 706, the script is aligned with the visemes. A rapid (or other) alignment may be performed in a conventional manner in order to attempt to synchronize the two information streams. For example, in one embodiment, rapid alignment as disclosed in the U.S. patent application identified by Serial No. 09/015,150 (docket no.
YO997-386) and entitled "Apparatus and Method for Generating Phonetic Transcription from Enrollment Utterances," the disclosure of which is incoφorated by reference herein, may be employed. Note that in a supervised mode, step 702B replaces step 702A such that the expected or known script is aligned with the visemes in step 706, rather than the decoded version of the script. Then, in step 708, a likelihood on the alignment is computed to determine how well the script aligns to the visual data. The results of the likelihood are then provided to a decision block 632 which, along with the results of the score module 630, decides on a final disposition ofthe speaker, e.g., accept him or reject him. This may be used to allow or deny access to a variety of devices, applications, facilities, etc.
So, in the unsupervised utterance verification mode, the system is able to check that the user is indeed speaking rather than using a playback device and moving bis lips. Also, a priori, eπors may be detected in the audio decoding. In the supervised mode, the system is able to prove that the user uttered the text if the recognized text is sufficiently aligned or coπelated to the extracted lip parameters.
It is to be appreciated that utterance verification in the unsupervised mode can be used to perform speech detection as disclosed in the above-referenced U.S. patent application identified as U.S. Serial No. 09/369,707 (attorney docket no. YO999-317). Indeed, if acoustic and visual activities are detected, they can be verified against each other. When the resulting acoustic utterance is accepted, the system considers that speech is detected. Otherwise, it is considered that extraneous activities are present. It is to be appreciated that the audio-visual speaker recognition module of FIG. 6 may employ the alternative embodiments of audio-visual speaker recognition described in the above-referenced U.S. patent application identified as Serial No. 09/369,706 (attorney docket no. YO999-318). For instance, whereas the embodiment of FIG. 6 illustrates a decision or score fusion approach, the module 20 may employ a feature fusion approach and/or a serial rescoring approach, as described in the above-referenced U.S. patent application identified as Serial No. 09/369,706 (attorney docket no. YO999-318).
It is to be further appreciated that the output of the audio-visual speaker recognition system of FIG. 6 is provided to the dialog manager 18 of FIG. 1 for use in disambiguating the user's intent, as explained above.
C. Conversational Virtual Machine
Referring now to FIGs. 8A and 8B, block diagrams illustrate a prefeπed embodiment of a conversational virtual machine (CVM). It is to be appreciated that such a conversational virtual machine is disclosed in the above-referenced PCT international patent application identified as US99/22927 (attorney docket no. YO999-111) filed on October 1, 1999 and entitled "Conversational Computing Via Conversational Virtual
Machine." • A description of one of the embodiments of such a machine for use in a prefeπed embodiment ofthe multi-modal conversational computing system ofthe present invention is provided below in this section. However, it is to be appreciated that other mechanisms for implementing conversational computing according to the invention may be employed, as explained below.
It is to be understood that the CVM described below may be employed to provide a framework for: portions of the I/O subsystem 12; I/O manager 14; recognition engines
16; dialog manager 18; and context stack 20 of FIG. 1. Throughout the description ofthe CVM below, the components of the CVM that may be employed to implement these functional components of FIG. 1 will be noted. However, while the CVM may be used because of its ability to implement an I/O manager, a modality independent context manager (context stack), a dialog manager (when disambiguation is performed), a classifier (when mood or focus is determined), required engines and APIs/interfaces to the dialog manager to run applications, it is important to note that other mechanisms may be alternatively used to implement these functional components of a muU modal conversational computing system of the invention. For example, functional components of a multi-modal conversational computing system ofthe invention may be implemented through a browser that carries these functions, an OSS (operating system service) layer, a VM (virtual machine) or even just an application that implements all these functionalities, possibly without explicitly identifying these component but rather by implementing hard-coded equivalent services. It is also to be appreciated that the implementation may support only modalities of speech and video and, in such a case, does not need to support other modalities (e.g., handwriting, GUI, etc.).
Thus, the CVM may be employed as a main component for implementing conversational computing according to the conversational computing paradigm described above with respect to the present invention. In one embodiment, the CVM is a conversational platform or kernel running on top of a conventional OS (operating system) or RTOS (real-time operating system). A CVM platform can also be implemented with PvC (pervasive computing) clients as well as servers and can be distributed across multiple systems (clients and servers). In general, the CVM provides conversational APIs (application programming interfaces) and protocols between conversational subsystems (e.g., speech recognition engine, text-to speech, etc.) and conversational and or conventional applications. The CVM may also provide backward compatibility to existing applications, with a more limited interface. As discussed in detail below, the CVM provides conversational services and behaviors as well as conversational protocols for interaction with multiple applications and devices also equipped with a CVM layer, or at least, conversationally aware.
It is to be understood that the different elements and protocol APIs described herein are defined on the basis of the function that they perform or the information that they exchange. Their actual organization or implementation can vary, e.g., implemented by a same or different entity, being implemented as a component of a larger component or as an independently instantiated object or a family of such objects or classes.
A CVM (or operating system) based on the conversational computing paradigm described herein allows a computer or any other interactive device to converse with a user. The CVM further allows the user to run multiple tasks on a machine regardless if the machine has no display or GUI capabilities, nor any keyboard, pen or pointing device. Indeed, the user can manage these tasks like a conversation and bring a task or multiple simultaneous tasks, to closure. To manage tasks like a conversation, the CVM affords the capability of relying on mixed initiatives, contexts and advanced levels of abstraction, to perform its various functions. Mixed initiative or free flow navigation allows a user to naturally complete, modify, or coπect a request via dialog with the system. Mixed initiative also implies that the CVM can actively help (take the initiative to help) and coach a user through a task, especially in speech-enabled applications, wherein the mixed initiative capability is a natural way of compensating for a display-less system or system with limited display capabilities. In general, the CVM complements conventional interfaces and user input/output rather than replacing them. This is the notion of "multi-modality" whereby speech, and video as described above, may be used in parallel with a mouse, keyboard, and other input devices such as a pen. Conventional interfaces can be replaced when device limitations constrain the implementation of certain interfaces. In addition, the ubiquity and uniformity of the resulting interface across devices, tiers and services is an additional mandatory characteristic. It is to be understood that a CVM system can, to a large extent, function with conventional input and/or output media. Indeed, a computer with classical keyboard inputs and pointing devices coupled with a traditional monitor display can profit significantly by utilizing the CVM. One example is described in U.S. patent application identified as U.S. Serial No. 09/507,526 (attorney docket no. YO999-178) filed on February 18, 2000 and entitled "Multi-Modal Shell" which claims priority to U.S. provisional patent application identified as U.S. Serial No. 60/128,081 filed on April 7, 1999 and U.S. provisional patent application identified by Serial No. 60/158,777 filed on October 12, 1999, the disclosures of which are incoφorated by reference herein (which describes a method for constructing a true multi-modal application with tight synchronization between a GUI modality and a speech modality). In other words, even users who do not want to talk to their computer can also realize a dramatic positive change to their interaction with the CVM enabled machine.
Referring now to FIG. 8A, a block diagram illustrates a CVM system according to a prefeπed embodiment, which may be implemented on a client device or a server. In terms of the vehicle example above, this means that the components of the system 10 may be located locally (in the vehicle), remotely (e.g., connected wirelessly to the vehicle), or some combination thereof. In general, the CVM provides a universal coordinated multi-modal conversational user interface (CUI) 780. The "multi- modality" aspect of the CUI implies that various I/O resources such as voice, keyboard, pen, and pointing device (mouse), keypads, touch screens, etc., and video as described above, can be used in conjunction with the CVM platform. The "universality" aspect of the CUI implies that the CVM system provides the same UI (user interface) to a user whether the
CVM is implemented in connection with a desktop computer, a PDA with limited display capabilities, or with a phone where no display is provided. In other words, universality implies that the CVM system can appropriately handle the UI of devices with capabilities ranging from speech only to multi-modal, i.e., speech + GUI, to purely GUI. As per the present invention, the system may be extended to include video input data as well. Therefore, the universal CUI provides the same UI for all user interactions, regardless of the access modality. Moreover, the concept of universal CUI extends to the concept of a coordinated
CUI. In particular, assuming a plurality of devices (within or across multiple computer tiers) offer the same CUI, they can be managed through a single discourse - i.e., a coordinated interface. That is, when multiple devices are conversationally connected (i.e., aware of each other), it is possible to simultaneously control them through one interface (e.g., single microphone). For example, voice can automatically control via a universal coordinated CUI a smart phone, a pager, a PDA (personal digital assistant), networked computers, IVR (interactive voice response) and a car embedded computer that are conversationally connected. These CUI concepts will be explained in greater detail below. The CVM system can run a plurality of applications including conversationally aware applications 782 (i.e., applications that "speak" conversational protocols) and conventional applications 784. The conversationally aware applications 782 are applications that are specifically programmed for operating with a CVM core layer (or kernel) 788 via conversational application APIs 786. In general, the CVM kernel 788 controls the dialog across applications and devices on the basis of their registered conversational capabilities and requirements and provides a unified conversational user interface which goes far beyond adding speech as I/O modality to provide conversational system behaviors. The CVM system may be built on top of a conventional OS and APIs 790 and conventional device hardware 792 and located on a server or any client device (PC, PDA, PvC). The conventional applications 784 are managed by the CVM kernel layer 788 which is responsible for accessing, via the OS APIs, GUI menus and commands of the conventional applications as well as the underlying OS commands. The CVM automaticallv handles all the input/output issues, including the conversational subsystems 796 (i.e., conversational engines) and conventional subsystems (e.g., file system and conventional drivers) ofthe conventional OS 790. In general, conversational sub-systems 796 are responsible for converting voice requests into queries and converting outputs and results into spoken messages using the appropriate data files 794 (e.g., contexts, finite state grammars, vocabularies, language models, symbolic query maps, etc.). The conversational application API 786 conveys all the information for the CVM 788 to transform queries into application calls and conversely converts output into speech, appropriately sorted before being provided to the user.
Referring now to FIG. 8B, a diagram illustrates abstract programming layers of a CVM according to a prefeπed embodiment. The abstract layers of the CVM comprise conversationally aware applications 800 and conventional applications 801 that can run on top of the CVM. An application that relies on multi-modal disambiguation is an example of such a conversational application that executes on top of the CVM. Similarly, an application that exploits focus information or mood can be considered as a conversational application on top of the CVM. These applications are the programs that are executed by the system to provide the user with the interaction he desires within the environment in which the system is deployed. As discussed above, the conversationally aware applications 800 interact with a CVM kernel layer 802 via a conversational application API layer 803. The conversational application API layer 803 encompasses conversational programming languages/scripts and libraries (conversational foundation classes) to provide the various features offered by the CVM kernel 802. For example, the conversational programming languages/scripts provide the conversational APIs that allow an application developer to hook (or develop) conversationally aware applications 800. They also provide the conversational API layer 803, conversational protocols 804 and system calls that allow a developer to build the conversational features into an application to make it "conversationally aware." The code implementing the applications, API calls and protocol calls includes inteφreted and compiled scripts and programs, with library links, conversational logic engine call and conversational foundation classes. More specifically, the conversational application API layer 803 comprises a plurality of conversational foundation classes 805 (or fundamental dialog components) which are provided to the application developer through library functions that may be used to build a CUI or conversationally aware applications 800. The conversational foundation classes 805 are the elementary components or conversational gestures (as described by TN. Raman, in "Auditory User Interfaces, Toward The Speaking Computer," Kluwer Academic Publishers, Boston 1997) that characterize any dialog, independently of the modality or combination of modalities (which can be implemented procedurally or declaratively). The conversational foundation classes 805 comprise CUI building blocks and conversational platform libraries, dialog modules and components, and dialog scripts and beans. The conversational foundation classes 805 may be compiled locally into conversational objects 806. More specifically, the conversational objects 805 (or dialog components) are compiled from the conversational foundation classes 805 (fundamental dialog components) by combining the different individual classes in a code calling these libraries through a programming language such as Java or
C++.
As noted above, coding comprises embedding such fundamental dialog components into declarative code or linking them to imperative code. Nesting and embedding of the conversational foundation classes 805 allows the conversational object 806 (either reusable or not) to be constructed (either declaratively or via compilation/inteφretation) for performing specific dialog tasks or applications. Note that CFC (Conversational Foundation Classes) or CML is not the only way to program the CVM. Any programming language that interfaces to the applications APIs and protocols would fit. The conversational objects 806 may be implemented declaratively such as pages of CML (conversational markup language) (nested or not) which are processed or loaded by a conversational browser (or viewer) (800a) as disclosed in the PCT patent application identified as PCT/US99/23008 (attorney docket no. YO9998-392) filed on October 1, 1999 and entitled "Conversational Browser and Conversational Systems," which is incoφorated herein by reference. The dialog objects comprise applets or objects that may be loaded through CML (conversational markup language) pages (via a conversational browser), imperative objects on top of CVM (possibly distributed on top ofthe CVM), script tags in CML, and servlet components. Some examples of conversational gestures that may be implemented are as follows. A conversational gesture message is used by a machine to convey informational messages to the user. The gesture messages will typically be rendered as a displayed string or spoken prompt. Portions of the message to be spoken can be a function of the cuπent state of the various applications/dialogs running on top of the CVM. A conversational gesture "select from set" is used to encapsulate dialogues where the user is expected to pick from a set of discrete choices. It encapsulates the prompt, the default selection, as well as the set of legal choices. Conversational gesture message "select from range" encapsulates dialogs where the user is allowed to pick a value from a continuous range of values. The gesture encapsulates the valid range, the current selection, and an informational prompt. In addition, conversational gesture input is used to obtain user input when the input constraints are more complex (or perhaps non-existent). The gesture encapsulates the user prompt, application-level semantics about the item of information being requested and possibly a predicate to test the validity of the input. As described above, however, the conversational foundation classes include, yet suφass, the concept of conversational gestures (i.e., they extend to the level of fundamental behavior and services as well as rules to perform conversational tasks).
As discussed below, a programming model allows the connection between a master dialog manager and engines through conversational APIs. It is to be understood that such a master dialog manager may be implemented as part ofthe dialog manager 18 of FIG. 1 , while the engines would include the one or more recognition engines of FIG. 1.
Data files of the foundation classes, as well as data needed by any recognition engine (e.g., grammar, acoustic models, video patterns, etc.), are present on CVM (loadable for embedded platforms or client platforms). Data files of objects can be expanded and loaded.
The development environment offered by the CVM is refeπed to herein as SPOKEN AGE™. Spoken Age allows a developer to build, simulate and debug conversationally aware applications for CVM. Besides offering direct implementation of the API calls, it offers also tools to build advanced conversational interfaces with multiple personalities, voice fonts which allow the user to select the type of voice providing the output, and conversational formatting languages which build conversational presentations like Postcript and AFL (audio formatting languages). As described above, the conversational application API layer 803 encompasses conversational programming languages and scripts to provide universal conversational input and output, conversational logic and conversational meta-information exchange protocols. The conversational programming language/scripts allow to use any available resources as input or output stream. Using the conversational engines 808 (recognition engines 16 of FIG. 1) and conversational data files 809 (accessed by CVM 802 via conversation engine APIs 807), each input is converted into a binary or ASCII input, which can be directly processed by the programming language as built-in objects. Calls, flags and tags can be automatically included to transmit between object and processes the conversational meta-information required to coπectly interface with the different objects. Moreover, output streams can be specially formatted according to the needs of the application or user. These programming tools allow multi-modal discourse processing to be readily built. Moreover, logic statement status and operators are expanded to handle the richness of conversational queries that can be compared on the bases of their ASCII/binary content or on the basis of their NLU-converted (natural language understanding-converted) query (input/output of conventional and conversational sub-systems) or FSG-based queries (where the system used restricted commands). Logic operators can be implemented to test or modify such systems. Conversational logic values/operators expand to include: true, false, incomplete, ambiguous, different/equivalent for an ASCII point of view, different/equivalent from a NLU point of view, different/equivalent from a active query field point of view, unknown, incompatible, and incomparable.
Furthermore, the conversational application API layer 803 comprises code for providing extensions of the underlying OS features and behavior. Such extensions include, for example, high level of abstraction and abstract categories associated with any object, self-registration mechanisms of abstract categories, memorization, summarization, conversational search, selection, redirection, user customization, train ability, help, multiuser and security capabilities, as well as the foundation class libraries. The conversational computing system of FIG. 8B further comprises a conversational engine API layer 807 which provides an interface between core engines conversational engines 808 (e.g., speech recognition, speaker recognition, NL parsing, NLU, TTS and speech compression/decompression engines, visual recognition) and the applications using them. The engine API layer 807 also provides the protocols to communicate with core engines whether they be local or remote. An I/O API layer 810 provides an interface with conventional I/O resources 811 such as a keyboard, mouse, touch screen, keypad, etc. (for providing a multi-modal conversational UI), an audio subsystem for capturing speech I/O (audio in audio out), and a video subsystem for capturing video I/O. The I/O API layer 810 provides device abstractions, I/O abstractions and UI abstractions. The I/O resources 811 will register with the CVM kernel layer 802 via the I/O API layer 810. It is to be understood that the I/O APIs 810 may be implemented as part ofthe I/O manager 14 of FIG. 1, while the I/O resources 811 may be implemented as part ofthe I/O subsystem 12 of FIG. 1
The core CVM kernel layer 802 comprises programming layers such as a conversational application and behavior/service manager layer 815, a conversational dialog manager (arbitrator) layer 819, a conversational resource manager layer 820, a task/dispatcher manager 821 and a meta-information manager 822, which provide the core functions ofthe CVM layer 802. It is to be understood that these components may be implemented as part of the dialog manager 18 of FIG. 1. The conversational application and behavior/service manager layer 815 comprises functions for managing'fhe conventional and conversationally aware applications 800 and 801. Such management functions include, for example, keeping track of which applications are registered (both local and network-distributed), what are the dialog interfaces (if any) of the applications, and what is the state of each application. In addition, the conversational application and services/behavior manager 815 initiates all the tasks associated with any specific service or behavior provided by the CVM system. The conversational services and behaviors are all the behaviors and features of a conversational UI that the user may expect to find in the applications and interactions, as well as the features that an application developer may expect to be able to access via APIs (without having to implement with the development of the application). Examples of the conversational services and behavior provided by the CVM kernel 802 include, but are not limited to, conversational categorization and meta-information, conversational object, resource and file management, conversational search, conversational selection, conversational customization, conversational security, conversational help, conversational prioritization, conversational resource management, output formatting and presentation, summarization, conversational delayed actions/agents/memorization, conversational logic, and coordinated interfaces and devices. Such services are provided through API calls via the conversational application API Layer 803. The conversational application and behavior/services manager 815 is responsible for executing all the different functions needed to adapt the UI to the capabilities and constraints ofthe device, application and/or user preferences.
The conversational dialog manager 819 comprises functions for managing the dialog (conversational dialog comprising speech and other multi-modal I/O such as GUI keyboard, pointer, mouse, as well as video input, etc.) and arbitration (dialog manager arbitrator or DMA) across all registered applications. In particular, the conversational dialog manager 819 determines what information the user has, which inputs the user presents, and which aDnlication(s) should handle the user inputs. The DMA processes abstracted I/O events (abstracted by the I/O manager) using the context/history to understand the user intent. When an abstract event occurs, the DMA determines the target of the event and, if needed, seeks confirmation, disambiguation, coπection, more details, etc., until the intent is unambiguous and fully determined. The DMA then launches the action associated to the user's query. The DMA function handles multi-modal I/O events to: (1) determine the target application or dialog (or portion of it); and (2) use past history and context to: (a) understand the intent ofthe user; (b) follow up with a dialog to disambiguate, complete, coπect or confirm the understanding; (c) or, dispatch a task resulting from full understanding ofthe intent ofthe user. The conversational resource manager 820 determines what conversational engines
808 are registered (either local conversational 808 and/or network-distributed resources), the capabilities of each registered resource, and the state of each registered resource. In addition, the conversational resource manager 820 prioritizes the allocation of CPU cycles or input/output priorities to maintain a flowing dialog with the active application (e.g., the engines engaged for recognizing or processing a cuπent input or output have priorities). Similarly, for distributed applications, it routes and selects the engine and network path to be used to minimize any network delay for the active foreground process.
The task dispatcher/manager 821 dispatches and coordinates different tasks and processes that are spawned (by the user and machine) on local and networked conventional and conversational resources. The meta-information manager 822 manages the meta-information associated with the system via a meta-information repository 818. The meta-information manager 822 and repository 818 collect all the information typically assumed known in a conversational interaction but not available at the level of the current conversation. Examples are a-priori knowledge, cultural, educational assumptions and persistent information, past request, references, information about the
.user, the application, news, etc. It is typically the information that needs to be preserved and persist beyond the length/life of the conversational history/context and the information that is expected to be common knowledge for the conversation and, therefore, has never been defined during the current and possible past conversational interactions. Also, shortcuts to commands, resources and macros, etc., are managed by the meta-information manager 822 and stored in the meta-information repository 818. In addition, the meta-information repository 818 includes a user-usage log based on user identity. It is to be appreciated that services such as conversational help and assistance, as well as some dialog prompts (introduction, questions, feedback, etc.) provided by the CVM system can be tailored based on the usage history of the user as stored in the meta-information repository 818 and associated with the application. If a user has been previously interacting with a given application, an explanation can be reduced assuming that it is familiar to the user. Similarly, if a user commits many eπors, the explanations can be more complex, as multiple eπors are inteφreted as user uncertainty, unfamiliarity, or incomprehension misunderstanding ofthe application or function.
A context stack 817 is managed by the dialog manager 819, possibly through a context manager that interacts with the dialog manager and arbitrator. It is to be understood that the context stack 817 may be implemented as part ofthe context stack 20 of FIG. 1. The context stack 817 comprises all the information associated with an application. Such information includes all the variable, states, input, output and queries to the backend that are performed in the context of the dialog and any extraneous event that occurs during the dialog. The context stack is associated with the organized/sorted context coπesponding to each active dialog (or defeπed dialog-agents/memorization). A global history 816 is included in the CVM system and includes information that is stored beyond the context of each application. The global history stores, for example, the information that is associated with all the applications and actions taking during a conversational session (i.e., the history of the dialog between user and machine for a cuπent session or from when the machine was activated).
The CVM kernel layer 802 further comprises a backend abstraction layer 823 which allows access to backend business logic 813 via the dialog manager 819 (rather than bypassing the dialog manager 819). This allows such accesses to be added to the context stack 817 and global history 816. For instance, the backend abstraction layer 823 can translate input and output to and from the dialog manager 819 to database queries. This layer 823 will convert standardized attribute value n-tuples into database queries and translate the result of such queries into tables or sets of attribute value n-tuples back to the dialog manager 819. In addition, a conversational transcoding layer 824 is provided to adapt the behavior, UI and dialog presented to the user based on the I/O and engine capabilities ofthe device which executes the CVM system.
The CVM system further comprises a communication stack 814 (or communication engines) as part of the underlying system services provided by the OS 812. The CVM system utilizes the communication stack to transmit information via conversational protocols 804 which extend the conventional communication services to provide conversational communication. It is to be understood that the communication stack 814 may be implemented in connection with the well-known OSI (open system interconnection) protocol layers for providing conversational communication exchange between conversational devices. As is known in the art, OSI comprises seven layers with each layer performing a respective function to provide communication between network distributed conversational applications of network-connected devices. Such layers (whose functions are well-understood) comprise an application layer, a presentation layer, a session layer, a transport layer, a network layer, a data link layer and a physical layer. The application layer is extended to allow conversational communication via the conversational protocols 804.
The conversational protocols 804 allow, in general, remote applications and resources register their conversational capabilities and proxies. These conversational protocols 804 are further disclosed in the PCT patent application identified as PCT/US99/22925 (attorney docket no. Y0999-113) filed on October 1, 1999 and entitled
"System and Method For Providing Network Coordinated Conversational Services," which is incoφorated herein by reference (wherein the conversational protocols are utilized in a svstem that does not utilize a CVM system). It is to be appreciated that while a prefeπed embodiment of the multi-modal conversational computing system 10 of FIG. 1 may implement a CVM-based system as described above in the context of FIGs. 8A and 8B, the multi-modal conversational computing system 10 may alternatively be implemented as a "conversational browser" as described in the above-referenced PCT patent application identified as PCT/US99/23008
(attorney docket no. Y0998-392). Given the teachings provided herein, one of ordinary skill in the art will realize various other ways of implementing the multi-modal conversational computing system ofthe present invention.
D. Conversational Data Mining Referring now to FIGs. 9 A and 9B, block diagrams illustrate prefeπed embodiments of respective conversational data mining systems. It is to be appreciated that such conversational data mining systems are disclosed in the above-referenced U.S. patent application identified as Serial No. 09/371,400 (attorney docket no. Y0999-227) filed on August 10, 1999 and entitled "Conversational Data Mining," incoφorated by reference herein. A description of such systems, one of which may be employed to implement a mood/focus classifier module 22 of FIG. 1, is provided below in this section. However, it is to be appreciated that other mechamsms for implementing mood classification and focus detection according to the invention may be employed.
While focus detection may be performed in accordance with the dialog manager 18 (FIG. 1) along with ambiguity resolution, it is preferably performed in accordance with the mood/focus classifier 22 (FIG. 1), an implementation of which will be described below. It is to be appreciated that focus can be determined by classification and data mining exactly the same way as mood is determined or the user is classified (as will be explained below), i.e., the attitude and moves/gestures of the user are used to determine stochastically the most likely focus item and focus state.
FIGs. 9A and 9B will be used to generally describe mood/focus classification techniαues that mav be employed in the mood/focus classifier 22 (FIG. 1) with respect to speech-based event data. However, the extended application to include the modality associated with video-based event data will be illustrated in the context of FIG. 9C where it is shown that these classification techniques can be easily applied to multi-modal input. FIG. 9 A depicts an apparatus for collecting data associated with a voice of a user, in accordance with the present invention. The apparatus is designated generally as 900.
The apparatus includes a dialog management unit 902 which conducts a conversation with the user. It is to be understood that the user-provided input data events are preferably provided to the system 900 via the I/O manager 14 of FIG. 1. Apparatus 900 further includes an audio capture module 906 which is coupled to the dialog management unit 902 and which captures a speech waveform associated with utterances spoken by the user 904 during the conversation. While shown for ease of explanation in FIG. 9A, the audio capture unit 906 may be part ofthe I/O subsystem 12 of FIG. 1. In which case, the captured input data is passed onto system 900 via the I/O manager 14. As used herein, a conversation should be broadly understood to include any interaction, between a first human and either a second human, a machine, or a combination thereof, which includes at least some speech. Again, based on the above described teachings ofthe multi-modal system 10 of the invention, the mood classification (focus detection) system 900 may be extended to process video in a similar manner.
Apparatus 900 further includes an acoustic front end 908 which is coupled to the audio capture module 906 and which is configured to receive and digitize the speech waveform so as to provide a digitized speech waveform. Further, acoustic front end 908 is also configured to extract, from the digitized speech waveform, at least one acoustic feature which is coπelated with at least one user attribute. The at least one user attribute can include at least one ofthe following: gender ofthe user, age ofthe user, accent ofthe user, native language of the user, dialect of the user, socioeconomic classification of the user, educational level of the user, and emotional state of the user. The dialog management unit 902 may employ acoustic features, such as MEL cepstra, obtained from acoustic front end 908 and may therefore, if desired, have a direct coupling thereto. Apparatus 900 further includes a processing module 910 which is coupled to the acoustic front end 908 and which analyzes the at least one acoustic feature to determine the at least one user attribute. Yet further, apparatus 900 includes a data warehouse 912 which is coupled to the processing module 910 and which stores the at least one user attribute, together with at least one identifying indicia, in a form for subsequent data mining thereon. Identifying indicia will be discussed elsewhere herein.
The gender of the user can be determined by classifying the pitch of the user's voice, or by simply clustering the features. In the latter method, voice prints associated with a large set of speakers of a given gender are built and a speaker classification is then performed with the two sets of models. Age of the user can also be determined via classification of age groups, in a manner similar to gender. Although having limited reliability, broad classes of ages, such as children, teenagers, adults and senior citizens can be separated in this fashion.
Determination of accent from acoustic features is known in the art. For example, the paper "A Comparison of Two Unsupervised Approaches to Accent Identification" by
Lincoln et al., presented at the 1998 International Conference on Spoken Language Processing, Sidney, Australia [hereinafter ICSLP'98], sets forth useful techniques. Native language of the user can be determined in a manner essentially equivalent to accent classification. Meta information about the native language ofthe speaker can be added to define each accent/native language model.
That is, at the creation of the models for each native language, one employs a speaker or speakers who are tagged with that language as their native language. The paper "Language Identification Incoφorating Lexical Information" by Matrouf et al, also presented at ICSLP'98, discusses various techniques for language identification. The user's dialect can be determined from the accent and the usage of keywords or idioms which are specific to a given .dialect. For example, in the French language, the choice of "nonante" for the numeral 90 instead of "Quatre Vingt Dix" would identify the speaker as being of Belgian or Swiss extraction, and not French or Canadian. Further, the consequent choice of "quatre-vingt" instead of "octante" or "Huitante" for the numeral 80 would identify the individual as Belgian and not Swiss. In American English, the choice of "grocery sack" rather than "grocery bag" might identify a person as being of Midwestern origin rather than Midatlantic origin. Another example of Midwestern versus Midatlantic American English would be the choice of "pop" for a soft drink in the
Midwest and the choice of "soda" for the coπesponding soft drink in the middle Atlantic region. In an intentional context, the use of "holiday" rather than "vacation" might identify someone as being of British rather than United States origin. The operations described in this paragraph can be carried out using a speech recognizor 126 which will be discussed below.
The socioeconomic classification ofthe user can include such factors as the racial background ofthe user, ethnic background ofthe user, and economic class ofthe user, for example, blue collar, white collar-middle class or wealthy. Such determinations can be made via annotated accents and dialects at the moment of training, as well as by examining the choice of words ofthe user. While only moderately reliable, it is believed that these techniques will give sufficient insight into the background of the user so as to be useful for data mining.
The educational level of the user can be determined by the word choice and accent, in a manner similar to the socioeconomic classification; again, only partial reliability is expected, but sufficient for data mining puφoses.
Determination of the emotional state of the user from acoustic features is well known in the art. Emotional categories which can be recognized include hot anger, cold anger, panic, fear, anxiety, sadness, elation, despair, happiness, interest, boredom, shame, contempt, confusion, disgust and pride. Exemplary methods of determining emotional state from relevant acoustic features are set forth in the following papers: "Some Acoustic
Characteristics of Emotion" by Pereira and Watson, "Towards an Automatic Classification of Emotions in Speech" by Amir and Ron, and "Simulated Emotions: An Acoustic Study of Voice and Perturbation Measures" by Whiteside, all of which were presented at ICSLP'98.
The audio capture module 906 can include, for example, at least one of an analog-to-digital converter board, an interactive voice response system, and a microphone. The dialog management unit 902 can include a telephone interactive voice response system, for example, the same one used to implement the audio capturing.
Alternatively, the dialog management unit may simply be an acoustic interface to a human operator. Dialog management unit 902 can include natural language understanding (NLU), natural language generation (NLG), finite state grammar (FSG), and/or text-to-speech syntheses (TTS) for machine-prompting the user in lieu of, or in addition to, the human operator. The processing module 910 can be implemented in the processor portion of the IVR, or can be implemented in a separate general piupose computer with appropriate software. Still further, the processing module can be implemented using an application specific circuit such as an application specific integrated circuit (ASIC) or can be implemented in an application specific circuit employing discrete components, or a combination of discrete and integrated components. Processing module 910 can include an emotional state classifier 914. Classifier
914 can in turn include an emotional state classification module 916 and an emotional state prototype data base 918. Processing module 910 can further include a speaker clusterer and classifier 920.
Element 920 can further include a speaker clustering and classification module 922 and a speaker class data base 924.
Processing module 910 can further include a speech recognizor 926 which can, in turn, itself include a speech recognition module 928 and a speech prototype, language model and grammar database 930. Speech recognizor 926 can be part of the dialog management unit 902 or, for example, a separate element within the implementation of processing .module 910. Yet further, processing module 910 can include an accent identifier 932, which in turn includes an accent identification module 934 and an accent database 936,
Processing module 910 can include any one of elements 914, 920, 926 and 932; all of those elements together; or any combination thereof. Apparatus 900 can further include a post processor 938 which is coupled to the data warehouse 912 and which is configured to transcribe user utterances and to perform keyword spotting thereon. Although shown as a separate item in FIG. 9A, the post processor can be a part of the processing module 910 or of any of the sub-components thereof. For example, it can be implemented as part of the speech recognizor 926. Post processor 938 can be implemented as part of the processor of an IVR, as an application specific circuit, or on a general puφose computer with suitable software modules. Post processor 938 can employ speech recognizor 926. Post processor 938 can also include a semantic module (not shown) to inteφret meaning of phrases. The semantic module could be used by speech recognizor 926 to indicate that some decoding candidates in a list are meaningless and should be discarded/replaced with meaningful candidates.
The acoustic front end 908 can typically be an eight dimensions plus energy front end as known in the art. However, it should be understood that 13, 24, or any other number of dimensions could be used. MEL cepstra can be computed, for example, over 25 ms frames with a 10 ms overlap, along with the delta and delta delta parameters, that is, the first and second finite derivatives. Such acoustic features can be supplied to the speaker clusterer and classifier 920, speech recognizor 926 and accent identifier 932, as shown in FIG. 9A.
Other types of acoustic features can be extracted by the acoustic front end 908. These can be designated as emotional state features, such as riinning average pitch, running pitch variance, pitch jitter, running energy variance, speech rate, slaimmer, fundamental frequency, and variation in fundamental frequency. Pitch jitter refers to the number of sign changes of the first derivative of pitch. Shimmer is energy jitter. These features can be sunnlied from the acoustic front end 908 to the emotional state classifier 914. The aforementioned acoustic features, including the MEL cepstra and the emotional state features, can be thought of as the raw, that is, unprocessed features.
User queries can be transcribed by an IVR or otherwise. Speech features can first be processed by a text-independent speaker classification system, for example, in speaker clusterer and classifier 920. This permits classification ofthe speakers based on acoustic similarities of their voices. Implementation and use of such a system is disclosed in U.S. patent application Serial No. 60/011,058, filed February 2, 1996; U.S. patent application Serial No. 08/787,031, filed January 28, 1997 (now U.S. Patent No. 5,895,447 issued April 20, 1999); U.S. patent application Serial No. 08/788,471, filed January 28, 1997; and U.S. patent application Serial No. 08/787,029, filed January 28, 1997, all of which are co-assigned to International Business Machines Coφoration, and the disclosure of all of which is expressly incoφorated herein by reference for all puφoses. The classification of the speakers can be supervised or unsupervised. In the supervised case, the classes have been decided beforehand based on external information. Typically, such classification can separate between male and female, adult versus child, native speakers versus different classes of non-native speakers, and the like. The indices of this classification process constitute processed features. The results of this process can be supplied to the emotional state classifier 914 and can be used to normalize the emotional state features with respect to the average (mean) observed for a given class, during training, for a neutral emotional state. The normalized emotional state features are used by the emotional state classifier 914 which then outputs an estimate of the emotional state. This output is also considered to be part ofthe processed features. To summarize, the emotional state features can be normalized by the emotional state classifier 914 with respect to each class produced by the speech clusterer and classifier 920. A feature can be normalized as follows. Let X0 be the normal frequency. Let X,- be the measured frequency. Then, the normalized feature will be given by X, minus X0. This quantity can be positive or 5 negative, and is not, in general, dimensionless. The speech recognizor 926 can transcribe the queries from the user. It can be a speaker-independent or class-dependent large vocabulary continuous speech recognition, or system could be something as simple as a keyword spotter to detect insults (for example) and the like, Such systems are well known in the art. The output can be full sentences, but finer granularity can also be attained; for example, time alignment of the recognized words. The time stamped transcriptions can also be considered as part of the processed features, and will be discussed further below with respect to methods in accordance with the present invention. Thus, conversation from every stage of a transaction can be transcribed and stored. As shown in FIG. 9A, appropriate data is transfeπed from the speaker clusterer and classifier 920 to the emotional state classifier
914 and the speech recognizor 926. As noted, it is possible to perform accent, dialect and language recognition with the input speech from the user. A continuous speech recognizor can be trained on speech with several speakers having the different accents which are to be recognized. Each of the framing speakers is also associated with an accent vector, with each dimension representing the most likely mixture component associated with each state of each lefeme. The speakers can be clustered based on the distance between these accent vectors, and the clusters can be identified by, for example, the accent of the member speakers. The accent identification can be performed by extracting an accent vector from the user's speech and classifying it. As noted, dialect, socioeconomic classification, and the like can be estimated based on vocabulary and word series used by the user. Appropriate key words, sentences, or grammatical mistakes ■ to detect can be compiled via expert linguistic knowledge. The accent, socioeconomic background, gender, age and the like are part ofthe processed features. As shown in FIG. 9A, any ofthe processed features, indicated by the solid aπows, can be stored in the data warehouse 912. Further, raw features, indicated by the dotted lines can also be stored in the data warehouse 912.
Any ofthe processed or raw features can be stored in the data warehouse 912 and then associated with the other data which has been collected, upon completion of the transaction. Classical data mining techniques can then be applied. Such techniques are known, for example, as set forth in the book "Da.ta Warehousing, Data Mining and OAAP," by Alex Berson and Stephen J. Smith, published by McGraw Hill in 1997, and in "Discovering Data Mining," by Cabena et al, published by Prentice Hall in 1998. For a given business objective, for example, target marketing, predictive models or classifiers are automatically obtained by applying appropriate mining recipes. All data stored in the data warehouse 912 can be stored in a format to facilitate subsequent data mining thereon. Those of skill in the art are aware of appropriate formats for data which is to be mined, as set forth in the two cited reference books. Business objectives can include, for example, detection of users who are vulnerable to a proposal to buy a given product or service, detection of users who have problems with the automated system and should be transfeπed to an operator and detection of users who are angry at the service and should be transfeπed to a supervisory person. The user can be a customer of a business which employs the apparatus 900, or can be a client of some other type of institution, such as a nonprofit institution, a government agency or the like.
Features can be extracted and decisions dynamically returned by the models. This will be discussed further below.
FIG. 9B depicts a real-time-modifiable voice system for interaction with a user, in accordance with the present invention, which is designated generally as 1000. Elements in FIG. 9B which are similar to those in FIG. 9A have received the same reference numerals incremented by 100. System 1000 can include a dialog management unit 1002 similar to that discussed above. In particular, as suggested in FIG. 9B, unit 1002 can be a human operator or supervisor, an JNR, or a Voice User Interface (VUI). System 1000 can also include an audio capture module 1006 similar to that described above, and an acoustic front end 1008, also similar to that described above. Just as with apparatus 900, unit 1002 can be directly coupled to acoustic front end 1008, if desired, to permit use .of MEL cepstra or other acoustic features determined by front end 1008. Further, svstem 1 OOΩ ϊnclnHes a processing module 1010 similar to that described above, but having certain additional features which will now be discussed. Processing module 1010 can include a dynamic classification module 1040 which performs dynamic classification of the user. Accordingly, processing module 1010 is configured to modify behavior of the voice system 1000 based on at least one user attribute which has been determined based on at least one acoustic feature extracted from the user's speech.
System 1000 can further include a business logic unit 1042 which is coupled to the dialog management unit 1002, the dynamic classification module 1040, and optionally to the acoustic front end 1008. The business logic unit can be implemented as a processing portion of the IVR or VUI, can be part of an appropriately programmed general ptupose computer, or can be an application specific circuit. At present, it is believed preferable that the processing module 1010 (including module 1040) be implemented as a general ptupose computer and that the business logic 1042 be implemented in a processor portion of an interactive voice response system. Dynamic classification module 1040 can be configured to provide feedback which can be real-time feedback to the business logic unit 1042 and the dialog management unit 1002.
A data warehouse 1012 and post processor 1038 can be optionally provided as shown and can operate as discussed above with respect to the data collecting apparatus 900. It should be emphasized, however, that in the real-time-modifiable voice system 1000 ofthe present invention, data warehousing is optional and if desired, the system can be limited to the real time feedback discussed with respect to elements 1040, 1042 and
1002.
Processing module 1010 can modify behavior ofthe system 1000, at least in part, by prompting a human operator thereof, as suggested by the feedback line connected with dialog management unit 1002. For example, a human operator could be alerted when an angry emotional state of the user is detected and could be prompted to utter soothing words to the user, or transfer the user to a higher level human supervisor. Further, the processing module 1010 could modify business logic 1042 of the system 1000. This could be done, for example, when both the processing module 1010 and business logic unit 1042 were part of an IVR system. Examples of modification of business logic will be discussed further below, but could include tailoring a marketing offer to the user based on attributes ofthe user detected by the system 1000.
Referring now to FIG. 9C, a block diagram illustrates how the mood/focus . classification techniques described above may be implemented by a mood/focus classifier
2 (FIG. 1) in a multi-modal environment which includes speech and video input event data. As shown, the classifier shown in FIG. 9C comprises a speech input channel 1050-1, a speech channel controller 1052-1, and a speech-based mood classification subsystem 1054-1. The classifier also comprises a video input channel 1050-N, a video channel controller 1052-N, and a video-based mood classification subsystem 1054-N. Of course, other input channels and coπesponding classification subsystems may be included to extend the classifier to other modalities. The individual classification subsystems each take raw features from their respective input channel and employ recognition and classification engines to process the features and then, in conjunction with data warehouse 1058, make a dynamic classification determination. The details of these processes are described above with respect to FIGs. 9A and 9B. Video features may be treated similar to speech features. Then, joint dynamic classification may be performed in block 1056 using the data from each input modality to make an overall classification determination. Business logic unit 1060 and multi-modal shell 1062 are used to control the process in accordance with the particular application(s) being run by the mood/focus classifier. Channel controllers 1052-1 and 1052-N are used to control the input of speech data and video data, respectively.
Accordingly, it is to be understood that, after determining the mood of a user, a mood classification system as described above can instruct the I/O subsystem 12 of FIG. 1, via the I/O manager 14, to adjust devices in the environment that would have the effect of changing the user's mood and/or focus, e.g., temperature control system, music system, etc.. Referring now to FIG. 10, a block diagram of an illustrative hardware implementation of a multi-modal conversational computing system according to the invention is shown. In this particular implementation, a processor 1092 for controlling and performing the various operations associated with the illustrative systems of the invention depicted in FIGs. 1 through 9C is coupled to a memory 1094 and a user interface 1096. It is to be appreciated that the term "processor" as used herein is intended to include any processing device, such as, for example, one that includes a CPU (central processing unit) and/or other processing circuitry. For example, the processor may be a digital signal processor, as is known in the art. Also the term "processor" may refer to more than one individual processor. The term "memory" as used herein is intended to include memory associated with a processor or CPU, such as, for example, RAM, ROM, a fixed memory device (e.g., hard drive), a removable memory device (e.g., diskette), a flash memory, etc. In addition, the term "user interface" as used herein is intended to include, for example, one or more input devices, e.g., keyboard, for inputting data to the processing unit, and/or one or more output devices, e.g., CRT display and/or printer, for providing results associated with the processing unit. The user interface 1096 is also intended to include the one or more microphones for receiving user speech and the one or more cameras/sensors for capturing image data, as well as any other I/O interface devices used in the multi-modal system. Accordingly, computer software including instructions or code for performing the methodologies of the invention, as described herein, may be stored in one or more ofthe associated memory devices (e.g., ROM, fixed or removable memory) and, when ready to be utilized, loaded in part or in whole (e.g., into RAM) and executed by a CPU. In any case, it should be understood that the elements illustrated in FIGs. 1 through 9C may be implemented in various forms of hardware, software, or combinations thereof, e.g., one or more digital signal processors with associated memory, application specific integrated circuit(s), functional circuitry, one or more appropriately programmed general puφose digital computers with associated memory, etc. Given the teachings of the invention provided herein, one of ordinary skill in the related art will be able to contemplate other implementations ofthe elements ofthe invention.
Although illustrative embodiments of the present invention have been described herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be affected therein by one skilled in the art without departing from the scope or spirit ofthe invention.

Claims

ClaimsWhat is claimed is:
1. A multi-modal conversational computing system, the system comprising: a user interface subsystem, the user interface subsystem being configured to input multi-modal data from an environment in which the user interface subsystem is deployed, the multi-modal data including data associated with a first modality input sensor and data associated with at least a second modality input sensor, and the environment including one or more users and one or more devices which are controllable by the multi-modal system; at least one processor, the at least one processor being operatively coupled to the user interface subsystem and being configured to: (i) receive at least a portion of the multi-modal input data from the user interface subsystem; (ii) make a determination of at least one of an intent, a focus and a mood of at least one of the one or more users based on at least a portion of the received multi-modal input data; and (iii) cause execution of one or more actions to occur in the environment based on at least one of the determined intent, the determined focus and the determined mood; and memory, operatively coupled to the at least one processor, which stores at least a portion of results associated with the intent, focus and mood determinations made by the processor for possible use in a subsequent determination.
2. The system of claim 1, wherein the intent determination comprises resolving referential ambiguity associated with the one or more users in the environment based on at least a portion ofthe received multi-modal data.
3. The system of claim 1, wherein the intent determination comprises resolving referential ambiguity associated with the one or more devices in the environment based on at least a. portion of the received multi-modal data.
4. The system of claim 1, wherein the execution of one or more actions in the environment comprises controlling at least one of the one or more devices in the environment to at least one of effectuate the determined intent, effect the determined focus, and effect the determined mood ofthe one or more users.
5. The system of claim 1, wherein the execution of one or more actions in the environment comprises controlling at least one of the one or more devices in the environment to request further user input to assist in making at least one of the determinations.
6. The system of claim 1, wherein the execution of the one or more actions comprises initiating a process to at least one of further complete, coπect, and disambiguate what the system understands from previous input.
7. The system of claim 1, wherein the at least one processor is further configured to abstract the received multi-modal input data into one or more events prior to making the one or more determinations.
8. The system of claim 1, wherein the at least one processor is further configured to perform one or more recognition operations on the received multi-modal input data prior to making the one or more determinations.
9. A multi-modal conversational computing system, the system comprising: a user interface subsystem, the user interface subsystem being configured to input multi-modal data from an environment in which the user interface subsystem is deployed, the multi-modal data including data associated with a first modality input sensor and data associated with at least a second modality input sensor, and the environment including one or more users and one or more devices which are controllable by the multi-modal system; an input output manager module operatively coupled to the user interface subsystem and configured to abstract the multi-modal input data into one or more events; one or more recognition engines operatively coupled to the input/output manager module and configured to perform, when necessary, one or more recognition operations on the abstracted multi-modal input data; a dialog manager module operatively coupled to the one or more recognition engines and the input/output manager module and configured to: (i) receive at least a portion of the abstracted multi-modal input data and, when necessary, the recognized multi-modal input data; (ii) make a determination of an intent of at least one ofthe one or more users based on at least a portion of the received multi-modal input data; and (iii) cause execution of one or more actions to occur in the environment based on the determined intent; a focus and mood classification module operatively coupled to the one or more recognition engines and the input/output manager module and configured to: (i) receive at least a portion of the abstracted multi-modal input data and, when necessary, the recognized multi-modal input data; (ii) make a determination of at least one of a focus and a mood of at least one of the one or more users based on at least a portion of the received multi-modal input data; and (iii) cause execution of one or more actions to occur in the environment based on at least one ofthe determined focus and mood; and a context stack memory operatively coupled to the dialog manager module, the one or more recognition engines and the focus and mood classification module, which stores at least a portion of results associated with the intent, focus and mood determinations made by the dialog manager and the classification module for possible use in a subsequent determination.
10. A computer-based conversational computing method, the method comprising the steps of: obtaining multi-modal data from an environment including one or more users and one or more controllable devices, the multi-modal data including data associated with a first modality input sensor and data associated with at least a second modality input sensor; making a determination of at least one of an intent, a focus and a mood of at least one ofthe one or more users based on at least a portion ofthe obtained multi-modal input data; causing execution of one or more actions to occur in the environment based on at least one ofthe determined intent, the determined focus and the determined mood; and storing at least a portion of results associated with the intent, focus and mood determinations for possible use in a subsequent determination.
11. The method of claim 10, wherein the intent determination step comprises resolving referential ambiguity associated with the one or more users in the environment based on at least a portion ofthe received multi-modal data.
12. The method of claim 10, wherein the intent determination step comprises resolving referential ambiguity associated with the one or more devices in the environment based on at least a portion ofthe received multi-modal data.
13. The method of claim 10, wherein the step of causing the execution of one or more actions in the environment comprises controlling at least one of the one or more devices in the environment to at least one of effectuate the determined intent, effect the determined focus, and effect the determined mood ofthe one or more users.
14. The method of claim 10, wherein the step of causing the execution of one or more actions in the environment comprises controlling at least one of the one or more devices in the environment to request further user input to assist in making at least one of the determinations.
15. The method of claim 10, wherein the step of causing the execution ofthe one or more actions comprises initiating a process to at least one of further complete, coπect, and disambiguate what the system understands from previous input.
16. The method of claim 10, wherein further comprising the step of abstracting the received multi-modal input data into one or more events prior to making the one or more determinations .
17. The method of claim 10, further comprising the step of performing one or more recognition operations on the received multi-modal input data prior to making the one or more determinations.
18. An article of manufacture for performing conversational computing, comprising a machine readable medium containing one or more programs which when executed implement the steps of: obtaining multi-modal data from an environment including one or more users and one or more controllable devices, the multi-modal data including data associated with a first modality input sensor and data associated with at least a second modality input sensor; making a determination of at least one of an intent, a focus and a mood of at least one ofthe one or more users based on at least a portion ofthe obtained multi-modal input data; causing execution of one or more actions to occur in the environment based on at least one ofthe determined intent, the determined focus and the determined mood; and storing at least a portion of results associated with the intent, focus and mood determinations for possible use in a subsequent determination.
19. A multi-modal conversational computing system, the system comprising: a user interface subsystem, the user interface subsystem being configured to input multi-modal data from an environment in which the user interface subsystem is deployed, the multi-modal data including at least audio-based data and image-based data, and the environment including one or more users and one or more devices which are controllable by the multi-modal system; at least one processor, the at least one processor being operatively coupled to the user interface subsystem and being configured to: (i) receive at least a portion of the multi-modal input data from the user interface subsystem; (ii) make a determination of at least one of an intent, a focus and a mood of at least one of the one or more users based on at least a portion of the received multi-modal input data; and (iii) cause execution of one or more actions to occur in the environment based on at least one of the determined intent, the determined focus and the determined mood; and memory, operatively coupled to the at least one processor, which stores at least a portion of results associated with the intent, focus and mood determinations made by the processor for possible use in a subsequent determination.
20. The system ofclaim 19, wherein the intent determination comprises resolving referential ambiguity associated with the one or more users in the environment based on at least a portion ofthe received multi-modal data.
21. The system ofclaim 19, wherein the intent determination comprises resolving referential ambiguity associated with the one or more devices in the environment based on at least a portion ofthe received multi-modal data.
22. The system of claim 19, wherein the user interface subsystem comprises one or more image capturing devices, deployed in the environment, for capturing the image-based data.
23. The system of claim 22, wherein the image-based data is at least one of in the visible wavelength spectrum and not in the visible wavelength spectrum.
24. The system of claim 22, wherein the image-based data is at least one of video, infrared, and radio frequency-based image data.
25. The system of claim 19, wherein the user interface subsystem comprises one or more audio capturing devices, deployed in the environment, for capturing the audio-based data.
26. The system of claim 25, wherein the one or more audio capturing devices comprise one or more microphones.
27. The system of claim 19, wherein the user interface subsystem comprises one or more graphical user interface-based input devices, deployed in the environment, for capturing graphical user interface-based data.
28. The system of claim 19, wherein the user interface subsystem comprises a stylus-based input device, deployed in the environment, for capturing handwritten-based data.
29. The system of claim 19, wherein the execution of one or more actions in the environment comprises controlling at least one of the one or more devices in the environment to at least one of effectuate the determined intent, effect the determined focus, and effect the determined mood ofthe one or more users.
30. The system of claim 19, wherein the execution of one or more actions in the environment comprises controlling at least one of the one or more devices in the environment to request further user input to assist in making at least one of the determinations.
31. The system of claim 19, wherein the at least one processor is further configured to abstract the received multi-modal input data into one or more events prior to making the one or more determinations.
32. The system of claim 19, wherein the at least one processor is further configured to perform one or more recognition operations on the received multi-modal input data prior to making the one or more determinations.
33. The system of claim 32, wherein one of the one or more recognition operations comprises speech recognition.
34. The system of claim 32, wherein one of the one or more recognition operations comprises speaker recognition.
35. The system of claim 32, wherein one of the one or more recogmtion operations comprises gesture recognition.
36. The system of claim 19, wherein the execution of the one or more actions comprises initiating a process to at least one of further complete, coπect, and disambiguate what the system understands from previous input.
37. A multi-modal conversational computing system, the system comprising: a user interface subsystem, the user interface subsystem being configured to input multi-modal data from an environment in which the user interface subsystem is deployed, the multi-modal data including at least audio-based data and image-based data, and the environment including one or more users and one or more devices which are controllable by the multi-modal system; an input/output manager module operatively coupled to the user interface subsystem and configured to abstract the multi-modal input data into one or more events; one or more recognition engines operatively coupled to the input/output manager module and configured to perform, when necessary, one or more recognition operations on the abstracted multi-modal input data; a dialog manager module operatively coupled to the one or more recognition engines and the input/output manager module and configured to: (i) receive at least a portion of the abstracted multi-modal input data and, when necessary, the recognized multi-modal input data; (ii) make a determination of an intent of at least one ofthe one or more users based on at least a portion of the received multi-modal input data; and (iii) cause execution of one or more actions to occur in the environment based on the determined intent; a focus and mood classification module operatively coupled to the one or more recognition engines and the input/output manager module and configured to: (i) receive at least a portion of the abstracted multi-modal input data and, when necessary, the recognized multi-modal input data; (ii) make a determination of at least one of a focus and a mood of at least one of the one or more users based on at least a portion of the received multi-modal input data; and (iii) cause execution of one or more actions to occur in the environment based on at least one ofthe determined focus and mood; and a context stack memory operatively coupled to the dialog manager module, the one or more recognition engines and the focus and mood classification module, which stores at least a portion of results associated with the intent, focus and mood determinations made by the dialog manager and the classification module for possible use in a subsequent determination.
38. A computer-based conversational computing method, the method comprising the steps of: obtaining multi-modal data from an environment including one or more users and one or more controllable devices, the multi-modal data including at least audio-based data and image-based data; making a determination of at least one of an intent, a focus and a mood of at least one ofthe one or more users based on at least a portion ofthe obtained multi-modal input data; causing execution of one or more actions to occur in the environment based on at least one ofthe determined intent, the determined focus and the determined mood; and storing at least a portion of results associated with the intent, focus and mood determinations for possible use in a subsequent determination.
PCT/US2002/002853 2001-02-05 2002-01-31 System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input WO2002063599A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
JP2002563459A JP2004538543A (en) 2001-02-05 2002-01-31 System and method for multi-mode focus detection, reference ambiguity resolution and mood classification using multi-mode input
CA002437164A CA2437164A1 (en) 2001-02-05 2002-01-31 System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input
EP02724896A EP1358650A4 (en) 2001-02-05 2002-01-31 System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input
KR1020037010176A KR100586767B1 (en) 2001-02-05 2002-01-31 System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input
HK04106079A HK1063371A1 (en) 2001-02-05 2004-08-13 System and method for multi-modal focus detection,referential ambiguity resolution and mood classif ication using multi-modal input

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/776,654 2001-02-05
US09/776,654 US6964023B2 (en) 2001-02-05 2001-02-05 System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input

Publications (1)

Publication Number Publication Date
WO2002063599A1 true WO2002063599A1 (en) 2002-08-15

Family

ID=25108023

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/002853 WO2002063599A1 (en) 2001-02-05 2002-01-31 System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input

Country Status (8)

Country Link
US (1) US6964023B2 (en)
EP (1) EP1358650A4 (en)
JP (1) JP2004538543A (en)
KR (1) KR100586767B1 (en)
CN (1) CN1310207C (en)
CA (1) CA2437164A1 (en)
HK (1) HK1063371A1 (en)
WO (1) WO2002063599A1 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1641157A2 (en) * 2004-09-28 2006-03-29 Sony Corporation Audio/visual content providing system and audio/visual content providing method
WO2007141052A1 (en) * 2006-06-09 2007-12-13 Sony Ericsson Mobile Communications Ab Methods, electronic devices, and computer program products for setting a feature of an electronic device based on at least one user characteristic
WO2010107526A1 (en) * 2009-03-18 2010-09-23 Robert Bosch Gmbh System and method for multi-modal input synchronization and disambiguation
WO2011065686A2 (en) 2009-11-27 2011-06-03 Samsung Electronics Co., Ltd. Communication interface apparatus and method for multi-user and system
WO2013058728A1 (en) * 2011-10-17 2013-04-25 Nuance Communications, Inc. Speech signal enhancement using visual information
US8605945B2 (en) 2006-02-07 2013-12-10 Qualcomm, Incorporated Multi-mode region-of-interest video object segmentation
US8660479B2 (en) 2007-09-04 2014-02-25 Ibiquity Digital Corporation Digital radio broadcast receiver, broadcasting methods and methods for tagging content of interest
US8676114B2 (en) 2007-09-04 2014-03-18 Ibiquity Digital Corporation Digital radio broadcast receiver, broadcasting methods and methods for tagging content of interest
WO2014070872A3 (en) * 2012-10-30 2014-06-26 Robert Bosch Gmbh System and method for multimodal interaction with reduced distraction in operating vehicles
WO2014116614A1 (en) * 2013-01-25 2014-07-31 Microsoft Corporation Using visual cues to disambiguate speech inputs
US9263060B2 (en) 2012-08-21 2016-02-16 Marian Mason Publishing Company, Llc Artificial neural network based system for classification of the emotional content of digital music
US9552810B2 (en) 2015-03-31 2017-01-24 International Business Machines Corporation Customizable and individualized speech recognition settings interface for users with language accents
EP3293613A1 (en) * 2010-01-21 2018-03-14 Tobii AB Eye tracker based contextual action
DE102019118184A1 (en) * 2019-07-05 2021-01-07 Bayerische Motoren Werke Aktiengesellschaft System and method for user-specific adaptation of vehicle parameters
DE102009058146B4 (en) 2009-12-12 2024-07-11 Volkswagen Ag Method and device for multimodal context-sensitive operation

Families Citing this family (674)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6775264B1 (en) 1997-03-03 2004-08-10 Webley Systems, Inc. Computer, internet and telecommunications based network
AU6630800A (en) 1999-08-13 2001-03-13 Pixo, Inc. Methods and apparatuses for display and traversing of links in page character array
US7516190B2 (en) 2000-02-04 2009-04-07 Parus Holdings, Inc. Personal voice-based information retrieval system
US6721705B2 (en) * 2000-02-04 2004-04-13 Webley Systems, Inc. Robust voice browser system and voice activated device controller
US8645137B2 (en) 2000-03-16 2014-02-04 Apple Inc. Fast, language-independent method for user authentication by voice
US20130158367A1 (en) * 2000-06-16 2013-06-20 Bodymedia, Inc. System for monitoring and managing body weight and other physiological conditions including iterative and personalized planning, intervention and reporting capability
US20020059072A1 (en) * 2000-10-16 2002-05-16 Nasreen Quibria Method of and system for providing adaptive respondent training in a speech recognition application
GB0113255D0 (en) * 2001-05-31 2001-07-25 Scient Generics Ltd Number generator
US7406421B2 (en) 2001-10-26 2008-07-29 Intellisist Inc. Systems and methods for reviewing informational content in a vehicle
JP2002366166A (en) * 2001-06-11 2002-12-20 Pioneer Electronic Corp System and method for providing contents and computer program for the same
US6934675B2 (en) * 2001-06-14 2005-08-23 Stephen C. Glinski Methods and systems for enabling speech-based internet searches
US8301503B2 (en) * 2001-07-17 2012-10-30 Incucomm, Inc. System and method for providing requested information to thin clients
WO2004019315A1 (en) * 2001-07-17 2004-03-04 Nohr Steven P System and method for finger held hardware device
US7274800B2 (en) * 2001-07-18 2007-09-25 Intel Corporation Dynamic gesture recognition from stereo sequences
CA2397451A1 (en) * 2001-08-15 2003-02-15 At&T Corp. Systems and methods for classifying and representing gestural inputs
US7167832B2 (en) * 2001-10-15 2007-01-23 At&T Corp. Method for dialog management
US20030110038A1 (en) * 2001-10-16 2003-06-12 Rajeev Sharma Multi-modal gender classification using support vector machines (SVMs)
ITFI20010199A1 (en) 2001-10-22 2003-04-22 Riccardo Vieri SYSTEM AND METHOD TO TRANSFORM TEXTUAL COMMUNICATIONS INTO VOICE AND SEND THEM WITH AN INTERNET CONNECTION TO ANY TELEPHONE SYSTEM
ITTO20011035A1 (en) * 2001-10-30 2003-04-30 Loquendo Spa METHOD FOR THE MANAGEMENT OF PERSONAL-MACHINE DIALOGS WITH MIXED INITIATIVE BASED ON VOICE INTERACTION.
GB2381688B (en) 2001-11-03 2004-09-22 Dremedia Ltd Time ordered indexing of audio-visual data
GB2381638B (en) * 2001-11-03 2004-02-04 Dremedia Ltd Identifying audio characteristics
JP4226247B2 (en) * 2002-01-15 2009-02-18 富士フイルム株式会社 Image processing device
US20040205482A1 (en) * 2002-01-24 2004-10-14 International Business Machines Corporation Method and apparatus for active annotation of multimedia content
US8195597B2 (en) * 2002-02-07 2012-06-05 Joseph Carrabis System and method for obtaining subtextual information regarding an interaction between an individual and a programmable device
US8249880B2 (en) * 2002-02-14 2012-08-21 Intellisist, Inc. Real-time display of system instructions
DE10210799B4 (en) * 2002-03-12 2006-04-27 Siemens Ag Adaptation of a human-machine interface depending on a psycho-profile and a current state of a user
US7489687B2 (en) * 2002-04-11 2009-02-10 Avaya. Inc. Emergency bandwidth allocation with an RSVP-like protocol
US7869998B1 (en) 2002-04-23 2011-01-11 At&T Intellectual Property Ii, L.P. Voice-enabled dialog system
US7209883B2 (en) * 2002-05-09 2007-04-24 Intel Corporation Factorial hidden markov model for audiovisual speech recognition
US7165029B2 (en) * 2002-05-09 2007-01-16 Intel Corporation Coupled hidden Markov model for audiovisual speech recognition
US20030212552A1 (en) * 2002-05-09 2003-11-13 Liang Lu Hong Face recognition procedure useful for audiovisual speech recognition
US7882363B2 (en) * 2002-05-31 2011-02-01 Fountain Venture As Biometric authentication system
US7398209B2 (en) 2002-06-03 2008-07-08 Voicebox Technologies, Inc. Systems and methods for responding to natural language speech utterance
US7305340B1 (en) 2002-06-05 2007-12-04 At&T Corp. System and method for configuring voice synthesis
JP4020189B2 (en) * 2002-06-26 2007-12-12 株式会社リコー Image processing circuit
GB0215118D0 (en) * 2002-06-28 2002-08-07 Hewlett Packard Co Dynamic resource allocation in a multimodal system
US7177816B2 (en) * 2002-07-05 2007-02-13 At&T Corp. System and method of handling problematic input during context-sensitive help for multi-modal dialog systems
US7177815B2 (en) * 2002-07-05 2007-02-13 At&T Corp. System and method of context-sensitive help for multi-modal dialog systems
US7693720B2 (en) 2002-07-15 2010-04-06 Voicebox Technologies, Inc. Mobile systems and methods for responding to natural language speech utterance
US7363398B2 (en) * 2002-08-16 2008-04-22 The Board Of Trustees Of The Leland Stanford Junior University Intelligent total access system
US20040042643A1 (en) * 2002-08-28 2004-03-04 Symtron Technology, Inc. Instant face recognition system
US7788096B2 (en) * 2002-09-03 2010-08-31 Microsoft Corporation Method and apparatus for generating decision tree questions for speech processing
US7359979B2 (en) 2002-09-30 2008-04-15 Avaya Technology Corp. Packet prioritization and associated bandwidth and buffer management techniques for audio over IP
US8176154B2 (en) 2002-09-30 2012-05-08 Avaya Inc. Instantaneous user initiation voice quality feedback
US20040073690A1 (en) 2002-09-30 2004-04-15 Neil Hepworth Voice over IP endpoint call admission
US6925438B2 (en) * 2002-10-08 2005-08-02 Motorola, Inc. Method and apparatus for providing an animated display with translated speech
BR0315229A (en) * 2002-10-09 2005-08-30 Bodymedia Inc Apparatus for detecting, receiving, derived from, and presenting human physiological and contextual information.
US7171043B2 (en) * 2002-10-11 2007-01-30 Intel Corporation Image recognition using hidden markov models and coupled hidden markov models
US7133811B2 (en) * 2002-10-15 2006-11-07 Microsoft Corporation Staged mixture modeling
US20040113939A1 (en) * 2002-12-11 2004-06-17 Eastman Kodak Company Adaptive display system
KR100580619B1 (en) * 2002-12-11 2006-05-16 삼성전자주식회사 Apparatus and method of managing dialog between user and agent
US8645122B1 (en) 2002-12-19 2014-02-04 At&T Intellectual Property Ii, L.P. Method of handling frequently asked questions in a natural language dialog service
US7472063B2 (en) * 2002-12-19 2008-12-30 Intel Corporation Audio-visual feature fusion and support vector machine useful for continuous speech recognition
KR100506085B1 (en) * 2002-12-28 2005-08-05 삼성전자주식회사 Apparatus for processing tongue image and health care service apparatus using tongue image
US7203368B2 (en) * 2003-01-06 2007-04-10 Intel Corporation Embedded bayesian network for pattern recognition
US7313561B2 (en) 2003-03-12 2007-12-25 Microsoft Corporation Model definition schema
US7275024B2 (en) * 2003-03-12 2007-09-25 Microsoft Corporation Automatic generation of a dimensional model for business analytics from an object model for online transaction processing
US7634478B2 (en) * 2003-12-02 2009-12-15 Microsoft Corporation Metadata driven intelligent data navigation
US7546226B1 (en) 2003-03-12 2009-06-09 Microsoft Corporation Architecture for automating analytical view of business applications
US7762665B2 (en) 2003-03-21 2010-07-27 Queen's University At Kingston Method and apparatus for communication between humans and devices
US7779114B2 (en) 2003-04-17 2010-08-17 International Business Machines Corporation Method and system for administering devices with multiple user metric spaces
US8145743B2 (en) * 2003-04-17 2012-03-27 International Business Machines Corporation Administering devices in dependence upon user metric vectors
US7669134B1 (en) 2003-05-02 2010-02-23 Apple Inc. Method and apparatus for displaying information during an instant messaging session
US7421393B1 (en) 2004-03-01 2008-09-02 At&T Corp. System for developing a dialog manager using modular spoken-dialog components
US7197366B2 (en) 2003-05-15 2007-03-27 International Business Machines Corporation Method and system for administering devices including an action log
US20040249637A1 (en) * 2003-06-04 2004-12-09 Aurilab, Llc Detecting repeated phrases and inference of dialogue models
US20040249825A1 (en) * 2003-06-05 2004-12-09 International Business Machines Corporation Administering devices with dynamic action lists
US20040249826A1 (en) * 2003-06-05 2004-12-09 International Business Machines Corporation Administering devices including creating a user reaction log
US7386863B2 (en) * 2003-06-26 2008-06-10 International Business Machines Corporation Administering devices in dependence upon user metric vectors with multiple users
US7464062B2 (en) 2003-06-26 2008-12-09 International Business Machines Corporation Administering devices in dependence upon user metric vectors including generic metric spaces
US7151969B2 (en) 2003-06-26 2006-12-19 International Business Machines Corporation Administering devices in dependence upon user metric vectors with optimizing metric action lists
US7437443B2 (en) 2003-07-02 2008-10-14 International Business Machines Corporation Administering devices with domain state objects
US20050108366A1 (en) * 2003-07-02 2005-05-19 International Business Machines Corporation Administering devices with domain state objects
US20050004788A1 (en) * 2003-07-03 2005-01-06 Lee Hang Shun Raymond Multi-level confidence measures for task modeling and its application to task-oriented multi-modal dialog management
US20050050137A1 (en) * 2003-08-29 2005-03-03 International Business Machines Corporation Administering devices in dependence upon metric patterns
US7460652B2 (en) 2003-09-26 2008-12-02 At&T Intellectual Property I, L.P. VoiceXML and rule engine based switchboard for interactive voice response (IVR) services
US20050071462A1 (en) * 2003-09-30 2005-03-31 Ibm Corporation Creating user metric patterns
US20050071463A1 (en) * 2003-09-30 2005-03-31 Ibm Corporation Administering devices in dependence upon device content metadata
US20050108429A1 (en) * 2003-10-23 2005-05-19 International Business Machines Corporation Devices in a domain
US7461143B2 (en) 2003-10-23 2008-12-02 International Business Machines Corporation Administering devices including allowed action lists
US7263511B2 (en) * 2003-10-23 2007-08-28 International Business Machines Corporation Creating user metric patterns including user notification
US6961668B2 (en) * 2003-10-23 2005-11-01 International Business Machines Corporation Evaluating test actions
US7199802B2 (en) * 2003-10-24 2007-04-03 Microsoft Corporation Multiple-mode window presentation system and process
JP2005157494A (en) 2003-11-20 2005-06-16 Aruze Corp Conversation control apparatus and conversation control method
US7257454B2 (en) * 2003-11-21 2007-08-14 Taiwan Semiconductor Manufacturing Company, Ltd. Dynamically adjusting the distribution for dispatching lot between current and downstream tool by using expertise weighting mechanism
US7376565B2 (en) * 2003-12-15 2008-05-20 International Business Machines Corporation Method, system, and apparatus for monitoring security events using speech recognition
US7542971B2 (en) * 2004-02-02 2009-06-02 Fuji Xerox Co., Ltd. Systems and methods for collaborative note-taking
US20050177373A1 (en) * 2004-02-05 2005-08-11 Avaya Technology Corp. Methods and apparatus for providing context and experience sensitive help in voice applications
US7412393B1 (en) * 2004-03-01 2008-08-12 At&T Corp. Method for developing a dialog manager using modular spoken-dialog components
US7369100B2 (en) * 2004-03-04 2008-05-06 Eastman Kodak Company Display system and method with multi-person presentation function
US7090358B2 (en) * 2004-03-04 2006-08-15 International Business Machines Corporation System, apparatus and method of displaying information for foveal vision and peripheral vision
US20050197843A1 (en) * 2004-03-07 2005-09-08 International Business Machines Corporation Multimodal aggregating unit
JP4458888B2 (en) * 2004-03-22 2010-04-28 富士通株式会社 Conference support system, minutes generation method, and computer program
US20050240424A1 (en) * 2004-04-27 2005-10-27 Xiaofan Lin System and method for hierarchical attribute extraction within a call handling system
US7676754B2 (en) * 2004-05-04 2010-03-09 International Business Machines Corporation Method and program product for resolving ambiguities through fading marks in a user interface
FR2871978B1 (en) * 2004-06-16 2006-09-22 Alcatel Sa METHOD FOR PROCESSING SOUND SIGNALS FOR A COMMUNICATION TERMINAL AND COMMUNICATION TERMINAL USING THE SAME
US7663788B2 (en) * 2004-06-29 2010-02-16 Fujifilm Corporation Image correcting apparatus and method, and image correction program
US7978827B1 (en) 2004-06-30 2011-07-12 Avaya Inc. Automatic configuration of call handling based on end-user needs and characteristics
US7936861B2 (en) 2004-07-23 2011-05-03 At&T Intellectual Property I, L.P. Announcement system and method of use
US8165281B2 (en) 2004-07-28 2012-04-24 At&T Intellectual Property I, L.P. Method and system for mapping caller information to call center agent transactions
US7580837B2 (en) 2004-08-12 2009-08-25 At&T Intellectual Property I, L.P. System and method for targeted tuning module of a speech recognition system
US7623685B2 (en) * 2004-08-20 2009-11-24 The Regents Of The University Of Colorado Biometric signatures and identification through the use of projective invariants
US7295904B2 (en) * 2004-08-31 2007-11-13 International Business Machines Corporation Touch gesture based interface for motor vehicle
US7197130B2 (en) 2004-10-05 2007-03-27 Sbc Knowledge Ventures, L.P. Dynamic load balancing between multiple locations with different telephony system
US7668889B2 (en) 2004-10-27 2010-02-23 At&T Intellectual Property I, Lp Method and system to combine keyword and natural language search results
US7657005B2 (en) * 2004-11-02 2010-02-02 At&T Intellectual Property I, L.P. System and method for identifying telephone callers
US7502835B1 (en) 2004-11-17 2009-03-10 Juniper Networks, Inc. Virtual folders for tracking HTTP sessions
US7461134B2 (en) * 2004-11-19 2008-12-02 W.A. Krapf, Inc. Bi-directional communication between a web client and a web server
US7724889B2 (en) 2004-11-29 2010-05-25 At&T Intellectual Property I, L.P. System and method for utilizing confidence levels in automated call routing
US7242751B2 (en) 2004-12-06 2007-07-10 Sbc Knowledge Ventures, L.P. System and method for speech recognition-enabled automatic call routing
US7864942B2 (en) 2004-12-06 2011-01-04 At&T Intellectual Property I, L.P. System and method for routing calls
KR20060066416A (en) * 2004-12-13 2006-06-16 한국전자통신연구원 A remote service apparatus and method that diagnoses laryngeal disorder or/and state using a speech codec
TWI251754B (en) * 2004-12-16 2006-03-21 Delta Electronics Inc Method for optimizing loads of speech/user recognition system
US7747437B2 (en) * 2004-12-16 2010-06-29 Nuance Communications, Inc. N-best list rescoring in speech recognition
US8340971B1 (en) * 2005-01-05 2012-12-25 At&T Intellectual Property Ii, L.P. System and method of dialog trajectory analysis
US7751551B2 (en) 2005-01-10 2010-07-06 At&T Intellectual Property I, L.P. System and method for speech-enabled call routing
TWI269268B (en) * 2005-01-24 2006-12-21 Delta Electronics Inc Speech recognizing method and system
US7627109B2 (en) 2005-02-04 2009-12-01 At&T Intellectual Property I, Lp Call center system for multiple transaction selections
US7697766B2 (en) * 2005-03-17 2010-04-13 Delphi Technologies, Inc. System and method to determine awareness
US7996219B2 (en) 2005-03-21 2011-08-09 At&T Intellectual Property Ii, L.P. Apparatus and method for model adaptation for spoken language understanding
US8223954B2 (en) 2005-03-22 2012-07-17 At&T Intellectual Property I, L.P. System and method for automating customer relations in a communications environment
US20060229882A1 (en) * 2005-03-29 2006-10-12 Pitney Bowes Incorporated Method and system for modifying printed text to indicate the author's state of mind
US7653547B2 (en) * 2005-03-31 2010-01-26 Microsoft Corporation Method for testing a speech server
US7636432B2 (en) 2005-05-13 2009-12-22 At&T Intellectual Property I, L.P. System and method of determining call treatment of repeat calls
US20060260624A1 (en) * 2005-05-17 2006-11-23 Battelle Memorial Institute Method, program, and system for automatic profiling of entities
US20060271520A1 (en) * 2005-05-27 2006-11-30 Ragan Gene Z Content-based implicit search query
US20090049388A1 (en) * 2005-06-02 2009-02-19 Ronnie Bernard Francis Taib Multimodal computer navigation
US20070015121A1 (en) * 2005-06-02 2007-01-18 University Of Southern California Interactive Foreign Language Teaching
US7657020B2 (en) 2005-06-03 2010-02-02 At&T Intellectual Property I, Lp Call routing system and method of using the same
US8005204B2 (en) 2005-06-03 2011-08-23 At&T Intellectual Property I, L.P. Call routing system and method of using the same
US7917365B2 (en) * 2005-06-16 2011-03-29 Nuance Communications, Inc. Synchronizing visual and speech events in a multimodal application
US7496513B2 (en) * 2005-06-28 2009-02-24 Microsoft Corporation Combined input processing for a computing device
US7457753B2 (en) * 2005-06-29 2008-11-25 University College Dublin National University Of Ireland Telephone pathology assessment
US8503641B2 (en) 2005-07-01 2013-08-06 At&T Intellectual Property I, L.P. System and method of automated order status retrieval
JP4717539B2 (en) * 2005-07-26 2011-07-06 キヤノン株式会社 Imaging apparatus and imaging method
EP1748378B1 (en) 2005-07-26 2009-09-16 Canon Kabushiki Kaisha Image capturing apparatus and image capturing method
US7640160B2 (en) 2005-08-05 2009-12-29 Voicebox Technologies, Inc. Systems and methods for responding to natural language speech utterance
US7904300B2 (en) * 2005-08-10 2011-03-08 Nuance Communications, Inc. Supporting multiple speech enabled user interface consoles within a motor vehicle
US20070038633A1 (en) * 2005-08-10 2007-02-15 International Business Machines Corporation Method and system for executing procedures in mixed-initiative mode
US8073699B2 (en) * 2005-08-16 2011-12-06 Nuance Communications, Inc. Numeric weighting of error recovery prompts for transfer to a human agent from an automated speech response system
US8526577B2 (en) 2005-08-25 2013-09-03 At&T Intellectual Property I, L.P. System and method to access content from a speech-enabled automated system
US20070055523A1 (en) * 2005-08-25 2007-03-08 Yang George L Pronunciation training system
US8548157B2 (en) 2005-08-29 2013-10-01 At&T Intellectual Property I, L.P. System and method of managing incoming telephone calls at a call center
US7949529B2 (en) * 2005-08-29 2011-05-24 Voicebox Technologies, Inc. Mobile systems and methods of supporting natural language human-machine interactions
US8265939B2 (en) * 2005-08-31 2012-09-11 Nuance Communications, Inc. Hierarchical methods and apparatus for extracting user intent from spoken utterances
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US8447592B2 (en) * 2005-09-13 2013-05-21 Nuance Communications, Inc. Methods and apparatus for formant-based voice systems
US8825482B2 (en) * 2005-09-15 2014-09-02 Sony Computer Entertainment Inc. Audio, video, simulation, and user interface paradigms
JP2009508553A (en) * 2005-09-16 2009-03-05 アイモーションズ−エモーション テクノロジー エー/エス System and method for determining human emotion by analyzing eyeball properties
US7633076B2 (en) 2005-09-30 2009-12-15 Apple Inc. Automated response to and sensing of user activity in portable devices
US7889892B2 (en) * 2005-10-13 2011-02-15 Fujifilm Corporation Face detecting method, and system and program for the methods
US7697827B2 (en) 2005-10-17 2010-04-13 Konicek Jeffrey C User-friendlier interfaces for a camera
JP4888996B2 (en) * 2005-10-21 2012-02-29 株式会社ユニバーサルエンターテインメント Conversation control device
US20070092007A1 (en) * 2005-10-24 2007-04-26 Mediatek Inc. Methods and systems for video data processing employing frame/field region predictions in motion estimation
US7840898B2 (en) * 2005-11-01 2010-11-23 Microsoft Corporation Video booklet
KR100715949B1 (en) * 2005-11-11 2007-05-08 삼성전자주식회사 Method and apparatus for classifying mood of music at high speed
US20070117072A1 (en) * 2005-11-21 2007-05-24 Conopco Inc, D/B/A Unilever Attitude reaction monitoring
US8209182B2 (en) * 2005-11-30 2012-06-26 University Of Southern California Emotion recognition system
US7860718B2 (en) * 2005-12-08 2010-12-28 Electronics And Telecommunications Research Institute Apparatus and method for speech segment detection and system for speech recognition
US20070143307A1 (en) * 2005-12-15 2007-06-21 Bowers Matthew N Communication system employing a context engine
US7552098B1 (en) 2005-12-30 2009-06-23 At&T Corporation Methods to distribute multi-class classification learning on several processors
KR100745980B1 (en) * 2006-01-11 2007-08-06 삼성전자주식회사 Score fusion method and apparatus thereof for combining multiple classifiers
US8265349B2 (en) 2006-02-07 2012-09-11 Qualcomm Incorporated Intra-mode region-of-interest video object segmentation
US8265392B2 (en) 2006-02-07 2012-09-11 Qualcomm Incorporated Inter-mode region-of-interest video object segmentation
JP5055781B2 (en) * 2006-02-14 2012-10-24 株式会社日立製作所 Conversation speech analysis method and conversation speech analysis apparatus
US8209181B2 (en) * 2006-02-14 2012-06-26 Microsoft Corporation Personal audio-video recorder for live meetings
US8781837B2 (en) * 2006-03-23 2014-07-15 Nec Corporation Speech recognition system and method for plural applications
US7848917B2 (en) * 2006-03-30 2010-12-07 Microsoft Corporation Common word graph based multimodal input
US8150692B2 (en) * 2006-05-18 2012-04-03 Nuance Communications, Inc. Method and apparatus for recognizing a user personality trait based on a number of compound words used by the user
JP2007318438A (en) * 2006-05-25 2007-12-06 Yamaha Corp Voice state data generating device, voice state visualizing device, voice state data editing device, voice data reproducing device, and voice communication system
US8332218B2 (en) 2006-06-13 2012-12-11 Nuance Communications, Inc. Context-based grammars for automated speech recognition
US20080005068A1 (en) * 2006-06-28 2008-01-03 Microsoft Corporation Context-based search, retrieval, and awareness
CN101506859A (en) * 2006-07-12 2009-08-12 医疗网络世界公司 Computerized medical training system
US7502767B1 (en) * 2006-07-21 2009-03-10 Hewlett-Packard Development Company, L.P. Computing a count of cases in a class
US9583096B2 (en) * 2006-08-15 2017-02-28 Nuance Communications, Inc. Enhancing environment voice macros via a stackable save/restore state of an object within an environment controlled by voice commands for control of vehicle components
US20080059027A1 (en) * 2006-08-31 2008-03-06 Farmer Michael E Methods and apparatus for classification of occupancy using wavelet transforms
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8374874B2 (en) 2006-09-11 2013-02-12 Nuance Communications, Inc. Establishing a multimodal personality for a multimodal application in dependence upon attributes of user interaction
US8073681B2 (en) 2006-10-16 2011-12-06 Voicebox Technologies, Inc. System and method for a cooperative conversational voice user interface
US20080091515A1 (en) * 2006-10-17 2008-04-17 Patentvc Ltd. Methods for utilizing user emotional state in a business process
US20100007726A1 (en) * 2006-10-19 2010-01-14 Koninklijke Philips Electronics N.V. Method and apparatus for classifying a person
US8355915B2 (en) * 2006-11-30 2013-01-15 Rao Ashwin P Multimodal speech recognition system
US9830912B2 (en) 2006-11-30 2017-11-28 Ashwin P Rao Speak and touch auto correction interface
US8000969B2 (en) * 2006-12-19 2011-08-16 Nuance Communications, Inc. Inferring switching conditions for switching between modalities in a speech application environment extended for interactive text exchanges
US7912724B1 (en) * 2007-01-18 2011-03-22 Adobe Systems Incorporated Audio comparison using phoneme matching
US7617337B1 (en) 2007-02-06 2009-11-10 Avaya Inc. VoIP quality tradeoff system
US7818176B2 (en) 2007-02-06 2010-10-19 Voicebox Technologies, Inc. System and method for selecting and presenting advertisements based on natural language processing of voice-based input
US20080201369A1 (en) * 2007-02-16 2008-08-21 At&T Knowledge Ventures, Lp System and method of modifying media content
WO2008106655A1 (en) * 2007-03-01 2008-09-04 Apapx, Inc. System and method for dynamic learning
US8069044B1 (en) * 2007-03-16 2011-11-29 Adobe Systems Incorporated Content matching using phoneme comparison and scoring
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US8983051B2 (en) 2007-04-03 2015-03-17 William F. Barton Outgoing call classification and disposition
US8131556B2 (en) * 2007-04-03 2012-03-06 Microsoft Corporation Communications using different modalities
JP4337064B2 (en) * 2007-04-04 2009-09-30 ソニー株式会社 Information processing apparatus, information processing method, and program
US8660841B2 (en) * 2007-04-06 2014-02-25 Technion Research & Development Foundation Limited Method and apparatus for the use of cross modal association to isolate individual media sources
US7925505B2 (en) * 2007-04-10 2011-04-12 Microsoft Corporation Adaptation of language models and context free grammar in speech recognition
US8856002B2 (en) * 2007-04-12 2014-10-07 International Business Machines Corporation Distance metrics for universal pattern processing tasks
US8131549B2 (en) 2007-05-24 2012-03-06 Microsoft Corporation Personality-based device
US20090033622A1 (en) * 2007-05-30 2009-02-05 24/8 Llc Smartscope/smartshelf
US8166109B2 (en) * 2007-06-21 2012-04-24 Cisco Technology, Inc. Linking recognized emotions to non-visual representations
DE102007030209A1 (en) * 2007-06-27 2009-01-08 Siemens Audiologische Technik Gmbh smoothing process
ITFI20070177A1 (en) 2007-07-26 2009-01-27 Riccardo Vieri SYSTEM FOR THE CREATION AND SETTING OF AN ADVERTISING CAMPAIGN DERIVING FROM THE INSERTION OF ADVERTISING MESSAGES WITHIN AN EXCHANGE OF MESSAGES AND METHOD FOR ITS FUNCTIONING.
CN101119209A (en) * 2007-09-19 2008-02-06 腾讯科技(深圳)有限公司 Virtual pet system and virtual pet chatting method, device
US20090083035A1 (en) * 2007-09-25 2009-03-26 Ritchie Winson Huang Text pre-processing for text-to-speech generation
US8218811B2 (en) 2007-09-28 2012-07-10 Uti Limited Partnership Method and system for video interaction based on motion swarms
US9053089B2 (en) 2007-10-02 2015-06-09 Apple Inc. Part-of-speech tagging using latent analogy
JP2009086581A (en) * 2007-10-03 2009-04-23 Toshiba Corp Apparatus and program for creating speaker model of speech recognition
US8165886B1 (en) 2007-10-04 2012-04-24 Great Northern Research LLC Speech interface system and method for control and interaction with applications on a computing system
US8595642B1 (en) 2007-10-04 2013-11-26 Great Northern Research, LLC Multiple shell multi faceted graphical user interface
WO2009045861A1 (en) * 2007-10-05 2009-04-09 Sensory, Incorporated Systems and methods of performing speech recognition using gestures
CN101414348A (en) * 2007-10-19 2009-04-22 三星电子株式会社 Method and system for identifying human face in multiple angles
US8364694B2 (en) 2007-10-26 2013-01-29 Apple Inc. Search assistant for digital media assets
US8620662B2 (en) 2007-11-20 2013-12-31 Apple Inc. Context-aware unit selection
US8127235B2 (en) 2007-11-30 2012-02-28 International Business Machines Corporation Automatic increasing of capacity of a virtual space in a virtual world
US8140335B2 (en) 2007-12-11 2012-03-20 Voicebox Technologies, Inc. System and method for providing a natural language voice user interface in an integrated voice navigation services environment
US10002189B2 (en) 2007-12-20 2018-06-19 Apple Inc. Method and apparatus for searching using an active ontology
US20090164919A1 (en) 2007-12-24 2009-06-25 Cary Lee Bates Generating data for managing encounters in a virtual world environment
US8219407B1 (en) 2007-12-27 2012-07-10 Great Northern Research, LLC Method for processing the output of a speech recognizer
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US8327272B2 (en) 2008-01-06 2012-12-04 Apple Inc. Portable multifunction device, method, and graphical user interface for viewing and managing electronic calendars
US20090198496A1 (en) * 2008-01-31 2009-08-06 Matthias Denecke Aspect oriented programmable dialogue manager and apparatus operated thereby
JP5181704B2 (en) * 2008-02-07 2013-04-10 日本電気株式会社 Data processing apparatus, posture estimation system, posture estimation method and program
US8065143B2 (en) 2008-02-22 2011-11-22 Apple Inc. Providing text input using speech data and non-speech data
US8289283B2 (en) 2008-03-04 2012-10-16 Apple Inc. Language input interface on a device
EP2099198A1 (en) * 2008-03-05 2009-09-09 Sony Corporation Method and device for personalizing a multimedia application
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US10496753B2 (en) * 2010-01-18 2019-12-03 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US8615664B2 (en) * 2008-05-23 2013-12-24 The Invention Science Fund I, Llc Acquisition and particular association of inference data indicative of an inferred mental state of an authoring user and source identity data
US9161715B2 (en) * 2008-05-23 2015-10-20 Invention Science Fund I, Llc Determination of extent of congruity between observation of authoring user and observation of receiving user
US9192300B2 (en) * 2008-05-23 2015-11-24 Invention Science Fund I, Llc Acquisition and particular association of data indicative of an inferred mental state of an authoring user
US9101263B2 (en) * 2008-05-23 2015-08-11 The Invention Science Fund I, Llc Acquisition and association of data indicative of an inferred mental state of an authoring user
US20090292658A1 (en) * 2008-05-23 2009-11-26 Searete Llc, A Limited Liability Corporation Of The State Of Delaware Acquisition and particular association of inference data indicative of inferred mental states of authoring users
US9305548B2 (en) 2008-05-27 2016-04-05 Voicebox Technologies Corporation System and method for an integrated, multi-modal, multi-device natural language voice services environment
US8464150B2 (en) 2008-06-07 2013-06-11 Apple Inc. Automatic language identification for dynamic text processing
US8219397B2 (en) * 2008-06-10 2012-07-10 Nuance Communications, Inc. Data processing system for autonomously building speech identification and tagging data
US20090327974A1 (en) * 2008-06-26 2009-12-31 Microsoft Corporation User interface for gestural control
US20110115702A1 (en) * 2008-07-08 2011-05-19 David Seaberg Process for Providing and Editing Instructions, Data, Data Structures, and Algorithms in a Computer System
US20100010370A1 (en) 2008-07-09 2010-01-14 De Lemos Jakob System and method for calibrating and normalizing eye data in emotional testing
KR100889026B1 (en) * 2008-07-22 2009-03-17 김정태 Searching system using image
US20100030549A1 (en) 2008-07-31 2010-02-04 Lee Michael M Mobile device having human language translation capability with positional feedback
US8136944B2 (en) 2008-08-15 2012-03-20 iMotions - Eye Tracking A/S System and method for identifying the existence and position of text in visual media content and for determining a subjects interactions with the text
US8165881B2 (en) * 2008-08-29 2012-04-24 Honda Motor Co., Ltd. System and method for variable text-to-speech with minimized distraction to operator of an automotive vehicle
US20100057465A1 (en) * 2008-09-03 2010-03-04 David Michael Kirsch Variable text-to-speech for automotive application
US8768702B2 (en) 2008-09-05 2014-07-01 Apple Inc. Multi-tiered voice feedback in an electronic device
US8285550B2 (en) * 2008-09-09 2012-10-09 Industrial Technology Research Institute Method and system for generating dialogue managers with diversified dialogue acts
US8898568B2 (en) 2008-09-09 2014-11-25 Apple Inc. Audio user interface
US8712776B2 (en) 2008-09-29 2014-04-29 Apple Inc. Systems and methods for selective text to speech synthesis
US8396714B2 (en) 2008-09-29 2013-03-12 Apple Inc. Systems and methods for concatenation of words in text to speech synthesis
US8583418B2 (en) 2008-09-29 2013-11-12 Apple Inc. Systems and methods of detecting language and natural language strings for text to speech synthesis
US8352272B2 (en) 2008-09-29 2013-01-08 Apple Inc. Systems and methods for text to speech synthesis
US8352268B2 (en) 2008-09-29 2013-01-08 Apple Inc. Systems and methods for selective rate of speech and speech preferences for text to speech synthesis
US8355919B2 (en) 2008-09-29 2013-01-15 Apple Inc. Systems and methods for text normalization for text to speech synthesis
US8218751B2 (en) 2008-09-29 2012-07-10 Avaya Inc. Method and apparatus for identifying and eliminating the source of background noise in multi-party teleconferences
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US9922640B2 (en) 2008-10-17 2018-03-20 Ashwin P Rao System and method for multimodal utterance detection
KR101019335B1 (en) * 2008-11-11 2011-03-07 주식회사 팬택 Method and system for controlling application of mobile terminal using gesture
WO2010067118A1 (en) 2008-12-11 2010-06-17 Novauris Technologies Limited Speech recognition involving a mobile device
US8494857B2 (en) 2009-01-06 2013-07-23 Regents Of The University Of Minnesota Automatic measurement of speech fluency
US20100178956A1 (en) * 2009-01-14 2010-07-15 Safadi Rami B Method and apparatus for mobile voice recognition training
US8327040B2 (en) 2009-01-26 2012-12-04 Micron Technology, Inc. Host controller
US8862252B2 (en) 2009-01-30 2014-10-14 Apple Inc. Audio user interface for displayless electronic device
US8326637B2 (en) 2009-02-20 2012-12-04 Voicebox Technologies, Inc. System and method for processing multi-modal device interactions in a natural language voice services environment
KR101041039B1 (en) * 2009-02-27 2011-06-14 고려대학교 산학협력단 Method and Apparatus for space-time voice activity detection using audio and video information
US9295806B2 (en) 2009-03-06 2016-03-29 Imotions A/S System and method for determining emotional response to olfactory stimuli
US8380507B2 (en) 2009-03-09 2013-02-19 Apple Inc. Systems and methods for determining the language to use for speech generated by a text to speech engine
US9183554B1 (en) * 2009-04-21 2015-11-10 United Services Automobile Association (Usaa) Systems and methods for user authentication via mobile device
CN102405463B (en) * 2009-04-30 2015-07-29 三星电子株式会社 Utilize the user view reasoning device and method of multi-modal information
US10255566B2 (en) 2011-06-03 2019-04-09 Apple Inc. Generating and processing task items that represent tasks to perform
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US10540976B2 (en) 2009-06-05 2020-01-21 Apple Inc. Contextual voice commands
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US9431006B2 (en) 2009-07-02 2016-08-30 Apple Inc. Methods and apparatuses for automatic speech recognition
US8363957B2 (en) * 2009-08-06 2013-01-29 Delphi Technologies, Inc. Image classification system and method thereof
JP5715132B2 (en) * 2009-08-20 2015-05-07 コーニンクレッカ フィリップス エヌ ヴェ Method and system for image analysis
US9154730B2 (en) * 2009-10-16 2015-10-06 Hewlett-Packard Development Company, L.P. System and method for determining the active talkers in a video conference
US20110093263A1 (en) * 2009-10-20 2011-04-21 Mowzoon Shahin M Automated Video Captioning
US9653066B2 (en) * 2009-10-23 2017-05-16 Nuance Communications, Inc. System and method for estimating the reliability of alternate speech recognition hypotheses in real time
US8121618B2 (en) 2009-10-28 2012-02-21 Digimarc Corporation Intuitive computing methods and systems
US9171541B2 (en) 2009-11-10 2015-10-27 Voicebox Technologies Corporation System and method for hybrid processing in a natural language voice services environment
US8682649B2 (en) 2009-11-12 2014-03-25 Apple Inc. Sentiment prediction from textual data
KR101377459B1 (en) * 2009-12-21 2014-03-26 한국전자통신연구원 Apparatus for interpreting using utterance similarity measure and method thereof
US8600743B2 (en) 2010-01-06 2013-12-03 Apple Inc. Noise profile determination for voice-related feature
US8381107B2 (en) 2010-01-13 2013-02-19 Apple Inc. Adaptive audio feedback system and method
US8311838B2 (en) 2010-01-13 2012-11-13 Apple Inc. Devices and methods for identifying a prompt corresponding to a voice input in a sequence of prompts
JP5407880B2 (en) * 2010-01-13 2014-02-05 株式会社リコー Optical scanning apparatus and image forming apparatus
US10553209B2 (en) 2010-01-18 2020-02-04 Apple Inc. Systems and methods for hands-free notification summaries
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US10679605B2 (en) 2010-01-18 2020-06-09 Apple Inc. Hands-free list-reading by intelligent automated assistant
US10705794B2 (en) 2010-01-18 2020-07-07 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US8676581B2 (en) * 2010-01-22 2014-03-18 Microsoft Corporation Speech recognition analysis via identification information
WO2011089450A2 (en) 2010-01-25 2011-07-28 Andrew Peter Nelson Jerram Apparatuses, methods and systems for a digital conversation management platform
US9205328B2 (en) 2010-02-18 2015-12-08 Activision Publishing, Inc. Videogame system and method that enables characters to earn virtual fans by completing secondary objectives
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
WO2011116514A1 (en) * 2010-03-23 2011-09-29 Nokia Corporation Method and apparatus for determining a user age range
JP2011209787A (en) * 2010-03-29 2011-10-20 Sony Corp Information processor, information processing method, and program
US9682324B2 (en) 2010-05-12 2017-06-20 Activision Publishing, Inc. System and method for enabling players to participate in asynchronous, competitive challenges
US8560318B2 (en) * 2010-05-14 2013-10-15 Sony Computer Entertainment Inc. Methods and system for evaluating potential confusion within grammar structure for set of statements to be used in speech recognition during computing event
US8639516B2 (en) 2010-06-04 2014-01-28 Apple Inc. User-specific noise suppression for voice quality improvements
US20200226012A1 (en) * 2010-06-07 2020-07-16 Affectiva, Inc. File system manipulation using machine learning
US8296151B2 (en) * 2010-06-18 2012-10-23 Microsoft Corporation Compound gesture-speech commands
US20110313762A1 (en) * 2010-06-20 2011-12-22 International Business Machines Corporation Speech output with confidence indication
US8903891B2 (en) * 2010-06-24 2014-12-02 Sap Se User interface communication utilizing service request identification to manage service requests
US8713021B2 (en) 2010-07-07 2014-04-29 Apple Inc. Unsupervised document clustering using latent semantic density analysis
US9104670B2 (en) 2010-07-21 2015-08-11 Apple Inc. Customized search or acquisition of digital media assets
US10353495B2 (en) * 2010-08-20 2019-07-16 Knowles Electronics, Llc Personalized operation of a mobile device using sensor signatures
JP2012047924A (en) * 2010-08-26 2012-03-08 Sony Corp Information processing device and information processing method, and program
US8719006B2 (en) 2010-08-27 2014-05-06 Apple Inc. Combined statistical and rule-based part-of-speech tagging for text-to-speech synthesis
US8719014B2 (en) 2010-09-27 2014-05-06 Apple Inc. Electronic device with text error correction based on voice recognition data
US8594997B2 (en) * 2010-09-27 2013-11-26 Sap Ag Context-aware conversational user interface
US9484046B2 (en) 2010-11-04 2016-11-01 Digimarc Corporation Smartphone-based methods and systems
US8676574B2 (en) 2010-11-10 2014-03-18 Sony Computer Entertainment Inc. Method for tone/intonation recognition using auditory attention cues
US8966036B1 (en) * 2010-11-24 2015-02-24 Google Inc. Method and system for website user account management based on event transition matrixes
CN103493126B (en) * 2010-11-25 2015-09-09 爱立信(中国)通信有限公司 Audio data analysis system and method
US8559606B2 (en) 2010-12-07 2013-10-15 Microsoft Corporation Multimodal telephone calls
US10762293B2 (en) 2010-12-22 2020-09-01 Apple Inc. Using parts-of-speech tagging and named entity recognition for spelling correction
US10515147B2 (en) 2010-12-22 2019-12-24 Apple Inc. Using statistical language models for contextual lookup
KR101749100B1 (en) * 2010-12-23 2017-07-03 한국전자통신연구원 System and method for integrating gesture and sound for controlling device
CN102637071A (en) * 2011-02-09 2012-08-15 英华达(上海)电子有限公司 Multimedia input method applied to multimedia input device
US9047867B2 (en) * 2011-02-21 2015-06-02 Adobe Systems Incorporated Systems and methods for concurrent signal recognition
US8781836B2 (en) 2011-02-22 2014-07-15 Apple Inc. Hearing assistance system for providing consistent human speech
US20120239396A1 (en) * 2011-03-15 2012-09-20 At&T Intellectual Property I, L.P. Multimodal remote control
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US8756061B2 (en) * 2011-04-01 2014-06-17 Sony Computer Entertainment Inc. Speech syllable/vowel/phone boundary detection using auditory attention cues
US20120259638A1 (en) * 2011-04-08 2012-10-11 Sony Computer Entertainment Inc. Apparatus and method for determining relevance of input speech
US9135562B2 (en) 2011-04-13 2015-09-15 Tata Consultancy Services Limited Method for gender verification of individuals based on multimodal data analysis utilizing an individual's expression prompted by a greeting
US9230549B1 (en) 2011-05-18 2016-01-05 The United States Of America As Represented By The Secretary Of The Air Force Multi-modal communications (MMC)
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
US10672399B2 (en) 2011-06-03 2020-06-02 Apple Inc. Switching between text data and audio data based on a mapping
US8975903B2 (en) 2011-06-09 2015-03-10 Ford Global Technologies, Llc Proximity switch having learned sensitivity and method therefor
US8928336B2 (en) 2011-06-09 2015-01-06 Ford Global Technologies, Llc Proximity switch having sensitivity control and method therefor
US8812294B2 (en) 2011-06-21 2014-08-19 Apple Inc. Translating phrases from one language into another using an order-based set of declarative rules
US8929598B2 (en) * 2011-06-29 2015-01-06 Olympus Imaging Corp. Tracking apparatus, tracking method, and storage medium to store tracking program
JP5664480B2 (en) * 2011-06-30 2015-02-04 富士通株式会社 Abnormal state detection device, telephone, abnormal state detection method, and program
KR101801327B1 (en) * 2011-07-29 2017-11-27 삼성전자주식회사 Apparatus for generating emotion information, method for for generating emotion information and recommendation apparatus based on emotion information
US10004286B2 (en) 2011-08-08 2018-06-26 Ford Global Technologies, Llc Glove having conductive ink and method of interacting with proximity sensor
US8706472B2 (en) 2011-08-11 2014-04-22 Apple Inc. Method for disambiguating multiple readings in language conversion
US8994660B2 (en) 2011-08-29 2015-03-31 Apple Inc. Text correction processing
US9143126B2 (en) 2011-09-22 2015-09-22 Ford Global Technologies, Llc Proximity switch having lockout control for controlling movable panel
US8762156B2 (en) 2011-09-28 2014-06-24 Apple Inc. Speech recognition repair using contextual information
US8994228B2 (en) 2011-11-03 2015-03-31 Ford Global Technologies, Llc Proximity switch having wrong touch feedback
US10112556B2 (en) 2011-11-03 2018-10-30 Ford Global Technologies, Llc Proximity switch having wrong touch adaptive learning and method
US8878438B2 (en) 2011-11-04 2014-11-04 Ford Global Technologies, Llc Lamp and proximity switch assembly and method
GB2496893A (en) * 2011-11-25 2013-05-29 Nokia Corp Presenting Name Bubbles at Different Image Zoom Levels
JP5682543B2 (en) * 2011-11-28 2015-03-11 トヨタ自動車株式会社 Dialogue device, dialogue method and dialogue program
US9250713B2 (en) * 2011-12-05 2016-02-02 Microsoft Technology Licensing, Llc Control exposure
BR112014015844A8 (en) * 2011-12-26 2017-07-04 Intel Corp determining vehicle-based occupant audio and visual inputs
US20130212501A1 (en) * 2012-02-10 2013-08-15 Glen J. Anderson Perceptual computing with conversational agent
KR101971697B1 (en) * 2012-02-24 2019-04-23 삼성전자주식회사 Method and apparatus for authenticating user using hybrid biometrics information in a user device
US8843364B2 (en) 2012-02-29 2014-09-23 Adobe Systems Incorporated Language informed source separation
US9384493B2 (en) 2012-03-01 2016-07-05 Visa International Service Association Systems and methods to quantify consumer sentiment based on transaction data
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
WO2013138633A1 (en) 2012-03-15 2013-09-19 Regents Of The University Of Minnesota Automated verbal fluency assessment
US8687880B2 (en) 2012-03-20 2014-04-01 Microsoft Corporation Real time head pose estimation
CN102592593B (en) * 2012-03-31 2014-01-01 山东大学 Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech
US9660644B2 (en) 2012-04-11 2017-05-23 Ford Global Technologies, Llc Proximity switch assembly and activation method
US9287864B2 (en) 2012-04-11 2016-03-15 Ford Global Technologies, Llc Proximity switch assembly and calibration method therefor
US8933708B2 (en) 2012-04-11 2015-01-13 Ford Global Technologies, Llc Proximity switch assembly and activation method with exploration mode
US9219472B2 (en) 2012-04-11 2015-12-22 Ford Global Technologies, Llc Proximity switch assembly and activation method using rate monitoring
US9944237B2 (en) 2012-04-11 2018-04-17 Ford Global Technologies, Llc Proximity switch assembly with signal drift rejection and method
US9568527B2 (en) 2012-04-11 2017-02-14 Ford Global Technologies, Llc Proximity switch assembly and activation method having virtual button mode
US9184745B2 (en) 2012-04-11 2015-11-10 Ford Global Technologies, Llc Proximity switch assembly and method of sensing user input based on signal rate of change
US9197206B2 (en) 2012-04-11 2015-11-24 Ford Global Technologies, Llc Proximity switch having differential contact surface
US9831870B2 (en) 2012-04-11 2017-11-28 Ford Global Technologies, Llc Proximity switch assembly and method of tuning same
US9531379B2 (en) 2012-04-11 2016-12-27 Ford Global Technologies, Llc Proximity switch assembly having groove between adjacent proximity sensors
US9520875B2 (en) 2012-04-11 2016-12-13 Ford Global Technologies, Llc Pliable proximity switch assembly and activation method
US9065447B2 (en) 2012-04-11 2015-06-23 Ford Global Technologies, Llc Proximity switch assembly and method having adaptive time delay
US9559688B2 (en) 2012-04-11 2017-01-31 Ford Global Technologies, Llc Proximity switch assembly having pliable surface and depression
US9280610B2 (en) 2012-05-14 2016-03-08 Apple Inc. Crowd sourcing information to fulfill user requests
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US8775442B2 (en) 2012-05-15 2014-07-08 Apple Inc. Semantic search using a single-source semantic model
US9136840B2 (en) 2012-05-17 2015-09-15 Ford Global Technologies, Llc Proximity switch assembly having dynamic tuned threshold
JP2013242763A (en) * 2012-05-22 2013-12-05 Clarion Co Ltd Dialogue apparatus, dialogue system and dialogue control method
US9251704B2 (en) * 2012-05-29 2016-02-02 GM Global Technology Operations LLC Reducing driver distraction in spoken dialogue
US8981602B2 (en) 2012-05-29 2015-03-17 Ford Global Technologies, Llc Proximity switch assembly having non-switch contact and method
US8849041B2 (en) * 2012-06-04 2014-09-30 Comcast Cable Communications, Llc Data recognition in content
US9337832B2 (en) 2012-06-06 2016-05-10 Ford Global Technologies, Llc Proximity switch and method of adjusting sensitivity therefor
US9881616B2 (en) * 2012-06-06 2018-01-30 Qualcomm Incorporated Method and systems having improved speech recognition
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
WO2013185109A2 (en) 2012-06-08 2013-12-12 Apple Inc. Systems and methods for recognizing textual identifiers within a plurality of words
US9641172B2 (en) 2012-06-27 2017-05-02 Ford Global Technologies, Llc Proximity switch assembly having varying size electrode fingers
US20140007115A1 (en) * 2012-06-29 2014-01-02 Ning Lu Multi-modal behavior awareness for human natural command control
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
TWI456515B (en) * 2012-07-13 2014-10-11 Univ Nat Chiao Tung Human identification system by fusion of face recognition and speaker recognition, method and service robot thereof
US9672815B2 (en) * 2012-07-20 2017-06-06 Interactive Intelligence Group, Inc. Method and system for real-time keyword spotting for speech analytics
NZ730641A (en) * 2012-08-24 2018-08-31 Interactive Intelligence Inc Method and system for selectively biased linear discriminant analysis in automatic speech recognition systems
US9424840B1 (en) * 2012-08-31 2016-08-23 Amazon Technologies, Inc. Speech recognition platforms
US9576574B2 (en) 2012-09-10 2017-02-21 Apple Inc. Context-sensitive handling of interruptions by intelligent digital assistant
US8922340B2 (en) 2012-09-11 2014-12-30 Ford Global Technologies, Llc Proximity switch based door latch release
US9547647B2 (en) 2012-09-19 2017-01-17 Apple Inc. Voice-based media searching
US9105268B2 (en) 2012-09-19 2015-08-11 24/7 Customer, Inc. Method and apparatus for predicting intent in IVR using natural language queries
US8935167B2 (en) 2012-09-25 2015-01-13 Apple Inc. Exemplar-based latent perceptual modeling for automatic speech recognition
US8484025B1 (en) * 2012-10-04 2013-07-09 Google Inc. Mapping an audio utterance to an action using a classifier
US9031293B2 (en) 2012-10-19 2015-05-12 Sony Computer Entertainment Inc. Multi-modal sensor based emotion recognition and emotional interface
US9020822B2 (en) 2012-10-19 2015-04-28 Sony Computer Entertainment Inc. Emotion recognition using auditory attention cues extracted from users voice
US8796575B2 (en) 2012-10-31 2014-08-05 Ford Global Technologies, Llc Proximity switch assembly having ground layer
KR20140070861A (en) * 2012-11-28 2014-06-11 한국전자통신연구원 Apparatus and method for controlling multi modal human-machine interface
US9672811B2 (en) 2012-11-29 2017-06-06 Sony Interactive Entertainment Inc. Combining auditory attention cues with phoneme posterior scores for phone/vowel/syllable boundary detection
US9265458B2 (en) 2012-12-04 2016-02-23 Sync-Think, Inc. Application of smooth pursuit cognitive testing paradigms to clinical drug development
US20140173440A1 (en) * 2012-12-13 2014-06-19 Imimtek, Inc. Systems and methods for natural interaction with operating systems and application graphical user interfaces using gestural and vocal input
US9646605B2 (en) * 2013-01-22 2017-05-09 Interactive Intelligence Group, Inc. False alarm reduction in speech recognition systems using contextual information
DE112014000709B4 (en) 2013-02-07 2021-12-30 Apple Inc. METHOD AND DEVICE FOR OPERATING A VOICE TRIGGER FOR A DIGITAL ASSISTANT
KR102050897B1 (en) * 2013-02-07 2019-12-02 삼성전자주식회사 Mobile terminal comprising voice communication function and voice communication method thereof
US9311640B2 (en) 2014-02-11 2016-04-12 Digimarc Corporation Methods and arrangements for smartphone payments and transactions
US9380976B2 (en) 2013-03-11 2016-07-05 Sync-Think, Inc. Optical neuroinformatics
US9311204B2 (en) 2013-03-13 2016-04-12 Ford Global Technologies, Llc Proximity interface development system having replicator and method
US10242097B2 (en) 2013-03-14 2019-03-26 Aperture Investments, Llc Music selection and organization using rhythm, texture and pitch
US10623480B2 (en) 2013-03-14 2020-04-14 Aperture Investments, Llc Music categorization using rhythm, texture and pitch
US10061476B2 (en) 2013-03-14 2018-08-28 Aperture Investments, Llc Systems and methods for identifying, searching, organizing, selecting and distributing content based on mood
US10225328B2 (en) 2013-03-14 2019-03-05 Aperture Investments, Llc Music selection and organization using audio fingerprints
US10424292B1 (en) 2013-03-14 2019-09-24 Amazon Technologies, Inc. System for recognizing and responding to environmental noises
US9733821B2 (en) 2013-03-14 2017-08-15 Apple Inc. Voice control to diagnose inadvertent activation of accessibility features
US11271993B2 (en) 2013-03-14 2022-03-08 Aperture Investments, Llc Streaming music categorization using rhythm, texture and pitch
US9368114B2 (en) 2013-03-14 2016-06-14 Apple Inc. Context-sensitive handling of interruptions
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US9977779B2 (en) 2013-03-14 2018-05-22 Apple Inc. Automatic supplementation of word correction dictionaries
US10572476B2 (en) 2013-03-14 2020-02-25 Apple Inc. Refining a search based on schedule items
US9875304B2 (en) 2013-03-14 2018-01-23 Aperture Investments, Llc Music selection and organization using audio fingerprints
US10642574B2 (en) 2013-03-14 2020-05-05 Apple Inc. Device, method, and graphical user interface for outputting captions
WO2014144579A1 (en) 2013-03-15 2014-09-18 Apple Inc. System and method for updating an adaptive speech recognition model
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
AU2014233517B2 (en) * 2013-03-15 2017-05-25 Apple Inc. Training an at least partial voice command system
KR101857648B1 (en) 2013-03-15 2018-05-15 애플 인크. User training by intelligent digital assistant
US10078487B2 (en) 2013-03-15 2018-09-18 Apple Inc. Context-sensitive handling of interruptions
US20140288939A1 (en) * 2013-03-20 2014-09-25 Navteq B.V. Method and apparatus for optimizing timing of audio commands based on recognized audio patterns
US9202459B2 (en) * 2013-04-19 2015-12-01 GM Global Technology Operations LLC Methods and systems for managing dialog of speech systems
US9609272B2 (en) * 2013-05-02 2017-03-28 Avaya Inc. Optimized video snapshot
US20160063335A1 (en) 2013-05-03 2016-03-03 Nokia Technologies Oy A method and technical equipment for people identification
KR101351561B1 (en) * 2013-05-08 2014-01-15 주식회사 아몬드 소프트 Big data extracting system and method
US9251275B2 (en) * 2013-05-16 2016-02-02 International Business Machines Corporation Data clustering and user modeling for next-best-action decisions
WO2014197334A2 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
WO2014197336A1 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
WO2014197335A1 (en) 2013-06-08 2014-12-11 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
EP3937002A1 (en) 2013-06-09 2022-01-12 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
AU2014278595B2 (en) 2013-06-13 2017-04-06 Apple Inc. System and method for emergency calls initiated by voice command
TW201504839A (en) * 2013-07-19 2015-02-01 Quanta Comp Inc Portable electronic apparatus and interactive human face login method
US20150039312A1 (en) * 2013-07-31 2015-02-05 GM Global Technology Operations LLC Controlling speech dialog using an additional sensor
DE112014003653B4 (en) 2013-08-06 2024-04-18 Apple Inc. Automatically activate intelligent responses based on activities from remote devices
US9165182B2 (en) * 2013-08-19 2015-10-20 Cisco Technology, Inc. Method and apparatus for using face detection information to improve speaker segmentation
US11199906B1 (en) 2013-09-04 2021-12-14 Amazon Technologies, Inc. Global user input management
DE102013016196B4 (en) 2013-09-27 2023-10-12 Volkswagen Ag Motor vehicle operation using combined input modalities
US9330171B1 (en) * 2013-10-17 2016-05-03 Google Inc. Video annotation using deep network architectures
US9779722B2 (en) * 2013-11-05 2017-10-03 GM Global Technology Operations LLC System for adapting speech recognition vocabulary
US20150154002A1 (en) * 2013-12-04 2015-06-04 Google Inc. User interface customization based on speaker characteristics
US10296160B2 (en) 2013-12-06 2019-05-21 Apple Inc. Method for extracting salient dialog usage from live data
US9354778B2 (en) 2013-12-06 2016-05-31 Digimarc Corporation Smartphone-based methods and systems
TWI543635B (en) * 2013-12-18 2016-07-21 jing-feng Liu Speech Acquisition Method of Hearing Aid System and Hearing Aid System
KR101550580B1 (en) * 2014-01-17 2015-09-08 한국과학기술연구원 User interface apparatus and control method thereof
CN104795067B (en) * 2014-01-20 2019-08-06 华为技术有限公司 Voice interactive method and device
WO2015120263A1 (en) 2014-02-06 2015-08-13 Contact Solutions LLC Systems, apparatuses and methods for communication flow modification
GB2523353B (en) * 2014-02-21 2017-03-01 Jaguar Land Rover Ltd System for use in a vehicle
US9412363B2 (en) 2014-03-03 2016-08-09 Microsoft Technology Licensing, Llc Model based approach for on-screen item selection and disambiguation
US10304458B1 (en) * 2014-03-06 2019-05-28 Board of Trustees of the University of Alabama and the University of Alabama in Huntsville Systems and methods for transcribing videos using speaker identification
US8825585B1 (en) * 2014-03-11 2014-09-02 Fmr Llc Interpretation of natural communication
US9966079B2 (en) * 2014-03-24 2018-05-08 Lenovo (Singapore) Pte. Ltd. Directing voice input based on eye tracking
US20220147562A1 (en) 2014-03-27 2022-05-12 Aperture Investments, Llc Music streaming, playlist creation and streaming architecture
US9620105B2 (en) 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
US10095850B2 (en) * 2014-05-19 2018-10-09 Kadenze, Inc. User identity authentication techniques for on-line content or access
US11669090B2 (en) 2014-05-20 2023-06-06 State Farm Mutual Automobile Insurance Company Autonomous vehicle operation feature monitoring and evaluation of effectiveness
US9792656B1 (en) 2014-05-20 2017-10-17 State Farm Mutual Automobile Insurance Company Fault determination with autonomous feature use monitoring
US10373259B1 (en) 2014-05-20 2019-08-06 State Farm Mutual Automobile Insurance Company Fully autonomous vehicle insurance pricing
US10599155B1 (en) 2014-05-20 2020-03-24 State Farm Mutual Automobile Insurance Company Autonomous vehicle operation feature monitoring and evaluation of effectiveness
US9972054B1 (en) 2014-05-20 2018-05-15 State Farm Mutual Automobile Insurance Company Accident fault determination for autonomous vehicles
US10592095B2 (en) 2014-05-23 2020-03-17 Apple Inc. Instantaneous speaking of content on touch devices
US9502031B2 (en) 2014-05-27 2016-11-22 Apple Inc. Method for supporting dynamic grammars in WFST-based ASR
US9508360B2 (en) * 2014-05-28 2016-11-29 International Business Machines Corporation Semantic-free text analysis for identifying traits
US9734193B2 (en) 2014-05-30 2017-08-15 Apple Inc. Determining domain salience ranking from ambiguous words in natural speech
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US10078631B2 (en) 2014-05-30 2018-09-18 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
TWI566107B (en) 2014-05-30 2017-01-11 蘋果公司 Method for processing a multi-part voice command, non-transitory computer readable storage medium and electronic device
US10289433B2 (en) 2014-05-30 2019-05-14 Apple Inc. Domain specific language for encoding assistant dialog
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US10146318B2 (en) * 2014-06-13 2018-12-04 Thomas Malzbender Techniques for using gesture recognition to effectuate character selection
US9697828B1 (en) * 2014-06-20 2017-07-04 Amazon Technologies, Inc. Keyword detection modeling using contextual and environmental information
US9600743B2 (en) 2014-06-27 2017-03-21 International Business Machines Corporation Directing field of vision based on personal interests
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US10659851B2 (en) 2014-06-30 2020-05-19 Apple Inc. Real-time digital assistant knowledge updates
US10376792B2 (en) 2014-07-03 2019-08-13 Activision Publishing, Inc. Group composition matchmaking system and method for multiplayer video games
US10540723B1 (en) 2014-07-21 2020-01-21 State Farm Mutual Automobile Insurance Company Methods of providing insurance savings based upon telematics and usage-based insurance
US9972184B2 (en) * 2014-07-24 2018-05-15 State Farm Mutual Automobile Insurance Company Systems and methods for monitoring a vehicle operator and for monitoring an operating environment within the vehicle
US9646198B2 (en) * 2014-08-08 2017-05-09 International Business Machines Corporation Sentiment analysis in a video conference
US9471837B2 (en) 2014-08-19 2016-10-18 International Business Machines Corporation Real-time analytics to identify visual objects of interest
US10446141B2 (en) 2014-08-28 2019-10-15 Apple Inc. Automatic speech recognition based on user feedback
CN104217226B (en) * 2014-09-09 2017-07-11 天津大学 Conversation activity recognition methods based on deep neural network Yu condition random field
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10789041B2 (en) 2014-09-12 2020-09-29 Apple Inc. Dynamic thresholds for always listening speech trigger
EP3195145A4 (en) 2014-09-16 2018-01-24 VoiceBox Technologies Corporation Voice commerce
WO2016044321A1 (en) 2014-09-16 2016-03-24 Min Tang Integration of domain information into state transitions of a finite state transducer for natural language processing
US10317992B2 (en) 2014-09-25 2019-06-11 Microsoft Technology Licensing, Llc Eye gaze for spoken language understanding in multi-modal conversational interactions
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US9886432B2 (en) 2014-09-30 2018-02-06 Apple Inc. Parsimonious handling of word inflection via categorical stem + suffix N-gram language models
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
JP5907231B1 (en) * 2014-10-15 2016-04-26 富士通株式会社 INPUT INFORMATION SUPPORT DEVICE, INPUT INFORMATION SUPPORT METHOD, AND INPUT INFORMATION SUPPORT PROGRAM
WO2016061309A1 (en) 2014-10-15 2016-04-21 Voicebox Technologies Corporation System and method for providing follow-up responses to prior natural language inputs of a user
US10038443B2 (en) 2014-10-20 2018-07-31 Ford Global Technologies, Llc Directional proximity switch assembly
JP6365229B2 (en) 2014-10-23 2018-08-01 株式会社デンソー Multisensory interface control method, multisensory interface control device, and multisensory interface system
US9269374B1 (en) 2014-10-27 2016-02-23 Mattersight Corporation Predictive video analytics system and methods
US9946531B1 (en) 2014-11-13 2018-04-17 State Farm Mutual Automobile Insurance Company Autonomous vehicle software version assessment
US10431214B2 (en) 2014-11-26 2019-10-01 Voicebox Technologies Corporation System and method of determining a domain and/or an action related to a natural language input
US10614799B2 (en) 2014-11-26 2020-04-07 Voicebox Technologies Corporation System and method of providing intent predictions for an utterance prior to a system detection of an end of the utterance
US10552013B2 (en) 2014-12-02 2020-02-04 Apple Inc. Data detection
US9711141B2 (en) 2014-12-09 2017-07-18 Apple Inc. Disambiguating heteronyms in speech synthesis
US9898170B2 (en) 2014-12-10 2018-02-20 International Business Machines Corporation Establishing user specified interaction modes in a question answering dialogue
US10118099B2 (en) 2014-12-16 2018-11-06 Activision Publishing, Inc. System and method for transparently styling non-player characters in a multiplayer video game
WO2016126248A1 (en) * 2015-02-04 2016-08-11 Empire Technology Development Llc Adaptive merchant site smapling linked to payment transactions
US9374465B1 (en) * 2015-02-11 2016-06-21 Language Line Services, Inc. Multi-channel and multi-modal language interpretation system utilizing a gated or non-gated configuration
WO2016137797A1 (en) * 2015-02-23 2016-09-01 SomniQ, Inc. Empathetic user interface, systems, and methods for interfacing with empathetic computing device
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US9654103B2 (en) 2015-03-18 2017-05-16 Ford Global Technologies, Llc Proximity switch assembly having haptic feedback and method
US9899019B2 (en) 2015-03-18 2018-02-20 Apple Inc. Systems and methods for structured stem and suffix language models
US9431003B1 (en) 2015-03-27 2016-08-30 International Business Machines Corporation Imbuing artificial intelligence systems with idiomatic traits
US10418032B1 (en) * 2015-04-10 2019-09-17 Soundhound, Inc. System and methods for a virtual assistant to manage and use context in a natural language dialog
CN104820678B (en) * 2015-04-15 2018-10-19 小米科技有限责任公司 Audio-frequency information recognition methods and device
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US10315113B2 (en) 2015-05-14 2019-06-11 Activision Publishing, Inc. System and method for simulating gameplay of nonplayer characters distributed across networked end user devices
US9548733B2 (en) 2015-05-20 2017-01-17 Ford Global Technologies, Llc Proximity sensor assembly having interleaved electrode configuration
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US10127220B2 (en) 2015-06-04 2018-11-13 Apple Inc. Language identification from short strings
US9578173B2 (en) 2015-06-05 2017-02-21 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10101822B2 (en) 2015-06-05 2018-10-16 Apple Inc. Language input correction
US10255907B2 (en) 2015-06-07 2019-04-09 Apple Inc. Automatic accent detection using acoustic models
US10186254B2 (en) 2015-06-07 2019-01-22 Apple Inc. Context-based endpoint detection
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US10471348B2 (en) 2015-07-24 2019-11-12 Activision Publishing, Inc. System and method for creating and sharing customized video game weapon configurations in multiplayer video games via one or more social networks
US10437871B2 (en) * 2015-08-12 2019-10-08 Hithink Royalflush Information Network Co., Ltd. Method and system for sentiment analysis of information
CN105159111B (en) * 2015-08-24 2019-01-25 百度在线网络技术(北京)有限公司 Intelligent interaction device control method and system based on artificial intelligence
US9805601B1 (en) 2015-08-28 2017-10-31 State Farm Mutual Automobile Insurance Company Vehicular traffic alerts for avoidance of abnormal traffic conditions
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
CN105187981A (en) * 2015-09-18 2015-12-23 智车优行科技(北京)有限公司 In-veicle sound field distribution controlling apparatus and method
US9665567B2 (en) 2015-09-21 2017-05-30 International Business Machines Corporation Suggesting emoji characters based on current contextual emotional state of user
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
USD806711S1 (en) 2015-12-11 2018-01-02 SomniQ, Inc. Portable electronic device
US9886958B2 (en) 2015-12-11 2018-02-06 Microsoft Technology Licensing, Llc Language and domain independent model based approach for on-screen item selection
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
KR102434604B1 (en) * 2016-01-05 2022-08-23 한국전자통신연구원 Voice recognition terminal, voice recognition server and voice recognition method performing a personalized voice recognition for performing personalized voice recognition
CN105700682A (en) * 2016-01-08 2016-06-22 北京乐驾科技有限公司 Intelligent gender and emotion recognition detection system and method based on vision and voice
US11719545B2 (en) 2016-01-22 2023-08-08 Hyundai Motor Company Autonomous vehicle component damage and salvage assessment
US10308246B1 (en) 2016-01-22 2019-06-04 State Farm Mutual Automobile Insurance Company Autonomous vehicle signal control
US10134278B1 (en) 2016-01-22 2018-11-20 State Farm Mutual Automobile Insurance Company Autonomous vehicle application
US10324463B1 (en) 2016-01-22 2019-06-18 State Farm Mutual Automobile Insurance Company Autonomous vehicle operation adjustment based upon route
US11242051B1 (en) 2016-01-22 2022-02-08 State Farm Mutual Automobile Insurance Company Autonomous vehicle action communications
US10395332B1 (en) 2016-01-22 2019-08-27 State Farm Mutual Automobile Insurance Company Coordinated autonomous vehicle automatic area scanning
US11441916B1 (en) 2016-01-22 2022-09-13 State Farm Mutual Automobile Insurance Company Autonomous vehicle trip routing
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US9817817B2 (en) 2016-03-17 2017-11-14 International Business Machines Corporation Detection and labeling of conversational actions
JP2017182776A (en) * 2016-03-29 2017-10-05 株式会社デンソー Vehicle periphery monitoring apparatus and computer program
US9767349B1 (en) * 2016-05-09 2017-09-19 Xerox Corporation Learning emotional states using personalized calibration tasks
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US10832665B2 (en) * 2016-05-27 2020-11-10 Centurylink Intellectual Property Llc Internet of things (IoT) human interface apparatus, system, and method
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
DK179588B1 (en) 2016-06-09 2019-02-22 Apple Inc. Intelligent automated assistant in a home environment
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10586535B2 (en) 2016-06-10 2020-03-10 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
DK179343B1 (en) 2016-06-11 2018-05-14 Apple Inc Intelligent task discovery
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
DK179049B1 (en) 2016-06-11 2017-09-18 Apple Inc Data driven natural language event detection and classification
DK179415B1 (en) 2016-06-11 2018-06-14 Apple Inc Intelligent device arbitration and control
US10235993B1 (en) * 2016-06-14 2019-03-19 Friday Harbor Llc Classifying signals using correlations of segments
US10789534B2 (en) 2016-07-29 2020-09-29 International Business Machines Corporation Measuring mutual understanding in human-computer conversation
US10331784B2 (en) 2016-07-29 2019-06-25 Voicebox Technologies Corporation System and method of disambiguating natural language processing requests
US9922649B1 (en) * 2016-08-24 2018-03-20 Jpmorgan Chase Bank, N.A. System and method for customer interaction management
CN116844543A (en) * 2016-08-26 2023-10-03 王峥嵘 Control method and system based on voice interaction
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
KR101700099B1 (en) * 2016-10-11 2017-01-31 미디어젠(주) Hybrid speech recognition Composite Performance Auto Evaluation system
US10135989B1 (en) 2016-10-27 2018-11-20 Intuit Inc. Personalized support routing based on paralinguistic information
KR102591413B1 (en) * 2016-11-16 2023-10-19 엘지전자 주식회사 Mobile terminal and method for controlling the same
KR102450374B1 (en) * 2016-11-17 2022-10-04 삼성전자주식회사 Method and device to train and recognize data
US10500498B2 (en) 2016-11-29 2019-12-10 Activision Publishing, Inc. System and method for optimizing virtual games
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US10515623B1 (en) * 2016-12-23 2019-12-24 Amazon Technologies, Inc. Non-speech input to speech processing system
US10229682B2 (en) 2017-02-01 2019-03-12 International Business Machines Corporation Cognitive intervention for voice recognition failure
US11128675B2 (en) 2017-03-20 2021-09-21 At&T Intellectual Property I, L.P. Automatic ad-hoc multimedia conference generator
DK201770439A1 (en) 2017-05-11 2018-12-13 Apple Inc. Offline personal assistant
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
DK179745B1 (en) 2017-05-12 2019-05-01 Apple Inc. SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT
DK201770411A1 (en) 2017-05-15 2018-12-20 Apple Inc. Multi-modal interfaces
DK201770431A1 (en) 2017-05-15 2018-12-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
DK201770432A1 (en) 2017-05-15 2018-12-21 Apple Inc. Hierarchical belief states for digital assistants
DK179560B1 (en) 2017-05-16 2019-02-18 Apple Inc. Far-field extension for digital assistant services
US10535344B2 (en) * 2017-06-08 2020-01-14 Microsoft Technology Licensing, Llc Conversational system user experience
US10769138B2 (en) 2017-06-13 2020-09-08 International Business Machines Corporation Processing context-based inquiries for knowledge retrieval
KR102299847B1 (en) * 2017-06-26 2021-09-08 삼성전자주식회사 Face verifying method and apparatus
US10503467B2 (en) * 2017-07-13 2019-12-10 International Business Machines Corporation User interface sound emanation activity classification
US11315560B2 (en) 2017-07-14 2022-04-26 Cognigy Gmbh Method for conducting dialog between human and computer
US11424947B2 (en) * 2017-08-02 2022-08-23 Lenovo (Singapore) Pte. Ltd. Grouping electronic devices to coordinate action based on context awareness
MX2020001279A (en) * 2017-08-03 2020-08-20 Lingochamp Information Tech Shanghai Co Ltd Deep context-based grammatical error correction using artificial neural networks.
US10409132B2 (en) 2017-08-30 2019-09-10 International Business Machines Corporation Dynamically changing vehicle interior
US10974150B2 (en) 2017-09-27 2021-04-13 Activision Publishing, Inc. Methods and systems for improved content customization in multiplayer gaming environments
US10561945B2 (en) 2017-09-27 2020-02-18 Activision Publishing, Inc. Methods and systems for incentivizing team cooperation in multiplayer gaming environments
US11040286B2 (en) 2017-09-27 2021-06-22 Activision Publishing, Inc. Methods and systems for improved content generation in multiplayer gaming environments
US10714144B2 (en) * 2017-11-06 2020-07-14 International Business Machines Corporation Corroborating video data with audio data from video content to create section tagging
US10515640B2 (en) * 2017-11-08 2019-12-24 Intel Corporation Generating dialogue based on verification scores
CN108081901A (en) * 2017-11-08 2018-05-29 珠海格力电器股份有限公司 Vehicle-mounted air conditioner control method and device
US11273836B2 (en) 2017-12-18 2022-03-15 Plusai, Inc. Method and system for human-like driving lane planning in autonomous driving vehicles
US11130497B2 (en) 2017-12-18 2021-09-28 Plusai Limited Method and system for ensemble vehicle control prediction in autonomous driving vehicles
US20190185012A1 (en) * 2017-12-18 2019-06-20 PlusAI Corp Method and system for personalized motion planning in autonomous driving vehicles
US10864443B2 (en) 2017-12-22 2020-12-15 Activision Publishing, Inc. Video game content aggregation, normalization, and publication systems and methods
CN108091324B (en) * 2017-12-22 2021-08-17 北京百度网讯科技有限公司 Tone recognition method and device, electronic equipment and computer-readable storage medium
KR102466942B1 (en) * 2017-12-27 2022-11-14 한국전자통신연구원 Apparatus and method for registering face posture for face recognition
US10839160B2 (en) * 2018-01-19 2020-11-17 International Business Machines Corporation Ontology-based automatic bootstrapping of state-based dialog systems
CN108520748B (en) * 2018-02-01 2020-03-03 百度在线网络技术(北京)有限公司 Intelligent device function guiding method and system
US20210005203A1 (en) 2018-03-13 2021-01-07 Mitsubishi Electric Corporation Voice processing apparatus and voice processing method
CN108492350A (en) * 2018-04-02 2018-09-04 吉林动画学院 Role's mouth shape cartoon production method based on lip-reading
WO2019204186A1 (en) * 2018-04-18 2019-10-24 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
EP4307093A3 (en) 2018-05-04 2024-03-13 Google LLC Invoking automated assistant function(s) based on detected gesture and gaze
JP7471279B2 (en) * 2018-05-04 2024-04-19 グーグル エルエルシー Adapting an automated assistant based on detected mouth movements and/or gaze
EP4130941A1 (en) 2018-05-04 2023-02-08 Google LLC Hot-word free adaptation of automated assistant function(s)
US11169668B2 (en) * 2018-05-16 2021-11-09 Google Llc Selecting an input mode for a virtual assistant
US10789200B2 (en) 2018-06-01 2020-09-29 Dell Products L.P. Server message block remote direct memory access persistent memory dialect
US10699705B2 (en) * 2018-06-22 2020-06-30 Adobe Inc. Using machine-learning models to determine movements of a mouth corresponding to live speech
CN110147702B (en) * 2018-07-13 2023-05-23 腾讯科技(深圳)有限公司 Method and system for detecting and identifying target of real-time video
US10831442B2 (en) * 2018-10-19 2020-11-10 International Business Machines Corporation Digital assistant user interface amalgamation
US11342002B1 (en) * 2018-12-05 2022-05-24 Amazon Technologies, Inc. Caption timestamp predictor
US10770072B2 (en) 2018-12-10 2020-09-08 International Business Machines Corporation Cognitive triggering of human interaction strategies to facilitate collaboration, productivity, and learning
US11679330B2 (en) 2018-12-18 2023-06-20 Activision Publishing, Inc. Systems and methods for generating improved non-player characters
US11455982B2 (en) * 2019-01-07 2022-09-27 Cerence Operating Company Contextual utterance resolution in multimodal systems
US11315692B1 (en) * 2019-02-06 2022-04-26 Vitalchat, Inc. Systems and methods for video-based user-interaction and information-acquisition
US10902220B2 (en) 2019-04-12 2021-01-26 The Toronto-Dominion Bank Systems and methods of generating responses associated with natural language input
CA3137927A1 (en) * 2019-06-06 2020-12-10 Artie, Inc. Multi-modal model for dynamically responsive virtual characters
US11875231B2 (en) * 2019-06-26 2024-01-16 Samsung Electronics Co., Ltd. System and method for complex task machine learning
CN110390942A (en) * 2019-06-28 2019-10-29 平安科技(深圳)有限公司 Mood detection method and its device based on vagitus
US11257493B2 (en) 2019-07-11 2022-02-22 Soundhound, Inc. Vision-assisted speech processing
US11263634B2 (en) 2019-08-16 2022-03-01 Advanced New Technologies Co., Ltd. Payment method and device
JP6977004B2 (en) * 2019-08-23 2021-12-08 サウンドハウンド,インコーポレイテッド In-vehicle devices, methods and programs for processing vocalizations
US11481599B2 (en) * 2019-09-04 2022-10-25 Tencent America LLC Understanding a query intention for medical artificial intelligence systems using semi-supervised deep learning
US11097193B2 (en) 2019-09-11 2021-08-24 Activision Publishing, Inc. Methods and systems for increasing player engagement in multiplayer gaming environments
US11743719B2 (en) 2019-10-07 2023-08-29 Denso Corporation System and method for authenticating an occupant of a vehicle
US11712627B2 (en) 2019-11-08 2023-08-01 Activision Publishing, Inc. System and method for providing conditional access to virtual gaming items
CN111128157B (en) * 2019-12-12 2022-05-27 珠海格力电器股份有限公司 Wake-up-free voice recognition control method for intelligent household appliance, computer readable storage medium and air conditioner
US11132535B2 (en) * 2019-12-16 2021-09-28 Avaya Inc. Automatic video conference configuration to mitigate a disability
CN111274372A (en) * 2020-01-15 2020-06-12 上海浦东发展银行股份有限公司 Method, electronic device, and computer-readable storage medium for human-computer interaction
KR20210099988A (en) * 2020-02-05 2021-08-13 삼성전자주식회사 Method and apparatus for meta-training neural network and method and apparatus for training class vector of neuarl network
CN113362828B (en) 2020-03-04 2022-07-05 阿波罗智联(北京)科技有限公司 Method and apparatus for recognizing speech
KR102137060B1 (en) * 2020-03-04 2020-07-23 씨엠아이텍주식회사 Face Recognition System and Method for Updating Registration Face Template
US11354906B2 (en) * 2020-04-13 2022-06-07 Adobe Inc. Temporally distributed neural networks for video semantic segmentation
GB2596141A (en) * 2020-06-19 2021-12-22 Continental Automotive Gmbh Driving companion
US11351459B2 (en) 2020-08-18 2022-06-07 Activision Publishing, Inc. Multiplayer video games with virtual characters having dynamically generated attribute profiles unconstrained by predefined discrete values
US11524234B2 (en) 2020-08-18 2022-12-13 Activision Publishing, Inc. Multiplayer video games with virtual characters having dynamically modified fields of view
CN112435653B (en) * 2020-10-14 2024-07-30 北京地平线机器人技术研发有限公司 Voice recognition method and device and electronic equipment
US11461681B2 (en) * 2020-10-14 2022-10-04 Openstream Inc. System and method for multi-modality soft-agent for query population and information mining
US11769018B2 (en) 2020-11-24 2023-09-26 Openstream Inc. System and method for temporal attention behavioral analysis of multi-modal conversations in a question and answer system
US12057116B2 (en) * 2021-01-29 2024-08-06 Salesforce, Inc. Intent disambiguation within a virtual agent platform
US20220415311A1 (en) * 2021-06-24 2022-12-29 Amazon Technologies, Inc. Early invocation for contextual data processing
US12020704B2 (en) 2022-01-19 2024-06-25 Google Llc Dynamic adaptation of parameter set used in hot word free adaptation of automated assistant
US20230377560A1 (en) * 2022-05-18 2023-11-23 Lemon Inc. Speech tendency classification
CN115062328B (en) * 2022-07-12 2023-03-10 中国科学院大学 Intelligent information analysis method based on cross-modal data fusion
US12017674B2 (en) 2022-09-02 2024-06-25 Toyota Motor North America, Inc. Directional audio for distracted driver applications
CN116882496B (en) * 2023-09-07 2023-12-05 中南大学湘雅医院 Medical knowledge base construction method for multistage logic reasoning
CN117409780B (en) * 2023-12-14 2024-02-27 浙江宇宙奇点科技有限公司 AI digital human voice interaction method and system
CN118276684A (en) * 2024-05-29 2024-07-02 河北金融学院 Virtual tour guide area characteristic culture VR display system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5771042A (en) * 1996-07-17 1998-06-23 International Business Machines Corporation Multi-size control for multiple adjacent workspaces
US6144391A (en) * 1992-03-13 2000-11-07 Quantel Limited Electronic video processing system
US6219048B1 (en) * 1991-11-12 2001-04-17 Apple Computer, Inc. Object selection using hit test tracks

Family Cites Families (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0612401A (en) * 1992-06-26 1994-01-21 Fuji Xerox Co Ltd Emotion simulating device
FR2696574B1 (en) * 1992-10-06 1994-11-18 Sextant Avionique Method and device for analyzing a message supplied by means of interaction with a human-machine dialogue system.
JPH06131437A (en) * 1992-10-20 1994-05-13 Hitachi Ltd Method for instructing operation in composite form
US5517021A (en) * 1993-01-19 1996-05-14 The Research Foundation State University Of New York Apparatus and method for eye tracking interface
US5694150A (en) * 1995-09-21 1997-12-02 Elo Touchsystems, Inc. Multiuser/multi pointing device graphical user interface system
US5895447A (en) * 1996-02-02 1999-04-20 International Business Machines Corporation Speech recognition using thresholded speaker class model selection or model adaptation
US6073101A (en) * 1996-02-02 2000-06-06 International Business Machines Corporation Text independent speaker recognition for transparent command ambiguity resolution and continuous access control
US5937383A (en) * 1996-02-02 1999-08-10 International Business Machines Corporation Apparatus and methods for speech recognition including individual or speaker class dependent decoding history caches for fast word acceptance or rejection
US5912721A (en) * 1996-03-13 1999-06-15 Kabushiki Kaisha Toshiba Gaze detection apparatus and its method as well as information display apparatus
US6018341A (en) * 1996-11-20 2000-01-25 International Business Machines Corporation Data processing system and method for performing automatic actions in a graphical user interface
US5877763A (en) * 1996-11-20 1999-03-02 International Business Machines Corporation Data processing system and method for viewing objects on a user interface
US6088669A (en) * 1997-01-28 2000-07-11 International Business Machines, Corporation Speech recognition with attempted speaker recognition for speaker model prefetching or alternative speech modeling
US6269336B1 (en) * 1998-07-24 2001-07-31 Motorola, Inc. Voice browser for interactive services and methods thereof
WO2000008547A1 (en) * 1998-08-05 2000-02-17 British Telecommunications Public Limited Company Multimodal user interface
US6243076B1 (en) * 1998-09-01 2001-06-05 Synthetic Environments, Inc. System and method for controlling host system interface with point-of-interest data
US6629065B1 (en) * 1998-09-30 2003-09-30 Wisconsin Alumni Research Foundation Methods and apparata for rapid computer-aided design of objects in virtual reality and other environments
IL140805A0 (en) 1998-10-02 2002-02-10 Ibm Structure skeletons for efficient voice navigation through generic hierarchical objects
US6539359B1 (en) * 1998-10-02 2003-03-25 Motorola, Inc. Markup language for interactive services and methods thereof
US6246981B1 (en) * 1998-11-25 2001-06-12 International Business Machines Corporation Natural language task-oriented dialog manager and method
US6523172B1 (en) * 1998-12-17 2003-02-18 Evolutionary Technologies International, Inc. Parser translator system and method
US6675356B1 (en) * 1998-12-22 2004-01-06 Xerox Corporation Distributed document-based calendaring system
US6493703B1 (en) * 1999-05-11 2002-12-10 Prophet Financial Systems System and method for implementing intelligent online community message board
JP3514372B2 (en) * 1999-06-04 2004-03-31 日本電気株式会社 Multimodal dialogue device
US6219640B1 (en) * 1999-08-06 2001-04-17 International Business Machines Corporation Methods and apparatus for audio-visual speaker recognition and utterance verification
US6594629B1 (en) * 1999-08-06 2003-07-15 International Business Machines Corporation Methods and apparatus for audio-visual speech detection and recognition
US6665644B1 (en) * 1999-08-10 2003-12-16 International Business Machines Corporation Conversational data mining
US7069220B2 (en) * 1999-08-13 2006-06-27 International Business Machines Corporation Method for determining and maintaining dialog focus in a conversational speech system
US6377913B1 (en) * 1999-08-13 2002-04-23 International Business Machines Corporation Method and system for multi-client access to a dialog system
US6598020B1 (en) * 1999-09-10 2003-07-22 International Business Machines Corporation Adaptive emotion and initiative generator for conversational systems
US6658388B1 (en) * 1999-09-10 2003-12-02 International Business Machines Corporation Personality generator for conversational systems
US6847959B1 (en) * 2000-01-05 2005-01-25 Apple Computer, Inc. Universal interface for retrieval of information in a computer system
US6600502B1 (en) * 2000-04-14 2003-07-29 Innovative Technology Application, Inc. Immersive interface interactive multimedia software method and apparatus for networked computers
US6751661B1 (en) * 2000-06-22 2004-06-15 Applied Systems Intelligence, Inc. Method and system for providing intelligent network management
US6754643B1 (en) * 2000-10-03 2004-06-22 Sandia Corporation Adaptive method with intercessory feedback control for an intelligent agent
US6731307B1 (en) * 2000-10-30 2004-05-04 Koninklije Philips Electronics N.V. User interface/entertainment device that simulates personal interaction and responds to user's mental state and/or personality

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6219048B1 (en) * 1991-11-12 2001-04-17 Apple Computer, Inc. Object selection using hit test tracks
US6144391A (en) * 1992-03-13 2000-11-07 Quantel Limited Electronic video processing system
US5771042A (en) * 1996-07-17 1998-06-23 International Business Machines Corporation Multi-size control for multiple adjacent workspaces

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1358650A4 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1641157A3 (en) * 2004-09-28 2012-05-16 Sony Corporation Audio/visual content providing system and audio/visual content providing method
EP1641157A2 (en) * 2004-09-28 2006-03-29 Sony Corporation Audio/visual content providing system and audio/visual content providing method
US8605945B2 (en) 2006-02-07 2013-12-10 Qualcomm, Incorporated Multi-mode region-of-interest video object segmentation
WO2007141052A1 (en) * 2006-06-09 2007-12-13 Sony Ericsson Mobile Communications Ab Methods, electronic devices, and computer program products for setting a feature of an electronic device based on at least one user characteristic
US8660479B2 (en) 2007-09-04 2014-02-25 Ibiquity Digital Corporation Digital radio broadcast receiver, broadcasting methods and methods for tagging content of interest
US8676114B2 (en) 2007-09-04 2014-03-18 Ibiquity Digital Corporation Digital radio broadcast receiver, broadcasting methods and methods for tagging content of interest
US9123341B2 (en) 2009-03-18 2015-09-01 Robert Bosch Gmbh System and method for multi-modal input synchronization and disambiguation
WO2010107526A1 (en) * 2009-03-18 2010-09-23 Robert Bosch Gmbh System and method for multi-modal input synchronization and disambiguation
EP2504745A2 (en) * 2009-11-27 2012-10-03 Samsung Electronics Co., Ltd. Communication interface apparatus and method for multi-user and system
WO2011065686A2 (en) 2009-11-27 2011-06-03 Samsung Electronics Co., Ltd. Communication interface apparatus and method for multi-user and system
EP2504745A4 (en) * 2009-11-27 2014-12-10 Samsung Electronics Co Ltd Communication interface apparatus and method for multi-user and system
US9799332B2 (en) 2009-11-27 2017-10-24 Samsung Electronics Co., Ltd. Apparatus and method for providing a reliable voice interface between a system and multiple users
DE102009058146B4 (en) 2009-12-12 2024-07-11 Volkswagen Ag Method and device for multimodal context-sensitive operation
EP3293613A1 (en) * 2010-01-21 2018-03-14 Tobii AB Eye tracker based contextual action
WO2013058728A1 (en) * 2011-10-17 2013-04-25 Nuance Communications, Inc. Speech signal enhancement using visual information
US9293151B2 (en) 2011-10-17 2016-03-22 Nuance Communications, Inc. Speech signal enhancement using visual information
US9263060B2 (en) 2012-08-21 2016-02-16 Marian Mason Publishing Company, Llc Artificial neural network based system for classification of the emotional content of digital music
WO2014070872A3 (en) * 2012-10-30 2014-06-26 Robert Bosch Gmbh System and method for multimodal interaction with reduced distraction in operating vehicles
US9190058B2 (en) 2013-01-25 2015-11-17 Microsoft Technology Licensing, Llc Using visual cues to disambiguate speech inputs
WO2014116614A1 (en) * 2013-01-25 2014-07-31 Microsoft Corporation Using visual cues to disambiguate speech inputs
US9552810B2 (en) 2015-03-31 2017-01-24 International Business Machines Corporation Customizable and individualized speech recognition settings interface for users with language accents
DE102019118184A1 (en) * 2019-07-05 2021-01-07 Bayerische Motoren Werke Aktiengesellschaft System and method for user-specific adaptation of vehicle parameters

Also Published As

Publication number Publication date
KR100586767B1 (en) 2006-06-08
EP1358650A4 (en) 2008-03-19
CA2437164A1 (en) 2002-08-15
US20020135618A1 (en) 2002-09-26
CN1310207C (en) 2007-04-11
JP2004538543A (en) 2004-12-24
HK1063371A1 (en) 2004-12-24
CN1494711A (en) 2004-05-05
KR20030077012A (en) 2003-09-29
US6964023B2 (en) 2005-11-08
EP1358650A1 (en) 2003-11-05

Similar Documents

Publication Publication Date Title
US6964023B2 (en) System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input
US12039975B2 (en) Dialog management for multiple users
US10977452B2 (en) Multi-lingual virtual personal assistant
US20210081056A1 (en) Vpa with integrated object recognition and facial expression recognition
CN108701453B (en) Modular deep learning model
US6816836B2 (en) Method and apparatus for audio-visual speech detection and recognition
US11830485B2 (en) Multiple speech processing system with synthesized speech styles
Stiefelhagen et al. Enabling multimodal human–robot interaction for the karlsruhe humanoid robot
US20190279616A1 (en) Voice Characterization-Based Natural Language Filtering
WO2022125381A1 (en) Multiple virtual assistants
Këpuska et al. A novel wake-up-word speech recognition system, wake-up-word recognition task, technology and evaluation
US11681364B1 (en) Gaze prediction
Këpuska Wake-up-word speech recognition
AU2020103587A4 (en) A system and a method for cross-linguistic automatic speech recognition
US11991511B2 (en) Contextual awareness in dynamic device groups
CN117882131A (en) Multiple wake word detection
Holzapfel et al. A robot learns to know people—first contacts of a robot
Kostoulas et al. Detection of negative emotional states in real-world scenario
WO2022265872A1 (en) Presence-based application invocation
RBB et al. Deliverable 5.1

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 02805565.9

Country of ref document: CN

AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2437164

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 1020037010176

Country of ref document: KR

WWE Wipo information: entry into national phase

Ref document number: 2002563459

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 2002724896

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1020037010176

Country of ref document: KR

WWP Wipo information: published in national office

Ref document number: 2002724896

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWG Wipo information: grant in national office

Ref document number: 1020037010176

Country of ref document: KR