WO2012174515A1 - Reconnaissance de paroles de dialogue hybride pour interaction automatisée dans un véhicule et interfaces utilisateur dans le véhicule nécessitant un traitement de commande cognitive minimal pour celle-ci - Google Patents

Reconnaissance de paroles de dialogue hybride pour interaction automatisée dans un véhicule et interfaces utilisateur dans le véhicule nécessitant un traitement de commande cognitive minimal pour celle-ci Download PDF

Info

Publication number
WO2012174515A1
WO2012174515A1 PCT/US2012/042916 US2012042916W WO2012174515A1 WO 2012174515 A1 WO2012174515 A1 WO 2012174515A1 US 2012042916 W US2012042916 W US 2012042916W WO 2012174515 A1 WO2012174515 A1 WO 2012174515A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
speech recognition
vehicle
dialog application
spoken
Prior art date
Application number
PCT/US2012/042916
Other languages
English (en)
Inventor
Thomas Barton Schalk
Leonel Saenz
Barry Burch
Original Assignee
Agero Connected Services, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US13/524,695 external-priority patent/US20120253823A1/en
Application filed by Agero Connected Services, Inc. filed Critical Agero Connected Services, Inc.
Priority to CA2839285A priority Critical patent/CA2839285A1/fr
Publication of WO2012174515A1 publication Critical patent/WO2012174515A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems

Definitions

  • the present invention relates generally to a system and method for implementing a server- based speech recognition system for multi-modal interaction that may be applied to any interactive automated system, such as an interactive automated system that is being used inside a motor vehicle. More particularly, the present invention pertains to a system and method of utilizing multiple speech recognizers and an associated human-to-machine, in-vehicle interface to create an efficient, safe, reliable, convenient, and enjoyable experience for the motorist under driving conditions while simultaneously achieving high automation rates.
  • the present invention provides call center enterprises with highly effective automation to reduce costs without sacrificing the quality of service for the customer.
  • Interactive automation should be a preferred measure of interaction by the customer, or motorist, to achieve tasks that could otherwise be handled through human/agent interaction through a call center.
  • SOA service oriented architecture
  • the benefits of such an approach are to provide a safe and enjoyable user interface and to improve a call center's efficiency, as described herein.
  • Telematics includes the integration of wireless communications, vehicle monitoring systems, and location devices. Such technologies in automotive communications combine wireless voice and data capability for management of information and safety applications.
  • Driver distraction can occur when a driver' s cognitive processing is allocated to any task that is not focused on driving a vehicle safely.
  • Making phone calls and entering data into mobile devices are examples of tasks that can be highly distractive while driving.
  • Conventional typing while driving is extremely dangerous because both vision and touch are involved, making it impractical to drive safely.
  • knocking the only way to enter a destination into a vehicle navigation system.
  • some existing built-in systems attempt to purposefully limit the use of the interface only when the vehicle is stationary. Unfortunately, this stationary requirement adversely compromises the range of capabilities that may be possible with in- vehicle systems.
  • the microphone must be hands free and, therefore, may be at a distance from the speaker's mouth.
  • road noise can be harsh and non- stationary.
  • the present invention provides safe measures for completing tasks that involve typing under driving conditions. Safety is maintained because the interface is designed to be extremely simple and quick to use. Simplicity to the driver is achieved by leveraging speech and hearing as primary input/output modalities during interactions within the vehicle while, at the same time, minimizing the need for visual and mechanical interactions that relate to completing tasks. Accordingly, in the present invention, an advanced human-like speech recognition system as described above is used to enable the process of typing short text strings.
  • the present invention pertains to a method of prompting that begins with the speaking task and follows with a computerized verbalization of the text. Subsequent user interface steps can be visual in nature, or only involve sound. In terms of the use case, the vehicle driver hears audio prompts and responds with speech to complete a task such as creating a text message. As a result, the present invention makes it practical for vehicle drivers to use their speech to enter text strings.
  • the SOA an asynchronous approach can be used to recognize speech.
  • the dialog is always forward moving and the user is not asked to repeat utterances, even though the user can ask to repeat a phrase, if the application includes an appropriate query.
  • the benefits of such an approach provide a safe and enjoyable user interface that is compelling to use while driving a vehicle.
  • Embodiments of the present invention provide a method for implementing an interactive automated system, comprising processing spoken utterances of a person using a processing system located in proximity to the person, transmitting the processed speech information to a remote data center using a wireless link, analyzing the transmitted processed speech information to scale and end-point the speech utterance, converting the analyzed speech information into packet data format, selecting at least one optimal specialized speech recognition engine to translate the converted speech information into text format, transporting the packet speech information to the at least one selected specialized speech recognition engine using an internet-protocol transport network, retrieving the recognition results and an associated confidence score from the at least one specialized speech recognition engine, continuing the automated dialog with the person if the confidence score meets or exceeds a pre-determined threshold for the best match, and selecting at least one alternative specialized speech recognition engine to translate the converted speech information into text format if the confidence score is low such that it is below a pre-determined threshold for the best match.
  • the at least one alternative specialized speech recognition engine is agent-assisted.
  • the at least one selected optimal specialized speech recognition engine is not local.
  • the at least one selected optimal specialized speech engine is selected based on a given intent of the person.
  • the automated dialog is continued with the person prior to, or subsequent to, receiving the recognition results in an asynchronous manner.
  • the automated dialog is continued with the person subsequent to receiving the recognition results in a synchronous manner.
  • the packet data and recognition results are logged for subsequent analysis.
  • the processing system is located on-board a vehicle.
  • the vehicle location information is also transported with the packet speech information to the at least one selected specialized speech recognition engine.
  • the vehicle location information is logged for subsequent analysis.
  • the intent of the person includes at least one of texting, browsing, navigation, and social networking.
  • Embodiments of the present invention also provide an interactive automated speech recognition system comprising a processing system located in proximity to a person wherein the processing system processes spoken utterances of the person, a remote data center, a wireless link that transmits the processed speech information from the processing system to the remote data center wherein the transmitted processed speech information is analyzed to scale and end-point the speech utterance and converted into packet data format, at least one optimal specialized speech recognition engine selected to translate the converted speech information into text format, an internet protocol transport network that transports the converted speech information to the at least one selected optimal specialized speech recognition engine, and wherein the at least one specialized speech recognition engine produces recognition results and an associated confidence score, and based upon the confidence score, the automated dialog is continued with the person if the confidence score meets or exceeds a pre-determined threshold for the best match, or at least one alternative specialized speech recognition engine is selected to translate the converted speech information into text format if the confidence score is low such that it is below a pre-determined threshold for the best match.
  • a method for providing dynamic interactive voice recognition (IVR) over a wireless network comprises establishing a connection with a telematics control unit via the wireless network, configuring a directed dialog application of at least one of a remote data center and a vehicle to provide IVR for use with expected spoken user commands, and using an open dialog application separate from the remote data center to provide IVR for use with unexpected spoken user commands.
  • IVR dynamic interactive voice recognition
  • the process switches back and forth between the directed dialog application and the open dialog application in accordance with pre-defined criteria.
  • the directed dialog application is executed to present questions corresponding to a limited subset of possible spoken user commands.
  • a teaching mode of the directed dialog application is entered before using the open dialog application when an unexpected spoken user command is received.
  • a reduced subset of possible choices is presented in the teaching mode in order to obtain a valid spoken user command.
  • the reduced subset is less than the limited subset.
  • a further subset of possible choices is presented in response to a user selection.
  • the process switches to the open dialog application absent a selection of the further subset of possible choices.
  • the process switches to the open dialog application when a further unexpected spoken user command is received.
  • the process switches to the open dialog application absent a selection of one of the reduced subset of possible choices.
  • resources for the open dialog application are provided with a speech recognition service cloud.
  • the speech recognition service cloud provides different speech recognition systems in parallel.
  • the different speech recognition systems are selected from a group consisting of at least two of a navigation speech recognition system, a dictation speech recognition system, an audio information recognition system, and a human assisted speech recognition system.
  • the unexpected user command comprising natural user language
  • the open dialog application provided by the speech recognition service cloud is used to provide information based on the natural user language.
  • the information comprises possible spoken user commands.
  • suggested user commands are provided in response to receiving an invalid spoken user command.
  • a method for minimizing task completion time using dynamic interactive voice recognition (IVR) over a wireless network comprises the steps of establishing a connection with a telematics control unit via the wireless network, configuring a directed dialog application of at least one of a remote data center and a vehicle to provide IVR for use with expected spoken user commands, receiving one of a plurality of expected spoken user commands, depending on the user command received, prompting a user for further spoken information in order to complete one or more actions required by the user command received, and completing the one or more actions upon receipt of the further spoken information.
  • IVR dynamic interactive voice recognition
  • the expected spoken user command received comprises a shortcut.
  • a system for providing dynamic interactive voice recognition (IVR) over a wireless network comprising a remote data center comprising a communications device operable to establish a connection with a telematics control unit via the wireless network, a directed dialog application operable to provide IVR for use with expected spoken user commands, and a communications subsystem, and an open dialog application separate from the remote data center, communicatively connected to the remote data center through the communications subsystem, and operable to provide IVR for use with unexpected spoken user commands.
  • IVR dynamic interactive voice recognition
  • the remote data center is configured to use resources for the open dialog application that are provided by a speech recognition service cloud.
  • FIG. 1 is a system architecture diagram illustrating components of a speech recognizer according to an exemplary embodiment of the present invention.
  • FIG. 2 is a system architecture diagram illustrating components of a service-oriented architecture for in-vehicle speech recognition according to an exemplary embodiment of the present invention.
  • FIG. 3 is a system architecture diagram illustrating components of a connected vehicle speech recognizer according to an exemplary embodiment of the present invention.
  • FIG. 4 is a graph illustrating usability and flexibility of the system according to an exemplary embodiment of the present invention utilizing directed and open dialogs.
  • FIG. 5 is a flow diagram illustrating the system of processes that comprise a multi-modal user interface design and how commonalities are shared among a number of exemplary user interfaces according to an exemplary embodiment of the present invention.
  • FIG. 6 is a process flow diagram of a synchronous speech recognition approach aimed at showing the limitations of the user experience.
  • FIG. 7 is a process flow diagram of an asynchronous speech recognition approach aimed at showing the advantages of the asynchronous approach according to an exemplary embodiment of the present invention.
  • FIG. 8 is a process flow diagram of a method for providing dynamic interactive voice recognition (IVR) over a wireless network according to an exemplary embodiment of the present invention.
  • IVR dynamic interactive voice recognition
  • FIG. 9 is a process flow diagram of a method for minimizing task completion time using dynamic interactive voice recognition (IVR) over a wireless network according to an exemplary embodiment of the present invention.
  • IVR interactive voice recognition
  • program As used herein, the term “about” or “approximately” applies to all numeric values, whether or not explicitly indicated. These terms generally refer to a range of numbers that one of skill in the art would consider equivalent to the recited values (i.e. having the same function or result). In many instances these terms may include numbers that are rounded to the nearest significant figure.
  • program “software,” “software application,” and the like as used herein, are defined as a sequence of instructions designed for execution on a computer system.
  • a "program,” “software,” “computer program,” or “software application” may include a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, a shared library/dynamic load library, and/or other sequence of instructions designed for execution on a computer system.
  • FIG. 1 of the drawings there is shown a system architecture diagram representing the basic components of a speech recognizer in connection with a remote data center that require special optimization under conditions in which the environment is harsh and the recognition task is complex (e.g., recognition of dictation or a street address).
  • a speech recognizer is highly tuned, accuracy can be unpredictable simply because it is virtually impossible to model every sound that a person can make when speaking into a microphone.
  • acceptable results can be achieved.
  • acoustic models represent how speech sounds in a target environment
  • grammars represent what can be recognized during an application
  • dictionaries represent the way that words are to be pronounced
  • language models govern the allowable sentence structure.
  • acoustic models 15 are statistical representations of phonetic sounds that are produced under specific environmental conditions. Phonetic sounds can be thought of as sub-units of spoken words that are to be recognized by an automated speech recognition (ASR) system.
  • ASR automated speech recognition
  • the environmental conditions are characterized by numerous components, including the microphone type and its placement, the surrounding acoustic media, audio transmission properties, background noise, signal conditioning software, and any other variable that influences the quality of the sound that the speech recognizer processes.
  • Acoustic models 15 are needed for high accuracy speech recognition, and, the more highly tuned the acoustic model, the more accurate is the speech recognition. Speech data collections form the basis of acoustic models. However, live adaptation is used for in-the-field tuning. Thousands of recordings that represent environmental extremes of a target recognition environment constitute a "good" base of speech data.
  • "Grammar” or “Grammars” 17 are a set of rules that define the set of words and phrases (i.e., a vocabulary) that may be recognized during voice applications.
  • An application may have several grammars such as "yes/no,” numerical digits, street names, action menu items, etc. To maximize accuracy, only the necessary vocabulary should be active at any point of an application call flow.
  • Grammars 17 rely on "dictionaries” for pronunciation information. Dictionaries are commonly referred to as “lexicons.”
  • a “lexicon” 16 is a collection of words and their associated pronunciations in terms of phonetic transcriptions. Similar to a common dictionary, pronunciation is specified by a standard symbol set.
  • “Language models” 18 are designed to assist the recognition matching process for multiple words in a phrase or a sentence. Common languages are statistical in nature and attempt to assign a probability to a sequence of allowable words by a probability distribution. Language modeling can be used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing, and information retrieval. In speech recognition, to predict the next word in a speech sequence, a language model can be used to capture the properties of a language.
  • acoustic models 15, grammars 17, lexicons 16, and language models 18 are optimized to reach a high level of accuracy.
  • a properly tuned speech recognizer should be able to recognize it too.
  • using real-world recordings for adaptation purposes is one way to improve accuracy.
  • a key feature of the present invention lies in a particular division of duties—the performance of the complex speech recognition tasks are separated from the system that is hosting the application.
  • the base speech application contains a dialog structure that relies on its own recognizer for basic command and control. For complex speech recognition tasks, accessible specialized recognizers can be used.
  • the connect time may prevent a good user experience from being possible because a typical telephone connection time is approximately 10 seconds.
  • Speech recognition that is server-based leverages the remote device' s Internet connection to transmit packeted audio and to have returned recognition results almost instantaneously.
  • the remote device acts as the client and the recognition is performed off-board by way of a data channel.
  • the present invention is unique as it viably mixes recognition engines in real-time with a real-time dialog with humans.
  • the present invention deals with an enterprise automated system having its own speech recognition resources and an actual dialog occurs (i.e., audio prompting occurs).
  • the Internet is not accessed wirelessly— a telephone voice channel serves as the means of communication between the person, or motorist, and the enterprise automated system.
  • the present invention provides an automatic interactive system at an enterprise remote data center (ERDC) that leverages multiple specialized speech recognizers over a data channel (i.e., the Internet) and allows, by way of a wireless voice communication channel, a person, such as a motorist, to interact in a hands-free environment with the automated system, the automated system being capable of understanding complex speech requests.
  • ERDC enterprise remote data center
  • the primary advantages of hosting the application on- premise at the ERDC include ease of back-end integration, control of application design and support, improvement of application maintenance, and cost-effective implementation due to server sharing.
  • the application dialog design can easily be modified without changing any remote, or in- vehicle, hardware or software.
  • the ERDC can prototype and launch automated interactive applications that are designed internally. This means that complete application ownership is possible even though sophisticated speech recognition is used within the application and candidate speech recognition engines can be evaluated without making application changes. Also, multiple-language speech recognition is easily accommodated through the use of specialized speech recognition services.
  • each channel of a server-based, interactive, automation system could accommodate numerous vehicles simultaneously.
  • Locating an automated interactive automation service cluster within the ERDC provides substantial benefits over an embedded speech system inside a vehicle.
  • this architecture provides increased operational flexibility and control from the call center. Efficiency is increased because content can be added or modified with centralized hardware and/or software. Calls from the vehicles can be monitored and improvements can be made at the centralized locations, rather than inside each vehicle. Also, there is an improved scalability as computer resources can be shared across a large number of vehicles.
  • TCUs standard telematics control units
  • the inventive system provides the ability to manage personalization in terms of customer preferences.
  • the present invention is directed to a system and method that leverages multiple specialized speech recognizers that are accessed on-premise through an Internet-protocol transport network.
  • the ERDS is equipped with highly available connectivity for high-speed Internet access, thereby eliminating wireless coverage from being an issue.
  • the speech application is hosted on an automated interactive system located within the ERDC (or call center). All application maintenance and updating can be managed by the enterprise remote data center (ERDC) without the need of costly subject-matter experts in speech recognition.
  • ERDC enterprise remote data center
  • this particular embodiment is shown as being used in connection with motor vehicles in FIG. 2.
  • the system and method of the present invention is applicable to all interactive systems.
  • the vehicle's telematics control unit (TCU) 34 connects to the ERDC 48 by way of a wireless communications link that includes antennas 35, 37 and the cellular network 36.
  • the final stage of the telematics connection includes the telephone network (e.g., POTS) 38, which terminates at a communications device, e.g., the PBX 39, within the ERDC 48.
  • the ERDC 48 is comprised of a media gateway 40 and an interactive automation service cluster 41.
  • the media gateway 40 manages the communications media session between the PBX 39 and the interactive automation service cluster 41.
  • the interactive automation service cluster 41 is the central point of intelligence of the inventive systems and methods as described in the following text.
  • a telematics request can be accomplished, for example, by the vehicle driver 31 pressing a button, in response to which the TCU 38 initiates a connection with the ERDC 48 as described above.
  • the vehicle driver 31 is able to hear audio prompts, for example, through the in- vehicle speaker 33 located in proximity to the vehicle driver 31.
  • the in-vehicle speaker 33 and a microphone 32 an automated interaction takes place with the vehicle driver 31. The interaction could begin, for example, with the audio prompt "How may I help you?" Simultaneously and in a seamless fashion, when the telematics connection is established to the ERDC 48, data information such as the vehicle's location, vehicle model information, vehicle driver information, diagnostic information, etc. is captured and communicated via a data channel to the interactive automation service cluster 41.
  • the vehicle driver may then respond out loud with a request and say, for example, "I need to find an Italian restaurant nearby" or "I want to text my friend Bob.”
  • the interactive automation server cluster 41 which is comprised of a group of servers interconnected together to form the ERDC-located speech system, automatically selects the appropriate speech recognition engine.
  • the appropriate speech recognition engine could be located internal to the interactive automation server cluster 41 within the ERDC 48 or could be externally available to the interactive automation server cluster in a speech recognition service cloud 49 that may be accessed through the world-wide-web (referred to as "cloud computing") from one or more speech vendors that offer a URL access point to their speech server farm.
  • the speech engine that is selected depends on the type of request made by the vehicle driver 31. For example, simple "yes/no" queries or action menu selections may be handled by the recognition engine hosted within the interactive automation server cluster 41. More complex speech recognition tasks, such as recognizing a spoken sentence, could be handled by a remote dictation engine 44.
  • the Internet protocol transport network 42 is highly available to the interactive automation server cluster 41 and operates at a high speed, making it practical to recognize complex speech utterances in just a matter of seconds from the time the vehicle driver utters the directive.
  • the process for selecting the appropriate recognition engine and whether or not to direct the user to a live operator can be dependent upon the complexity of the speech to be recognized.
  • a remote navigational engine 43 When handling a complex speech recognition task, such as recognizing a navigational destination, a remote navigational engine 43 by way of the Internet protocol transport network 42 can perform the handling.
  • the speech application is executed within the interactive automation server cluster 41 and waits for a response from the remote navigational engine 43 before proceeding with the next step (e.g., a subsequent prompt, visual display of the destination information, or an end to the interactive session).
  • a recognition process occurs and, as part of the process, the recognizer creates an "n"-best list of the top hypotheses, or “recognition results.”
  • the recognizer For each spoken utterance, a recognition process occurs and, as part of the process, the recognizer creates an "n"-best list of the top hypotheses, or “recognition results.” In other words, if "n" equals five, the recognizer generates up to five text representations of what was possibly spoken, each with an associated probability of correct recognition.
  • the variable "n” may be a pre-defined, limited number and/or is dependent upon the number of results returned that meet or exceed a certain probability of correct recognition.
  • Each recognition hypothesis is assigned a confidence score (or probability) that is typically normalized to 1. If the top choice is assigned a confidence score above a specified threshold (e.g., 0.85), the spoken input is considered to be a final recognition result without requiring further processing.
  • the result provided by the remote navigational engine 43 is of low confidence, meaning that the spoken speech was not automatically recognized with a certain level of confidence.
  • the corresponding audio wave file could be passed over the web to a live transcription agent 47.
  • the speech application executed within the interactive automation server cluster 41, waits for a response from the transcription agent 47 before proceeding to the next step (e.g., a subsequent prompt, a visual display of the destination information, or an end to the interactive session).
  • a subsequent prompt e.g., a subsequent prompt, a visual display of the destination information, or an end to the interactive session.
  • the interactive automation server cluster 41 serves as the intelligence behind the automation experienced by the vehicle driver, or other users.
  • the fact that the systems and methods are agnostic (i.e., not tied to one particular technology vendor) in choosing the speech recognition engine means that the system and method highly robust and flexible because multiple best-in-class recognizers can be leveraged.
  • the system and method of the present invention incorporates agent assistance into the implementation model, the user experience can also "feel" human in the wake of noise and heavy accent characteristics.
  • the vehicle driver or other user 31 can be connected to a live agent (see, e.g., FIG. 7) to handle extreme circumstances. Such a connection is easily managed by the interactive automation server cluster 41 if the application is configured accordingly, examples of which are described in the following section.
  • Dialog designs of the invention improve the connected car speech process.
  • the methodologies discussed herein can be applied to any number of tasks including managing search results, requesting information, and texting through voice, to name a few.
  • the inventive system utilizes the positive characteristics of both limited dialog and expansive dialog, the latter of which can include live voice exchanges.
  • the inventive system employs dynamic interactive voice recognition that switches seamlessly between a limited dialog having minimal vocabulary to an open dialog having a virtually unlimited vocabulary. Throughout the process, the user is able keep his/her hands on wheel because the inventive systems and methods and their human-machine interfaces enable the user to get information to/from the remote data center.
  • the inventive dialog processes and interfaces are configured such that communication flow should work to make it easier for the user to obtain the information desired and avoid a situation where the user does not like the process or quits learning the process because the process is too complicated to learn.
  • the systems and methods are configured to react in different ways depending on whether the user is speaking within the defined vocabulary or the user is generating instructions that are outside the defined vocabulary.
  • a power user presumably limits uttered speech to the defined vocabulary, while a novice user utters instructions that are generally outside the defined vocabulary.
  • the invention provides a hybrid system that moves back and forth between a structured/directed dialog mode, e.g., power mode, and an unstructured/open dialog mode, e.g., novice mode. To explain this process, reference is made to FIGS.
  • a directed dialog When having an interactive voice exchange, many processes start with a limited grammar. This means that the user is asked questions that have a small subset of possible valid responses. Herein, this limited process is referred to as a "directed dialog." Some examples of a directed dialog include binary responses such as yes/no, up/down, left/right, or a set of numbers that is limited between 1 and 10. In such cases, the ERDC 48 can easily handle expected responses when correct. When the user's utterances are incorrect or invalid, the system recognizes this as an indication that the user needs help. At this point, the system transitions into a teaching mode.
  • the teaching mode does not try to figure out what was said by the user. Instead, the system utilizes a process that tries to make it easier for the user to say a valid response. In attempting to obtain a valid response from the user, the system provides a small or reduced subset of possible choices. Assume, for example, that the user has thirty possible valid responses in a directed dialog but that the user does not give a valid response. Instead of trying to determine what was actually said, which would require use of off- board, process-intensive, voice recognition, the system provides a second level of choices that is much smaller than the thirty possible valid responses. The system is configured to give the user broad categories from which to choose a valid command.
  • the system says "please say traffic, new destination, or gas prices.”
  • each of these three categories broadly describe a number of different possible valid responses within the set of thirty.
  • the traffic subcategory if selected, could provide six different choices.
  • the destination sub-category if selected, could provide five different choices and the gas sub-category, if selected, could provide four different choices.
  • the user instead of listing thirty items to the user, which is unworkable, only three are listed for selection. If one is not appropriate, the user can be prompted for "more choices.”
  • the small, manageable subset is uttered to the user.
  • the strategy of the invention is not just to skip steps but leverage the low- vocabulary mode (i.e., directed dialog) and teach user the available categories.
  • the system will not say to the user: "I'm sorry I don't understand.” If no categories are selected or if the user's utterance is still not understood with confidence, then the system switches to a high- vocabulary mode by having an open dialog with the user (i.e., natural language).
  • a dialog session would end because the user has shown an inability to say simple commands - like someone failing to enter their password correctly on a web page.
  • natural language is actually needed (e.g., the user gets through the directed dialog interaction)
  • the inventive system proceeds to the speech recognition service cloud 49. This process is illustrated in FIG.
  • a navigation speech recognition system 102 can take over and quickly provide the ERDC 48 with the information to be sent to the user.
  • dictation e.g., text or email
  • a dictation speech recognition system 104 can take over and quickly provide the ERDC 48 with the information to be sent to the user.
  • audio information e.g., music
  • a voice search speech recognition system 106 can take over and quickly provide the ERDC 48 with the information to be sent to the user.
  • a human assisted speech recognition system 108 can take over and quickly provide the ERDC 48 with the information to be sent to the user.
  • the system can use the confidence scoring process set forth in co-pending U.S. Patent Application Serial No. 12/636,327 or U.S. Patent Nos. 7,373,248 and 7,634,357, for example, to address the invalid user response.
  • FIG. 4 is a graph illustrating the problem associated with use of the directed and open dialogs.
  • the limited directed dialog is usable only to limited extent because of the small vocabulary. It is flexible to a point (dashed vertical line A), but starts to become unusable and inflexible as the user utters responses that are unintelligible or are outside the vocabulary.
  • Open dialog is very flexible because it can interpret virtually all responses from a user. However, open dialog requires time and outside processing and, possibly, requires live operator assistance and, therefore, there is a point at which open dialog is undesirable (dashed vertical line B). From this, there exists a range in which both directed and open dialogs are disadvantageous.
  • the system of the invention guides the user and utilizes the directed and open dialogs to prevent use of either dialog in a disadvantageous situation.
  • the present invention also provides a user interface that enables functionality in a vehicle in a way that makes it safe while driving.
  • the user interface of the present invention allows navigation, dialing, web browsing, text messaging (texting), and other applications for mobile devices by speech-enabled typing.
  • the primary objective of the user interface is to make it practical for a vehicle driver to access a set of applications that are controlled and managed by user interfaces that share a strong degree of commonality.
  • the user interface for texting shares commonality with the user interface for web browsing.
  • the user interface for web browsing shares commonality with the user interface for local search and full address entry for navigation.
  • the invention utilizes a three- step approach for completing tasks that normally require conventional typing. The three steps are:
  • the vehicle driver initiates the task by indicating intent.
  • Intent can be communicated through a specific button push, touching a specific icon, or saying a specific speech command such as "I want to send a text message.”
  • the user is prompted by speech to say a phrase that will match the intended text for a text message or the intended text to enter in a search box, or a destination category, name, or address.
  • the invention makes it practical for vehicle drivers to use their own speech to enter text strings.
  • the recognized result is, then, managed by the user in a way that depends on the task. Web browsing would entail a simple glance at a screen. Texting would entail saying the name of the recipient and then speaking the content of the text message.
  • Destination entry could entail touching a screen to download a specific destination to an on-board navigation system.
  • Other examples follow the same pattern: input intent; speak a phrase; and manage the result.
  • the user interface of the inventive systems and methods requires advanced speech recognition that allows free-form dictation in the highly challenging environment of a vehicle's interior.
  • the present invention also encompasses asynchronous speech recognition, which means that the user interface can step forward in execution before recognition results are obtained. For example, a user could speak a text message and be prompted to say the name of the recipient before the spoken text message is actually recognized.
  • the user interface can include playing back the later-recognized text message along with the target recipient.
  • Longer latencies associated with obtaining recognition results can be managed by sending the message without confirmation but subsequent to the user interaction within the vehicle. For example, the message may be recognized and sent twenty (20) seconds later, without the user knowing exactly when the message was sent.
  • some tasks such as web browsing or local search, are sensitive to timing, and a synchronous approach is only practical when the latencies are controlled to be within several seconds, analogous to delays typically experienced with conventional web browsing.
  • a conventional interactive voice response system typically includes error handling dialogs that slow down the interactive process and often cause user frustration when recognition failures occur.
  • the dialog is always forward moving (i.e., the next-level prompts occur immediately after a user speaks even if the speech is not recognized) and the user is not asked to repeat spoken utterances.
  • a portion of the dialog can be synchronous and thereby allow for the system to ask a user to confirm a phrase that was spoken by the user or the user can cause the system to repeat a result by responding to a yes/no query (e.g., answering "no" to the system' s query of "did you say... ?”).
  • FIG. 5 shows a representation of the inventive in- vehicle, user-interface solution, based on a system of user interface processes or applications that involve or require the same basic steps albeit accomplishing the steps by different methods and producing different results or providing different functions to the user.
  • the user interface is multi-modal in nature and is based on three steps that are common among a variety of applications including, but not limited to, texting 210, browsing 213, navigation 216, and social networking 219, as well as other applications 222.
  • Step one 225 involves establishment of intent, or selecting the application intended to be used.
  • Application selection may be achieved by touching an icon on a display, pushing a particular button, or by saying a speech command such as "web search” or "text-by- voice.”
  • the second step 226 involves speaking the phrase to be converted to text, which can be referred to as speech-enabled typing.
  • the nature of the phrase to be converted to text depends on the user intent.
  • the type(s) of phrases to be converted include, but are not limited to, text messages 211, search string entries 214, target destinations 217, or brief announcements 220, as well as other phrases 223, depending on the intent 225.
  • the recognized phrase is played through audio (text- to- speech, for example) and the user then decides how to manage the result 227.
  • Step three 227 can entail such actions as saying the name of a target text recipient 212, glancing 215 at search results such as a weather report on a display, touching a displayed destination to enter 218 the destination into a navigation system, or speaking a group 221 name for a social networking communication.
  • steps one and three can involve input modalities other than speech, but step two entails speech-enabled typing.
  • a key to the present invention is the simplicity of a single user interface and method that can be applied across a variety of different applications. The simplicity of the resultant user interface is highly appealing under driving conditions because very little cognitive processing is required by a driver to learn and use many applications. Because there are so few steps, task completion is fast and distraction is thereby minimized.
  • FIG. 6 is a process flow diagram of a synchronous speech recognition approach.
  • the user starts 300 and experiences an IVR prompt 301 and, typically, utters a verbal response.
  • the recognition engine 302 processes the verbal response and, based upon matching scores that are referred to as recognition confidence levels 303, either moves on to the next prompt after processing is completed within the enterprise back-end 304 or re-prompts 301.
  • recognition confidence levels 303 Matching scores that are referred to as recognition confidence levels 303, either moves on to the next prompt after processing is completed within the enterprise back-end 304 or re-prompts 301.
  • recognition confidence levels 303 match scores that are referred to as recognition confidence levels 303.
  • the interactive process ends.
  • the potential issue with a synchronous approach is that the user can get stuck in an error loop when successive low confidence levels (303 ⁇ low) occur.
  • a conventional interactive voice response system typically includes error handling dialogs that increase the duration of the interactive process and, often, cause user frustration when recognition failures occur.
  • the user starts 310 and experiences an IVR prompt 312.
  • the IVR captures the user utterance, transfers the audio to a speech recognition engine 313 that can be queued, and executes the next prompt 312 (if any remain) in the series.
  • Processing 315 of the user utterances occurs in parallel to the prompting 312 for user input; that is, the two activities are asynchronous.
  • the user prompting 311 process will not be interrupted due to low recognition confidences scores 314 or excessive recognition latencies.
  • low confidence utterances can be transcribed by a human 316, thereby assuring high accuracy, but at a cost that is greater than fully automated speech recognition.
  • prompting is a forward moving process whether a valid recognition result is obtained or not.
  • the potential issue of a user getting stuck in a prompting error loop 312 is eliminated and there is some guarantee of a good user experience.
  • Those experienced in the science of automatic speech recognition attribute unexpected audio input as a major cause of recognition errors.
  • Involving humans within the systems and processes of the invention allow these errors to disappear because those humans can usually still transcribe such "infected" audio.
  • human-assisted speech recognition employed by the inventive methods and systems is very practical when combined with the asynchronous speech recognition solutions.
  • a portion of the dialog can be synchronous and a portion can be asynchronous.
  • a portion of the dialog may be required to be synchronous to, perhaps, allow for a user-requested repetition of a phrase (a scenario in which a user is prompted with "Did you say ⁇ >?
  • the inventive systems and methods can be purely synchronous, purely asynchronous, or a combination of both.
  • Conventional speech applications utilize prompting schemes within which, for each prompt, prompting is continued after a recognition result is obtained.
  • Certain applications must be implemented with limits on latency between the time an utterance is finished being spoken and the time the recognition result is utilized (such as dialing a phone number by voice); these applications generally require a synchronous approach.
  • certain applications can be implemented with less stringent limits on latency between the time an utterance is finished being spoken and the time the recognition result is utilized (for example, a text message can be sent several minutes after a driver has spoken a text message); these applications generally require a synchronous approach, but can tolerate asynchronous speech recognition for part of the dialog.
  • a driver may request to send a text message (intent); the user is prompted and speaks the text message (which could be recognized asynchronously); the user is prompted and speaks the name of the text message recipient, which is recognized synchronously, or asynchronously; the test message is sent after all recognition results are determined.
  • Some applications, such as form-filling can be completely asynchronous.
  • Form- filling applications can include, for example, capturing a user name, address, credit card number, and service selection; the form can be filled out with text after the recognition results are determined, perhaps hours after the user dialog is complete.
  • part of a form-filling dialog can include having a user describe something like an automobile accident where an application simply records the description for subsequent recognition, possible though human-assisted speech recognition.
  • FIG. 8 illustrates a diagram of a method for providing dynamic interactive voice recognition (IVR) over a wireless network.
  • IVR dynamic interactive voice recognition
  • a connection with a telematics control unit is established via the wireless network.
  • directed dialog application is configured to provide IVR for use with expected spoken user commands.
  • the directed dialog application may be implemented at either the vehicle via TCU 34 or ERDC 48 via interactive automation server cluster 41.
  • an open dialog application separate from the remote data center is used to provide IVR for use with unexpected spoken user commands.
  • the inventive hybrid dialog employs systems to accommodate Power and Novice User Modes simultaneously.
  • the system is either listening for a user to speak a specific utterance, e.g., an expected spoken user command from a limited set of grammar items (e.g., a limited vocabulary mode) or listening and trying to interpret anything it hears (unlimited vocabulary mode or unexpected spoken user command mode).
  • a limited set of grammar items e.g., a limited vocabulary mode
  • the unlimited vocabulary mode does have boundaries, in that only certain domains are covered (e.g., destinations, song names, dictation).
  • the system is very capable of detecting that a user has spoken an invalid command (e.g., out-of-vocabulary).
  • the remote data center switches back and forth between the directed dialog application and the open dialog application in accordance with pre-defined criteria.
  • the pre-defined criteria may be a confidence score as described above.
  • the present hybrid system moves back and forth between a structured/directed dialog mode, e.g., power mode, and an unstructured/open dialog mode, e.g., novice mode.
  • the directed dialog application presents questions that correspond to a limited subset of possible spoken user commands. This means that the user is asked questions having a small subset of possible valid responses. When the user's utterances, e.g., spoken user commands, are incorrect or invalid, the system recognizes this as an indication that the user needs help. At this point, the system transitions into a teaching mode.
  • a teaching mode, e.g., learning mode, of the directed dialog application is entered before using the open dialog application when an unexpected spoken user command is received.
  • the directed dialog application seamlessly switches to the learning mode, in which the user is prompted with exactly what to say (e.g., "Please say traffic, weather, or navigation") as opposed to prompting the user to say the name of a service.
  • the teaching mode does not try to figure out what was said by the user. Instead, the system utilizes a process that tries to make it easier for the user to say a valid response. In attempting to obtain a valid response from the user, the teaching mode of the system provides a small or reduced subset of possible choices.
  • the system provides a second level of choices that is much smaller than the thirty possible valid responses.
  • the system is configured to give the user broad categories, e.g., a reduced subset, from which to choose a valid command. For example, the system says "please say traffic, new destination, or gas prices.”
  • each of these three categories broadly describe a number of different possible valid responses within the set of thirty.
  • the traffic sub-category if selected, could provide six different choices.
  • the destination sub-category if selected, could provide five different choices and the gas sub-category, if selected, could provide four different choices.
  • the reduced subset is less than the limited subset of possible spoken user commands.
  • switching to the open dialog application occurs when a further unexpected spoken user command is received. In one exemplary embodiment, switching to the open dialog application occurs absent a selection of one of the reduced subsets of possible choices.
  • a further subset of possible choices can be presented. If a sub-category is not appropriate, the user can be prompted for "more choices," e.g., a further subset of possible choices. When one is selected, the small, manageable subset is uttered to the user. In one embodiment, switching to the open dialog application occurs absent a selection of the further subset of possible choices.
  • Resources for the open dialog application may be provided by a speech recognition service cloud.
  • the speech recognition service cloud can provide different speech recognition systems in parallel.
  • the different speech recognition systems may be a navigation speech recognition system, a dictation speech recognition system, an audio information recognition system, and a human assisted speech recognition system.
  • an unexpected user command comprising natural user language is received.
  • a user may utter an unexpected user command asking, for example, for "best directions based on traffic.”
  • the open dialog application provided by the speech recognition service cloud is used to provide information based on the natural user language.
  • the information provided is a list of possible spoken user commands.
  • suggested user commands can be provided in response to receiving an invalid spoken user command.
  • the present hybrid system treats "dynamic grammars", i.e., open dialog or natural user language, the same way as "limited grammars”, e.g., limited vocabulary, so that on-the-fly vocabularies accommodate the novice and power users.
  • dynamic grammars i.e., open dialog or natural user language
  • limited grammars e.g., limited vocabulary
  • the unlimited vocabulary mode e.g. open dialog mode
  • invalid utterances are not readily detected and the user is asked to confirm a recognition result if a low confidence score occurs, or they are re-prompted.
  • FIG. 9 discloses a method 900 for minimizing task completion time using dynamic interactive voice recognition (IVR) over a wireless network.
  • IVR interactive voice recognition
  • a connection is established with a telematics control unit via the wireless network.
  • a directed dialog application is configured to provide IVR for use with expected spoken user commands.
  • the directed dialog application may be implemented at either the vehicle via TCU 34 or ERDC 48 via interactive automation server cluster 41.
  • one of a plurality of expected spoken user commands is received.
  • a user is prompted for further spoken information in order to complete one or more actions required by the received user command.
  • the one or more actions is completed upon receipt of the further spoken information.
  • the hybrid dialog minimizes the number of steps in the dialog to minimize task completion time. Task completion time from start to finish becomes very short and, therefore, very safe.
  • the user is prompted to speak a minimal number of times for any given task. For example, if a user wants to send a message to someone, the hybrid design prompts the user to speak the message first, and follows this query by asking to whom the message is to be sent. If the user utters a valid recipient (e.g., from the user's contact list), then it is assumed that the message was recognized correctly. In addition, the system also recognizes that the user wants to send the message and that the user wants to text a specific contact. This recognition by the system minimizes the number of steps in the dialog and, in addition, minimizes time spent by the user on the activity.
  • a user once a user has determined that a text message is to be sent, the user will only need to complete two steps: 1) uttering the message for inclusion in the text message; and 2) uttering the contact(s) to whom the message is to be sent. If the system receives an answer that is not expected, a learning mode can be entered in order to provide other suggestions. For example, if the user mistakenly utters "Mary Smithers" as a contact instead of "Mary Smith", the system may ask the user to try again or select another contact.
  • the received expected spoken user command comprises a shortcut.
  • the user is enabled to use verbal shortcuts by speaking commands that are specific to a particular service, such as saying "new destination” instead of saying "navigation" first. If a user utters an incorrect shortcut command, the system may enter a learning mode in order to teach the user the proper shortcut to use.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Navigation (AREA)

Abstract

L'invention concerne un système et un procédé qui permettent de mettre en œuvre un système de reconnaissance de paroles sur serveur pour une interaction automatisée et multimodale dans un véhicule, et qui consistent à recevoir, par un conducteur de véhicule, des invites audio par une interface homme-machine à bord et une réponse comportant des paroles pour achever des tâches, telles que la création et l'envoi de messages texte, la navigation sur Internet, la navigation, etc. Cette architecture axée sur le service est utilisée pour solliciter des dispositifs de reconnaissance de paroles spécialisés d'une manière adaptative. L'interface homme-machine permet l'achèvement d'une tâche d'entrée de texte tout en conduisant un véhicule d'une manière qui réduit à un minimum la fréquence des interactions visuelles et mécaniques du conducteur avec l'interface, éliminant ainsi les moments d'inattention dangereux pendant des conditions de conduite. Après l'invite initiale, la tâche de dactylographie est suivie par une verbalisation informatisée du texte. Des étapes d'interface ultérieures peuvent être de nature visuelle ou ne comprendre que le son.
PCT/US2012/042916 2011-06-16 2012-06-18 Reconnaissance de paroles de dialogue hybride pour interaction automatisée dans un véhicule et interfaces utilisateur dans le véhicule nécessitant un traitement de commande cognitive minimal pour celle-ci WO2012174515A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA2839285A CA2839285A1 (fr) 2011-06-16 2012-06-18 Reconnaissance de paroles de dialogue hybride pour interaction automatisee dans un vehicule et interfaces utilisateur dans le vehicule necessitant un traitement de commande cognitive minimal pour celle-ci

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201161497699P 2011-06-16 2011-06-16
US61/497,699 2011-06-16
US13/524,695 US20120253823A1 (en) 2004-09-10 2012-06-15 Hybrid Dialog Speech Recognition for In-Vehicle Automated Interaction and In-Vehicle Interfaces Requiring Minimal Driver Processing
US13/524,695 2012-06-15

Publications (1)

Publication Number Publication Date
WO2012174515A1 true WO2012174515A1 (fr) 2012-12-20

Family

ID=47357523

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2012/042916 WO2012174515A1 (fr) 2011-06-16 2012-06-18 Reconnaissance de paroles de dialogue hybride pour interaction automatisée dans un véhicule et interfaces utilisateur dans le véhicule nécessitant un traitement de commande cognitive minimal pour celle-ci

Country Status (2)

Country Link
CA (1) CA2839285A1 (fr)
WO (1) WO2012174515A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10701305B2 (en) * 2013-01-30 2020-06-30 Kebron G. Dejene Video signature system and method
CN112289317A (zh) * 2020-11-20 2021-01-29 苏州思必驰信息科技有限公司 用于语音交互的回复方法及系统
WO2023227129A1 (fr) * 2022-05-27 2023-11-30 广州小鹏汽车科技有限公司 Procédé d'interaction vocale, terminal d'unité de tête, véhicule et support d'enregistrement

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11676062B2 (en) * 2018-03-06 2023-06-13 Samsung Electronics Co., Ltd. Dynamically evolving hybrid personalized artificial intelligence system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020049535A1 (en) * 1999-09-20 2002-04-25 Ralf Rigo Wireless interactive voice-actuated mobile telematics system
US20070219807A1 (en) * 2003-11-19 2007-09-20 Atx Group, Inc. Wirelessly delivered owner's manual
US20100079336A1 (en) * 2008-09-30 2010-04-01 Sense Networks, Inc. Comparing Spatial-Temporal Trails In Location Analytics
US7831431B2 (en) * 2006-10-31 2010-11-09 Honda Motor Co., Ltd. Voice recognition updates via remote broadcast signal

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020049535A1 (en) * 1999-09-20 2002-04-25 Ralf Rigo Wireless interactive voice-actuated mobile telematics system
US20070219807A1 (en) * 2003-11-19 2007-09-20 Atx Group, Inc. Wirelessly delivered owner's manual
US7831431B2 (en) * 2006-10-31 2010-11-09 Honda Motor Co., Ltd. Voice recognition updates via remote broadcast signal
US20100079336A1 (en) * 2008-09-30 2010-04-01 Sense Networks, Inc. Comparing Spatial-Temporal Trails In Location Analytics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HUA ET AL.: "Speech Recognition Interface Design for In-Vehicle System.", 11 November 2010 (2010-11-11), Retrieved from the Internet <URL:http://www.auto-ui.org/10/proceedings/p29.pdf> [retrieved on 20120807] *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10701305B2 (en) * 2013-01-30 2020-06-30 Kebron G. Dejene Video signature system and method
CN112289317A (zh) * 2020-11-20 2021-01-29 苏州思必驰信息科技有限公司 用于语音交互的回复方法及系统
CN112289317B (zh) * 2020-11-20 2022-05-20 思必驰科技股份有限公司 用于语音交互的回复方法及系统
WO2023227129A1 (fr) * 2022-05-27 2023-11-30 广州小鹏汽车科技有限公司 Procédé d'interaction vocale, terminal d'unité de tête, véhicule et support d'enregistrement

Also Published As

Publication number Publication date
CA2839285A1 (fr) 2012-12-20

Similar Documents

Publication Publication Date Title
US9558745B2 (en) Service oriented speech recognition for in-vehicle automated interaction and in-vehicle user interfaces requiring minimal cognitive driver processing for same
US20120253823A1 (en) Hybrid Dialog Speech Recognition for In-Vehicle Automated Interaction and In-Vehicle Interfaces Requiring Minimal Driver Processing
US7826945B2 (en) Automobile speech-recognition interface
US9619572B2 (en) Multiple web-based content category searching in mobile search application
US9495956B2 (en) Dealing with switch latency in speech recognition
US9202465B2 (en) Speech recognition dependent on text message content
US8635243B2 (en) Sending a communications header with voice recording to send metadata for use in speech recognition, formatting, and search mobile search application
CA2546913C (fr) Manuel de proprietaire distribue au moyen d&#39;un procede sans fil
US20110054895A1 (en) Utilizing user transmitted text to improve language model in mobile dictation application
US20110060587A1 (en) Command and control utilizing ancillary information in a mobile voice-to-speech application
US20110054894A1 (en) Speech recognition through the collection of contact information in mobile dictation application
US20110054899A1 (en) Command and control utilizing content information in a mobile voice-to-speech application
US20110054900A1 (en) Hybrid command and control between resident and remote speech recognition facilities in a mobile voice-to-speech application
US20110054898A1 (en) Multiple web-based content search user interface in mobile search application
US20110054897A1 (en) Transmitting signal quality information in mobile dictation application
US20110288867A1 (en) Nametag confusability determination
US11605387B1 (en) Assistant determination in a skill
WO2012174515A1 (fr) Reconnaissance de paroles de dialogue hybride pour interaction automatisée dans un véhicule et interfaces utilisateur dans le véhicule nécessitant un traitement de commande cognitive minimal pour celle-ci
KR20190037470A (ko) 대화 시스템, 이를 포함하는 차량 및 대화 처리 방법
CA2737850C (fr) Manuel de proprietaire distribue au moyen d&#39;un procede sans fil
KR20230135396A (ko) 대화 관리 방법, 사용자 단말 및 컴퓨터로 판독 가능한 기록 매체

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12801437

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2839285

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12801437

Country of ref document: EP

Kind code of ref document: A1