EP1273004A1 - System zum verarbeiten eines natürlichen sprach-dialog-systems - Google Patents

System zum verarbeiten eines natürlichen sprach-dialog-systems

Info

Publication number
EP1273004A1
EP1273004A1 EP01924726A EP01924726A EP1273004A1 EP 1273004 A1 EP1273004 A1 EP 1273004A1 EP 01924726 A EP01924726 A EP 01924726A EP 01924726 A EP01924726 A EP 01924726A EP 1273004 A1 EP1273004 A1 EP 1273004A1
Authority
EP
European Patent Office
Prior art keywords
matching
entry
phrase
context
searching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP01924726A
Other languages
English (en)
French (fr)
Inventor
Dean C. Weber
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
One Voice Technologies Inc
Original Assignee
One Voice Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by One Voice Technologies Inc filed Critical One Voice Technologies Inc
Publication of EP1273004A1 publication Critical patent/EP1273004A1/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/228Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context

Definitions

  • the present invention relates to speech recognition for an object-based computer user interface. More specifically, the embodiments relate to a novel method and system for user interaction with a computer using speech recognition and natural language processing.
  • Speech recognition involves software and hardware that act together to audibly detect human speech and translate the detected speech into a string of words. As is known in the art, speech recognition works by breaking down sounds the hardware detects into smaller, non-divisible sounds called phonemes.
  • Phonemes are distinct units of sound. For example, the word “those” is made up of three phonemes; the first is the “th” sound, the second is the “o” sound, and the third is the “s” sound.
  • the speech recognition software attempts to match the detected phonemes with known words from a stored dictionary.
  • An example of a speech recognition system is given in U.S. Patent No. 4,783,803, entitled “SPEECH RECOGNITION APPARATUS AND METHOD", issued
  • a proposed enhancement to these speech recognition systems is to process the detected words using a natural language processing system.
  • Natural language processing generally involves determining a conceptual "meaning” (e.g., what meaning the speaker intended to convey) of the detected words by analyzing their grammatical relationship and relative context.
  • Natural language processing used in concert with speech recognition provides a powerful tool for operating a computer using spoken words rather than manual input such as a keyboard or mouse.
  • a conventional natural language processing system may fail to determine the correct "meaning" of the words detected by the speech recognition system. In such a case, the user is typically required to recompose or restate the phrase, with the hope that the natural language processing system will determine the correct "meaning” on subsequent attempts. Clearly, this may lead to substantial delays as the user is required to restate the entire sentence or command.
  • Another drawback of conventional systems is that the processing time required for the speech recognition can be prohibitively long. This is primarily due to the finite speed of the processing resources as compared with the large amount of information to be processed. For example, in many conventional speech recognition programs, the time required to recognize the utterance is long due to the size of the dictionary file being searched.
  • conventional speech recognition and natural language processing systems fail to generate intelligent conversations because they lack a conversational memory.
  • conventional systems do not remember the contents of their conversations with users.
  • conversations with such system are repetitive, and may result in users repeatedly telling the systems the same facts over-and-over, or worse, have systems fail to learn and remember simple concepts.
  • Another drawback of conventional speech recognition and natural language processing systems is that once a user successfully "trains" a computer system to recognize the user's speech and voice commands, the user cannot easily move to another computer without having to undergo the process of training the new computer. As a result, changing a user's computer workstations or location results in wasted time by users that need to re-train the new computer to the user's speech habits and voice commands.
  • FIG. 1 is a functional block diagram of an exemplary computer system
  • FIG. 2 is an expanded functional block diagram of the CPU and storage medium of the computer system of FIG. 1;
  • FIGS. 3A-D are a flowchart of the method of providing interactive speech recognition and natural language processing to a computer;
  • FIG. 4 is a diagram of selected columns of an exemplary natural language processing (NLP) database
  • FIG. 5 is a diagram of an exemplary Database Definition File (DDF);
  • FIG. 6 is a diagram of selected columns of an exemplary object table;
  • FIGS. 7A-D are a flowchart illustrating the linking of interactive speech recognition and natural language processing to a networked object, such as a web-page;
  • FIG. 8 is a diagram depicting a computer system connecting to other computers, storage media, and web-sites via the Internet;
  • FIG. 9 is a diagram of an exemplary user global user registry;
  • FIGS. 10A-B are flowcharts illustrating alternate embodiments of the retrieval and enabling of an individual's user global user registry during login at a computer workstation.
  • FIG. 11 is a diagram depicting an exemplary Dialogue Generation Log.
  • computer system 100 includes a central processing unit (CPU) 102.
  • the CPU 102 may be any general purpose microprocessor or microcontroller as is known in the art, appropriately programmed to perform the method described herein with reference to FIGS. 3A-D.
  • the software for programming the CPU can be found at storage medium 108 or alternatively from another location across a computer network.
  • the CPU 102 may be a conventional microprocessor such as the Pentium IIITM processor manufactured by Intel Corporation or the like.
  • CPU 102 communicates with a plurality of peripheral equipment, including a display 104, manual input 106, storage medium 108, microphone 110, speaker 112, data input port 114 and network interface 116.
  • Display 104 may be a visual display such as a CRT, LCD screen, touch-sensitive screen, or other monitors as are known in the art for visually displaying images and text to a user.
  • Manual input 106 may be a conventional keyboard, keypad, mouse, trackball, or other input device as is known in the art for the manual input of data.
  • Storage medium 108 may be a conventional read/write memory such as a magnetic disk drive, floppy disk drive, CD- ROM drive, silicon memory or other memory device as is known in the art for storing and retrieving data.
  • storage medium 108 may be remotely located from CPU 102, and be connected to CPU 102 via a network such as a local area network (LAN), or a wide area network (WAN), or the Internet.
  • LAN local area network
  • WAN wide area network
  • Microphone 110 may be any suitable microphone as is known in the art for providing audio signals to CPU 102.
  • Speaker 112 may be any suitable speaker as is known in the art for reproducing audio signals from CPU 102. It is understood that microphone
  • Data input port 114 may be any data port as is known in the art for interfacing with an external accessory using a data protocol such as RS-232, Universal Serial Bus, or the like.
  • Network interface 116 may be any interface as known in the art for communicating or transferring files across a computer network, examples of such networks include TCP/IP, ethernet, or token ring networks.
  • a network interface 116 may consist of a modem connected to the data input port 114.
  • FIG. 1 illustrates the functional elements of a computer system 100.
  • Each of the elements of computer system 100 may be suitable off-the-shelf components as described above.
  • the present embodiment provides a method and system for human interaction with the computer system 100 using speech.
  • the computer system 100 may be connected to the Internet 700, a collection of computer networks.
  • computer system 100 may use a network interface 116, a modem connected to the data input port 114, or any other method known in the art.
  • Web-sites 710, other computers 720, and storage media 108 may also be connected to the Internet through such methods known in the art.
  • FIG. 2 an expanded functional block diagram of CPU 102 and storage medium 108 is illustrated. It is understood that the functional elements of FIG.
  • CPU 102 includes speech recognition processor 200, data processor 201, natural language processor 202, and application interface 220.
  • the data processor 201 interfaces with the display 104, storage medium 108, microphone 110, speaker 112, data input port 114, and network interface 116.
  • the data processor 201 allows the CPU to locate and read data from these sources.
  • Natural language processor 202 further includes variable replacer 204, string formatter 206, word weighter 208, boolean tester 210, pronoun replacer 211, and search engine 213.
  • Storage medium 108 includes a plurality of context-specific grammar files 212, general grammar file 214, dictation grammar 216, context-specific dictation model 217, natural language processor (NLP) database 218, and a dialogue generation log (DGL) file 219.
  • the grammar files 212, 214, and 216 are Bakus-Naur Form (BNF) files, which describe the structure of the language spoken by the user.
  • BNF files are well known in the art for describing the structure of language, and details of BNF files will therefore not be discussed herein.
  • One advantage of BNF files is that hierarchical tree-like structures may be used to describe phrases or word sequences, without the need to explicitly recite all combinations of these word sequences.
  • the context-specific dictation model 217 is an optional file that contains specific models to improve dictation accuracy. These models enable users to specify word orders and word models. The models accomplish this by describing words and their relationship to other words, thus determining word meaning by contextual interpretation in a specific field or topic.
  • a context-specific dictation model 217 for computers may indicate the likelihood of the word “microprocessor” being associated with “computer,” and that a number, such as "650” is likely to be found near the word “megahertz.”
  • a speech recognition processor would analyze the phrase, interpret a single object, i.e. the computer, and realize that "650 megahertz microprocessor" are adjectives or traits describing the type of computer.
  • Topics for context-specific dictation models 217 vary widely, and may include any topic area of interest to a user — both broad and narrow. Broad topics may include: history, law, medicine, science, technology, or computers. Specialized topics, such as a particular field of literature encountered at a book retailer's web-site are also possible. Such a context-specific dictation model 217 may contain text for author and title information, for example.
  • the context-specific dictation model 217 format relies upon the underlying speech recognition processor 200, and is specific to each type of speech recognition processor 200.
  • a dialogue generation log 219 is a file, database table, or any other log known in the art that contains a memory of the conversations between a computer or system practicing the present embodiment, and a user of such a system.
  • the purpose behind the dialogue generation log 219 is to create a conversational memory by storing a log of what is spoken.
  • the log is also used to avoid repetitive speech generation.
  • an exemplary dialogue generation log 219 is shown in FIG. 11.
  • the dialogue generation log 219 is comprised of individual dialogue generation log entries 1112A-JC, each entry representing a line of dialogue or statement exchanged between the system and a user.
  • an example dialogue generation log entry would contain information such as a log entry identifier (ID) 1102, a user identifier 1104, a statement 1106, the context of the statement 1108, and a date/time stamp 1110.
  • ID 1102 uniquely identifies each entry 1112.
  • a user ID 1104 represents each user of the system. Statements made by the system itself are also recorded in the log, and a special user ID 1104 is given to the system.
  • each entry contains a statement of what was spoken 1106, and preferably the context 1108 in which the statement was spoken. For example, the statement "I would like to see movies,” would be linked to the context of "movies” or "film.”
  • a date-time-stamp 1110 may be recorded with each entry to identify the time frame (i.e. how recent) the statement was last made.
  • the flow begins at block 300 with the providing of an utterance to speech processor 200.
  • An utterance is a series of sounds having a beginning and an end, and may include one or more spoken words.
  • a microphone 110 may capture the spoken words in block 300.
  • the utterance may be provided to the speech processor 200 over data input port 114, or from storage medium 108.
  • the utterance is in a digital format such as the well-known ".wav" audio file format.
  • the speech processor 200 determines whether one of the context-specific grammars 212 has been enabled. If the context-specific grammars 212 are enabled, the context-specific grammars 212 are searched at block 304.
  • the context-specific grammars 212 are BNF files that contain words and phrases which are related to a parent context.
  • a context is a subject area. For example, in one embodiment applicable to personal computers, examples of contexts may be "news", or "weather", or "stocks”. In such a case, the context-specific grammars 212 would each contain commands, control words, descriptors, qualifiers, or parameters that correspond to a different one of these contexts.
  • the use of contexts provides a hierarchal structure for types of information.
  • Contexts and their use will be described further below with reference to the NLP database 218. If a context-specific grammar 212 has been enabled, the context-specific grammar 212 is searched for a match to the utterance provided at block 300. However, if a context-specific grammar 212 has not been enabled, the flow proceeds to block 308 where the general grammar 214 is enabled.
  • the general grammar 214 is a BNF file which contains words and phrases which do not, themselves, belong to a parent context, but may have an associated context for which a context-specific grammar file 212 exists. In other words, the words and phrases in the general grammar 214 may be at the root of the hierarchal context structure. For example, in one embodiment applicable to personal computers, the general grammar 214 would contain commands and control phrases.
  • the general grammar 214 is searched for a matching word or phrase for the utterance provided at block 300. A decision is made, depending on whether the match is found, at block 312. If a match is not found, then the dictation grammar 216 is enabled at block 314.
  • the dictation grammar 216 is a BNF file that contains a list of words that do not, themselves, have either a parent context or an associated context.
  • the dictation grammar 216 contains a relatively large list of general words similar to a general dictionary.
  • the dictation grammar is searched for matching words for each word of the utterance provided at block 300.
  • decision block 318 if no matching words are found, any relevant context-specific dictation model 217 is enabled at block 317.
  • a visual error message is optionally displayed at the display 104 or an audible error message is optionally reproduced through speaker 112, at block 320.
  • the process ends until another utterance is provided to the speech processor 200 at block 300.
  • the enabled context-specific grammar 212 when an utterance is provided to the speech processor 200, the enabled context-specific grammar 212, if any, is first searched. If there are no matches in the enabled context-specific grammar 212, then the general grammar 214 is enabled and searched. If there are no matches in the general grammar 214, then the dictation grammar 316 is enabled and searched. Finally, if there are no matches in the dictation grammar 316, a context-specific dictation model 217 is enabled 317 and used to interpret the utterance.
  • the speech recognition processor 200 when the speech recognition processor 200 is searching either the context-specific grammar 212 or the general grammar 214, it is said to be in the "command and control" mode. In this mode, the speech recognition processor 200 compares the entire utterance as a whole to the entries in the grammar. By contrast, when the speech recognition processor 200 is searching the dictation grammar, it is said to be in the "dictation" mode. In this mode, the speech recognition processor 200 compares the utterance to the entries in the dictation grammar 216 one word at a time. Finally, when the speech recognition processor 200 is matching the utterance with a context-specific dictation model 217, it is said to be in
  • model matching mode It is expected that searching for a match for an entire utterance in the command and control mode will generally be faster than searching for one word at a time in dictation or model matching modes.
  • any individual context-specific grammar 212 will be smaller in size (i.e., fewer total words and phrases) than the general grammar 214, which in turn will be smaller in size than the dictation grammar 216.
  • searching any enabled context-specific grammar 212 first it is likely that a match, if any, will be found more quickly, due at least in part to the smaller file size.
  • searching the general grammar 214 before the dictation grammar 216 it is likely that a match, if any, will be found more quickly.
  • the words and phrases in the enabled context- specific grammar 212 are more likely to be uttered by the user because they are words that are highly relevant to the subject matter about which the user was most recently speaking. This also allows the user to speak in a more conversational style, using sentence fragments, with the meaning of his words being interpreted according to the enabled context-specific grammar 212.
  • the present embodiment may search more efficiently than if the searching were to occur one entry at a time in a single, large list of all expected words and phrases.
  • Block 322 shows that one action may be to direct application interface 220 to take some action with respect to a separate software application or entity.
  • application interface 220 may use the Speech Application Programming Interface (SAPI) standard by Microsoft to communicate with an external application.
  • SAPI Speech Application Programming Interface
  • the external application may be directed, for example, to access a particular Internet web site URL or to speak a particular phrase by converting text to speech.
  • Other actions may be taken as will be discussed further below with reference to the NLP database 218 of FIG. 4.
  • Block 324 shows that another action may be to access a row in the natural language processing (NLP) database 218 directly, thereby bypassing the natural language processing method described further below.
  • Block 326 shows that another action may be to prepend a word or phrase for the enabled context to the matching word or phrase found in the context-specific grammar 306. For example, if the enabled context were "movies" and the matching utterance were “8 o'clock,” the word “movies” would be prepended to the phrase “8 o'clock” to form the phrase “movies at 8 o'clock.”
  • the flow may proceed to block 322 where the application interface 220 is directed to take an action as described above, or to block 324 where a row in the NLP database is directly accessed.
  • the general grammar 214 no prepending of a context occurs because, as stated above, the entries in the general grammar 214 do not, themselves, have a parent context.
  • manually entered words may be captured, at block 301, and input into the natural language processor.
  • words may be entered manually via manual input 106. In this case, no speech recognition is required, and yet natural language processing of the entered words is still desired. Thus, the flow proceeds to FIG. 3B.
  • the natural language processor 202 formats the phrase for natural language processing analysis. This formatting is accomplished by string formatter 206 and may include such text processing as removing duplicate spaces between words, making all letters lower case (or upper case), expanding contractions (e.g., changing "it's" to "it is”), and the like. The formatting prepares the phrase for parsing.
  • word-variables refers to words or phrases that represent amounts, dates, times, currencies, and the like.
  • word-variables refers to words or phrases that represent amounts, dates, times, currencies, and the like.
  • the phrase “what movies are playing at 8 o'clock” would be transformed at block 330 to "what movies are playing at $time” where "$time” is a wildcard function used to represent any time value.
  • the phrase "sell IBM stock at 100 dollars” would be transformed at block 330 to "sell IBM stock at $dollars” where "$dollars” is a wildcard function used to represent any dollar value.
  • This block may be accomplished by a simple loop that searches the phrase for key tokens such as the words "dollar” or "o'clock” and replaces the word- variables with a specified wildcard function.
  • an array may be used. This allows re-substitution of the original word-variable back into the phrase at the some position after the NLP database 218 has been searched. The purpose of replacing word-variables with an associated wildcard function at block
  • the 330 is to reduce the number of entries that must be present in the NLP database 218.
  • the NLP database 218 would only contain the phrase "what movies are playing at
  • pronouns in the phrase are replaced with proper names by pronoun replacer 211.
  • the pronouns "I,” “my,” or “mine” would be replaced with the speaker's name.
  • Block 332 allows user-specific facts to be stored and accessed in the NLP database 218. For example, the sentence “who are my children” would be transformed into “who are Dean's children” where "Dean” is the speaker's proper name.
  • this block may be performed in a simple loop that searches the phrase for pronouns, and replaces the pronouns found with an appropriate proper name. In order to keep track of the locations in the phrase where a substitution was made, an array may be used.
  • the individual words in the phrase are weighted according to their relative "importance” or "significance” to the overall meaning of the phrase by word weighter 208. For example, in one embodiment there are three weighting factors assigned. The lowest weighting factor is assigned to words such as "a,” “an,” “the,” and other articles. The highest weighting factor is given to words that are likely to have a significant relation to the meaning of the phrase. For example, these may include all verbs, nouns, adjectives, and proper names in the NLP database 218. A medium weighting factor is given to all other words in the phrase. The purpose of this weighting is to allow for more powerful searching of the NLP database 218.
  • the NLP database 218 comprises a plurality of columns 400-410, and a plurality of rows 412A-412N.
  • the entries represent phrases that are "known" to the NLP database.
  • a number of required words for each entry in column 400 is shown.
  • an associated context or subcontext for each entry in column 400 is shown.
  • one or more associated actions are shown for each entry in column 400.
  • the NLP database 218 shown in FIG. 4 is merely a simplified example for the purpose of teaching the present embodiment. Other embodiments may have more or fewer columns with different entries.
  • the NLP database 218 is searched for possible matches to the phrase, based on whether the entry in column 400 of the NLP database 218 contains any of the words in the phrase (or their synonyms), and the relative weights of those words.
  • a confidence value is generated for each of the possible matching entries based on the number of occurrences of each of the words in the phrase and their relative weights.
  • Weighted word searching of a database is well known in the art and may be performed by commercially available search engines such as the product "dtsearch” by DT Software, Inc. of
  • the natural language processor 202 determines whether any of the possible matching entries has a confidence value greater than or equal to some predetermined minimum threshold, T.
  • T represents the lowest acceptable confidence value for which a decision can be made as to whether the phrase matched any of the entries in the NLP database 218. If there is no possible matching entry with a confidence value greater than or equal to T, then the flow proceeds to block 342 where an optional error message is either visually displayed to the user over display 104 or audibly reproduced over speaker 112. In one embodiment, the type of error message, if any, displayed to the user may depend on how many
  • hits i.e., how many matching words from the phrase
  • the flow proceeds to block 344 where the "noise” words are discarded from the phrase.
  • the "noise” words include words that do not contribute significantly to the overall meaning of the phrase relative to the other words in the phrase. These may include articles, pronouns, conjunctions, and words of a similar nature. "Non-noise” words would include words that contribute significantly to the overall meaning of the phrase. "Non- noise” words would include verbs, nouns, adjectives, proper names, and words of a similar nature.
  • the non-noise word requirement is retrieved from column 402 of the NLP database 218 for the highest-confidence matching entry at block 346. For example, if the highest-confidence matching phrase was the entry in row 412A, (e.g., "what movies are playing at $time"), then the number of required non-noise words is 3.
  • a test is made to determine whether the number of required non-noise words from the phrase is actually present in the highest-confidence entry retrieved from the NLP database 218. This test is a verification of the accuracy of the relevance-style search performed at block 336, it being understood that an entry may generate a confidence value higher than the minimum threshold, T, without being an acceptable match for the phrase.
  • test performed at decision 348 is a boolean "AND" test performed by boolean tester 210. The test determines whether each one of the non-noise words in the phrase
  • the associated action in column 408 e.g., access movie web site
  • the associated action may be for natural language processor 202 to direct a text-to-speech application (not shown) to speak the present time to the user through the speaker 112.
  • the first associated action may be to access a predetermined news web site on the Internet, and a second associated action may be to direct an image display application (not shown) to display images associated with the news.
  • the natural language processor 202 instructs the speech recognition processor 200 to enable the context-specific grammar 212 for the associated context of column 404.
  • context-specific grammar 212 for the context "movies” would be enabled.
  • the flow proceeds to block 354 where the user is prompted over display 104 or speaker 112 whether the highest-confidence entry was meant, and record the inquiry in the dialogue generation log 219. For example, if the user uttered "How much is IBM stock selling for today," the highest-confidence entry in the NLP database 218 may be the entry in row 412B. In this case, although the relevance factor may be high, the number of required words (or their synonyms) may not be sufficient. Thus, the user would be prompted at block 354 whether he meant "what is the price of IBM stock on August 28, 1998.”
  • the user may respond either affirmatively or negatively. If it is determined at decision 356 that the user has responded affirmatively, then the action associated with the highest- confidence entry is taken at block 350, and the associated context-specific grammar 212 enabled at block 352. The speech made by the user is also recorded as an entry of the dialogue generation log 219.
  • the flow proceeds to FIG. 3D where the associated context from column 404 of NLP database 218 is retrieved for the highest-confidence entry, and the user is prompted for information using a context-based interactive dialog at block 360. For example, if the user uttered "what is the price of XICOR stock today," and the highest confidence entry from the NLP database 218 was row 412B (e.g., "what is the price of IBM stock on $date"), then the user would be prompted at block 354 whether that was what he meant.
  • context-based interactive dialog may entail prompting the user for the name and stock ticker symbol of XICOR stock. The user may respond by speaking the required information.
  • a different context-based interactive dialog may be used for each of the possible contexts. For example, the "weather” context-based interactive dialog may entail prompting the user for the name of the location (e.g., the city) about which weather information is desired. Also, the "news" context-based interactive dialog may entail prompting the user for types of articles, news source, internet URL for the news site, or other related information.
  • the NLP database 218, general grammar 214, and context-specific grammar 212 are updated to include the new information, at block 362. In this way, the next time the user asks for that information, a proper match will be found, and the appropriate action taken without prompting the user for more information.
  • the present embodiment adaptively "learns" to recognize phrases uttered by the user.
  • one or more of the NLP database 218, context-specific grammar 212, general grammar 214, and dictation grammar 216 also contain time-stamp values (not shown) associated with each entry. Each time a matching entry is used, the time-stamp value associated with that entry is updated. At periodic intervals, or when initiated by the user, the entries that have a time-stamp value before a certain date and time are removed from their respective databases/grammars. In this way, the databases/grammars may be kept to an efficient size by "purging" old or out-of-date entries. This also assists in avoiding false matches.
  • the updates to the NLP database 218, general grammar 214, and context-specific grammar 212 are stored in a user global user registry 800, shown in FIG. 9.
  • a global user registry entry 800a would be comprised of any general grammar additions 214a, context-specific grammar additions 212a, and NLP database additions 218a created by the user training. Since each user of the system would have a different global user registry entry 800a, the embodiment would be flexible enough to allow for special customizations and could adapt to the idiosyncrasies of individual users.
  • the updates to global user registry entry 800 also contain dialogue generation log additions 219a, user preferences 802, and remembered information 804, as shown in FIG. 9.
  • the dialogue generation log additions 219a would be comprised of the dialogue exchanged between the system and an individual user.
  • the addition of dialogue generation log additions 219a allow the system greater intelligence in its conversations with an individual user, to be transparent and mobile across different computers.
  • User preferences 802 represents the stored system preferences of an individual user. Examples of such user preferences 802, include, but are not limited to: window color and placement, system sounds, font preferences, and other user-interface preferences.
  • Remembered information 804 includes, user schedule information, learned facts derived from the dialogue generation log 219, and any other user (but non-user-interface) stored data not included in user preferences. For example, if the system is told, "I own a dog," the remembered information will include that fact.
  • Remembered information may be stored or represented by any information storage format known in the art, including any database format.
  • the global user registry 800 would be stored locally and mirrored at known server locations.
  • the mirrored copy referred to as the "travelling" global user registry entry, enables users to access their phrases "adaptively" learned by the embodiment, even when the user is logged into a different location.
  • FIG. 10A illustrates an embodiment that accesses customized global user registry entries 800a at local and remote (travelling) locations. Initially, a valid system user is verified, by any means known in the art, and then the system searches for a locally stored global user registry entry. For example, the system queries users for their login ID and password as shown in block 900. If the password and login ID match, as determined by decision block 905, the user is deemed to be a valid user.
  • the system searches for a travelling global user registry entry, block 920. If either search turns up a global user registry entry, the global user registry entry is loaded, blocks 915 and 925, respectively.
  • the global user registry entry 800a is enabled at block 940 by extracting the general grammar additions 214a, context-specific grammar additions 212a, NLP database additions 218a, dialogue generation log additions 219a, user preferences 802, and remembered information 804. These "learned" adaptations are then used by the system, as discussed earlier with the method of FIGS 3 A-3D. If the retrieval of the travelling global user registry entry 800a is unsuccessful, standard (non-custom user) processing is performed as reflected at block 945.
  • the user only logs into the system when accessing a terminal that does not recognize the user as a local user.
  • This embodiment is shown in FIG. 10B. From this remote system, the user is required to specify an ID and a password.
  • the system determines whether the user has a locally stored voice profile, block 910. If the user has a local voice profile, it is loaded at block 915, and flow continues at decision block 950. If the user does not have a locally stored global user registry entry, the system queries the user for a login identifier and a password at block 900. If the user is valid, the travelling global user registry entry is retrieved at block 925, and flow continues at block 950.
  • the user is treated as a user with standard (non-custom) processing, at block 945.
  • the global user registry entry 800a is enabled at block 940 by extracting the general grammar additions 214a, context-specific grammar additions 212a, NLP database additions 218a, dialogue generation log additions 219a, user preferences 802, and remembered information 804. These "learned" adaptations are then used by the system, as discussed earlier with the method of FIGS 3A-3D.
  • speech recognition and natural language processing may be used to interact with objects, such as help files (".hip” files), World-Wide-Web (“WWW” or “web”) pages, or any other objects that have a context-sensitive voice-based interface.
  • objects such as help files (".hip” files), World-Wide-Web (“WWW” or “web”) pages, or any other objects that have a context-sensitive voice-based interface.
  • FIG. 5 illustrates an exemplary Dialog Definition File (DDF) 500 which represents information necessary to associate the speech recognition and natural language processing to an internet object, such as a text or graphics file or, in the preferred embodiment, a web-page or help file.
  • DDF Dialog Definition File
  • the Dialog Definition File 500 consists of an object table 510
  • the DDF may also contain additional context-specific grammar files 214 and additional entries for the natural language processing (NLP) database 218, as illustrated in FIG. 5.
  • NLP natural language processing
  • the preferred embodiment of the DDF 500 includes an object table 510, a context-specific grammar file 214, a context-specific dictation model 217, and a file containing entries to the natural language processing database 218. These components may be compressed and combined into the DDF file 500 by any method known in the art, such as through Lempel-Ziv compression.
  • the context-specific specific grammar file 214 and the natural language processing database 218 are as described in earlier sections.
  • the object table 510 is a memory structure, such as a memory tree, chain or table, which associates an address of a resource with various actions, grammars, or entries in the NLP database 218.
  • FIG. 6 illustrates a memory table which may contain entry columns for: an object 520, a Text-to-Speech (TTS) flag 522, a text speech 524, a use grammar flag 526, an append grammar flag 528, an "is yes/no?” flag, and "do yes” 532 and "do no” 534 actions.
  • TTS Text-to-Speech
  • Each row in the table 540A-540 « would represent the grammar and speech related to an individual object.
  • the exemplary embodiment would refer to objects 520 through a Universal Resource Locator (URL).
  • a URL is a standard method of specifying the address of any resource on the Internet that is part of the World- Wide- Web.
  • URLs can specify information in a large variety of object formats, including hypertext, graphical, database and other files, in addition to a number of object devices and communication protocols.
  • object formats including hypertext, graphical, database and other files, in addition to a number of object devices and communication protocols.
  • URLs and other method of specifying objects can be used.
  • the Text-to-Speech (TTS) flag 522 indicates whether an initial statement should be voiced over speaker 112 when the corresponding object is transferred. For example, when transferring the web page listed in the object column 520 of row 540A (http : //www . our-pet-store . com), the TTS flag 522 is marked, indicating the text speech 524, "Hello, welcome to...,” is to be voiced over speaker 112.
  • the utterance of the initial statement may be dependent on the dialogue generation log 219.
  • the system first checks to see if the statement was made recently to a user, before re-voicing the statement.
  • a conversation can be made more intelligent by using the dialogue generation log 219 as a "conversational memory.”
  • a conversational memory formed by the dialogue generation log 219 also acts as a "sanity check" to make sure conversations do not become repetitive.
  • the use of the dialogue generation log 219 is best illustrated by example.
  • a user was examining a speech-enhanced pet-store web-site using the present embodiment.
  • an object table 510 is transferred from the web-site to the user's computer system 100.
  • the system notes that the TTS flag 522 is marked, indicating the text speech 524, "Hello, welcome to our pet store, would you like to see our dog specials?" Since this is the user's first visit to the web page, no context or statement- related entries are found in the dialogue generation log 219, so the statement is to be voiced over speaker 112.
  • the user responds, "no," and decides to surf a different web page instead.
  • the system Transparent to the user, the system records both statements in the dialogue generation log 219.
  • the next three object table 510 flags relate to the use of grammars associated with this object.
  • the affirmative marking of the "use grammar” 526 or “append grammar” 528 flags indicate the presence of a content-specific grammar file 214 related to the indicated object.
  • the marking of the "use grammar” flag 526 indicates that the new content-specific grammar file 214 replaces the existing content-specific grammar file, and the existing file is disabled.
  • the "append grammar” flag 528 indicates that the new content-specific grammar file should be enabled concurrently with the existing content-specific grammar file.
  • FIG. 7 A a method and system of providing speech and voice commands to objects, such as a computer reading a help file or browsing the World- Wide- Web, is illustrated.
  • the method of FIGS. 7A-7D may be used in conjunction with the method of FIGS 3A-3D and FIG. 10.
  • an object location is provided to a help file reader or World- Wide-Web browser.
  • a help file reader/browser is a program used to examine hypertext documents that are written to help users accomplish tasks or solve problems, and is well known in the art.
  • the web browser is a program used to navigate through the Internet, and is well known in the art.
  • Block 602 which providing an object location to the browser, can be as simple as a user clicking on a program "help" menu item, manually typing in a URL, or having a user select a "link" at a chosen web-site. It also may be the result of a voiced command as described earlier with reference to the action associated with each entry in the NLP database 218.
  • the computer Given the object location, the computer must decide on whether it can resolve object location specified, at block 604. This resolution process is a process well known in the art. If the computer is unable to resolve the object location or internet address, an error message is displayed in the browser window, at block 605, and the system is returned to its initial starting state 600. If the object location or Internet address is resolved, the computer retrieves the object at block 606.
  • a web browser sends the web-site a request to for the web page, at block 606.
  • the help reader reads the help file off of storage media 108, at block 606.
  • the computer examines whether the DDF file 500 location is encoded within the object.
  • the DDF file location could be encoded within web page HyperText Markup Language (HTML) as a URL.
  • HTML HyperText Markup Language
  • Encoding DDF file location within HTML code may be done either through listing the
  • DDF "http: //www. our-pet-store . com/Converselt .ddf"> —> If the DDF file location information is encoded within the web page, the location's Internet address is resolved, at block 616, and the computer requests transfer of the DDF file 500, at block 626.
  • An equivalent encoding scheme could be used within help file hypertext.
  • Block 618 determines whether the DDF file is located at the web-site.
  • the computer sends query to the web-site inquiring about the presence of the DDF file 500. If the DDF file 500 is present at the web-site, the computer requests transfer of the DDF file 500, at block 626.
  • the computer queries the centralized location about the presence of a DDF file for the web-site, at block 620. If the DDF file is present at the web-site, the computer requests transfer of the DDF file, at block 626. If the DDF file 500 cannot be found, the existing components of any present DDF file, such as the object table 510, context-specific dictation model 217, NLP database 218 associated with the object, and context-specific grammar 214 for any previously-viewed object, are deactivated in block 622. Furthermore, the object is treated as a non- voice-activated object, and only standard grammar files are used, at block 624. Standard grammar files are the grammar files existing on the system excluding any grammars associated with the content-specific grammar file associated with the object.
  • any existing components of any present DDF file 500 are deactivated, at block 622, and the web-site is treated as a non-voice-activated object, and only standard grammar files are used, at block 624.
  • the DDF file 500 If the DDF file 500 is requested at block 626 and its transfer is successful at block 628, it replaces any prior DDF file, at block 630. Any components of the DDF file 500, such as the object table 510, context-specific-grammar files 214, context-specific-dictation models 217, and NLP database 218 are extracted at block 632. A similar technique may be used for obtaining the software necessary to implement the method illustrated in FIGS. 3A-3D, comprising the functional elements of FIG. 2.
  • the object table 510 is read into memory by the computer in block 634. If the object is present in the site object table 510, as determined by block 636, it will be represented by a row 540A-540 « of the table, as shown in FIG. 6. Each row of the object table represents the speech-interactions available to a user for that particular object. If no row corresponding to the object exists, then no-speech interaction exists for the web page, and processing ends. ⁇
  • the computer checks if the TTS flag 522 is marked, to determine whether a text speech 524 is associated with the web-page, at block 638. If there is a text speech 524, a decision must be made on whether it is appropriate to play the spoken statement 639. For example, if a user was queried, "Would you like to hear about the history of Mars?" in the past five minutes, it would be inappropriate, and probably annoying, to ask the user the same question again. In making such a determination, the system checks whether the statement has been made to the user recently by examining the dialogue generation log 219.
  • the system notes that the TTS flag 522 is marked, indicating the text speech 524, "Hello, welcome to our pet store, would you like to see our cat specials?" Since the text-to-speech statement is an inquiry about cats, the system performs a check in the dialogue generation log 219 to see if there are past conversations with a context 1108 of "cat.” The context of "cat" is located, and the system notes that the user responded "no," therefore the user does not own a cat. Consequently, the system realizes that a negative match has been made, and it would be inappropriate to voice the question to the user.
  • dialogue generation log entries 1112 are time dependent, they must be kept up-to-date by the system, and old entries should be culled and removed on a periodic basis. After all, in the case of the above example, it is possible that the user eventually acquires a cat. Old entries may be removed after a certain time-period expires, or after the dialogue generation log 219 reaches a specific size. Dialogue generation log entries 1112 for each user can be removed on a first-in-first-out (FIFO) basis, so that older entries are removed before more recent entries.
  • FIFO first-in-first-out
  • the present embodiment provides a method and system for an object interactive user-interface for a computer.
  • the present embodiment decreases speech recognition time and increases the user's ability to communicate with local and networked objects, such as help files or web-pages, in a conversational style.
  • the present embodiment further increases interactive efficiency.
  • the adaptive updates can be incorporated into global user registry entries that can be stored locally and remotely, to allow users access to the global user registry entries at various locations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
EP01924726A 2000-04-06 2001-04-06 System zum verarbeiten eines natürlichen sprach-dialog-systems Withdrawn EP1273004A1 (de)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US54379000A 2000-04-06 2000-04-06
US543790 2000-04-06
PCT/US2001/011138 WO2001078065A1 (en) 2000-04-06 2001-04-06 Natural language and dialogue generation processing

Publications (1)

Publication Number Publication Date
EP1273004A1 true EP1273004A1 (de) 2003-01-08

Family

ID=24169568

Family Applications (1)

Application Number Title Priority Date Filing Date
EP01924726A Withdrawn EP1273004A1 (de) 2000-04-06 2001-04-06 System zum verarbeiten eines natürlichen sprach-dialog-systems

Country Status (4)

Country Link
EP (1) EP1273004A1 (de)
AU (1) AU2001251354A1 (de)
CA (1) CA2408584A1 (de)
WO (1) WO2001078065A1 (de)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7725307B2 (en) 1999-11-12 2010-05-25 Phoenix Solutions, Inc. Query engine for processing voice based queries including semantic decoding
US9076448B2 (en) 1999-11-12 2015-07-07 Nuance Communications, Inc. Distributed real time speech recognition system
US6665640B1 (en) 1999-11-12 2003-12-16 Phoenix Solutions, Inc. Interactive speech based learning/training system formulating search queries based on natural language parsing of recognized user queries
US6633846B1 (en) 1999-11-12 2003-10-14 Phoenix Solutions, Inc. Distributed realtime speech recognition system
US6795808B1 (en) * 2000-10-30 2004-09-21 Koninklijke Philips Electronics N.V. User interface/entertainment device that simulates personal interaction and charges external database with relevant data
US7693720B2 (en) 2002-07-15 2010-04-06 Voicebox Technologies, Inc. Mobile systems and methods for responding to natural language speech utterance
US7640160B2 (en) 2005-08-05 2009-12-29 Voicebox Technologies, Inc. Systems and methods for responding to natural language speech utterance
DE102005037621A1 (de) * 2005-08-09 2007-02-22 Siemens Ag Verfahren und Sprachdialogsystem zur Ermittlung zumindest einer Transaktion zur Bedienung einer Hintergrundapplikation
US7620549B2 (en) 2005-08-10 2009-11-17 Voicebox Technologies, Inc. System and method of supporting adaptive misrecognition in conversational speech
US7949529B2 (en) 2005-08-29 2011-05-24 Voicebox Technologies, Inc. Mobile systems and methods of supporting natural language human-machine interactions
DE102006029755A1 (de) 2006-06-27 2008-01-03 Deutsche Telekom Ag Verfahren und Vorrichtung zur natürlichsprachlichen Erkennung einer Sprachäußerung
DE102006036338A1 (de) * 2006-08-03 2008-02-07 Siemens Ag Verfahren zum Erzeugen einer kontextbasierten Sprachdialogausgabe in einem Sprachdialogsystem
US8073681B2 (en) 2006-10-16 2011-12-06 Voicebox Technologies, Inc. System and method for a cooperative conversational voice user interface
US7818176B2 (en) 2007-02-06 2010-10-19 Voicebox Technologies, Inc. System and method for selecting and presenting advertisements based on natural language processing of voice-based input
US8140335B2 (en) 2007-12-11 2012-03-20 Voicebox Technologies, Inc. System and method for providing a natural language voice user interface in an integrated voice navigation services environment
US9305548B2 (en) 2008-05-27 2016-04-05 Voicebox Technologies Corporation System and method for an integrated, multi-modal, multi-device natural language voice services environment
DE102008025532B4 (de) * 2008-05-28 2014-01-09 Audi Ag Kommunikationssystem und Verfahren zum Durchführen einer Kommunikation zwischen einem Nutzer und einer Kommunikationseinrichtung
US8326637B2 (en) 2009-02-20 2012-12-04 Voicebox Technologies, Inc. System and method for processing multi-modal device interactions in a natural language voice services environment
US9502025B2 (en) 2009-11-10 2016-11-22 Voicebox Technologies Corporation System and method for providing a natural language content dedication service
US9171541B2 (en) 2009-11-10 2015-10-27 Voicebox Technologies Corporation System and method for hybrid processing in a natural language voice services environment
WO2016044321A1 (en) 2014-09-16 2016-03-24 Min Tang Integration of domain information into state transitions of a finite state transducer for natural language processing
CN107003996A (zh) 2014-09-16 2017-08-01 声钰科技 语音商务
CN107003999B (zh) 2014-10-15 2020-08-21 声钰科技 对用户的在先自然语言输入的后续响应的系统和方法
US10431214B2 (en) 2014-11-26 2019-10-01 Voicebox Technologies Corporation System and method of determining a domain and/or an action related to a natural language input
US10614799B2 (en) 2014-11-26 2020-04-07 Voicebox Technologies Corporation System and method of providing intent predictions for an utterance prior to a system detection of an end of the utterance
WO2018023106A1 (en) 2016-07-29 2018-02-01 Erik SWART System and method of disambiguating natural language processing requests

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02301869A (ja) * 1989-05-17 1990-12-13 Hitachi Ltd 自然言語処理システム保守支援方式
JP2855409B2 (ja) * 1994-11-17 1999-02-10 日本アイ・ビー・エム株式会社 自然言語処理方法及びシステム
US6499013B1 (en) * 1998-09-09 2002-12-24 One Voice Technologies, Inc. Interactive user interface using speech recognition and natural language processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
None *
See also references of WO0178065A1 *

Also Published As

Publication number Publication date
AU2001251354A1 (en) 2001-10-23
WO2001078065A1 (en) 2001-10-18
CA2408584A1 (en) 2001-10-18

Similar Documents

Publication Publication Date Title
US6434524B1 (en) Object interactive user interface using speech recognition and natural language processing
AU762282B2 (en) Network interactive user interface using speech recognition and natural language processing
WO2001078065A1 (en) Natural language and dialogue generation processing
US7729913B1 (en) Generation and selection of voice recognition grammars for conducting database searches
CA2280331C (en) Web-based platform for interactive voice response (ivr)
CA2437620C (en) Hierarchichal language models
JP4267081B2 (ja) 分散システムにおけるパターン認識登録
JP5330450B2 (ja) テキストフォーマッティング及びスピーチ認識のためのトピック特有のモデル
US6910012B2 (en) Method and system for speech recognition using phonetically similar word alternatives
US6311182B1 (en) Voice activated web browser
US20020087315A1 (en) Computer-implemented multi-scanning language method and system
JPH08335160A (ja) ビデオスクリーン表示を音声対話型にするシステム
WO2000045375A1 (en) Method and apparatus for voice annotation and retrieval of multimedia data
WO2002033542A2 (en) Software development systems and methods
WO2002054385A1 (en) Computer-implemented dynamic language model generation method and system
House Spoken-language access to multimedia(SLAM): a multimodal interface to the World-Wide Web
JP3893893B2 (ja) ウエブページの音声検索方法、音声検索装置および音声検索プログラム
López-Cózar et al. Combining language models in the input interface of a spoken dialogue system

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20021106

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

RIN1 Information on inventor provided before grant (corrected)

Inventor name: WEBER, DEAN, C.

17Q First examination report despatched

Effective date: 20050728

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20070530